OAI-PMH data harvesting (Morning Glory)

Overview  

The purpose of the OAI-PMH tests is to measure performance of Morning Glory release and to find possible issues, bottlenecks per PERF-263

Environment

  • mod-oai-pmh v3.9.1  
  • edge-oai-pmh v2.5.0
  • mod-source-record-manager v3.4.1
  • mod-source-record-storage v5.4.0
  • mod-inventory-storage v24.0.3
  • okapi v4.14.2

Specifically, the following settings were used


CPUMemoryXmxMaxMetaSpaceSizeTasks CountTask Rev Number
mod-oai-pmh20481845 | 2048144051224
edge-oai-pmh10241360 | 151295212823
mod-inventory-storage10241684 | 1872144051228

Summary

  • Average response time per request with resumption token 600ms ( compared to Lotus's 850ms).
  • Incremental calls performed - 82299  (Bugfest data set 1 user and 20 DB connections)*
  • OOM happens frequently if followed the recommended setting (soft limit < maxMetaspaceSize + XmX).  Only when changed to soft limit > maxMetaspaceSize + Xmx the harvests completed successfully.
  • Thread block errors and subsequent OOMs happened about 50% of the time. This is likely due to a fast rate of incremental calls by the JMeter test script. When changed to 40 requests/min, there were no more errors, but this is a very small rate that would take over 30 hours for the harvest of 8M records to complete.

Note: Bugfest dataset was used because it has more SRS records than PTF's dataset.

Test Results

Test 1

This test was done with a database freshly restored from Bugfest (Morning Glory).  There was neither reindexing on Elastic Search nor recreating the indexes and "analyze" the inventory-storage tables. 

  • 8.26M records were transferred and harvested in about 19 hours
  • Each incremental call to harvest took about 811ms, and a total of 82,300 calls.
  • No memory or CPU issues observed.
    • mod-oai-pmh started out spiking up to 50% for about 40 minutes. This is during the initial transfer of instances.
    •  
    • No memory issues observed starting when the test was performed on 8/24 at 22:00
  • RDS CPU utilization graph doesn't show any abnormality

Test 2

Test 2 was done with re-indexing on Elastic Search, re-create the relevant database indexes and Analyzed the tables to update the table stats.

The test failed after 26 minutes with a 502 error:


HTTP 502 Service temporarily unavailable.

Please check back in a minute or two. 

If the issue persists, please report it to EBSCO Connect.

  • Only 3,5700,000 instances got transferred.
  • One mod-oai-pmh task crashed at 106% memory level
  • 1339 incremental API calls to harvest were made. Each averaged 1173ms.

Tests 3 and 4

Tests 3 and 4 also suffered the same fate of running out of heap space memory.  Shortly after the harvests were launched (during the initial transfer of instances) one of two OAI-PMH tasks crashed, leading to a timeout on the client side and the whole harvest came to a complete halt.  Below are the memory and CPU graphs of tests 3 and 4.

Test 5 

After adjusting the memory's soft limit to be greater than Xmx + XMetaSpaceSize, the harvest did not crash and completed successfully in 13 hours. 


CPUMemoryXmxMaxMetaSpaceSizeTasks CountTask Rev Number
mod-oai-pmh20482000 | 2048144051225
edge-oai-pmh10241360 | 151295212823
mod-inventory-storage10241684 | 1872144051228


This time the test was launched from carrier-io so the timing is even better than of the first test. Response times were much faster as well.

  • 8.26M records were transferred and harvested in about 13 hours and 40 minutes
  • Each incremental call to harvest took about 592ms, and a total of 82,300 calls.
  • No memory or CPU issues observed.

CPU utilizations are typical for an OAI-PMH harvest, with mod-oai-pmh leading the pack spiking at 50% initially for about half an hour during the initial instance transfers, but settled down at around 5% thereafter.

okapi and its variants (nginx-okapi, pub-okapi) also spiked initially for about 10 minutes but subsided afterward.



Test 6, 7

  • The harvests stopped about an hour in 
  • Logs show "Thread blocked" errors which lead to OOM.  The mod-oai-pmh tasks did not crash, however. 
  • mod-oai-pmh service's memory and CPU utilization percentage were nominal, at about 70% for memory.
  • It's worth noting that the mod-oai-pmh service was not restarted before these tests. Logs are attached. 




Test 8

Based on a log of the EBSCO harvester, the rate of the incremental harvest request was anywhere from 9 to 47 requests/min.  In this test we slowed down the request rate to 40/min. This test and subsequent tests did not log any errors for the first couple of hours and even several hours later, which is typically when the thread-block and OOM issues occurred. 

Note that the request rate when converted to requests/min equals around 40.02 req/min. This proved to be a point of stability.  This test was stopped short because had it gone on longer, it would have taken about 36 hours to fully harvest all the data at this rate. A Jira was created to improve performance of mod-oai-pmh: MODOAIPMH-443 - Getting issue details... STATUS


Database showing little CPU usage during the harvest.

CPU utilization of relevant modules during the harvest.


Memory utilization of the modules that involve in the OAI-PMH workflow.