OAI-PMH data harvesting (Morning Glory)
Overview
The purpose of the OAI-PMH tests is to measure performance of Morning Glory release and to find possible issues, bottlenecks per PERF-263
Environment
- mod-oai-pmh v3.9.1
- edge-oai-pmh v2.5.0
- mod-source-record-manager v3.4.1
- mod-source-record-storage v5.4.0
- mod-inventory-storage v24.0.3
- okapi v4.14.2
Specifically, the following settings were used
CPU | Memory | Xmx | MaxMetaSpaceSize | Tasks Count | Task Rev Number | |
---|---|---|---|---|---|---|
mod-oai-pmh | 2048 | 1845 | 2048 | 1440 | 512 | 2 | 4 |
edge-oai-pmh | 1024 | 1360 | 1512 | 952 | 128 | 2 | 3 |
mod-inventory-storage | 1024 | 1684 | 1872 | 1440 | 512 | 2 | 8 |
Summary
- Average response time per request with resumption token 600ms ( compared to Lotus's 850ms).
- Incremental calls performed - 82299 (Bugfest data set 1 user and 20 DB connections)*.
- OOM happens frequently if followed the recommended setting (soft limit < maxMetaspaceSize + XmX). Only when changed to soft limit > maxMetaspaceSize + Xmx the harvests completed successfully.
- Thread block errors and subsequent OOMs happened about 50% of the time. This is likely due to a fast rate of incremental calls by the JMeter test script. When changed to 40 requests/min, there were no more errors, but this is a very small rate that would take over 30 hours for the harvest of 8M records to complete.
* Note: Bugfest dataset was used because it has more SRS records than PTF's dataset.
Test Results
Test 1
This test was done with a database freshly restored from Bugfest (Morning Glory). There was neither reindexing on Elastic Search nor recreating the indexes and "analyze" the inventory-storage tables.
- 8.26M records were transferred and harvested in about 19 hours
- Each incremental call to harvest took about 811ms, and a total of 82,300 calls.
- No memory or CPU issues observed.
- mod-oai-pmh started out spiking up to 50% for about 40 minutes. This is during the initial transfer of instances.
- No memory issues observed starting when the test was performed on 8/24 at 22:00
- RDS CPU utilization graph doesn't show any abnormality
Test 2
Test 2 was done with re-indexing on Elastic Search, re-create the relevant database indexes and Analyzed the tables to update the table stats.
The test failed after 26 minutes with a 502 error:
HTTP 502 Service temporarily unavailable.
Please check back in a minute or two.
If the issue persists, please report it to EBSCO Connect.
- Only 3,5700,000 instances got transferred.
- One mod-oai-pmh task crashed at 106% memory level
- 1339 incremental API calls to harvest were made. Each averaged 1173ms.
Tests 3 and 4
Tests 3 and 4 also suffered the same fate of running out of heap space memory. Shortly after the harvests were launched (during the initial transfer of instances) one of two OAI-PMH tasks crashed, leading to a timeout on the client side and the whole harvest came to a complete halt. Below are the memory and CPU graphs of tests 3 and 4.
Test 5
After adjusting the memory's soft limit to be greater than Xmx + XMetaSpaceSize, the harvest did not crash and completed successfully in 13 hours.
CPU | Memory | Xmx | MaxMetaSpaceSize | Tasks Count | Task Rev Number | |
---|---|---|---|---|---|---|
mod-oai-pmh | 2048 | 2000 | 2048 | 1440 | 512 | 2 | 5 |
edge-oai-pmh | 1024 | 1360 | 1512 | 952 | 128 | 2 | 3 |
mod-inventory-storage | 1024 | 1684 | 1872 | 1440 | 512 | 2 | 8 |
This time the test was launched from carrier-io so the timing is even better than of the first test. Response times were much faster as well.
- 8.26M records were transferred and harvested in about 13 hours and 40 minutes
- Each incremental call to harvest took about 592ms, and a total of 82,300 calls.
- No memory or CPU issues observed.
CPU utilizations are typical for an OAI-PMH harvest, with mod-oai-pmh leading the pack spiking at 50% initially for about half an hour during the initial instance transfers, but settled down at around 5% thereafter.
okapi and its variants (nginx-okapi, pub-okapi) also spiked initially for about 10 minutes but subsided afterward.
Test 6, 7
- The harvests stopped about an hour in
- Logs show "Thread blocked" errors which lead to OOM. The mod-oai-pmh tasks did not crash, however.
- mod-oai-pmh service's memory and CPU utilization percentage were nominal, at about 70% for memory.
- It's worth noting that the mod-oai-pmh service was not restarted before these tests. Logs are attached.
Test 8
Based on a log of the EBSCO harvester, the rate of the incremental harvest request was anywhere from 9 to 47 requests/min. In this test we slowed down the request rate to 40/min. This test and subsequent tests did not log any errors for the first couple of hours and even several hours later, which is typically when the thread-block and OOM issues occurred.
Note that the request rate when converted to requests/min equals around 40.02 req/min. This proved to be a point of stability. This test was stopped short because had it gone on longer, it would have taken about 36 hours to fully harvest all the data at this rate. A Jira was created to improve performance of mod-oai-pmh: - MODOAIPMH-443Getting issue details... STATUS
Database showing little CPU usage during the harvest.
CPU utilization of relevant modules during the harvest.
Memory utilization of the modules that involve in the OAI-PMH workflow.