OAI-PMH data harvesting (Morning Glory)
- 1 Overview
- 2 Environment
- 3 Summary
- 4 Test Results
- 4.1 Test 1
- 4.2 Test 2
- 4.3 Tests 3 and 4
- 4.4 Test 5
Overview
The purpose of the OAI-PMH tests is to measure performance of Morning Glory release and to find possible issues, bottlenecks per PERF-263
Environment
mod-oai-pmh v3.9.1
edge-oai-pmh v2.5.0
mod-source-record-manager v3.4.1
mod-source-record-storage v5.4.0
mod-inventory-storage v24.0.3
okapi v4.14.2
Specifically, the following settings were used
| CPU | Memory | Xmx | MaxMetaSpaceSize | Tasks Count | Task Rev Number |
|---|---|---|---|---|---|---|
mod-oai-pmh | 2048 | 1845 | 2048 | 1440 | 512 | 2 | 4 |
edge-oai-pmh | 1024 | 1360 | 1512 | 952 | 128 | 2 | 3 |
mod-inventory-storage | 1024 | 1684 | 1872 | 1440 | 512 | 2 | 8 |
Summary
Average response time per request with resumption token 600ms ( compared to Lotus's 850ms).
Incremental calls performed - 82299 (Bugfest data set 1 user and 20 DB connections)*.
OOM happens frequently if followed the recommended setting (soft limit < maxMetaspaceSize + XmX). Only when changed to soft limit > maxMetaspaceSize + Xmx the harvests completed successfully.
Thread block errors and subsequent OOMs happened about 50% of the time. This is likely due to a fast rate of incremental calls by the JMeter test script. When changed to 40 requests/min, there were no more errors, but this is a very small rate that would take over 30 hours for the harvest of 8M records to complete.
* Note: Bugfest dataset was used because it has more SRS records than PTF's dataset.
Test Results
Test 1
This test was done with a database freshly restored from Bugfest (Morning Glory). There was neither reindexing on Elastic Search nor recreating the indexes and "analyze" the inventory-storage tables.
8.26M records were transferred and harvested in about 19 hours
Each incremental call to harvest took about 811ms, and a total of 82,300 calls.
No memory or CPU issues observed.
mod-oai-pmh started out spiking up to 50% for about 40 minutes. This is during the initial transfer of instances.
No memory issues observed starting when the test was performed on 8/24 at 22:00
RDS CPU utilization graph doesn't show any abnormality
Test 2
Test 2 was done with re-indexing on Elastic Search, re-create the relevant database indexes and Analyzed the tables to update the table stats.
The test failed after 26 minutes with a 502 error:
HTTP 502 Service temporarily unavailable.
Please check back in a minute or two.
If the issue persists, please report it to EBSCO Connect.
Only 3,5700,000 instances got transferred.
One mod-oai-pmh task crashed at 106% memory level
1339 incremental API calls to harvest were made. Each averaged 1173ms.
Tests 3 and 4
Tests 3 and 4 also suffered the same fate of running out of heap space memory. Shortly after the harvests were launched (during the initial transfer of instances) one of two OAI-PMH tasks crashed, leading to a timeout on the client side and the whole harvest came to a complete halt. Below are the memory and CPU graphs of tests 3 and 4.
Test 5
After adjusting the memory's soft limit to be greater than Xmx + XMetaSpaceSize, the harvest did not crash and completed successfully in 13 hours.
| CPU | Memory | Xmx | MaxMetaSpaceSize | Tasks Count | Task Rev Number |
|---|---|---|---|---|---|---|
mod-oai-pmh | 2048 | 2000 | 2048 | 1440 | 512 | 2 | 5 |
edge-oai-pmh | 1024 | 1360 | 1512 | 952 | 128 | 2 | 3 |
mod-inventory-storage | 1024 | 1684 | 1872 | 1440 | 512 | 2 | 8 |
This time the test was launched from carrier-io so the timing is even better than of the first test. Response times were much faster as well.
8.26M records were transferred and harvested in about 13 hours and 40 minutes
Each incremental call to harvest took about 592ms, and a total of 82,300 calls.
No memory or CPU issues observed.
CPU utilizations are typical for an OAI-PMH harvest, with mod-oai-pmh leading the pack spiking at 50% initially for about half an hour during the initial instance transfers, but settled down at around 5% thereafter.
okapi and its variants (nginx-okapi, pub-okapi) also spiked initially for about 10 minutes but subsided afterward.
Test 6, 7
The harvests stopped about an hour in
Logs show "Thread blocked" errors which lead to OOM. The mod-oai-pmh tasks did not crash, however.
mod-oai-pmh service's memory and CPU utilization percentage were nominal, at about 70% for memory.
It's worth noting that the mod-oai-pmh service was not restarted before these tests. Logs are attached.
Test 8
Based on a log of the EBSCO harvester, the rate of the incremental harvest request was anywhere from 9 to 47 requests/min. In this test we slowed down the request rate to 40/min. This test and subsequent tests did not log any errors for the first couple of hours and even several hours later, which is typically when the thread-block and OOM issues occurred.
Note that the request rate when converted to requests/min equals around 40.02 req/min. This proved to be a point of stability. This test was stopped short because had it gone on longer, it would have taken about 36 hours to fully harvest all the data at this rate. A Jira was created to improve performance of mod-oai-pmh: MODOAIPMH-443: Investigate OAI-PMH thread blockedClosed
Database showing little CPU usage during the harvest.
CPU utilization of relevant modules during the harvest.
Memory utilization of the modules that involve in the OAI-PMH workflow.