OAI-PMH data harvesting (Orchid) by EBSCO Harvester
Overview
JMeter script was used with the 40 requests per minute throughput limit that helped to obtain full harvesting successful results. The throughput limit usage, honestly, isn't a simulation of real user behaviour. The purpose of the current OAI-PMH tests is to measure the performance of the Orchid release by the EBSCO Harvester recommended tool and to find possible issues, and bottlenecks per - PERF-510Getting issue details... STATUS
Summary
- Tests were executed by EBSCO Harvester in the AWS ptf-windows instance. There is no control of the 40 requests per minute throughput limit as it was done by the Jmeter script. That is why a successful harvesting operation by EBSCO Harvester took less time (~18 hours) than by JMeter script with a throughput limit (~25 hours). Moreover, EBSCO Harvester has a retry functionality that helps to keep on harvesting even after failed time-out requests.
- When the mod-oai-pmh.instance table gets accumulated with instance UUIDs from previous harvests, as was when the PTF env mod-oai-pmh.instance table reached 30M records, it took more time to insert new records into the table, therefore the overall duration of creating (downloading) new 'instanceUUIDs' records was increased as well. (Not a critical issue).
- '(504) A gateway timeout' were caused by mod-oai-pmh 'java.lang.OutOfMemoryError: Java heap space'. On the other hand, the 'java.lang.OutOfMemoryError' exceptions appeared when not all 'instanceUUIDs ' records were created for the request. The reason why it happens should be investigated additionally.
- 'repository.fetchingChunkSize=8000' option increased the duration of the harvesting request. The default value (5000) shows optimal results.
- All test executions show similar Service Memory behaviour. Only after restarting the service, the service memory usage is in an optimal level. After the harvesting operation starts, Service memory usage is grown up to 99%, stabilized on this level, and doesn't change reached value (even for future harvesting processes).
On the one hand, investigating how much memory is used per harvesting is unavailable. On the other hand, the reason why Service memory usage is not decreased if there are no activities should be investigated additionally as it could be a functionality of AWS displaying aggregated memory of the containers or could be a FOLIO issue. - In case all 'instanceUUIDs ' records have been created for the request as expected (ocp2 - 10'023'100 'instance' records) with different 'repository.fetchingChunkSize' (5000, 8000), the harvesting operation was less than 24 hours and was completed successfully. But the instability of the harvesting operation for Orchid release due to not all 'instance' records being created for the request should be investigated and fixed by - MODOAIPMH-509Getting issue details... STATUS
Recommendations & Jiras
- MODOAIPMH-509Getting issue details... STATUS
Test Runs & Results
The table contains information about test executions, their results, 'mod-oai-pmh' service configuration:
Date | Start | Finish | Source | Records/Req | Metadata Final Response | DB_Instances Created | Result | Comment | Version | CPU | memory/ memoryReservation | Xmx | MaxMetaSpaceSize | MetaspaceSize | Tasks | Task Revision | |
1 | 4/4/2023 | 4/4/2023 14:15 | SRS+Inv | 100 | { "requestMetadataCollection": [ { "requestId": "df1ef27f-fcc5-4e51-a7b8-0577ced08380", "lastUpdatedDate": "2023-04-04T14:56:26.003+00:00", "streamEnded": false, "downloadedAndSavedInstancesCounter": 0, "failedToSaveInstancesCounter": 0, "returnedInstancesCounter": 1741881, "skippedInstancesCounter": 33019, "failedInstancesCounter": 0, "suppressedInstancesCounter": 0 } ], "totalRecords": 1 } | x | (504) Gateway timeout was caused by mod-oai-pmh java.lang.OutOfMemoryError: Java heap space | mod-oai-pmh:3.11.3 | 2048 | 2048/1845 | 1440m | 512m | 384m | 2 | 12 | ||
2 | 4/5/2023 | 5/4/2023 11:13 | SRS+Inv | 100 | x | x | Not enough space. Local problem | mod-oai-pmh:3.11.3 | 3072 | 3072/2872 | 1440m | 512m | 384m | 2 | 13 | ||
3 | 4/5/2023 | 5/4/2023 12:58 | SRS+Inv | 100 | x | x | Not enough space. Local problem | mod-oai-pmh:3.11.3 | 3072 | 3072/2872 | 1440m | 512m | 384m | 2 | 13 | ||
4 | 4/6/2023 | 6/4/2023 20:03 | 7/4/2023 14:15 | SRS+Inv | 100 | { "requestMetadataCollection": [ { "requestId": "d6dbf068-9a12-4d8b-915c-c340115dc2e0", "lastUpdatedDate": "2023-04-07T11:57:25.317+00:00", "streamEnded": true, "downloadedAndSavedInstancesCounter": 10023100, "failedToSaveInstancesCounter": 0, "returnedInstancesCounter": 9951864, "skippedInstancesCounter": 71236, "failedInstancesCounter": 0, "suppressedInstancesCounter": 0 } ], "totalRecords": 1 } | Harvest has completed | mod-oai-pmh:3.12.0-S.299 | 2048 | 3072/2767 | 2150m | 512m | 384m | 2 | 14 | ||
5 | 4/7/2023 | 7/4/2023 16:30 | 7/4/2023 17:09 | SRS | 100 | { "requestMetadataCollection": [ { "requestId": "27d95a17-cdf9-4d3b-af70-a0792d839d29", "lastUpdatedDate": "2023-04-07T17:26:55.603+00:00", "streamEnded": false, "downloadedAndSavedInstancesCounter": 0, "failedToSaveInstancesCounter": 0, "returnedInstancesCounter": 2127445, "skippedInstancesCounter": 32755, "failedInstancesCounter": 0, "suppressedInstancesCounter": 0 } ], "totalRecords": 1 } | 2,345,000 | (504) Gateway timeout was caused by mod-oai-pmh java.lang.OutOfMemoryError: Java heap space | mod-oai-pmh:3.12.0-S.299 | 2048 | 3072/2767 | 2150m | 512m | 384m | 2 | 14 | |
6 | 4/11/2023 | 11/4/2023 11:23 | 11/4/2023 12:50 | SRS | 100 | { "requestMetadataCollection": [ { & |