OAI-PMH data harvesting[Incremental + Full] (Poppy)
Overview
- The purpose of the OAI-PMH Incremental Harvesting tests is to measure performance of Poppy release and to find possible issues, bottlenecks per - PERF-660Getting issue details... STATUS on OCP2 environment.
- The purpose of the OAI-PMH Full Harvesting tests is to measure the performance of Poppy release by the EBSCO Harvester recommended tool and to find possible issues, and bottlenecks per - PERF-659Getting issue details... STATUS on OCP2 environment.
Summary
OAI-PMH - Incremental Harvesting:
Three tests have been executed by JMeter script to check performance of harvesting the following number of records 10K, 25K, 50K, 500K and 1 MLN with different OAI-PMH Behaviors :
Test 1. Record source set to Source record storage ;
Test 2. Record source set to Inventory ;
Test 3. Record source set to Source record storage and inventory;
- Harvesting time is similar in both tests Test1 and Test3, but for Test2(10K, 25K, 50K )it`s take about 50% more time to processes, because of the date creation distribution, from 1962-2023 were created about 250K and 2023-10-23 were created about 800K instances;
- The CPU usage was consistent throughout all of the tests and didn`t exceed 5% for each services, on the begging of each test we observed a spike in CPU usage that lasted for a few seconds;
- Memory utilization was stable, except edge-oai-pmh service;
- Database CPU utilization reached maximum of 15%, number of DB connections = 140;
- OAI-PMH - Full Harvesting:
Three tests have been executed using EBSCO Harvester to check performance with different OAI-PMH Behaviors :
Test 4. Record source set to Source record storage. Test duration is about 16 hours 42 min, 10403507 - returned Inventory instances, 76.4 GB of data stored on the disk.
Test 5. Record source set to Inventory. Test duration is about 1 hours 46 min, 1122521- returned Inventory instances, 1.96 GB of data stored on the disk.
Test 6. Record source set to Source record storage and inventory. Test duration is about 16 hours 50 min, 11526892 - returned Inventory instances, 78.4 GB of data stored on the disk.
- The average CPU Utilization during Test 4, Test 6 was about mod-oai-pmh-b = 7%, edge-oai-pmh-b = 3%, mod-source-record-storage-b = 1%, okapi-b = 1.2%, mod-inventory-storage-b = 0.5% for Test 5, these values were 1-2% lower;
- Memory utilization was without any problems. Service edge-oai-pmh-b was not restarted after each test as on the previous runs to check for memory leaks. After Test 4 memory utilization reached about 53% and during the next Test 5 and Test 6 were fluctuations in the range of 45-55%
- Average CPU utilization for Test 4-6 was about 16%, number of DB connections = 140.
Comparison results
After analysis of the OAIMPH Incremental Harvesters logs, after each request is executed /oai/records?verb=ListRecords&apikey=[APIKey]&resumptionToken=[resumptionToken], in Jmeter the waiting time was added, which is used in the program to save the response to the file. Also for Test 2 800K instances were generated. Therefore, it will be incorrect to directly compare the processing time and resource usage, since the system load and the number of RPS have changed.
Nevertheless, in comparison to OAI-PMH data harvesting (Orchid), OAI-PMH data harvesting (Orchid) by EBSCO Harvester several important points can be distinguished:
Incremental Harvesting
1) The duration of the havering is similar
2) After stabilization, the CPU utilization in both tests does not exceed 5% and the RDS CPU Utilization test was about 15%.
3) At the beginning of all tests, there is a sharp increase in CPU usage, but in Poppy release the maximum value is much lower than in Orchid, CPU usage stabilization occurs within a few minutes in Poppy , compared to 30 minutes in Orchid
4) Memory usage. In Poppy mod-oai-pmh service does not use 100% of the memory. The edge-oai-pmh-b service has a similar memory usage profile on both releases.
5) RDS CPU in Orchid has no spikes at the beginning of each test.
Full harvest
1) Same as Incremental Harvesting. Memory usage. On the orchid, mod-oai-pmh does not use 100% of the memory. The edge-oai-pmh-b service has a similar memory usage profile on both releases.
2) DB CPU usage is more even. In both releases, there are still spikes on ocp2-db-01 and ocp2-db-02, but they may be caused by the OAIMPH Harvester program.
Improvements that can be noted in Poppy release:
1) There is no degradation in request processing time, as duration is approximately the same;
2) Fixed high memory consumption by mod-oai-pmh service;
3) At the beginning of the tests, there are no sharp spikes of services CPU usage on and the database CPU usage.
4) The service CPU utilization is very low ~ 7%, RDS CPU utilization is also very low ~ 15%. So it`s enough resources to perform another actions in the system.
Recommendations & Jiras
- To have the same starting conditions before running test with different Record source sets the edge-oai-pmh service was restarted, it was done to return the service memory usage to its starting(after deployment) value;
- Run the incremental harvesting tests with different Max records per response values, for example 200, 500 etc.;
- Сonduct a more detailed analysis of why the edge-oai-pmh service is consuming a lot of memory and does not erase after the tests are finished;
- Generate 1 Million instances with a uniform distribution over time 2022-12-21 2023-10-16.
Test Runs & Results
Incremental harvesting
Number of harvested records | Test 1. Record source = Source record storage Duration | Test 2. Record source = Inventory Duration | Test 3. Record source = Source record storage and inventory Duration | Orchid source = Source record storage Duration | Orchid source = Source record storage and inventory Duration |
---|---|---|---|---|---|
10000 records(10K) | 2 min 8 sec | 3 min 44 sec | 2 min 4 sec | not tested | not tested |
25000 records(25K) | 4 min 43 sec | 6 min 50 sec | 4 min 13 sec | 3 min 50s | 4 min 32 s |
50000 records(50K) | 9 min 12 sec | 12 min 48 sec | 8 min 12 sec | not tested | not tested |
500000 records(500K) | 1 hours 18 min | 1 hours 19 min | 1 hours 15 min | 1 hr 14min | 1 hr 7min |
1000000 records(1MLN) | 2 hours 29 min | 2 hours 29 min | 2 hours 24 min | 2 hr 1min | 2 hr 21 min |
Full harvesting using EBSCO Harvester
Record source | Duration | Number of returned instances | volume in GB of returned data | Number of files | Orchid Duration |
---|---|---|---|---|---|
Source record storage | 16 hours 42 min | 10403507 | 76.4 | 104,737 | ~ 17 h |
Inventory | 1 hours 46 min | 1122521 | 1.96 | 11,227 | not tested |
Source record storage and inventory | 16 hours 50 min | 11526892 | 78.4 | 115,971 | ~ 18 h |
Incremental harvesting resources utilization
Test 1. Record source = Source record storage
Service CPU Utilization
During four harvesting tests with 10K, 50K, 500K and 1MLN records CPU usage remained steady, a few minor fluctuations were at the the beginning of each test. The averages CPU usage for mod-oai-pmh-b = 3%, edge-oai-pmh-b = 2.5%, mod-source-record-storage-b = 1.5%, okapi-b = 1.2%, mod-inventory-storage-b = 0.5% . After the middle of the 4th test(1Mln records), something launched a hidden JMeter script, which causes a significant increase in CPU consumption, but didn`t affect processing time.
Service Memory Utilization
Memory utilization was without any problems, except for the edge-oai-pmh-b service. At the beginning of the testing, it consumed approximately 20% of memory, but 30 minutes after the test finished, it was consuming around 45%
.
RDS CPU Utilization
Average CPU utilization during 4 test was about 13%
.
RDS Database Connections
Number of database connection was about 140.
Test 2. Record source = Inventory
Service CPU Utilization
During four harvesting tests with 10K, 50K, 500K and 1MLN records CPU usage remained steady, a few minor fluctuations were at the the beginning of each test. The averages CPU usage for mod-oai-pmh-b = 2%, edge-oai-pmh-b = 2%, mod-source-record-storage-b = 1.2%, okapi-b = 1.1%, mod-inventory-storage-b = 0.5% .
Service Memory Utilization
Memory utilization was without any problems, except for the edge-oai-pmh-b service. At the beginning of the testing, it consumed approximately 18% of memory, but 30 minutes after the test finished, it was consuming around 34%
RDS CPU Utilization
Average CPU utilization during 4 test was about 15%RDS Database Connections
Number of database connection was about 140.
Test 3. Record source = Source record storage and inventory
Service CPU Utilization
During four harvesting tests with 10K, 50K, 500K and 1MLN records CPU usage remained steady, a few minor fluctuations were at the the beginning of each test. The averages CPU usage for mod-oai-pmh-b = 3%, edge-oai-pmh-b = 2.5%, mod-source-record-storage-b = 1.4%, okapi-b = 1.1%, mod-inventory-storage-b = 0.5% .
Service Memory Utilization
Memory utilization was without any problems, except for the edge-oai-pmh-b service, during the third test, memory consumption the same at the previous tests, Between 500K and 1MLN records test, there was a period of two hours during which the tests were not running's, the system was not loaded at all, and memory consumption by edge-oai-pmh-b did not decrease during this period.
RDS CPU Utilization
Average CPU utilization during 4 test was about 12%
RDS Database Connections
Number of database connection was about 140.
Full harvesting resources utilization
Test 4. Record source = Source record storage
Service CPU Utilization
During the harvesting tests the averages CPU usage for mod-oai-pmh-b = 7%, edge-oai-pmh-b = 3%, mod-source-record-storage-b = 1%, okapi-b = 1.2%, mod-inventory-storage-b = 0.5% . After the test CPU utilization returned to the before test condition.
Service Memory Utilization
Memory utilization was without any problems, except for the edge-oai-pmh-b service, during the test memory consumption was increasing, and 1 hour after the test finished memory consumption did not decrease
RDS CPU Utilization
Average CPU utilization during the test was about 16%
RDS Database Connections
Number of database connection was about 140.
Test 5. Record source = Source record storage
Service CPU Utilization
During the harvesting tests the averages CPU usage for mod-oai-pmh-b = 4%, edge-oai-pmh-b = 2%, mod-source-record-storage-b = 1%, okapi-b = 1.3%, mod-inventory-storage-b = 0.5% . After the test CPU utilization returned to the before test condition.
Service Memory Utilization
Full harvesting test were run one after another without edge-oai-pmh-b service restarting, memory consumption was stable, didn`t increase.
RDS CPU Utilization
Average CPU utilization during the test was about 16%
RDS Database Connections
Number of database connection was about 140.
Test 6. Record source = Source record storage
Service CPU Utilization
During the harvesting tests the averages CPU usage for mod-oai-pmh-b = 7%, edge-oai-pmh-b = 4%, mod-source-record-storage-b = 1%, okapi-b = 1.2%, mod-inventory-storage-b = 0.5% . After the test CPU utilization returned to the before test condition.
Service Memory Utilization
For all services memory consumption was stable. Service edge-oai-pmh-b was not restarted before the test, memory utilization varied from 45% to 55%.
RDS CPU Utilization
Average CPU utilization during the test was about 17%.
Spike at 18.10-18.20 caused by a sharp increase in the number of requests to ocp2-db-01.
RDS Database Connections
Number of database connection was about 140.
Appendix
Methodology/Approach
OAI-PMH (incremental harvesting) was carried out by JMeter script from carrier with 2 main requests:
- /oai/records?verb=ListRecords&metadataPrefix=marc21_withholdings&apikey=[APIKey]
- /oai/records?verb=ListRecords&apikey=[APIKey]&resumptionToken=[resumptionToken]
to extract the required number of records was used loop counter with following configuration:
- 98 loop counts for 10K records;
- 248 loop counts for 25K records;
- 499 loop counts for 50K records;
- 5000 loop counts for 500K records;
- 10000 loop counts for 1MLN records;
To run the incremental harvesting test the next time ranges were defined by experimental means. The time range for Test 2* was extended due to the impossibility of harvesting the defined number of records, but the next tests were run after adding 800K instances to database.
Start date | Until date | |
---|---|---|
Test 1. | 2022-12-21 | 2023-10-16 |
Test 2*. | 1962-12-21 | 2023-10-23* |
Test 3. | 2022-12-21 | 2023-10-16 |
OAI-PMH (full harvesting)
Before running OAI-PMH with full harvest, following database commands to optimize the tables were executed (from https://folio-org.atlassian.net/wiki/display/FOLIOtips/OAI-PMH+Best+Practices#OAIPMHBestPractices-SlowPerformance):
REINDEX index <tenant>_mod_inventory_storage.audit_item_pmh_createddate_idx ; |
Execute the following query in a related database for removing existed 'instances' created by previous harvesting request and a request itself:
TRUNCATE TABLE fs09000000_mod_oai_pmh.request_metadata_lb cascade |
Full harvesting tests were running from ptf-windows machine using EBSCO Harvesting . The following cmd command (cmd should be run in the same directory as EBSCO Harvester) start EBSCO Harvester:
|
With the following definition
Infrastructure
Environment: OCP2
Release: Poppy (2023 R2)
- 9 m6i.2xlarge EC2 instances located in US East (N. Virginia)
- 2 instances of db.r6.xlarge database instances, one reader, and one writer
- MSK tenant
- 4 brokers
Apache Kafka version 2.8.0
EBS storage volume per broker 300 GiB
- auto.create.topics.enable=true
- og.retention.minutes=480
- default.replication.factor=3
Modules
Module ocp2-pvt Mon Oct 23 15:48:03 UTC 2023 | Task Def. Revision | Module Version | Task Count | Mem Hard Limit | Mem Soft limit | CPU units | Xmx | MetaspaceSize | MaxMetaspaceSize | R/W split enabled |
---|---|---|---|---|---|---|---|---|---|---|
pub-edge | 8 | pub-edge:2022.03.02 | 2 | 1024 | 896 | 128 | 768 | 0 | 0 | false |
mod-inventory-storage | 1 | /mod-inventory-storage:26.1.0-SNAPSHOT.696 | 2 | 2208 | 1952 | 1024 | 1440 | 384 | 512 | false |
edge-oai-pmh | 8 | edge-oai-pmh:2.7.0-SNAPSHOT.141 | 2 | 1512 | 1360 | 1024 | 1440 | 384 | 512 | false |
mod-source-record-storage | 13 | mod-source-record-storage:5.7.0-SNAPSHOT.247 | 2 | 5600 | 5000 | 2048 | 3500 | 384 | 512 | false |
mod-inventory | 13 | mod-inventory:20.1.0-SNAPSHOT.446 | 2 | 2880 | 2592 | 1024 | 1814 | 384 | 512 | false |
mod-circulation | 10 | mod-circulation:24.0.0-SNAPSHOT.601 | 2 | 2880 | 2592 | 1536 | 1814 | 384 | 512 | false |
mod-source-record-manager | 15 | /mod-source-record-manager:3.7.0-SNAPSHOT.240 | 2 | 5600 | 5000 | 2048 | 3500 | 384 | 512 | false |
mod-quick-marc | 8 | mod-quick-marc:5.0.0-SNAPSHOT.114 | 1 | 2288 | 2176 | 128 | 1664 | 384 | 512 | false |
nginx-okapi | 8 | nginx-okapi:2023.09.21 | 2 | 1024 | 896 | 128 | 0 | 0 | 0 | false |
okapi-b | 9 | okapi:5.0.1 | 3 | 1684 | 1440 | 1024 | 922 | 384 | 512 | false |
mod-oai-pmh | 10 | mod-oai-pmh:3.12.0-SNAPSHOT.362 | 2 | 4096 | 3690 | 2048 | 3076 | 384 | 512 | false |