OAI-PMH data harvesting[Incremental + Full] (Poppy)
Overview
- The purpose of the OAI-PMH Incremental Harvesting tests is to measure performance of Poppy release and to find possible issues, bottlenecks per - PERF-660Getting issue details... STATUS on OCP2 environment.
- The purpose of the OAI-PMH Full Harvesting tests is to measure the performance of Poppy release by the EBSCO Harvester recommended tool and to find possible issues, and bottlenecks per - PERF-659Getting issue details... STATUS on OCP2 environment.
Summary
OAI-PMH - Incremental Harvesting:
Three tests have been executed by JMeter script to check performance of harvesting the following number of records 10K, 25K, 50K, 500K and 1 MLN with different OAI-PMH Behaviors :
Test 1. Record source set to Source record storage ;
Test 2. Record source set to Inventory ;
Test 3. Record source set to Source record storage and inventory;
- Harvesting time is similar in both tests Test1 and Test3, but for Test2(10K, 25K, 50K )it`s take about 50% more time to processes, because of the date creation distribution, from 1962-2023 were created about 250K and 2023-10-23 were created about 800K instances;
- The CPU usage was consistent throughout all of the tests and didn`t exceed 5% for each services, on the begging of each test we observed a spike in CPU usage that lasted for a few seconds;
- Memory utilization was stable, except edge-oai-pmh service;
- Database CPU utilization reached maximum of 15%, number of DB connections = 140;
- OAI-PMH - Full Harvesting:
Three tests have been executed using EBSCO Harvester to check performance with different OAI-PMH Behaviors :
Test 4. Record source set to Source record storage. Test duration is about 16 hours 42 min, 10403507 - returned Inventory instances, 76.4 GB of data stored on the disk.
Test 5. Record source set to Inventory. Test duration is about 1 hours 46 min, 1122521- returned Inventory instances, 1.96 GB of data stored on the disk.
Test 6. Record source set to Source record storage and inventory. Test duration is about 16 hours 50 min, 11526892 - returned Inventory instances, 78.4 GB of data stored on the disk.
- The average CPU Utilization during Test 4, Test 6 was about mod-oai-pmh-b = 7%, edge-oai-pmh-b = 3%, mod-source-record-storage-b = 1%, okapi-b = 1.2%, mod-inventory-storage-b = 0.5% for Test 5, these values were 1-2% lower;
- Memory utilization was without any problems. Service edge-oai-pmh-b was not restarted after each test as on the previous runs to check for memory leaks. After Test 4 memory utilization reached about 53% and during the next Test 5 and Test 6 were fluctuations in the range of 45-55%
- Average CPU utilization for Test 4-6 was about 16%, number of DB connections = 140.
Comparison results
After analysis of the OAIMPH Incremental Harvesters logs, after each request is executed /oai/records?verb=ListRecords&apikey=[APIKey]&resumptionToken=[resumptionToken], in Jmeter the waiting time was added, which is used in the program to save the response to the file. Also for Test 2 800K instances were generated. Therefore, it will be incorrect to directly compare the processing time and resource usage, since the system load and the number of RPS have changed.
Nevertheless, in comparison to OAI-PMH data harvesting (Orchid), OAI-PMH data harvesting (Orchid) by EBSCO Harvester several important points can be distinguished:
Incremental Harvesting
1) The duration of the havering is similar
2) After stabilization, the CPU utilization in both tests does not exceed 5% and the RDS CPU Utilization test was about 15%.
3) At the beginning of all tests, there is a sharp increase in CPU usage, but in Poppy release the maximum value is much lower than in Orchid, CPU usage stabilization occurs within a few minutes in Poppy , compared to 30 minutes in Orchid
4) Memory usage. In Poppy mod-oai-pmh service does not use 100% of the memory. The edge-oai-pmh-b service has a similar memory usage profile on both releases.
5) RDS CPU in Orchid has no spikes at the beginning of each test.
Full harvest
1) Same as Incremental Harvesting. Memory usage. On the orchid, mod-oai-pmh does not use 100% of the memory. The edge-oai-pmh-b service has a similar memory usage profile on both releases.
2) DB CPU usage is more even. In both releases, there are still spikes on ocp2-db-01 and ocp2-db-02, but they may be caused by the OAIMPH Harvester program.
Improvements that can be noted in Poppy release:
1) There is no degradation in request processing time, as duration is approximately the same;
2) Fixed high memory consumption by mod-oai-pmh service;
3) At the beginning of the tests, there are no sharp spikes of services CPU usage on and the database CPU usage.
4) The service CPU utilization is very low ~ 7%, RDS CPU utilization is also very low ~ 15%. So it`s enough resources to perform another actions in the system.
Recommendations & Jiras
- To have the same starting conditions before running test with different Record source sets the edge-oai-pmh service was restarted, it was done to return the service memory usage to its starting(after deployment) value;
- Run the incremental harvesting tests with different Max records per response values, for example 200, 500 etc.;
- Сonduct a more detailed analysis of why the edge-oai-pmh service is consuming a lot of memory and does not erase after the tests are finished;
- Generate 1 Million instances with a uniform distribution over time 2022-12-21 2023-10-16.
Test Runs & Results
Incremental harvesting
Number of harvested records | Test 1. Record source = Source record storage Duration | Test 2. Record source = Inventory Duration | Test 3. Record source = Source record storage and inventory Duration | Orchid source = Source record storage Duration | Orchid source = Source record storage and inventory Duration |
---|---|---|---|---|---|
10000 records(10K) | 2 min 8 sec | 3 min 44 sec | 2 min 4 sec | not tested | not tested |
25000 records(25K) | 4 min 43 sec | 6 min 50 sec | 4 min 13 sec | 3 min 50s | 4 min 32 s |
50000 records(50K) | 9 min 12 sec | 12 min 48 sec | 8 min 12 sec | not tested | not tested |
500000 records(500K) | 1 hours 18 min | 1 hours 19 min | 1 hours 15 min | 1 hr 14min | 1 hr 7min |
1000000 records(1MLN) | 2 hours 29 min | 2 hours 29 min | 2 hours 24 min | 2 hr 1min | 2 hr 21 min |
Full harvesting using EBSCO Harvester
Record source | Duration | Number of returned instances | volume in GB of returned data | Number of files | Orchid Duration |
---|---|---|---|---|---|
Source record storage | 16 hours 42 min | 10403507 | 76.4 | 104,737 | ~ 17 h |
Inventory | 1 hours 46 min | 1122521 | 1.96 | 11,227 | not tested |
Source record storage and inventory | 16 hours 50 min | 11526892 | 78.4 | 115,971 | ~ 18 h |
Incremental harvesting resources utilization
Test 1. Record source = Source record storage
Service CPU Utilization
During four harvesting tests with 10K, 50K, 500K and 1MLN records CPU usage remained steady, a few minor fluctuations were at the the beginning of each test. The averages CPU usage for mod-oai-pmh-b = 3%, edge-oai-pmh-b = 2.5%, mod-source-record-storage-b = 1.5%, okapi-b = 1.2%, mod-inventory-storage-b = 0.5% . After the middle of the 4th test(1Mln records), something launched a hidden JMeter script, which causes a significant increase in CPU consumption, but didn`t affect processing time.
Service Memory Utilization
Memory utilization was without any problems, except for the edge-oai-pmh-b service. At the beginning of the testing, it consumed approximately 20% of memory, but 30 minutes after the test finished, it was consuming around 45%
.