OAI-PMH data harvesting[Incremental + Full] (Poppy)

Overview

  • The purpose of the OAI-PMH Incremental Harvesting tests is to measure performance of Poppy release and to find possible issues, bottlenecks per PERF-660 - Getting issue details... STATUS on OCP2 environment.
  • The purpose of the OAI-PMH Full Harvesting tests is to measure the performance of Poppy release by the EBSCO Harvester recommended tool and to find possible issues, and bottlenecks per PERF-659 - Getting issue details... STATUS on OCP2 environment.

Summary

  • OAI-PMH - Incremental Harvesting:

    • Three tests have been executed by JMeter script to check performance of harvesting the following number of records 10K, 25K, 50K, 500K and 1 MLN with different OAI-PMH Behaviors :

      • Test 1. Record source set to Source record storage ;

      • Test 2. Record source set to Inventory ;

      • Test 3.  Record source set to Source record storage and inventory; 

    • Harvesting time is similar in both tests Test1 and Test3, but for Test2(10K, 25K, 50K )it`s take about 50% more time to processes, because of the date creation distribution, from 1962-2023 were created about 250K and 2023-10-23 were created about 800K instances;
    • The CPU usage was consistent throughout all of the tests and didn`t exceed 5% for each services, on the begging of each test we observed a spike in CPU usage that lasted for a few seconds;
    • Memory utilization was stable, except edge-oai-pmh service;
    • Database CPU utilization reached maximum of 15%, number of DB connections = 140; 
  • OAI-PMH - Full Harvesting:
    • Three tests have been executed using EBSCO Harvester to check performance with different OAI-PMH Behaviors :

      • Test 4. Record source set to Source record storage. Test duration is about  16 hours 42 min, 10403507 - returned  Inventory instances, 76.4 GB of data stored on the disk.

      • Test 5. Record source set to Inventory. Test duration is about  1 hours 46 min, 1122521- returned  Inventory instances, 1.96 GB  of data stored on the disk.

      • Test 6.  Record source set to Source record storage and inventory. Test duration is about  16 hours 50 min, 11526892 - returned  Inventory instances, 78.4 GB of data stored on the disk.

    • The average CPU Utilization during Test 4, Test 6 was about  mod-oai-pmh-b = 7%, edge-oai-pmh-b = 3%, mod-source-record-storage-b = 1%, okapi-b = 1.2%, mod-inventory-storage-b = 0.5%  for Test 5, these values were 1-2% lower;
    • Memory utilization was without any problems. Service edge-oai-pmh-b  was not restarted after each test as on the previous runs to check for memory leaks. After Test 4 memory utilization reached about 53% and during the next Test 5 and Test 6 were fluctuations in the range of 45-55%
    • Average CPU utilization for Test 4-6 was about 16%, number of DB connections = 140.

Comparison results
After analysis of the OAIMPH Incremental Harvesters logs, after each request is executed /oai/records?verb=ListRecords&apikey=[APIKey]&resumptionToken=[resumptionToken], in Jmeter the waiting time was added, which is used in the program to save the response to the file. Also for Test 2 800K instances were generated. Therefore, it will be incorrect to directly compare the processing time and resource usage, since the system load and the number of RPS have changed.

Nevertheless, in comparison to OAI-PMH data harvesting (Orchid), OAI-PMH data harvesting (Orchid) by EBSCO Harvester several important points can be distinguished:
Incremental Harvesting
1) The duration of the havering is similar 
2) After stabilization, the CPU utilization in both tests does not exceed 5% and the RDS CPU Utilization test was about 15%. 
3) At the beginning of all tests, there is a sharp increase in CPU usage, but in Poppy release the maximum value is much lower than in Orchid, CPU usage stabilization occurs within a few minutes in Poppy , compared to 30 minutes in Orchid
4) Memory usage. In Poppy mod-oai-pmh service does not use 100% of the memory. The edge-oai-pmh-b service has a similar memory usage profile on both releases.
5) RDS CPU in Orchid has no spikes at the beginning of each test.

Full harvest

1) Same as Incremental Harvesting. Memory usage. On the orchid, mod-oai-pmh does not use 100% of the memory. The edge-oai-pmh-b service has a similar memory usage profile on both releases. 
2) DB CPU usage is more even. In both releases, there are still spikes on ocp2-db-01 and ocp2-db-02, but they may be caused by the OAIMPH Harvester program.

Improvements that can be noted in Poppy release:
1) There is no degradation in request processing time, as duration is approximately the same;

2) Fixed high memory consumption by mod-oai-pmh service;

3) At the beginning of the tests, there are no sharp spikes of services CPU usage on and the database CPU usage. 

4) The service CPU utilization is very low ~ 7%, RDS CPU utilization is also very low ~ 15%. So it`s enough resources to perform another actions in the system.

Recommendations & Jiras

  • To have the same starting conditions before running test with different Record source sets the edge-oai-pmh service was restarted, it was done to return the service memory usage to its starting(after deployment) value;
  • Run the incremental harvesting tests with different Max records per response values, for example 200, 500 etc.;
  • Сonduct a more detailed analysis of why the edge-oai-pmh service is consuming a lot of memory and does not erase after the tests are finished;
  • Generate 1 Million instances with a uniform distribution over time 2022-12-21 2023-10-16. 

Test Runs & Results

Incremental harvesting

Number of harvested records

Test 1. Record source = Source record storage Duration

Test 2. Record source = Inventory Duration

Test 3. Record source = Source record storage and inventory Duration

Orchid

source = Source record storage Duration

Orchid

source = Source record storage and inventory Duration

10000 records(10K)

2 min 8 sec

3 min 44 sec

2 min 4 sec

not tested

not tested

25000 records(25K)

4 min 43 sec

6 min 50 sec

4 min 13 sec

3 min 50s4 min 32 s

50000 records(50K)

9 min 12 sec 

12 min 48 sec

8 min 12 sec

not testednot tested

500000 records(500K)

1 hours 18 min

1 hours 19 min

1 hours 15 min

1 hr 14min1 hr 7min

1000000 records(1MLN)

2 hours 29 min

2 hours 29 min

2 hours 24 min

2 hr 1min2 hr 21 min

Full harvesting using EBSCO Harvester

Record source

Duration

Number of returned instances

volume in GB of returned data

Number of files

Orchid

Duration

Source record storage

16 hours 42 min1040350776.4104,737~ 17 h 

Inventory

1 hours 46 min

1122521

1.96

11,227

not tested

Source record storage and inventory

16 hours 50 min

11526892

78.4

115,971

~ 18 h

Incremental harvesting resources utilization

Test 1.  Record source = Source record storage

Service CPU Utilization

During four harvesting tests with 10K, 50K, 500K and 1MLN records CPU usage remained steady, a few minor fluctuations were at the the beginning of each test. The averages CPU usage for  mod-oai-pmh-b = 3%, edge-oai-pmh-b = 2.5%, mod-source-record-storage-b = 1.5%, okapi-b = 1.2%, mod-inventory-storage-b = 0.5% . After the middle of the 4th test(1Mln records), something launched a hidden JMeter script, which causes a significant increase in CPU consumption, but didn`t affect processing time. 

Service Memory Utilization

Memory utilization was without any problems, except for the edge-oai-pmh-b service. At the beginning of the testing, it consumed approximately 20% of memory, but 30 minutes after the test finished, it was consuming around 45%

.