OAI-PMH data harvesting[Incremental + Full] (Poppy)

Overview

  • The purpose of the OAI-PMH Incremental Harvesting tests is to measure performance of Poppy release and to find possible issues, bottlenecks per PERF-660 - Getting issue details... STATUS on OCP2 environment.
  • The purpose of the OAI-PMH Full Harvesting tests is to measure the performance of Poppy release by the EBSCO Harvester recommended tool and to find possible issues, and bottlenecks per PERF-659 - Getting issue details... STATUS on OCP2 environment.

Summary

  • OAI-PMH - Incremental Harvesting:

    • Three tests have been executed by JMeter script to check performance of harvesting the following number of records 10K, 25K, 50K, 500K and 1 MLN with different OAI-PMH Behaviors :

      • Test 1. Record source set to Source record storage ;

      • Test 2. Record source set to Inventory ;

      • Test 3.  Record source set to Source record storage and inventory; 

    • Harvesting time is similar in both tests Test1 and Test3, but for Test2(10K, 25K, 50K )it`s take about 50% more time to processes, because of the date creation distribution, from 1962-2023 were created about 250K and 2023-10-23 were created about 800K instances;
    • The CPU usage was consistent throughout all of the tests and didn`t exceed 5% for each services, on the begging of each test we observed a spike in CPU usage that lasted for a few seconds;
    • Memory utilization was stable, except edge-oai-pmh service;
    • Database CPU utilization reached maximum of 15%, number of DB connections = 140; 
  • OAI-PMH - Full Harvesting:
    • Three tests have been executed using EBSCO Harvester to check performance with different OAI-PMH Behaviors :

      • Test 4. Record source set to Source record storage. Test duration is about  16 hours 42 min, 10403507 - returned  Inventory instances, 76.4 GB of data stored on the disk.

      • Test 5. Record source set to Inventory. Test duration is about  1 hours 46 min, 1122521- returned  Inventory instances, 1.96 GB  of data stored on the disk.

      • Test 6.  Record source set to Source record storage and inventory. Test duration is about  16 hours 50 min, 11526892 - returned  Inventory instances, 78.4 GB of data stored on the disk.

    • The average CPU Utilization during Test 4, Test 6 was about  mod-oai-pmh-b = 7%, edge-oai-pmh-b = 3%, mod-source-record-storage-b = 1%, okapi-b = 1.2%, mod-inventory-storage-b = 0.5%  for Test 5, these values were 1-2% lower;
    • Memory utilization was without any problems. Service edge-oai-pmh-b  was not restarted after each test as on the previous runs to check for memory leaks. After Test 4 memory utilization reached about 53% and during the next Test 5 and Test 6 were fluctuations in the range of 45-55%
    • Average CPU utilization for Test 4-6 was about 16%, number of DB connections = 140.

Comparison results
After analysis of the OAIMPH Incremental Harvesters logs, after each request is executed /oai/records?verb=ListRecords&apikey=[APIKey]&resumptionToken=[resumptionToken], in Jmeter the waiting time was added, which is used in the program to save the response to the file. Also for Test 2 800K instances were generated. Therefore, it will be incorrect to directly compare the processing time and resource usage, since the system load and the number of RPS have changed.

Nevertheless, in comparison to OAI-PMH data harvesting (Orchid), OAI-PMH data harvesting (Orchid) by EBSCO Harvester several important points can be distinguished:
Incremental Harvesting
1) The duration of the havering is similar 
2) After stabilization, the CPU utilization in both tests does not exceed 5% and the RDS CPU Utilization test was about 15%. 
3) At the beginning of all tests, there is a sharp increase in CPU usage, but in Poppy release the maximum value is much lower than in Orchid, CPU usage stabilization occurs within a few minutes in Poppy , compared to 30 minutes in Orchid
4) Memory usage. In Poppy mod-oai-pmh service does not use 100% of the memory. The edge-oai-pmh-b service has a similar memory usage profile on both releases.
5) RDS CPU in Orchid has no spikes at the beginning of each test.

Full harvest

1) Same as Incremental Harvesting. Memory usage. On the orchid, mod-oai-pmh does not use 100% of the memory. The edge-oai-pmh-b service has a similar memory usage profile on both releases. 
2) DB CPU usage is more even. In both releases, there are still spikes on ocp2-db-01 and ocp2-db-02, but they may be caused by the OAIMPH Harvester program.

Improvements that can be noted in Poppy release:
1) There is no degradation in request processing time, as duration is approximately the same;

2) Fixed high memory consumption by mod-oai-pmh service;

3) At the beginning of the tests, there are no sharp spikes of services CPU usage on and the database CPU usage. 

4) The service CPU utilization is very low ~ 7%, RDS CPU utilization is also very low ~ 15%. So it`s enough resources to perform another actions in the system.

Recommendations & Jiras

  • To have the same starting conditions before running test with different Record source sets the edge-oai-pmh service was restarted, it was done to return the service memory usage to its starting(after deployment) value;
  • Run the incremental harvesting tests with different Max records per response values, for example 200, 500 etc.;
  • Сonduct a more detailed analysis of why the edge-oai-pmh service is consuming a lot of memory and does not erase after the tests are finished;
  • Generate 1 Million instances with a uniform distribution over time 2022-12-21 2023-10-16. 

Test Runs & Results

Incremental harvesting

Number of harvested records

Test 1. Record source = Source record storage Duration

Test 2. Record source = Inventory Duration

Test 3. Record source = Source record storage and inventory Duration

Orchid

source = Source record storage Duration

Orchid

source = Source record storage and inventory Duration

10000 records(10K)

2 min 8 sec

3 min 44 sec

2 min 4 sec

not tested

not tested

25000 records(25K)

4 min 43 sec

6 min 50 sec

4 min 13 sec

3 min 50s4 min 32 s

50000 records(50K)

9 min 12 sec 

12 min 48 sec

8 min 12 sec

not testednot tested

500000 records(500K)

1 hours 18 min

1 hours 19 min

1 hours 15 min

1 hr 14min1 hr 7min

1000000 records(1MLN)

2 hours 29 min

2 hours 29 min

2 hours 24 min

2 hr 1min2 hr 21 min

Full harvesting using EBSCO Harvester

Record source

Duration

Number of returned instances

volume in GB of returned data

Number of files

Orchid

Duration

Source record storage

16 hours 42 min1040350776.4104,737~ 17 h 

Inventory

1 hours 46 min

1122521

1.96

11,227

not tested

Source record storage and inventory

16 hours 50 min

11526892

78.4

115,971

~ 18 h

Incremental harvesting resources utilization

Test 1.  Record source = Source record storage

Service CPU Utilization

During four harvesting tests with 10K, 50K, 500K and 1MLN records CPU usage remained steady, a few minor fluctuations were at the the beginning of each test. The averages CPU usage for  mod-oai-pmh-b = 3%, edge-oai-pmh-b = 2.5%, mod-source-record-storage-b = 1.5%, okapi-b = 1.2%, mod-inventory-storage-b = 0.5% . After the middle of the 4th test(1Mln records), something launched a hidden JMeter script, which causes a significant increase in CPU consumption, but didn`t affect processing time. 

Service Memory Utilization

Memory utilization was without any problems, except for the edge-oai-pmh-b service. At the beginning of the testing, it consumed approximately 20% of memory, but 30 minutes after the test finished, it was consuming around 45%

.

RDS CPU Utilization

Average CPU utilization during 4 test was about 13%

.

RDS Database Connections

Number of database connection was about 140.

Test 2.  Record source = Inventory

Service CPU Utilization

During four harvesting tests with 10K, 50K, 500K and 1MLN records CPU usage remained steady, a few minor fluctuations were at the the beginning of each test. The averages CPU usage for  mod-oai-pmh-b = 2%, edge-oai-pmh-b = 2%, mod-source-record-storage-b = 1.2%, okapi-b = 1.1%, mod-inventory-storage-b = 0.5%

Service Memory Utilization

Memory utilization was without any problems, except for the edge-oai-pmh-b service. At the beginning of the testing, it consumed approximately 18% of memory, but 30 minutes after the test finished, it was consuming around 34%

RDS CPU Utilization

Average CPU utilization during 4 test was about 15%

RDS Database Connections

Number of database connection was about 140.

Test 3.  Record source = Source record storage and inventory

Service CPU Utilization

During four harvesting tests with 10K, 50K, 500K and 1MLN records CPU usage remained steady, a few minor fluctuations were at the the beginning of each test. The averages CPU usage for  mod-oai-pmh-b = 3%, edge-oai-pmh-b = 2.5%, mod-source-record-storage-b = 1.4%, okapi-b = 1.1%, mod-inventory-storage-b = 0.5%

Service Memory Utilization

Memory utilization was without any problems, except for the edge-oai-pmh-b service, during the third test, memory consumption the same at the previous tests, Between 500K and 1MLN records test, there was a period of two hours during which the tests were not running's, the system was not loaded at all, and memory consumption by edge-oai-pmh-b did not decrease during this period.



RDS CPU Utilization


Average CPU utilization during 4 test was about 12%

RDS Database Connections


Number of database connection was about 140.

Full harvesting resources utilization

Test 4.  Record source = Source record storage

Service CPU Utilization

During the harvesting tests the averages CPU usage for  mod-oai-pmh-b = 7%, edge-oai-pmh-b = 3%, mod-source-record-storage-b = 1%, okapi-b = 1.2%, mod-inventory-storage-b = 0.5% . After the test CPU utilization returned to the before test condition.



Service Memory Utilization

Memory utilization was without any problems, except for the edge-oai-pmh-b service, during the test memory consumption was increasing, and 1 hour after the test finished memory consumption did not decrease

RDS CPU Utilization


Average CPU utilization during the test was about 16%

RDS Database Connections


Number of database connection was about 140.

Test 5.  Record source = Source record storage

Service CPU Utilization

During the harvesting tests the averages CPU usage for  mod-oai-pmh-b = 4%, edge-oai-pmh-b = 2%, mod-source-record-storage-b = 1%, okapi-b = 1.3%, mod-inventory-storage-b = 0.5% . After the test CPU utilization returned to the before test condition.

Service Memory Utilization

Full harvesting test were run one after another without edge-oai-pmh-b service restarting, memory consumption was stable, didn`t increase.

RDS CPU Utilization

Average CPU utilization during the test was about 16%

RDS Database Connections

Number of database connection was about 140.

Test 6.  Record source = Source record storage


Service CPU Utilization

During the harvesting tests the averages CPU usage for  mod-oai-pmh-b = 7%, edge-oai-pmh-b = 4%, mod-source-record-storage-b = 1%, okapi-b = 1.2%, mod-inventory-storage-b = 0.5% . After the test CPU utilization returned to the before test condition.

Service Memory Utilization

 For all services memory consumption was stable. Service edge-oai-pmh-b  was not restarted before the test, memory utilization varied from 45% to 55%.

RDS CPU Utilization

Average CPU utilization during the test was about 17%.  


Spike at 18.10-18.20 caused by a sharp increase in the number of requests to ocp2-db-01.

 DB PerfInsights graph


RDS Database Connections

Number of database connection was about 140.


Appendix

Methodology/Approach

OAI-PMH (incremental harvesting) was carried out by JMeter script from carrier with 2 main requests: 

  • /oai/records?verb=ListRecords&metadataPrefix=marc21_withholdings&apikey=[APIKey]
  • /oai/records?verb=ListRecords&apikey=[APIKey]&resumptionToken=[resumptionToken]

to extract the required number of records was used loop counter with following configuration:

  • 98 loop counts for 10K records;
  • 248 loop counts for 25K records;
  • 499 loop counts for 50K records;
  • 5000 loop counts for 500K records;
  • 10000 loop counts for 1MLN records;

To run the incremental harvesting test the next time ranges were defined by experimental means. The time range for Test 2* was extended due to the impossibility of harvesting the defined number of records, but the next tests were run after adding 800K instances to database.


Start date Until date
Test 1.2022-12-212023-10-16
Test 2*.1962-12-212023-10-23*
Test 3. 2022-12-212023-10-16


OAI-PMH (full harvesting)

Before running OAI-PMH with full harvest, following database commands to optimize the tables were executed (from https://folio-org.atlassian.net/wiki/display/FOLIOtips/OAI-PMH+Best+Practices#OAIPMHBestPractices-SlowPerformance):

REINDEX index <tenant>_mod_inventory_storage.audit_item_pmh_createddate_idx ;
REINDEX index <tenant>_mod_inventory_storage.audit_holdings_record_pmh_createddate_idx;
REINDEX index <tenant>_mod_inventory_storage.holdings_record_pmh_metadata_updateddate_idx;
REINDEX index <tenant>_mod_inventory_storage.item_pmh_metadata_updateddate_idx;
REINDEX index <tenant>_mod_inventory_storage.instance_pmh_metadata_updateddate_idx;
analyze verbose <tenant>_mod_inventory_storage.instance;
analyze verbose <tenant>_mod_inventory_storage.item;
analyze verbose <tenant>_mod_inventory_storage.holdings_record;


Execute the following query in a related database for removing existed 'instances' created by previous harvesting request and a request itself:

TRUNCATE TABLE fs09000000_mod_oai_pmh.request_metadata_lb cascade

 Full harvesting tests were running from ptf-windows machine using EBSCO Harvesting . The following cmd command (cmd should be run in the same directory as EBSCO Harvester) start EBSCO Harvester:

OAIPMHHarvester.exe -HarvestMode=full -DefinitionId=poppy-marc21-with-holdings -HarvesterWebClientTimeout_Seconds=0s=0

With the following definition

 Harvest definition

<?xml version="1.0" encoding="UTF-8"?>

<HarvestDefinition xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="HarvestDefinition.xsd">

<id>Orchid</id>

<Description>Orchid</Description>

<Urls>

      <!-- include as many as necessary/provided -->

<Url>https://edge-ptf-ocp2-00.int.aws.folio.org/oai/eyJzIjoiVDNUSzAzR2QyViIsInQiOiJmczA5MDAwMDAwIiwidSI6ImZzMDkwMDAwMDAifQ=?verb=ListRecords&metadataPrefix=marc21_withholdings&set=all&from=2018-10-18T00:00:00Z&until=2018-10-19T00:00:00Z</Url>

</Urls>

<!-- 0 if no throttle required -->

<ThrottledInMiliseconds>0</ThrottledInMiliseconds>

<!-- Enables the user to segment harvesting requests by setting the StartDate and WindowSizeInDays parameters -->

<!-- metadata to harvest -->

<MetadataFormat>marc21_withholdings</MetadataFormat>

<!-- Harvested, None, or Custom. If Custom, specific as many child setSpecs as necessary. These will act as filters when harvesting -->

    

<Sets use="Harvested">

  </Sets>

  <!--<Sets use="Custom"> -->

<!--<setSpec>NameOfSetSpecGoesHere</setSpec> -->

    <!--<setSpec>NameOfSetSpecGoesHere</setSpec>

    </Sets> -->

</HarvestDefinition>

Infrastructure

Environment: OCP2
Release: Poppy (2023 R2)

  • 9 m6i.2xlarge EC2 instances located in US East (N. Virginia)
  • 2 instances of db.r6.xlarge database instances, one reader, and one writer 
  • MSK tenant
    • 4 brokers
    • Apache Kafka version 2.8.0

    • EBS storage volume per broker 300 GiB

    • auto.create.topics.enable=true
    • og.retention.minutes=480
    • default.replication.factor=3

Modules

Module
ocp2-pvt
Mon Oct 23 15:48:03 UTC 2023
Task Def. RevisionModule VersionTask CountMem Hard LimitMem Soft limitCPU unitsXmxMetaspaceSizeMaxMetaspaceSizeR/W split enabled
pub-edge8pub-edge:2022.03.022102489612876800false
mod-inventory-storage1/mod-inventory-storage:26.1.0-SNAPSHOT.69622208195210241440384512false
edge-oai-pmh8edge-oai-pmh:2.7.0-SNAPSHOT.14121512136010241440384512false
mod-source-record-storage13mod-source-record-storage:5.7.0-SNAPSHOT.24725600500020483500384512false
mod-inventory13mod-inventory:20.1.0-SNAPSHOT.44622880259210241814384512false
mod-circulation10mod-circulation:24.0.0-SNAPSHOT.60122880259215361814384512false
mod-source-record-manager15/mod-source-record-manager:3.7.0-SNAPSHOT.24025600500020483500384512false
mod-quick-marc8mod-quick-marc:5.0.0-SNAPSHOT.1141228821761281664384512false
nginx-okapi8nginx-okapi:2023.09.2121024896128000false
okapi-b9okapi:5.0.13168414401024922384512false
mod-oai-pmh10mod-oai-pmh:3.12.0-SNAPSHOT.36224096369020483076384512false