OAI-PMH data harvesting[Concurrent Incremental] (Poppy)
Overview
Summary
OAI-PMH - Incremental Harvesting:
Three tests have been executed by JMeter script to check performance of harvesting the following number of records 10K, 25K, 50K, 500K and 1 MLN with different OAI-PMH Behaviors :
Test 1. Record source set to Source record storage ;
Test 2. Record source set to Inventory* (data set limit in OCP3 - 250k) ;
Test 3. Record source set to Source record storage and inventory.
- Number of multiple concurrent harvests:
- 2 harvests;
- 4 harvests;
- 6 harvests.
- CPU utilization during all tests was relevant to number of concurrent harvests.
- Test #1 mod-oai-pmh-b: 2 harvests - 5%, 4 harvests - 10%, 6 harvests - 15%
- Test #2 mod-oai-pmh-b: 2 harvests - 1%, 4 harvests - 3.7%, 6 harvests - 5.5%
- Test #3 mod-oai-pmh-b: 2 harvests - 10%, 4 harvests - 15%, 6 harvests - 25%
- Memory consumption was stable except of mod-inventory which grew slowly and mod-oai-pmh that grew up from 46% to 56%. Tests:
- Tests #1 and #3 mod-oai-pmh-b didn't exceed 40%
- Test #2 mod-oai-pmh-b achieved 55%
- RDS CPU utilization:
- The averages CPU usage for 2 harvests - 15%
- The averages CPU usage for 4 harvests - 20%
- The averages CPU usage for 6 harvests - 25%
- Durations of harvests differed significantly in tests #1,3 (SRS) and test #2 (Inventory) because of the date creation distribution fromDate and untilDate parameters.
- Durations were not degraded by increased number of concurrent harvests.
- Response times for tests can be found in expanded links in section Test #. Record source
Improvements that can be noted in Poppy release:
1) Non-ECS environment with Poppy release can handle concurrent OAI-PMH
Recommendations & Jiras
- To prepare tests it's good point to populate complete_updated_date column in {tenant}_mod_inventory_storage.instance using migration. More info in Appendix section.
- To avoid degradation on OAI-PMH response times check that DB top queries do not have DELETE and INSERT for marc_id values after cluster restart
- To have the same starting conditions before running test with different Record source sets the edge-oai-pmh service was restarted, it was done to return the service memory usage to its starting(after deployment) value;
Test Runs & Results
Incremental harvesting
2 concurrent Incremental OAI-PMH | 4 concurrent Incremental OAI-PMH | 6 concurrent Incremental OAI-PMH | |||||||
Number of harvested records | Test 1. Record source = Source record storage Duration | Test 2. Record source = Inventory Duration | Test 3. Record source = Source record storage and inventory Duration | Test 1. Record source = Source record storage Duration | Test 2. Record source = Inventory Duration | Test 3. Record source = Source record storage and inventory Duration | Test 1. Record source = Source record storage Duration | Test 2. Record source = Inventory Duration | Test 3. Record source = Source record storage and inventory Duration |
---|---|---|---|---|---|---|---|---|---|
10000 records(10K) | 00:02:08 | 00:08:55 | 00:01:39 | 00:01:05 | 00:01:46 | 00:01:31 | 00:01:07 | 00:01:32 | 00:01:14 |
25000 records(25K) | 00:04:09 | 00:16:25 | 00:04:27 | 00:02:38 | 00:21:00 | 00:04:34 | 00:02:52 | 00:20:32 | 00:02:57 |
50000 records(50K) | 00:07:40 | 00:33:25 | 00:08:10 | 00:05:17 | 00:32:46 | 00:07:44 | 00:05:34 | 00:32:47 | 00:13:25 |
500000 records(500K) / 250000 records(250K) in test #2 | 01:56:40 | 02:33:30 | 01:51:24 | 01:58:34 | 02:35:29 | 01:48:48 | 01:34:29 | 02:37:45 | 01:44:42 |
1000000 records(1MLN) | 02:50:17 | not enough data | 02:39:09 | 02:59:09 | not enough data | 02:50:29 | 03:04:30 | not enough data | 02:58:50 |
Incremental harvesting
Test 1. Record source = Source record storage
Service CPU Utilization
During five harvesting tests with 10K, 25k, 50K, 500K and 1MLN records CPU utilization remained steady for the same number of concurrent harvests.
The averages CPU usage for 2 harvests mod-oai-pmh-b = 5%, edge-oai-pmh-b = 3.5%, mod-source-record-storage-b = 2%, okapi-b = 1.5%, mod-inventory-storage-b = 1.5% .
The averages CPU usage for 4 harvests mod-oai-pmh-b = 9%, edge-oai-pmh-b = 5.4%, mod-source-record-storage-b = 1.5%, okapi-b = 1.7%, mod-inventory-storage-b = 0.7% .
The averages CPU usage for 6 harvests mod-oai-pmh-b = 15.5%, edge-oai-pmh-b = 9%, mod-source-record-storage-b = 1.5%, okapi-b = 2.4%, mod-inventory-storage-b = 1% .
A few minor fluctuations were at the the beginning of each test.
Service Memory Consumption
Memory consumption was stable.
The averages memory consumption didn't exceed mod-oai-pmh-b = 40%, edge-oai-pmh-b = 31%, mod-source-record-storage-b = 37%, okapi-b = 37%, mod-inventory-storage-b = 14% .
This graph for 10k, 25k, 50k records
This graph for 500k and 1 MLN records
This graph for 1 MLN records only
RDS CPU Utilization
Average CPU utilization was stable for the same number of concurrent harvests.
The averages CPU usage for 2 harvests - 15%
The averages CPU usage for 4 harvests - 20%
The averages CPU usage for 6 harvests - 25-30%
RDS Database Connections
Number of database connection was about 440,
Database load
This graph shows top sql queries for OAI-PMH 10k, 25k, 50k
This graph shows top sql queries for OAI-PMH 500k, 1 MLN
Marked query runs after cluster start until 16:30 UTC. This query was found in pcp1 cluster also.
This graph for 1 MLN only. 4 and 6 concurrent harvests
Test 2. Record source = Inventory
Service CPU Utilization
The averages CPU usage for 2 harvests mod-oai-pmh-b = 1%, edge-oai-pmh-b = 0.5%, mod-source-record-storage-b = 1.5%, okapi-b = 0.8%, mod-inventory-storage-b = 0.3% .
The averages CPU usage for 4 harvests mod-oai-pmh-b = 3.7%, edge-oai-pmh-b = 1.5%, mod-source-record-storage-b = 1.6%, okapi-b = 1.2%, mod-inventory-storage-b = 0.4% .
The averages CPU usage for 6 harvests mod-oai-pmh-b = 5.5%, edge-oai-pmh-b = 2%, mod-source-record-storage-b = 1.4%, okapi-b = 1.2%, mod-inventory-storage-b = 0.5% .
This graph for 10k, 25k, 50k.
This graph for 250k
Service Memory Consumption
For 10k, 25k, 50k memory consumption for mod-oai-pmh was 28% at the beginning and grew up to 46%
For 250k tests memory consumption for mod-oai-pmh was 55% at the beginning of 250k tests and stayed at this level
This graph for 10k, 25k, 50k.
This graph for 250k
RDS CPU Utilization
RDS for 10k, 25k, 50k
Fluctuations on the screen explained by DELETE, INSERT queries with marc_id values connected to daily cluster restart. After 14:30 this process was finished and we observe graph for the tests
RDS for 250k
RDS Database Connections
Connections are the same as for other tests - 440.
Test 3. Record source = Source record storage and inventory
Service CPU Utilization
The averages CPU usage for 2 harvests mod-oai-pmh-b = 10%, edge-oai-pmh-b = 7%, mod-source-record-storage-b = 1.7%, okapi-b = 1.5%, mod-inventory-storage-b = 0.6% .
The averages CPU usage for 4 harvests mod-oai-pmh-b = 15%, edge-oai-pmh-b = 10%, mod-source-record-storage-b = 1.5%, okapi-b = 2%, mod-inventory-storage-b = 0.8% .
The averages CPU usage for 6 harvests mod-oai-pmh-b = 25%, edge-oai-pmh-b = 15%, mod-source-record-storage-b = 1.4%, okapi-b = 2.4%, mod-inventory-storage-b = 1% .
The graph shows 10k, 25k, 50k, and 2 harvests of 500k
The graph demonstrate 500k and 1 MLN harvests
The graph demonstrate 1 MLN harvests only
Service Memory Utilization
Memory consumption was stable from OAI-PMH related modules. Mod-inventory didn't exceed 72%.
The averages memory consumption didn't exceed mod-oai-pmh-b = 40%, edge-oai-pmh-b = 29%, mod-source-record-storage-b = 37%, okapi-b = 37%, mod-inventory-storage-b = 15% , mod-inventory = 72%
RDS CPU Utilization
Average CPU utilization was stable for the same number of concurrent harvests, close to results in test #1..
Fluctuations on DB graphs explained that after everyday cluster start we observed DELETE queries from marc_indexers table with specific condition. Producing high load which affect response times of OAI-PMH. It happens each time after cluster restart.
It deletes rows from the table marc_indexers based on certain conditions defined in two separate subqueries.
It captures the marc_id values of the deleted rows
It inserts the distinct marc_id values from both subqueries into the table marc_indexers_deleted_ids to keep track of the deleted marc_id values.
The averages CPU usage for 2 harvests - 15%
The averages CPU usage for 4 harvests - 20%
The averages CPU usage for 6 harvests - 25-30%
RDS Database Connections
Number of database connection was about 440 in all tests.Database load
This graph shows 10k, 25k, 50k
Top query:
- WITH deleted_rows AS ( delete from marc_indexers mi where exists( select 1 from marc_records_tracking mrt where mrt.is_dirty = true and mrt.marc_id = mi.marc_id and mrt.version > mi.version ) returning mi.marc_id), deleted_rows2 AS ( delete from marc_indexers mi where exists( select 1 from records_lb where records_lb.id = mi.marc_id and records_lb.state = 'OLD' ) returning mi.marc_id) INSERT INTO marc_indexers_deleted_ids SELECT DISTINCT marc_id FROM deleted_rows UNION SELECT marc_id FROM deleted_rows2
Appendix
Methodology/Approach
OAI-PMH (incremental harvesting) was carried out by JMeter script from carrier with 2 main requests:
- /oai/records?verb=ListRecords&metadataPrefix=marc21_withholdings&apikey=[APIKey]
- /oai/records?verb=ListRecords&apikey=[APIKey]&resumptionToken=[resumptionToken]
to extract the required number of records was used loop counter with following configuration:
- 98 loop counts for 10K records;
- 248 loop counts for 25K records;
- 498 loop counts for 50K records;
- 2498 loop counts for 250k records*
- 4998 loop counts for 500K records;
- 9998 loop counts for 1MLN records
* - Test #2 data set limit
To run the incremental harvesting test the next time ranges were defined by experimental means. The time range for Test 2* was extended due to the impossibility of harvesting the defined number of records, but the next tests were run after adding 800K instances to database.
Start date | Until date | |
---|---|---|
Test 1. | 2022-12-21 | 2023-10-16 |
Test 2*. | 1962-12-21 | 2023-10-23* |
Test 3. | 2022-12-21 | 2023-10-16 |
OAI-PMH
Before testing OAI-PMH, following database commands to optimize the tables were executed (from https://folio-org.atlassian.net/wiki/display/FOLIOtips/OAI-PMH+Best+Practices#OAIPMHBestPractices-SlowPerformance):
REINDEX index <tenant>_mod_inventory_storage.audit_item_pmh_createddate_idx ; |
Execute the following query in a related database for removing existed 'instances' created by previous harvesting request and a request itself:
TRUNCATE TABLE fs09000000_mod_oai_pmh.request_metadata_lb cascade |
Execute migration for complete_updated_date column as described here Migration scripts for OAI-PMH (note that in step 2. Update command set
search_path =
"{tenant}_mod_inventory_storage"
,
"public"
; may not work for some reason). It's ok to skip the command in scope of OAI-PMH.
Infrastructure
Environment: OCP3
Release: Poppy (2023 R2)
- 9 m6i.2xlarge EC2 instances located in US East (N. Virginia)
- 2 instances of db.r6.xlarge database instances, one reader, and one writer
- MSK tenant
- 4 brokers
Apache Kafka version 2.8.0
EBS storage volume per broker 300 GiB
- auto.create.topics.enable=true
- og.retention.minutes=480
- default.replication.factor=3
Modules