OAI-PMH data harvesting[Concurrent Incremental] (Poppy)

OAI-PMH data harvesting[Concurrent Incremental] (Poppy)

Overview

Summary

  • OAI-PMH - Incremental Harvesting:

    • Three tests have been executed by JMeter script to check performance of harvesting the following number of records 10K, 25K, 50K, 500K and 1 MLN with different OAI-PMH Behaviors :

      • Test 1. Record source set to Source record storage ;

      • Test 2. Record source set to Inventory* (data set limit in OCP3 - 250k) ;

      • Test 3.  Record source set to Source record storage and inventory.

    • Number of multiple concurrent harvests:

      • 2 harvests;

      • 4 harvests;

      • 6 harvests.

  • CPU utilization during all tests was relevant to number of concurrent harvests. 

    • Test #1 mod-oai-pmh-b: 2 harvests -   5%, 4 harvests - 10%,  6 harvests - 15%

    • Test #2 mod-oai-pmh-b: 2 harvests -   1%, 4 harvests - 3.7%, 6 harvests - 5.5%

    • Test #3 mod-oai-pmh-b: 2 harvests - 10%, 4 harvests - 15%,  6 harvests - 25%

  • Memory consumption was stable except of mod-inventory which grew slowly and mod-oai-pmh that grew up from 46% to 56%.  Tests:

    • Tests #1 and #3 mod-oai-pmh-b didn't exceed 40%

    • Test #2 mod-oai-pmh-b achieved 55%

  • RDS CPU utilization:

    • The averages CPU usage for  2 harvests - 15%

    • The averages CPU usage for  4 harvests - 20%

    • The averages CPU usage for  6 harvests - 25%

  • Durations of harvests differed significantly in tests #1,3 (SRS) and test #2 (Inventory) because of the date creation distribution fromDate and untilDate parameters.

  • Durations were not degraded by increased number of concurrent harvests.

  • Response times for tests can be found in expanded links in section Test #.  Record source

Improvements that can be noted in Poppy release:
1) Non-ECS environment with Poppy release can handle concurrent OAI-PMH 

Recommendations & Jiras

  • To prepare tests it's good point to populate complete_updated_date column in {tenant}_mod_inventory_storage.instance using migration. More info in Appendix section.

  • To avoid degradation on OAI-PMH response times check that DB top queries do not have DELETE and INSERT for marc_id values after cluster restart

  • To have the same starting conditions before running test with different Record source sets the edge-oai-pmh service was restarted, it was done to return the service memory usage to its starting(after deployment) value;

Test Runs & Results

Incremental harvesting

 

2 concurrent Incremental OAI-PMH

4 concurrent Incremental OAI-PMH

6 concurrent Incremental OAI-PMH

Number of harvested records

Test 1. Record source = Source record storage Duration

Test 2. Record source = Inventory Duration

Test 3. Record source = Source record storage and inventory Duration

Test 1. Record source = Source record storage Duration

Test 2. Record source = Inventory Duration

Test 3. Record source = Source record storage and inventory Duration

Test 1. Record source = Source record storage Duration

Test 2. Record source = Inventory Duration

Test 3. Record source = Source record storage and inventory Duration

10000 records(10K)

00:02:08

00:08:55

00:01:39

00:01:05

00:01:46

00:01:31

00:01:07

00:01:32

00:01:14

25000 records(25K)

00:04:09

00:16:25

00:04:27

00:02:38

00:21:00

00:04:34

00:02:52

00:20:32

00:02:57

50000 records(50K)

00:07:40

00:33:25

00:08:10

00:05:17

00:32:46

00:07:44

00:05:34

00:32:47

00:13:25

500000 records(500K) / 250000 records(250K) in test #2

01:56:40

02:33:30

01:51:24

01:58:34

02:35:29

01:48:48

01:34:29

02:37:45

01:44:42

1000000 records(1MLN)

02:50:17

not enough data

02:39:09

02:59:09

not enough data

02:50:29

03:04:30

not enough data

02:58:50

Incremental harvesting

Test 1.  Record source = Source record storage

Test Label

Number of harvested records

Average Response Times, ms

Duration

Test Label

Number of harvested records

Average Response Times, ms

Duration

SRS 2 concurrent 10k

10000

0.982

00:02:08

SRS 4 concurrent 10k

10000

0.356

00:01:05

SRS 6 concurrent 10k

10000

0.37

00:01:07

SRS 2 concurrent 25k

25000

0.689

00:04:09

SRS 4 concurrent 25k

25000

0.331

00:02:38

SRS 6 concurrent 25k

25000

0.385

00:02:52

SRS 2 concurrent 50k

50000

0.616

00:07:40

SRS 4 concurrent 50k

50000

0.334

00:05:17

SRS 6 concurrent 50k

50000

0.364

00:05:34

SRS 2 concurrent 500k

500000

0.903

01:56:40

SRS 4 concurrent 500k

500000

1.12

01:58:34

SRS 6 concurrent 500k

500000

0.829

01:34:29

SRS 2 concurrent 1Mln

1000000

0.718

02:50:17

SRS 4 concurrent 1Mln

1000000

0.77

02:59:09

SRS 6 concurrent 1Mln

1000000

0.802

03:04:30

This graph shows response times for GET request that retrieve data. For some reason for 4 and 6 concurrent harvests with 10k, 25k and 50k it decreases significantly affecting positively duration.

Service CPU Utilization

During five harvesting tests with 10K, 25k, 50K, 500K and 1MLN records CPU utilization remained steady for the same number of concurrent harvests.

The averages CPU usage for  2 harvests mod-oai-pmh-b = 5%, edge-oai-pmh-b = 3.5%, mod-source-record-storage-b = 2%, okapi-b = 1.5%, mod-inventory-storage-b = 1.5%

The averages CPU usage for  4 harvests mod-oai-pmh-b = 9%, edge-oai-pmh-b = 5.4%, mod-source-record-storage-b = 1.5%, okapi-b = 1.7%, mod-inventory-storage-b = 0.7%

The averages CPU usage for  6 harvests mod-oai-pmh-b = 15.5%, edge-oai-pmh-b = 9%, mod-source-record-storage-b = 1.5%, okapi-b = 2.4%, mod-inventory-storage-b = 1%

A few minor fluctuations were at the the beginning of each test. 

Service Memory Consumption

Memory consumption was stable. 

The averages memory consumption didn't exceed mod-oai-pmh-b = 40%, edge-oai-pmh-b = 31%, mod-source-record-storage-b = 37%, okapi-b = 37%, mod-inventory-storage-b = 14%

This graph for 10k, 25k, 50k records

This graph for 500k and 1 MLN records

This graph for 1 MLN records only

RDS CPU Utilization

Average CPU utilization was stable for the same number of concurrent harvests. 

The averages CPU usage for  2 harvests - 15%

The averages CPU usage for  4 harvests  - 20%

The averages CPU usage for  6 harvests  - 25-30%

 

RDS Database Connections

Number of database connection was about 440,

Database load

This graph shows top sql queries for OAI-PMH 10k, 25k, 50k

This graph shows top sql queries for OAI-PMH 500k, 1 MLN

Marked query runs after cluster start until 16:30 UTC. This query was found in pcp1 cluster also.

This graph for 1 MLN only. 4 and 6 concurrent harvests

Test 2.  Record source = Inventory

Service CPU Utilization

The averages CPU usage for  2 harvests mod-oai-pmh-b = 1%, edge-oai-pmh-b = 0.5%, mod-source-record-storage-b = 1.5%, okapi-b = 0.8%, mod-inventory-storage-b = 0.3%

The averages CPU usage for  4 harvests mod-oai-pmh-b = 3.7%, edge-oai-pmh-b = 1.5%, mod-source-record-storage-b = 1.6%, okapi-b = 1.2%, mod-inventory-storage-b = 0.4%

The averages CPU usage for  6 harvests mod-oai-pmh-b = 5.5%, edge-oai-pmh-b = 2%, mod-source-record-storage-b = 1.4%, okapi-b = 1.2%, mod-inventory-storage-b = 0.5%

This graph for 10k, 25k, 50k.

This graph for 250k

Service Memory Consumption

For 10k, 25k, 50k memory consumption for mod-oai-pmh was 28% at the beginning and grew up to 46% 

For 250k tests memory consumption for mod-oai-pmh was 55% at the beginning of 250k tests and stayed at this level 

This graph for 10k, 25k, 50k.

This graph for 250k

RDS CPU Utilization

RDS for 10k, 25k, 50k

Fluctuations on the screen explained by DELETE, INSERT queries with marc_id values connected to daily cluster restart. After 14:30 this process was finished and we observe graph for the tests

RDS for 250k

RDS Database Connections

Connections are the same as for other tests - 440.

 

Test 3.  Record source = Source record storage and inventory

Test Label

Average Response Times, ms

Duration

Test Label

Average Response Times, ms

Duration

SRS+INV 2 concurrent 10k

0.71

00:01:39

SRS+INV 4 concurrent 10k

0.617

00:01:31

SRS+INV 6 concurrent 10k

0.439

00:01:14

SRS+INV 2 concurrent 25k

0.773

00:04:27

SRS+INV 4 concurrent 25k

0.802

00:04:34

SRS+INV 6 concurrent 25k

0.407

00:02:57

SRS+INV 2 concurrent 50k

0.684

00:08:10

SRS+INV 4 concurrent 50k

0.629

00:07:44

SRS+INV 6 concurrent 50k

1.31

00:13:25

SRS+INV 2 concurrent 500k

1.03

01:51:24

SRS+INV 4 concurrent 500k

1

01:48:48

SRS+INV 6 concurrent 500k

0.953

01:44:42

SRS+INV 2 concurrent 1Mln

0.652

02:39:09

SRS+INV 4 concurrent 1Mln

0.721

02:50:29

SRS+INV 6 concurrent 1Mln

0.768

02:58:50

Service CPU Utilization

The averages CPU usage for  2 harvests mod-oai-pmh-b = 10%, edge-oai-pmh-b = 7%, mod-source-record-storage-b = 1.7%, okapi-b = 1.5%, mod-inventory-storage-b = 0.6%

The averages CPU usage for  4 harvests mod-oai-pmh-b = 15%, edge-oai-pmh-b = 10%, mod-source-record-storage-b = 1.5%, okapi-b = 2%, mod-inventory-storage-b = 0.8%

The averages CPU usage for  6 harvests mod-oai-pmh-b = 25%, edge-oai-pmh-b = 15%, mod-source-record-storage-b = 1.4%, okapi-b = 2.4%, mod-inventory-storage-b = 1%

The graph shows 10k, 25k, 50k, and 2 harvests of 500k

The graph demonstrate 500k and 1 MLN harvests

The graph demonstrate 1 MLN harvests only

Service Memory Utilization

Memory consumption was stable from OAI-PMH related modules. Mod-inventory didn't exceed 72%.

The averages memory consumption didn't exceed mod-oai-pmh-b = 40%, edge-oai-pmh-b = 29%, mod-source-record-storage-b = 37%, okapi-b = 37%, mod-inventory-storage-b = 15% , mod-inventory = 72%

RDS CPU Utilization

Average CPU utilization was stable for the same number of concurrent harvests, close to results in test #1..

Fluctuations on DB graphs explained that after everyday cluster start we observed DELETE queries from marc_indexers table with specific condition. Producing high load which affect response times of OAI-PMH. It happens each time after cluster restart.

It deletes rows from the table marc_indexers based on certain conditions defined in two separate subqueries.
It captures the marc_id values of the deleted rows
It inserts the distinct marc_id values from both subqueries into the table marc_indexers_deleted_ids to keep track of the deleted marc_id values. 


The averages CPU usage for  2 harvests - 15%

The averages CPU usage for  4 harvests  - 20%

The averages CPU usage for  6 harvests  - 25-30%

RDS Database Connections

Number of database connection was about 440 in all tests.

 

Database load

This graph shows 10k, 25k, 50k 

Top query:

  • WITH deleted_rows AS ( delete from marc_indexers mi where exists( select 1 from marc_records_tracking mrt where mrt.is_dirty = true and mrt.marc_id = mi.marc_id and mrt.version > mi.version ) returning mi.marc_id), deleted_rows2 AS ( delete from marc_indexers mi where exists( select 1 from records_lb where records_lb.id = mi.marc_id and records_lb.state = 'OLD' ) returning mi.marc_id) INSERT INTO marc_indexers_deleted_ids SELECT DISTINCT marc_id FROM deleted_rows UNION SELECT marc_id FROM deleted_rows2

 

 

Appendix

Methodology/Approach

OAI-PMH (incremental harvesting) was carried out by JMeter script from carrier with 2 main requests: 

  • /oai/records?verb=ListRecords&metadataPrefix=marc21_withholdings&apikey=[APIKey]

  • /oai/records?verb=ListRecords&apikey=[APIKey]&resumptionToken=[resumptionToken]

to extract the required number of records was used loop counter with following configuration:

  • 98 loop counts for 10K records;

  • 248 loop counts for 25K records;

  • 498 loop counts for 50K records;

  • 2498 loop counts for 250k records* 

  • 4998 loop counts for 500K records;