OAI-PMH data harvesting releases comparison

PERF-197 - Getting issue details... STATUS

Overview

As was mentioned in Jira - there is huge performance degradation between releases iris - juniper. According to Jira - previously harvesting process of 350K records takes ±2 hour and now on Juniper release it takes 4 hours. 

Test flow

Test consist of few calls:

Initial call:

  • /oai/records?verb=ListRecords&metadataPrefix=marc21_withholdings&apikey=[APIKey] - performing only once

Harvesting call:

  • /oai/records?verb=ListRecords&apikey=[APIKey]&resumptionToken=[resumptionToken] - performing repeatedly, harvesting 100 records each time until there is no more data in [tenant]_mod_oai_pmh.instances table to harvest.

[resumptionToken] returning in initial call response and in each harvesting call until there is no more records to harvest. When all data has being harvested - resumptionToken will not return with the response.


Environments to test on:

  • ICP1 - Iris release;
  • JCP1 - Juniper release;
  • IMTC1 - Latest Juniper release (and more powerful environment).


Software versions

IMTC1 (Juniper)JCP1 (Juniper)ICP1 (Iris)
mod-oai-pmh:3.5.0mod-oai-pmh:3.5.0mod-oai-pmh:3.4.2
edge-oai-pmh:2.3.1edge-oai-pmh:2.3.1edge-oai-pmh:2.3.1

mod-source-record-storage:5.1.5

mod-source-record-storage:5.1.6mod-source-record-storage:5.0.6
mod-source-record-manager:3.1.3mod-source-record-manager:3.1.3mod-source-record-manager:3.0.9
mod-inventory-storage:21.0.3mod-inventory-storage:21.0.3mod-inventory-storage:20.2.2
okapi:4.8.2okapi:4.8.2okapi:4.7.3


Test results 

Harvesting of 350K records.


ICP1 (Iris)IMTC1 (Juniper)JCP1 (Juniper)
Test 11 hr 47 min1 hr 14 min45 min
Test 21 hr 26 min1 hr 10 min42 min


CountsICP1 (Iris)IMTC1 (Juniper)JCP1 (Juniper)
Instances8,072,048756,4177,252,127
SRS records1,186,613597,223710,371


Conclusions:

  • As you can see from table above - we can say that Juniper release at least not slower than Iris release, or even faster.
  • We can see slowness on IMTC1 env. However it's still better or the same as ICP1 (Iris).
  • The only difference between IMTC1 and JCP1 is state of DB (different data set may affect the performance). Between IMTC1 and JCP1, the difference here is the proportion of SRS records relative to the instance records. IMTC1's harvesting time may be slower because there are more SRS records to process per instances than with JCP1 instances.
  • We'll need additional tests on same data set but on different releases.


Resource usage

We observe different CPU usage on mod-inventory storage between JCP and IMTC. I think that i can be explained with different numbers of instances that have SRS records. 

IMTC



JCP1







ICP1