IN PROGRESS

Table of Contents

Overview

The purpose of these set of tests is to measure performance of Kiwi release. Find possible issues, bottlenecks.

Jira Legacy

server	System JiraJIRA
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	PERF-198

Table of Contents

Overview

...

Environment

Software versions (Test 1-2)

mod-oai-pmh:3.7.0-SNAPSHOT.188.
edge-oai-pmh:2.4.0
mod-source-record-manager:3.2.3
mod-source-record-storage:5.2.1
mod-inventory-storage:22.0.1
okapi:4.9.0

Original PTF dataset containing 1,212,039 underlying MARC records for 8.7M instances

Software versions (Test 3 with Bugfest Dataset)

mod-oai-pmh:3.6.1
edge-oai-pmh:2.4.0
mod-source-record-manager:3.2.6
mod-source-record-storage-5.2.5
mod-inventory-storage-22.0.3
okapi:4.9.0

Bugfest dataset containing 8,034,444 underlying MARC records for 8.3M instances

Summary

Kiwi release was able ho harvest 7,808,200 records in 19 hr 8 min (1M records per 2 hours and 15 min).
Average response time per request with resumption token 0.874s.
No memory or CPU issues were found (after the first couple of JIRAs below had been fixed)
KPIs:
- mod-oai-pmh CPU usage 120% (on data transferring) 100% on harvesting.
- RDS CPU usage 80% on data transferring and ±15 % on harvesting
- Memory usage 105-107% on mod-source-record-manager. 35% on mod-oai-pmh. No signs of memory leaks on related modules.
A few issues were found
- OutOfMemory exception:
  Jira Legacy
  server System JIRA
  serverId 01505d01-b853-3c2e-90f1-ee9b165564fc
  key MODOAIPMH-374
- Thread block issue:
  Jira Legacy
  server System JIRA
  serverId 01505d01-b853-3c2e-90f1-ee9b165564fc
  key MODOAIPMH-374
- When instances didn't have underlying MARC records, multiple repeating calls from mod-edge-oai-pmh to mod-oai-pmh were occurred, resulting in the end-client receiving an timeout, see
  Jira Legacy
  server System JIRA
  serverId 01505d01-b853-3c2e-90f1-ee9b165564fc
  key MODOAIPMH-383

Test flow

Test consist of few calls:

Initial call :that was performed only once

Code Block
/oai/records?verb=ListRecords&metadataPrefix=marc21_withholdings&apikey=[APIKey]

...

Subsequent harvesting calls:

Code Block
/oai/records?verb=ListRecords&apikey=[APIKey]&resumptionToken=[resumptionToken]

...

These calls were performed repeatedly, harvesting 100 records each time until there is no more data in [tenant]_mod_oai_pmh.instances table to harvest.

[resumptionToken] was set to 100, returning in initial call response and in each harvesting call until there is no more records to harvest. When all data has being harvested - resumptionToken will not return with the response.

Software versions

mod-oai-pmh:3.7.0-SNAPSHOT.188
edge-oai-pmh:2.4.0
mod-source-record-manager:3.2.3
mod-source-record-storage:5.2.1
mod-inventory-storage:22.0.1
okapi:4.9.0

Issues detected during testing

List of issues founded, reasons and possible fixes:

1) OutOfMemory exception. MODOAIPMH-374

2) Thread block issue

...

Issues detected during testing

1) OutOfMemory exception. fixed in scope of

Jira Legacy

server	System JIRA
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	MODOAIPMH-374

2) Thread block issue. fixed in scope of

Jira Legacy

server	System JIRA
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	MODOAIPMH-374

3) Client timeouts.

Jira Legacy

server	System JIRA
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	MODOAIPMH-383

New issue appearing when we're starting DB transferring and harvesting process at the same time. It's leads to high load on DB and it responding with timeout

2021-12-01T10:02:42,566 ERROR [vert.x-eventloop-thread-0] MarcWithHoldingsRequestHelper Save instance Ids failed: Timeout.

io.vertx.core.impl.NoStackTraceThrowable: Timeout

Test results

Test 1

Records transfered. - 4770043 (should be 8415303)

Image Removed

...

Total Underlying SRS records: 1,212,039
Duration: 4 hr 57 min
Records transferred: 4,770,043 (should be 8,415,303)
Calls performed 20,618

Image Added
We can see here unstable part of test. This spikes on chart showing extremely increased response times. which leads to throughput gaps. At this point we still not sure what was happening so we checked the logs of
RDS response times: PGLogs.log
mod-oai-pmh
nginx-oai-pmh
edge-oai-pmh
okapi
At each point we have good response times and we can't see correlation between logs and this chart.

Image Added
Service CPU usage has reached ±200% while data transfer. And it's on 50-60% level during data harvesting. However during "unstable part" of test it has drop down to 20-25%.

Image Added
Service memory usage is stable. There is no suspects for memory leak.

Image Added
Notable observations:
While data transferring process is going on the background DB CPU usage has reached 70%-75%.
Data transferring process has failed in 10 minutes and transfer only 4770043 from 8M records.
Harvesting itself consumes 15% DB CPU.

Image Added

Test 2

Total Underlying SRS records: 1,212,039
Duration: 4 hr 25 min
Records transferred: 3,815,867 (should be 8,415,303)
Calls performed - 22305

Results were the same as in test 1, showing consistency in failures due to missing a large number of underlying MARC records.

Image Added

Test 3 (with Bugfest Dataset)

Underlying MARC records: 8,034,444
Records transferred - 8,213,392
Records harvested - 78082 X 100 = 7,808,200
Time spent : 19 hr 8 min

Average response time for call with resumption token 0.874 ms

With new data set there is no "unstable parts" in this test. The results in this test is the best and accurate representation of OAI-PMH performance in Kiwi.

Image Added

CPU usage is stable and without big spikes. As you can see there is higher CPU usage at the beginning of the test. It's data transferring process between mod-inventory-stirage and mod-oai-pmh.

Image Added

We can see here that memory usage is stable and none of containers didn't fail.

Image Added

Notable observations:

OutOfMemory exception MODOAIPMH-374 and Thread block issue MODOAIPMH-374 were found and resolved early on in the testing cycle.
The remaining issue: unstable parts of first couple of tests were made by data set MODOAIPMH-383
- Instances didn't have underlying records and this causes multiple repeating calls from mod-edge-oai-pmh to mod-oai-pmh.
- This leads to end client to wait until oai-pmh will find records with underlying records. And often client fail with 504 getaway timeout (load balancer timeout 400 seconds).
- The workaround for this issue was by testing with a dataset that has underlying MARC records, which was Bugfest's. This is an edge case as most systems would have more MARC records than instances. It does not need to be resolved for the Kiwi release.

Versions Compared

Old Version 6

New Version Current

Key

Overview

Overview

Environment

Software versions (Test 1-2)

Software versions (Test 3 with Bugfest Dataset)

Summary

Test flow

Software versions

Issues detected during testing

Issues detected during testing

Test results

Test 1

Test 2

Test 3 (with Bugfest Dataset)

Notable observations:

Page Comparison

Versions Compared

Old Version 6

New Version Current

Key

Overview

Overview

Environment

Software versions (Test 1-2)

Software versions (Test 3 with Bugfest Dataset)

Summary

Test flow

Software versions

Issues detected during testing

Issues detected during testing

Test results

Test 1

Test 2

Test 3 (with Bugfest Dataset)

Notable observations: