Skip to end of banner
Go to start of banner

OAI-PMH data harvesting [KIWI]

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 32 Next »

IN PROGRESS


PERF-198 - Getting issue details... STATUS



Overview 

The purpose of these set of tests is to measure performance of Kiwi release. Find possible issues, bottlenecks. 


Environment 

Software versions (Test 1-2)

  • mod-oai-pmh:3.7.0-SNAPSHOT.188.  
  • edge-oai-pmh:2.4.0
  • mod-source-record-manager:3.2.3
  • mod-source-record-storage:5.2.1
  • mod-inventory-storage:22.0.1
  • okapi:4.9.0


Original PTF dataset containing 1212039 underlying records for 8.7M instances


Software versions (Test 3)

  • mod-oai-pmh:3.6.1
  • edge-oai-pmh:2.4.0
  • mod-source-record-manager:3.2.6
  • mod-source-record-storage-5.2.5
  • mod-inventory-storage-22.0.3
  • okapi:4.9.0

Bugfest dataset containing 8034444 underlying records for 8.3M instances


Summary

Kiwi release was able ho harvest 7 808 200 records in 19 hr 8 min (1 M per 2 hour 15 min). 

Average response time per request 0.874 s.


Test flow

Test consist of few calls:

Initial call:

  • /oai/records?verb=ListRecords&metadataPrefix=marc21_withholdings&apikey=[APIKey] - performing only once

Harvesting call:

  • /oai/records?verb=ListRecords&apikey=[APIKey]&resumptionToken=[resumptionToken] - performing repeatedly, harvesting 100 records each time until there is no more data in [tenant]_mod_oai_pmh.instances table to harvest.

[resumptionToken] returning in initial call response and in each harvesting call until there is no more records to harvest. When all data has being harvested - resumptionToken will not return with the response.


Issues detected during testing

List of issues founded, reasons and possible fixes:

1) OutOfMemory exception. fixed in scope of MODOAIPMH-374


2) Thread block issue. fixed in scope of MODOAIPMH-374


3) DB timeout. New issue appearing when we're starting DB transferring and harvesting process at the same time. It's leads to high load on DB and it responding with timeout

2021-12-01T10:02:42,566 ERROR [vert.x-eventloop-thread-0] MarcWithHoldingsRequestHelper Save instance Ids failed: Timeout.

io.vertx.core.impl.NoStackTraceThrowable: Timeout

Fifed with changing of data set to "bugfest" like 


Test results

Test 1

Duration - 4 hr 57 min

Records transferred  . - 4770043 (should be 8415303)

Records harvested - 20618 X 100 = 2 061 800.

Total Underlying SRS records: 1,212,039


We can see here unstable part of test. This spikes on chart showing extremely increased response times. which leads to throughput gaps. At this point we still not sure this does it happening we've checked:

  • RDS response times (logs) ;PGLogs.log
  • mod-oai-pmh (logs);
  • nginx-oai-pmh (logs);
  • edge-oai-pmh (logs);
  • okapi (logs);

At each point we have good response times and we can't see correlation between logs and this chart. 



Service CPU usage has reached ±200% while data transfer. And it's on 50-60% level during data harvesting. However during "unstable part" of test it has drop down to 20-25%.


  • Service memory usage is stable. There is no suspects for memory leak.


Notable observations:

  • While data transferring process is going on the background DB CPU usage has reached 70%-75%.
  • Data transferring process has failed in 10 minutes and transfer only 4770043 from 8M records.
  • Harvesting itself consumes 15% DB CPU.





Test 2

Duration 4 hr 25 min

Records transferred . - 3815867 (should be 8415303)

Records harvested - 22305 X 100 = 2 230 500.

Total Underlying SRS records: 1,212,039














Final test (bugfest data set)

Records transferred -  8213392;

Records harvested - 78082 X 100 = 7 808 200

Time spent : 19 hr 8 min

Underlying records number: 8034444

Average response time for call with resumption token 0.874 ms


With new data set there is no "unstable parts" in this test


CPU usage is stable and without big spikes. As you can see there is higher CPU usage at the beginning of the test. It's data transferring  process between mod-inventory-stirage and mod-oai-pmh.



We can see here that memory usage is stable and none of containers didn't fail. 







Notable observations:

  • Unstable parts of first tests was made by data set:
    • instances didn't have underlying records and this causes multiple repeating calls from mod-edge-oai-pmh to mod-oai-pmh.
    • this leads to end client to wait until oai-pmh will find records with underlying records. And often client fail with 504 getaway timeout (load balancer timeout 400 seconds).
  • Timeouts in DB fixed with changing of dataset to "bugfest" like.
  • Jira ticket created to handle client waits: MODOAIPMH-383


  • No labels