OAI-PMH data harvesting [iris]

OAI-PMH data harvesting [iris]

 

PERF-144: Harvesting 10M using marc21_withholdings metadata prefixClosedPERF-144: Harvesting 10M using marc21_withholdings metadata prefixClosed

 

 

PERF-144: Harvesting 10M using marc21_withholdings metadata prefixClosed

 

 

 

 

 

PERF-144: Harvesting 10M using marc21_withholdings metadata prefixClosed

 

 

Overview

The purpose of this test report is to highlight results from mod-oai-pmh harvesting tests. 

Test flow

Test consist of few calls:

Initial call:

  • /oai/records?verb=ListRecords&metadataPrefix=marc21_withholdings&apikey=[APIKey] - performing only once

Harvesting call:

  • /oai/records?verb=ListRecords&apikey=[APIKey]&resumptionToken=[resumptionToken] - performing cyclically until there is data in [tenant]_mod_oai_pmh.instances table to harvest.

[resumptionToken] returning in initial call response and in each harvesting call until there is something to harvest. When all data has being harvested - resumptionToken will not return with response.

 

Environment:

  • 61 back-end modules deployed in 110 ECS services

  • 3 okapi ECS services

  • 8 m5.large  EC2 instances

  • 2 db.r5.xlarge AWS RDS instance (1 reader, 1 writer)

Software version

mod-oai-pmh 3.4.2

Summary 

We're able to harvest all data set we have, which is 7.2 M records with different "Max records per response" parameter value.

Possibly there's a memory leak on oai-pmh side as we've seen continuously growing memory and CPU usage (see screenshots below).

 

Tests and results

 

test

Max records per response

Time to complete

Result

Issues

test

Max records per response

Time to complete

Result

Issues

1

100

6 hours 26 minutes

all data harvested

Growing CPU/RAM usage

2

300

2 hours 31 min

5.5 M records harvested

connection lost with load generator
not an oai-pmh issue

3

500

2 hour 27 min

all data harvested

Growing CPU/RAM usage

 

Service CPU usage

 

Service Memory usage

 

Source-record-storage memory usage

 

Source-record-storage CPU usage

 

 

 

Source-record-manager memory usage

 

Source-record-manager CPU usage

 

mod-inventory-storage CPU usage

 

 

mod-inventory-storage memory usage

 

 

Heap Analysis

There's 2 issues as a leak suspects common for each heap dumps taken (after each test):

io.vertx.core.http.impl.HttpClientImpl:

There is growing instances number 7 347 → 13 248 → 20 664;

 

 

 

io.vertx.core.http.impl.ConnectionManager:

There is growing instances number 14 694 →26 514 → 41 328;