OAI-PMH data harvesting (Orchid)

OAI-PMH data harvesting (Orchid)

Overview  

The purpose of the OAI-PMH tests is to measure the performance of the Orchid release and to find possible issues, bottlenecks per PERF-410: [Orchid] [OAI-PMH] Performance testing (Jmeter)Closed

Environment

 

Version

CPU

Memory/MemoryReservation

Xmx

MaxMetaSpaceSize

MetaspaceSize

Tasks Count

Task Rev Number

 

Version

CPU

Memory/MemoryReservation

Xmx

MaxMetaSpaceSize

MetaspaceSize

Tasks Count

Task Rev Number

mod-oai-pmh

mod-oai-pmh:3.11.3

2048

2048/1845

1440m

512m

384m

2

12

edge-oai-pmh

edge-oai-pmh:2.6.1

1024

1512/1360

1440m

512m

384m

2

5

mod-inventory-storage

mod-inventory-storage:26.0.0

1024

2208/1952

1440m

512m

384m

2

4

mod-source-record-storage

mod-source-record-storage:5.6.5

2048

5600/5000

 

3600m

512M

384M

2

4

okapi

okapi:5.1.0-SNAPSHOT.1352

okapi:5.0.1

1024

1684/1440

922m

512m

384m

3

2

Summary

  • Tests were executed by JMeter script in the AWS Carrier-io instance.

  • The average response time per request with a resumption token is 604 ms ( compared to Nolana's 646 ms).

  • 'OutOfMemory' exceptions (OOM) were detected at all times when tests were run without throughput limits (w/o 40 requests/min). Tests with 40 requests/min (which in Morning Glory proved to be successful 100% of the time) had positive results but the integrity of the harvesting was not checked. OOMs seen in Morning Glory and Nolana releases are also seen here.

  • 'Thread blocked' exceptions were detected only during the investigation of a custom branch (https://github.com/folio-org/mod-oai-pmh/tree/vert-timeout-rollback) with vert timeout manipulation.

Recommendations & Jiras

Test Results

Test Number

recordsSource

maxRecordsPerResponse

Duration

Total Requests

Requests/min

Average Response time,

ms

Completed

Test Number

recordsSource

maxRecordsPerResponse

Duration

Total Requests

Requests/min

Average Response time,

ms

Completed

1

Source record storage

100

~25 h

92'040

39.5

604

No issues 

2

Source record storage and Inventory

100

~9 h

19'889

38.8

488

No issues 

- the integrity of the harvesting was not checked

Response metadata:

Test #1
{    "requestMetadataCollection": [        {            "requestId": "5457c24a-d3c6-4807-8b9e-9a0123802820",            "lastUpdatedDate": "2023-04-03T06:12:59.807+00:00",            "streamEnded": true,            "downloadedAndSavedInstancesCounter": 10023100,            "failedToSaveInstancesCounter": 0,            "returnedInstancesCounter": 9160018,            "skippedInstancesCounter": 45682,            "failedInstancesCounter": 0,            "suppressedInstancesCounter": 0        }    ],    "totalRecords": 1 }

 

Test #2
{    "requestMetadataCollection": [        {            "requestId": "0c2ef1f4-ab94-450a-b092-723ffc5338d6",            "lastUpdatedDate": "2023-04-03T18:39:29.738+00:00",            "streamEnded": true,            "downloadedAndSavedInstancesCounter": 10023100,            "failedToSaveInstancesCounter": 0,            "returnedInstancesCounter": 1957773,            "skippedInstancesCounter": 33027,            "failedInstancesCounter": 0,            "suppressedInstancesCounter": 0        }    ],    "totalRecords": 1 }

For additional investigation of the behaviour of OAI-PMH operations with the throughput limitation of 40 requests/min, a series of test executions have been run on the local machine:

Id

recordsSource

Throughput

Exec

Result

Comments

1

Source records storage

No limits

local

OutOfMemory

 

2

40/min

local

connection lost ~12h

Not representative due to the local machine has lost connection

3

Inventory

No limits

local

504 Gateway Time-out

 

5

Source record storage and Inventory

No limits

local

java.net.SocketException: An established connection was aborted by the software in your host machine

Not representative due to the local machine has lost connection

Moreover, for additional investigation of the behaviour of OAI-PMH operations, several versions of 'mod-oai-pmh' were tested:

Version of the mod-oai-pmh

Throughput

recordsSource

Result

mod-oai-pmh:3.11.0

No limits

SRS

OutOfMemory

mod-oai-pmh:3.11.1

No limits

SRS

OutOfMemory

mod-oai-pmh:3.12.0

No limits

SRS

OutOfMemory

https://github.com/folio-org/mod-oai-pmh/tree/vert-timeout-rollback

No limits

SRS

Thread blocked >> OOM

Resources Utilization: Test #1 - SRS > 92'040 requests

Service CPU Utilization

mod-oai-pmh can spike up to over 300% initially.  The other modules barely incur any high CPU utilization above their baselines. 

Service Memory Usage

No memory issues were detected.

Database CPU Utilization

The DB CPU utilization shows spikes in the first 30-60 minutes for transferring instances and is relatively even and smooth for the rest of the harvest. 

ocp2-db-01: in the case of excepting the first 30-60 minutes, we can see a tendency of slow CPU Utilization growth.

Resources Utilization: Test #2 - SRS+Inventory > 19'889 requests

Service CPU Utilization

'mod-oai-pmh' spiked up to less than 80% initially.  The other modules barely incur any high CPU utilization above their baselines. 

Service Memory Usage

No memory issues were detected.

 

Database CPU Utilization

The DB CPU utilization shows spikes in the first 30-60 minutes for transferring instances and is relatively even and smooth for the rest of the harvest. 

 

Incremental harvesting

 

Environment

Incremental harvesting testing were performed on a bit different environment (for most of a modules it's the same except mod-source-record-storage memory configuration for incremental harvesting it's less than for full harvesting env.5600/5000 vs 2048/1440 and xmx 3600 vs 1024)

For env configuration details please see : Appendix section

Summary

  • For all tests all data were harvested successfully;

  • There are a chance to get "504 Gateway Time-out" on initial call. Bug created MODINVSTOR-1052

  • Response times for harvesting calls is in average 650 -750ms;

  • No memory leak suspects were defined;

  • RDS CPU usage is on 12-13% level during harvesting and ±35-40% during data transferring.

 

Methodology

For testing incremental harvesting PTF team create JMeter script that contains initial oai-pmh call:

/oai?verb=ListRecords&metadataPrefix=marc21_withholdings&apikey=<apiKey>&from=<start-date>&until=<end-date>

and subsequent call:

oai/<APIKey>?verb=ListRecords&resumptionToken=<resumptionToken>

 

Time ranges and data set sizes: 

set

from

to 

num of records

set

from

to 

num of records

1

2023-04-07

2023-04-07

25K

2

2023-04-01

2023-04-06

590K

3

2023-03-28T18:00:00Z

2023-04-06T18:00:00Z

1M records

 

 

 

Test Results

 

test #

Data set

duration 

Record source

status

test #

Data set

duration 

Record source

status

1

25K

3 min 50s

SRS

Success 

2

590K

1 hr 14min

SRS

Success

3

1M

2 hr 1min

SRS

Success

4

25K

4 min 32 s

SRS+inventory

Success

5

590K

1 hr 7min

SRS+inventory

Success

6

1M

2 hr 21 min

SRS+inventory

Success

 

Record source SRS resource usage

 

 

 

Record source SRS+inventory resource usage 

Service CPU usage - 25K- 500K instances harvesting

Service CPU usage - 1M instances harvesting

Service Memory usage - 25K- 500K instances harvesting

Service Memory usage - 1M instances harvesting

RDS CPU usage 25K- 500K instances harvesting

RDS CPU usage 1M instances harvesting

 

ERRORS

Below log from mod-oai-pmh containing 504 Gateway Time-out

13:07:43 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:07:43 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:35168 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:08:13 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:08:13 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:50660 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:08:13 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:08:13 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:50704 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:08:43 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:08:43 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:51752 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:08:43 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:08:43 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:51786 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:08:46 [] [] [] [] ERROR OaiPmhJsonParser Unexpected character ('<' (code 60)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')

at [Source: UNKNOWN; line: 1, column: 2]

13:08:46 [] [] [] [] ERROR OaiPmhJsonParser Error position at error part of json is 2

13:08:46 [] [] [] [] ERROR OaiPmhJsonParser <html><head><title>504 Gateway Time-out</title></head><body><center><h1>504 Gateway Time-out</h1></cen

13:08:46 [] [] [] [] WARN OaiPmhImpl getOaiRecords:: RequestId b3eca50d-eba2-4bfd-b9a4-bad773caa434 completed with error Got error response from inventory-storage, uri: '/inventory-hierarchy/updated-instance-ids?onlyInstanceUpdateDate=false&endDate=2023-04-06T18:00:01Z&deletedRecordSupport=true&startDate=2023-03-28T18:00:00Z&skipSuppressedFromDiscoveryRecords=true' message: Invalid response: Gateway Time-out null

13:08:46 [] [] [] [] INFO LogUtil null GET /oai/records verb=ListRecords&metadataPrefix=marc21_withholdings&apikey=eyJzIjoiVDNUSzAzR2QyViIsInQiOiJmczA5MDAwMDAwIiwidSI6ImZzMDkwMDAwMDAifQ==&from=2023-03-28T18:00:00Z&until=2023-04-06T18:00:00Z HTTP_1_1 500 295 400029 tid=fs09000000 Internal Server Error

13:09:13 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:09:13 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:44342 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:09:13 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:09:13 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:44360 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:09:43 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:09:43 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:55432 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:09:43 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:09:43 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:55472 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:10:13 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:10:13 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:52508 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:10:13 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:10:13 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:52528 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:10:43 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:10:43 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:59294 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:10:43 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:10:43 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:59350 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:11:13 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:11:13 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:37806 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:11:13 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:11:13 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:37818 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:11:43 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:11:43 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:40848 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:11:43 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:11:43 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:40868 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:12:13 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:12:13 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:56864 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:12:13 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:12:13 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:56888 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK

13:12:43 [] [] [] [mod_oai_pmh] INFO RestRouting invoking getAdminHealth

13:12:43 [] [] [] [mod_oai_pmh] INFO LogUtil 127.0.0.1:55784 GET /admin/health null HTTP_1_1 200 4 0 tid=null OK