eHoldings (Nolana)

Overview

eHoldings was tested with a Nolana snapshot (which was taken toward the very tail end of the dev cycle, up to the Nolana Bugfest, so it can be considered as an official Nolana release testing). Failover testing and multiple concurrent jobs testing were performed.  The following documents the results of eHoldings testing.

Infrastructure

PTF -environment (ncp1)

  • 9 m6i.2xlarge EC2 instances located in us-west-2. 
  • 2 instances of db.r6.xlarge database instances, one reader and one writer
  • MSK (ptf-kafka-1 cluster)
    • 4 m5.2xlarge brokers in 2 zones
    • auto.create-topics.enable = true
    • log.retention.minutes=480
    • default.replication.factor=3
    • Apache Kafka v2.8.0
    • EBS storage volume per broker = 300GB
    • Kafka topics 

Memory parameters for relevant modules:

Module

Version

Max Metaspace Size (MB)

XmX (MB)

Soft Limit (MB)

Hard Limit (MB)

CPUNumber of ECS Tasks
mod-agreements5.4.0-SNAPSHOT.104512968148815921282
mod-notes4.0.0-SNAPSHOT.23712832289610241282
mod-feesfines18.2.0-SNAPSHOT.13212876889610241282
mod-data-export-spring1.5.0-SNAPSHOT.585121536184420482561
mod-data-export-worker1.5.0-SNAPSHOT.855122048 2600307210242
mod-kb-ebsco-java3.12.0-SNAPSHOT.6112876889610241282
okapi4.14.45129221360151210243
nginx-okapi2022.03.02--89610241282
pub-okapi2022.03.02--89610241282

High Level Summary

  • With mod-data-export-worker v2.0.1 and v2.0.2, 
    • Up to 16 concurrent jobs can be performed without any issues. 
    • Multiple tenants can run concurrent jobs. PTF tested with 3 tenants and each tenant kicks off 5 concurrent jobs.
  • Pre mod-data-export-worker v2.0.1 behavior:
    • Up to 10 concurrent jobs can be performed without issues but it would take up to 87 minutes to complete. 
    • 14 concurrent jobs: 10 succeeded, 4 failed; took 4 hours to complete all jobs.
    • When there are more than 10 concurrent jobs, eHoldings have long lulls (as seen in low modules CPU utilization) in between spikes. This could contribute to the 10+ hours to complete the jobs.
  • Simulation of a failed ECS task/container show that eHoldings has no problems completing the ongoing job.
  • Exporting a package without assigned titles, agreements, and notes is about 2x faster than with them.

Test Results

Test #Jobs PackageJob DurationOverall DurationStatus
11Wiley UBCM - Engineering (without notes/agreements)8 mins8 minsSuccessful
21Wiley UBCM - Engineering (with notes)38 mins38 minsSuccessful
31Wiley UBCM - Engineering (with notes)29 mins29 minsSuccessful
42Wiley UBCM - Engineering (with notes)

30 mins

29mins

30 minsSuccessful (all)
54Wiley UBCM -Engineering (with notes)

30 mins (4x)

29 mins (1x)

30 minsSuccessful (all)
66Wiley UBCM -Engineering (with notes)

29 mins (3x)

28 mins (3x)

29 minsSuccessful (all)
710Wiley UBCM -Engineering (with notes)

28-30 mins (8x)

32 mins (2x - started 40 minutes later)

87 minsSuccessful (all)
814Wiley UBCM -Engineering (with notes)

30 mins ( 3x - started on time)

29 mins (3x - started 2 hours later)

29 mins (2x - 30 mins later)

1 min (2x - started 4 hours later)

1 min (1x - started 90 mins later)

6 min (1x - started 30 mins later)

4 hours

Successful

Successful

Successful

Failed

Failed

Failed

In the table above, the number "3x" in the Job Duration column, for example, denotes the number of jobs that have the same outcome. With tests 7 and 8, there are some jobs although they were all kicked off at the same time they actually started (or transitioned from the Scheduled state to In Progress state) 40 minutes or up to 4 hours later. For these two tests, the status column shows the statuses of each group of jobs, as they are grouped in the Job Duration column.  The Overall Duration column reports the overall time that all jobs took from the very beginning when they were all kicked off until the very last job that completed either in success or error. 

What we're seeing here is the performance for each job, whether by themselves or running concurrently with others, are rather consistent. They all finish around 30 minutes with up to 6 concurrent jobs.  However, as more jobs were added as in test #7, the export's stability decreases. Two out of the 10 concurrent jobs started 40 minutes later while 8 of them started initially.  When 14 jobs were scheduled at once only 3 jobs started on immediately while the rest were delayed by 30 minutes and up to 4 hours later.  There were 4 failures in the 14 concurrent jobs test as well.  A couple of errors were  due to HoldingsIQ service which is an external and non-FOLIO service that mod-data-worker calls, while one is from mod-data-export-worker itself.

Errors

HoldingsIQ service Errors

[404 Not Found] during [GET] to [http://eholdings/providers/58] [KbEbscoClient#getProviderById(String,String)]: [{
  "errors" : [ {
    "title" : "Provider not found"
  } ],
  "jsonapi" : {
    "version" : "1.0"
  }
}] (NotFound)


[504 Gateway Timeout] during [GET] to [http://eholdings/packages/58-2110695/resources?searchfield=title&page=21&count=20&include=accessType] [KbEbscoClient#getResourcesByPackageId(String,Map)]: [{
  "errors" : [ {
    "title" : "Endpoint request timed out"
  } ],
  "jsonapi" : {
    "version" : "1.0"
  }
}] (GatewayTimeout)


mod-data-export-worker Error:

Your proposed upload is smaller than the minimum allowed size (Service: S3, Status Code: 400, Request ID: FHEHBW1CRV4QSSWG, Extended Request ID: /r5bKrM7NTz2PqRFzGLJ4S6xOj79TS2ng0VKekasA3C2Xzrp0uzoGmW4NBNE6+f5wkVdMXja4Z0=) (S3Exception)
Modules CPU Utilization 

2 Concurrent jobs:

6 Concurrent jobs

In this graph the 6 jobs started at 17:38 and ended at 18:08. Note the large lull of seemingly inactivity after 17:45. 

10 Concurrent jobs:


With 10 concurrent jobs the CPU graphs continues to show the interesting pattern of spikes and lulls. Starting around 18:10 to around 19:40, there are 4 periods when the modules seem to be actively working while the rest of the time they didn't do anything. This pattern is even more apparent when there are 14 concurrent jobs:

14 Concurrent Jobs:

Here the CPU activities are spread throughout 14 hours. Note the periodic spikes getting smaller as more jobs get completed and less jobs need to run as time progressed.

The trends are as follows:

  • More jobs = higher CPU utilization from mod-agreements, mod-notes. These two modules use most CPU. 
  • okapi (and its variants) and other modules use much less CPU compared to mod-agreements and mod-notes
Modules Memory Utilization
  • Nothing remarkable here, as all modules don't exhibit any signs of memory growth or leaks.
Database CPU Utilization

The database does not use much CPU in all tests. Here are the graphs of tests up to 10 concurrent jobs

With 14 concurrent jobs there graph looks like below:

The CPU spikes correspond to the spikes in the modules CPU utilization graphs. 

Failover Tests

Two failover tests were performed when exporting a single package each.  Both mod-data-export-worker ECS tasks were killed but the jobs still finished successfully.

mod-data-export-worker v2.0.1

Performance improved greatly with mod-data-export-worker v2.0.1

<
Test #Jobs PackageJob Duration (snapshot)Overall Duration (snapshot)Job Duration (v2.0.1)Overall Duration (v2.0.1)
21Wiley UBCM - Engineering (with notes)38 mins38 mins3 mins3 mins
31Wiley UBCM - Engineering (with notes)29 mins29 mins4 mins4 mins
42Wiley UBCM - Engineering (with notes)

30 mins

29mins

30 mins

3 mins

4 mins

4 mins
5