eHoldings (Nolana)

Overview

eHoldings was tested with a Nolana snapshot (which was taken toward the very tail end of the dev cycle, up to the Nolana Bugfest, so it can be considered as an official Nolana release testing). Failover testing and multiple concurrent jobs testing were performed.  The following documents the results of eHoldings testing.

Infrastructure

PTF -environment (ncp1)

  • 9 m6i.2xlarge EC2 instances located in us-west-2. 
  • 2 instances of db.r6.xlarge database instances, one reader and one writer
  • MSK (ptf-kafka-1 cluster)
    • 4 m5.2xlarge brokers in 2 zones
    • auto.create-topics.enable = true
    • log.retention.minutes=480
    • default.replication.factor=3
    • Apache Kafka v2.8.0
    • EBS storage volume per broker = 300GB
    • Kafka topics 

Memory parameters for relevant modules:

Module

Version

Max Metaspace Size (MB)

XmX (MB)

Soft Limit (MB)

Hard Limit (MB)

CPUNumber of ECS Tasks
mod-agreements5.4.0-SNAPSHOT.104512968148815921282
mod-notes4.0.0-SNAPSHOT.23712832289610241282
mod-feesfines18.2.0-SNAPSHOT.13212876889610241282
mod-data-export-spring1.5.0-SNAPSHOT.585121536184420482561
mod-data-export-worker1.5.0-SNAPSHOT.855122048 2600307210242
mod-kb-ebsco-java3.12.0-SNAPSHOT.6112876889610241282
okapi4.14.45129221360151210243
nginx-okapi2022.03.02--89610241282
pub-okapi2022.03.02--89610241282

High Level Summary

  • With mod-data-export-worker v2.0.1 and v2.0.2, 
    • Up to 16 concurrent jobs can be performed without any issues. 
    • Multiple tenants can run concurrent jobs. PTF tested with 3 tenants and each tenant kicks off 5 concurrent jobs.
  • Pre mod-data-export-worker v2.0.1 behavior:
    • Up to 10 concurrent jobs can be performed without issues but it would take up to 87 minutes to complete. 
    • 14 concurrent jobs: 10 succeeded, 4 failed; took 4 hours to complete all jobs.
    • When there are more than 10 concurrent jobs, eHoldings have long lulls (as seen in low modules CPU utilization) in between spikes. This could contribute to the 10+ hours to complete the jobs.
  • Simulation of a failed ECS task/container show that eHoldings has no problems completing the ongoing job.
  • Exporting a package without assigned titles, agreements, and notes is about 2x faster than with them.

Test Results

Test #Jobs PackageJob DurationOverall DurationStatus
11Wiley UBCM - Engineering (without notes/agreements)8 mins8 minsSuccessful
21Wiley UBCM - Engineering (with notes)38 mins38 minsSuccessful
31Wiley UBCM - Engineering (with notes)29 mins29 minsSuccessful
42Wiley UBCM - Engineering (with notes)

30 mins

29mins

30 minsSuccessful (all)
54Wiley UBCM -Engineering (with notes)

30 mins (4x)

29 mins (1x)

30 minsSuccessful (all)
66Wiley UBCM -Engineering (with notes)

29 mins (3x)

28 mins (3x)

29 minsSuccessful (all)
710Wiley UBCM -Engineering (with notes)

28-30 mins (8x)

32 mins (2x - started 40 minutes later)

87 minsSuccessful (all)
814Wiley UBCM -Engineering (with notes)

30 mins ( 3x - started on time)

29 mins (3x - started 2 hours later)

29 mins (2x - 30 mins later)

1 min (2x - started 4 hours later)

1 min (1x - started 90 mins later)

6 min (1x - started 30 mins later)

4 hours

Successful

Successful

Successful

Failed

Failed

Failed

In the table above, the number "3x" in the Job Duration column, for example, denotes the number of jobs that have the same outcome. With tests 7 and 8, there are some jobs although they were all kicked off at the same time they actually started (or transitioned from the Scheduled state to In Progress state) 40 minutes or up to 4 hours later. For these two tests, the status column shows the statuses of each group of jobs, as they are grouped in the Job Duration column.  The Overall Duration column reports the overall time that all jobs took from the very beginning when they were all kicked off until the very last job that completed either in success or error. 

What we're seeing here is the performance for each job, whether by themselves or running concurrently with others, are rather consistent. They all finish around 30 minutes with up to 6 concurrent jobs.  However, as more jobs were added as in test #7, the export's stability decreases. Two out of the 10 concurrent jobs started 40 minutes later while 8 of them started initially.  When 14 jobs were scheduled at once only 3 jobs started on immediately while the rest were delayed by 30 minutes and up to 4 hours later.  There were 4 failures in the 14 concurrent jobs test as well.  A couple of errors were  due to HoldingsIQ service which is an external and non-FOLIO service that mod-data-worker calls, while one is from mod-data-export-worker itself.

Errors

HoldingsIQ service Errors

[404 Not Found] during [GET] to [http://eholdings/providers/58] [KbEbscoClient#getProviderById(String,String)]: [{
  "errors" : [ {
    "title" : "Provider not found"
  } ],
  "jsonapi" : {
    "version" : "1.0"
  }
}] (NotFound)


[504 Gateway Timeout] during [GET] to [http://eholdings/packages/58-2110695/resources?searchfield=title&page=21&count=20&include=accessType] [KbEbscoClient#getResourcesByPackageId(String,Map)]: [{
  "errors" : [ {
    "title" : "Endpoint request timed out"
  } ],
  "jsonapi" : {
    "version" : "1.0"
  }
}] (GatewayTimeout)


mod-data-export-worker Error:

Your proposed upload is smaller than the minimum allowed size (Service: S3, Status Code: 400, Request ID: FHEHBW1CRV4QSSWG, Extended Request ID: /r5bKrM7NTz2PqRFzGLJ4S6xOj79TS2ng0VKekasA3C2Xzrp0uzoGmW4NBNE6+f5wkVdMXja4Z0=) (S3Exception)
Modules CPU Utilization 

2 Concurrent jobs:

6 Concurrent jobs

In this graph the 6 jobs started at 17:38 and ended at 18:08. Note the large lull of seemingly inactivity after 17:45. 

10 Concurrent jobs:


With 10 concurrent jobs the CPU graphs continues to show the interesting pattern of spikes and lulls. Starting around 18:10 to around 19:40, there are 4 periods when the modules seem to be actively working while the rest of the time they didn't do anything. This pattern is even more apparent when there are 14 concurrent jobs:

14 Concurrent Jobs:

Here the CPU activities are spread throughout 14 hours. Note the periodic spikes getting smaller as more jobs get completed and less jobs need to run as time progressed.

The trends are as follows:

  • More jobs = higher CPU utilization from mod-agreements, mod-notes. These two modules use most CPU. 
  • okapi (and its variants) and other modules use much less CPU compared to mod-agreements and mod-notes
Modules Memory Utilization
  • Nothing remarkable here, as all modules don't exhibit any signs of memory growth or leaks.
Database CPU Utilization

The database does not use much CPU in all tests. Here are the graphs of tests up to 10 concurrent jobs

With 14 concurrent jobs there graph looks like below:

The CPU spikes correspond to the spikes in the modules CPU utilization graphs. 

Failover Tests

Two failover tests were performed when exporting a single package each.  Both mod-data-export-worker ECS tasks were killed but the jobs still finished successfully.

mod-data-export-worker v2.0.1

Performance improved greatly with mod-data-export-worker v2.0.1

Test #Jobs PackageJob Duration (snapshot)Overall Duration (snapshot)Job Duration (v2.0.1)Overall Duration (v2.0.1)
21Wiley UBCM - Engineering (with notes)38 mins38 mins3 mins3 mins
31Wiley UBCM - Engineering (with notes)29 mins29 mins4 mins4 mins
42Wiley UBCM - Engineering (with notes)

30 mins

29mins

30 mins

3 mins

4 mins

4 mins
54Wiley UBCM -Engineering (with notes)

30 mins (4x)

29 mins (1x)

30 mins

2 mins (1x)

3 mins (3x)

3 mins
66Wiley UBCM -Engineering (with notes)

29 mins (3x)

28 mins (3x)

29 mins

2 mins (1x - started 1 min later)

4 mins (5x)

4 mins
710Wiley UBCM -Engineering (with notes)

28-30 mins (8x)

32 mins (2x - started 40 minutes later)

87 mins

3 mins (All )


7 mins
814Wiley UBCM -Engineering (with notes)

30 mins ( 3x - started on time)

29 mins (3x - started 2 hours later)

29 mins (2x - 30 mins later)

1 min (2x - started 4 hours later)

1 min (1x - started 90 mins later)

6 min (1x - started 30 mins later)

4 hours

3 mins (1x - started within 1 min of each other)

3 mins (8x - started 3 mins later)

4 mins (6x - started within 1 min of each other)

7 mins
916Wiley UBCM -Engineering (with notes)Did not runDid not run

3-4 mins (12x)

3 mins (4x - 2 started 1 min after, 2 started 4 mins later)

7 mins


Service CPU utilization of the tests with mod-data-export-worker v2.0.1.

Note that mod-agreements and mod-notes lead the spikes in all tests.  With 4 conc. jobs mod-agreements reached over 100%, mod-notes 50%,  6 concurrent jobs mod-agreements 140%, mod-notes 70%, 16 concurrent jobs mod-agreements over 300% and mod-notes about 180%. This is expected as the package has 15 notes and 5 agreements. 

Memory of all the relevant modules were fairly stable in all the tests for the interval shown. mod-data-export-worker has a little bump each time a new test is run. Whether or not the trend continues at least with eHoldings testing is unknown (more tests would need to be run.) However, in other tests involving mod-data-export-worker, no memory leaks were observed. 


Database CPU utilization was unremarkable as well:


mod-data-export-worker v2.0.2 & Multi-tenants Exporting

eHoldings concurrent exports was tested with mod-data-export-worker v2.0.2.  This time the load is distributed equally among 2-3 participating tenants on the same cluster. These tests were done to see if eHoldings exporting can be done concurrently in a multi-tenants environment. The answer is yes. Here are the results:

There are 3 tenants on the cluster: Tenant 00, 01, and 02. 

The table below shows the results of 6 tests, each test add more concurrent jobs.  In test 1, for example, there are two tenants, 01 and 02 exporting.  The jobs were kicked off at the same time, as indicated by the timestamp in the Launched Time column. The job may be kicked off but they could be scheduled waiting to get started. When they start exporting, that is when the timestamp in the "Start Time" column was recorded.  End Time is self-explanatory. All jobs were successful hence the "S".  Duration is for the overall duration of each individual job. Wait Time is the time the job had to wait, or be in the Scheduled state, before it is transitioned to In Progress. Concurrency is the number of jobs that are running at the same time as the job we're looking at. 

As seen, all the jobs were passing successfully. Some got started later than the other.  Most of the time a bunch of jobs got started first at the same time, then after that each remaining job takes turn to get started.  This is evident in tests 3-6.  Another thing to note is the duration of each job is consistent no matter the number of concurrent jobs being thrown at it.  


Test 

Tenant

Job ID

Launched Time

Start Time (UTC)

End Time (UTC)

Final Status

Duration

Wait Time

Concurrency

1

01

000046

9:35pm

9:35pm

9:38pm

S

3

0

1


02

000037

9:35pm

9:39pm

9:41pm

S

2

4

1











2

01

000047

9:43pm

9:43pm

9:46pm

S

3

0

1


01

000048

9:43pm

9:53pm

9:56pm

S

3

10

1


02

000038

9:43pm

9:46pm

9:50pm

S

4

3

1


02

000039

9:43pm

9:50pm

9:53pm

S

3

7

1











3

00

006034

10:08pm

10:08pm

10:11pm

S

3

0

3


00

006035

10:08pm

10:08pm

10:11pm

S

3

0

3


01

000049

10:08pm

10:08pm

10:11pm

S

3

0

3


01

000050

10:08pm

10:18pm

10:21pm

S

3

10

1


02

000040

10:08pm

10:11pm

10:14pm

S

3

3

1


02

000041

10:08pm

10:14pm

10:18pm

S

4

6

1











4

00

006036

10:28pm

10:28pm

10:31pm

S

3

0

4


00

006037

10:28pm

10:28pm

10:31pm

S

3

0

4


00

006038

10:28pm

10:28pm

10:31pm

S

3

0

4


01

000051

10:28pm

10:28pm

10:31pm

S

3

0

4


01

000052

10:28pm

10:41pm

10:44pm

S

3

13

1


01

000053

10:28pm

10:44pm

10:48pm

S

4

16

1


02

000042

10:28pm

10:31pm

10:34pm

S

3

3

1


02

000043

10:28pm

10:34pm

10:38pm

S

4

6

1


02

000044

10:28pm

10:38pm

10:41pm

S

3

10

1











5

00

006039

11:00pm

11:00pm

11:03pm

S

3

0

5


00

006040

11:00pm

11:00pm

11:03pm

S

3

0

5


00

006041

11:00pm

11:00pm

11:03pm

S

3

0

5


00

006042

11:00pm

11:00pm

11:03pm

S

3

0

5


01

000054

11:00pm

11:13pm

11:16pm

S

3

13

1


01

000055

11:00pm

11:16pm

11:19pm

S

3

16

1


01

000056

11:00pm

11:19pm

11:22pm

S

3

19

1


01

000057

11:00pm

11:22pm

11:26pm

S

4

22

1


02

000045

11:00pm

11:00pm

11:03pm

S

3

0

5


02

000046

11:00pm

11:03pm

11:06pm

S

3

3

1


02

000047

11:00pm

11:06pm

11:09pm

S

3

6

1


02

000048

11:00pm

11:09pm

11:13pm

S

4

9

1











6

00

006043

11:31pm

11:31pm

11:34pm

S

3

0

6


00

006044

11:31pm

11:31pm

11:34pm

S

3

0

6


00

006045

11:31pm

11:31pm

11:34pm

S

3

0

6


00

006046

11:31pm

11:31pm

11:34pm

S

3

0

6


00

006047

11:31pm

11:31pm

11:34pm

S

3

0

6


01

000058

11:30pm

11:30pm

11:34pm

S

4

0

6


01

000059

11:31pm

11:50pm

11:53pm

S

3

19

1


01

000060

11:31pm

11:53pm

11:56pm

S

3

22

1


01

000061

11:31pm

11:56pm

12:00am

S

4

25

1


01

000062

11:31pm

12:00am

12:03am

S

3

29

1


02

000049

11:31pm

11:34pm

11:37pm

S

3

3

1


02

000050

11:31pm

11:37pm

11:40pm

S

3

6

1


02

000051

11:31pm

11:40pm

11:43pm

S

3

9

1


02

000052

11:31pm

11:43pm

11:46pm

S

3

12

1


02

000053

11:31pm

11:46pm

11:50pm

S

4

15

1


Service CPU utilization of these 6 tests:

Service memory utilization:


Only mod-data-export-worker exhibits a growth in memory, from 46% to 52% after running 6 tests.  This might warrant an investigation.


Database CPU utilization of the tests: