Overview

eHoldings was tested with a Nolana snapshot (which was taken toward the very tail end of the dev cycle, up to the Nolana Bugfest, so it can be considered as an official Nolana release testing). Failover testing and multiple concurrent jobs testing were performed. The following documents the results of eHoldings testing.

Infrastructure

PTF -environment (ncp1)

9 m6i.2xlarge EC2 instances located in us-west-2.
2 instances of db.r6.xlarge database instances, one reader and one writer
MSK (ptf-kafka-1 cluster)
- 4 m5.2xlarge brokers in 2 zones
- auto.create-topics.enable = true
- log.retention.minutes=480
- default.replication.factor=3
- Apache Kafka v2.8.0
- EBS storage volume per broker = 300GB
- Kafka topics
  - .data-export.job.command - 50 partitions
  - data-export.job.update - 50 partitions

Memory parameters for relevant modules:

Module	Version	Max Metaspace Size (MB)	XmX (MB)	Soft Limit (MB)	Hard Limit (MB)	CPU	Number of ECS Tasks
mod-agreements	5.4.0-SNAPSHOT.104	512	968	1488	1592	128	2
mod-notes	4.0.0-SNAPSHOT.237	128	322	896	1024	128	2
mod-feesfines	18.2.0-SNAPSHOT.132	128	768	896	1024	128	2
mod-data-export-spring	1.5.0-SNAPSHOT.58	512	1536	1844	2048	256	1
mod-data-export-worker	1.5.0-SNAPSHOT.85	512	2048	2600	3072	1024	2
mod-kb-ebsco-java	3.12.0-SNAPSHOT.61	128	768	896	1024	128	2
okapi	4.14.4	512	922	1360	1512	1024	3
nginx-okapi	2022.03.02	-	-	896	1024	128	2
pub-okapi	2022.03.02	-	-	896	1024	128	2

High Level Summary

With mod-data-export-worker v2.0.1 and v2.0.2,
- Up to 16 concurrent jobs can be performed without any issues.
- Multiple tenants can run concurrent jobs. PTF tested with 3 tenants and each tenant kicks off 5 concurrent jobs.
Pre mod-data-export-worker v2.0.1 behavior:
- Up to 10 concurrent jobs can be performed without issues but it would take up to 87 minutes to complete.
- 14 concurrent jobs: 10 succeeded, 4 failed; took 4 hours to complete all jobs.
- When there are more than 10 concurrent jobs, eHoldings have long lulls (as seen in low modules CPU utilization) in between spikes. This could contribute to the 10+ hours to complete the jobs.
Simulation of a failed ECS task/container show that eHoldings has no problems completing the ongoing job.
Exporting a package without assigned titles, agreements, and notes is about 2x faster than with them.

Test Results

Test #	Jobs	Package	Job Duration	Overall Duration	Status
1	1	Wiley UBCM - Engineering (without notes/agreements)	8 mins	8 mins	Successful
2	1	Wiley UBCM - Engineering (with notes)	38 mins	38 mins	Successful
3	1	Wiley UBCM - Engineering (with notes)	29 mins	29 mins	Successful
4	2	Wiley UBCM - Engineering (with notes)	30 mins 29mins	30 mins	Successful (all)
5	4	Wiley UBCM -Engineering (with notes)	30 mins (4x) 29 mins (1x)	30 mins	Successful (all)
6	6	Wiley UBCM -Engineering (with notes)	29 mins (3x) 28 mins (3x)	29 mins	Successful (all)
7	10	Wiley UBCM -Engineering (with notes)	28-30 mins (8x) 32 mins (2x - started 40 minutes later)	87 mins	Successful (all)
8	14	Wiley UBCM -Engineering (with notes)	30 mins ( 3x - started on time) 29 mins (3x - started 2 hours later) 29 mins (2x - 30 mins later) 1 min (2x - started 4 hours later) 1 min (1x - started 90 mins later) 6 min (1x - started 30 mins later)	4 hours	Successful Successful Successful Failed Failed Failed

In the table above, the number "3x" in the Job Duration column, for example, denotes the number of jobs that have the same outcome. With tests 7 and 8, there are some jobs although they were all kicked off at the same time they actually started (or transitioned from the Scheduled state to In Progress state) 40 minutes or up to 4 hours later. For these two tests, the status column shows the statuses of each group of jobs, as they are grouped in the Job Duration column. The Overall Duration column reports the overall time that all jobs took from the very beginning when they were all kicked off until the very last job that completed either in success or error.

What we're seeing here is the performance for each job, whether by themselves or running concurrently with others, are rather consistent. They all finish around 30 minutes with up to 6 concurrent jobs. However, as more jobs were added as in test #7, the export's stability decreases. Two out of the 10 concurrent jobs started 40 minutes later while 8 of them started initially. When 14 jobs were scheduled at once only 3 jobs started on immediately while the rest were delayed by 30 minutes and up to 4 hours later. There were 4 failures in the 14 concurrent jobs test as well. A couple of errors were due to HoldingsIQ service which is an external and non-FOLIO service that mod-data-worker calls, while one is from mod-data-export-worker itself.

Errors

HoldingsIQ service Errors

[404 Not Found] during [GET] to [http://eholdings/providers/58] [KbEbscoClient#getProviderById(String,String)]: [{
  "errors" : [ {
    "title" : "Provider not found"
  } ],
  "jsonapi" : {
    "version" : "1.0"
  }
}] (NotFound)

[504 Gateway Timeout] during [GET] to [http://eholdings/packages/58-2110695/resources?searchfield=title&page=21&count=20&include=accessType] [KbEbscoClient#getResourcesByPackageId(String,Map)]: [{
  "errors" : [ {
    "title" : "Endpoint request timed out"
  } ],
  "jsonapi" : {
    "version" : "1.0"
  }
}] (GatewayTimeout)

mod-data-export-worker Error:

Your proposed upload is smaller than the minimum allowed size (Service: S3, Status Code: 400, Request ID: FHEHBW1CRV4QSSWG, Extended Request ID: /r5bKrM7NTz2PqRFzGLJ4S6xOj79TS2ng0VKekasA3C2Xzrp0uzoGmW4NBNE6+f5wkVdMXja4Z0=) (S3Exception)

Modules CPU Utilization

2 Concurrent jobs:

6 Concurrent jobs

In this graph the 6 jobs started at 17:38 and ended at 18:08. Note the large lull of seemingly inactivity after 17:45.

10 Concurrent jobs:

With 10 concurrent jobs the CPU graphs continues to show the interesting pattern of spikes and lulls. Starting around 18:10 to around 19:40, there are 4 periods when the modules seem to be actively working while the rest of the time they didn't do anything. This pattern is even more apparent when there are 14 concurrent jobs:

14 Concurrent Jobs:

Here the CPU activities are spread throughout 14 hours. Note the periodic spikes getting smaller as more jobs get completed and less jobs need to run as time progressed.

The trends are as follows:

More jobs = higher CPU utilization from mod-agreements, mod-notes. These two modules use most CPU.
okapi (and its variants) and other modules use much less CPU compared to mod-agreements and mod-notes

Modules Memory Utilization

Nothing remarkable here, as all modules don't exhibit any signs of memory growth or leaks.

Database CPU Utilization

The database does not use much CPU in all tests. Here are the graphs of tests up to 10 concurrent jobs

With 14 concurrent jobs there graph looks like below:

The CPU spikes correspond to the spikes in the modules CPU utilization graphs.

Failover Tests

Two failover tests were performed when exporting a single package each. Both mod-data-export-worker ECS tasks were killed but the jobs still finished successfully.

mod-data-export-worker v2.0.1

Performance improved greatly with mod-data-export-worker v2.0.1

Test #	Jobs	Package	Job Duration (snapshot)	Overall Duration (snapshot)	Job Duration (v2.0.1)	Overall Duration (v2.0.1)
2	1	Wiley UBCM - Engineering (with notes)	38 mins	38 mins	3 mins	3 mins
3	1	Wiley UBCM - Engineering (with notes)	29 mins	29 mins	4 mins	4 mins
4	2	Wiley UBCM - Engineering (with notes)	30 mins 29mins	30 mins	3 mins 4 mins	4 mins
5	4	Wiley UBCM -Engineering (with notes)	30 mins (4x) 29 mins (1x)	30 mins	2 mins (1x) 3 mins (3x)	3 mins
6	6	Wiley UBCM -Engineering (with notes)	29 mins (3x) 28 mins (3x)	29 mins	2 mins (1x - started 1 min later) 4 mins (5x)	4 mins
7	10	Wiley UBCM -Engineering (with notes)	28-30 mins (8x) 32 mins (2x - started 40 minutes later)	87 mins	3 mins (All )	7 mins
8	14	Wiley UBCM -Engineering (with notes)	30 mins ( 3x - started on time) 29 mins (3x - started 2 hours later) 29 mins (2x - 30 mins later) 1 min (2x - started 4 hours later) 1 min (1x - started 90 mins later) 6 min (1x - started 30 mins later)	4 hours	3 mins (1x - started within 1 min of each other) 3 mins (8x - started 3 mins later) 4 mins (6x - started within 1 min of each other)	7 mins
9	16	Wiley UBCM -Engineering (with notes)	Did not run	Did not run	3-4 mins (12x) 3 mins (4x - 2 started 1 min after, 2 started 4 mins later)	7 mins

Service CPU utilization of the tests with mod-data-export-worker v2.0.1.

Note that mod-agreements and mod-notes lead the spikes in all tests. With 4 conc. jobs mod-agreements reached over 100%, mod-notes 50%, 6 concurrent jobs mod-agreements 140%, mod-notes 70%, 16 concurrent jobs mod-agreements over 300% and mod-notes about 180%. This is expected as the package has 15 notes and 5 agreements.

Memory of all the relevant modules were fairly stable in all the tests for the interval shown. mod-data-export-worker has a little bump each time a new test is run. Whether or not the trend continues at least with eHoldings testing is unknown (more tests would need to be run.) However, in other tests involving mod-data-export-worker, no memory leaks were observed.

Database CPU utilization was unremarkable as well:

mod-data-export-worker v2.0.2 & Multi-tenants Exporting

eHoldings concurrent exports was tested with mod-data-export-worker v2.0.2. This time the load is distributed equally among 2-3 participating tenants on the same cluster. These tests were done to see if eHoldings exporting can be done concurrently in a multi-tenants environment. The answer is yes. Here are the results:

There are 3 tenants on the cluster: Tenant 00, 01, and 02.

The table below shows the results of 6 tests, each test add more concurrent jobs. In test 1, for example, there are two tenants, 01 and 02 exporting. The jobs were kicked off at the same time, as indicated by the timestamp in the Launched Time column. The job may be kicked off but they could be scheduled waiting to get started. When they start exporting, that is when the timestamp in the "Start Time" column was recorded. End Time is self-explanatory. All jobs were successful hence the "S". Duration is for the overall duration of each individual job. Wait Time is the time the job had to wait, or be in the Scheduled state, before it is transitioned to In Progress. Concurrency is the number of jobs that are running at the same time as the job we're looking at.

As seen, all the jobs were passing successfully. Some got started later than the other. Most of the time a bunch of jobs got started first at the same time, then after that each remaining job takes turn to get started. This is evident in tests 3-6. Another thing to note is the duration of each job is consistent no matter the number of concurrent jobs being thrown at it.

Test	Tenant	Job ID	Launched Time	Start Time (UTC)	End Time (UTC)	Final Status	Duration	Wait Time	Concurrency
1	01	000046	9:35pm	9:35pm	9:38pm	S	3	0	1
	02	000037	9:35pm	9:39pm	9:41pm	S	2	4	1

2	01	000047	9:43pm	9:43pm	9:46pm	S	3	0	1
	01	000048	9:43pm	9:53pm	9:56pm	S	3	10	1
	02	000038	9:43pm	9:46pm	9:50pm	S	4	3	1
	02	000039	9:43pm	9:50pm	9:53pm	S	3	7	1

3	00	006034	10:08pm	10:08pm	10:11pm	S	3	0	3
	00	006035	10:08pm	10:08pm	10:11pm	S	3	0	3
	01	000049	10:08pm	10:08pm	10:11pm	S	3	0	3
	01	000050	10:08pm	10:18pm	10:21pm	S	3	10	1
	02	000040	10:08pm	10:11pm	10:14pm	S	3	3	1
	02	000041	10:08pm	10:14pm	10:18pm	S	4	6	1

4	00	006036	10:28pm	10:28pm	10:31pm	S	3	0	4
	00	006037	10:28pm	10:28pm	10:31pm	S	3	0	4
	00	006038	10:28pm	10:28pm	10:31pm	S	3	0	4
	01	000051	10:28pm	10:28pm	10:31pm	S	3	0	4
	01	000052	10:28pm	10:41pm	10:44pm	S	3	13	1
	01	000053	10:28pm	10:44pm	10:48pm	S	4	16	1
	02	000042	10:28pm	10:31pm	10:34pm	S	3	3	1
	02	000043	10:28pm	10:34pm	10:38pm	S	4	6	1
	02	000044	10:28pm	10:38pm	10:41pm	S	3	10	1

5	00	006039	11:00pm	11:00pm	11:03pm	S	3	0	5
	00	006040	11:00pm	11:00pm	11:03pm	S	3	0	5
	00	006041	11:00pm	11:00pm	11:03pm	S	3	0	5
	00	006042	11:00pm	11:00pm	11:03pm	S	3	0	5
	01	000054	11:00pm	11:13pm	11:16pm	S	3	13	1
	01	000055	11:00pm	11:16pm	11:19pm	S	3	16	1
	01	000056	11:00pm	11:19pm	11:22pm	S	3	19	1
	01	000057	11:00pm	11:22pm	11:26pm	S	4	22	1
	02	000045	11:00pm	11:00pm	11:03pm	S	3	0	5
	02	000046	11:00pm	11:03pm	11:06pm	S	3	3	1
	02	000047	11:00pm	11:06pm	11:09pm	S	3	6	1
	02	000048	11:00pm	11:09pm	11:13pm	S	4	9	1

6	00	006043	11:31pm	11:31pm	11:34pm	S	3	0	6
	00	006044	11:31pm	11:31pm	11:34pm	S	3	0	6
	00	006045	11:31pm	11:31pm	11:34pm	S	3	0	6
	00	006046	11:31pm	11:31pm	11:34pm	S	3	0	6
	00	006047	11:31pm	11:31pm	11:34pm	S	3	0	6
	01	000058	11:30pm	11:30pm	11:34pm	S	4	0	6
	01	000059	11:31pm	11:50pm	11:53pm	S	3	19	1
	01	000060	11:31pm	11:53pm	11:56pm	S	3	22	1
	01	000061	11:31pm	11:56pm	12:00am	S	4	25	1
	01	000062	11:31pm	12:00am	12:03am	S	3	29	1
	02	000049	11:31pm	11:34pm	11:37pm	S	3	3	1
	02	000050	11:31pm	11:37pm	11:40pm	S	3	6	1
	02	000051	11:31pm	11:40pm	11:43pm	S	3	9	1
	02	000052	11:31pm	11:43pm	11:46pm	S	3	12	1
	02	000053	11:31pm	11:46pm	11:50pm	S	4	15	1

Service CPU utilization of these 6 tests:

Service memory utilization:

Only mod-data-export-worker exhibits a growth in memory, from 46% to 52% after running 6 tests. This might warrant an investigation.

Database CPU utilization of the tests:

Browser not supported