eHoldings (Nolana)
Overview
eHoldings was tested with a Nolana snapshot (which was taken toward the very tail end of the dev cycle, up to the Nolana Bugfest, so it can be considered as an official Nolana release testing). Failover testing and multiple concurrent jobs testing were performed. The following documents the results of eHoldings testing.
Infrastructure
PTF -environment (ncp1)
- 9 m6i.2xlarge EC2 instances located in us-west-2.
- 2 instances of db.r6.xlarge database instances, one reader and one writer
- MSK (ptf-kafka-1 cluster)
- 4 m5.2xlarge brokers in 2 zones
- auto.create-topics.enable = true
- log.retention.minutes=480
- default.replication.factor=3
- Apache Kafka v2.8.0
- EBS storage volume per broker = 300GB
- Kafka topics
- .data-export.job.command - 50 partitions
- data-export.job.update - 50 partitions
Memory parameters for relevant modules:
Module | Version | Max Metaspace Size (MB) | XmX (MB) | Soft Limit (MB) | Hard Limit (MB) | CPU | Number of ECS Tasks |
---|---|---|---|---|---|---|---|
mod-agreements | 5.4.0-SNAPSHOT.104 | 512 | 968 | 1488 | 1592 | 128 | 2 |
mod-notes | 4.0.0-SNAPSHOT.237 | 128 | 322 | 896 | 1024 | 128 | 2 |
mod-feesfines | 18.2.0-SNAPSHOT.132 | 128 | 768 | 896 | 1024 | 128 | 2 |
mod-data-export-spring | 1.5.0-SNAPSHOT.58 | 512 | 1536 | 1844 | 2048 | 256 | 1 |
mod-data-export-worker | 1.5.0-SNAPSHOT.85 | 512 | 2048 | 2600 | 3072 | 1024 | 2 |
mod-kb-ebsco-java | 3.12.0-SNAPSHOT.61 | 128 | 768 | 896 | 1024 | 128 | 2 |
okapi | 4.14.4 | 512 | 922 | 1360 | 1512 | 1024 | 3 |
nginx-okapi | 2022.03.02 | - | - | 896 | 1024 | 128 | 2 |
pub-okapi | 2022.03.02 | - | - | 896 | 1024 | 128 | 2 |
High Level Summary
- With mod-data-export-worker v2.0.1 and v2.0.2,
- Up to 16 concurrent jobs can be performed without any issues.
- Multiple tenants can run concurrent jobs. PTF tested with 3 tenants and each tenant kicks off 5 concurrent jobs.
- Pre mod-data-export-worker v2.0.1 behavior:
- Up to 10 concurrent jobs can be performed without issues but it would take up to 87 minutes to complete.
- 14 concurrent jobs: 10 succeeded, 4 failed; took 4 hours to complete all jobs.
- When there are more than 10 concurrent jobs, eHoldings have long lulls (as seen in low modules CPU utilization) in between spikes. This could contribute to the 10+ hours to complete the jobs.
- Simulation of a failed ECS task/container show that eHoldings has no problems completing the ongoing job.
- Exporting a package without assigned titles, agreements, and notes is about 2x faster than with them.
Test Results
Test # | Jobs | Package | Job Duration | Overall Duration | Status |
---|---|---|---|---|---|
1 | 1 | Wiley UBCM - Engineering (without notes/agreements) | 8 mins | 8 mins | Successful |
2 | 1 | Wiley UBCM - Engineering (with notes) | 38 mins | 38 mins | Successful |
3 | 1 | Wiley UBCM - Engineering (with notes) | 29 mins | 29 mins | Successful |
4 | 2 | Wiley UBCM - Engineering (with notes) | 30 mins 29mins | 30 mins | Successful (all) |
5 | 4 | Wiley UBCM -Engineering (with notes) | 30 mins (4x) 29 mins (1x) | 30 mins | Successful (all) |
6 | 6 | Wiley UBCM -Engineering (with notes) | 29 mins (3x) 28 mins (3x) | 29 mins | Successful (all) |
7 | 10 | Wiley UBCM -Engineering (with notes) | 28-30 mins (8x) 32 mins (2x - started 40 minutes later) | 87 mins | Successful (all) |
8 | 14 | Wiley UBCM -Engineering (with notes) | 30 mins ( 3x - started on time) 29 mins (3x - started 2 hours later) 29 mins (2x - 30 mins later) 1 min (2x - started 4 hours later) 1 min (1x - started 90 mins later) 6 min (1x - started 30 mins later) | 4 hours | Successful Successful Successful Failed Failed Failed |
In the table above, the number "3x" in the Job Duration column, for example, denotes the number of jobs that have the same outcome. With tests 7 and 8, there are some jobs although they were all kicked off at the same time they actually started (or transitioned from the Scheduled state to In Progress state) 40 minutes or up to 4 hours later. For these two tests, the status column shows the statuses of each group of jobs, as they are grouped in the Job Duration column. The Overall Duration column reports the overall time that all jobs took from the very beginning when they were all kicked off until the very last job that completed either in success or error.
What we're seeing here is the performance for each job, whether by themselves or running concurrently with others, are rather consistent. They all finish around 30 minutes with up to 6 concurrent jobs. However, as more jobs were added as in test #7, the export's stability decreases. Two out of the 10 concurrent jobs started 40 minutes later while 8 of them started initially. When 14 jobs were scheduled at once only 3 jobs started on immediately while the rest were delayed by 30 minutes and up to 4 hours later. There were 4 failures in the 14 concurrent jobs test as well. A couple of errors were due to HoldingsIQ service which is an external and non-FOLIO service that mod-data-worker calls, while one is from mod-data-export-worker itself.
Errors
HoldingsIQ service Errors
[404 Not Found] during [GET] to [http://eholdings/providers/58] [KbEbscoClient#getProviderById(String,String)]: [{ "errors" : [ { "title" : "Provider not found" } ], "jsonapi" : { "version" : "1.0" } }] (NotFound)
[504 Gateway Timeout] during [GET] to [http://eholdings/packages/58-2110695/resources?searchfield=title&page=21&count=20&include=accessType] [KbEbscoClient#getResourcesByPackageId(String,Map)]: [{ "errors" : [ { "title" : "Endpoint request timed out" } ], "jsonapi" : { "version" : "1.0" } }] (GatewayTimeout)
mod-data-export-worker Error:
Your proposed upload is smaller than the minimum allowed size (Service: S3, Status Code: 400, Request ID: FHEHBW1CRV4QSSWG, Extended Request ID: /r5bKrM7NTz2PqRFzGLJ4S6xOj79TS2ng0VKekasA3C2Xzrp0uzoGmW4NBNE6+f5wkVdMXja4Z0=) (S3Exception)
Modules CPU Utilization
2 Concurrent jobs:
6 Concurrent jobs
In this graph the 6 jobs started at 17:38 and ended at 18:08. Note the large lull of seemingly inactivity after 17:45.
10 Concurrent jobs:
With 10 concurrent jobs the CPU graphs continues to show the interesting pattern of spikes and lulls. Starting around 18:10 to around 19:40, there are 4 periods when the modules seem to be actively working while the rest of the time they didn't do anything. This pattern is even more apparent when there are 14 concurrent jobs:
14 Concurrent Jobs:
Here the CPU activities are spread throughout 14 hours. Note the periodic spikes getting smaller as more jobs get completed and less jobs need to run as time progressed.
The trends are as follows:
- More jobs = higher CPU utilization from mod-agreements, mod-notes. These two modules use most CPU.
- okapi (and its variants) and other modules use much less CPU compared to mod-agreements and mod-notes
Modules Memory Utilization
- Nothing remarkable here, as all modules don't exhibit any signs of memory growth or leaks.
Database CPU Utilization
The database does not use much CPU in all tests. Here are the graphs of tests up to 10 concurrent jobs
With 14 concurrent jobs there graph looks like below:
The CPU spikes correspond to the spikes in the modules CPU utilization graphs.
Failover Tests
Two failover tests were performed when exporting a single package each. Both mod-data-export-worker ECS tasks were killed but the jobs still finished successfully.
mod-data-export-worker v2.0.1
Performance improved greatly with mod-data-export-worker v2.0.1
Test # | Jobs | Package | Job Duration (snapshot) | Overall Duration (snapshot) | Job Duration (v2.0.1) | Overall Duration (v2.0.1) |
---|---|---|---|---|---|---|
2 | 1 | Wiley UBCM - Engineering (with notes) | 38 mins | 38 mins | 3 mins | 3 mins |
3 | 1 | Wiley UBCM - Engineering (with notes) | 29 mins | 29 mins | 4 mins | 4 mins |
4 | 2 | Wiley UBCM - Engineering (with notes) | 30 mins 29mins | 30 mins | 3 mins 4 mins | 4 mins |
5 | 4 | Wiley UBCM -Engineering (with notes) | 30 mins (4x) 29 mins (1x) | 30 mins | 2 mins (1x) 3 mins (3x) | 3 mins |
6 | 6 | Wiley UBCM -Engineering (with notes) | 29 mins (3x) 28 mins (3x) | 29 mins | 2 mins (1x - started 1 min later) 4 mins (5x) | 4 mins |
7 | 10 | Wiley UBCM -Engineering (with notes) | 28-30 mins (8x) 32 mins (2x - started 40 minutes later) | 87 mins | 3 mins (All ) | 7 mins |
8 | 14 | Wiley UBCM -Engineering (with notes) | 30 mins ( 3x - started on time) 29 mins (3x - started 2 hours later) 29 mins (2x - 30 mins later) 1 min (2x - started 4 hours later) 1 min (1x - started 90 mins later) 6 min (1x - started 30 mins later) | 4 hours | 3 mins (1x - started within 1 min of each other) 3 mins (8x - started 3 mins later) 4 mins (6x - started within 1 min of each other) | 7 mins |
9 | 16 | Wiley UBCM -Engineering (with notes) | Did not run | Did not run | 3-4 mins (12x) 3 mins (4x - 2 started 1 min after, 2 started 4 mins later) | 7 mins |
Service CPU utilization of the tests with mod-data-export-worker v2.0.1.
Note that mod-agreements and mod-notes lead the spikes in all tests. With 4 conc. jobs mod-agreements reached over 100%, mod-notes 50%, 6 concurrent jobs mod-agreements 140%, mod-notes 70%, 16 concurrent jobs mod-agreements over 300% and mod-notes about 180%. This is expected as the package has 15 notes and 5 agreements.
Memory of all the relevant modules were fairly stable in all the tests for the interval shown. mod-data-export-worker has a little bump each time a new test is run. Whether or not the trend continues at least with eHoldings testing is unknown (more tests would need to be run.) However, in other tests involving mod-data-export-worker, no memory leaks were observed.
Database CPU utilization was unremarkable as well:
mod-data-export-worker v2.0.2 & Multi-tenants Exporting
eHoldings concurrent exports was tested with mod-data-export-worker v2.0.2. This time the load is distributed equally among 2-3 participating tenants on the same cluster. These tests were done to see if eHoldings exporting can be done concurrently in a multi-tenants environment. The answer is yes. Here are the results:
There are 3 tenants on the cluster: Tenant 00, 01, and 02.
The table below shows the results of 6 tests, each test add more concurrent jobs. In test 1, for example, there are two tenants, 01 and 02 exporting. The jobs were kicked off at the same time, as indicated by the timestamp in the Launched Time column. The job may be kicked off but they could be scheduled waiting to get started. When they start exporting, that is when the timestamp in the "Start Time" column was recorded. End Time is self-explanatory. All jobs were successful hence the "S". Duration is for the overall duration of each individual job. Wait Time is the time the job had to wait, or be in the Scheduled state, before it is transitioned to In Progress. Concurrency is the number of jobs that are running at the same time as the job we're looking at.
As seen, all the jobs were passing successfully. Some got started later than the other. Most of the time a bunch of jobs got started first at the same time, then after that each remaining job takes turn to get started. This is evident in tests 3-6. Another thing to note is the duration of each job is consistent no matter the number of concurrent jobs being thrown at it.
Test | Tenant | Job ID | Launched Time | Start Time (UTC) | End Time (UTC) | Final Status | Duration | Wait Time | Concurrency |
---|---|---|---|---|---|---|---|---|---|
1 | 01 | 000046 | 9:35pm | 9:35pm | 9:38pm | S | 3 | 0 | 1 |
02 | 000037 | 9:35pm | 9:39pm | 9:41pm | S | 2 | 4 | 1 | |
2 | 01 | 000047 | 9:43pm | 9:43pm | 9:46pm | S | 3 | 0 | 1 |
01 | 000048 | 9:43pm | 9:53pm | 9:56pm | S | 3 | 10 | 1 | |
02 | 000038 | 9:43pm | 9:46pm | 9:50pm | S | 4 | 3 | 1 | |
02 | 000039 | 9:43pm | 9:50pm | 9:53pm | S | 3 | 7 | 1 | |
3 | 00 | 006034 | 10:08pm | 10:08pm | 10:11pm | S | 3 | 0 | 3 |
00 | 006035 | 10:08pm | 10:08pm | 10:11pm | S | 3 | 0 | 3 | |
01 | 000049 | 10:08pm | 10:08pm | 10:11pm | S | 3 | 0 | 3 | |
01 | 000050 | 10:08pm | 10:18pm | 10:21pm | S | 3 | 10 | 1 | |
02 | 000040 | 10:08pm | 10:11pm | 10:14pm | S | 3 | 3 | 1 | |
02 | 000041 | 10:08pm | 10:14pm | 10:18pm | S | 4 | 6 | 1 | |
4 | 00 | 006036 | 10:28pm | 10:28pm | 10:31pm | S | 3 | 0 | 4 |
00 | 006037 | 10:28pm | 10:28pm | 10:31pm | S | 3 | 0 | 4 | |
00 | 006038 | 10:28pm | 10:28pm | 10:31pm | S | 3 | 0 | 4 | |
01 | 000051 | 10:28pm | 10:28pm | 10:31pm | S | 3 | 0 | 4 | |
01 | 000052 | 10:28pm | 10:41pm | 10:44pm | S | 3 | 13 | 1 | |
01 | 000053 | 10:28pm | 10:44pm | 10:48pm | S | 4 | 16 | 1 | |
02 | 000042 | 10:28pm | 10:31pm | 10:34pm | S | 3 | 3 | 1 | |
02 | 000043 | 10:28pm | 10:34pm | 10:38pm | S | 4 | 6 | 1 | |
02 | 000044 | 10:28pm | 10:38pm | 10:41pm | S | 3 | 10 | 1 | |
5 | 00 | 006039 | 11:00pm | 11:00pm | 11:03pm | S | 3 | 0 | 5 |
00 | 006040 | 11:00pm | 11:00pm | 11:03pm | S | 3 | 0 | 5 | |
00 | 006041 | 11:00pm | 11:00pm | 11:03pm | S | 3 | 0 | 5 | |
00 | 006042 | 11:00pm | 11:00pm | 11:03pm | S | 3 | 0 | 5 | |
01 | 000054 | 11:00pm | 11:13pm | 11:16pm | S | 3 | 13 | 1 | |
01 | 000055 | 11:00pm | 11:16pm | 11:19pm | S | 3 | 16 | 1 | |
01 | 000056 | 11:00pm | 11:19pm | 11:22pm | S | 3 | 19 | 1 | |
01 | 000057 | 11:00pm | 11:22pm | 11:26pm | S | 4 | 22 | 1 | |
02 | 000045 | 11:00pm | 11:00pm | 11:03pm | S | 3 | 0 | 5 | |
02 | 000046 | 11:00pm | 11:03pm | 11:06pm | S | 3 | 3 | 1 | |
02 | 000047 | 11:00pm | 11:06pm | 11:09pm | S | 3 | 6 | 1 | |
02 | 000048 | 11:00pm | 11:09pm | 11:13pm | S | 4 | 9 | 1 | |
6 | 00 | 006043 | 11:31pm | 11:31pm | 11:34pm | S | 3 | 0 | 6 |
00 | 006044 | 11:31pm | 11:31pm | 11:34pm | S | 3 | 0 | 6 | |
00 | 006045 | 11:31pm | 11:31pm | 11:34pm | S | 3 | 0 | 6 | |
00 | 006046 | 11:31pm | 11:31pm | 11:34pm | S | 3 | 0 | 6 | |
00 | 006047 | 11:31pm | 11:31pm | 11:34pm | S | 3 | 0 | 6 | |
01 | 000058 | 11:30pm | 11:30pm | 11:34pm | S | 4 | 0 | 6 | |
01 | 000059 | 11:31pm | 11:50pm | 11:53pm | S | 3 | 19 | 1 | |
01 | 000060 | 11:31pm | 11:53pm | 11:56pm | S | 3 | 22 | 1 | |
01 | 000061 | 11:31pm | 11:56pm | 12:00am | S | 4 | 25 | 1 | |
01 | 000062 | 11:31pm | 12:00am | 12:03am | S | 3 | 29 | 1 | |
02 | 000049 | 11:31pm | 11:34pm | 11:37pm | S | 3 | 3 | 1 | |
02 | 000050 | 11:31pm | 11:37pm | 11:40pm | S | 3 | 6 | 1 | |
02 | 000051 | 11:31pm | 11:40pm | 11:43pm | S | 3 | 9 | 1 | |
02 | 000052 | 11:31pm | 11:43pm | 11:46pm | S | 3 | 12 | 1 | |
02 | 000053 | 11:31pm | 11:46pm | 11:50pm | S | 4 | 15 | 1 |
Service CPU utilization of these 6 tests:
Service memory utilization:
Only mod-data-export-worker exhibits a growth in memory, from 46% to 52% after running 6 tests. This might warrant an investigation.
Database CPU utilization of the tests: