eHoldings (Nolana)
Overview
eHoldings was tested with a Nolana snapshot (which was taken toward the very tail end of the dev cycle, up to the Nolana Bugfest, so it can be considered as an official Nolana release testing). Failover testing and multiple concurrent jobs testing were performed. The following documents the results of eHoldings testing.
Infrastructure
PTF -environment (ncp1)
- 9 m6i.2xlarge EC2 instances located in us-west-2.
- 2 instances of db.r6.xlarge database instances, one reader and one writer
- MSK (ptf-kafka-1 cluster)
- 4 m5.2xlarge brokers in 2 zones
- auto.create-topics.enable = true
- log.retention.minutes=480
- default.replication.factor=3
- Apache Kafka v2.8.0
- EBS storage volume per broker = 300GB
- Kafka topics
- .data-export.job.command - 50 partitions
- data-export.job.update - 50 partitions
Memory parameters for relevant modules:
Module | Version | Max Metaspace Size (MB) | XmX (MB) | Soft Limit (MB) | Hard Limit (MB) | CPU | Number of ECS Tasks |
---|---|---|---|---|---|---|---|
mod-agreements | 5.4.0-SNAPSHOT.104 | 512 | 968 | 1488 | 1592 | 128 | 2 |
mod-notes | 4.0.0-SNAPSHOT.237 | 128 | 322 | 896 | 1024 | 128 | 2 |
mod-feesfines | 18.2.0-SNAPSHOT.132 | 128 | 768 | 896 | 1024 | 128 | 2 |
mod-data-export-spring | 1.5.0-SNAPSHOT.58 | 512 | 1536 | 1844 | 2048 | 256 | 1 |
mod-data-export-worker | 1.5.0-SNAPSHOT.85 | 512 | 2048 | 2600 | 3072 | 1024 | 2 |
mod-kb-ebsco-java | 3.12.0-SNAPSHOT.61 | 128 | 768 | 896 | 1024 | 128 | 2 |
okapi | 4.14.4 | 512 | 922 | 1360 | 1512 | 1024 | 3 |
nginx-okapi | 2022.03.02 | - | - | 896 | 1024 | 128 | 2 |
pub-okapi | 2022.03.02 | - | - | 896 | 1024 | 128 | 2 |
High Level Summary
- With mod-data-export-worker v2.0.1 and v2.0.2,
- Up to 16 concurrent jobs can be performed without any issues.
- Multiple tenants can run concurrent jobs. PTF tested with 3 tenants and each tenant kicks off 5 concurrent jobs.
- Pre mod-data-export-worker v2.0.1 behavior:
- Up to 10 concurrent jobs can be performed without issues but it would take up to 87 minutes to complete.
- 14 concurrent jobs: 10 succeeded, 4 failed; took 4 hours to complete all jobs.
- When there are more than 10 concurrent jobs, eHoldings have long lulls (as seen in low modules CPU utilization) in between spikes. This could contribute to the 10+ hours to complete the jobs.
- Simulation of a failed ECS task/container show that eHoldings has no problems completing the ongoing job.
- Exporting a package without assigned titles, agreements, and notes is about 2x faster than with them.
Test Results
Test # | Jobs | Package | Job Duration | Overall Duration | Status |
---|---|---|---|---|---|
1 | 1 | Wiley UBCM - Engineering (without notes/agreements) | 8 mins | 8 mins | Successful |
2 | 1 | Wiley UBCM - Engineering (with notes) | 38 mins | 38 mins | Successful |
3 | 1 | Wiley UBCM - Engineering (with notes) | 29 mins | 29 mins | Successful |
4 | 2 | Wiley UBCM - Engineering (with notes) | 30 mins 29mins | 30 mins | Successful (all) |
5 | 4 | Wiley UBCM -Engineering (with notes) | 30 mins (4x) 29 mins (1x) | 30 mins | Successful (all) |
6 | 6 | Wiley UBCM -Engineering (with notes) | 29 mins (3x) 28 mins (3x) | 29 mins | Successful (all) |
7 | 10 | Wiley UBCM -Engineering (with notes) | 28-30 mins (8x) 32 mins (2x - started 40 minutes later) | 87 mins | Successful (all) |
8 | 14 | Wiley UBCM -Engineering (with notes) | 30 mins ( 3x - started on time) 29 mins (3x - started 2 hours later) 29 mins (2x - 30 mins later) 1 min (2x - started 4 hours later) 1 min (1x - started 90 mins later) 6 min (1x - started 30 mins later) | 4 hours | Successful Successful Successful Failed Failed Failed |
In the table above, the number "3x" in the Job Duration column, for example, denotes the number of jobs that have the same outcome. With tests 7 and 8, there are some jobs although they were all kicked off at the same time they actually started (or transitioned from the Scheduled state to In Progress state) 40 minutes or up to 4 hours later. For these two tests, the status column shows the statuses of each group of jobs, as they are grouped in the Job Duration column. The Overall Duration column reports the overall time that all jobs took from the very beginning when they were all kicked off until the very last job that completed either in success or error.
What we're seeing here is the performance for each job, whether by themselves or running concurrently with others, are rather consistent. They all finish around 30 minutes with up to 6 concurrent jobs. However, as more jobs were added as in test #7, the export's stability decreases. Two out of the 10 concurrent jobs started 40 minutes later while 8 of them started initially. When 14 jobs were scheduled at once only 3 jobs started on immediately while the rest were delayed by 30 minutes and up to 4 hours later. There were 4 failures in the 14 concurrent jobs test as well. A couple of errors were due to HoldingsIQ service which is an external and non-FOLIO service that mod-data-worker calls, while one is from mod-data-export-worker itself.
Errors
HoldingsIQ service Errors
[404 Not Found] during [GET] to [http://eholdings/providers/58] [KbEbscoClient#getProviderById(String,String)]: [{ "errors" : [ { "title" : "Provider not found" } ], "jsonapi" : { "version" : "1.0" } }] (NotFound)
[504 Gateway Timeout] during [GET] to [http://eholdings/packages/58-2110695/resources?searchfield=title&page=21&count=20&include=accessType] [KbEbscoClient#getResourcesByPackageId(String,Map)]: [{ "errors" : [ { "title" : "Endpoint request timed out" } ], "jsonapi" : { "version" : "1.0" } }] (GatewayTimeout)
mod-data-export-worker Error:
Your proposed upload is smaller than the minimum allowed size (Service: S3, Status Code: 400, Request ID: FHEHBW1CRV4QSSWG, Extended Request ID: /r5bKrM7NTz2PqRFzGLJ4S6xOj79TS2ng0VKekasA3C2Xzrp0uzoGmW4NBNE6+f5wkVdMXja4Z0=) (S3Exception)
Modules CPU Utilization
2 Concurrent jobs:
6 Concurrent jobs
In this graph the 6 jobs started at 17:38 and ended at 18:08. Note the large lull of seemingly inactivity after 17:45.
10 Concurrent jobs:
With 10 concurrent jobs the CPU graphs continues to show the interesting pattern of spikes and lulls. Starting around 18:10 to around 19:40, there are 4 periods when the modules seem to be actively working while the rest of the time they didn't do anything. This pattern is even more apparent when there are 14 concurrent jobs:
14 Concurrent Jobs:
Here the CPU activities are spread throughout 14 hours. Note the periodic spikes getting smaller as more jobs get completed and less jobs need to run as time progressed.
The trends are as follows:
- More jobs = higher CPU utilization from mod-agreements, mod-notes. These two modules use most CPU.
- okapi (and its variants) and other modules use much less CPU compared to mod-agreements and mod-notes
Modules Memory Utilization
- Nothing remarkable here, as all modules don't exhibit any signs of memory growth or leaks.
Database CPU Utilization
The database does not use much CPU in all tests. Here are the graphs of tests up to 10 concurrent jobs
With 14 concurrent jobs there graph looks like below:
The CPU spikes correspond to the spikes in the modules CPU utilization graphs.
Failover Tests
Two failover tests were performed when exporting a single package each. Both mod-data-export-worker ECS tasks were killed but the jobs still finished successfully.
mod-data-export-worker v2.0.1
Performance improved greatly with mod-data-export-worker v2.0.1
Test # | Jobs | Package | Job Duration (snapshot) | Overall Duration (snapshot) | Job Duration (v2.0.1) | Overall Duration (v2.0.1) |
---|---|---|---|---|---|---|
2 | 1 | Wiley UBCM - Engineering (with notes) | 38 mins | 38 mins | 3 mins | 3 mins |
3 | 1 | Wiley UBCM - Engineering (with notes) | 29 mins | 29 mins | 4 mins | 4 mins |
4 | 2 | Wiley UBCM - Engineering (with notes) | 30 mins 29mins | 30 mins | 3 mins 4 mins | 4 mins |
5 | <