PTF - DI testing for Cornell (Iris hotfixes)
Overview
In this workflow, we are checking the performance of Data Import for Cornell. This testing is done to mimic Cornell load as described in the Rally ticket https://rally1.rallydev.com/#/79944863724d/search?detail=%2Fuserstory%2F50807e3e-96eb-4ece-b24b-24665b1a4dc5&fdp=true&keywords=cornell
These tests were run in PTF env in icp1 hotfix-1 cluster - https://iris-cap1.int.aws.folio.org/
Following changes were made to env based on Cornell's load requirement:
1. Give more memory/CPU to Okapi, mod-srm, mod-srs, mod-inventory and mod-inventory-storage - 4x of what's already in the Task Definition
2. Make DB 4x large
3. Vertically scale up EC2 instances to m5.2xlarge
Module | Memory(Hard limit) MB | CPU | Task Def |
|---|---|---|---|
Okapi | 3456 | 512 | #2 |
mod-SRM | 5760 | 512 | #4 |
mod-SRS | 5760 | 512 | #4 |
mod-inventory | 7488 | 1024 | #5 |
mod-inventory-storage | 3456 | 512 | #2 |
Hotfix-1
Backend:
mod-data-import-2.0.2
mod-source-record-storage-5.0.4
mod-source-record-manager-3.0.7
okapi-4.7.3
Frontend:
folio_data-import-4.0.3
Hotfix-2
Backend:
mod-data-import-2.0.3
mod-source-record-storage-5.0.5
mod-source-record-manager-3.0.8
okapi-4.7.3
mod-inventory-16.3.3
mod-inventory-storage-20.2.1
Frontend:
folio_data-import-4.0.4
Environment:
7.2 million UChi SRS records
7.2 million inventory records
69 FOLIO back-end modules deployed in 149 ECS services
3 okapi ECS services
12 m5.2xlarge EC2 instances
1 writer db.r5.4xlarge 1 reader db.r5.4xlarge AWS RDS instance
INFO logging level
Test Runs
Hotfix-1
Data Import
Test | Profile | Load | Duration | Status | I | H | I | SRS Marc | Job Id | Results/Notes |
1. | Create import 25K (lone job) | 25K | 10:21 AM - 11:26 AM EST - 1 hour 6 minutes | Completed with Error | 24960 | 24959 | 24959 | 24950 | 991 | Job failed to create all records; mod-inventory's CPU spiked kafka.consumer.max.poll.records=10 |
2. | Create import 25K (lone job) | 25K | 2+ hours | Stuck at 99% | 25000 | 25000 | 24998 | 25000 | 1024 | Failed to create 2 items; mod-inventory's CPU spiked kafka.consumer.max.poll.records=10 |
3. | Create import 5K (lone job) | 5K | 16:37 - 16:47 EDT | Completed Successfully | 5000 | 5000 | 5000 | 5000 | 1025 |
|
4. | Create import 25K (lone job) | 25K | 16:50 - 17:32 EDT 42 minutes | Completed Successfully | 25000 | 25000 | 25000| 25000 | 1026 | mod-inventory's CPU did not spike during the import, only right after. |
5. | Create import 30K (lone job) | 30K | 18:05 - 19:11 EDT 1 hr 6 minutes | Stuck at 99% | 30000 | 30000 | 29999 | 30000 | 1027 | mod-inventory's CPU did not spike during the import. A dip observed about midway. events_cache messages rate spiked at the dip. |
Checkin-Checkout
Ran Checkin-Checkout for 20 Users for 1 hour
Relatively stable except few requests failed.
Hotfix-2
Data Import
Test | Profile | Load | Duration | Status | I | H | I | SRS Marc | Job Id | Results/Notes |
1. | Create import 25K (lone job) | 25K | 06/28 4:29 PM - 5:45 PM EDT - 1 hour 16 minutes | Completed with Error | 24950 | 24950 | 24950 | 24950 | 1090 | Job failed to create all records; kafka.consumer.max.poll.records=10 |
2. | Create import 25K (lone job) | 25K | 06/29 2:05PM - 3:15 PM EDT - 1 hour 10 minutes | Completed | 25000 | 25000 | 25000| 25000 | 1093 | Created all records as expected. There was one sudden spike in the mod-inventory CPU. |
3. | Create import 25K (lone job) | 25K | 06/29 3:31PM - 4:30 PM EDT - 1 hour 1 minute | Completed | 25000 | 25000 | 25000| 25000 | 1094 | kafka.consumer.max.poll.records=10 Created all records as expected. No spikes seen. |
4. | Create import 25K (lone job) | 25K | 06/29 17:35 - 18:23 EDT 48 minutes | Completed | 25000 | 25000 | 25000| 25000 | 1095 | kafka.consumer.max.poll.records=10 Created all records as expected. No spikes seen. |
5. | Create import 50K (lone job) | 50K | 06/29 22:51 UTC - 06/30 00:52 UTC - 2 hours | Completed with Error | 50000 | 50000 | 49999 | 50000 | 1096 | kafka.consumer.max.poll.records=10 Job stuck at 99%. Huge mod-inventory CPU spike midway (13 minutes) after 8 minutes of very low activities. events_cache spiked multiple times. |
6. | Create import 25K (with 20 users checkin/out) | 25K | 06/30 03:56 UTC - 06/30 04:49 UTC 53 minutes | Completed | 25000 | 25000 | 25000| 25000 | 1097 | kafka.consumer.max.poll.records=10 |
7. | Create import 25K (with 40 users checkin/out) | 25K | 06/30 4:41 PM UTC - 5:16 PM UTC - 35 minutes | Completed | 25000 | 25000 | 25000| 25000 | 1098 | kafka.consumer.max.poll.records=10 No spikes. Okapi CPU utilization is high around 220% because checkin-checkout also running in the background. |
Checkin-Checkout as background for DI above
DI Test# from above | Users |
| Req/s | Min | 50th pct | 75th pct | 95th pct | 99th pct | Max | Average | Latency | Grafana dashboard link | Results/Notes |
6. | 20 | Total Check-in | 19.495 | 0.356 | 1.397 | 1.957 | 3.831 | 5.999 | 10.34 | 1.695 | 3.4 | High CPU utilization for Okapi 180% which is normal for 20 Users | |
Total Check-out | 60.657 | 1.326 | 3.03 | 4.022 | 8.202 | 12.924 | 31.539 | 3.669 | 7.076 | ||||
7. | 40 | Total Check-in | 28.509 | 0.413 | 2.706 | 3.971 | 7.911 | 10.883 | 19.222 | 3.354 | 6.578 | High CPU utilization for Okapi 245% which is normal for 40 Users | |
Total Check-out | 86.383 | 1.669 | 6.111 | 8.417 | 17.415 | 27.235 | 53.761 | 7.551 | 14.468 |
Checkin-Checkout Standalone
Test | Users |
| Req/s | Min | 50th pct | 75th pct | 95th pct | 99th pct | Max | Average | Latency | Grafana dashboard link | Results/Notes |
1. | 20 | Total Check-in | 24.755 | 0.326 | 0.807 | 1.039 | 1.551 | 2.284 | 13.608 | 0.891 | 1.479 | High CPU utilization for Okapi 140% which is normal for 20 Users | |
Total Check-out | 75.733 | 1.235 | 1.926 | 2.228 | 3.172 | 4.751 | 12.905 | 2.081 | 3.014 | ||||
2. | 40 | Total Check-in | 36.529 | 0.349 | 1.724 | 2.356 | 4.675 | 6.746 | 13.136 | 2.061 | 4.02 | High CPU utilization for Okapi 250% which is normal for 40 Users | |