PTF - DI testing for Cornell (Iris hotfixes)
Overview
- In this workflow, we are checking the performance of Data Import for Cornell. This testing is done to mimic Cornell load as described in the Rally ticket https://rally1.rallydev.com/#/79944863724d/search?detail=%2Fuserstory%2F50807e3e-96eb-4ece-b24b-24665b1a4dc5&fdp=true&keywords=cornell
These tests were run in PTF env in icp1 hotfix-1 cluster - https://iris-cap1.int.aws.folio.org/
Following changes were made to env based on Cornell's load requirement:
1. Give more memory/CPU to Okapi, mod-srm, mod-srs, mod-inventory and mod-inventory-storage - 4x of what's already in the Task Definition
2. Make DB 4x large
3. Vertically scale up EC2 instances to m5.2xlarge
Module | Memory(Hard limit) MB | CPU | Task Def |
---|---|---|---|
Okapi | 3456 | 512 | #2 |
mod-SRM | 5760 | 512 | #4 |
mod-SRS | 5760 | 512 | #4 |
mod-inventory | 7488 | 1024 | #5 |
mod-inventory-storage | 3456 | 512 | #2 |
Hotfix-1
- Backend:
- mod-data-import-2.0.2
- mod-source-record-storage-5.0.4
- mod-source-record-manager-3.0.7
- okapi-4.7.3
- Frontend:
- folio_data-import-4.0.3
Hotfix-2
- Backend:
- mod-data-import-2.0.3
- mod-source-record-storage-5.0.5
- mod-source-record-manager-3.0.8
- okapi-4.7.3
- mod-inventory-16.3.3
- mod-inventory-storage-20.2.1
- Frontend:
- folio_data-import-4.0.4
Environment:
- 7.2 million UChi SRS records
- 7.2 million inventory records
- 69 FOLIO back-end modules deployed in 149 ECS services
- 3 okapi ECS services
- 12 m5.2xlarge EC2 instances
- 1 writer db.r5.4xlarge 1 reader db.r5.4xlarge AWS RDS instance
- INFO logging level
Test Runs
Hotfix-1
Data Import
Test | Profile | Load | Duration | Status | I | H | I | SRS Marc | Job Id | Results/Notes |
1. | Create import 25K (lone job) | 25K | 10:21 AM - 11:26 AM EST - 1 hour 6 minutes | Completed with Error | 24960 | 24959 | 24959 | 24950 | 991 | Job failed to create all records; mod-inventory's CPU spiked kafka.consumer.max.poll.records=10 |
2. | Create import 25K (lone job) | 25K | 2+ hours | Stuck at 99% | 25000 | 25000 | 24998 | 25000 | 1024 | Failed to create 2 items; mod-inventory's CPU spiked kafka.consumer.max.poll.records=10 |
3. | Create import 5K (lone job) | 5K | 16:37 - 16:47 EDT | Completed Successfully | 5000 | 5000 | 5000 | 5000 | 1025 | |
4. | Create import 25K (lone job) | 25K | 16:50 - 17:32 EDT 42 minutes | Completed Successfully | 25000 | 25000 | 25000| 25000 | 1026 | mod-inventory's CPU did not spike during the import, only right after. |
5. | Create import 30K (lone job) | 30K | 18:05 - 19:11 EDT 1 hr 6 minutes | Stuck at 99% | 30000 | 30000 | 29999 | 30000 | 1027 | mod-inventory's CPU did not spike during the import. A dip observed about midway. events_cache messages rate spiked at the dip. |
Checkin-Checkout
Ran Checkin-Checkout for 20 Users for 1 hour
Relatively stable except few requests failed.
Hotfix-2
Data Import
Test | Profile | Load | Duration | Status | I | H | I | SRS Marc | Job Id | Results/Notes |
1. | Create import 25K (lone job) | 25K | 06/28 4:29 PM - 5:45 PM EDT - 1 hour 16 minutes | Completed with Error | 24950 | 24950 | 24950 | 24950 | 1090 | Job failed to create all records; kafka.consumer.max.poll.records=10 |
2. | Create import 25K (lone job) | 25K | 06/29 2:05PM - 3:15 PM EDT - 1 hour 10 minutes | Completed | 25000 | 25000 | 25000| 25000 | 1093 | Created all records as expected. There was one sudden spike in the mod-inventory CPU. |
3. | Create import 25K (lone job) | 25K | 06/29 3:31PM - 4:30 PM EDT - 1 hour 1 minute | Completed | 25000 | 25000 | 25000| 25000 | 1094 | kafka.consumer.max.poll.records=10 Created all records as expected. No spikes seen. |
4. | Create import 25K (lone job) | 25K | 06/29 17:35 - 18:23 EDT 48 minutes | Completed | 25000 | 25000 | 25000| 25000 | 1095 | kafka.consumer.max.poll.records=10 Created all records as expected. No spikes seen. |
5. | Create import 50K (lone job) | 50K | 06/29 22:51 UTC - 06/30 00:52 UTC - 2 hours | Completed with Error | 50000 | 50000 | 49999 | 50000 | 1096 | kafka.consumer.max.poll.records=10 Job stuck at 99%. Huge mod-inventory CPU spike midway (13 minutes) after 8 minutes of very low activities. events_cache spiked multiple times. |
6. | Create import 25K (with 20 users checkin/out) | 25K | 06/30 03:56 UTC - 06/30 04:49 UTC 53 minutes | Completed | 25000 | 25000 | 25000| 25000 | 1097 | kafka.consumer.max.poll.records=10 |
7. | Create import 25K (with 40 users checkin/out) | 25K | 06/30 4:41 PM UTC - 5:16 PM UTC - 35 minutes | Completed | 25000 | 25000 | 25000| 25000 | 1098 | kafka.consumer.max.poll.records=10 No spikes. Okapi CPU utilization is high around 220% because checkin-checkout also running in the background. |
Checkin-Checkout as background for DI above
DI Test# from above | Users | Req/s | Min | 50th pct | 75th pct | 95th pct | 99th pct | Max | Average | Latency | Grafana dashboard link | Results/Notes | |
6. | 20 | Total Check-in | 19.495 | 0.356 | 1.397 | 1.957 | 3.831 | 5.999 | 10.34 | 1.695 | 3.4 | 6/29/2021 11:55PM-00:55AM EDT | High CPU utilization for Okapi 180% which is normal for 20 Users |
Total Check-out | 60.657 | 1.326 | 3.03 | 4.022 | 8.202 | 12.924 | 31.539 | 3.669 | 7.076 | ||||
7. | 40 | Total Check-in | 28.509 | 0.413 | 2.706 | 3.971 | 7.911 | 10.883 | 19.222 | 3.354 | 6.578 | 6/30/2021 12:20PM-1:40PM EDT | High CPU utilization for Okapi 245% which is normal for 40 Users |
Total Check-out | 86.383 | 1.669 | 6.111 | 8.417 | 17.415 | 27.235 | 53.761 | 7.551 | 14.468 |
Checkin-Checkout Standalone
Test | Users | Req/s | Min | 50th pct | 75th pct | 95th pct | 99th pct | Max | Average | Latency | Grafana dashboard link | Results/Notes | |
1. | 20 | Total Check-in | 24.755 | 0.326 | 0.807 | 1.039 | 1.551 | 2.284 | 13.608 | 0.891 | 1.479 | 6/28/2021 2:42PM-3:50PM EDT | High CPU utilization for Okapi 140% which is normal for 20 Users |
Total Check-out | 75.733 | 1.235 | 1.926 | 2.228 | 3.172 | 4.751 | 12.905 | 2.081 | 3.014 | ||||
2. | 40 | Total Check-in | 36.529 | 0.349 | 1.724 | 2.356 | 4.675 | 6.746 | 13.136 | 2.061 | 4.02 | 6/30/2021 10:14AM-11:22AM EDT | High CPU utilization for Okapi 250% which is normal for 40 Users |
Total Check-out | 109.9 | 1.44 | 4.037 | 5.346 | 10.508 | 15.765 | 32.952 | 4.844 | 8.978 |
Observations
Accidently, ran parallel import in another cluster cap2 for Test 1 above. Parallel import does not create all expected numbers of records.