PTF - DI testing for Cornell (Iris hotfixes)










































Overview

  1. In this workflow, we are checking the performance of Data Import for Cornell. This testing is done to mimic Cornell load as described in the Rally ticket https://rally1.rallydev.com/#/79944863724d/search?detail=%2Fuserstory%2F50807e3e-96eb-4ece-b24b-24665b1a4dc5&fdp=true&keywords=cornell


These tests were run in PTF env in icp1 hotfix-1 cluster - https://iris-cap1.int.aws.folio.org/

Following changes were made to env based on Cornell's load requirement:

1. Give more memory/CPU to Okapi, mod-srm, mod-srs, mod-inventory and mod-inventory-storage - 4x of what's already in the Task Definition

2. Make DB 4x large

3. Vertically scale up EC2 instances to m5.2xlarge

ModuleMemory(Hard limit) MBCPUTask Def
Okapi3456512#2
mod-SRM5760512#4
mod-SRS5760512#4
mod-inventory74881024#5
mod-inventory-storage3456512#2


Hotfix-1

  • Backend:
    • mod-data-import-2.0.2
    • mod-source-record-storage-5.0.4
    • mod-source-record-manager-3.0.7
    • okapi-4.7.3
  • Frontend:
    • folio_data-import-4.0.3

Hotfix-2

  • Backend:
    • mod-data-import-2.0.3
    • mod-source-record-storage-5.0.5
    • mod-source-record-manager-3.0.8
    • okapi-4.7.3
    • mod-inventory-16.3.3
    • mod-inventory-storage-20.2.1
  • Frontend:
    • folio_data-import-4.0.4


Environment:

  • 7.2 million UChi SRS records
  • 7.2 million inventory records
  • 69 FOLIO back-end modules deployed in 149 ECS services
  • 3 okapi ECS services
  • 12 m5.2xlarge  EC2 instances
  • writer db.r5.4xlarge 1 reader db.r5.4xlarge AWS RDS instance
  • INFO logging level

Test Runs

Hotfix-1

Data Import

Test

Profile

Load

Duration

StatusI | H | I | SRS Marc

Job Id

Results/Notes

1.

Create import 25K (lone job)

25K

10:21 AM - 11:26 AM EST - 1 hour 6 minutes

Completed with Error24960 | 24959 | 24959 | 24950991

Job failed to create all records;

mod-inventory's CPU spiked

kafka.consumer.max.poll.records=10

2.Create import 25K (lone job)25K2+ hoursStuck at 99%25000 | 25000 | 24998 | 250001024

Failed to create 2 items;

mod-inventory's CPU spiked kafka.consumer.max.poll.records=10

3.Create import 5K (lone job)5K16:37 - 16:47 EDTCompleted Successfully5000 | 5000 | 5000 | 50001025
4.Create import 25K (lone job)25K

16:50 - 17:32 EDT

42 minutes

Completed Successfully25000 | 25000 | 25000| 250001026mod-inventory's CPU did not spike during the import, only right after.
5.Create import 30K (lone job)30K

18:05 - 19:11 EDT

1 hr 6 minutes

Stuck at 99% 30000 | 30000 | 29999 | 300001027mod-inventory's CPU did not spike during the import. A dip observed about midway. events_cache messages rate spiked at the dip.

Checkin-Checkout

Ran Checkin-Checkout for 20 Users for 1 hour

Relatively stable except few requests failed.

Grafana dashboard - http://carrier-io.int.folio.ebsco.com/grafana/d/q69rYQlik/jmeter-performance?orgId=1&var-percentile=95&var-test_type=baseline&var-test=circulation_checkInCheckOut&var-env=int&var-grouping=1s&var-low_limit=250&var-high_limit=700&var-db_name=jmeter&var-sampler_type=All&from=1624636077093&to=1624640056950

 

Hotfix-2

Data Import

Test

Profile

Load

Duration

StatusI | H | I | SRS Marc

Job Id

Results/Notes

1.

Create import 25K (lone job)

25K

06/28 4:29 PM - 5:45 PM EDT - 1 hour 16 minutes

Completed with Error24950 | 24950 | 24950 | 249501090

Job failed to create all records;

kafka.consumer.max.poll.records=10

2.Create import 25K (lone job)25K06/29 2:05PM - 3:15 PM EDT - 1 hour 10 minutesCompleted25000 | 25000 | 25000| 250001093Created all records as expected. There was one sudden spike in the mod-inventory CPU.
3. Create import 25K (lone job)25K06/29 3:31PM - 4:30 PM EDT - 1 hour 1 minuteCompleted25000 | 25000 | 25000| 250001094

kafka.consumer.max.poll.records=10

Created all records as expected. No spikes seen.

4.Create import 25K (lone job)25K

06/29 17:35 - 18:23 EDT

48 minutes

Completed25000 | 25000 | 25000| 250001095

kafka.consumer.max.poll.records=10

Created all records as expected. No spikes seen.

5.Create import 50K (lone job)50K

06/29 22:51 UTC - 06/30 00:52 UTC - 2 hours

Completed with Error50000 | 50000 | 49999 | 500001096

kafka.consumer.max.poll.records=10

Job stuck at 99%. Huge mod-inventory CPU spike midway (13 minutes) after 8 minutes of very low activities. events_cache spiked multiple times.

6.Create import 25K (with 20 users checkin/out)25K

06/30 03:56 UTC - 

06/30 04:49 UTC

53 minutes

Completed25000 | 25000 | 25000| 250001097

kafka.consumer.max.poll.records=10

7.Create import 25K (with 40 users checkin/out)25K06/30 4:41 PM UTC - 5:16 PM UTC - 35 minutesCompleted25000 | 25000 | 25000| 250001098

kafka.consumer.max.poll.records=10

No spikes. Okapi CPU utilization is high around 220% because checkin-checkout also running in the background.

Checkin-Checkout as background for DI above

DI Test# from aboveUsers
Req/sMin50th pct75th pct95th pct99th pctMaxAverageLatencyGrafana dashboard linkResults/Notes
6.20Total Check-in19.4950.3561.3971.9573.8315.99910.341.6953.4
6/29/2021 11:55PM-00:55AM EDT
High CPU utilization for Okapi 180% which is normal for 20 Users
Total Check-out60.6571.3263.034.0228.20212.92431.5393.6697.076
7.40Total Check-in28.5090.4132.7063.9717.91110.88319.2223.3546.5786/30/2021 12:20PM-1:40PM EDTHigh CPU utilization for Okapi 245% which is normal for 40 Users
Total Check-out86.3831.6696.1118.41717.41527.23553.7617.55114.468

Checkin-Checkout Standalone

TestUsers
Req/sMin50th pct75th pct95th pct99th pctMaxAverageLatencyGrafana dashboard linkResults/Notes
1.20Total Check-in24.7550.3260.8071.0391.5512.28413.6080.8911.4796/28/2021 2:42PM-3:50PM EDTHigh CPU utilization for Okapi 140% which is normal for 20 Users
Total Check-out75.7331.2351.9262.2283.1724.75112.9052.0813.014
2.40Total Check-in36.5290.3491.7242.3564.6756.74613.1362.0614.026/30/2021 10:14AM-11:22AM EDTHigh CPU utilization for Okapi 250% which is normal for 40 Users
Total Check-out109.91.444.0375.34610.50815.76532.9524.8448.978

Observations

Accidently, ran parallel import in another cluster cap2 for Test 1 above. Parallel import does not create all expected numbers of records.