Capacity Testing with m7g KRaft mode MSK cluster

Overview

This document contains the results of testing workflows Check-out (Check-in had a functional issues at the moment) and Data Import for MARC Bibliographic records with PTF- Create-3 job profile in the Quesnelia release with MSK KRaft mode. Test were conducted on qcpt environment. The main idea is to see how number of topics and number of messages affects resource usage of MSK cluster and the main KPI for the testing flows. Tests should be performed with changed kafka broker number (2, 4, 6, 8) to achieve necessary load - 100.000 partitions in sync replicas.

Ticket: PERF-939 - Getting issue details... STATUS

Summary

  • Defining capacity for MSK instance type kafka.m7g.xlarge changing number of topics and messages
    • Changing the number of brokers for the same number of topics and synced messages we observe best performance with 6 brokers - stable data imports and check-out average response time 1.1 seconds which is not longer than in baseline test results with mcpt cluster.
    • Test with additional load using the script generating messages on topics which are not involved in data import and check-out do not affect performance of these flows.
    • Test with 6 brokers and 200.000 synced messages failed. Kafka was not stable with more than 90% of CPU.
    • Test with 8 brokers and 200.000 synced messages was successful. So the cluster can handle even this doubled load.
    • mod-pubsub-b module play important role during testing and affect CPU utilization of MSK brokers.
  • Resource utilization
    • 1'st test kafka.m7g.xlarge, 2 brokers, total topics: 6690, total partitions: 48062
      • Module CPU utilization: mod-pubsub-b - 82%, mod-inventory-b - 38%, mod-quick-marc-b - 35%, mod-dcb-b - 27%, okapi-b - 19%, mod-data-import-b - 17%

      • Module Memory: mod-inventory-b - 93%, mod-dcb-b - 88%, mod-permissions-b - 82%, mod-data-import-b - 63%, mod-quick-marc-b - 62%, mod-source-record-storage-b - 61%,  okapi-b- 60%, mod-search-b - 58%

    • 5'th test kafka.m7g.xlarge, 4 brokers, total topics: 6743, total partitions: online - 55872 In sync replicas - 111748
      • Module CPU utilization: mod-inventory-b - 56%, mod-quick-marc-b - 38%, mod-dcb-b - 26%, mod-pubsub-b - 23%, mod-di-converter-storage-b - 16%, okapi-b -16%, mod-configuration-b - 15%, mod-users-b - 13%, mod-search-b - 12%

      • Module Memory: 
    • 6'th test kafka.m7g.xlarge, 6 brokers, total topics: 6743, total partitions: online - 55872 In sync replicas - 111748
      • Module CPU utilization: spike of mod-data-import - 189%, mod-inventory - 86%, mod-quick-marc - 55%
      • Module Memory: mod-dcb-b - 104%, mod-inventory-b - 88%, mod-quick-marc-b - 62%, okapi-b - 60%, mod-search-b - 60%, mod-source-record-manager-b - 59%

Recommendations & Jiras

  • Module mod-pubsub-b was not stable and stopped containers due to out of memory. Allocating more resources could resolve the problem. The last configuration for this module:  "cpu": 0, "memoryReservation": 2048 /-XX:MaxMetaspaceSize=512m -Xmx2500m

Test Runs 

Test configurations #

MSK instance typeBrokers #Replication factorPartitions online #Partitions in replicas #

Scenario

Load level
1kafka.m7g.xlarge KRaft2, 6, 82

~100k (96426)

~50k (48062)

192852

96124

CICO + DI MARC Bib Create 5 users + 1 single record concurrently on 15 tenants + 10k on 1 tenant
2kafka.m7g.xlarge KRaft42~50k (55872)111748CICO + DI MARC Bib Create 5 users + 1 single record concurrently on 15 tenants + 10k on 1 tenant, 10k on 15 tenants
3kafka.m7g.xlarge KRaft62~50k (55872)111748CICO + DI MARC Bib Create 5 users + 1 single record concurrently on 15 tenants + 10k on 1 tenant

Test Results

  • 1'st test kafka.m7g.xlarge, 2 brokers, total topics: 6690, total partitions: 48062 - qcpt pointed to MSK cluster
    • MSK CPU utilization- 85%
    • MSK Disk usage - 0.5%
    • Check-out flow - 1.6 seconds without data import, 1.4 sec with data import
    • Data import Create job profile with 10k file on 1 tenant - 46 minutes completed successfully
    • Data import on 15 tenants concurrently - last longer than 4 hours and completed with errors (No actions reason)
  • 2'nd test kafka.m7g.xlarge, 6 brokers, total topics: 7335, total partitions: 96426 - qcpt + qcon pointed to MSK cluster
    • MSK CPU utilization - 93% - not stable
    • Check-out flow - no possibility to start testing
    • DI - no possibility to start testing
  • 3'rd test kafka.m7g.xlarge, 8 brokers, total topics: 7335, total partitions: 96426 - qcpt + qcon pointed to MSK cluster
    • MSK CPU utilization - 67% - 1,2 brokers, 45% - 3,4,5,6,7,8 brokers
    • Check-out flow - 1.1 seconds
    • Data import 1 single record 15 tenants duration - 55 seconds
  • 4'th test (retest of test #3 after brokers reboot) kafka.m7g.xlarge, 8 brokers, total topics: 7335, total partitions: 96426 - qcpt + qcon pointed to MSK cluster
    • MSK CPU utilization - 90% - 1,2 brokers, 70% - 3,4,5,6,7,8 brokers - was unstable for an hour, after stabilization tests completed successfully
    • Check-out flow - 1.6 seconds
    • Data import 1 single record 15 tenants duration - 44 seconds
    • Data import Create job profile with 10k file on 1 tenant - 16 minutes completed successfully
  • 5'th test kafka.m7g.xlarge, 4 brokers, total topics: 6743, total partitions: online - 55872 In sync replicas - 111748 - qcpt pointed to MSK cluster
    • MSK CPU utilization - 79% 1,2 brokers, 62% - 9, 10 brokers
    • MSK CPU utilization with Data import 10k - 89% all brokers
    • Check-out flow -  1.1 seconds
    • Data import 1 single record 15 tenants duration - 1 minutes 20 seconds
    • Data import Create job profile with 10k file on 1 tenant - 10 minutes
    • Data import Create job profile with 10k file on 15 tenants- 2 hours 25 minutes
  • 6'th test kafka.m7g.xlarge, 6 brokers, total topics: 6743, total partitions: online - 55872 In sync replicas - 111748 - qcpt pointed to MSK cluster
    • MSK CPU utilization with DI single record - 70% 1,2 brokers, 60% - 9, 10, 11, 12 brokers
    • MSK CPU utilization with Data import 10k - 85% 
    • Check-out flow - 1,2 seconds
    • Data import 1 single record 15 tenants duration - 1 minutes 40 seconds
    • Data import Create job profile with 10k file on 1 tenant - 6 minutes 40 seconds
    • Data import Create job profile with 10k file on 15 tenants - 1 hour 30 minutes, completed successfully on all tenants
  • 7'th test kafka.m7g.xlarge, 6 brokers, total topics: 6743, total partitions: online - 55872 In sync replicas - 111748 - qcpt pointed to MSK cluster. Retest #6 with additional load by message creation script (./folio-topics-load-messages.sh  NUM_RECORDS=100000).
    • mod-pubsub-b module play important role during testing and affect CPU utilization of MSK brokers.
    • MSK CPU utilization with DI single record - 52% 1,2 brokers, 37% - 9, 10, 11, 12 brokers
    • MSK CPU utilization with Data import 10k - 70% 
    • Check-out flow -  1.1 seconds
    • Data import 1 single record 15 tenants duration - 1 minutes 10 seconds
    • Data import Create job profile with 10k file on 1 tenant - 6 min 20 seconds

Response time



1'st TestStart Time "8/7/24, 9:33 AM"
End Time "8/7/24, 9:53 AM"
2'd TestStart Time "8/7/24, 10:47 AM"
End Time "8/7/24, 11:47 AM"
3'd TestStart Time "8/7/24, 12:41 PM"
End Time "8/7/24, 1:42 PM"
4'th TestStart Time "8/16/24, 9:09 AM"
End Time "8/16/24, 10:09 AM"
5'th Test ?Start Time "8/8/24, 9:19 AM"
End Time "8/8/24, 10:20 AM"
6'th TestStart Time "8/13/24, 10:10 AM"
End Time "8/13/24, 11:10 AM"
7'th TestStart Time "8/14/24, 12:01 PM"
End Time "8/14/24, 1:01 PM"
8'th TestStart Time "8/16/24, 9:09 AM"
End Time "8/16/24, 10:09 AM"
11'th Test - 4 brokers, online partitions: ~50k, replication factor: 2Start Time "8/16/24, 2:16 PM"
End Time "8/16/24, 3:16 PM"
12'th Test - 6 brokers, online partitions: ~50k, replication factor: 2Start Time "8/19/24, 9:44 AM"
End Time "8/19/24, 10:44 AM"
16'th Test - 6 brokers, online partitions: ~50k, replication factor: 2Start Time "8/23/24, 12:10 PM"
End Time "8/23/24, 1:10 PM"
RequestsExecutionsResponse Times (ms)ExecutionsResponse Times (ms)ExecutionsResponse Times (ms)ExecutionsResponse Times (ms)ExecutionsResponse Times (ms)ExecutionsResponse Times (ms)ExecutionsResponse Times (ms)ExecutionsResponse Times (ms)ExecutionsResponse Times (ms)ExecutionsResponse Times (ms)Response Times (ms)





















Label#SamplesAverage90th pct95th pct#SamplesAverage90th pct95th pct#SamplesAverage90th pct95th pct#SamplesAverage90th pct95th pct#SamplesAverage90th pct95th pct#SamplesAverage90th pct95th pct#SamplesAverage90th pct95th pct#SamplesAverage90th pct95th pct#SamplesAverage90th pct95th pct#SamplesAverage90th pct95th pctAverage90th pct95th pct
CICO_TC_Check-Out Controller_cs00000001_00151221505.31750.71863.34061414.21633.61769.24061351.121561.917094061175.11369.81498.94061701.72318.93290.74071132.81304.41392.64031492.21779.42400.84061175.11369.81498.94061173.781385.21572.94071241.011576.21747.2


CICO_TC_Check-Out Controller_cs00000001_00141221478.61734.418944061409.91620.51765.94061350.81584.116934071175.11362.21543.24061661.82378.53146.64061127.912811369.74041594.81681.53296.54071175.11362.21543.24071165.51366.21499.64071236.141570.417101144.41515.81676
CICO_TC_Check-Out Controller_cs00000001_00131221601.81655.91797.64061383.81654.61842.34061356.461575.9181140711591399.6156240616732366.82985.64071096.11264.81356.44041606.618944181.540711591399.615624061166.091360.51493.94071198.181521.61634.41127.591427.81596.8
CICO_TC_Check-Out Controller_cs00000001_001212215601798.12136.24061440.91639.21807.34061388.251605.217424071174.91381.81493.64061727.22568.63828.34071126.21290.61380.44041651.92123.53857.84071174.91381.81493.64071206.211376.81521.64071225.121566.41720.81148.711507.81718.4
CICO_TC_Check-Out Controller_cs00000001_001112215481762.52041.14061460.21642.31862.94061386.671668.617904071188.114251598.440617062504.93224.74071135.11287.41436.24041537.61836.52978.54071188.114251598.44071193.971394.81521.24071235.291587.81735.21154.31530.41721.6
CICO_TC_Check-Out Controller_cs00000001_00101221534.21892.22127.64061418.51603.217054061397.731625.5177740711841419.41536.24061787.926054088.54071135.91293.81408.64041679.31869480040711841419.41536.24071191.11391.61588.64071254.721590.41774.61147.411502.61755.4
CICO_TC_Check-Out Controller_cs00000001_00091221525.51838.92048.14061416.81604.41815.24061399.741631.817974071171.71363.215604061701.42289.33402.14071138.61277.21354.24041670.823504237.34071171.71363.215604071199.581413.61562.64071247.6515711735.611561499.21697.4
CICO_TC_Check-Out Controller_cs00000001_000812215391787.12033.94061422.61634.51778.74061346.011582.717644071193.41384.81622.64061734.92462.43578.74071136.11323.41454.24041673.122123792.34071193.41384.81622.64071174.811343.61497.44071244.711586.41746.61155.0115691703.2
CICO_TC_Check-Out Controller_cs00000001_000712215251833.32065.44061433.91627.91785.24061400.381650.417394081182.91420.41596.94071714.923213529.84071132.61287.41396.64041558.116382489.34081182.91420.41596.94071205.181391.21549.64071249.271561.41798.61148.811517.61722.4
CICO_TC_Check-Out Controller_cs00000001_00061221537.51796.51908.14061418.51634.21748.14071351.721583.616784071176.214161576.64061640.82279.52843.24071137.81306.41424.24041565.817802357.84071176.214161576.64071211.3913601538.44071238.7615841784.21154.5615101733.2
CICO_TC_Check-Out Controller_cs00000001_00051221675.31889.52173.84061370.81615.21756.94071411.171586.217044081179.31406.31596.74071687.12339.43068.64071126.81300.61385.64041544.517982730.34081179.31406.31596.74071176.381390.61506.24071245.971634.61830.21151.181497.21705.6
CICO_TC_Check-Out Controller_cs00000001_00041221485.21758.11855.84061429.41644.31755.74071344.811585.216894081176.21373.51496.94071674.72239.83014.44071132.613021379.44041459.71535.520514081176.21373.51496.94071195.111435.41611.64071245.721610.21733.41151.021524.41738.4
CICO_TC_Check-Out Controller_cs00000001_00031221546.21866.819864061453.41693.81911.74071370.7161317364071195.11419.21550.64071695.32292.43515.24081131.11312.21402.84041607.1180836174071195.11419.21550.64071183.6214011547.84081240.081565.117011157.11532.11732.8
CICO_TC_Check-Out Controller_cs00000001_00021221683.21786.12017.44061361.61595.91691.64071350.79156417054081190.91394.61601.14071680.32253.63284.84071122.51257.41361.84051644.41938.84028.24081190.91394.61601.14071163.451391.414594071234.211561.81734.61145.471495.51711.6
CICO_TC_Check-Out Controller_cs00000001_00011231552.71811.21990.64061395.31677.618334071365.9161017474081180.313901533.64071715.62403.2343140711301290.613614041687.122284428.54081180.313901533.64071167.531396.61504.840712351564.417201156.131542.31692

Service CPU Utilization


DI MARC BIB Create and Update + CICO

5'th test (4 brokers, ~50k online partitions)

6'th test (6 brokers, ~50k online partitions)

Spikes:

  • 10k on 1 tenant: mod-quick-marc - 134%, mod-inventory - 95%, mod-di-converter-storage - 57%
  • 10k on 15 tenants: mod-data-import - 189%, mod-inventory - 86%, mod-quick-marc - 55%

Service Memory Utilization

5'th test (4 brokers, ~50k online partitions)

6'th test (6 brokers, ~50k online partitions)

DB CPU Utilization

5'th test (4 brokers, ~50k online partitions)

6'th test (6 brokers, ~50k online partitions)

DB Connections

5'th test (4 brokers, ~50k online partitions)

6'th test (6 brokers, ~50k online partitions)

MSK instance resource utilization

Disk usage by broker

5'th test (4 brokers, ~50k online partitions)

6'th test (6 brokers, ~50k online partitions)

CPU (User) usage by broker

5'th test (4 brokers, ~50k online partitions)

6'th test (6 brokers, ~50k online partitions)


Appendix

Infrastructure

PTF -environment qcpt

  • 10 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
  • 1 database  instance, writer

    NameMemory GIBvCPUsmax_connections

    db.r6g.4xlarge

     GiB vCPUs-
  • MSK ptf-KRaft-mode2
    • multiple configurations of m7g.xlarge brokers in 2 zones (1, 2, 3, 4 per zone)
    • Apache Kafka version 3.7.x, mode - KRaft

    • Cluster configuration name: fse-kafka-config revision 26
    • EBS storage volume per broker 300 GiB

    • auto.create.topics.enable=true
    • log.retention.minutes=480
    • default.replication.factor=2 (the last test with 6 brokers - replication factor was 3)
  • Open Search ptf-test
    • Data nodes
      • Instance type - r6g.2xlarge.search
      • Number of nodes - 4
      • Version: OpenSearch_2_7_R20240502
    • Dedicated master nodes
      • Instance type - r6g.large.search
      • Number of nodes - 3

Task count for modules mod-agreements set to 0 before test start.

Modules

 All qcpt modules
ModuleTask Definition RevisionModule VersionTask CountMem Hard LimitMem Soft LimitCPU UnitsXmxMetaspace SizeMax Metaspace Size
mod-remote-storage1mod-remote-storage:3.2.024920447210243960512512
mod-finance-storage1mod-finance-storage:8.6.121024896102470088128
mod-ncip1mod-ncip:1.14.52102489612876888128
mod-agreements5mod-agreements:7.0.52409635841024000
mod-ebsconet1mod-ebsconet:2.2.0212481024128700128256
mod-organizations1mod-organizations:1.9.22102489612876888128
mod-consortia1mod-consortia:1.1.025136477610244416384512
edge-sip21edge-sip2:3.2.42102489612876888128
mod-settings1mod-settings:1.0.32102489620076888128
mod-serials-management1mod-serials-management:1.0.32248023121281792384512
mod-data-import1mod-data-import:3.1.11204818442561292384512
edge-dematic1edge-dematic:2.2.21102489612876888128
mod-search1mod-search:3.2.6225922480204814405121024
mod-tags1mod-tags:2.2.02102489612876888128
mod-authtoken1mod-authtoken:2.15.121440115251292288128
edge-courses1edge-courses:1.4.12102489612876888128
edge-inventory1edge-inventory:1.4.02102489612876888128
mod-inventory-update1mod-inventory-update:3.3.12102489612876888128
mod-notify1mod-notify:3.2.02102489612876888128
mod-configuration1mod-configuration:5.10.02102489612876888128
mod-orders-storage1mod-orders-storage:13.7.22102489651270088128
edge-caiasoft1edge-caiasoft:2.2.22102489612876888128
mod-login-saml1mod-login-saml:2.8.12102489612876888128
mod-gobi1mod-gobi:2.8.12102489612876888128
mod-licenses1mod-licenses:6.0.22248023125121792384512
mod-password-validator1mod-password-validator:3.2.0214401298128768384512
edge-dcb1edge-dcb:1.1.0-SNAPSHOT.152102489612876888128
mod-bulk-operations1mod-bulk-operations:2.0.223072260010241536384512
mod-fqm-manager1mod-fqm-manager:2.0.42102489612876888128
mod-graphql2mod-graphql:1.12.12102489612876888128
mod-finance1mod-finance:4.9.02102489612876888128
mod-batch-print1mod-batch-print:1.1.02102489612876888128
mod-lists1mod-lists:2.0.62102489612876888128
mod-copycat1mod-copycat:1.6.02102489612876888128
mod-entities-links1mod-entities-links:3.0.1225922480400144001024
mod-permissions2mod-permissions:6.5.02168415445121024384512
pub-edge1pub-edge:2023.06.142102489612876800
mod-orders1mod-orders:12.8.822048174010241024384512
edge-patron2edge-patron:5.1.12102489625676888128
edge-ncip1edge-ncip:1.10.02102489612876888128
mod-users-bl1mod-users-bl:7.7.321440115251292288128
mod-invoice1mod-invoice:5.8.221440115251292288128
mod-inventory-storage1mod-inventory-storage:27.1.324096369020483076384512
edge-ea-data-export1edge-ea-data-export:4.2.02102489612876888128
mod-user-import1mod-user-import:3.8.02102489612876888128
mod-sender1mod-sender:1.12.02102489612876888128
edge-oai-pmh1edge-oai-pmh:2.9.121512136010241440384512
mod-data-export-worker1mod-data-export-worker:3.2.423072280010242048384512
mod-rtac1mod-rtac:3.6.02102489612876888128
mod-task-list1mod-task-list:1.9.22102489612876888128
mod-circulation-storage1mod-circulation-storage:17.2.122880259215361814384512
mod-calendar1mod-calendar:3.1.02102489612876888128
mod-source-record-storage1mod-source-record-storage:5.8.525600500020483500384512
mod-event-config1mod-event-config:2.7.12102489612876888128
mod-courses1mod-courses:1.4.102102489612876888128
mod-circulation-item3mod-circulation-item:1.0.0-SNAPSHOT.1221024896128000
mod-inventory1mod-inventory:20.2.622880259210241814384512
mod-email1mod-email:1.17.02280025505121800384512
mod-di-converter-storage1mod-di-converter-storage:2.2.22102489612876888128
mod-circulation1mod-circulation:24.2.522880259215361814384512
mod-pubsub1mod-pubsub:2.13.12153614401024922384512
edge-rtac1edge-rtac:2.7.22102489612876888128
edge-orders1edge-orders:3.0.22102489612876888128
mod-template-engine1mod-template-engine:1.20.02102489612876888128
mod-users1mod-users:19.3.12102489612876888128
mod-patron-blocks1mod-patron-blocks:1.10.021024896102476888128
edge-fqm1edge-fqm:2.0.12102489612876888128
mod-audit1mod-audit:2.9.02102489612876888128
mod-source-record-manager1mod-source-record-manager:3.8.525600500020483500384512
nginx-edge1nginx-edge:2023.06.1421024896128000
mod-quick-marc1mod-quick-marc:5.1.11228821761281664384512
nginx-okapi1nginx-okapi:2023.06.14210248961024000
okapi-b1okapi:5.3.03168414401024922384512
mod-feesfines1mod-feesfines:19.1.02102489612876888128
mod-invoice-storage1mod-invoice-storage:5.8.121872153610241024384512
edge-users1edge-users:1.2.02102489612876888128
mod-service-interaction1mod-service-interaction:4.0.22204818442561290384512
mod-dcb1mod-dcb:1.1.0-SNAPSHOT.172102489612876888128
mod-data-export3mod-data-export:5.0.41230420481024153688256
mod-patron1mod-patron:6.1.02102489612876888128
mod-oai-pmh1mod-oai-pmh:3.13.224096369020483076384512
edge-connexion1edge-connexion:1.3.02102489612876888128
mod-notes1mod-notes:5.2.021024896128952384512
mod-kb-ebsco-java1mod-kb-ebsco-java:4.0.02102489612876888128
mod-login1mod-login:7.11.12144012981024768384512
mod-data-export-spring1mod-data-export-spring:3.2.21204818442561536384512
mod-organizations-storage1mod-organizations-storage:4.7.02102489612876888128
pub-okapi1pub-okapi:2023.06.142102489612876800
mod-eusage-reports1mod-eusage-reports:2.0.02102489612876888128
edge-erm1edge-erm:1.0.02102489612876888128

Methodology/Approach

  • Populate ptf-KRaft-mode2 cluster with topics from tenant cluster for qcpt
  • Create the list of unconsolidated topics for all modules that are involved into data import and check-in/check-out
    •  The list of modules to unconsolidate

      1. mod-search
      2. mod-orders-storage
      3. mod-inn-reach
      4. mod-data-export-worker
      5. mod-data-import
      6. mod-bulk-operations
      7. mod-pubsub
      8. mod-dcb (doesn't exist yet)
      9. mod-quick-marc
      10. mod-inventory
      11. mod-invoice
      12. mod-search-ebsco
      13. mod-consortia
      14. mod-circulation-storage
      15. mod-remote-storage
      16. mod-source-record-manager
      17. mod-data-export-spring
      18. mod-audit
      19. mod-source-record-storage
      20. mod-linked-data
      21. mod-entities-links
      22. mod-circulation
      23. mod-settings
      24. mod-orders
      25. mod-inventory-storage
      26. mod-users
      For completeness sake, and for the future, here are the Eureka Kafka modules:
      1. mod-users-keycloak
      2. mgr-tenant-entitlements
      3. mgr-applications
      4. mod-consortia-keycloak

  • Run smoke CICO for 20 minutes and single data import on random tenant involved in testing
  • Run 1 hour CICO and 10k DI create job for 15 tenants concurrently
  • Adjust the sh script to generate additional messages in topics which not related to tested flows
  • Repeat testing
  • Change brokers number (2, 4, 6, 8) - to troubleshoot possible issues reboot brokers after broker size changing
  • Compare resource utilization of MSK and main KPI for CICO & DI

Additional/Files

Scripts may be found on S3 bucket: fse-ptf/capacity_testing/qcpt

Topics: