Capacity Testing with m7g KRaft mode MSK cluster
Overview
This document contains the results of testing workflows Check-out (Check-in had a functional issues at the moment) and Data Import for MARC Bibliographic records with PTF- Create-3 job profile in the Quesnelia release with MSK KRaft mode. Test were conducted on qcpt environment. The main idea is to see how number of topics and number of messages affects resource usage of MSK cluster and the main KPI for the testing flows. Tests should be performed with changed kafka broker number (2, 4, 6, 8) to achieve necessary load - 100.000 partitions in sync replicas.
Ticket: - PERF-939Getting issue details... STATUS
Summary
- Defining capacity for MSK instance type kafka.m7g.xlarge changing number of topics and messages
- Changing the number of brokers for the same number of topics and synced messages we observe best performance with 6 brokers - stable data imports and check-out average response time 1.1 seconds which is not longer than in baseline test results with mcpt cluster.
- Test with additional load using the script generating messages on topics which are not involved in data import and check-out do not affect performance of these flows.
- Test with 6 brokers and 200.000 synced messages failed. Kafka was not stable with more than 90% of CPU.
- Test with 8 brokers and 200.000 synced messages was successful. So the cluster can handle even this doubled load.
- mod-pubsub-b module play important role during testing and affect CPU utilization of MSK brokers.
- Resource utilization
- 1'st test kafka.m7g.xlarge, 2 brokers, total topics: 6690, total partitions: 48062
Module CPU utilization: mod-pubsub-b - 82%, mod-inventory-b - 38%, mod-quick-marc-b - 35%, mod-dcb-b - 27%, okapi-b - 19%, mod-data-import-b - 17%
Module Memory: mod-inventory-b - 93%, mod-dcb-b - 88%, mod-permissions-b - 82%, mod-data-import-b - 63%, mod-quick-marc-b - 62%, mod-source-record-storage-b - 61%, okapi-b- 60%, mod-search-b - 58%
- 5'th test kafka.m7g.xlarge, 4 brokers, total topics: 6743, total partitions: online - 55872 In sync replicas - 111748
Module CPU utilization: mod-inventory-b - 56%, mod-quick-marc-b - 38%, mod-dcb-b - 26%, mod-pubsub-b - 23%, mod-di-converter-storage-b - 16%, okapi-b -16%, mod-configuration-b - 15%, mod-users-b - 13%, mod-search-b - 12%
- Module Memory:
- 6'th test kafka.m7g.xlarge, 6 brokers, total topics: 6743, total partitions: online - 55872 In sync replicas - 111748
- Module CPU utilization: spike of mod-data-import - 189%, mod-inventory - 86%, mod-quick-marc - 55%
Module Memory: mod-dcb-b - 104%, mod-inventory-b - 88%, mod-quick-marc-b - 62%, okapi-b - 60%, mod-search-b - 60%, mod-source-record-manager-b - 59%
- 1'st test kafka.m7g.xlarge, 2 brokers, total topics: 6690, total partitions: 48062
Recommendations & Jiras
- Module mod-pubsub-b was not stable and stopped containers due to out of memory. Allocating more resources could resolve the problem. The last configuration for this module: "cpu": 0, "memoryReservation": 2048 /-XX:MaxMetaspaceSize=512m -Xmx2500m
Test Runs
Test configurations # | MSK instance type | Brokers # | Replication factor | Partitions online # | Partitions in replicas # | Scenario | Load level |
---|---|---|---|---|---|---|---|
1 | kafka.m7g.xlarge KRaft | 2, 6, 8 | 2 | ~100k (96426) ~50k (48062) | 192852 96124 | CICO + DI MARC Bib Create | 5 users + 1 single record concurrently on 15 tenants + 10k on 1 tenant |
2 | kafka.m7g.xlarge KRaft | 4 | 2 | ~50k (55872) | 111748 | CICO + DI MARC Bib Create | 5 users + 1 single record concurrently on 15 tenants + 10k on 1 tenant, 10k on 15 tenants |
3 | kafka.m7g.xlarge KRaft | 6 | 2 | ~50k (55872) | 111748 | CICO + DI MARC Bib Create | 5 users + 1 single record concurrently on 15 tenants + 10k on 1 tenant |
Test Results
- 1'st test kafka.m7g.xlarge, 2 brokers, total topics: 6690, total partitions: 48062 - qcpt pointed to MSK cluster
- MSK CPU utilization- 85%
- MSK Disk usage - 0.5%
- Check-out flow - 1.6 seconds without data import, 1.4 sec with data import
- Data import Create job profile with 10k file on 1 tenant - 46 minutes completed successfully
- Data import on 15 tenants concurrently - last longer than 4 hours and completed with errors (No actions reason)
- 2'nd test kafka.m7g.xlarge, 6 brokers, total topics: 7335, total partitions: 96426 - qcpt + qcon pointed to MSK cluster
- MSK CPU utilization - 93% - not stable
- Check-out flow - no possibility to start testing
- DI - no possibility to start testing
- 3'rd test kafka.m7g.xlarge, 8 brokers, total topics: 7335, total partitions: 96426 - qcpt + qcon pointed to MSK cluster
- MSK CPU utilization - 67% - 1,2 brokers, 45% - 3,4,5,6,7,8 brokers
- Check-out flow - 1.1 seconds
- Data import 1 single record 15 tenants duration - 55 seconds
- 4'th test (retest of test #3 after brokers reboot) kafka.m7g.xlarge, 8 brokers, total topics: 7335, total partitions: 96426 - qcpt + qcon pointed to MSK cluster
- MSK CPU utilization - 90% - 1,2 brokers, 70% - 3,4,5,6,7,8 brokers - was unstable for an hour, after stabilization tests completed successfully
- Check-out flow - 1.6 seconds
- Data import 1 single record 15 tenants duration - 44 seconds
- Data import Create job profile with 10k file on 1 tenant - 16 minutes completed successfully
- 5'th test kafka.m7g.xlarge, 4 brokers, total topics: 6743, total partitions: online - 55872 In sync replicas - 111748 - qcpt pointed to MSK cluster
- MSK CPU utilization - 79% 1,2 brokers, 62% - 9, 10 brokers
- MSK CPU utilization with Data import 10k - 89% all brokers
- Check-out flow - 1.1 seconds
- Data import 1 single record 15 tenants duration - 1 minutes 20 seconds
- Data import Create job profile with 10k file on 1 tenant - 10 minutes
- Data import Create job profile with 10k file on 15 tenants- 2 hours 25 minutes
- 6'th test kafka.m7g.xlarge, 6 brokers, total topics: 6743, total partitions: online - 55872 In sync replicas - 111748 - qcpt pointed to MSK cluster
- MSK CPU utilization with DI single record - 70% 1,2 brokers, 60% - 9, 10, 11, 12 brokers
- MSK CPU utilization with Data import 10k - 85%
- Check-out flow - 1,2 seconds
- Data import 1 single record 15 tenants duration - 1 minutes 40 seconds
- Data import Create job profile with 10k file on 1 tenant - 6 minutes 40 seconds
- Data import Create job profile with 10k file on 15 tenants - 1 hour 30 minutes, completed successfully on all tenants
- 7'th test kafka.m7g.xlarge, 6 brokers, total topics: 6743, total partitions: online - 55872 In sync replicas - 111748 - qcpt pointed to MSK cluster. Retest #6 with additional load by message creation script (./folio-topics-load-messages.sh NUM_RECORDS=100000).
- mod-pubsub-b module play important role during testing and affect CPU utilization of MSK brokers.
- MSK CPU utilization with DI single record - 52% 1,2 brokers, 37% - 9, 10, 11, 12 brokers
- MSK CPU utilization with Data import 10k - 70%
- Check-out flow - 1.1 seconds
- Data import 1 single record 15 tenants duration - 1 minutes 10 seconds
- Data import Create job profile with 10k file on 1 tenant - 6 min 20 seconds
Response time
1'st Test | Start Time "8/7/24, 9:33 AM" End Time "8/7/24, 9:53 AM" | 2'd Test | Start Time "8/7/24, 10:47 AM" End Time "8/7/24, 11:47 AM" | 3'd Test | Start Time "8/7/24, 12:41 PM" End Time "8/7/24, 1:42 PM" | 4'th Test | Start Time "8/16/24, 9:09 AM" End Time "8/16/24, 10:09 AM" | 5'th Test ? | Start Time "8/8/24, 9:19 AM" End Time "8/8/24, 10:20 AM" | 6'th Test | Start Time "8/13/24, 10:10 AM" End Time "8/13/24, 11:10 AM" | 7'th Test | Start Time "8/14/24, 12:01 PM" End Time "8/14/24, 1:01 PM" | 8'th Test | Start Time "8/16/24, 9:09 AM" End Time "8/16/24, 10:09 AM" | 11'th Test - 4 brokers, online partitions: ~50k, replication factor: 2 | Start Time "8/16/24, 2:16 PM" End Time "8/16/24, 3:16 PM" | 12'th Test - 6 brokers, online partitions: ~50k, replication factor: 2 | Start Time "8/19/24, 9:44 AM" End Time "8/19/24, 10:44 AM" | 16'th Test - 6 brokers, online partitions: ~50k, replication factor: 2 | Start Time "8/23/24, 12:10 PM" End Time "8/23/24, 1:10 PM" | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Requests | Executions | Response Times (ms) | Executions | Response Times (ms) | Executions | Response Times (ms) | Executions | Response Times (ms) | Executions | Response Times (ms) | Executions | Response Times (ms) | Executions | Response Times (ms) | Executions | Response Times (ms) | Executions | Response Times (ms) | Executions | Response Times (ms) | Response Times (ms) | ||||||||||||||||||||||
Label | #Samples | Average | 90th pct | 95th pct | #Samples | Average | 90th pct | 95th pct | #Samples | Average | 90th pct | 95th pct | #Samples | Average | 90th pct | 95th pct | #Samples | Average | 90th pct | 95th pct | #Samples | Average | 90th pct | 95th pct | #Samples | Average | 90th pct | 95th pct | #Samples | Average | 90th pct | 95th pct | #Samples | Average | 90th pct | 95th pct | #Samples | Average | 90th pct | 95th pct | Average | 90th pct | 95th pct |
CICO_TC_Check-Out Controller_cs00000001_0015 | 122 | 1505.3 | 1750.7 | 1863.3 | 406 | 1414.2 | 1633.6 | 1769.2 | 406 | 1351.12 | 1561.9 | 1709 | 406 | 1175.1 | 1369.8 | 1498.9 | 406 | 1701.7 | 2318.9 | 3290.7 | 407 | 1132.8 | 1304.4 | 1392.6 | 403 | 1492.2 | 1779.4 | 2400.8 | 406 | 1175.1 | 1369.8 | 1498.9 | 406 | 1173.78 | 1385.2 | 1572.9 | 407 | 1241.01 | 1576.2 | 1747.2 | |||
CICO_TC_Check-Out Controller_cs00000001_0014 | 122 | 1478.6 | 1734.4 | 1894 | 406 | 1409.9 | 1620.5 | 1765.9 | 406 | 1350.8 | 1584.1 | 1693 | 407 | 1175.1 | 1362.2 | 1543.2 | 406 | 1661.8 | 2378.5 | 3146.6 | 406 | 1127.9 | 1281 | 1369.7 | 404 | 1594.8 | 1681.5 | 3296.5 | 407 | 1175.1 | 1362.2 | 1543.2 | 407 | 1165.5 | 1366.2 | 1499.6 | 407 | 1236.14 | 1570.4 | 1710 | 1144.4 | 1515.8 | 1676 |
CICO_TC_Check-Out Controller_cs00000001_0013 | 122 | 1601.8 | 1655.9 | 1797.6 | 406 | 1383.8 | 1654.6 | 1842.3 | 406 | 1356.46 | 1575.9 | 1811 | 407 | 1159 | 1399.6 | 1562 | 406 | 1673 | 2366.8 | 2985.6 | 407 | 1096.1 | 1264.8 | 1356.4 | 404 | 1606.6 | 1894 | 4181.5 | 407 | 1159 | 1399.6 | 1562 | 406 | 1166.09 | 1360.5 | 1493.9 | 407 | 1198.18 | 1521.6 | 1634.4 | 1127.59 | 1427.8 | 1596.8 |
CICO_TC_Check-Out Controller_cs00000001_0012 | 122 | 1560 | 1798.1 | 2136.2 | 406 | 1440.9 | 1639.2 | 1807.3 | 406 | 1388.25 | 1605.2 | 1742 | 407 | 1174.9 | 1381.8 | 1493.6 | 406 | 1727.2 | 2568.6 | 3828.3 | 407 | 1126.2 | 1290.6 | 1380.4 | 404 | 1651.9 | 2123.5 | 3857.8 | 407 | 1174.9 | 1381.8 | 1493.6 | 407 | 1206.21 | 1376.8 | 1521.6 | 407 | 1225.12 | 1566.4 | 1720.8 | 1148.71 | 1507.8 | 1718.4 |
CICO_TC_Check-Out Controller_cs00000001_0011 | 122 | 1548 | 1762.5 | 2041.1 | 406 | 1460.2 | 1642.3 | 1862.9 | 406 | 1386.67 | 1668.6 | 1790 | 407 | 1188.1 | 1425 | 1598.4 | 406 | 1706 | 2504.9 | 3224.7 | 407 | 1135.1 | 1287.4 | 1436.2 | 404 | 1537.6 | 1836.5 | 2978.5 | 407 | 1188.1 | 1425 | 1598.4 | 407 | 1193.97 | 1394.8 | 1521.2 | 407 | 1235.29 | 1587.8 | 1735.2 | 1154.3 | 1530.4 | 1721.6 |
CICO_TC_Check-Out Controller_cs00000001_0010 | 122 | 1534.2 | 1892.2 | 2127.6 | 406 | 1418.5 | 1603.2 | 1705 | 406 | 1397.73 | 1625.5 | 1777 | 407 | 1184 | 1419.4 | 1536.2 | 406 | 1787.9 | 2605 | 4088.5 | 407 | 1135.9 | 1293.8 | 1408.6 | 404 | 1679.3 | 1869 | 4800 | 407 | 1184 | 1419.4 | 1536.2 | 407 | 1191.1 | 1391.6 | 1588.6 | 407 | 1254.72 | 1590.4 | 1774.6 | 1147.41 | 1502.6 | 1755.4 |
CICO_TC_Check-Out Controller_cs00000001_0009 | 122 | 1525.5 | 1838.9 | 2048.1 | 406 | 1416.8 | 1604.4 | 1815.2 | 406 | 1399.74 | 1631.8 | 1797 | 407 | 1171.7 | 1363.2 | 1560 | 406 | 1701.4 | 2289.3 | 3402.1 | 407 | 1138.6 | 1277.2 | 1354.2 | 404 | 1670.8 | 2350 | 4237.3 | 407 | 1171.7 | 1363.2 | 1560 | 407 | 1199.58 | 1413.6 | 1562.6 | 407 | 1247.65 | 1571 | 1735.6 | 1156 | 1499.2 | 1697.4 |
CICO_TC_Check-Out Controller_cs00000001_0008 | 122 | 1539 | 1787.1 | 2033.9 | 406 | 1422.6 | 1634.5 | 1778.7 | 406 | 1346.01 | 1582.7 | 1764 | 407 | 1193.4 | 1384.8 | 1622.6 | 406 | 1734.9 | 2462.4 | 3578.7 | 407 | 1136.1 | 1323.4 | 1454.2 | 404 | 1673.1 | 2212 | 3792.3 | 407 | 1193.4 | 1384.8 | 1622.6 | 407 | 1174.81 | 1343.6 | 1497.4 | 407 | 1244.71 | 1586.4 | 1746.6 | 1155.01 | 1569 | 1703.2 |
CICO_TC_Check-Out Controller_cs00000001_0007 | 122 | 1525 | 1833.3 | 2065.4 | 406 | 1433.9 | 1627.9 | 1785.2 | 406 | 1400.38 | 1650.4 | 1739 | 408 | 1182.9 | 1420.4 | 1596.9 | 407 | 1714.9 | 2321 | 3529.8 | 407 | 1132.6 | 1287.4 | 1396.6 | 404 | 1558.1 | 1638 | 2489.3 | 408 | 1182.9 | 1420.4 | 1596.9 | 407 | 1205.18 | 1391.2 | 1549.6 | 407 | 1249.27 | 1561.4 | 1798.6 | 1148.81 | 1517.6 | 1722.4 |
CICO_TC_Check-Out Controller_cs00000001_0006 | 122 | 1537.5 | 1796.5 | 1908.1 | 406 | 1418.5 | 1634.2 | 1748.1 | 407 | 1351.72 | 1583.6 | 1678 | 407 | 1176.2 | 1416 | 1576.6 | 406 | 1640.8 | 2279.5 | 2843.2 | 407 | 1137.8 | 1306.4 | 1424.2 | 404 | 1565.8 | 1780 | 2357.8 | 407 | 1176.2 | 1416 | 1576.6 | 407 | 1211.39 | 1360 | 1538.4 | 407 | 1238.76 | 1584 | 1784.2 | 1154.56 | 1510 | 1733.2 |
CICO_TC_Check-Out Controller_cs00000001_0005 | 122 | 1675.3 | 1889.5 | 2173.8 | 406 | 1370.8 | 1615.2 | 1756.9 | 407 | 1411.17 | 1586.2 | 1704 | 408 | 1179.3 | 1406.3 | 1596.7 | 407 | 1687.1 | 2339.4 | 3068.6 | 407 | 1126.8 | 1300.6 | 1385.6 | 404 | 1544.5 | 1798 | 2730.3 | 408 | 1179.3 | 1406.3 | 1596.7 | 407 | 1176.38 | 1390.6 | 1506.2 | 407 | 1245.97 | 1634.6 | 1830.2 | 1151.18 | 1497.2 | 1705.6 |
CICO_TC_Check-Out Controller_cs00000001_0004 | 122 | 1485.2 | 1758.1 | 1855.8 | 406 | 1429.4 | 1644.3 | 1755.7 | 407 | 1344.81 | 1585.2 | 1689 | 408 | 1176.2 | 1373.5 | 1496.9 | 407 | 1674.7 | 2239.8 | 3014.4 | 407 | 1132.6 | 1302 | 1379.4 | 404 | 1459.7 | 1535.5 | 2051 | 408 | 1176.2 | 1373.5 | 1496.9 | 407 | 1195.11 | 1435.4 | 1611.6 | 407 | 1245.72 | 1610.2 | 1733.4 | 1151.02 | 1524.4 | 1738.4 |
CICO_TC_Check-Out Controller_cs00000001_0003 | 122 | 1546.2 | 1866.8 | 1986 | 406 | 1453.4 | 1693.8 | 1911.7 | 407 | 1370.7 | 1613 | 1736 | 407 | 1195.1 | 1419.2 | 1550.6 | 407 | 1695.3 | 2292.4 | 3515.2 | 408 | 1131.1 | 1312.2 | 1402.8 | 404 | 1607.1 | 1808 | 3617 | 407 | 1195.1 | 1419.2 | 1550.6 | 407 | 1183.62 | 1401 | 1547.8 | 408 | 1240.08 | 1565.1 | 1701 | 1157.1 | 1532.1 | 1732.8 |
CICO_TC_Check-Out Controller_cs00000001_0002 | 122 | 1683.2 | 1786.1 | 2017.4 | 406 | 1361.6 | 1595.9 | 1691.6 | 407 | 1350.79 | 1564 | 1705 | 408 | 1190.9 | 1394.6 | 1601.1 | 407 | 1680.3 | 2253.6 | 3284.8 | 407 | 1122.5 | 1257.4 | 1361.8 | 405 | 1644.4 | 1938.8 | 4028.2 | 408 | 1190.9 | 1394.6 | 1601.1 | 407 | 1163.45 | 1391.4 | 1459 | 407 | 1234.21 | 1561.8 | 1734.6 | 1145.47 | 1495.5 | 1711.6 |
CICO_TC_Check-Out Controller_cs00000001_0001 | 123 | 1552.7 | 1811.2 | 1990.6 | 406 | 1395.3 | 1677.6 | 1833 | 407 | 1365.9 | 1610 | 1747 | 408 | 1180.3 | 1390 | 1533.6 | 407 | 1715.6 | 2403.2 | 3431 | 407 | 1130 | 1290.6 | 1361 | 404 | 1687.1 | 2228 | 4428.5 | 408 | 1180.3 | 1390 | 1533.6 | 407 | 1167.53 | 1396.6 | 1504.8 | 407 | 1235 | 1564.4 | 1720 | 1156.13 | 1542.3 | 1692 |
Service CPU Utilization
DI MARC BIB Create and Update + CICO
5'th test (4 brokers, ~50k online partitions)
6'th test (6 brokers, ~50k online partitions)
Spikes:
- 10k on 1 tenant: mod-quick-marc - 134%, mod-inventory - 95%, mod-di-converter-storage - 57%
- 10k on 15 tenants: mod-data-import - 189%, mod-inventory - 86%, mod-quick-marc - 55%
Service Memory Utilization
5'th test (4 brokers, ~50k online partitions)
6'th test (6 brokers, ~50k online partitions)
DB CPU Utilization
5'th test (4 brokers, ~50k online partitions)
6'th test (6 brokers, ~50k online partitions)
DB Connections
5'th test (4 brokers, ~50k online partitions)
6'th test (6 brokers, ~50k online partitions)
6'th test (6 brokers, ~50k online partitions)
Appendix
Infrastructure
PTF -environment qcpt
- 10 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
1 database instance, writer
Name Memory GIB vCPUs max_connections db.r6g.4xlarge
GiB vCPUs - - MSK ptf-KRaft-mode2
- multiple configurations of m7g.xlarge brokers in 2 zones (1, 2, 3, 4 per zone)
Apache Kafka version 3.7.x, mode - KRaft
- Cluster configuration name: fse-kafka-config revision 26
EBS storage volume per broker 300 GiB
- auto.create.topics.enable=true
- log.retention.minutes=480
- default.replication.factor=2 (the last test with 6 brokers - replication factor was 3)
- Open Search ptf-test
- Data nodes
- Instance type - r6g.2xlarge.search
- Number of nodes - 4
- Version: OpenSearch_2_7_R20240502
- Dedicated master nodes
- Instance type - r6g.large.search
- Number of nodes - 3
- Data nodes
Task count for modules mod-agreements set to 0 before test start.
Modules
Methodology/Approach
- Populate ptf-KRaft-mode2 cluster with topics from tenant cluster for qcpt
- Create the list of unconsolidated topics for all modules that are involved into data import and check-in/check-out
- Run smoke CICO for 20 minutes and single data import on random tenant involved in testing
- Run 1 hour CICO and 10k DI create job for 15 tenants concurrently
- Adjust the sh script to generate additional messages in topics which not related to tested flows
- Repeat testing
- Change brokers number (2, 4, 6, 8) - to troubleshoot possible issues reboot brokers after broker size changing
- Compare resource utilization of MSK and main KPI for CICO & DI
Additional/Files
Scripts may be found on S3 bucket: fse-ptf/capacity_testing/qcpt
Topics: