Shared kafka topic experiments [inProgress]
Overview
Shared kafka topic is the feature introduced by FoliJet team [ documentation ] to reduce number of topic will be created by modules (with excluding tenant ID from name and adding "ALL" instead)
How it was:
- Multiple tenants creates multiple topics in Kafka for various purposes;
- Lots of kafka topics (also taking into account that they are replicated on different brokers) may consume a lot of space on brokers.
- It may be even a bigger issue when it will be consortia env (with more than 60 tenant)
How It suppose to be:
- Multiple tenants will communicate (produce and consume messages) with only one topic with keyword "ALL" instead of separate topics with tenantId
In scope of PERF-684 and PERF-713 - PTF team decided to run several experiments on Data Import using shared kafka topic approach
Test:
The idea of a test is to verify and check performance of approach of using shared kafka topics;
Part 1 (Data import only)
Test has being conducted manually (using UI). Performed DI create on 3 tenants with 25K file. Started the job on the first tenant, then wait until 30% start a second job on the 2nd tenant, then wait until the first job gets to 60% and start the third job. Record all jobs' durations and KPIs. Did this test 2 times.
Part 2(Data Import and CheckIn CheckOut)
Test 20 concurrent users on CICO with Data Import on the background with and without shared topic to check affect of shared topic on CICO together with Data Import.
Summary
- Overall performance looks about the same if we'll analyse all durations of data imports as one.
- On primary tenant duration become a worse (+10,+5 min) while on secondary tenants it's becomes better (-6,-4 min).
- As we can see - approach to use shared kafka topic may be a bit misbalanced on load of brokers. As now it's not multiple topics for each tenant that can be placed randomly on each broker but few topics for all tenants - it may cause that one (or few) kafka brokers can receive much more load than the others.
- With changing kafka topics approach - nothing has changed on service CPU usage side.
- No memory leaks were found.
Test Runs/Results
Test # | fileSize | comment | DI Duration/tenant | ||
---|---|---|---|---|---|
fs09.... | fs07...01 | fs07...02 | |||
1 | 25K | no shared topics | 13 min 49 s | 28 min 3 s | 21 min 48 s |
2 | 25K | no shared topics | 14 min 47 s | 26 min 47 s | 21 min 43 s |
3 | 25K | with shared topics | 23 min 59 s | 22 min 7 s | 17 min 17 s |
4 | 25K | with shared topics | 18 min 10 s | 20 min 31 s | 18 min 38 s |
5 | 50K | with shared topics | 30 min 3 s | 35 min 6 s | 40 min 9 s (error) |
6 | 100K | with shared topics | stuck on 99%* | 1 hr 24 min | 1 hr 43min (error) * |
7 | CICO 20 users + DI 25K on 3 tenants | no shared topics | completed with errors 27 min.* | 50 min | 41 min |
8 | CICO 20 users + DI 25K on 3 tenants | with shared topics | 39 min | 41 min | 37 min |
DI ERRORS. in all cases of error during Data Import - the error is io.netty.channel.StacklessClosedChannelException
Test 7,8 CICO
no shared topic(#7) | with shared topic(#8) | |||
before DI | during DI | before | during DI | |
Check In | 0,432s | 1,21s | 0,494s | 1,641s |
Check Out | 0,824s | 2,355s | 1,029s | 2,702s |
Memory Utilization
Note: No signs of memory leaks found.
CPU Utilization
Note: The CPU usage pattern it the same across all the tests with no anomalies.
RDS Metrics
Notes:
- CPU usage has a same pattern and reaches up to 90+%.
- Connections rate stays the same and at it's maximum reached ±700 connections.
Note: Here is visible that DB load increases with applying of shared kafka topics. May be explain with increased throughput of kafka that is causing more load on DB.
Kafka Metrics
Note: as was mentioned above - kafka brokers CPU load may be a bit misbalances. and here is visible that changing of approach did change a pattern of CPU usage on brokers.
Open Search Metrics
Note: As in multiple times previous tests Open Search data nodes CPU usage has reached 100% during indexing operations on newly created records with DI
Test 7,8 results
Memory Utilization
CPU Utilization
RDS Metrics
Kafka Metrics
Appendix
Infrastructure
PTF -environment pcp1
Release - Poppy.
- 10 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
1 database instance, writer db.r6g.xlarge
- MSK tenant
- 4 m5.2xlarge brokers in 2 zones
Apache Kafka version 2.8.0
EBS storage volume per broker 300 GiB
- auto.create.topics.enable=true
- log.retention.minutes=480
- default.replication.factor=3
- Open Search fse-ptf cluster
- instance type r6g.large.search
- Data nodes -4
- Master nodes -3
- EBS size 500 GiB
Modules version:
Module pcp1-pvt Thu Oct 26 10:24:30 UTC 2023 | Task Def. Revision | Module Version | Task Count | Mem Hard Limit | Mem Soft limit | CPU units | Xmx | MetaspaceSize | MaxMetaspaceSize | R/W split enabled |
---|---|---|---|---|---|---|---|---|---|---|
mod-inventory-storage-b | 11 | mod-inventory-storage:27.0.0 | 2 | 4096 | 3690 | 2048 | 3076 | 384 | 512 | false |
mod-data-import-b | 15 | mod-data-import:3.0.1 | 1 | 2048 | 1844 | 256 | 1292 | 384 | 512 | false |
mod-source-record-storage-b | 11 | mod-source-record-storage:5.7.0 | 2 | 5600 | 5000 | 2048 | 3500 | 384 | 512 | false |
mod-inventory-b | 10 | mod-inventory:20.1.0 | 2 | 2880 | 2592 | 1024 | 1814 | 384 | 512 | false |
mod-source-record-manager-b | 12 | mod-source-record-manager:3.7.0 | 2 | 5600 | 5000 | 2048 | 3500 | 384 | 512 | false |
mod-di-converter-storage-b | 14 | mod-di-converter-storage:2.1.0 | 2 | 1024 | 896 | 128 | 768 | 88 | 128 | false |
mod-circulation-storage-b | 10 | mod-circulation-storage:17.1.0 | 2 | 2880 | 2592 | 1536 | 1814 | 384 | 512 | false |
mod-pubsub-b | 9 | mod-pubsub:2.11.0 | 2 | 1536 | 1440 | 1024 | 922 | 384 | 512 | false |
mod-circulation-b | 10 | mod-circulation:24.0.0 | 2 | 2880 | 2592 | 1536 | 1814 | 384 | 512 | false |
Methodology/Approach
- with the default set up, Run 25K ptf-create-2 job profile data import on first tenant, wait until it has 30% progress, start 25K with the same job profile o second tenant, wait until first tenant will have 60% progress and finally start third tenant 25K job.
- Change task definitions of related modules (mod-data-import, mod-di-conventer-storage,mod-inventory,mod-inventory-storage,mod-source-record-storage,mod-source-record-manager) to use shared kafka topics. Simply add to the parameters of a service : {
"name": "KAFKA_PRODUCER_TENANT_COLLECTION",
"value": "ALL"
}, - rerun all the tests and compare results. (Data import durations, Cpu usage on a services, Mem usage on services, DB CPU, connections, kafka (MSK) CPU usage, Open search data nodes CPU usage)