Overview

Shared kafka topic is the feature introduced by FoliJet team [ documentation ] to reduce number of topic will be created by modules (with excluding tenant ID from name and adding "ALL" instead)

How it was:

Multiple tenants creates multiple topics in Kafka for various purposes;
Lots of kafka topics (also taking into account that they are replicated on different brokers) may consume a lot of space on brokers.
It may be even a bigger issue when it will be consortia env (with more than 60 tenant)

How It suppose to be:

Multiple tenants will communicate (produce and consume messages) with only one topic with keyword "ALL" instead of separate topics with tenantId

In scope of PERF-684 and PERF-713 - PTF team decided to run several experiments on Data Import using shared kafka topic approach

Test:

The idea of a test is to verify and check performance of approach of using shared kafka topics;

Part 1 (Data import only)

Test has being conducted manually (using UI). Performed DI create on 3 tenants with 25K file. Started the job on the first tenant, then wait until 30% start a second job on the 2nd tenant, then wait until the first job gets to 60% and start the third job. Record all jobs' durations and KPIs. Did this test 2 times.

Part 2(Data Import and CheckIn CheckOut)

Test 20 concurrent users on CICO with Data Import on the background with and without shared topic to check affect of shared topic on CICO together with Data Import.

Summary

Overall performance looks about the same if we'll analyse all durations of data imports as one.
On primary tenant duration become a worse (+10,+5 min) while on secondary tenants it's becomes better (-6,-4 min).
As we can see - approach to use shared kafka topic may be a bit misbalanced on load of brokers. As now it's not multiple topics for each tenant that can be placed randomly on each broker but few topics for all tenants - it may cause that one (or few) kafka brokers can receive much more load than the others.
With changing kafka topics approach - nothing has changed on service CPU usage side.
No memory leaks were found.

Test Runs/Results

Test #	fileSize	comment	DI Duration/tenant
Test #	fileSize	comment	fs09....	fs07...01	fs07...02
1	25K	no shared topics	13 min 49 s	28 min 3 s	21 min 48 s
2	25K	no shared topics	14 min 47 s	26 min 47 s	21 min 43 s
3	25K	with shared topics	23 min 59 s	22 min 7 s	17 min 17 s
4	25K	with shared topics	18 min 10 s	20 min 31 s	18 min 38 s
5	50K	with shared topics	30 min 3 s	35 min 6 s	40 min 9 s (error)
6	100K	with shared topics	stuck on 99%*	1 hr 24 min	1 hr 43min (error) *
7	CICO 20 users + DI 25K on 3 tenants	no shared topics	completed with errors 27 min.*	50 min	41 min
8	CICO 20 users + DI 25K on 3 tenants	with shared topics	39 min	41 min	37 min

DI ERRORS. in all cases of error during Data Import - the error is io.netty.channel.StacklessClosedChannelException

Test 7,8 CICO

	no shared topic(#7)		with shared topic(#8)
	before DI	during DI	before	during DI
Check In	0,432s	1,21s	0,494s	1,641s
Check Out	0,824s	2,355s	1,029s	2,702s

Memory Utilization

Note: No signs of memory leaks found.

CPU Utilization

Note: The CPU usage pattern it the same across all the tests with no anomalies.

RDS Metrics

Notes:

CPU usage has a same pattern and reaches up to 90+%.
Connections rate stays the same and at it's maximum reached ±700 connections.

Note: Here is visible that DB load increases with applying of shared kafka topics. May be explain with increased throughput of kafka that is causing more load on DB.

Kafka Metrics

Note: as was mentioned above - kafka brokers CPU load may be a bit misbalances. and here is visible that changing of approach did change a pattern of CPU usage on brokers.

Open Search Metrics

Note: As in multiple times previous tests Open Search data nodes CPU usage has reached 100% during indexing operations on newly created records with DI

Test 7,8 results

Memory Utilization

CPU Utilization

RDS Metrics

Kafka Metrics

Appendix

Infrastructure

PTF -environment pcp1

Release - Poppy.

10 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
1 database instance, writer db.r6g.xlarge
MSK tenant
- 4 m5.2xlarge brokers in 2 zones
- Apache Kafka version 2.8.0
- EBS storage volume per broker 300 GiB
- auto.create.topics.enable=true
- log.retention.minutes=480
- default.replication.factor=3
Open Search fse-ptf cluster
- instance type r6g.large.search
- Data nodes -4
- Master nodes -3
- EBS size 500 GiB

Modules version:

Module pcp1-pvt Thu Oct 26 10:24:30 UTC 2023	Task Def. Revision	Module Version	Task Count	Mem Hard Limit	Mem Soft limit	CPU units	Xmx	MetaspaceSize	MaxMetaspaceSize	R/W split enabled
mod-inventory-storage-b	11	mod-inventory-storage:27.0.0	2	4096	3690	2048	3076	384	512	false
mod-data-import-b	15	mod-data-import:3.0.1	1	2048	1844	256	1292	384	512	false
mod-source-record-storage-b	11	mod-source-record-storage:5.7.0	2	5600	5000	2048	3500	384	512	false
mod-inventory-b	10	mod-inventory:20.1.0	2	2880	2592	1024	1814	384	512	false
mod-source-record-manager-b	12	mod-source-record-manager:3.7.0	2	5600	5000	2048	3500	384	512	false
mod-di-converter-storage-b	14	mod-di-converter-storage:2.1.0	2	1024	896	128	768	88	128	false
mod-circulation-storage-b	10	mod-circulation-storage:17.1.0	2	2880	2592	1536	1814	384	512	false
mod-pubsub-b	9	mod-pubsub:2.11.0	2	1536	1440	1024	922	384	512	false
mod-circulation-b	10	mod-circulation:24.0.0	2	2880	2592	1536	1814	384	512	false

Methodology/Approach

with the default set up, Run 25K ptf-create-2 job profile data import on first tenant, wait until it has 30% progress, start 25K with the same job profile o second tenant, wait until first tenant will have 60% progress and finally start third tenant 25K job.
Change task definitions of related modules (mod-data-import, mod-di-conventer-storage,mod-inventory,mod-inventory-storage,mod-source-record-storage,mod-source-record-manager) to use shared kafka topics. Simply add to the parameters of a service : {
"name": "KAFKA_PRODUCER_TENANT_COLLECTION",
"value": "ALL"
},
rerun all the tests and compare results. (Data import durations, Cpu usage on a services, Mem usage on services, DB CPU, connections, kafka (MSK) CPU usage, Open search data nodes CPU usage)

Folio Development Teams

Shared kafka topic experiments [inProgress]

Analytics

Overview

Summary

Test Runs/Results

Memory Utilization

CPU Utilization

RDS Metrics

Kafka Metrics

Open Search Metrics

Test 7,8 results

Memory Utilization

CPU Utilization

RDS Metrics

Kafka Metrics

Appendix

Infrastructure

Methodology/Approach

Related content