Shared kafka topic experiments [inProgress]

Overview

Shared kafka topic is the feature introduced by FoliJet team [ documentation ] to reduce number of topic will be created by modules (with excluding tenant ID from name and adding "ALL" instead)

How it was: 

  • Multiple tenants creates multiple topics in Kafka for various purposes;
  • Lots of kafka topics (also taking into account that they are replicated on different brokers) may consume a lot of space on brokers.
  • It may be even a bigger issue when it will be consortia env (with more than 60 tenant)

How It suppose to be:

  • Multiple tenants will communicate (produce and consume messages) with only one topic with keyword "ALL" instead of separate topics with tenantId

In scope of PERF-684 and PERF-713 - PTF team decided to run several experiments on Data Import using shared kafka topic approach

Test: 

The idea of a test is to verify and check performance of approach of using shared kafka topics;

Part 1 (Data import only)

Test has being conducted manually (using UI). Performed DI create on 3 tenants with 25K file. Started the job on the first tenant, then wait until 30% start a second job on the 2nd tenant, then wait until the first job gets to 60% and start the third job.  Record all jobs' durations and KPIs.  Did this test 2 times. 

Part 2(Data Import and CheckIn CheckOut)

Test 20 concurrent users on CICO with Data Import on the background with and without shared topic to check affect of shared topic on CICO together with Data Import.

Summary

  • Overall performance looks about the same if we'll analyse all durations of data imports as one.
  • On primary tenant duration become a worse (+10,+5 min) while on secondary tenants it's becomes better (-6,-4 min).
  • As we can see - approach to use shared kafka topic may be a bit misbalanced on load of brokers. As now it's not multiple topics for each tenant that can be placed randomly on each broker but few topics for all tenants - it may cause that one (or few) kafka brokers can receive much more load than the others.
  • With changing kafka topics approach - nothing has changed on service CPU usage side. 
  • No memory leaks were found. 

Test Runs/Results

Test #fileSizecomment
DI Duration/tenant
fs09....fs07...01fs07...02
125Kno shared topics13 min 49 s28 min 3 s21 min 48 s
225Kno shared topics14 min 47 s26 min 47 s21 min 43 s
325Kwith shared topics23 min 59 s22 min 7 s17 min 17 s
425Kwith shared topics18 min 10 s20 min 31 s18 min 38 s
550Kwith shared topics30 min 3 s35 min 6 s40 min 9 s (error)
6100Kwith shared topicsstuck on 99%*1 hr 24 min

1 hr 43min

(error) *

7CICO 20 users + DI 25K on 3 tenantsno shared topicscompleted with errors 27 min.*50 min41 min
8CICO 20 users + DI 25K on 3 tenantswith shared topics39 min41 min37 min

DI ERRORS. in all cases of error during Data Import - the error is io.netty.channel.StacklessClosedChannelException


Test 7,8 CICO



no shared topic(#7)with shared topic(#8)
before DIduring DIbeforeduring DI
Check In0,432s1,21s0,494s1,641s
Check Out0,824s2,355s1,029s2,702s

Memory Utilization

Note: No signs of memory leaks found. 

CPU Utilization 

Note: The CPU usage pattern it the same across all the tests with no anomalies. 

RDS Metrics 

Notes:

  • CPU usage has a same pattern and reaches up to 90+%. 
  • Connections rate stays the same and at it's maximum reached ±700 connections.


Note: Here is visible that DB load increases with applying of shared kafka topics. May be explain with increased throughput of kafka that is causing more load on DB.

Kafka Metrics 

Note: as was mentioned above - kafka brokers CPU load may be a bit misbalances. and here is visible that changing of approach did change a pattern of CPU usage on brokers. 



Open Search Metrics 

Note: As in multiple times previous tests Open Search data nodes CPU usage has reached 100% during indexing operations on newly created records with DI



Test 7,8 results


Memory Utilization


CPU Utilization 


RDS Metrics 


Kafka Metrics 


Appendix

Infrastructure

PTF -environment pcp1

Release - Poppy.

  • 10 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
  • 1 database  instance, writer db.r6g.xlarge

  • MSK tenant
    • 4 m5.2xlarge brokers in 2 zones
    • Apache Kafka version 2.8.0

    • EBS storage volume per broker 300 GiB

    • auto.create.topics.enable=true
    • log.retention.minutes=480
    • default.replication.factor=3
  • Open Search fse-ptf cluster
    • instance type r6g.large.search
    • Data nodes -4
    • Master nodes -3
    • EBS size 500 GiB

Modules version: 

Module
pcp1-pvt
Thu Oct 26 10:24:30 UTC 2023
Task Def. RevisionModule VersionTask CountMem Hard LimitMem Soft limitCPU unitsXmxMetaspaceSizeMaxMetaspaceSizeR/W split enabled
mod-inventory-storage-b11mod-inventory-storage:27.0.024096369020483076384512false
mod-data-import-b15mod-data-import:3.0.11204818442561292384512false
mod-source-record-storage-b11mod-source-record-storage:5.7.025600500020483500384512false
mod-inventory-b10mod-inventory:20.1.022880259210241814384512false
mod-source-record-manager-b12mod-source-record-manager:3.7.025600500020483500384512false
mod-di-converter-storage-b14mod-di-converter-storage:2.1.02102489612876888128false
mod-circulation-storage-b10mod-circulation-storage:17.1.022880259215361814384512false
mod-pubsub-b9mod-pubsub:2.11.02153614401024922384512false
mod-circulation-b10mod-circulation:24.0.022880259215361814384512false

Methodology/Approach

  • with the default set up, Run 25K ptf-create-2 job profile data import on first tenant, wait until it has 30% progress, start 25K with the same job profile o second tenant, wait until first tenant will have 60% progress and finally start third tenant 25K job. 
  • Change task definitions of related modules (mod-data-import, mod-di-conventer-storage,mod-inventory,mod-inventory-storage,mod-source-record-storage,mod-source-record-manager) to use shared kafka topics. Simply add to the parameters of a service :  {
              "name": "KAFKA_PRODUCER_TENANT_COLLECTION",
              "value": "ALL"
            },
  • rerun all the tests and compare results. (Data import durations, Cpu usage on a services, Mem usage on services, DB CPU, connections, kafka (MSK) CPU usage, Open search data nodes CPU usage)