Shared kafka topic experiments [inProgress]

Shared kafka topic experiments [inProgress]

Overview

Shared kafka topic is the feature introduced by FoliJet team [ documentation ] to reduce number of topic will be created by modules (with excluding tenant ID from name and adding "ALL" instead)

How it was: 

  • Multiple tenants creates multiple topics in Kafka for various purposes;

  • Lots of kafka topics (also taking into account that they are replicated on different brokers) may consume a lot of space on brokers.

  • It may be even a bigger issue when it will be consortia env (with more than 60 tenant)

How It suppose to be:

  • Multiple tenants will communicate (produce and consume messages) with only one topic with keyword "ALL" instead of separate topics with tenantId

In scope of PERF-684 and PERF-713 - PTF team decided to run several experiments on Data Import using shared kafka topic approach

Test: 

The idea of a test is to verify and check performance of approach of using shared kafka topics;

Part 1 (Data import only)

Test has being conducted manually (using UI). Performed DI create on 3 tenants with 25K file. Started the job on the first tenant, then wait until 30% start a second job on the 2nd tenant, then wait until the first job gets to 60% and start the third job.  Record all jobs' durations and KPIs.  Did this test 2 times. 

Part 2(Data Import and CheckIn CheckOut)

Test 20 concurrent users on CICO with Data Import on the background with and without shared topic to check affect of shared topic on CICO together with Data Import.

Summary

  • Overall performance looks about the same if we'll analyse all durations of data imports as one.

  • On primary tenant duration become a worse (+10,+5 min) while on secondary tenants it's becomes better (-6,-4 min).

  • As we can see - approach to use shared kafka topic may be a bit misbalanced on load of brokers. As now it's not multiple topics for each tenant that can be placed randomly on each broker but few topics for all tenants - it may cause that one (or few) kafka brokers can receive much more load than the others.

  • With changing kafka topics approach - nothing has changed on service CPU usage side. 

  • No memory leaks were found. 

Test Runs/Results

Test #

fileSize

comment


DI Duration/tenant

fs09....

fs07...01

fs07...02

1

25K

no shared topics

13 min 49 s

28 min 3 s

21 min 48 s

2

25K

no shared topics

14 min 47 s

26 min 47 s

21 min 43 s

3

25K

with shared topics

23 min 59 s

22 min 7 s

17 min 17 s

4

25K

with shared topics

18 min 10 s

20 min 31 s

18 min 38 s

5

50K

with shared topics

30 min 3 s

35 min 6 s

40 min 9 s (error)

6

100K

with shared topics

stuck on 99%*

1 hr 24 min

1 hr 43min

(error) *

7

CICO 20 users + DI 25K on 3 tenants

no shared topics

completed with errors 27 min.*

50 min

41 min

8

CICO 20 users + DI 25K on 3 tenants

with shared topics

39 min

41 min

37 min

DI ERRORS. in all cases of error during Data Import - the error is io.netty.channel.StacklessClosedChannelException



Test 7,8 CICO




no shared topic(#7)

with shared topic(#8)

before DI

during DI

before

during DI

Check In

0,432s

1,21s

0,494s

1,641s

Check Out

0,824s

2,355s

1,029s

2,702s

Memory Utilization

Note: No signs of memory leaks found. 

CPU Utilization 

Note: The CPU usage pattern it the same across all the tests with no anomalies. 

RDS Metrics 

Notes:

  • CPU usage has a same pattern and reaches up to 90+%. 

  • Connections rate stays the same and at it's maximum reached ±700 connections.



Note: Here is visible that DB load increases with applying of shared kafka topics. May be explain with increased throughput of kafka that is causing more load on DB.

Kafka Metrics 

Note: as was mentioned above - kafka brokers CPU load may be a bit misbalances. and here is visible that changing of approach did change a pattern of CPU usage on brokers. 





Open Search Metrics 

Note: As in multiple times previous tests Open Search data nodes CPU usage has reached 100% during indexing operations on newly created records with DI





Test 7,8 results



Memory Utilization



CPU Utilization 



RDS Metrics 



Kafka Metrics 



Appendix

Infrastructure

PTF -environment pcp1

Release - Poppy.

  • 10 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1

  • 1 database  instance, writer db.r6g.xlarge

  • MSK tenant

    • 4 m5.2xlarge brokers in 2 zones

    • Apache Kafka version 2.8.0

    • EBS storage volume per broker 300 GiB

    • auto.create.topics.enable=true

    • log.retention.minutes=480

    • default.replication.factor=3

  • Open Search fse-ptf cluster

    • instance type r6g.large.search

    • Data nodes -4

    • Master nodes -3

    • EBS size 500 GiB

Modules version: 

Module
pcp1-pvt
Thu Oct 26 10:24:30 UTC 2023

Task Def. Revision

Module Version

Task Count

Mem Hard Limit

Mem Soft limit

CPU units

Xmx

MetaspaceSize

MaxMetaspaceSize

R/W split enabled

Module
pcp1-pvt
Thu Oct 26 10:24:30 UTC 2023

Task Def. Revision

Module Version

Task Count

Mem Hard Limit

Mem Soft limit

CPU units

Xmx

MetaspaceSize

MaxMetaspaceSize

R/W split enabled

mod-inventory-storage-b

11

mod-inventory-storage:27.0.0

2

4096

3690

2048

3076

384

512

false

mod-data-import-b

15

mod-data-import:3.0.1

1

2048

1844

256

1292

384

512

false

mod-source-record-storage-b

11

mod-source-record-storage:5.7.0

2

5600

5000

2048

3500

384

512

false

mod-inventory-b

10

mod-inventory:20.1.0

2

2880

2592

1024

1814

384

512

false

mod-source-record-manager-b

12

mod-source-record-manager:3.7.0

2

5600

5000

2048

3500

384

512

false

mod-di-converter-storage-b

14

mod-di-converter-storage:2.1.0

2

1024

896

128

768

88

128

false

mod-circulation-storage-b

10

mod-circulation-storage:17.1.0

2

2880

2592

1536

1814

384

512

false

mod-pubsub-b

9

mod-pubsub:2.11.0

2

1536

1440

1024

922

384

512

false

mod-circulation-b

10

mod-circulation:24.0.0

2

2880

2592

1536

1814

384

512

false

Methodology/Approach

  • with the default set up, Run 25K ptf-create-2 job profile data import on first tenant, wait until it has 30% progress, start 25K with the same job profile o second tenant, wait until first tenant will have 60% progress and finally start third tenant 25K job. 

  • Change task definitions of related modules (mod-data-import, mod-di-conventer-storage,mod-inventory,mod-inventory-storage,mod-source-record-storage,mod-source-record-manager) to use shared kafka topics. Simply add to the parameters of a service :  {
              "name": "KAFKA_PRODUCER_TENANT_COLLECTION",
              "value": "ALL"
            },

  • rerun all the tests and compare results. (Data import durations, Cpu usage on a services, Mem usage on services, DB CPU, connections, kafka (MSK) CPU usage, Open search data nodes CPU usage)