Dependencies between mod-pubsub kafka partitions and CICO performance(Orchid)


Overview

According to PERF-534 It's been observed that DI's performance was improved greatly when DI Kafka topics' partitions were increased to 2.

In this testing effort, we'd like to see if increasing mod-pubsub's Kafka topics partitions from 1 to two would have the same positive impact on the Check In Check Out workflow as many mod-pubsub's topics are related to circulation.  We will test CICO with R/W split enabled and disabled as well.

Summary

  • It doesn’t looks like changing of mod-pub-sub partitions to 2 helping much. 
  • For some tests response times were faster, and for some tests were slower (-20ms + 50ms).
  • There is no big benefits when enabling Read/Write split on CICO in these standalone tests. However they may appear during real life usage and/or running several high DB load workflows (such as Data Import), as it will distribute load between DB nodes.

Recommendations

The only notable observation is that mod-pub-sub Kafka topics has naming pattern include mod-pub-sub version like: 

  • ncp5.pub-sub.fs09000000.FEE_FINE_BALANCE_CHANGED.mod-pubsub-2.7.0
  • ncp5.pub-sub.fs09000000.FEE_FINE_BALANCE_CHANGED.mod-pubsub-2.9.1
  • ncp5.pub-sub.fs09000000.FEE_FINE_BALANCE_CHANGED.mod-pubsub-2.10.0-SNAPSHOT

*which is same topic for different mod-pubsub versions. 

If mod-pubsub gets updated frequently, then the old topics might still hang around and will accumulate unnecessarily. So possibly it's a good idea to exclude the version number from topic naming pattern.


Test Sets 

Test #

Test Conditions

Duration 

Load generator size Load generator Memory(GiB) 

Notes


1.

8,20,30,75users CI/CO 30 mins eacht3.large32 pub-sub partition, R/W split enabled

2.

8,20,30,75users CI/CO 30 mins eacht3.large31 pub-sub partitions, R/W split enabled
3.8,20,30,75users CI/CO 30 mins eacht3.large31 pub-sub partition, R/W split disabled
4.8,20,30,75users CI/CO 30 mins eacht3.large32 pub-sub partitions, R/W split disabled

Results

Below listed response times (average (avg.) 75 percentile and 95 percentile) for tests (8,20,30,75 users) with 1 and 2 mod-pub-sub Kafka topic partitions.

Also there is comparison provided between 2 and 1 partitions. Number +n mean that particular response time is slower by n ms comparing with appropriate number from 1 partition test. 

With Read Write split enabled

R/W Split enabled


CICO
2 partitions1 partition2 partitions1 partition
avg.75%95%avg. 75%95%avg.75%95%avg. 75%95%
8 users0.4760.496+20.556-40.4760.4940.5600.763-70.784+380.890-270.7700.7460.917

20 users

0.459-40.477-70.527-120.4630.4840.5390.740-100.763-100.845-130.7500.7730.858
30 users0.456+160.482+180.530+240.4400.4640.5060.747+80.770+90.848+60.7390.7610.842
75 users0.526-30.574-60.707+120.5290.5800.6950.951-41.020-111.187+20.9551.0311.185

*Here we can see that there is no significant difference in response times between 1 and 2 partitions of mod-pub-sub kafka topics when R/W split is enabled. For some cases it's better and for some it's worse so we can conclude that it has no pattern and having 2 partitions have no benefits in response times. 

Read Write split disabled 

R/W Split Disabled


CICO
2 partitions1 partition2 partitions1 partition
avg.75%95%avg. 75%95%avg.75%95%avg. 75%95%
8 users0.484+170.497+200.561+70.4670.4770.5540.759+240.769+260.926+700.7350.7430.856

20 users

0.471+110.468-110.540+190.4600.4790.5210.748+140.771+150.848+160.7340.7560.832
30 users0.446-70.472-60.520-20.4530.4790.5220.727-100.749-90.824-50.7370.7580.829
75 users0.5520.604+910.736+370.5220.5130.6690.977+491.046+541.2200.9280.9921.120

*Here we can see that there is no significant difference in response times between 1 and 2 partitions of mod-pub-sub kafka topics when R/W split is disabled. For some cases it's better and for some it's worse so we can conclude that it has no pattern and having 2 partitions have no benefits in response times. 

Comparisons

Comparison between RW/Split enabled/disabled with 1 and 2 partitions

Table below shows how many milliseconds will we save or miss if we'll enable Read/Write split on DB. (R/W split disabled response times are baseline numbers for comparisons)

Notable observations:

  • As shown here - there is a big difference in CPU usage pattern with and without R/W split. For now I doesn't looks like it helping much, and in most cases it makes performance worse. However there possibly will be performance benefits during real life usage and/or running several high DB load workflows (such as Data Import), as it will distribute load between DB nodes.
  • For now - no visible pattern to conclude if R/W split is working better with 1 mod-pub-sub partition or with 2.
R/W Split Disabled


CICO
2 partitions1 partition2 partitions1 partition
avg.75%95%avg. 75%95%avg.75%95%avg. 75%95%
8 users0.484-80.497-10.561-50.467+90.477+170.554+60.759+40.769+150.926-360.735+350.743+30.856+61

20 users

0.471-170.468+90.540-130.460+30.479+50.521+180.748-80.771-80.848-30.734+160.756+170.832+26
30 users0.446+100.472+100.520+100.453-130.479-150.522-160.727+200.749+210.824+200.737+20.758+30.829+13
75 users0.552-260.604-300.736+290.522+70.513+670.669+260.977-261.046-261.220-130.928+270.992+391.120+65

Comparison between current result vs initial results

Initial results was made by measuring CICO performance on snapshot version of modules

As a base for current result we'll use CICO results with Read/Write split disabled and with one mod-pub-sub Kafka partition.(As it was also input conditions for initial testing).


Initial test

CICO
avg. 95%avg. 95%
8 users0.467 534 (-14%)0.554 1'041 (-87%)0.735 909 (-23%)0.856 1'462 (-70%)

20 users

0.460 600 (-30%)0.521 1'196 (-129%)0.734  1'110 (-51%)0.832 1'807 (-117%)
30 users0.453 834 (-84%)0.522 1'708 (-227%)0.737 1'590 (-115%)0.829 2'777 (-234%)
75 users0.522 825 (-58%)0.669 1'566 (-134%)0.928 1'960 (-111%)

1.120 3'142 (-180%)

*Here numbers like 534 -is numbers from initial (with snapshot versions) test. Numbers like 0.467 - is results of recent tests. 

** We can see here significant difference of response times. Initial results were made on snapshot versions of the modules and recent results on released versions. That's explaining such difference in response times.

CPU Utilization 


R/W split enabledR/W split disabled

Average CPU usages per module for 30 users test:

Modules

CPU usage R/W enabled, 2 partitionsCPU usage R/W enabled, 1 partitionsCPU usage R/W Disabled, 1 partitionsCPU usage R/W Disabled, 2 partitions
mod-inventory11%11%12%12%

mod-inventory-storage

14%11%16%14%
okapi32%32%33%33%
mod-feesfines23%23%23%25%
mod-patron-blocks1%2%2%2%
mod-pubsub6%6%7%8%
mod-authtoken17%19%14%17%
mod-circulation-storage6%5%6%6%
mod-circulation6%6%6%6%
mod-configuration20%24%20%26%
mod-users42%47%38%45%
mod-remote-storage2%2%2%2%

*In table above we can see that there is no big benefits from CPU utilization perspective when we have 1,2 partitions on R/W split enabled/disabled. As for mod-pub-sub CPU utilization - there is no benefits as well. We can see that CPU usage is on 6-8% level without dependency on number of kafka topics partitions and Read/Write split enable or disable. 


Instance level CPU usage

R/W split enabledR/W split disabled

Memory Utilization

Note: Here we can't see any sign of memory leaks on every module. Memory shows stable trend. 

R/W split enabled R/W split Disabled


RDS CPU Utilization 

R/W split enabledR/W split disabled

Note: Here we can see how DB load is distributed when R/W split enabled with when it's disabled. 
At maximum for reader and writer CPU usage for DB nodes when R/W split enabled is ±25% vs when it's disabled ±45% for primary node. 


Appendix

Infrastructure

PTF -environment ncp5

  • 10 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
  • 2 instances of db.r6g.xlarge database instances, one reader, and one writer
  • MSK ptf-kakfa-3 
    • 4 kafka.m5.2xlarge brokers in 2 zones
    • Apache Kafka version 2.8.0

    • EBS storage volume per broker 300 GiB

    • auto.create.topics.enable=true
    • log.retention.minutes=480
    • default.replication.factor=3


Modules memory and CPU parameters 

Modules

Version

Running Tasks 

CPU

Memory

MemoryReservation

MaxMetaspaceSize

Xmx

mod-inventory

20.0.4

2102428802592512m1814m

mod-inventory-storage

26.0.0

21024

2208

1952

512m1440m
okapi

5.0.1

31024

1684

1440

512m922m
mod-feesfines

18.2.1

21281024896

128m

768m
mod-patron-blocks

1.8.0

210241024896128m768m
mod-pubsub

2.9.1

21024

1536

1440

512m922m
mod-authtoken

2.13.0

2

512

1440

1152

128m

922m

mod-circulation-storage

16.0.0

21024

1536

1440

512m896m
mod-circulation

23.5.4

2153628802592512m1814m
mod-configuration

5.9.1

21281024896128m768m
mod-users

19.1.1

2

128

1024896128m768m
mod-remote-storage

2.0.2

2

1024

4920

4472

512m

3960m

mod-pub-sub Kafka topics

  • ncp5.pub-sub.fs09000000.FEE_FINE_BALANCE_CHANGED.mod-pubsub-2.9.1
  • ncp5.pub-sub.fs09000000.ITEM_AGED_TO_LOST.mod-pubsub-2.9.1
  • ncp5.pub-sub.fs09000000.ITEM_CHECKED_IN.mod-pubsub-2.9.1
  • ncp5.pub-sub.fs09000000.ITEM_CHECKED_OUT.mod-pubsub-2.9.1
  • ncp5.pub-sub.fs09000000.ITEM_CLAIMED_RETURNED.mod-pubsub-2.9.1
  • ncp5.pub-sub.fs09000000.ITEM_DECLARED_LOST.mod-pubsub-2.9.1
  • ncp5.pub-sub.fs09000000.LOAN_CLOSED.mod-pubsub-2.9.1
  • ncp5.pub-sub.fs09000000.LOAN_DUE_DATE_CHANGED.mod-pubsub-2.9.1
  • ncp5.pub-sub.fs09000000.LOAN_RELATED_FEE_FINE_CLOSED.mod-pubsub-2.9.1
  • ncp5.pub-sub.fs09000000.LOG_RECORD.mod-pubsub-2.9.1
  • ncp5.pub-sub.fs09000000.QM_ERROR.mod-pubsub-2.9.1
  • ncp5.pub-sub.fs09000000.QM_INVENTORY_INSTANCE_UPDATED.mod-pubsub-2.9.1
  • ncp5.pub-sub.fs09000000.QM_RECORD_UPDATED.mod-pubsub-2.9.1
  • ncp5.pub-sub.fs09000000.QM_SRS_MARC_BIB_RECORD_UPDATED.mod-pubsub-2.9.1