Dependencies between mod-pubsub kafka partitions and CICO performance(Orchid)
Overview
According to PERF-534 It's been observed that DI's performance was improved greatly when DI Kafka topics' partitions were increased to 2.
In this testing effort, we'd like to see if increasing mod-pubsub's Kafka topics partitions from 1 to two would have the same positive impact on the Check In Check Out workflow as many mod-pubsub's topics are related to circulation. We will test CICO with R/W split enabled and disabled as well.
Summary
- It doesn’t looks like changing of mod-pub-sub partitions to 2 helping much.
- For some tests response times were faster, and for some tests were slower (-20ms + 50ms).
- There is no big benefits when enabling Read/Write split on CICO in these standalone tests. However they may appear during real life usage and/or running several high DB load workflows (such as Data Import), as it will distribute load between DB nodes.
Recommendations
The only notable observation is that mod-pub-sub Kafka topics has naming pattern include mod-pub-sub version like:
- ncp5.pub-sub.fs09000000.FEE_FINE_BALANCE_CHANGED.mod-pubsub-2.7.0
- ncp5.pub-sub.fs09000000.FEE_FINE_BALANCE_CHANGED.mod-pubsub-2.9.1
- ncp5.pub-sub.fs09000000.FEE_FINE_BALANCE_CHANGED.mod-pubsub-2.10.0-SNAPSHOT
*which is same topic for different mod-pubsub versions.
If mod-pubsub gets updated frequently, then the old topics might still hang around and will accumulate unnecessarily. So possibly it's a good idea to exclude the version number from topic naming pattern.
Test Sets
Test # | Test Conditions | Duration | Load generator size | Load generator Memory(GiB) | Notes |
1. | 8,20,30,75users CI/CO | 30 mins each | t3.large | 3 | 2 pub-sub partition, R/W split enabled |
2. | 8,20,30,75users CI/CO | 30 mins each | t3.large | 3 | 1 pub-sub partitions, R/W split enabled |
| 3. | 8,20,30,75users CI/CO | 30 mins each | t3.large | 3 | 1 pub-sub partition, R/W split disabled |
| 4. | 8,20,30,75users CI/CO | 30 mins each | t3.large | 3 | 2 pub-sub partitions, R/W split disabled |
Results
Below listed response times (average (avg.) 75 percentile and 95 percentile) for tests (8,20,30,75 users) with 1 and 2 mod-pub-sub Kafka topic partitions.
Also there is comparison provided between 2 and 1 partitions. Number +n mean that particular response time is slower by n ms comparing with appropriate number from 1 partition test.
With Read Write split enabled
| R/W Split enabled | CI | CO | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 partitions | 1 partition | 2 partitions | 1 partition | |||||||||
| avg. | 75% | 95% | avg. | 75% | 95% | avg. | 75% | 95% | avg. | 75% | 95% | |
| 8 users | 0.476 | 0.496+2 | 0.556-4 | 0.476 | 0.494 | 0.560 | 0.763-7 | 0.784+38 | 0.890-27 | 0.770 | 0.746 | 0.917 |
20 users | 0.459-4 | 0.477-7 | 0.527-12 | 0.463 | 0.484 | 0.539 | 0.740-10 | 0.763-10 | 0.845-13 | 0.750 | 0.773 | 0.858 |
| 30 users | 0.456+16 | 0.482+18 | 0.530+24 | 0.440 | 0.464 | 0.506 | 0.747+8 | 0.770+9 | 0.848+6 | 0.739 | 0.761 | 0.842 |
| 75 users | 0.526-3 | 0.574-6 | 0.707+12 | 0.529 | 0.580 | 0.695 | 0.951-4 | 1.020-11 | 1.187+2 | 0.955 | 1.031 | 1.185 |
*Here we can see that there is no significant difference in response times between 1 and 2 partitions of mod-pub-sub kafka topics when R/W split is enabled. For some cases it's better and for some it's worse so we can conclude that it has no pattern and having 2 partitions have no benefits in response times.
Read Write split disabled
| R/W Split Disabled | CI | CO | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 partitions | 1 partition | 2 partitions | 1 partition | |||||||||
| avg. | 75% | 95% | avg. | 75% | 95% | avg. | 75% | 95% | avg. | 75% | 95% | |
| 8 users | 0.484+17 | 0.497+20 | 0.561+7 | 0.467 | 0.477 | 0.554 | 0.759+24 | 0.769+26 | 0.926+70 | 0.735 | 0.743 | 0.856 |
20 users | 0.471+11 | 0.468-11 | 0.540+19 | 0.460 | 0.479 | 0.521 | 0.748+14 | 0.771+15 | 0.848+16 | 0.734 | 0.756 | 0.832 |
| 30 users | 0.446-7 | 0.472-6 | 0.520-2 | 0.453 | 0.479 | 0.522 | 0.727-10 | 0.749-9 | 0.824-5 | 0.737 | 0.758 | 0.829 |
| 75 users | 0.552 | 0.604+91 | 0.736+37 | 0.522 | 0.513 | 0.669 | 0.977+49 | 1.046+54 | 1.220 | 0.928 | 0.992 | 1.120 |
*Here we can see that there is no significant difference in response times between 1 and 2 partitions of mod-pub-sub kafka topics when R/W split is disabled. For some cases it's better and for some it's worse so we can conclude that it has no pattern and having 2 partitions have no benefits in response times.
Comparisons
Comparison between RW/Split enabled/disabled with 1 and 2 partitions
Table below shows how many milliseconds will we save or miss if we'll enable Read/Write split on DB. (R/W split disabled response times are baseline numbers for comparisons)
Notable observations:
- As shown here - there is a big difference in CPU usage pattern with and without R/W split. For now I doesn't looks like it helping much, and in most cases it makes performance worse. However there possibly will be performance benefits during real life usage and/or running several high DB load workflows (such as Data Import), as it will distribute load between DB nodes.
- For now - no visible pattern to conclude if R/W split is working better with 1 mod-pub-sub partition or with 2.
| R/W Split Disabled | CI | CO | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 partitions | 1 partition | 2 partitions | 1 partition | |||||||||
| avg. | 75% | 95% | avg. | 75% | 95% | avg. | 75% | 95% | avg. | 75% | 95% | |
| 8 users | 0.484-8 | 0.497-1 | 0.561-5 | 0.467+9 | 0.477+17 | 0.554+6 | 0.759+4 | 0.769+15 | 0.926-36 | 0.735+35 | 0.743+3 | 0.856+61 |
20 users | 0.471-17 | 0.468+9 | 0.540-13 | 0.460+3 | 0.479+5 | 0.521+18 | 0.748-8 | 0.771-8 | 0.848-3 | 0.734+16 | 0.756+17 | 0.832+26 |
| 30 users | 0.446+10 | 0.472+10 | 0.520+10 | 0.453-13 | 0.479-15 | 0.522-16 | 0.727+20 | 0.749+21 | 0.824+20 | 0.737+2 | 0.758+3 | 0.829+13 |
| 75 users | 0.552-26 | 0.604-30 | 0.736+29 | 0.522+7 | 0.513+67 | 0.669+26 | 0.977-26 | 1.046-26 | 1.220-13 | 0.928+27 | 0.992+39 | 1.120+65 |
Comparison between current result vs initial results
Initial results was made by measuring CICO performance on snapshot version of modules
As a base for current result we'll use CICO results with Read/Write split disabled and with one mod-pub-sub Kafka partition.(As it was also input conditions for initial testing).
| Initial test | CI | CO | ||
|---|---|---|---|---|
| avg. | 95% | avg. | 95% | |
| 8 users | 0.467 534 (-14%) | 0.554 1'041 (-87%) | 0.735 909 (-23%) | 0.856 1'462 (-70%) |
20 users | 0.460 600 (-30%) | 0.521 1'196 (-129%) | 0.734 1'110 (-51%) | 0.832 1'807 (-117%) |
| 30 users | 0.453 834 (-84%) | 0.522 1'708 (-227%) | 0.737 1'590 (-115%) | 0.829 2'777 (-234%) |
| 75 users | 0.522 825 (-58%) | 0.669 1'566 (-134%) | 0.928 1'960 (-111%) | 1.120 3'142 (-180%) |
*Here numbers like 534 -is numbers from initial (with snapshot versions) test. Numbers like 0.467 - is results of recent tests.
** We can see here significant difference of response times. Initial results were made on snapshot versions of the modules and recent results on released versions. That's explaining such difference in response times.
CPU Utilization
| R/W split enabled | R/W split disabled |
|---|---|
Average CPU usages per module for 30 users test:
Modules | CPU usage R/W enabled, 2 partitions | CPU usage R/W enabled, 1 partitions | CPU usage R/W Disabled, 1 partitions | CPU usage R/W Disabled, 2 partitions |
|---|---|---|---|---|
| mod-inventory | 11% | 11% | 12% | 12% |
mod-inventory-storage | 14% | 11% | 16% | 14% |
| okapi | 32% | 32% | 33% | 33% |
| mod-feesfines | 23% | 23% | 23% | 25% |
| mod-patron-blocks | 1% | 2% | 2% | 2% |
| mod-pubsub | 6% | 6% | 7% | 8% |
| mod-authtoken | 17% | 19% | 14% | 17% |
| mod-circulation-storage | 6% | 5% | 6% | 6% |
| mod-circulation | 6% | 6% | 6% | 6% |
| mod-configuration | 20% | 24% | 20% | 26% |
| mod-users | 42% | 47% | 38% | 45% |
| mod-remote-storage | 2% | 2% | 2% | 2% |
*In table above we can see that there is no big benefits from CPU utilization perspective when we have 1,2 partitions on R/W split enabled/disabled. As for mod-pub-sub CPU utilization - there is no benefits as well. We can see that CPU usage is on 6-8% level without dependency on number of kafka topics partitions and Read/Write split enable or disable.
Instance level CPU usage
| R/W split enabled | R/W split disabled |
|---|---|
Memory Utilization
Note: Here we can't see any sign of memory leaks on every module. Memory shows stable trend.
| R/W split enabled | R/W split Disabled |
|---|---|
RDS CPU Utilization
| R/W split enabled | R/W split disabled |
|---|---|
Note: Here we can see how DB load is distributed when R/W split enabled with when it's disabled.
At maximum for reader and writer CPU usage for DB nodes when R/W split enabled is ±25% vs when it's disabled ±45% for primary node.
Appendix
Infrastructure
PTF -environment ncp5
- 10 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
- 2 instances of db.r6g.xlarge database instances, one reader, and one writer
- MSK ptf-kakfa-3
- 4 kafka.m5.2xlarge brokers in 2 zones
Apache Kafka version 2.8.0
EBS storage volume per broker 300 GiB
- auto.create.topics.enable=true
- log.retention.minutes=480
- default.replication.factor=3
Modules memory and CPU parameters
Modules | Version | Running Tasks | CPU | Memory | MemoryReservation | MaxMetaspaceSize | Xmx |
|---|---|---|---|---|---|---|---|
| mod-inventory | 20.0.4 | 2 | 1024 | 2880 | 2592 | 512m | 1814m |
mod-inventory-storage | 26.0.0 | 2 | 1024 | 2208 | 1952 | 512m | 1440m |
| okapi | 5.0.1 | 3 | 1024 | 1684 | 1440 | 512m | 922m |
| mod-feesfines | 18.2.1 | 2 | 128 | 1024 | 896 | 128m | 768m |
| mod-patron-blocks | 1.8.0 | 2 | 1024 | 1024 | 896 | 128m | 768m |
| mod-pubsub | 2.9.1 | 2 | 1024 | 1536 | 1440 | 512m | 922m |
| mod-authtoken | 2.13.0 | 2 | 512 | 1440 | 1152 | 128m | 922m |
| mod-circulation-storage | 16.0.0 | 2 | 1024 | 1536 | 1440 | 512m | 896m |
| mod-circulation | 23.5.4 | 2 | 1536 | 2880 | 2592 | 512m | 1814m |
| mod-configuration | 5.9.1 | 2 | 128 | 1024 | 896 | 128m | 768m |
| mod-users | 19.1.1 | 2 | 128 | 1024 | 896 | 128m | 768m |
| mod-remote-storage | 2.0.2 | 2 | 1024 | 4920 | 4472 | 512m | 3960m |
mod-pub-sub Kafka topics
- ncp5.pub-sub.fs09000000.FEE_FINE_BALANCE_CHANGED.mod-pubsub-2.9.1
- ncp5.pub-sub.fs09000000.ITEM_AGED_TO_LOST.mod-pubsub-2.9.1
- ncp5.pub-sub.fs09000000.ITEM_CHECKED_IN.mod-pubsub-2.9.1
- ncp5.pub-sub.fs09000000.ITEM_CHECKED_OUT.mod-pubsub-2.9.1
- ncp5.pub-sub.fs09000000.ITEM_CLAIMED_RETURNED.mod-pubsub-2.9.1
- ncp5.pub-sub.fs09000000.ITEM_DECLARED_LOST.mod-pubsub-2.9.1
- ncp5.pub-sub.fs09000000.LOAN_CLOSED.mod-pubsub-2.9.1
- ncp5.pub-sub.fs09000000.LOAN_DUE_DATE_CHANGED.mod-pubsub-2.9.1
- ncp5.pub-sub.fs09000000.LOAN_RELATED_FEE_FINE_CLOSED.mod-pubsub-2.9.1
- ncp5.pub-sub.fs09000000.LOG_RECORD.mod-pubsub-2.9.1
- ncp5.pub-sub.fs09000000.QM_ERROR.mod-pubsub-2.9.1
- ncp5.pub-sub.fs09000000.QM_INVENTORY_INSTANCE_UPDATED.mod-pubsub-2.9.1
- ncp5.pub-sub.fs09000000.QM_RECORD_UPDATED.mod-pubsub-2.9.1
- ncp5.pub-sub.fs09000000.QM_SRS_MARC_BIB_RECORD_UPDATED.mod-pubsub-2.9.1