Comparing different values of KAFKA_CONSUMER_MAX_POLL_RECORDS
Overview
The PTF was asked to evaluate if decreasing the value of KAFKA_CONSUMER_MAX_POLL_RECORDS from 600 to 200 in mod-search’s task definition would have any performance impact on the workflows that makes use of this parameter. This parameter, as it name implies, specifies how many messages to get from each call to Kafka. Getting more records (600) will result in a bigger memory consumption but less trips to Kafka, whereas getting less records result in more trips but less memory consumption and perhaps more stability for mod-search, the main actor that uses this parameter.
Two workflows were chosen to be tested: Data Import (of 25K records) and Reindexing (of instances in a central tenant and 5 member tenants cluster) on a Sunflower release. Data Import jobs yield messages for records that need to be indexed, whereas reindexing records create messages that need to be (re)indexed. Data Import won’t be affected directly because the runtime indexing of the imported records is happening asynchronously whereas reindexing of the records is happening in real time and performance could be impacted directly by the parameter.
Summary
Data Import jobs had the same durations with the KAFKA_CONSUMER_MAX_POLL_RECORDS value being 600 or 200.
Reindexing’s durations also were similar to each other whether the value was 600 or 200.
All performance metrics: Service CPU and memory utilizations, DB CPU and AAS are similar between the two sets of tests as well for both Data Import and Reindexing workflows. No significant differences identified.
Conclusion: no performance impact on decreasing this value from 600 to 200.
Recommendations
KAFKA_CONSUMER_MAX_POLL_RECORDS can be set to 200 as necessary.
Test Results
The table below contains Data Import tests results of Create and Update imports of 25K MARC BIB records when the KAFKA_CONSUMER_MAX_POLL_RECORDS environment variable’s value equals 600 and 200. The durations of the Data Import jobs are listed in minutes and seconds. Evidently the Create and Update imports durations don’t vary much with either setting.
| KAFKA_CONSUMER_MAX_POLL_RECORDS = 600 | KAFKA_CONSUMER_MAX_POLL_RECORDS = 200 |
|---|---|---|
Create Import | 0:12:38 | 0:13:13 |
Create Import | 0:12:31 | 0:12:35 |
Create Import | 0:12:41 | 0:11:44 |
Update Import | 0:21:28 | 21:48 |
Update Import | 0:20:22 | 21:51 |
Update Import | 0:22:09 | 21:56 |
The next table shows reindexing test results with KAFKA_CONSUMER_MAX_POLL_RECORDS equals 600 and 200. Full reindexing was done on the central tenant that has over 1M instance records. Again, the durations – in minutes – are very similar between the reindexings. The table shows the duration data gathered from different components: mod-search, OpenSearch, and the database.
| KAFKA_CONSUMER_MAX_POLL_RECORDS = 600 | KAFKA_CONSUMER_MAX_POLL_RECORDS = 200 |
|---|---|---|
Reindexing (mod-search) | 174 | 169 |
Indexing time (via indexing rate in OpenSearch) | 120 | 120 |
Database | 178 | 173 |
Metrics
Data Import
The next two graphs show service CPU and memory utilizations being similar for DI jobs with the KAFKA_CONSUMER_MAX_POLL_RECORDS having value = 600 or value = 200. Note: henceforth KAFKA_CONSUMER_MAX_POLL_RECORDS is denoted as “variable” in the following graphs for conciseness.
(Note that the service CPU and memory graphs show imports of 10K, not 25K records. This is because the 25K graphs are not available anymore, so 10K graphs are placed here instead. 25K import graphs and 10K graphs only differ in duration, not spike pattern or magnitude)
Database CPU utilizations for Create Imports: no changes between the imports with KAFKA_CONSUMER_MAX_POLL_RECORDS equals to 600 or 200. Note: henceforth KAFKA_CONSUMER_MAX_POLL_RECORDS is denoted as “variable” in the following graphs for conciseness.
Database CPU utilizations for Update Imports: no changes between the imports with the variable’s value equals to 600 or 200
OpenSearch’s indexing rates and patterns for the import tests are similar to each other when the variable is 600 or 200.
ReIndexing
Service CPU and memory metrics for reindexing follow Data Import to exhibit the same pattern of spikes when the variable is 600 or 200.
Database metrics such as CPU utilizations and CPU loads show the same durations and pattern of spikes during reindexing of the dataset with the variable = 600 or 200.
Indexing rates also show the same durations and pattern of spikes during reindexing of the dataset with the variable = 600 or 200.
Appendix
Infrastructure
PTF -environment secon
12 r7g.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
db.r7.xlarge database instances, writer
MSK fse-test
4 kafka.m7g.xlarge brokers in 2 zones
Apache Kafka version 3.7.x (KRaft mode)
EBS storage volume per broker 300 GiB
auto.create.topics.enable=true
log.retention.minutes=480
default.replication.factor=3
OpenSearch 2.13 ptf-test cluster (for Data Import tests)
r7g.2xlarge.search 4 data nodes
r6g.large.search 3 dedicated master nodes
OpenSearch 2.13 ptf-loc cluster (for reindexing tests)
r7g.xlarge.search 4 data nodes
m7g.large.search 3 dedicated master nodes