Overview

The Mobius-like environment has 61 tenants. We need to test this environment to see if it can handle loads from all 61 tenants in high-level preliminary tests that consist of a few use cases.

1) Perform 30-minutes baseline CICO tests on the main tenant and a secondary tenant with 5 concurrent users.
2) Perform the following tests in all 61 tenants at the same time or with combinations:

CICO test for 5 users each on all tenants
1 hour test, repeat it 2 times
DI create jobs (10K) on 10-20 tenants
Search/Browse on all tenants

3) Increase number of Kafka brokers (+2), retest

4) Increase number of data nodes for Open Search service, retest

Compare results against the baseline tests and record the KPIs and other observations such as response times or errors in a report.

Summary

db.r6g.xlarge DB size can not handle load of 61 tenants even with test of 1 user for CechIn-CheckOut workflow. Due to high number of connection being created. To handle this number of concurrent connections to DB - instance type should be at least db.r6g.4xlarge (it has acceptable number of connections). However even that could be not enough. In latest tests we can see that db.r6g.4xlarge os oftenly reaching a limit of connections available. Possibly Shared pool of connections will have a positive affect on this.
- According to https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Managing.Performance.html db.r6g.4xlarge has 4000 max number of connections, while db.r6g.xlarge has only 2000.
Ticket created on performance degradation MODINVSTOR-1124
nginx-okapi in combined tests spiking up to 400% so CPU units should be increased (at least up to 512.(currently it's 128));
kafka CPU usage is on ±60% level during whole test. (in waiting state it's 35-40%). Increasing number of Kafka brokers (+2) has positive affect on data import, however while DI performance has being improved CICO being affected. As we observed - higher throughput on DI load DB more and has negative affect on response times of CICO
- io.vertx.core.impl.NoStackTraceThrowable: Connection is not active now, current status: CLOSED
- or io.netty.channel.StacklessClosedChannelException
DI first job (typically it's primary tenant) working fastest. each next tenant working slower and up to 3 hr.
In last test 5 DI jobs completed with errors due to same issues mentioned above. One job not even started due to 500 internal server error on POST call to start a job.
OpenSearch CPU usage is on 90% during whole test. It is likely due to DI jobs requiring indexing on each record created. This indexing is done asynchronously so it does not affect overall DI's duration, but likely affects other workflows' performance
No memory leaks was found
Improvements (adding 2 more brokers to Kafka cluster and changing CPU units on nginx-edge to 512) in test #7-8 did make Data Import faster, however they did affect CICO as well and didi increase response times on CI and CO +200ms avg.

Recommendations & Jiras

Original ticket - PERF-639 Preliminary Testing of Mobius-like Env;
Ticket to improve resources and retest PERF-670
Recommended to increase DB instance type at least to db.r6g.4xlarge on env with 61 tenants (all tests below performed with this instance type)
Recommended to increase CPU units at least to 512 on nginx-okapi;
Recommended to scale Kafka (either instance type or number of brokers) due to high CPU usage;

Test Runs

Test #	Test Conditions	Duration	Load generator size	Load generator Memory(GiB)	Notes
1.	2 tenants 2 user each CICO	30 min	t3.2xlarge	3
2.	61 tenants 5 user each CICO	30 min	t3.2xlarge	3
3	61 tenants 5 user each CICO + 10k MARC BIB Create on 5 tenants	30 min	t3.2xlarge	3
4.	5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants	60 mins	t3.2xlarge	3
5.	5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants	60 mins	t3.2xlarge	3
6	5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 30 tenants+ Search workflow 1 user 61 tenants	90 mins	t3.2xlarge	12
7	5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants	60	t3.2xlarge	10	test with changed CPU units up to 512 and adding 2 more brokers to Kafka
8	5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (retest)	60	t3.2xlarge	10	test with changed CPU units up to 512 and adding 2 more brokers to Kafka

Results

#	test	primary\secondary		primary\secondary		primary\secondary		primary\secondary
#	test	CI		CO		Data Import duration		Searches
1	2 tenants 5 user each CICO	0,953	0,662	1,701	1,196	-	-	-	-
2	61 tenant 5 users CICO(*)	1,041	0,797	1,966	1,414	-	-	-	-
3	61 tenant 5 users +DI on 5 tenants 10K	1,141	0,863	2,003	1,511	30 min	50-52 min (StacklessClosed on one tenant on one record)	-	-
4	61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant	1,245	0,974	2,25	1,72	42 min	1 hr 39 min - 2 hr 57 min (one tenant StacklessClosedChannel. One tenant with connection is not active now, current status: CLOSED	1,3-4,4s	0,7-4,3s
5	61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant (rerun)	1,216	0,941	2,14	1,641	(stuck on 98%)	1 hr 33 min - 2 hr 40 min (5 completed with errors including primary)	3,5-21 s	2,3 -18 s
6	61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant	1,321	0,998	2,277	1,770	stuck on 99% 7 records failed with 504 Gateway Time-out	3 hr 38 min - 6 hr 24 min (4 completed with errors including primary) - 504 Gateway Time-out io.netty.channel.StacklessClosedChannelException	2,9 -6,3 s	2,1 -7,3s
7	5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more brokers to Kafka )	1,503	1,190	2,313	1,836	14 min	56 min - 1 hr 47 min (8 jobs ether stuck or completed with errors)	3,0 -5,088	1,800-4,348
8	5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more brokers to Kafka )	1,77	1,487	2,839	2,319	29 min (error, 11 records failed with io.netty.channel.StacklessClosedChannelException)	1 hr 2 min - 1 hr 54 min (** 10 jobs failed )	4,103-7,605	3,058-8,752
9	5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more data nodes to Open Search )*with 2 Kafka brokers	1,442	1,181	2,312	1,769	44 minutes completed	1 hr 39 min - 3 hr 8 min. one job failed on one record with Connection is not active now, current status: CLOSED	1,779-3,534	0,67-3,636
10	5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more data nodes to Open Search )*with 2 Kafka brokers	1,496	1,220	2,334	1,845	44 minutes -completed with errors (2 records) io.netty.channel.StacklessClosedChannelException Connection is not active now, current status: CLOSED	1 hr 42 min - 2 hr 48 min one job failed on one record with Connection is not active now, current status: CLOSED	2,381-5,197	1,308-4,982

*note. Here were performed two tests to check consistency of the results.

** new errors that was never observed before in DI workflow appeared in test #8

io.vertx.pgclient.PgException: FATAL: remaining connection slots are reserved for non-replication superuser connections (53300) – on mod-inventory, mod-login, mod-authtoken, mod-permittions, mod-source-record-manager, mod-source-record-storage, mod-users, mod-circulation.

io.vertx.pgclient.PgException: FATAL: sorry, too many clients already (53300) -- on mod-inventory only

Test #1 2 tenants 5 user each CICO

There is almost no visible CPU, memory and DB response usage no issues was found.

Memory Utilization

Note: No memory leaks was found.

CPU Utilization

Note: CPU usage on related containers are hardly visible due to low load during baseline test.

RDS metrics

Test #2 61 tenants 5 user each

There were two identical tests performed as results are mostly the same we're using only one of them in report.

test #2 (61 tenants 5 user each CICO)

1. Most memory using modules is mod-inventory (±100%), mod-circulation (±80), mod-circulation-storage (±80).
2. nginx-okapi CPU usage is ±400% due to small number of CPU usage allocated by default for a module (128).
3. DB CPU is 15% in avg. during whole test
4. DB connection rate is ±2 000 in avg, with spikes up to 3.5K

Memory Utilization

Note: even with total users number 5X61tenant (305) there is no visible memory leaks or anomalies.

CPU Utilization

Note: with 305 total users included in a test CPU usage on nginx-okapi spiking above 400%.

RDS metrics

Note: RDS CPU usage is mostly below 20% during whole test.

Note: RDS DB connections number when system in stand-by state is ±1000 connections. During test this number increases to ±2K with spikes above 3K connections.

Test #3 61 tenant 5 users +DI on 5 tenants 10K

1. nginx-okapi CPU usage is ±400% due to small number of CPU usage allocated by default for a module (128).
2. No memory leaks was found.
3. DB CPU usage is close to 50%
4. DB connections rate has spikes up to 4 000 connections (which is max for current DB instance type)

Memory Utilization

Note: no memory leaks found in related to test modules.

CPU Utilization

Note: chart above and below is identical except one above has excluded nginx-okapi for better view of other modules CPU usage. All modules in good shape without reaching out limits.

RDS metrics

Note: with included DI to CICO in this test RDS CPU usage grown up to ±50%.

Note: connections on DB has spikes up to 4K.

Kafka metrics

Note: Kafka memory usage during test reached 2.3%. And this memory will be freed up with retention policy after 480 minutes.

Note: Kafka CPU usage is ±50-60% on each broker.

Test #4 61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant

In this test CICO

Test consisted with two parts:

Actual test that includes CICO+DI+search on each tenant (approximately from 12:15 to 13:15 on a chart)
Data imports continue running after actual tests. (from 13:15 to 15:45)

test #4,#5

1. nginx-okapi CPU usage is ±500%;
2. No signs memory leaks was found;
3. DB CPU usage is 70-80% in avg.
4. DB connection rate is ±3 000 connections avg. with spikes up to 4K
5. Open search CPU usage is close to 90% (max) from the beginning of the test due to searches workflow included in this test + data import indexing.
6. Few DI jobs has "Completed with errors" status due to fail of a few records with connection is not active now, current status: CLOSED , StacklessClosedChannel.
7. connection is not active now, current status: CLOSED - happened only one time and hasn't reproduced yet
8. DI job (10K) took up to 3 hours to complete
9. Search rate is 180 Ops/min.

Memory Utilization

Note: no major issues with memory was found.

CPU Utilization

Note: Below two charts: one without Nginx-okapi for more accurate vision of a CPU trends, second with Nginx-okapi.

Note: Nginx-okapi has more than 500% CPU usage. That's because of allocated 128 CPU units for nginx-okapi service. We should consider increasing CPU units for nginx-okapi at least up to 512.

RDS metrics

Note:

OpenSearch metrics

Kafka metrics

Test #5 61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant (rerun)

This is rerun of a previous test to check results consistency.

Results are more or less the same except Search workflow that increases response time 4 times.

Memory Utilization

CPU Utilization

RDS metrics

OpenSearch metrics

Kafka metrics

Test #6 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant

Test #6

1. CPU usage on nginx-edge as expected is high ±500%
2. CPU usage on mod-quick-marck has reached 1.5K% and didn't went down after a test. However it didn't affect results. Moreover this behaviour didn't reproduce since then.
3. Di took up 6,5 hr to complete.

Memory Utilization

CPU Utilization

Note: Here are two CPU Utilisation charts included first one without nginx-okapi and mod-quick-marc.

mod-quick-marc reached 1,5K CPU usage from very beginning of a test and didn't come down even after end of a test. Note that this is new behaviour and wasn't observed in previous tests.

RDS metrics

Note: RDS connections reached 4K concurrent connections during main part of a test.

OpenSearch metrics

Kafka metrics

Note: test itself with including of data imports after (in respect to timeline) ended in ±17:00. however messages was there during 8 more fours.

Test #7-8 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant (with improvements, rerun)

Test #7,#8 (improvement with 2 additional Kafka brokers)

1. CPU usage on nginx-edge after improvements is 150% avg.
2. Each other modules CPU and memory usage is the same as in previous tests.
3. DI becomes faster (possibly due to better performing Kafka with 4 brokers in a cluster). In previous tests it took up to 2 hr 57 min, mow it's 1 hr 57 min.
4. CI/CO response times affected by +200ms, +500ms for CI and CO. Can be explained with increasing number of Kafka brokers - DI becomes faster and did produce more load on DB side, which affect CICO response times.

Memory Utilization

CPU Utilization

RDS metrics

OpenSearch metrics

Kafka metrics

Test #9-10 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant (with additional data nodes on open search)

Test #9,#10 (tests that has only 2 Kafka brokers and 2 additional data nodes on open search)

1. Each other modules CPU and memory usage is the same as in previous tests
2. DI becomes slower in comparison to previous two tests, however it's more stable.
3. CICO response times becomes faster, as in previous tests (#7,#8) which is proving the point that the faster is DI - the more it's loading DB and more it has affect on CICO response time
4. search response times doesn't seems to be better. In comparison to other tests (in average) sometimes it's better, some times it's worse. So 2 additional data nodes didn't change much with performance of search.

Memory Utilization

CPU Utilization

RDS metrics

OpenSearch metrics

Kafka metrics

Appendix

Mobious like env has 61 tenants

Primary tenant fs00001137 - has 3+M records in inventory
60 secondary tenants (mob01,mob02,.....mob060) has originally prepared 10K records each. (at this point numbers may vary as some number of data imports on different tenants were performed).

Infrastructure

PTF environment ompt-pvt

11 m6g.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
1 instance of db.r6.4xlarge database
MSK ptf-mobius-testing
- 2 kafka.m5.2xlarge brokers in 2 zones
- Apache Kafka version 2.8.0
- EBS storage volume per broker 250 GiB
- auto.create.topics.enable=true
- auto.create.topics.enable=true
- log.retention.minutes=480
- num.partitions=2
OpenSearch fse
- version - OpenSearch 2.7
- instance type r6g.xlarge.search
- 4 data nodes
- EBS volume 500 GiB
- Dedicated Master nodes 3 X r6g.large.search

Modules memory and CPU parameters

Modules list with parametrs

Module ompt-pvt Thu Sep 14 12:13:07 UTC 2023	task Def#	Task Count	Mem Hard Limit	Mem Soft limit	CPU units	Xmx	MetaspaceSize	MaxMetaspaceSize	R/W split enabled
mod-remote-storage:2.0.3	2	2	4920	4472	1024	3960	512	512	false
mod-finance-storage:8.4.2	2	2	1024	896	1024	700	88	128	false
mod-ncip:1.13.1	2	2	1024	896	128	768	88	128	false
mod-agreements:5.5.2	3	2	6500	6400	1024	3868	384	1524	false
mod-ebsconet:2.0.0	2	2	1248	1024	128	700	128	256	false
edge-sip2:3.0.0	2	2	1024	896	128	768	88	128	false
mod-organizations:1.7.0	2	2	1024	896	128	768	88	128	false
mod-settings:1.0.0	2	2	1024	896	200	0	0	0	false
edge-dematic:2.0.0	2	1	1024	896	128	0	0	0	false
mod-data-import:2.7.1	2	1	2048	1844	256	1292	384	512	false
mod-search:2.0.1	2	2	2592	2480	2048	1440	512	1024	false
mod-tags:2.0.1	2	2	1024	896	128	768	88	128	false
mod-authtoken:2.13.0	2	2	1440	1152	512	922	88	128	false
mod-inventory-update:3.0.1	2	2	1024	896	128	768	88	128	false
mod-notify:3.0.0	2	2	1024	896	128	768	88	128	false
mod-configuration:5.9.1	2	2	1024	896	128	768	88	128	false
mod-orders-storage:13.5.0	2	2	1024	896	512	768	88	128	false
edge-caiasoft:2.0.0	2	2	1024	896	128	0	0	0	false
mod-login-saml:2.6.1	2	2	1024	896	128	768	88	128	false
mod-gobi:2.6.0	2	2	1024	896	128	768	88	128	false
mod-password-validator:3.0.0	2	2	1440	1298	128	768	384	512	false
mod-licenses:4.3.1	3	2	8000	7900	1024	5812	384	1536	false
mod-bulk-operations:1.0.6	2	2	3072	2600	1024	1536	384	512	false
mod-graphql:1.11.0	2	2	1024	896	128	768	88	128	false
mod-finance:4.7.1	2	2	1024	896	128	768	88	128	false
mod-copycat:1.4.0	2	2	1024	896	128	768	88	128	false
mod-entities-links:1.0.2	2	2	2592	2480	400	1440	0	1024	false
mod-permissions:6.3.2	4	2	1684	1544	512	1024	384	512	false
pub-edge:2023.06.14	2	2	1024	896	128	768	0	0	false
mod-orders:12.6.8	2	2	2048	1440	1024	1024	384	512	false
edge-patron:4.11.0	2	2	1024	896	256	768	88	128	false
edge-ncip:1.8.1	2	2	1024	896	128	768	88	128	false
mod-users-bl:7.5.0	2	2	1440	1152	512	922	88	128	false
edge-ea-data-export:3.9.0	2	2	1024	896	128	768	88	128	false
mod-inventory-storage:26.0.0	2	2	4096	3690	2048	3076	384	512	false
mod-invoice:5.6.5	2	2	1440	1152	512	922	88	128	false
mod-user-import:3.7.2	2	2	1024	896	128	768	88	128	false
mod-sender:1.10.0	2	2	1024	896	128	768	88	128	false
edge-oai-pmh:2.6.1	2	2	1512	1360	1024	1440	384	512	false
mod-data-export-worker:3.0.13	2	2	3072	2800	1024	2048	384	512	false
mod-rtac:3.5.0	2	2	1024	896	128	768	88	128	false
mod-task-list:1.8.0	2	2	1024	896	128	768	88	128	false
mod-circulation-storage:16.0.1	2	2	2880	2592	1536	1814	384	512	false
mod-source-record-storage:5.6.7	2	2	5600	5000	2048	3500	384	512	false
mod-calendar:2.4.2	2	2	1024	896	128	768	88	128	false
mod-event-config:2.5.0	2	2	1024	896	128	768	88	128	false
mod-courses:1.4.7	2	2	1024	896	128	768	88	128	false
mod-inventory:20.0.6	2	2	2880	2592	1024	1814	384	512	false
mod-email:1.15.3	2	2	1024	896	128	768	88	128	false
mod-di-converter-storage:2.0.5	2	2	1024	896	128	768	88	128	false
mod-circulation:23.5.6	2	2	2880	2592	1536	768	88	128	false
mod-pubsub:2.9.1	2	2	1536	1440	1024	922	384	512	false
edge-orders:2.8.1	2	2	1024	896	128	768	88	128	false
edge-rtac:2.6.0	2	2	1024	896	128	768	88	128	false
mod-users:19.1.1	2	2	1024	896	128	768	88	128	false
mod-template-engine:1.18.0	2	2	1024	896	128	768	88	128	false
mod-patron-blocks:1.8.0	2	2	1024	896	1024	768	88	128	false
mod-audit:2.7.0	2	2	1024	896	128	768	88	128	false
mod-source-record-manager:3.6.4	2	2	5600	5000	2048	3500	384	512	false
nginx-edge:2023.06.14	2	2	1024	896	128	0	0	0	false
mod-quick-marc:3.0.0	2	1	2288	2176	128	1664	384	512	false
nginx-okapi:2023.06.14	2	2	1024	896	128	0	0	0	false
okapi:5.0.1	2	3	1684	1440	1024	922	384	512	false
mod-feesfines:18.2.1	2	2	1024	896	128	768	88	128	false
mod-invoice-storage:5.6.0	2	2	1872	1536	1024	1024	384	512	false
mod-service-interaction:2.2.2	2	2	2048	1844	256	1290	384	512	false
mod-patron:5.5.2	2	2	1024	896	128	768	88	128	false
mod-data-export:4.7.1	2	1	1024	896	1024	768	88	128	false
mod-oai-pmh:3.11.3	2	2	4096	3690	2048	3076	384	512	false
edge-connexion:1.0.6	2	2	1024	896	128	768	88	128	false
mod-notes:5.0.1	2	2	1024	896	128	952	384	512	false
mod-kb-ebsco-java:3.13.0	2	2	1024	896	128	768	88	128	false
mod-login:7.9.0	2	2	1440	1298	1024	768	384	512	false
mod-organizations-storage:4.5.1	2	2	1024	896	128	768	88	128	false
mod-data-export-spring:2.0.2	2	1	2048	1844	256	1536	384	512	false
pub-okapi:2023.06.14	2	2	1024	896	128	768	0	0	false

Methodology/Approach

PTF team has developed DMS (Data Management Server) which by calling API responding with needed data for each tenant.
- Api to call ${DMSHost}:5222/${tenantID}/CICO_available. This api call will return available item ID in Json format
PTF has prepared data preparation script (Bash) that is looping through tenant Id and preparing data for each tenant including primary one.

Script https://github.com/folio-org/perf-testing/tree/master/workflows-scripts/master-script-multi-tenant:
- Artefact placed here multi-tenant-checkInCheckOut-DI-Search2.zip
- PTF team has improved existing CICO script (+DI script + Search script) to work with multiple tenants using DMS server.
- Script doing login for each of tenants available in credentials.csv and writing all of data (such as tenantId, token, tenantHost) to separate file.
- Each next thread group is using already prepared file with tokens, and tenants information to do its workflows.
- Script designed in a way to start each new DI with delay of 2 minutes.

Analysis:

To define avg response time for CICO use avg. column from summary table. As there is almost whole time is DI on the background - we can use whole time range of a test for avg. analysis

To check DI duration - run either CheckDI.sh (located on carrier box /home/ec2-user/MasterDataLoad/Mobius) using

 bash CheckDI.sh psql.conf

or run on DB side

SELECT started_date,completed_date-started_date as duration, file_name,  status
	FROM [tenantId]_mod_source_record_manager.job_execution order by started_date desc limit 100;

Additional Screenshots of graphs or charts

Here is excel spreadsheet attached with summary tables for all tests in this report. Each of tabs in spreadsheet corresponds to each test. (note: spreadsheet does not include DI durations).

Discussion

Things to discuss:

io.vertx.core.impl.NoStackTraceThrowable: Connection is not active now, current status: CLOSED. This is DI error that we've never seen before. Moreover we was not able to reproduce it yet.
In test #5 we have significantly increased response time for search workflow without any visible reason.

Multi tenants ECS report [in progress]

Overview

Summary

Recommendations & Jiras

Test Runs

Results

Test #1 2 tenants 5 user each CICO

Memory Utilization

CPU Utilization

RDS metrics

Test #2 61 tenants 5 user each

Memory Utilization

CPU Utilization

RDS metrics

Test #3 61 tenant 5 users +DI on 5 tenants 10K

Memory Utilization

CPU Utilization

RDS metrics

Kafka metrics

Test #4 61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant

Memory Utilization

CPU Utilization

RDS metrics

OpenSearch metrics

Kafka metrics

Test #5 61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant (rerun)

Memory Utilization

CPU Utilization

RDS metrics

OpenSearch metrics

Kafka metrics

Test #6 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant

Memory Utilization

CPU Utilization

RDS metrics

OpenSearch metrics

Kafka metrics

Test #7-8 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant (with improvements, rerun)

Memory Utilization

CPU Utilization

RDS metrics

OpenSearch metrics

Kafka metrics

Test #9-10 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant (with additional data nodes on open search)

Memory Utilization

CPU Utilization

RDS metrics

OpenSearch metrics

Kafka metrics

Appendix

Infrastructure

Methodology/Approach

Additional Screenshots of graphs or charts

Discussion