Multi tenants ECS report [in progress]
Overview
The Mobius-like environment has 61 tenants. We need to test this environment to see if it can handle loads from all 61 tenants in high-level preliminary tests that consist of a few use cases.
1) Perform 30-minutes baseline CICO tests on the main tenant and a secondary tenant with 5 concurrent users.
2) Perform the following tests in all 61 tenants at the same time or with combinations:
- CICO test for 5 users each on all tenants
- 1 hour test, repeat it 2 times
- DI create jobs (10K) on 10-20 tenants
- Search/Browse on all tenants
3) Increase number of Kafka brokers (+2), retest
4) Increase number of data nodes for Open Search service, retest
Compare results against the baseline tests and record the KPIs and other observations such as response times or errors in a report.
Summary
- db.r6g.xlarge DB size can not handle load of 61 tenants even with test of 1 user for CechIn-CheckOut workflow. Due to high number of connection being created. To handle this number of concurrent connections to DB - instance type should be at least db.r6g.4xlarge (it has acceptable number of connections). However even that could be not enough. In latest tests we can see that db.r6g.4xlarge os oftenly reaching a limit of connections available. Possibly Shared pool of connections will have a positive affect on this.
- According to https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Managing.Performance.html db.r6g.4xlarge has 4000 max number of connections, while db.r6g.xlarge has only 2000.
- Ticket created on performance degradation MODINVSTOR-1124
- nginx-okapi in combined tests spiking up to 400% so CPU units should be increased (at least up to 512.(currently it's 128));
- kafka CPU usage is on ±60% level during whole test. (in waiting state it's 35-40%). Increasing number of Kafka brokers (+2) has positive affect on data import, however while DI performance has being improved CICO being affected. As we observed - higher throughput on DI load DB more and has negative affect on response times of CICO
- io.vertx.core.impl.NoStackTraceThrowable: Connection is not active now, current status: CLOSED
- or io.netty.channel.StacklessClosedChannelException
- DI first job (typically it's primary tenant) working fastest. each next tenant working slower and up to 3 hr.
- In last test 5 DI jobs completed with errors due to same issues mentioned above. One job not even started due to 500 internal server error on POST call to start a job.
- OpenSearch CPU usage is on 90% during whole test. It is likely due to DI jobs requiring indexing on each record created. This indexing is done asynchronously so it does not affect overall DI's duration, but likely affects other workflows' performance
- No memory leaks was found
- Improvements (adding 2 more brokers to Kafka cluster and changing CPU units on nginx-edge to 512) in test #7-8 did make Data Import faster, however they did affect CICO as well and didi increase response times on CI and CO +200ms avg.
Recommendations & Jiras
- Original ticket - PERF-639 Preliminary Testing of Mobius-like Env;
- Ticket to improve resources and retest PERF-670
- Recommended to increase DB instance type at least to db.r6g.4xlarge on env with 61 tenants (all tests below performed with this instance type)
- Recommended to increase CPU units at least to 512 on nginx-okapi;
- Recommended to scale Kafka (either instance type or number of brokers) due to high CPU usage;
Test Runs
Test # | Test Conditions | Duration | Load generator size | Load generator Memory(GiB) | Notes |
1. | 2 tenants 2 user each CICO | 30 min | t3.2xlarge | 3 | |
2. | 61 tenants 5 user each CICO | 30 min | t3.2xlarge | 3 | |
3 | 61 tenants 5 user each CICO + 10k MARC BIB Create on 5 tenants | 30 min | t3.2xlarge | 3 | |
4. | 5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants | 60 mins | t3.2xlarge | 3 | |
5. | 5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants | 60 mins | t3.2xlarge | 3 | |
6 | 5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 30 tenants+ Search workflow 1 user 61 tenants | 90 mins | t3.2xlarge | 12 | |
7 | 5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants | 60 | t3.2xlarge | 10 | test with changed CPU units up to 512 and adding 2 more brokers to Kafka |
8 | 5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (retest) | 60 | t3.2xlarge | 10 | test with changed CPU units up to 512 and adding 2 more brokers to Kafka |
Results
# | test | primary\secondary | primary\secondary | primary\secondary | primary\secondary | ||||
CI | CO | Data Import duration | Searches | ||||||
1 | 2 tenants 5 user each CICO | 0,953 | 0,662 | 1,701 | 1,196 | - | - | - | - |
2 | 61 tenant 5 users CICO(*) | 1,041 | 0,797 | 1,966 | 1,414 | - | - | - | - |
3 | 61 tenant 5 users +DI on 5 tenants 10K | 1,141 | 0,863 | 2,003 | 1,511 | 30 min | 50-52 min (StacklessClosed on one tenant on one record) | - | - |
4 | 61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant | 1,245 | 0,974 | 2,25 | 1,72 | 42 min | 1 hr 39 min - 2 hr 57 min (one tenant StacklessClosedChannel. One tenant with connection is not active now, current status: CLOSED | 1,3-4,4s | 0,7-4,3s |
5 | 61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant (rerun) | 1,216 | 0,941 | 2,14 | 1,641 | (stuck on 98%) | 1 hr 33 min - 2 hr 40 min (5 completed with errors including primary) | 3,5-21 s | 2,3 -18 s |
6 | 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant | 1,321 | 0,998 | 2,277 | 1,770 | stuck on 99% 7 records failed with 504 Gateway Time-out | 3 hr 38 min - 6 hr 24 min (4 completed with errors including primary) - 504 Gateway Time-out io.netty.channel.StacklessClosedChannelException | 2,9 -6,3 s | 2,1 -7,3s |
7 | 5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more brokers to Kafka ) | 1,503 | 1,190 | 2,313 | 1,836 | 14 min | 56 min - 1 hr 47 min (8 jobs ether stuck or completed with errors) | 3,0 -5,088 | 1,800-4,348 |
8 | 5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more brokers to Kafka ) | 1,77 | 1,487 | 2,839 | 2,319 | 29 min (error, 11 records failed with io.netty.channel.StacklessClosedChannelException) | 1 hr 2 min - 1 hr 54 min (** 10 jobs failed ) | 4,103-7,605 | 3,058-8,752 |
9 | 5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more data nodes to Open Search )*with 2 Kafka brokers | 1,442 | 1,181 | 2,312 | 1,769 | 44 minutes completed | 1 hr 39 min - 3 hr 8 min. one job failed on one record with Connection is not active now, current status: CLOSED | 1,779-3,534 | 0,67-3,636 |
10 | 5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more data nodes to Open Search )*with 2 Kafka brokers | 1,496 | 1,220 | 2,334 | 1,845 | 44 minutes -completed with errors (2 records)
| 1 hr 42 min - 2 hr 48 min one job failed on one record with Connection is not active now, current status: CLOSED | 2,381-5,197 | 1,308-4,982 |
*note. Here were performed two tests to check consistency of the results.
** new errors that was never observed before in DI workflow appeared in test #8
- io.vertx.pgclient.PgException: FATAL: remaining connection slots are reserved for non-replication superuser connections (53300) – on mod-inventory, mod-login, mod-authtoken, mod-permittions, mod-source-record-manager, mod-source-record-storage, mod-users, mod-circulation.
io.vertx.pgclient.PgException: FATAL: sorry, too many clients already (53300) -- on mod-inventory only
Test #1 2 tenants 5 user each CICO
There is almost no visible CPU, memory and DB response usage no issues was found.
Memory Utilization
Note: No memory leaks was found.
CPU Utilization
Note: CPU usage on related containers are hardly visible due to low load during baseline test.
RDS metrics
Test #2 61 tenants 5 user each
There were two identical tests performed as results are mostly the same we're using only one of them in report.
test #2 (61 tenants 5 user each CICO)
- Most memory using modules is mod-inventory (±100%), mod-circulation (±80), mod-circulation-storage (±80).
- nginx-okapi CPU usage is ±400% due to small number of CPU usage allocated by default for a module (128).
- DB CPU is 15% in avg. during whole test
- DB connection rate is ±2 000 in avg, with spikes up to 3.5K
Memory Utilization
Note: even with total users number 5X61tenant (305) there is no visible memory leaks or anomalies.
CPU Utilization
Note: with 305 total users included in a test CPU usage on nginx-okapi spiking above 400%.
RDS metrics
Note: RDS CPU usage is mostly below 20% during whole test.
Note: RDS DB connections number when system in stand-by state is ±1000 connections. During test this number increases to ±2K with spikes above 3K connections.
Test #3 61 tenant 5 users +DI on 5 tenants 10K
- nginx-okapi CPU usage is ±400% due to small number of CPU usage allocated by default for a module (128).
- No memory leaks was found.
- DB CPU usage is close to 50%
- DB connections rate has spikes up to 4 000 connections (which is max for current DB instance type)
Memory Utilization
Note: no memory leaks found in related to test modules.
CPU Utilization
Note: chart above and below is identical except one above has excluded nginx-okapi for better view of other modules CPU usage. All modules in good shape without reaching out limits.
RDS metrics
Note: with included DI to CICO in this test RDS CPU usage grown up to ±50%.
Note: connections on DB has spikes up to 4K.
Kafka metrics
Note: Kafka memory usage during test reached 2.3%. And this memory will be freed up with retention policy after 480 minutes.
Note: Kafka CPU usage is ±50-60% on each broker.
Test #4 61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant
In this test CICO
Test consisted with two parts:
- Actual test that includes CICO+DI+search on each tenant (approximately from 12:15 to 13:15 on a chart)
- Data imports continue running after actual tests. (from 13:15 to 15:45)
test #4,#5
- nginx-okapi CPU usage is ±500%;
- No signs memory leaks was found;
- DB CPU usage is 70-80% in avg.
- DB connection rate is ±3 000 connections avg. with spikes up to 4K
- Open search CPU usage is close to 90% (max) from the beginning of the test due to searches workflow included in this test + data import indexing.
- Few DI jobs has "Completed with errors" status due to fail of a few records with connection is not active now, current status: CLOSED , StacklessClosedChannel.
- connection is not active now, current status: CLOSED - happened only one time and hasn't reproduced yet
- DI job (10K) took up to 3 hours to complete
- Search rate is 180 Ops/min.
Memory Utilization
Note: no major issues with memory was found.
CPU Utilization
Note: Below two charts: one without Nginx-okapi for more accurate vision of a CPU trends, second with Nginx-okapi.
Note: Nginx-okapi has more than 500% CPU usage. That's because of allocated 128 CPU units for nginx-okapi service. We should consider increasing CPU units for nginx-okapi at least up to 512.
RDS metrics
Note:
OpenSearch metrics
Kafka metrics
Test #5 61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant (rerun)
This is rerun of a previous test to check results consistency.
Results are more or less the same except Search workflow that increases response time 4 times.
Memory Utilization
CPU Utilization
RDS metrics
OpenSearch metrics
Kafka metrics
Test #6 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant
Test #6
- CPU usage on nginx-edge as expected is high ±500%
- CPU usage on mod-quick-marck has reached 1.5K% and didn't went down after a test. However it didn't affect results. Moreover this behaviour didn't reproduce since then.
- Di took up 6,5 hr to complete.
Memory Utilization
CPU Utilization
Note: Here are two CPU Utilisation charts included first one without nginx-okapi and mod-quick-marc.
mod-quick-marc reached 1,5K CPU usage from very beginning of a test and didn't come down even after end of a test. Note that this is new behaviour and wasn't observed in previous tests.
RDS metrics
Note: RDS connections reached 4K concurrent connections during main part of a test.
OpenSearch metrics
Kafka metrics
Note: test itself with including of data imports after (in respect to timeline) ended in ±17:00. however messages was there during 8 more fours.
Test #7-8 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant (with improvements, rerun)
Test #7,#8 (improvement with 2 additional Kafka brokers)
- CPU usage on nginx-edge after improvements is 150% avg.
- Each other modules CPU and memory usage is the same as in previous tests.
- DI becomes faster (possibly due to better performing Kafka with 4 brokers in a cluster). In previous tests it took up to 2 hr 57 min, mow it's 1 hr 57 min.
- CI/CO response times affected by +200ms, +500ms for CI and CO. Can be explained with increasing number of Kafka brokers - DI becomes faster and did produce more load on DB side, which affect CICO response times.
Memory Utilization
CPU Utilization
RDS metrics
OpenSearch metrics
Kafka metrics
Test #9-10 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant (with additional data nodes on open search)
Test #9,#10 (tests that has only 2 Kafka brokers and 2 additional data nodes on open search)
- Each other modules CPU and memory usage is the same as in previous tests
- DI becomes slower in comparison to previous two tests, however it's more stable.
- CICO response times becomes faster, as in previous tests (#7,#8) which is proving the point that the faster is DI - the more it's loading DB and more it has affect on CICO response time
- search response times doesn't seems to be better. In comparison to other tests (in average) sometimes it's better, some times it's worse. So 2 additional data nodes didn't change much with performance of search.
Memory Utilization
CPU Utilization
RDS metrics
OpenSearch metrics
Kafka metrics
Appendix
Mobious like env has 61 tenants
- Primary tenant fs00001137 - has 3+M records in inventory
- 60 secondary tenants (mob01,mob02,.....mob060) has originally prepared 10K records each. (at this point numbers may vary as some number of data imports on different tenants were performed).
Infrastructure
PTF environment ompt-pvt
- 11 m6g.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
- 1 instance of db.r6.4xlarge database
- MSK ptf-mobius-testing
- 2 kafka.m5.2xlarge brokers in 2 zones
Apache Kafka version 2.8.0
EBS storage volume per broker 250 GiB
- auto.create.topics.enable=true
- auto.create.topics.enable=true
- log.retention.minutes=480
- num.partitions=2
- OpenSearch fse
- version - OpenSearch 2.7
- instance type r6g.xlarge.search
- 4 data nodes
- EBS volume 500 GiB
- Dedicated Master nodes 3 X r6g.large.search
Modules memory and CPU parameters
Methodology/Approach
- PTF team has developed DMS (Data Management Server) which by calling API responding with needed data for each tenant.
- Api to call ${DMSHost}:5222/${tenantID}/CICO_available. This api call will return available item ID in Json format
- PTF has prepared data preparation script (Bash) that is looping through tenant Id and preparing data for each tenant including primary one.
- Script https://github.com/folio-org/perf-testing/tree/master/workflows-scripts/master-script-multi-tenant:
- Artefact placed here multi-tenant-checkInCheckOut-DI-Search2.zip
- PTF team has improved existing CICO script (+DI script + Search script) to work with multiple tenants using DMS server.
- Script doing login for each of tenants available in credentials.csv and writing all of data (such as tenantId, token, tenantHost) to separate file.
- Each next thread group is using already prepared file with tokens, and tenants information to do its workflows.
- Script designed in a way to start each new DI with delay of 2 minutes.
Analysis:
To define avg response time for CICO use avg. column from summary table. As there is almost whole time is DI on the background - we can use whole time range of a test for avg. analysis
To check DI duration - run either CheckDI.sh (located on carrier box /home/ec2-user/MasterDataLoad/Mobius) using
bash CheckDI.sh psql.conf
or run on DB side
SELECT started_date,completed_date-started_date as duration, file_name, status FROM [tenantId]_mod_source_record_manager.job_execution order by started_date desc limit 100;
Additional Screenshots of graphs or charts
Here is excel spreadsheet attached with summary tables for all tests in this report. Each of tabs in spreadsheet corresponds to each test. (note: spreadsheet does not include DI durations).
Discussion
Things to discuss:
- io.vertx.core.impl.NoStackTraceThrowable: Connection is not active now, current status: CLOSED. This is DI error that we've never seen before. Moreover we was not able to reproduce it yet.
- In test #5 we have significantly increased response time for search workflow without any visible reason.