The Mobius-like environment has 61 tenants. We need to test this environment to see if it can handle loads from all 61 tenants in high-level preliminary tests that consist of a few use cases.
1) Perform 30-minutes baseline CICO tests on the main tenant and a secondary tenant with 5 concurrent users.
2) Perform the following tests in all 61 tenants at the same time or with combinations:
3) Increase number of Kafka brokers (+2), retest
4) Increase number of data nodes for Open Search service, retest
Compare results against the baseline tests and record the KPIs and other observations such as response times or errors in a report.
Test # | Test Conditions | Duration | Load generator size | Load generator Memory(GiB) | Notes |
1. | 2 tenants 2 user each CICO | 30 min | t3.2xlarge | 3 | |
2. | 61 tenants 5 user each CICO | 30 min | t3.2xlarge | 3 | |
3 | 61 tenants 5 user each CICO + 10k MARC BIB Create on 5 tenants | 30 min | t3.2xlarge | 3 | |
4. | 5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants | 60 mins | t3.2xlarge | 3 | |
5. | 5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants | 60 mins | t3.2xlarge | 3 | |
6 | 5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 30 tenants+ Search workflow 1 user 61 tenants | 90 mins | t3.2xlarge | 12 | |
7 | 5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants | 60 | t3.2xlarge | 10 | test with changed CPU units up to 512 and adding 2 more brokers to Kafka |
8 | 5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (retest) | 60 | t3.2xlarge | 10 | test with changed CPU units up to 512 and adding 2 more brokers to Kafka |
# | test | primary\secondary | primary\secondary | primary\secondary | primary\secondary | ||||
CI | CO | Data Import duration | Searches | ||||||
1 | 2 tenants 5 user each CICO | 0,953 | 0,662 | 1,701 | 1,196 | - | - | - | - |
2 | 61 tenant 5 users CICO(*) | 1,041 | 0,797 | 1,966 | 1,414 | - | - | - | - |
3 | 61 tenant 5 users +DI on 5 tenants 10K | 1,141 | 0,863 | 2,003 | 1,511 | 30 min | 50-52 min (StacklessClosed on one tenant on one record) | - | - |
4 | 61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant | 1,245 | 0,974 | 2,25 | 1,72 | 42 min | 1 hr 39 min - 2 hr 57 min (one tenant StacklessClosedChannel. One tenant with connection is not active now, current status: CLOSED | 1,3-4,4s | 0,7-4,3s |
5 | 61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant (rerun) | 1,216 | 0,941 | 2,14 | 1,641 | (stuck on 98%) | 1 hr 33 min - 2 hr 40 min (5 completed with errors including primary) | 3,5-21 s | 2,3 -18 s |
6 | 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant | 1,321 | 0,998 | 2,277 | 1,770 | stuck on 99% 7 records failed with 504 Gateway Time-out | 3 hr 38 min - 6 hr 24 min (4 completed with errors including primary) - 504 Gateway Time-out io.netty.channel.StacklessClosedChannelException | 2,9 -6,3 s | 2,1 -7,3s |
7 | 5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more brokers to Kafka ) | 1,503 | 1,190 | 2,313 | 1,836 | 14 min | 56 min - 1 hr 47 min (8 jobs ether stuck or completed with errors) | 3,0 -5,088 | 1,800-4,348 |
8 | 5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more brokers to Kafka ) | 1,77 | 1,487 | 2,839 | 2,319 | 29 min (error, 11 records failed with io.netty.channel.StacklessClosedChannelException) | 1 hr 2 min - 1 hr 54 min (** 10 jobs failed ) | 4,103-7,605 | 3,058-8,752 |
9 | 5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more data nodes to Open Search )*with 2 Kafka brokers | 1,442 | 1,181 | 2,312 | 1,769 | 44 minutes completed | 1 hr 39 min - 3 hr 8 min. one job failed on one record with Connection is not active now, current status: CLOSED | 1,779-3,534 | 0,67-3,636 |
10 | 5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more data nodes to Open Search )*with 2 Kafka brokers | 1,496 | 1,220 | 2,334 | 1,845 | 44 minutes -completed with errors (2 records)
| 1 hr 42 min - 2 hr 48 min one job failed on one record with Connection is not active now, current status: CLOSED | 2,381-5,197 | 1,308-4,982 |
*note. Here were performed two tests to check consistency of the results.
** new errors that was never observed before in DI workflow appeared in test #8
io.vertx.pgclient.PgException: FATAL: sorry, too many clients already (53300) -- on mod-inventory only
There is almost no visible CPU, memory and DB response usage no issues was found.
Note: No memory leaks was found.
Note: CPU usage on related containers are hardly visible due to low load during baseline test.
There were two identical tests performed as results are mostly the same we're using only one of them in report.
test #2 (61 tenants 5 user each CICO)
Note: even with total users number 5X61tenant (305) there is no visible memory leaks or anomalies.
Note: with 305 total users included in a test CPU usage on nginx-okapi spiking above 400%.
Note: RDS CPU usage is mostly below 20% during whole test.
Note: RDS DB connections number when system in stand-by state is ±1000 connections. During test this number increases to ±2K with spikes above 3K connections.
Note: no memory leaks found in related to test modules.
Note: chart above and below is identical except one above has excluded nginx-okapi for better view of other modules CPU usage. All modules in good shape without reaching out limits.
Note: with included DI to CICO in this test RDS CPU usage grown up to ±50%.
Note: connections on DB has spikes up to 4K.
Note: Kafka memory usage during test reached 2.3%. And this memory will be freed up with retention policy after 480 minutes.
Note: Kafka CPU usage is ±50-60% on each broker.
In this test CICO
Test consisted with two parts:
test #4,#5
Note: no major issues with memory was found.
Note: Below two charts: one without Nginx-okapi for more accurate vision of a CPU trends, second with Nginx-okapi.
Note: Nginx-okapi has more than 500% CPU usage. That's because of allocated 128 CPU units for nginx-okapi service. We should consider increasing CPU units for nginx-okapi at least up to 512.
Note:
This is rerun of a previous test to check results consistency.
Results are more or less the same except Search workflow that increases response time 4 times.
Test #6
Note: Here are two CPU Utilisation charts included first one without nginx-okapi and mod-quick-marc.
mod-quick-marc reached 1,5K CPU usage from very beginning of a test and didn't come down even after end of a test. Note that this is new behaviour and wasn't observed in previous tests.
Note: RDS connections reached 4K concurrent connections during main part of a test.
Note: test itself with including of data imports after (in respect to timeline) ended in ±17:00. however messages was there during 8 more fours.
Test #7,#8 (improvement with 2 additional Kafka brokers)
Test #9,#10 (tests that has only 2 Kafka brokers and 2 additional data nodes on open search)
Mobious like env has 61 tenants
PTF environment ompt-pvt
Apache Kafka version 2.8.0
EBS storage volume per broker 250 GiB
Modules memory and CPU parameters
|
Analysis:
To define avg response time for CICO use avg. column from summary table. As there is almost whole time is DI on the background - we can use whole time range of a test for avg. analysis
To check DI duration - run either CheckDI.sh (located on carrier box /home/ec2-user/MasterDataLoad/Mobius) using
bash CheckDI.sh psql.conf |
or run on DB side
SELECT started_date,completed_date-started_date as duration, file_name, status FROM [tenantId]_mod_source_record_manager.job_execution order by started_date desc limit 100; |
Here is excel spreadsheet attached with summary tables for all tests in this report. Each of tabs in spreadsheet corresponds to each test. (note: spreadsheet does not include DI durations).
Things to discuss: