Multi tenants ECS report [in progress]


Overview

The Mobius-like environment has 61 tenants.  We need to test this environment to see if it can handle loads from all 61 tenants in high-level preliminary tests that consist of a few use cases. 

1) Perform 30-minutes baseline CICO tests on the main tenant and a secondary tenant with 5 concurrent users.  
2) Perform the following tests in all 61 tenants at the same time or with combinations:

  • CICO test for 5 users each on all tenants
  • 1 hour test, repeat it 2 times
  • DI create jobs (10K) on 10-20 tenants
  • Search/Browse on all tenants

3) Increase number of Kafka brokers (+2), retest

4) Increase number of data nodes for Open Search service, retest

Compare results against the baseline tests and record the KPIs and other observations such as response times or errors in a report. 

Summary

  • db.r6g.xlarge DB size can not handle load of 61 tenants even with test of 1 user for CechIn-CheckOut workflow. Due to high number of connection being created. To handle this number of concurrent connections to DB - instance type should be at least db.r6g.4xlarge (it has acceptable number of connections). However even that could be not enough. In latest tests we can see that db.r6g.4xlarge os oftenly reaching a limit of connections available. Possibly Shared pool of connections will have a positive affect on this.
  • Ticket created on performance degradation  MODINVSTOR-1124
  • nginx-okapi in combined tests spiking up to 400% so CPU units should be increased (at least up to 512.(currently it's 128));
  • kafka CPU usage is on ±60% level during whole test. (in waiting state it's 35-40%). Increasing number of Kafka brokers (+2) has positive affect on data import, however while DI performance has being improved CICO being affected. As we observed - higher throughput on DI load DB more and has negative affect on response times of CICO
    • io.vertx.core.impl.NoStackTraceThrowable: Connection is not active now, current status: CLOSED
    • or io.netty.channel.StacklessClosedChannelException
  • DI first job (typically it's primary tenant) working fastest. each next tenant working slower and up to 3 hr.
  • In last test 5 DI jobs completed with errors due to same issues mentioned above. One job not even started due to 500 internal server error on POST call to start a job. 
  • OpenSearch CPU usage is on 90% during whole test. It is likely due to DI jobs requiring indexing on each record created. This indexing is done asynchronously so it does not affect overall DI's duration, but likely affects other workflows' performance
  • No memory leaks was found
  • Improvements (adding 2 more brokers to Kafka cluster and changing CPU units on nginx-edge to 512) in test #7-8 did make Data Import faster, however they did affect CICO as well and didi increase response times on CI and CO +200ms avg. 

Recommendations & Jiras

  • Original ticket - PERF-639 Preliminary Testing of Mobius-like Env;
  • Ticket to improve resources and retest PERF-670
  • Recommended to increase DB instance type at least to db.r6g.4xlarge on env with 61 tenants (all tests below performed with this instance type)
  • Recommended to increase CPU units at least to 512 on nginx-okapi;
  • Recommended to scale Kafka (either instance type or number of brokers) due to high CPU usage;



Test Runs 

Test #

Test Conditions

Duration 

Load generator size Load generator Memory(GiB)

Notes

1.2 tenants 2 user each CICO30 mint3.2xlarge3
2.61 tenants 5 user each CICO30 mint3.2xlarge3
361 tenants 5 user each CICO + 10k MARC BIB Create on 5 tenants30 mint3.2xlarge3

4.

5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants60 minst3.2xlarge3

5.

5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants60 minst3.2xlarge3
65 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 30 tenants+ Search workflow 1 user 61 tenants90 minst3.2xlarge12
75 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants60t3.2xlarge10test with changed CPU units up to 512 and adding 2 more brokers to Kafka
85 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (retest)60t3.2xlarge10test with changed CPU units up to 512 and adding 2 more brokers to Kafka

Results


#testprimary\secondaryprimary\secondaryprimary\secondaryprimary\secondary
CICOData Import durationSearches
12 tenants 5 user each CICO0,9530,6621,7011,196----
261 tenant 5 users CICO(*)1,0410,7971,9661,414----
361 tenant 5 users +DI on 5 tenants 10K1,1410,8632,0031,51130 min50-52 min (StacklessClosed on one tenant on one record)--
461 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant1,2450,9742,251,7242 min1 hr 39 min - 2 hr 57 min (one tenant StacklessClosedChannel. One tenant with connection is not active now, current status: CLOSED1,3-4,4s0,7-4,3s
561 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant (rerun)1,2160,9412,141,641 (stuck on 98%)1 hr 33 min - 2 hr 40 min (5 completed with errors including primary)3,5-21 s2,3 -18 s
661 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant 1,3210,9982,2771,770stuck on 99% 7 records failed with 504 Gateway Time-out

3 hr 38 min - 6 hr 24 min

(4 completed with errors including primary)

- 504 Gateway Time-out  io.netty.channel.StacklessClosedChannelException 

2,9 -6,3 s2,1 -7,3s
75 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more brokers to Kafka )1,5031,1902,3131,83614 min56 min - 1 hr 47 min (8 jobs ether stuck or completed with errors)3,0 -5,0881,800-4,348
85 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more brokers to Kafka )1,771,4872,8392,31929 min (error, 11 records failed with io.netty.channel.StacklessClosedChannelException)

1 hr 2 min - 1 hr 54 min (** 10 jobs failed )



4,103-7,6053,058-8,752
95 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more data nodes  to Open Search )*with 2 Kafka brokers1,4421,1812,3121,76944 minutes completed

1 hr 39 min - 3 hr 8 min.

one job failed on one record with 

Connection is not active now, current status: CLOSED

1,779-3,5340,67-3,636
105 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more data nodes  to Open Search )*with 2 Kafka brokers1,4961,2202,3341,84544 minutes -completed with errors (2 records)
  • io.netty.channel.StacklessClosedChannelException
  • Connection is not active now, current status: CLOSED

1 hr 42 min - 2 hr 48 min 

one job failed on one record with 

Connection is not active now, current status: CLOSED

2,381-5,1971,308-4,982

*note. Here were performed two tests to check consistency of the results.

** new errors that was never observed before in DI workflow appeared in test #8

  • io.vertx.pgclient.PgException: FATAL: remaining connection slots are reserved for non-replication superuser connections (53300) – on mod-inventory, mod-login, mod-authtoken, mod-permittions, mod-source-record-manager, mod-source-record-storage, mod-users, mod-circulation. 
  • io.vertx.pgclient.PgException: FATAL: sorry, too many clients already (53300) -- on mod-inventory only


Test #1 2 tenants 5 user each CICO

There is almost no visible CPU, memory and DB response usage no issues was found. 

Memory Utilization

Note: No memory leaks was found. 

CPU Utilization 

Note: CPU usage on related containers are hardly visible due to low load during baseline test. 


RDS metrics 


Test #2 61 tenants 5 user each

There were two identical tests performed as results are mostly the same we're using only one of them in report.

test #2 (61 tenants 5 user each CICO)

    1. Most memory using modules is mod-inventory (±100%), mod-circulation (±80), mod-circulation-storage (±80).
    2. nginx-okapi CPU usage is ±400% due to small number of CPU usage allocated by default for a module (128).
    3. DB CPU is 15% in avg. during whole test
    4. DB connection rate is ±2 000 in avg, with spikes up to 3.5K

Memory Utilization


Note: even with total users number 5X61tenant (305) there is no visible memory leaks or anomalies. 

CPU Utilization 


Note: with 305 total users included in a test CPU usage on nginx-okapi spiking above 400%.

RDS metrics 

Note: RDS CPU usage is mostly below 20% during whole test. 

Note: RDS DB connections number when system in stand-by state is ±1000 connections. During test this number increases to ±2K with spikes above 3K connections. 


Test #3 61 tenant 5 users +DI on 5 tenants 10K

    1. nginx-okapi CPU usage is ±400% due to small number of CPU usage allocated by default for a module (128).
    2.  No memory leaks was found.
    3. DB CPU usage is close to 50%
    4. DB connections rate has spikes up to 4 000 connections (which is max for current DB instance type)

Memory Utilization

Note: no memory leaks found in related to test modules.


CPU Utilization 

Note: chart above and below is identical except one above has excluded nginx-okapi for better view of other modules CPU usage. All modules in good shape without reaching out limits.



RDS metrics 


Note: with included DI to CICO in this test RDS CPU usage grown up to ±50%. 


Note: connections on DB has spikes up to 4K. 


Kafka metrics

Note: Kafka memory usage during test reached 2.3%. And this memory will be freed up with retention policy after 480 minutes. 

Note: Kafka CPU usage is ±50-60% on each broker. 



Test #4 61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant 

In this test CICO

Test consisted with two parts: 

  • Actual test that includes CICO+DI+search on each tenant (approximately from 12:15 to 13:15 on a chart)
  • Data imports continue running after actual tests. (from 13:15 to 15:45)

test #4,#5

    1. nginx-okapi CPU usage is ±500%;
    2. No signs memory leaks was found;
    3. DB CPU usage is 70-80% in avg.
    4. DB connection rate is ±3 000 connections avg. with spikes up to 4K
    5. Open search CPU usage is close to 90% (max) from the beginning of the test due to searches workflow included in this test + data import indexing.
    6. Few DI jobs has "Completed with errors" status due to fail of a few records with  connection is not active now, current status: CLOSED , StacklessClosedChannel.
    7.  connection is not active now, current status: CLOSED - happened only one time and hasn't reproduced yet
    8. DI job (10K) took up to 3 hours to complete
    9. Search rate is 180 Ops/min.

Memory Utilization


Note: no major issues with memory was found. 

CPU Utilization 

Note: Below two charts: one without Nginx-okapi for more accurate vision of a CPU trends, second with Nginx-okapi. 


Note: Nginx-okapi has more than 500% CPU usage. That's because of allocated 128 CPU units for nginx-okapi service. We should consider increasing CPU units for nginx-okapi at least up to 512.


RDS metrics 

Note:  


OpenSearch metrics




Kafka metrics


Test #5 61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant (rerun)

This is rerun of a previous test to check results consistency.

Results are more or less the same except Search workflow that increases response time 4 times.

Memory Utilization


CPU Utilization 



RDS metrics 



OpenSearch metrics



Kafka metrics



Test #6 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant 

Test #6

    1. CPU usage on nginx-edge as expected is high ±500%
    2. CPU usage on mod-quick-marck has reached 1.5K% and didn't went down after a test. However it didn't affect results. Moreover this behaviour didn't reproduce since then.
    3. Di took up 6,5 hr to complete.


Memory Utilization

CPU Utilization 

Note: Here are two CPU Utilisation charts included first one without nginx-okapi and mod-quick-marc.

mod-quick-marc reached 1,5K CPU usage from very beginning of a test and didn't come down even after end of a test. Note that this is new behaviour and wasn't observed in previous tests. 

RDS metrics 

Note: RDS connections reached 4K concurrent connections during main part of a test. 

OpenSearch metrics


Kafka metrics

Note:  test itself with including of data imports after (in respect to timeline) ended in ±17:00. however messages was there during 8 more fours.



Test #7-8 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant  (with improvements, rerun)

Test #7,#8 (improvement with 2 additional Kafka brokers)

    1. CPU usage on nginx-edge after improvements is 150% avg. 
    2. Each other modules CPU and memory usage is the same as in previous tests. 
    3. DI becomes faster (possibly due to better performing Kafka with 4 brokers in a cluster). In previous tests it took up to 2 hr 57 min, mow it's 1 hr 57 min. 
    4. CI/CO response times affected by +200ms, +500ms for CI and CO. Can be explained with increasing number of Kafka brokers - DI becomes faster and did produce more load on DB side, which affect CICO response times. 

Memory Utilization

CPU Utilization