Multi tenants ECS report [in progress]

Multi tenants ECS report [in progress]

 

Overview

The Mobius-like environment has 61 tenants.  We need to test this environment to see if it can handle loads from all 61 tenants in high-level preliminary tests that consist of a few use cases. 

1) Perform 30-minutes baseline CICO tests on the main tenant and a secondary tenant with 5 concurrent users.  
2) Perform the following tests in all 61 tenants at the same time or with combinations:

  • CICO test for 5 users each on all tenants

  • 1 hour test, repeat it 2 times

  • DI create jobs (10K) on 10-20 tenants

  • Search/Browse on all tenants

3) Increase number of Kafka brokers (+2), retest

4) Increase number of data nodes for Open Search service, retest

Compare results against the baseline tests and record the KPIs and other observations such as response times or errors in a report. 

Summary

  • db.r6g.xlarge DB size can not handle load of 61 tenants even with test of 1 user for CechIn-CheckOut workflow. Due to high number of connection being created. To handle this number of concurrent connections to DB - instance type should be at least db.r6g.4xlarge (it has acceptable number of connections). However even that could be not enough. In latest tests we can see that db.r6g.4xlarge os oftenly reaching a limit of connections available. Possibly Shared pool of connections will have a positive affect on this.

  • Ticket created on performance degradation  MODINVSTOR-1124

  • nginx-okapi in combined tests spiking up to 400% so CPU units should be increased (at least up to 512.(currently it's 128));

  • kafka CPU usage is on ±60% level during whole test. (in waiting state it's 35-40%). Increasing number of Kafka brokers (+2) has positive affect on data import, however while DI performance has being improved CICO being affected. As we observed - higher throughput on DI load DB more and has negative affect on response times of CICO

    • io.vertx.core.impl.NoStackTraceThrowable: Connection is not active now, current status: CLOSED

    • or io.netty.channel.StacklessClosedChannelException

  • DI first job (typically it's primary tenant) working fastest. each next tenant working slower and up to 3 hr.

  • In last test 5 DI jobs completed with errors due to same issues mentioned above. One job not even started due to 500 internal server error on POST call to start a job. 

  • OpenSearch CPU usage is on 90% during whole test. It is likely due to DI jobs requiring indexing on each record created. This indexing is done asynchronously so it does not affect overall DI's duration, but likely affects other workflows' performance

  • No memory leaks was found

  • Improvements (adding 2 more brokers to Kafka cluster and changing CPU units on nginx-edge to 512) in test #7-8 did make Data Import faster, however they did affect CICO as well and didi increase response times on CI and CO +200ms avg. 

Recommendations & Jiras

  • Original ticket - PERF-639 Preliminary Testing of Mobius-like Env;

  • Ticket to improve resources and retest PERF-670

  • Recommended to increase DB instance type at least to db.r6g.4xlarge on env with 61 tenants (all tests below performed with this instance type)

  • Recommended to increase CPU units at least to 512 on nginx-okapi;

  • Recommended to scale Kafka (either instance type or number of brokers) due to high CPU usage;

 

 

Test Runs 

Test #

Test Conditions

Duration 

Load generator size 

Load generator Memory(GiB)

Notes

1.

2 tenants 2 user each CICO

30 min

t3.2xlarge

3

 

2.

61 tenants 5 user each CICO

30 min

t3.2xlarge

3

 

3

61 tenants 5 user each CICO + 10k MARC BIB Create on 5 tenants

30 min

t3.2xlarge

3

 

4.

5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants

60 mins

t3.2xlarge

3

 

5.

5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants

60 mins

t3.2xlarge

3

 

6

5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 30 tenants+ Search workflow 1 user 61 tenants

90 mins

t3.2xlarge

12

 

7

5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants

60

t3.2xlarge

10

test with changed CPU units up to 512 and adding 2 more brokers to Kafka

8

5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (retest)

60

t3.2xlarge

10

test with changed CPU units up to 512 and adding 2 more brokers to Kafka

Results

 

#

test

primary\secondary

primary\secondary

primary\secondary

primary\secondary

CI

CO

Data Import duration

Searches

1

2 tenants 5 user each CICO

0,953

0,662

1,701

1,196

-

-

-

-

2

61 tenant 5 users CICO(*)

1,041

0,797

1,966

1,414

-

-

-

-

3

61 tenant 5 users +DI on 5 tenants 10K

1,141

0,863

2,003

1,511

30 min

50-52 min (StacklessClosed on one tenant on one record)

-

-

4

61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant

1,245

0,974

2,25

1,72

42 min

1 hr 39 min - 2 hr 57 min (one tenant StacklessClosedChannel. One tenant with connection is not active now, current status: CLOSED

1,3-4,4s

0,7-4,3s

5

61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant (rerun)

1,216

0,941

2,14

1,641

 (stuck on 98%)

1 hr 33 min - 2 hr 40 min (5 completed with errors including primary)

3,5-21 s

2,3 -18 s

6

61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant 

1,321

0,998

2,277

1,770

stuck on 99% 7 records failed with 504 Gateway Time-out

3 hr 38 min - 6 hr 24 min

(4 completed with errors including primary)

- 504 Gateway Time-out  io.netty.channel.StacklessClosedChannelException 

2,9 -6,3 s

2,1 -7,3s

7

5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more brokers to Kafka )

1,503

1,190

2,313

1,836

14 min

56 min - 1 hr 47 min (8 jobs ether stuck or completed with errors)

3,0 -5,088

1,800-4,348

8

5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more brokers to Kafka )

1,77

1,487

2,839

2,319

29 min (error, 11 records failed with io.netty.channel.StacklessClosedChannelException)

1 hr 2 min - 1 hr 54 min (** 10 jobs failed )

 

 

4,103-7,605

3,058-8,752

9

5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more data nodes  to Open Search )*with 2 Kafka brokers

1,442

1,181

2,312

1,769

44 minutes completed

1 hr 39 min - 3 hr 8 min.

one job failed on one record with 

Connection is not active now, current status: CLOSED

1,779-3,534

0,67-3,636

10

5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more data nodes  to Open Search )*with 2 Kafka brokers

1,496

1,220

2,334

1,845

44 minutes -completed with errors (2 records)

  • io.netty.channel.StacklessClosedChannelException

  • Connection is not active now, current status: CLOSED

1 hr 42 min - 2 hr 48 min 

one job failed on one record with 

Connection is not active now, current status: CLOSED

2,381-5,197

1,308-4,982

*note. Here were performed two tests to check consistency of the results.

** new errors that was never observed before in DI workflow appeared in test #8

  • io.vertx.pgclient.PgException: FATAL: remaining connection slots are reserved for non-replication superuser connections (53300) – on mod-inventory, mod-login, mod-authtoken, mod-permittions, mod-source-record-manager, mod-source-record-storage, mod-users, mod-circulation. 

  • io.vertx.pgclient.PgException: FATAL: sorry, too many clients already (53300) -- on mod-inventory only

     

Test #1 2 tenants 5 user each CICO

There is almost no visible CPU, memory and DB response usage no issues was found. 

Memory Utilization

Note: No memory leaks was found. 

CPU Utilization 

Note: CPU usage on related containers are hardly visible due to low load during baseline test. 

 

RDS metrics 

 

Test #2 61 tenants 5 user each

There were two identical tests performed as results are mostly the same we're using only one of them in report.

test #2 (61 tenants 5 user each CICO)

Memory Utilization

 

Note: even with total users number 5X61tenant (305) there is no visible memory leaks or anomalies. 

CPU Utilization 

 

Note: with 305 total users included in a test CPU usage on nginx-okapi spiking above 400%.

RDS metrics 

Note: RDS CPU usage is mostly below 20% during whole test. 

Note: RDS DB connections number when system in stand-by state is ±1000 connections. During test this number increases to ±2K with spikes above 3K connections. 

 

Test #3 61 tenant 5 users +DI on 5 tenants 10K

Memory Utilization

Note: no memory leaks found in related to test modules.

 

CPU Utilization 

Note: chart above and below is identical except one above has excluded nginx-okapi for better view of other modules CPU usage. All modules in good shape without reaching out limits.

 

 

RDS metrics 

 

Note: with included DI to CICO in this test RDS CPU usage grown up to ±50%. 

 

Note: connections on DB has spikes up to 4K. 

 

Kafka metrics

Note: Kafka memory usage during test reached 2.3%. And this memory will be freed up with retention policy after 480 minutes. 

Note: Kafka CPU usage is ±50-60% on each broker. 

 

 

Test #4 61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant 

In this test CICO

Test consisted with two parts: 

  • Actual test that includes CICO+DI+search on each tenant (approximately from 12:15 to 13:15 on a chart)

  • Data imports continue running after actual tests. (from 13:15 to 15:45)

test #4,#5

Memory Utilization

 

Note: no major issues with memory was found. 

CPU Utilization 

Note: Below two charts: one without Nginx-okapi for more accurate vision of a CPU trends, second with Nginx-okapi. 

 

Note: Nginx-okapi has more than 500% CPU usage. That's because of allocated 128 CPU units for nginx-okapi service. We should consider increasing CPU units for nginx-okapi at least up to 512.

 

RDS metrics 

Note:  

 

OpenSearch metrics

 

 

 

Kafka metrics

 

Test #5 61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant (rerun)

This is rerun of a previous test to check results consistency.

Results are more or less the same except Search workflow that increases response time 4 times.

Memory Utilization

 

CPU Utilization 

 

 

RDS metrics 

 

 

OpenSearch metrics

 

 

Kafka metrics

 

 

Test #6 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant 

Test #6

 

Memory Utilization

CPU Utilization 

Note: Here are two CPU Utilisation charts included first one without nginx-okapi and mod-quick-marc.

mod-quick-marc reached 1,5K CPU usage from very beginning of a test and didn't come down even after end of a test. Note that this is new behaviour and wasn't observed in previous tests. 

RDS metrics 

Note: RDS connections reached 4K concurrent connections during main part of a test. 

OpenSearch metrics

 

Kafka metrics

Note:  test itself with including of data imports after (in respect to timeline) ended in ±17:00. however messages was there during 8 more fours.

 

 

Test #7-8 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant  (with improvements, rerun)

Test #7,#8 (improvement with 2 additional Kafka brokers)

Memory Utilization

CPU Utilization 

 

RDS metrics 

 

 

 

OpenSearch metrics

 

 

Kafka metrics

 

 

 

Test #9-10 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant  (with additional data nodes on open search)

Test #9,#10 (tests that has only 2 Kafka brokers and 2 additional data nodes on open search)

Memory Utilization

CPU Utilization 

 

RDS metrics 

 

 

OpenSearch metrics

 

 

Kafka metrics

 

 

 

Appendix