Multi tenants ECS report [in progress]


Overview

The Mobius-like environment has 61 tenants.  We need to test this environment to see if it can handle loads from all 61 tenants in high-level preliminary tests that consist of a few use cases. 

1) Perform 30-minutes baseline CICO tests on the main tenant and a secondary tenant with 5 concurrent users.  
2) Perform the following tests in all 61 tenants at the same time or with combinations:

  • CICO test for 5 users each on all tenants
  • 1 hour test, repeat it 2 times
  • DI create jobs (10K) on 10-20 tenants
  • Search/Browse on all tenants

3) Increase number of Kafka brokers (+2), retest

4) Increase number of data nodes for Open Search service, retest

Compare results against the baseline tests and record the KPIs and other observations such as response times or errors in a report. 

Summary

  • db.r6g.xlarge DB size can not handle load of 61 tenants even with test of 1 user for CechIn-CheckOut workflow. Due to high number of connection being created. To handle this number of concurrent connections to DB - instance type should be at least db.r6g.4xlarge (it has acceptable number of connections). However even that could be not enough. In latest tests we can see that db.r6g.4xlarge os oftenly reaching a limit of connections available. Possibly Shared pool of connections will have a positive affect on this.
  • Ticket created on performance degradation  MODINVSTOR-1124
  • nginx-okapi in combined tests spiking up to 400% so CPU units should be increased (at least up to 512.(currently it's 128));
  • kafka CPU usage is on ±60% level during whole test. (in waiting state it's 35-40%). Increasing number of Kafka brokers (+2) has positive affect on data import, however while DI performance has being improved CICO being affected. As we observed - higher throughput on DI load DB more and has negative affect on response times of CICO
    • io.vertx.core.impl.NoStackTraceThrowable: Connection is not active now, current status: CLOSED
    • or io.netty.channel.StacklessClosedChannelException
  • DI first job (typically it's primary tenant) working fastest. each next tenant working slower and up to 3 hr.
  • In last test 5 DI jobs completed with errors due to same issues mentioned above. One job not even started due to 500 internal server error on POST call to start a job. 
  • OpenSearch CPU usage is on 90% during whole test. It is likely due to DI jobs requiring indexing on each record created. This indexing is done asynchronously so it does not affect overall DI's duration, but likely affects other workflows' performance
  • No memory leaks was found
  • Improvements (adding 2 more brokers to Kafka cluster and changing CPU units on nginx-edge to 512) in test #7-8 did make Data Import faster, however they did affect CICO as well and didi increase response times on CI and CO +200ms avg. 

Recommendations & Jiras

  • Original ticket - PERF-639 Preliminary Testing of Mobius-like Env;
  • Ticket to improve resources and retest PERF-670
  • Recommended to increase DB instance type at least to db.r6g.4xlarge on env with 61 tenants (all tests below performed with this instance type)
  • Recommended to increase CPU units at least to 512 on nginx-okapi;
  • Recommended to scale Kafka (either instance type or number of brokers) due to high CPU usage;



Test Runs 

Test #

Test Conditions

Duration 

Load generator size Load generator Memory(GiB)

Notes

1.2 tenants 2 user each CICO30 mint3.2xlarge3
2.61 tenants 5 user each CICO30 mint3.2xlarge3
361 tenants 5 user each CICO + 10k MARC BIB Create on 5 tenants30 mint3.2xlarge3

4.

5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants60 minst3.2xlarge3

5.

5 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants60 minst3.2xlarge3
65 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 30 tenants+ Search workflow 1 user 61 tenants90 minst3.2xlarge12
75 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants60t3.2xlarge10test with changed CPU units up to 512 and adding 2 more brokers to Kafka
85 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (retest)60t3.2xlarge10test with changed CPU units up to 512 and adding 2 more brokers to Kafka

Results


#testprimary\secondaryprimary\secondaryprimary\secondaryprimary\secondary
CICOData Import durationSearches
12 tenants 5 user each CICO0,9530,6621,7011,196----
261 tenant 5 users CICO(*)1,0410,7971,9661,414----
361 tenant 5 users +DI on 5 tenants 10K1,1410,8632,0031,51130 min50-52 min (StacklessClosed on one tenant on one record)--
461 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant1,2450,9742,251,7242 min1 hr 39 min - 2 hr 57 min (one tenant StacklessClosedChannel. One tenant with connection is not active now, current status: CLOSED1,3-4,4s0,7-4,3s
561 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant (rerun)1,2160,9412,141,641 (stuck on 98%)1 hr 33 min - 2 hr 40 min (5 completed with errors including primary)3,5-21 s2,3 -18 s
661 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant 1,3210,9982,2771,770stuck on 99% 7 records failed with 504 Gateway Time-out

3 hr 38 min - 6 hr 24 min

(4 completed with errors including primary)

- 504 Gateway Time-out  io.netty.channel.StacklessClosedChannelException 

2,9 -6,3 s2,1 -7,3s
75 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more brokers to Kafka )1,5031,1902,3131,83614 min56 min - 1 hr 47 min (8 jobs ether stuck or completed with errors)3,0 -5,0881,800-4,348
85 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more brokers to Kafka )1,771,4872,8392,31929 min (error, 11 records failed with io.netty.channel.StacklessClosedChannelException)

1 hr 2 min - 1 hr 54 min (** 10 jobs failed )



4,103-7,6053,058-8,752
95 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more data nodes  to Open Search )*with 2 Kafka brokers1,4421,1812,3121,76944 minutes completed

1 hr 39 min - 3 hr 8 min.

one job failed on one record with 

Connection is not active now, current status: CLOSED

1,779-3,5340,67-3,636
105 users CI/CO on 61 tenants + DI 10k MARC BIB Create on 15 tenants+ Search workflow 1 user 61 tenants (with improvements of adding 2 more data nodes  to Open Search )*with 2 Kafka brokers1,4961,2202,3341,84544 minutes -completed with errors (2 records)
  • io.netty.channel.StacklessClosedChannelException
  • Connection is not active now, current status: CLOSED

1 hr 42 min - 2 hr 48 min 

one job failed on one record with 

Connection is not active now, current status: CLOSED

2,381-5,1971,308-4,982

*note. Here were performed two tests to check consistency of the results.

** new errors that was never observed before in DI workflow appeared in test #8

  • io.vertx.pgclient.PgException: FATAL: remaining connection slots are reserved for non-replication superuser connections (53300) – on mod-inventory, mod-login, mod-authtoken, mod-permittions, mod-source-record-manager, mod-source-record-storage, mod-users, mod-circulation. 
  • io.vertx.pgclient.PgException: FATAL: sorry, too many clients already (53300) -- on mod-inventory only


Test #1 2 tenants 5 user each CICO

There is almost no visible CPU, memory and DB response usage no issues was found. 

Memory Utilization

Note: No memory leaks was found. 

CPU Utilization 

Note: CPU usage on related containers are hardly visible due to low load during baseline test. 


RDS metrics 


Test #2 61 tenants 5 user each

There were two identical tests performed as results are mostly the same we're using only one of them in report.

test #2 (61 tenants 5 user each CICO)

    1. Most memory using modules is mod-inventory (±100%), mod-circulation (±80), mod-circulation-storage (±80).
    2. nginx-okapi CPU usage is ±400% due to small number of CPU usage allocated by default for a module (128).
    3. DB CPU is 15% in avg. during whole test
    4. DB connection rate is ±2 000 in avg, with spikes up to 3.5K

Memory Utilization


Note: even with total users number 5X61tenant (305) there is no visible memory leaks or anomalies. 

CPU Utilization 


Note: with 305 total users included in a test CPU usage on nginx-okapi spiking above 400%.

RDS metrics 

Note: RDS CPU usage is mostly below 20% during whole test. 

Note: RDS DB connections number when system in stand-by state is ±1000 connections. During test this number increases to ±2K with spikes above 3K connections. 


Test #3 61 tenant 5 users +DI on 5 tenants 10K

    1. nginx-okapi CPU usage is ±400% due to small number of CPU usage allocated by default for a module (128).
    2.  No memory leaks was found.
    3. DB CPU usage is close to 50%
    4. DB connections rate has spikes up to 4 000 connections (which is max for current DB instance type)

Memory Utilization

Note: no memory leaks found in related to test modules.


CPU Utilization 

Note: chart above and below is identical except one above has excluded nginx-okapi for better view of other modules CPU usage. All modules in good shape without reaching out limits.



RDS metrics 


Note: with included DI to CICO in this test RDS CPU usage grown up to ±50%. 


Note: connections on DB has spikes up to 4K. 


Kafka metrics

Note: Kafka memory usage during test reached 2.3%. And this memory will be freed up with retention policy after 480 minutes. 

Note: Kafka CPU usage is ±50-60% on each broker. 



Test #4 61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant 

In this test CICO

Test consisted with two parts: 

  • Actual test that includes CICO+DI+search on each tenant (approximately from 12:15 to 13:15 on a chart)
  • Data imports continue running after actual tests. (from 13:15 to 15:45)

test #4,#5

    1. nginx-okapi CPU usage is ±500%;
    2. No signs memory leaks was found;
    3. DB CPU usage is 70-80% in avg.
    4. DB connection rate is ±3 000 connections avg. with spikes up to 4K
    5. Open search CPU usage is close to 90% (max) from the beginning of the test due to searches workflow included in this test + data import indexing.
    6. Few DI jobs has "Completed with errors" status due to fail of a few records with  connection is not active now, current status: CLOSED , StacklessClosedChannel.
    7.  connection is not active now, current status: CLOSED - happened only one time and hasn't reproduced yet
    8. DI job (10K) took up to 3 hours to complete
    9. Search rate is 180 Ops/min.

Memory Utilization


Note: no major issues with memory was found. 

CPU Utilization 

Note: Below two charts: one without Nginx-okapi for more accurate vision of a CPU trends, second with Nginx-okapi. 


Note: Nginx-okapi has more than 500% CPU usage. That's because of allocated 128 CPU units for nginx-okapi service. We should consider increasing CPU units for nginx-okapi at least up to 512.


RDS metrics 

Note:  


OpenSearch metrics




Kafka metrics


Test #5 61 tenant 5 users + DI 15 tenants 10 K + Search on 61 tenant (rerun)

This is rerun of a previous test to check results consistency.

Results are more or less the same except Search workflow that increases response time 4 times.

Memory Utilization


CPU Utilization 



RDS metrics 



OpenSearch metrics



Kafka metrics



Test #6 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant 

Test #6

    1. CPU usage on nginx-edge as expected is high ±500%
    2. CPU usage on mod-quick-marck has reached 1.5K% and didn't went down after a test. However it didn't affect results. Moreover this behaviour didn't reproduce since then.
    3. Di took up 6,5 hr to complete.


Memory Utilization

CPU Utilization 

Note: Here are two CPU Utilisation charts included first one without nginx-okapi and mod-quick-marc.

mod-quick-marc reached 1,5K CPU usage from very beginning of a test and didn't come down even after end of a test. Note that this is new behaviour and wasn't observed in previous tests. 

RDS metrics 

Note: RDS connections reached 4K concurrent connections during main part of a test. 

OpenSearch metrics


Kafka metrics

Note:  test itself with including of data imports after (in respect to timeline) ended in ±17:00. however messages was there during 8 more fours.



Test #7-8 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant  (with improvements, rerun)

Test #7,#8 (improvement with 2 additional Kafka brokers)

    1. CPU usage on nginx-edge after improvements is 150% avg. 
    2. Each other modules CPU and memory usage is the same as in previous tests. 
    3. DI becomes faster (possibly due to better performing Kafka with 4 brokers in a cluster). In previous tests it took up to 2 hr 57 min, mow it's 1 hr 57 min. 
    4. CI/CO response times affected by +200ms, +500ms for CI and CO. Can be explained with increasing number of Kafka brokers - DI becomes faster and did produce more load on DB side, which affect CICO response times. 

Memory Utilization

CPU Utilization 


RDS metrics 




OpenSearch metrics



Kafka metrics




Test #9-10 61 tenant 5 users + DI 30 tenants 10 K + Search on 61 tenant  (with additional data nodes on open search)

Test #9,#10 (tests that has only 2 Kafka brokers and 2 additional data nodes on open search)

    1. Each other modules CPU and memory usage is the same as in previous tests
    2. DI becomes slower in comparison to previous two tests, however it's more stable.
    3. CICO response times becomes faster, as in previous tests (#7,#8) which is proving the point that the faster is DI - the more it's loading DB and more it has affect on CICO response time
    4. search response times doesn't seems to be better. In comparison to other tests (in average) sometimes it's better, some times it's worse. So 2 additional data nodes didn't change much with performance of search.

Memory Utilization

CPU Utilization 


RDS metrics 



OpenSearch metrics



Kafka metrics




Appendix

Mobious like env has 61 tenants

  • Primary tenant fs00001137 - has 3+M records in inventory
  • 60 secondary tenants (mob01,mob02,.....mob060) has originally prepared 10K records each. (at this point numbers may vary as some number of data imports on different tenants were performed).

Infrastructure

PTF environment ompt-pvt

  • 11 m6g.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
  • 1 instance of db.r6.4xlarge database 
  • MSK ptf-mobius-testing
    • 2 kafka.m5.2xlarge brokers in 2 zones
    • Apache Kafka version 2.8.0

    • EBS storage volume per broker 250 GiB

    • auto.create.topics.enable=true
    • auto.create.topics.enable=true
    • log.retention.minutes=480
    • num.partitions=2
  • OpenSearch fse
    • version - OpenSearch 2.7
    • instance type r6g.xlarge.search
    • 4 data nodes
    • EBS volume 500 GiB
    • Dedicated Master nodes 3 X r6g.large.search


Modules memory and CPU parameters

 Modules list with parametrs
Module
ompt-pvt
Thu Sep 14 12:13:07 UTC 2023
task Def#Task CountMem Hard LimitMem Soft limitCPU unitsXmxMetaspaceSizeMaxMetaspaceSizeR/W split enabled
mod-remote-storage:2.0.3224920447210243960512512false
mod-finance-storage:8.4.2221024896102470088128false
mod-ncip:1.13.122102489612876888128false
mod-agreements:5.5.23265006400102438683841524false
mod-ebsconet:2.0.02212481024128700128256false
edge-sip2:3.0.022102489612876888128false
mod-organizations:1.7.022102489612876888128false
mod-settings:1.0.0221024896200000false
edge-dematic:2.0.0211024896128000false
mod-data-import:2.7.121204818442561292384512false
mod-search:2.0.12225922480204814405121024false
mod-tags:2.0.122102489612876888128false
mod-authtoken:2.13.0221440115251292288128false
mod-inventory-update:3.0.122102489612876888128false
mod-notify:3.0.022102489612876888128false
mod-configuration:5.9.122102489612876888128false
mod-orders-storage:13.5.022102489651276888128false
edge-caiasoft:2.0.0221024896128000false
mod-login-saml:2.6.122102489612876888128false
mod-gobi:2.6.022102489612876888128false
mod-password-validator:3.0.02214401298128768384512false
mod-licenses:4.3.13280007900102458123841536false
mod-bulk-operations:1.0.6223072260010241536384512false
mod-graphql:1.11.022102489612876888128false
mod-finance:4.7.122102489612876888128false
mod-copycat:1.4.022102489612876888128false
mod-entities-links:1.0.22225922480400144001024false
mod-permissions:6.3.242168415445121024384512false
pub-edge:2023.06.1422102489612876800false
mod-orders:12.6.8222048144010241024384512false
edge-patron:4.11.022102489625676888128false
edge-ncip:1.8.122102489612876888128false
mod-users-bl:7.5.0221440115251292288128false
edge-ea-data-export:3.9.022102489612876888128false
mod-inventory-storage:26.0.0224096369020483076384512false
mod-invoice:5.6.5221440115251292288128false
mod-user-import:3.7.222102489612876888128false
mod-sender:1.10.022102489612876888128false
edge-oai-pmh:2.6.1221512136010241440384512false
mod-data-export-worker:3.0.13223072280010242048384512false
mod-rtac:3.5.022102489612876888128false
mod-task-list:1.8.022102489612876888128false
mod-circulation-storage:16.0.1222880259215361814384512false
mod-source-record-storage:5.6.7225600500020483500384512false
mod-calendar:2.4.222102489612876888128false
mod-event-config:2.5.022102489612876888128false
mod-courses:1.4.722102489612876888128false
mod-inventory:20.0.6222880259210241814384512false
mod-email:1.15.322102489612876888128false
mod-di-converter-storage:2.0.522102489612876888128false
mod-circulation:23.5.62228802592153676888128false
mod-pubsub:2.9.122153614401024922384512false
edge-orders:2.8.122102489612876888128false
edge-rtac:2.6.022102489612876888128false
mod-users:19.1.122102489612876888128false
mod-template-engine:1.18.022102489612876888128false
mod-patron-blocks:1.8.0221024896102476888128false
mod-audit:2.7.022102489612876888128false
mod-source-record-manager:3.6.4225600500020483500384512false
nginx-edge:2023.06.14221024896128000false
mod-quick-marc:3.0.021228821761281664384512false
nginx-okapi:2023.06.14221024896128000false
okapi:5.0.123168414401024922384512false
mod-feesfines:18.2.122102489612876888128false
mod-invoice-storage:5.6.0221872153610241024384512false
mod-service-interaction:2.2.222204818442561290384512false
mod-patron:5.5.222102489612876888128false
mod-data-export:4.7.1211024896102476888128false
mod-oai-pmh:3.11.3224096369020483076384512false
edge-connexion:1.0.622102489612876888128false
mod-notes:5.0.1221024896128952384512false
mod-kb-ebsco-java:3.13.022102489612876888128false
mod-login:7.9.022144012981024768384512false
mod-organizations-storage:4.5.122102489612876888128false
mod-data-export-spring:2.0.221204818442561536384512false
pub-okapi:2023.06.1422102489612876800false

Methodology/Approach

  • PTF team has developed DMS (Data Management Server) which by calling API responding with needed data for each tenant. 
    • Api to call ${DMSHost}:5222/${tenantID}/CICO_available. This api call will return available item ID in Json format
  • PTF has prepared data preparation script (Bash) that is looping through tenant Id and preparing data for each tenant including primary one. 


Analysis: 

To define avg response time for CICO use avg. column from summary table. As there is almost whole time is DI on the background - we can use whole time range of a test for avg. analysis

To check DI duration - run either CheckDI.sh (located on carrier box /home/ec2-user/MasterDataLoad/Mobius) using

 bash CheckDI.sh psql.conf

or run on DB side

SELECT started_date,completed_date-started_date as duration, file_name,  status
	FROM [tenantId]_mod_source_record_manager.job_execution order by started_date desc limit 100;
	


Additional Screenshots of graphs or charts 

Here is excel spreadsheet attached with summary tables for all tests in this report. Each of tabs in spreadsheet corresponds to each test. (note: spreadsheet does not include DI durations). 



Discussion

Things to discuss: 

  • io.vertx.core.impl.NoStackTraceThrowable: Connection is not active now, current status: CLOSED. This is DI error that we've never seen before. Moreover we was not able to reproduce it yet.
  • In test #5 we have significantly increased response time for search workflow without any visible reason.