RDS proxy experiments

Overview

In scope of PERF-747 we need to check performance and usability of using RDS proxy approach.

RDS proxy is used to better manage the actual number of database connections created and maintained by the database server. Instead of having a 1:1 connection created by the client and added to the pool by the DB, the proxy could provide a many:1 or many:few connection ratio and still able to allow the client to execute the queries in time. It was found in testing on the Mobius-like cluster which had 61 tenants that the maximum number of database connections was exhausted very quickly when 30 DI jobs were running with 5 vUsers on each tenant doing CICO.

2 proxies being created with different set of modules communicating through them: 

  • pmpt-proxy (for mod-inventory, mod-inventory-storage).
  • pmpt-proxy-2 (for mod_di_converter_storage, mod_source_record_manager, mod_source_record_storage)

Each of proxies has "connection pool maximum connections" which means the maximum allowed connections, as a percentage of the maximum connection limit of your database. 

  • pmpt-proxy (for mod-inventory, mod-inventory-storage) has 50%, 25% in future tests (see notes section below)
  • pmpt-proxy-2 (for mod_di_converter_storage, mod_source_record_manager, mod_source_record_storage) has 25%.

Most of a tests was performed with 4x database, then for experiment switched to 8x large

Summary


Test Runs 


Results

#testCI(s)CO(s)Data Import durationconfigurationnotes
161 tenants 5 user each CICO + 10k MARC BIB Create on 14 tenants

1,49

2,35

34 min - 2 hr 53 min50% connections on pmpt-proxy, 4xLarge DB no errors. 
261 tenants 5 user each CICO + 10k MARC BIB Create on 14 tenants

1,94

3,0

2 hr - 2 hr 50 min50% connections on pmpt-proxy,  4xLarge DB 2 DI jobs stuck
361 tenants 5 user each CICO + 10k MARC BIB Create on 14 tenants

1,7

2,67

20 min - 2 hr 33 min50% connections on pmpt-proxy,  4xLarge DB11 DI jobs completed with error
461 tenants 5 user each CICO + 10k MARC BIB Create on 14 tenants

2,23

3,47

34 min -3 hr 3 min

25% connections on pmpt-proxy,  4xLarge DB,

With shared pool circulation-storage 1000

9 DI jobs failed
561 tenants 5 user each CICO + 10k MARC BIB Create on 14 tenants

1,18

1,9

30 min - 3 hr 38 min

25% connections on pmpt-proxy,  8xLarge DB

no errors
61 tenants 5 user each CICO + 10k MARC BIB Create on 30 tenants

1,181

1,884

2 hr 33 min - 6 hr 25 min

25% connections on pmpt-proxy,  8xLarge DB

3 DI jobs stuck
761 tenants 5 user each CICO + 10k MARC BIB Create on 14 tenants

1,358

2,107

45 min - 3 hr 13 min

25% connections on pmpt-proxy,  8xLarge DB,

number of max DB connections changed from default (5K) to 10K

2 DI jobs failed 
861 tenants 5 user each CICO + 10k MARC BIB Create on 14 tenants

3,835

5,606

3 hr 15 min

25% connections on pmpt-proxy,  4xLarge DB,

number of max DB connections changed from default (5K) to 10K

10 DI jobs failed 
961 tenants 5 user each CICO + 10k MARC BIB Create on 15 tenants

1,623

2,294

50 min - 3 hr 10 min

NO proxy, 4xLarge DB,

number of max DB connections changed from default (5K) to 10K

no Errors

Test 1

Initial test. No errors except on fs00001137 tenant (CI/CO and data import as well) as it wasn't registered into proxies. 

All tenants involved did finish CI/CO without any errors.

All tenants involved did finish Data Import successfully.

CPU Utilization 

Memory Utilization

Note: This chart contains multiple tests in it - to show that there is no memory leak produced by using of RD proxy.

Note: Drop of memory usage on several modules is due to modules restart before tests: 

RDS metrics 

Note: RDS CPU usage is close to 50%-60% during CICO+DI, and it's ± 40% during only DI jobs running on the background. 

Note: RDS connections rate is ±4,5K during CICO+DI. and 3,2K-3,0K during only data imports.

Note: Here is visible low memory left available with using RDS proxy

Proxy Metrics

Proxy 1Proxy 2



Test 2

Second test performed right after first one (with CICO data preparation in between), without any restarting of a modules and Database. 

Note: Here on throughput chart is clearly visible a big gap (21:35-21:40) with a multiple errors in it. This gap happened due to DB restart with "Out Of Memory" exception. During this time (while DB was restarting) there's a lot of errors on all tenants CI/CO + 2 DI jobs is stuck. There are a lot of errors on multiple modules (even on those who is not directly involved into this workflows) but all of them are related to DB restart and OOM (Out Of Memory).

After DB restart it went without errors. 

DB logs: 

2023-12-04 04:21:44 UTC:10.23.10.247(36116):mob08_mod_users@folio:[15713]:ERROR:  out of memory at character 8
2023-12-04 04:21:44 UTC:10.23.10.247(36116):mob08_mod_users@folio:[15713]:DETAIL:  Failed on request of size 8192.
2023-12-04 04:21:58 UTC:10.23.10.183(60756):folio@folio:[13487]:DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2023-12-04 04:21:59 UTC::@:[537]:FATAL:  Can't handle storage runtime process crash
2023-12-04 04:21:59 UTC::@:[537]:LOG:  database system is shut down


CPU Utilization 

Memory Utilization

Note: This chart contains multiple tests in it - to show that there is no memory leak produced by using of RD proxy.

RDS metrics 

Note: Here is clearly visible gap in RDS connections chart. This is when DB restart was happening due to Out Of Memory exception. 

Note: Here is visible low memory left available with using RDS proxy

Proxy Metrics

Proxy 1Proxy 2


Test 3

This test performed without any restart after test #2 to heck if there are degradation between tests. 

this time DB did restart multiple times which led to multiple errors during test from (CI/CO side and from DI as well). Errors and logs are the same as it was in test #2

CPU Utilization 

Memory Utilization

Note: This chart contains multiple tests in it - to show that there is no memory leak produced by using of RD proxy.

RDS metrics 

Note: here clearly visible that at some points of a test - DB was low on free memory. 

Proxy Metrics

Proxy 1Proxy 2


Test 4 with shared pool on mod-circulation-storage

Test to check if enabling of shared pool on mod-circulation-storage will affect performance

CPU Utilization 

Memory Utilization

RDS metrics 


Note: same behaviour as in few previous tests. DB is running out of memory and it led to DB restart.


Test 5 8x large

Test to check performance with running test with 8x large DB. 

  • DB was stable. test competed without DB restarts.
  • CI/CO response times is better than in previous tests (CI 1,118/ CO 1,9). 

  • All DI jobs finished successfully. 

CPU Utilization 

Memory Utilization

RDS metrics 


Note: we can't see memory issues here (as we observed them in previous tests)


Test 6 8x large 30 DI jobs

Test with 30 parallel DI jobs running. 

  • Test finished successfully. 
  • All errors that is visible from chart below is due to sudden restart of mod-inventory. 

CPU Utilization 

Memory Utilization

Note: here is visible several restarts of mod-inventory. 

RDS metrics 

Test 7 8x large 10000 connections

Repeat of previous test with changing of max allowed connections to DB parameter from 5 000 to 10 000

CPU Utilization 

Memory Utilization

RDS metrics 



Test 8 4x large 10000 connections

CPU Utilization 


Memory Utilization

RDS metrics