RDS proxy experiments
Overview
In scope of PERF-747 we need to check performance and usability of using RDS proxy approach.
RDS proxy is used to better manage the actual number of database connections created and maintained by the database server. Instead of having a 1:1 connection created by the client and added to the pool by the DB, the proxy could provide a many:1 or many:few connection ratio and still able to allow the client to execute the queries in time. It was found in testing on the Mobius-like cluster which had 61 tenants that the maximum number of database connections was exhausted very quickly when 30 DI jobs were running with 5 vUsers on each tenant doing CICO.
2 proxies being created with different set of modules communicating through them:
- pmpt-proxy (for mod-inventory, mod-inventory-storage).
- pmpt-proxy-2 (for mod_di_converter_storage, mod_source_record_manager, mod_source_record_storage)
Each of proxies has "connection pool maximum connections" which means the maximum allowed connections, as a percentage of the maximum connection limit of your database.
- pmpt-proxy (for mod-inventory, mod-inventory-storage) has 50%, 25% in future tests (see notes section below)
- pmpt-proxy-2 (for mod_di_converter_storage, mod_source_record_manager, mod_source_record_storage) has 25%.
Most of a tests was performed with 4x database, then for experiment switched to 8x large
Summary
Test Runs
Results
# | test | CI(s) | CO(s) | Data Import duration | configuration | notes |
---|---|---|---|---|---|---|
1 | 61 tenants 5 user each CICO + 10k MARC BIB Create on 14 tenants | 1,49 | 2,35 | 34 min - 2 hr 53 min | 50% connections on pmpt-proxy, 4xLarge DB | no errors. |
2 | 61 tenants 5 user each CICO + 10k MARC BIB Create on 14 tenants | 1,94 | 3,0 | 2 hr - 2 hr 50 min | 50% connections on pmpt-proxy, 4xLarge DB | 2 DI jobs stuck |
3 | 61 tenants 5 user each CICO + 10k MARC BIB Create on 14 tenants | 1,7 | 2,67 | 20 min - 2 hr 33 min | 50% connections on pmpt-proxy, 4xLarge DB | 11 DI jobs completed with error |
4 | 61 tenants 5 user each CICO + 10k MARC BIB Create on 14 tenants | 2,23 | 3,47 | 34 min -3 hr 3 min | 25% connections on pmpt-proxy, 4xLarge DB, With shared pool circulation-storage 1000 | 9 DI jobs failed |
5 | 61 tenants 5 user each CICO + 10k MARC BIB Create on 14 tenants | 1,18 | 1,9 | 30 min - 3 hr 38 min | 25% connections on pmpt-proxy, 8xLarge DB | no errors |
6 | 61 tenants 5 user each CICO + 10k MARC BIB Create on 30 tenants | 1,181 | 1,884 | 2 hr 33 min - 6 hr 25 min | 25% connections on pmpt-proxy, 8xLarge DB | 3 DI jobs stuck |
7 | 61 tenants 5 user each CICO + 10k MARC BIB Create on 14 tenants | 1,358 | 2,107 | 45 min - 3 hr 13 min | 25% connections on pmpt-proxy, 8xLarge DB, number of max DB connections changed from default (5K) to 10K | 2 DI jobs failed |
8 | 61 tenants 5 user each CICO + 10k MARC BIB Create on 14 tenants | 3,835 | 5,606 | 3 hr 15 min | 25% connections on pmpt-proxy, 4xLarge DB, number of max DB connections changed from default (5K) to 10K | 10 DI jobs failed |
9 | 61 tenants 5 user each CICO + 10k MARC BIB Create on 15 tenants | 1,623 | 2,294 | 50 min - 3 hr 10 min | NO proxy, 4xLarge DB, number of max DB connections changed from default (5K) to 10K | no Errors |
Test 1
Initial test. No errors except on fs00001137 tenant (CI/CO and data import as well) as it wasn't registered into proxies.
All tenants involved did finish CI/CO without any errors.
All tenants involved did finish Data Import successfully.
CPU Utilization
Memory Utilization
Note: This chart contains multiple tests in it - to show that there is no memory leak produced by using of RD proxy.
Note: Drop of memory usage on several modules is due to modules restart before tests:
RDS metrics
Note: RDS CPU usage is close to 50%-60% during CICO+DI, and it's ± 40% during only DI jobs running on the background.
Note: RDS connections rate is ±4,5K during CICO+DI. and 3,2K-3,0K during only data imports.
Note: Here is visible low memory left available with using RDS proxy
Proxy Metrics
Proxy 1 | Proxy 2 |
---|---|
Test 2
Second test performed right after first one (with CICO data preparation in between), without any restarting of a modules and Database.
Note: Here on throughput chart is clearly visible a big gap (21:35-21:40) with a multiple errors in it. This gap happened due to DB restart with "Out Of Memory" exception. During this time (while DB was restarting) there's a lot of errors on all tenants CI/CO + 2 DI jobs is stuck. There are a lot of errors on multiple modules (even on those who is not directly involved into this workflows) but all of them are related to DB restart and OOM (Out Of Memory).
After DB restart it went without errors.
DB logs:
2023-12-04 04:21:44 UTC:10.23.10.247(36116):mob08_mod_users@folio:[15713]:ERROR: out of memory at character 8 2023-12-04 04:21:44 UTC:10.23.10.247(36116):mob08_mod_users@folio:[15713]:DETAIL: Failed on request of size 8192. 2023-12-04 04:21:58 UTC:10.23.10.183(60756):folio@folio:[13487]:DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2023-12-04 04:21:59 UTC::@:[537]:FATAL: Can't handle storage runtime process crash 2023-12-04 04:21:59 UTC::@:[537]:LOG: database system is shut down
CPU Utilization
Memory Utilization
Note: This chart contains multiple tests in it - to show that there is no memory leak produced by using of RD proxy.
RDS metrics
Note: Here is clearly visible gap in RDS connections chart. This is when DB restart was happening due to Out Of Memory exception.
Note: Here is visible low memory left available with using RDS proxy
Proxy Metrics
Proxy 1 | Proxy 2 |
---|---|
Test 3
This test performed without any restart after test #2 to heck if there are degradation between tests.
this time DB did restart multiple times which led to multiple errors during test from (CI/CO side and from DI as well). Errors and logs are the same as it was in test #2
CPU Utilization
Memory Utilization
Note: This chart contains multiple tests in it - to show that there is no memory leak produced by using of RD proxy.
RDS metrics
Note: here clearly visible that at some points of a test - DB was low on free memory.
Proxy Metrics
Proxy 1 | Proxy 2 |
---|---|
Test 4 with shared pool on mod-circulation-storage
Test to check if enabling of shared pool on mod-circulation-storage will affect performance
CPU Utilization
Memory Utilization
RDS metrics
Note: same behaviour as in few previous tests. DB is running out of memory and it led to DB restart.
Test 5 8x large
Test to check performance with running test with 8x large DB.
- DB was stable. test competed without DB restarts.
CI/CO response times is better than in previous tests (CI 1,118/ CO 1,9).
- All DI jobs finished successfully.
CPU Utilization
Memory Utilization
RDS metrics
Note: we can't see memory issues here (as we observed them in previous tests)
Test 6 8x large 30 DI jobs
Test with 30 parallel DI jobs running.
- Test finished successfully.
- All errors that is visible from chart below is due to sudden restart of mod-inventory.
CPU Utilization
Memory Utilization
Note: here is visible several restarts of mod-inventory.
RDS metrics
Test 7 8x large 10000 connections
Repeat of previous test with changing of max allowed connections to DB parameter from 5 000 to 10 000