Overview
This document contains the results of testing concurrent Data Import with file splitting feature for MARC Bibliographic records in the Quesnelia release.
The purpose for this test is to define how concurrent DI affect duration of DI jobs on the central tenant and to check possible issues during smoke test with 50k DI Create job running concurrently on all 3 tenants.
Compared with Data Import Creates + Updates Poppy pcp1
Ticket: - PERF-837Getting issue details... STATUS
Summary
- Data Import duration of 10k and 25k jobs multiplies in correspondence with number of concurrent DI jobs. Duration doubles with 2 concurrent jobs (1 job for each tenant) and 3 times longer for 3 concurrent jobs. This trend is consistent across tenants. The same as in Poppy.
- Running smoke test 50k file with DI create job revealed issue with module mod-permission which spiked and led to 4 split jobs got SNAPSHOT_UPDATE_ERROR at the beginning of the test on third fs07000002 tenant. No error messages in UI. Reason - No action.
- Average CPU utilization was for top two modules during DI Create jobs in mod-inventory-b - 121%, mod-quick-marc-b - 92%, Update jobs mod-inventory-b - 150%, mod-quick-marc-b - 92%. Modules used less resources in quesnelia. More information can be found in Service CPU Utilization tab.
- Memory consumption: mod-inventory-b - 87%, mod-data-import-b - 80%, mod-source-record-storage-b - 73%, mod-permissions-b - 71% which is slightly less than in Poppy
- RDS CPU utilization was 88% for all jobs that is 10% less than in Poppy
Create jobs DB connections for 2 tenants - 990, for 3 tenants - 1100, Update jobs DB connections for 2 tenants - 950, for 3 tenants - 1000
Top long query was SELECT jsonb FROM fs07000002_mod_permissions.permissions - rows 4275 rows/sec, average latency 1579 ms/call
Test Runs
Test # | Scenario | Load level |
---|---|---|
1 - Concurrent Create imports | DI MARC Bib Create | 10K, 25K concurrently (with 5 min pause) on 2 and 3 tenants |
2 - Concurrent Update imports | DI MARC Bib Update | 10K, 25K concurrently (with 5 min pause) on 2 and 3 tenants |
3 - Concurrent Create imports ("smoke test") of 50K | DI MARC Bib Create | 50k concurrently on 3 tenants |
Test Results
Duration of DI jobs grow proportionally to file size.
Smoke Test finished successfully for 3 concurrent DI Create jobs of 50K each.
DI Create | # of test | Number of concurrent jobs | Main tenant (fs09000000) | Second tenant (fs07000001) | Third tenant (fs07000002) |
---|---|---|---|---|---|
10k | Baseline | 1 | 00:05:01 | 00:05:01 | 00:05:00 |
1 | 2 | 00:09:04 | 00:09:08 | ||
2 | 3 | 00:14:25 | 00:15:12 | 00:15:22 | |
25k | Baseline | 1 | 00:10:43 | 00:11:15 | 00:11:22 |
3 | 2 | 00:22:44 | 00:22:47 | ||
4 | 3 | 00:40:32 | 00:40:17 | 00:40:25 | |
DI Update | |||||
10k | Baseline | 1 | 00:06:25 | ||
5 | 2 | 00:14:08 | 00:13:44 | ||
6 | 3 | 00:20:13 | 00:21:03 | 00:20:41 | |
25k | Baseline | 1 | 00:15:17 | ||
7 | 2 | 00:35:27 | 00:35:34 | ||
8 | 3 | 00:58:49 | 00:58:59 | 00:59:05 | |
DI Create (Smoke test) | |||||
50K | 9 | 1 | 22:44.2 | ||
10 | 3 | 01:21:37 | 01:22:49 | 01:22:49 |
Comparison
QUESNELIA | POPPY | COMPARISON Quesnelia/Poppy | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DI Create | # of test | Number of concurrent jobs | Central tenant | Second tenant | Third tenant | DI Create | Number of concurrent jobs | Central tenant | Second tenant | Third tenant | Delta fs09000000 | Delta fs07000001 | Delta fs07000002 |
10k | Baseline | 1 | 00:05:01 | 00:05:01 | 00:05:00 | 10k | 1 | 00:04:56 | 00:00:05 | ||||
1 | 2 | 00:09:04 | 00:09:08 | 2 | 00:10:43 | 00:10:37 | 00:01:39 | 00:01:29 | |||||
2 | 3 | 00:14:25 | 00:15:12 | 00:15:22 | 3 | 00:21:12 | 00:21:06 | 00:20:57 | 00:06:47 | 00:05:54 | 00:05:35 | ||
25k | Baseline | 1 | 00:10:43 | 00:11:15 | 00:11:22 | 25k | 1 | 00:11:24 | 00:00:41 | ||||
3 | 2 | 00:22:44 | 00:22:47 | 2 | 00:23:44 | 00:23:30 | 00:01:00 | 00:00:43 | |||||
4 | 3 | 00:40:32 | 00:40:17 | 00:40:25 | 3 | 00:37:11 | 00:37:05 | 00:36:58 | 00:03:21 | 00:03:12 | 00:03:27 | ||
DI Update | # of test | Number of concurrent jobs | Central tenant | Second tenant | Third tenant | DI Update | Number of concurrent jobs | Central tenant | Second tenant | Third tenant | Delta fs09000000 | Delta fs07000001 | Delta fs07000002 |
10k | Baseline | 1 | 00:06:25 | 10k | 1 | 00:06:32 | 00:00:07 | ||||||
5 | 2 | 00:14:08 | 00:13:44 | 2 | 00:09:47 | 00:11:26 | 00:04:21 | 00:02:18 | |||||
6 | 3 | 00:20:13 | 00:21:03 | 00:20:41 | 3 | 00:19:08 | 00:19:06 | 00:18:31 | 00:01:05 | 00:01:57 | 00:02:10 | ||
25k | Baseline | 1 | 00:15:17 | 25k | 1 | 00:15:13 | 00:00:04 | ||||||
7 | 2 | 00:35:27 | 00:35:34 | 2 | 00:30:49 | 00:30:52 | 00:04:38 | 00:04:42 | |||||
8 | 3 | 00:58:49 | 00:58:59 | 00:59:05 | 3 | 00:47:47 | 00:48:17 | 00:47:54 | 00:11:02 | 00:10:42 | 00:11:11 | ||
50K | 9 | 1 | 22:44.2 | 50K | 1 | 00:22:31 | 00:00:13 | ||||||
10 | 3 | 01:21:37 | 01:22:49 | 01:22:49 | 3 | 01:12:54 | 01:12:44 | 01:12:35 | 00:08:43 | 00:10:05 | 00:10:14 |
Service CPU Utilization
Service Memory Utilization
MSK tenant cluster
Disk usage by broker
CPU (User) usage by broker
DB CPU Utilization
RDS CPU utilization was 88% for all jobs that is 10% less than in Poppy
DB Connections
Create jobs DB connections for 2 tenants - 990, for 3 tenants - 1100
Update jobs DB connections for 2 tenants - 950, for 3 tenants - 1000
DB load
Top SQL
SELECT jsonb FROM fs07000002_mod_permissions.permissions
Appendix
Errors & Exceptions
During smoke test with 50k on 3 tenants SNAPSHOT_UPDATE_ERROR was observed on third fs07000002 tenant. No error messages in UI. Reason - No action on four split jobs at the beginning.
Infrastructure
PTF -environment qcp1
- 10 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
1 database instance, writer
Name Memory GIB vCPUs max_connections db.r6g.xlarge
32 GiB 4 vCPUs 2731 - Number of records in DB:
- fs09000000
- instances - 25901331
- items - 27074913
- holdings - 25871735
- fs07000001
- nstances - 10100620
- items - 1484850
- holdings - 10522266
- fs07000002
- nstances - 1161275
- items - 1153548
- holdings - 1153548
- fs09000000
- MSK tenant
- 4 m5.2xlarge brokers in 2 zones
Apache Kafka version 2.8.0
EBS storage volume per broker 300 GiB
- auto.create.topics.enable=true
- log.retention.minutes=480
- default.replication.factor=3
Module Version qcp1-pvt | Revision | Task Count | Mem Hard Limit | Mem Soft Limit | CPU | Xmx | MetaspaceSize | MaxMetaspaceSize |
---|---|---|---|---|---|---|---|---|
mod-users-bl:7.7.0 | 5 | 2 | 1440 | 1152 | 512 | 922 | 88 | 128 |
mod-configuration:5.10.0 | 5 | 2 | 1024 | 896 | 128 | 768 | 88 | 128 |
mod-authtoken:2.15.1 | 6 | 2 | 1440 | 1152 | 512 | 922 | 88 | 128 |
mod-data-import:3.1.0 | 8 | 1 | 2048 | 1844 | 256 | 1292 | 384 | 512 |
mod-remote-storage:3.2.0 | 5 | 2 | 4920 | 4472 | 1024 | 3960 | 512 | 512 |
mod-inventory-storage:27.1.0 | 5 | 2 | 4096 | 3690 | 2048 | 3076 | 384 | 512 |
pub-okapi:2023.06.14 | 3 | 2 | 1024 | 896 | 128 | 768 | Not found | Not found |
mod-feesfines:19.1.0 | 5 | 2 | 1024 | 896 | 128 | 768 | 88 | 128 |
okapi:5.3.0 | 5 | 3 | 1684 | 1440 | 1024 | 922 | 384 | 512 |
nginx-okapi:2023.06.14 | 3 | 2 | 1024 | 896 | 128 | Not found | Not found | Not found |
mod-quick-marc:5.1.0 | 5 | 1 | 2288 | 2176 | 128 | 1664 | 384 | 512 |
mod-source-record-manager:3.9.0-SNAPSHOT.330 | 6 | 2 | 5600 | 5000 | 2048 | 3500 | 384 | 512 |
mod-patron-blocks:1.10.0 | 5 | 2 | 1024 | 896 | 1024 | 768 | 88 | 128 |
mod-pubsub:2.13.0 | 5 | 2 | 1536 | 1440 | 1024 | 922 | 384 | 512 |
mod-circulation:24.2.0 | 5 | 2 | 2880 | 2592 | 1536 | 1814 | 384 | 512 |
mod-di-converter-storage:2.2.0 | 5 | 2 | 1024 | 896 | 128 | 768 | 88 | 128 |
mod-inventory:20.2.0 | 5 | 2 | 2880 | 2592 | 1024 | 1814 | 384 | 512 |
mod-source-record-storage:5.8.0 | 5 | 2 | 5600 | 5000 | 2048 | 3500 | 384 | 512 |
mod-circulation-storage:17.2.0 | 5 | 2 | 2880 | 2592 | 1536 | 1814 | 384 | 512 |
Methodology/Approach
- DI tests were started from UI concurrently with 1 job on each tenant, fs09000000 first and then on fs07000001 so in total two jobs on two tenants.
- Then 1 job on three tenants concurrently with several seconds delay - started with tenant fs09000000, second tenant -fs07000001 and third tenant - fs07000002.
- DI Create jobs executed first. Then DI Update jobs.
To get instance ids and job durations used these queries
select id -- from fs09000000_mod_inventory_storage.instance -- from fs07000001_mod_inventory_storage.instance -- from fs07000002_mod_inventory_storage.instance -- limit 1 -- To get instance ids for 10k Update on main tenant -- where creation_date > '2024-06-18 06:40:14' and creation_date < '2024-06-18 06:45:15.549' -- To get instance ids for 25k Update on main tenant -- where creation_date > '2024-06-18 06:50:00' and creation_date < '2024-06-18 07:00:43.192' -- To get instance ids for 10k Update on 0701 tenant -- where creation_date > '2024-06-18 07:07:51' and creation_date < '2024-06-18 07:12:53.209' -- To get instance ids for 25k Update on 0701 tenant -- where creation_date > '2024-06-18 07:17:04' and creation_date < '2024-06-18 07:28:19.22' -- To get instance ids for 10k Update on 0702 tenant -- where creation_date > '2024-06-18 07:33:07' and creation_date < '2024-06-18 07:38:06.762' -- To get instance ids for 25k Update on 0702 tenant -- where creation_date > '2024-06-18 07:42:50' and creation_date < '2024-06-18 07:54:12.09' select file_name,total_records_in_file,started_date,completed_date, completed_date - started_date as duration ,status,error_status -- from fs09000000_mod_source_record_manager.job_execution -- from fs07000002_mod_source_record_manager.job_execution -- from fs07000002_mod_source_record_manager.job_execution where subordination_type = 'COMPOSITE_PARENT' -- where started_date > '2024-06-13 14:47:54' and completed_date < '2024-06-13 19:01:50.832' order by started_date desc limit 10