Test multi-tenants DI by tweaking Database Parameters
Overview
Folijet investigated MODSOURCE-581 and determined that multi-tenant concurrent DI job failures are due to exhausting database resources like database connections and CPU and/or memory.
- PERF-544Getting issue details... STATUS
Summary
- Multitenant DI operations (3 tenants in parallel) were completed successfully for all tests with different configs and 10k & 25k files in the 'ncp5' PTF Environment.
- Based on the comparison, the updated configuration by increasing the RDS instance to db.r6g.2xlarge helped to decrease DI duration by up to -42%. However, the updated configuration by - MODSOURCE-629Getting issue details... STATUS 's recommendations didn't show positive changes (DI duration decreased up to +6%).
- Based on global test executions behaviour, the updated configuration by increasing the RDS instance to db.r6g.2xlarge and the updated configuration by - MODSOURCE-629Getting issue details... STATUS 's recommendations didn't help to resolve resource utilization problems (RDS CPU usage was approx. 95% for all tests with different configs and 10k & 25k files).
Recommendations & Jiras (Optional)
Results
The following table contains information about test executions, configurations, and results.
DB instance | # | UI | DI File | Profile | Result | Start | End | Duration | ID |
db.r6g.xlarge default | 1 | ptf-ncp5-00.int.aws.folio.org | 10k | PTF - Create 2 | Completed | 5/16/23 6:40 AM | 5/16/23 6:55 AM | 0:15 | 25973 |
ptf-ncp5-01.int.aws.folio.org | 10k | PTF - Create 2 | Completed | 5/16/23 6:40 AM | 5/16/23 6:56 AM | 0:16 | 3631 | ||
2 | ptf-ncp5-00.int.aws.folio.org | 10k | PTF - Create 2 | Completed | 5/16/23 7:48 AM | 5/16/23 8:12 AM | 0:24 | 25974 | |
ptf-ncp5-01.int.aws.folio.org | 10k | PTF - Create 2 | Completed | 5/16/23 7:49 AM | 5/16/23 8:12 AM | 0:23 | 3632 | ||
ptf-ncp5-02.int.aws.folio.org | 10k | PTF - Create 2 | Completed | 5/16/23 7:49 AM | 5/16/23 8:09 AM | 0:20 | 826 | ||
3 | ptf-ncp5-00.int.aws.folio.org | 25k | PTF - Create 2 | Completed | 5/16/23 10:35 AM | 5/16/23 11:33 AM | 0:58 | 26005 | |
ptf-ncp5-01.int.aws.folio.org | 25k | PTF - Create 2 | Completed | 5/16/23 10:35 AM | 5/16/23 11:34 AM | 0:59 | 3664 | ||
ptf-ncp5-02.int.aws.folio.org | 25k | PTF - Create 2 | Completed | 5/16/23 10:35 AM | 5/16/23 11:25 AM | 0:50 | 892 | ||
db.r6g.2xlarge | 4U | ptf-ncp5-00.int.aws.folio.org | 10k | PTF - Create 2 | Completed | 5/17/23 6:59 AM | 5/17/23 7:14 AM | 0:15 | 26071 |
ptf-ncp5-01.int.aws.folio.org | 10k | PTF - Create 2 | Completed | 5/17/23 6:59 AM | 5/17/23 7:13 AM | 0:14 | 3697 | ||
ptf-ncp5-02.int.aws.folio.org | 10k | PTF - Create 2 | Completed | 5/17/23 6:59 AM | 5/17/23 7:12 AM | 0:13 | 925 | ||
5U | ptf-ncp5-00.int.aws.folio.org | 25k | PTF - Create 2 | Completed | 5/17/23 7:21 AM | 5/17/23 8:11 AM | 0:50 | 26075 | |
ptf-ncp5-01.int.aws.folio.org | 25k | PTF - Create 2 | Completed | 5/17/23 7:21 AM | 5/17/23 7:55 AM | 0:34 | 3698 | ||
ptf-ncp5-02.int.aws.folio.org | 25k | PTF - Create 2 | Completed | 5/17/23 7:21 AM | 5/17/23 7:52 AM | 0:31 | 926 | ||
db.r6g.xlarge default | 6P | ptf-ncp5-00.int.aws.folio.org | 10k | PTF - Create 2 | Completed | 5/18/23 8:34 AM | 5/18/23 8:58 AM | 0:24 | 26203 |
ptf-ncp5-01.int.aws.folio.org | 10k | PTF - Create 2 | Completed | 5/18/23 8:34 AM | 5/18/23 8:58 AM | 0:24 | 3829 | ||
ptf-ncp5-02.int.aws.folio.org | 10k | PTF - Create 2 | Completed | 5/18/23 8:34 AM | 5/18/23 8:55 AM | 0:21 | 1057 | ||
7P | ptf-ncp5-00.int.aws.folio.org | 25k | PTF - Create 2 | Completed | 5/18/23 9:36 AM | 5/18/23 10:36 AM | 1:00 | 26236 | |
ptf-ncp5-01.int.aws.folio.org | 25k | PTF - Create 2 | Completed | 5/18/23 9:36 AM | 5/18/23 10:37 AM | 1:01 | 3862 | ||
ptf-ncp5-02.int.aws.folio.org | 25k | PTF - Create 2 | Completed | 5/18/23 9:36 AM | 5/18/23 10:29 AM | 0:53 | 1090 |
Comparisons
The following table contains information about the comparison of different configurations. Based on the comparison, the updated configuration by increasing the RDS instance to db.r6g.2xlarge helped to decrease DI duration by up to 42%. However, the updated configuration by - MODSOURCE-629Getting issue details... STATUS 's recommendations didn't show positive changes.
UI | Records per file | DI duration for Default config, min | DI duration for Instance update, min | Delta, +/-% | DI duration for SRM parameters update, min | Delta, +/-% |
ptf-ncp5-00.int.aws.folio.org | 10k | 0:24 | 0:15 | -37.50% | 0:24 | 0% |
ptf-ncp5-01.int.aws.folio.org | 10k | 0:23 | 0:14 | -39.13% | 0:24 | +4.35% |
ptf-ncp5-02.int.aws.folio.org | 10k | 0:20 | 0:13 | -35.00% | 0:21 | +5.00% |
------------------------------- | --------- | ---------------------- | ------------------------ | --------- | --------------------------------- | -------- |
ptf-ncp5-00.int.aws.folio.org | 25k | 0:58 | 0:50 | -13.79% | 1:00 | +3.45% |
ptf-ncp5-01.int.aws.folio.org | 25k | 0:59 | 0:34 | -42.37% | 1:01 | +3.39% |
ptf-ncp5-02.int.aws.folio.org | 25k | 0:50 | 0:31 | -38.00% | 0:53 | +6.00% |
Memory Utilization
Default config
'mod-source-record-storage' and 'mod-inventory' Memory utilization were more than 80%
Instance update
'mod-inventory' Memory utilization was more than 80%
SRM parameters update
'mod-source-record-storage' and 'mod-inventory' Memory utilization were more than 80%
'mod-source-record-manager-import' had visible waves of Memory utilization: up to 48% for multitenant DI 10k, and up to 70% for multitenant DI 25k.
CPU Utilization
Default config
The screenshot above shows the behaviour of typically used DI modules during DI operations.
'mod-data-import' CPU usage started with some spike up to 250% for multitenant DI (3 tenants - 25k file) but then normalized.
'mod-source-record-manager-import' CPU usage started with some spike up to ~150% for multitenant DI (3 tenants - 25k file) but then normalized.
'mod-inventory' and 'mod-di-converter' CPU usage had rarely increased to 100% (3 tenants - 10k).
The screenshot above shows the global picture during DI operations. Moreover, several 'edge-dematic-b' spikes were identified. Potentially, spikes didn't affect multitenant DI.
Instance update
The screenshot above shows the behaviour of typically used DI modules during DI operations.
'mod-data-import' CPU usage started with some spike up to >300% for multitenant DI (3 tenants - 25k file) but then normalized.
'mod-source-record-manager-import' CPU usage started with some spike up to ~200% for multitenant DI (3 tenants - 25k file) but then normalized.
'mod-inventory' and 'mod-di-converter' CPU usage had increased to more than 100% (3 tenants - 10k).
The screenshot above shows the global picture during DI operations. Arrow 1 indicates the anomaly behaviour of 'mod-quick-marc'. Arrow 2 indicates some anomaly for 'edge-dematic-b'.
SRM parameters update
The screenshot above shows the behaviour of typically used DI modules during DI operations.
'mod-data-import' CPU usage started with some spike up to 300% for multitenant DI (3 tenants - 25k file) but then normalized.
'mod-source-record-manager-import' CPU usage started with some spike up to ~160% for multitenant DI (3 tenants - 25k file) but then normalized.
'mod-inventory' and 'mod-di-converter' CPU usage had rare increasing to more than 100% (3 tenants - 10k).
The screenshot above shows the global picture during DI operations. The arrow indicates some anomaly but it isn't identified by AWS.
RDS CPU Utilization
Default config
RDS CPU utilization increased up to approx. 95% for the default configuration (db.r6g.xlarge).
Instance update
RDS CPU utilization increased up to approx. 95% for the updated configuration by increasing the RDS instance to db.r6g.2xlarge.
SRM parameters update
RDS CPU utilization increased to approx. 95% for the updated configuration by - MODSOURCE-629Getting issue details... STATUS 's recommendations.
Additional information from module and database logs
Instance update
Based on anomaly 'mod-quick-marc' behaviour, the following log error messages analysis for the updated configuration by increasing the RDS instance to db.r6g.2xlarge:
fields @timestamp, @logStream, @message | filter @logStream not like "okapi" | filter @logStream not like "edge-dematic" | filter @logStream not like "mod-login" | filter @logStream not like "mod-inn-reach" | filter @logStream not like "mod-authtoken" | filter @logStream not like "mod-remote-storage" | filter @logStream not like "mod-source-record-storage" | filter @message like " ERROR " | filter @message not like "operator does not exist: uuid = bytea" | filter @message not like "[${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR LogAccessor Error handler threw an exception" | filter @message not like "[${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR LogAccessor Backoff FixedBackOff" | filter @message not like "[${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR syncExceptionHandler Async method [public void org.folio.innreach.domain.service.impl.ContributionActionServiceImpl.handleItemCreation(org.folio.innreach.dto.Item)] throw exception" | filter @message not like "ERROR FilterApi Permission missing in" | filter @message not like "ERROR Api Access for user 'mod-innreach'" | filter @message not like "Error in process with Exception org.springframework.kafka.listener.ListenerExecutionFailedException: Listener method 'public void org.folio.rs.integration.KafkaMessageListener.handleEvents(java.util.List<org.folio.rs.domain.dto.DomainEvent>)' threw exception and the record is org.apache.kafka.clients.consumer.ConsumerRecords" | filter @message not like "ERROR KafkaConsumerWrapper businessHandlerCompletionHandler" | filter @message not like "ERROR KafkaConsumerWrapper start:: Error while KafkaConsumerWrapper is working:" | sort @timestamp desc | limit 4000
Total count of Error messages | Mod | Relative count of error messages | Details of error message |
---|---|---|---|
224612 | mod-quick-marc | 112306 | operator does not exist: uuid = bytea |
101078 | [${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR LogAccessor Error handler threw an exception | ||
11228 | [${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR LogAccessor Backoff FixedBackOff | ||
7 | okapi | 7 | ? HTTP response code=404 msg=No suitable module found for path |
10 | edge-dematic | 10 | SpringApplication Application run failed |
10 | mod-login | 10 | Error verifying user existence: No user found by username stagingDirector |
105000 | mod-inn-reach | 105000 | [${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR syncExceptionHandler Async method [public void org.folio.innreach.domain.service.impl.ContributionActionServiceImpl.handleItemCreation(org.folio.innreach.dto.Item)] throw exception |
210000 | mod-authtoken | 105000 | ERROR FilterApi Permission missing in |
105000 | ERROR Api Access for user 'mod-innreach' | ||
34674 | mod-remote-storage | 34674 | Error in process with Exception org.springframework.kafka.listener.ListenerExecutionFailedException: Listener method 'public void org.folio.rs.integration.KafkaMessageListener.handleEvents(java.util.List<org.folio.rs.domain.dto.DomainEvent>)' threw exception and the record is org.apache.kafka.clients.consumer.ConsumerRecords |
1084 | mod-source-record-storage | 1044 | ERROR KafkaConsumerWrapper businessHandlerCompletionHandler |
40 | ERROR KafkaConsumerWrapper start:: Error while KafkaConsumerWrapper is working: | ||
20 | mod-courses | 20 | ERROR CRUtil Expected status code 200 for GET request to url http://pvt.lb.ncp5.folio-eis.us-east-1:9130/service-points/3a40852d-49fd-4df2-a1f9-6e2641a6e91f, got 404: Not found @timestamp 1684308244675 |
4 | edge-caiasoft | 2 | ERROR $LoggingErrorHandler Unexpected error occurred in scheduled task |
2 | ERROR curityManagerService Exception retrieving password |
Appendix
Infrastructure
PTF -environment ncp5:
- 8 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1 [Number of ECS instances, instance type, location region]
- 2 instances of db.r6.xlarge database instances: Writer & reader instances
- MSK ptf-kakfa-3 [ kafka configurations]
- 4 kafka.m5.2xlarge brokers in 2 zones
Apache Kafka version 2.8.0
EBS storage volume per broker 300 GiB
- auto.create.topics.enable=true
- log.retention.minutes=480
- default.replication.factor=3
Modules memory and CPU parameters [table of services properties, will be generated with script soon]
Modules | Version | Task Definition | Running Tasks | CPU | Memory | MemoryReservation | MaxMetaspaceSize | Xmx |
---|---|---|---|---|---|---|---|---|
mod-data-import | 2.7.1 | 8 | 1 | 256 | 2048 | 1844 | 512 | 1292 |
mod-di-converter-storage | 2.0.2 | 5 | 2 | 128 | 1024 | 896 | 128 | 768 |
mod-source-record-storage | 5.6.5 | 24 | 2 | 1024 | 4096 | 3688 | 512 | 3076 |
mod-source-record-manager | 3.6.2 | 16 | 2 | 1024 | 4096 | 3688 | 512 | 3076 |
mod-inventory-storage | 26.0.0 | 10 | 2 | 1024 | 2208 | 1952 | 384 | 1440 |
mod-inventory | 20.0.4 | 8 | 2 | 1024 | 2880 | 2592 | 512 | 1814 |
Methodology/Approach
DI profile 'PTF - Create 2' was used for 3 different tenants.
Default config - RDS db.r6g.xlarge
Instance update - RDS db.r6g.2xlarge
SRM parameters update - updating 'mod-source-record-manager' with the following attributes:
-Ddi.flow.control.reset.state.interval=PT1M -Ddi.flow.control.records.threshold=10 (default value 25)