Test multi-tenants DI by tweaking Database Parameters
- 1 Overview
- 2 Summary
- 3 Results
- 3.1 Comparisons
- 3.2 Memory Utilization
- 3.2.1 Default config
- 3.2.2 Instance update
- 3.2.3 SRM parameters update
- 3.3 CPU Utilization
- 3.3.1 Default config
- 3.3.2 Instance update
- 3.3.3 SRM parameters update
- 3.4 RDS CPU Utilization
- 3.4.1 Default config
- 3.4.2 Instance update
- 3.4.3 SRM parameters update
- 3.5 Additional information from module and database logs
- 3.5.1 Instance update
- 4 Appendix
Overview
Folijet investigated MODSOURCE-581 and determined that multi-tenant concurrent DI job failures are due to exhausting database resources like database connections and CPU and/or memory.
PERF-544: Test multi-tenants DI by tweaking Database ParametersClosed
Summary
Multitenant DI operations (3 tenants in parallel) were completed successfully for all tests with different configs and 10k & 25k files in the 'ncp5' PTF Environment.
Based on the comparison, the updated configuration by increasing the RDS instance to db.r6g.2xlarge helped to decrease DI duration by up to -42%. However, the updated configuration by
MODSOURCE-629: CLONE - Importing 10,000 MARC authority records > Completes with errors due to timeout Closed's recommendations didn't show positive changes (DI duration decreased up to +6%).
Based on global test executions behaviour, the updated configuration by increasing the RDS instance to db.r6g.2xlarge and the updated configuration by
MODSOURCE-629: CLONE - Importing 10,000 MARC authority records > Completes with errors due to timeout Closed's recommendations didn't help to resolve resource utilization problems (RDS CPU usage was approx. 95% for all tests with different configs and 10k & 25k files).
Recommendations & Jiras (Optional)
Results
The following table contains information about test executions, configurations, and results.
DB instance | # | UI | DI File | Profile | Result | Start | End | Duration | ID |
db.r6g.xlarge default | 1 | 10k | PTF - Create 2 | Completed | 5/16/23 6:40 AM | 5/16/23 6:55 AM | 0:15 | 25973 | |
10k | PTF - Create 2 | Completed | 5/16/23 6:40 AM | 5/16/23 6:56 AM | 0:16 | 3631 | |||
2 | 10k | PTF - Create 2 | Completed | 5/16/23 7:48 AM | 5/16/23 8:12 AM | 0:24 | 25974 | ||
10k | PTF - Create 2 | Completed | 5/16/23 7:49 AM | 5/16/23 8:12 AM | 0:23 | 3632 | |||
10k | PTF - Create 2 | Completed | 5/16/23 7:49 AM | 5/16/23 8:09 AM | 0:20 | 826 | |||
3 | 25k | PTF - Create 2 | Completed | 5/16/23 10:35 AM | 5/16/23 11:33 AM | 0:58 | 26005 | ||
25k | PTF - Create 2 | Completed | 5/16/23 10:35 AM | 5/16/23 11:34 AM | 0:59 | 3664 | |||
25k | PTF - Create 2 | Completed | 5/16/23 10:35 AM | 5/16/23 11:25 AM | 0:50 | 892 | |||
db.r6g.2xlarge | 4U | 10k | PTF - Create 2 | Completed | 5/17/23 6:59 AM | 5/17/23 7:14 AM | 0:15 | 26071 | |
10k | PTF - Create 2 | Completed | 5/17/23 6:59 AM | 5/17/23 7:13 AM | 0:14 | 3697 | |||
10k | PTF - Create 2 | Completed | 5/17/23 6:59 AM | 5/17/23 7:12 AM | 0:13 | 925 | |||
5U | 25k | PTF - Create 2 | Completed | 5/17/23 7:21 AM | 5/17/23 8:11 AM | 0:50 | 26075 | ||
25k | PTF - Create 2 | Completed | 5/17/23 7:21 AM | 5/17/23 7:55 AM | 0:34 | 3698 | |||
25k | PTF - Create 2 | Completed | 5/17/23 7:21 AM | 5/17/23 7:52 AM | 0:31 | 926 | |||
db.r6g.xlarge default | 6P | 10k | PTF - Create 2 | Completed | 5/18/23 8:34 AM | 5/18/23 8:58 AM | 0:24 | 26203 | |
10k | PTF - Create 2 | Completed | 5/18/23 8:34 AM | 5/18/23 8:58 AM | 0:24 | 3829 | |||
10k | PTF - Create 2 | Completed | 5/18/23 8:34 AM | 5/18/23 8:55 AM | 0:21 | 1057 | |||
7P | 25k | PTF - Create 2 | Completed | 5/18/23 9:36 AM | 5/18/23 10:36 AM | 1:00 | 26236 | ||
25k | PTF - Create 2 | Completed | 5/18/23 9:36 AM | 5/18/23 10:37 AM | 1:01 | 3862 | |||
25k | PTF - Create 2 | Completed | 5/18/23 9:36 AM | 5/18/23 10:29 AM | 0:53 | 1090 |
Comparisons
The following table contains information about the comparison of different configurations. Based on the comparison, the updated configuration by increasing the RDS instance to db.r6g.2xlarge helped to decrease DI duration by up to 42%. However, the updated configuration by MODSOURCE-629: CLONE - Importing 10,000 MARC authority records > Completes with errors due to timeout Closed's recommendations didn't show positive changes.
UI | Records | DI duration for Default config, min | DI duration for Instance update, min | Delta, | DI duration for SRM parameters update, min | Delta, |
10k | 0:24 | 0:15 | -37.50% | 0:24 | 0% | |
10k | 0:23 | 0:14 | -39.13% | 0:24 | +4.35% | |
10k | 0:20 | 0:13 | -35.00% | 0:21 | +5.00% | |
------------------------------- | --------- | ---------------------- | ------------------------ | --------- | --------------------------------- | -------- |
25k | 0:58 | 0:50 | -13.79% | 1:00 | +3.45% | |
25k | 0:59 | 0:34 | -42.37% | 1:01 | +3.39% | |
25k | 0:50 | 0:31 | -38.00% | 0:53 | +6.00% |
Memory Utilization
Default config
'mod-source-record-storage' and 'mod-inventory' Memory utilization were more than 80%
Instance update
'mod-inventory' Memory utilization was more than 80%
SRM parameters update
'mod-source-record-storage' and 'mod-inventory' Memory utilization were more than 80%
'mod-source-record-manager-import' had visible waves of Memory utilization: up to 48% for multitenant DI 10k, and up to 70% for multitenant DI 25k.
CPU Utilization
Default config
The screenshot above shows the behaviour of typically used DI modules during DI operations.
'mod-data-import' CPU usage started with some spike up to 250% for multitenant DI (3 tenants - 25k file) but then normalized.
'mod-source-record-manager-import' CPU usage started with some spike up to ~150% for multitenant DI (3 tenants - 25k file) but then normalized.
'mod-inventory' and 'mod-di-converter' CPU usage had rarely increased to 100% (3 tenants - 10k).
The screenshot above shows the global picture during DI operations. Moreover, several 'edge-dematic-b' spikes were identified. Potentially, spikes didn't affect multitenant DI.
Instance update
The screenshot above shows the behaviour of typically used DI modules during DI operations.
'mod-data-import' CPU usage started with some spike up to >300% for multitenant DI (3 tenants - 25k file) but then normalized.
'mod-source-record-manager-import' CPU usage started with some spike up to ~200% for multitenant DI (3 tenants - 25k file) but then normalized.
'mod-inventory' and 'mod-di-converter' CPU usage had increased to more than 100% (3 tenants - 10k).
The screenshot above shows the global picture during DI operations. Arrow 1 indicates the anomaly behaviour of 'mod-quick-marc'. Arrow 2 indicates some anomaly for 'edge-dematic-b'.
SRM parameters update
The screenshot above shows the behaviour of typically used DI modules during DI operations.
'mod-data-import' CPU usage started with some spike up to 300% for multitenant DI (3 tenants - 25k file) but then normalized.
'mod-source-record-manager-import' CPU usage started with some spike up to ~160% for multitenant DI (3 tenants - 25k file) but then normalized.
'mod-inventory' and 'mod-di-converter' CPU usage had rare increasing to more than 100% (3 tenants - 10k).
The screenshot above shows the global picture during DI operations. The arrow indicates some anomaly but it isn't identified by AWS.
RDS CPU Utilization
Default config
RDS CPU utilization increased up to approx. 95% for the default configuration (db.r6g.xlarge).
Instance update
RDS CPU utilization increased up to approx. 95% for the updated configuration by increasing the RDS instance to db.r6g.2xlarge.
SRM parameters update
RDS CPU utilization increased to approx. 95% for the updated configuration by MODSOURCE-629: CLONE - Importing 10,000 MARC authority records > Completes with errors due to timeout Closed's recommendations.
Additional information from module and database logs
Instance update
Based on anomaly 'mod-quick-marc' behaviour, the following log error messages analysis for the updated configuration by increasing the RDS instance to db.r6g.2xlarge:
fields @timestamp, @logStream, @message
| filter @logStream not like "okapi"
| filter @logStream not like "edge-dematic"
| filter @logStream not like "mod-login"
| filter @logStream not like "mod-inn-reach"
| filter @logStream not like "mod-authtoken"
| filter @logStream not like "mod-remote-storage"
| filter @logStream not like "mod-source-record-storage"
| filter @message like " ERROR "
| filter @message not like "operator does not exist: uuid = bytea"
| filter @message not like "[${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR LogAccessor Error handler threw an exception"
| filter @message not like "[${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR LogAccessor Backoff FixedBackOff"
| filter @message not like "[${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR syncExceptionHandler Async method [public void org.folio.innreach.domain.service.impl.ContributionActionServiceImpl.handleItemCreation(org.folio.innreach.dto.Item)] throw exception"
| filter @message not like "ERROR FilterApi Permission missing in"
| filter @message not like "ERROR Api Access for user 'mod-innreach'"
| filter @message not like "Error in process with Exception org.springframework.kafka.listener.ListenerExecutionFailedException: Listener method 'public void org.folio.rs.integration.KafkaMessageListener.handleEvents(java.util.List<org.folio.rs.domain.dto.DomainEvent>)' threw exception and the record is org.apache.kafka.clients.consumer.ConsumerRecords"
| filter @message not like "ERROR KafkaConsumerWrapper businessHandlerCompletionHandler"
| filter @message not like "ERROR KafkaConsumerWrapper start:: Error while KafkaConsumerWrapper is working:"
| sort @timestamp desc
| limit 4000
Total count of Error messages | Mod | Relative count of error messages | Details of error message |
|---|---|---|---|
224612 | mod-quick-marc | 112306 | operator does not exist: uuid = bytea |
101078 | [${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR LogAccessor Error handler threw an exception | ||
11228 | [${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR LogAccessor Backoff FixedBackOff | ||
7 | okapi | 7 | ? HTTP response code=404 msg=No suitable module found for path |
10 | edge-dematic | 10 | SpringApplication Application run failed |
10 | mod-login | 10 | Error verifying user existence: No user found by username stagingDirector |
105000 |