Test multi-tenants DI by tweaking Database Parameters

Overview

Folijet investigated MODSOURCE-581 and determined that multi-tenant concurrent DI job failures are due to exhausting database resources like database connections and CPU and/or memory.

PERF-544 - Getting issue details... STATUS

Summary

  • Multitenant DI operations (3 tenants in parallel) were completed successfully for all tests with different configs and 10k & 25k files in the 'ncp5' PTF Environment.
  • Based on the comparison, the updated configuration by increasing the RDS instance to db.r6g.2xlarge helped to decrease DI duration by up to -42%. However, the updated configuration by MODSOURCE-629 - Getting issue details... STATUS 's recommendations didn't show positive changes (DI duration decreased up to +6%).
  • Based on global test executions behaviour, the updated configuration by increasing the RDS instance to db.r6g.2xlarge and the updated configuration by MODSOURCE-629 - Getting issue details... STATUS 's recommendations didn't help to resolve resource utilization problems (RDS CPU usage was approx. 95% for all tests with different configs and 10k & 25k files). 

Recommendations & Jiras (Optional)

Results

The following table contains information about test executions, configurations, and results.

DB instance#UIDI FileProfileResultStartEndDurationID

db.r6g.xlarge

default

1ptf-ncp5-00.int.aws.folio.org10kPTF - Create 2Completed5/16/23 6:40 AM5/16/23 6:55 AM0:1525973
ptf-ncp5-01.int.aws.folio.org10kPTF - Create 2Completed5/16/23 6:40 AM5/16/23 6:56 AM0:163631
2ptf-ncp5-00.int.aws.folio.org10kPTF - Create 2Completed5/16/23 7:48 AM5/16/23 8:12 AM0:2425974
ptf-ncp5-01.int.aws.folio.org10kPTF - Create 2Completed5/16/23 7:49 AM5/16/23 8:12 AM0:233632
ptf-ncp5-02.int.aws.folio.org10kPTF - Create 2Completed5/16/23 7:49 AM5/16/23 8:09 AM0:20826
3ptf-ncp5-00.int.aws.folio.org25kPTF - Create 2Completed5/16/23 10:35 AM5/16/23 11:33 AM0:5826005
ptf-ncp5-01.int.aws.folio.org25kPTF - Create 2Completed5/16/23 10:35 AM5/16/23 11:34 AM0:593664
ptf-ncp5-02.int.aws.folio.org25kPTF - Create 2Completed5/16/23 10:35 AM5/16/23 11:25 AM0:50892
db.r6g.2xlarge4Uptf-ncp5-00.int.aws.folio.org10kPTF - Create 2Completed5/17/23 6:59 AM5/17/23 7:14 AM0:1526071
ptf-ncp5-01.int.aws.folio.org10kPTF - Create 2Completed5/17/23 6:59 AM5/17/23 7:13 AM0:143697
ptf-ncp5-02.int.aws.folio.org10kPTF - Create 2Completed5/17/23 6:59 AM5/17/23 7:12 AM0:13925
5Uptf-ncp5-00.int.aws.folio.org25kPTF - Create 2Completed5/17/23 7:21 AM5/17/23 8:11 AM0:5026075
ptf-ncp5-01.int.aws.folio.org25kPTF - Create 2Completed5/17/23 7:21 AM5/17/23 7:55 AM0:343698
ptf-ncp5-02.int.aws.folio.org25kPTF - Create 2Completed5/17/23 7:21 AM5/17/23 7:52 AM0:31926

db.r6g.xlarge

default

6Pptf-ncp5-00.int.aws.folio.org10kPTF - Create 2Completed5/18/23 8:34 AM5/18/23 8:58 AM0:2426203
ptf-ncp5-01.int.aws.folio.org10kPTF - Create 2Completed5/18/23 8:34 AM5/18/23 8:58 AM0:243829
ptf-ncp5-02.int.aws.folio.org10kPTF - Create 2Completed5/18/23 8:34 AM5/18/23 8:55 AM0:211057
7Pptf-ncp5-00.int.aws.folio.org25kPTF - Create 2Completed5/18/23 9:36 AM5/18/23 10:36 AM1:0026236
ptf-ncp5-01.int.aws.folio.org25kPTF - Create 2Completed5/18/23 9:36 AM5/18/23 10:37 AM1:013862
ptf-ncp5-02.int.aws.folio.org25kPTF - Create 2Completed5/18/23 9:36 AM5/18/23 10:29 AM0:531090

Comparisons

The following table contains information about the comparison of different configurations. Based on the comparison, the updated configuration by increasing the RDS instance to db.r6g.2xlarge helped to decrease DI duration by up to 42%. However, the updated configuration by MODSOURCE-629 - Getting issue details... STATUS 's recommendations didn't show positive changes.

UIRecords
per file

DI duration for

Default config, min
Baseline

DI duration for

Instance update, min

Delta,
+/-%

DI duration for

SRM parameters update, min

Delta,
+/-%
ptf-ncp5-00.int.aws.folio.org10k0:240:15-37.50%0:240%
ptf-ncp5-01.int.aws.folio.org10k0:230:14-39.13%0:24+4.35%
ptf-ncp5-02.int.aws.folio.org10k0:200:13-35.00%0:21+5.00%
----------------------------------------------------------------------------------------------------------------------------------------
ptf-ncp5-00.int.aws.folio.org25k0:580:50-13.79%1:00+3.45%
ptf-ncp5-01.int.aws.folio.org25k0:590:34-42.37%1:01+3.39%
ptf-ncp5-02.int.aws.folio.org25k0:500:31-38.00%0:53+6.00%

Memory Utilization

Default config

https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=ncp5_Performance_Dashboard;start=2023-05-16T06:30:00Z;end=2023-05-16T11:40:00Z

'mod-source-record-storage' and 'mod-inventory' Memory utilization were more than 80%

Instance update

https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=ncp5_Performance_Dashboard;start=2023-05-17T06:55:00Z;end=2023-05-17T08:00:00Z

'mod-inventory' Memory utilization was more than 80%

SRM parameters update

https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=ncp5_Performance_Dashboard;start=2023-05-18T08:30:00Z;end=2023-05-18T10:45:00Z

'mod-source-record-storage' and 'mod-inventory' Memory utilization were more than 80%

'mod-source-record-manager-import' had visible waves of Memory utilization: up to 48% for multitenant DI 10k, and up to 70% for multitenant DI 25k.

CPU Utilization 

Default config

The screenshot above shows the behaviour of typically used DI modules during DI operations.

'mod-data-import' CPU usage started with some spike up to 250% for multitenant DI (3 tenants - 25k file) but then normalized.

'mod-source-record-manager-import' CPU usage started with some spike up to ~150% for multitenant DI (3 tenants - 25k file) but then normalized.

'mod-inventory' and 'mod-di-converter' CPU usage had rarely increased to 100% (3 tenants - 10k).

The screenshot above shows the global picture during DI operations. Moreover, several 'edge-dematic-b' spikes were identified. Potentially, spikes didn't affect multitenant DI.

Instance update

The screenshot above shows the behaviour of typically used DI modules during DI operations.

'mod-data-import' CPU usage started with some spike up to >300% for multitenant DI (3 tenants - 25k file) but then normalized.

'mod-source-record-manager-import' CPU usage started with some spike up to ~200% for multitenant DI (3 tenants - 25k file) but then normalized.

'mod-inventory' and 'mod-di-converter' CPU usage had increased to more than 100% (3 tenants - 10k).

The screenshot above shows the global picture during DI operations. Arrow 1 indicates the anomaly behaviour of 'mod-quick-marc'. Arrow 2 indicates some anomaly for 'edge-dematic-b'.

SRM parameters update

The screenshot above shows the behaviour of typically used DI modules during DI operations.

'mod-data-import' CPU usage started with some spike up to 300% for multitenant DI (3 tenants - 25k file) but then normalized.

'mod-source-record-manager-import' CPU usage started with some spike up to ~160% for multitenant DI (3 tenants - 25k file) but then normalized.

'mod-inventory' and 'mod-di-converter' CPU usage had rare increasing to more than 100% (3 tenants - 10k).

The screenshot above shows the global picture during DI operations. The arrow indicates some anomaly but it isn't identified by AWS.

RDS CPU Utilization 

Default config

RDS CPU utilization increased up to approx. 95% for the default configuration (db.r6g.xlarge).

Instance update

RDS CPU utilization increased up to approx. 95% for the updated configuration by increasing the RDS instance to db.r6g.2xlarge.

SRM parameters update

RDS CPU utilization increased to approx. 95% for the updated configuration by MODSOURCE-629 - Getting issue details... STATUS 's recommendations.

Additional information from module and database logs 

Instance update

Based on anomaly 'mod-quick-marc' behaviour, the following log error messages analysis for the updated configuration by increasing the RDS instance to db.r6g.2xlarge:

	fields @timestamp, @logStream, @message
		| filter @logStream not like "okapi"
		| filter @logStream not like "edge-dematic"
		| filter @logStream not like "mod-login"
		| filter @logStream not like "mod-inn-reach"
		| filter @logStream not like "mod-authtoken"
		| filter @logStream not like "mod-remote-storage"
		| filter @logStream not like "mod-source-record-storage"
		| filter @message like " ERROR "
        | filter @message not like "operator does not exist: uuid = bytea"
		| filter @message not like "[${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR LogAccessor          Error handler threw an exception"
        | filter @message not like "[${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR LogAccessor          Backoff FixedBackOff"
        | filter @message not like "[${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR syncExceptionHandler Async method [public void org.folio.innreach.domain.service.impl.ContributionActionServiceImpl.handleItemCreation(org.folio.innreach.dto.Item)] throw exception"
		| filter @message not like "ERROR FilterApi            Permission missing in"
		| filter @message not like "ERROR Api                  Access for user 'mod-innreach'"
		| filter @message not like "Error in process with Exception org.springframework.kafka.listener.ListenerExecutionFailedException: Listener method 'public void org.folio.rs.integration.KafkaMessageListener.handleEvents(java.util.List<org.folio.rs.domain.dto.DomainEvent>)' threw exception and the record is org.apache.kafka.clients.consumer.ConsumerRecords"
		| filter @message not like "ERROR KafkaConsumerWrapper businessHandlerCompletionHandler"
		| filter @message not like "ERROR KafkaConsumerWrapper start:: Error while KafkaConsumerWrapper is working:"
		| sort @timestamp desc
		| limit 4000


Total count of

Error messages 

Mod

Relative count of 

error messages

Details of error message
224612

mod-quick-marc

112306operator does not exist: uuid = bytea
101078[${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR LogAccessor          Error handler threw an exception
11228[${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR LogAccessor          Backoff FixedBackOff
7okapi7? HTTP response code=404 msg=No suitable module found for path
10edge-dematic10SpringApplication Application run failed
10mod-login10Error verifying user existence: No user found by username stagingDirector
105000mod-inn-reach105000[${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR syncExceptionHandler Async method [public void org.folio.innreach.domain.service.impl.ContributionActionServiceImpl.handleItemCreation(org.folio.innreach.dto.Item)] throw exception
210000 mod-authtoken 105000ERROR FilterApi            Permission missing in
105000ERROR Api                  Access for user 'mod-innreach'
34674mod-remote-storage34674Error in process with Exception org.springframework.kafka.listener.ListenerExecutionFailedException: Listener method 'public void org.folio.rs.integration.KafkaMessageListener.handleEvents(java.util.List<org.folio.rs.domain.dto.DomainEvent>)' threw exception and the record is org.apache.kafka.clients.consumer.ConsumerRecords
1084 mod-source-record-storage 1044ERROR KafkaConsumerWrapper businessHandlerCompletionHandler
40ERROR KafkaConsumerWrapper start:: Error while KafkaConsumerWrapper is working:
20mod-courses20ERROR CRUtil               Expected status code 200 for GET request to url http://pvt.lb.ncp5.folio-eis.us-east-1:9130/service-points/3a40852d-49fd-4df2-a1f9-6e2641a6e91f, got 404: Not found
@timestamp
1684308244675
edge-caiasoft 2ERROR $LoggingErrorHandler Unexpected error occurred in scheduled task
2ERROR curityManagerService Exception retrieving password


Appendix

Infrastructure

PTF -environment ncp5: 

  • 8 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1 [Number of ECS instances, instance type, location region]
  • 2 instances of db.r6.xlarge database instances: Writer & reader instances
  • MSK ptf-kakfa-3 [ kafka configurations]
    • 4 kafka.m5.2xlarge brokers in 2 zones
    • Apache Kafka version 2.8.0

    • EBS storage volume per broker 300 GiB

    • auto.create.topics.enable=true
    • log.retention.minutes=480
    • default.replication.factor=3


Modules memory and CPU parameters [table of services properties, will be generated with script soon]

Modules

Version

Task Definition

Running Tasks 

CPU

Memory

MemoryReservation

MaxMetaspaceSize

Xmx

mod-data-import2.7.181256204818445121292
mod-di-converter-storage2.0.2521281024896128768
mod-source-record-storage5.6.52421024409636885123076
mod-source-record-manager3.6.21621024409636885123076
mod-inventory-storage26.0.01021024220819523841440
mod-inventory20.0.4821024288025925121814


Methodology/Approach

DI profile 'PTF - Create 2' was used for 3 different tenants.

Default config - RDS db.r6g.xlarge

Instance update - RDS db.r6g.2xlarge

SRM parameters update - updating 'mod-source-record-manager' with the following attributes:

-Ddi.flow.control.reset.state.interval=PT1M -Ddi.flow.control.records.threshold=10  (default value 25)