Test multi-tenants DI by tweaking Database Parameters

Test multi-tenants DI by tweaking Database Parameters

Overview

Folijet investigated MODSOURCE-581 and determined that multi-tenant concurrent DI job failures are due to exhausting database resources like database connections and CPU and/or memory.

PERF-544: Test multi-tenants DI by tweaking Database ParametersClosed

Summary

Recommendations & Jiras (Optional)

Results

The following table contains information about test executions, configurations, and results.

DB instance

#

UI

DI File

Profile

Result

Start

End

Duration

ID

db.r6g.xlarge

default

1

ptf-ncp5-00.int.aws.folio.org

10k

PTF - Create 2

Completed

5/16/23 6:40 AM

5/16/23 6:55 AM

0:15

25973

ptf-ncp5-01.int.aws.folio.org

10k

PTF - Create 2

Completed

5/16/23 6:40 AM

5/16/23 6:56 AM

0:16

3631

2

ptf-ncp5-00.int.aws.folio.org

10k

PTF - Create 2

Completed

5/16/23 7:48 AM

5/16/23 8:12 AM

0:24

25974

ptf-ncp5-01.int.aws.folio.org

10k

PTF - Create 2

Completed

5/16/23 7:49 AM

5/16/23 8:12 AM

0:23

3632

ptf-ncp5-02.int.aws.folio.org

10k

PTF - Create 2

Completed

5/16/23 7:49 AM

5/16/23 8:09 AM

0:20

826

3

ptf-ncp5-00.int.aws.folio.org

25k

PTF - Create 2

Completed

5/16/23 10:35 AM

5/16/23 11:33 AM

0:58

26005

ptf-ncp5-01.int.aws.folio.org

25k

PTF - Create 2

Completed

5/16/23 10:35 AM

5/16/23 11:34 AM

0:59

3664

ptf-ncp5-02.int.aws.folio.org

25k

PTF - Create 2

Completed

5/16/23 10:35 AM

5/16/23 11:25 AM

0:50

892

db.r6g.2xlarge

4U

ptf-ncp5-00.int.aws.folio.org

10k

PTF - Create 2

Completed

5/17/23 6:59 AM

5/17/23 7:14 AM

0:15

26071

ptf-ncp5-01.int.aws.folio.org

10k

PTF - Create 2

Completed

5/17/23 6:59 AM

5/17/23 7:13 AM

0:14

3697

ptf-ncp5-02.int.aws.folio.org

10k

PTF - Create 2

Completed

5/17/23 6:59 AM

5/17/23 7:12 AM

0:13

925

5U

ptf-ncp5-00.int.aws.folio.org

25k

PTF - Create 2

Completed

5/17/23 7:21 AM

5/17/23 8:11 AM

0:50

26075

ptf-ncp5-01.int.aws.folio.org

25k

PTF - Create 2

Completed

5/17/23 7:21 AM

5/17/23 7:55 AM

0:34

3698

ptf-ncp5-02.int.aws.folio.org

25k

PTF - Create 2

Completed

5/17/23 7:21 AM

5/17/23 7:52 AM

0:31

926

db.r6g.xlarge

default

6P

ptf-ncp5-00.int.aws.folio.org

10k

PTF - Create 2

Completed

5/18/23 8:34 AM

5/18/23 8:58 AM

0:24

26203

ptf-ncp5-01.int.aws.folio.org

10k

PTF - Create 2

Completed

5/18/23 8:34 AM

5/18/23 8:58 AM

0:24

3829

ptf-ncp5-02.int.aws.folio.org

10k

PTF - Create 2

Completed

5/18/23 8:34 AM

5/18/23 8:55 AM

0:21

1057

7P

ptf-ncp5-00.int.aws.folio.org

25k

PTF - Create 2

Completed

5/18/23 9:36 AM

5/18/23 10:36 AM

1:00

26236

ptf-ncp5-01.int.aws.folio.org

25k

PTF - Create 2

Completed

5/18/23 9:36 AM

5/18/23 10:37 AM

1:01

3862

ptf-ncp5-02.int.aws.folio.org

25k

PTF - Create 2

Completed

5/18/23 9:36 AM

5/18/23 10:29 AM

0:53

1090

Comparisons

The following table contains information about the comparison of different configurations. Based on the comparison, the updated configuration by increasing the RDS instance to db.r6g.2xlarge helped to decrease DI duration by up to 42%. However, the updated configuration by MODSOURCE-629: CLONE - Importing 10,000 MARC authority records > Completes with errors due to timeout Closed's recommendations didn't show positive changes.

UI

Records
per file

DI duration for

Default config, min
Baseline

DI duration for

Instance update, min

Delta,
+/-%

DI duration for

SRM parameters update, min

Delta,
+/-%

ptf-ncp5-00.int.aws.folio.org

10k

0:24

0:15

-37.50%

0:24

0%

ptf-ncp5-01.int.aws.folio.org

10k

0:23

0:14

-39.13%

0:24

+4.35%

ptf-ncp5-02.int.aws.folio.org

10k

0:20

0:13

-35.00%

0:21

+5.00%

-------------------------------

---------

----------------------

------------------------

---------

---------------------------------

--------

ptf-ncp5-00.int.aws.folio.org

25k

0:58

0:50

-13.79%

1:00

+3.45%

ptf-ncp5-01.int.aws.folio.org

25k

0:59

0:34

-42.37%

1:01

+3.39%

ptf-ncp5-02.int.aws.folio.org

25k

0:50

0:31

-38.00%

0:53

+6.00%

Memory Utilization

Default config

https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=ncp5_Performance_Dashboard;start=2023-05-16T06:30:00Z;end=2023-05-16T11:40:00Z

'mod-source-record-storage' and 'mod-inventory' Memory utilization were more than 80%

Instance update

https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=ncp5_Performance_Dashboard;start=2023-05-17T06:55:00Z;end=2023-05-17T08:00:00Z

'mod-inventory' Memory utilization was more than 80%

SRM parameters update

https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=ncp5_Performance_Dashboard;start=2023-05-18T08:30:00Z;end=2023-05-18T10:45:00Z

'mod-source-record-storage' and 'mod-inventory' Memory utilization were more than 80%

'mod-source-record-manager-import' had visible waves of Memory utilization: up to 48% for multitenant DI 10k, and up to 70% for multitenant DI 25k.

CPU Utilization 

Default config

The screenshot above shows the behaviour of typically used DI modules during DI operations.

'mod-data-import' CPU usage started with some spike up to 250% for multitenant DI (3 tenants - 25k file) but then normalized.

'mod-source-record-manager-import' CPU usage started with some spike up to ~150% for multitenant DI (3 tenants - 25k file) but then normalized.

'mod-inventory' and 'mod-di-converter' CPU usage had rarely increased to 100% (3 tenants - 10k).

The screenshot above shows the global picture during DI operations. Moreover, several 'edge-dematic-b' spikes were identified. Potentially, spikes didn't affect multitenant DI.

Instance update

The screenshot above shows the behaviour of typically used DI modules during DI operations.

'mod-data-import' CPU usage started with some spike up to >300% for multitenant DI (3 tenants - 25k file) but then normalized.

'mod-source-record-manager-import' CPU usage started with some spike up to ~200% for multitenant DI (3 tenants - 25k file) but then normalized.

'mod-inventory' and 'mod-di-converter' CPU usage had increased to more than 100% (3 tenants - 10k).

The screenshot above shows the global picture during DI operations. Arrow 1 indicates the anomaly behaviour of 'mod-quick-marc'. Arrow 2 indicates some anomaly for 'edge-dematic-b'.

SRM parameters update

The screenshot above shows the behaviour of typically used DI modules during DI operations.

'mod-data-import' CPU usage started with some spike up to 300% for multitenant DI (3 tenants - 25k file) but then normalized.

'mod-source-record-manager-import' CPU usage started with some spike up to ~160% for multitenant DI (3 tenants - 25k file) but then normalized.

'mod-inventory' and 'mod-di-converter' CPU usage had rare increasing to more than 100% (3 tenants - 10k).

The screenshot above shows the global picture during DI operations. The arrow indicates some anomaly but it isn't identified by AWS.

RDS CPU Utilization 

Default config

RDS CPU utilization increased up to approx. 95% for the default configuration (db.r6g.xlarge).

Instance update

RDS CPU utilization increased up to approx. 95% for the updated configuration by increasing the RDS instance to db.r6g.2xlarge.

SRM parameters update

RDS CPU utilization increased to approx. 95% for the updated configuration by MODSOURCE-629: CLONE - Importing 10,000 MARC authority records > Completes with errors due to timeout Closed's recommendations.

Additional information from module and database logs 

Instance update

Based on anomaly 'mod-quick-marc' behaviour, the following log error messages analysis for the updated configuration by increasing the RDS instance to db.r6g.2xlarge:

fields @timestamp, @logStream, @message | filter @logStream not like "okapi" | filter @logStream not like "edge-dematic" | filter @logStream not like "mod-login" | filter @logStream not like "mod-inn-reach" | filter @logStream not like "mod-authtoken" | filter @logStream not like "mod-remote-storage" | filter @logStream not like "mod-source-record-storage" | filter @message like " ERROR " | filter @message not like "operator does not exist: uuid = bytea" | filter @message not like "[${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR LogAccessor Error handler threw an exception" | filter @message not like "[${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR LogAccessor Backoff FixedBackOff" | filter @message not like "[${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR syncExceptionHandler Async method [public void org.folio.innreach.domain.service.impl.ContributionActionServiceImpl.handleItemCreation(org.folio.innreach.dto.Item)] throw exception" | filter @message not like "ERROR FilterApi Permission missing in" | filter @message not like "ERROR Api Access for user 'mod-innreach'" | filter @message not like "Error in process with Exception org.springframework.kafka.listener.ListenerExecutionFailedException: Listener method 'public void org.folio.rs.integration.KafkaMessageListener.handleEvents(java.util.List<org.folio.rs.domain.dto.DomainEvent>)' threw exception and the record is org.apache.kafka.clients.consumer.ConsumerRecords" | filter @message not like "ERROR KafkaConsumerWrapper businessHandlerCompletionHandler" | filter @message not like "ERROR KafkaConsumerWrapper start:: Error while KafkaConsumerWrapper is working:" | sort @timestamp desc | limit 4000

 

Total count of

Error messages 

Mod

Relative count of 

error messages

Details of error message

Total count of

Error messages 

Mod

Relative count of 

error messages

Details of error message

224612

mod-quick-marc

112306

operator does not exist: uuid = bytea

101078

[${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR LogAccessor          Error handler threw an exception

11228

[${FolioLoggingContext:requestid}] [${FolioLoggingContext:tenantid}] [${FolioLoggingContext:userid}] [${FolioLoggingContext:moduleid}] ERROR LogAccessor          Backoff FixedBackOff

7

okapi

7

? HTTP response code=404 msg=No suitable module found for path

10

edge-dematic

10

SpringApplication Application run failed

10

mod-login

10

Error verifying user existence: No user found by username stagingDirector

105000