PTF - DI testing for Cornell (mod-srs, mod-srm connections)










































Overview

  1. In this workflow, we are checking the performance of Data Import for Cornell. This testing is done to mimic Cornell load with increased DB connections in mod-srs and mod-srm.  PERF-555 - Getting issue details... STATUS


These tests were run in PTF env in cptf2 cluster

Following changes were made to env based on Cornell's load requirement:

1. Give up to 500 connections to mod-srm, mod-srs.

2. Make DB r6g.xlarge, 2xlarge, 4xlarge, 8xlarge Serverless v2 (0.5 - 128 ACUs), Serverless v2 (32 - 128 ACUs), Serverless v2 (16 - 96 ACUs), Serverless v2 (8 - 96 ACUs).

Summary


Data Import with the profile "EBSCO ebooks new and updated" causes a very high load on the database.

Increasing the number of connections for mod-srs & mod-srm to 500 (even to 100) severely increases database load and resource utilization.

One of the reasons is the huge volume of the database: for example, table mod_source_record_storage.marc_indexers has 1 028 022 472 records in the database.

Data import with the profile "EBSCO ebooks new and updated" could run successfully with database sizes: db.r6g.2xlarge, db.r6g.4xlarge, db.r6g.8xlarge. With increasing database size we can observe decreasing data import job duration.

Increasing the number of connections for mod-srs & mod-srm to 500 (even to 100) causes database memory usage errors.

Test Runs

Aurora PostgreSQL ChangesDI duration with 30 mod-srs & mod-srm connectionsStatusDI duration with 500 mod-srs & mod-srm connectionsACUAverage ACUs per testMem (GiB)Price per hour $Baseline
price per month $
Additional expenses
Serverless v2 (0.5 - 128 ACUs)1 hour 3 minCompleted with errorsstopped due to DB error0.58410.09(per 0.5 ACU)64.80+Price per additionally used ACUs
Serverless v2 (32 - 128 ACUs)34 minCompleted20 min hanging 0% completed DB error3299640.18(per 1 ACU)4,147.20+Price per additionally used ACUs
db.r6g.8xlarge18 minCompletedstopped due to DB error--2563.5972,589.84No additional expenses
db.r6g.4xlarge29 minCompletedstopped due to DB error--1281.7981,294.56No additional expenses
db.r6g.2xlarge50 minCompletedstopped due to DB error--640.889640.00No additional expenses
db.r6g.xlarge2 hours -20% done.Could be completed with errors due to DB errorstopped due to DB error--320.45324.00No additional expenses
Serverless v2 (16 - 96 ACUs)45 minCompleted-1663320.18(per 1 ACU)2,073.60+Price per additionally used ACUs
Serverless v2 (8 - 96 ACUs)36 min (25 min when started right after the first one)Completed-883160.18(per 1 ACU)1,036.80+Price per additionally used ACUs

Note that the average ACU utilization for the database is approximately 4 ACUs without any activities in the environment. And each job, test, or any activity will cause additional ACUs utilization.

Explanation of the table above:

$0.18 per ACU Hour is the approximate average price for Serverless V2.

The price depends on the AWS region: for us-east N.Virginia - $0.12, for South America (Sao Paulo) - $0.25

When DB is already scaled up DI duration decreases( for example from 36 to 25 min when started right after the first one for Serverless v2 (8 - 96 ACUs)

For easier calculation of database price use https://calculator.aws/#/addService/AuroraPostgreSQL. Need to place average ACU utilization per month to get more accurate results.

Observations

To run parallel import jobs is impossible. All jobs will be run one by one.

 RDS Resource utilization

Serverless v2 (8,16 - 96 ACUs)


On these graphs, we can observe gaps that correspond to DB issues with increased connections for mod-srs and mod-srm to 500.


Appendix


Infrastructure

Records count :

  • mod_source_record_storage.marc_indexers = 1028022472

PTF -environment cptf2 

  • 6 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
  • 1 database  instance one writer only

  • MSK ptf-kakfa-3
    • 4 m5.2xlarge brokers in 2 zones
    • Apache Kafka version 2.8.0

    • EBS storage volume per broker 300 GiB

    • auto.create.topics.enable=true
    • log.retention.minutes=480
    • default.replication.factor=3
    • 2 partitions per DI topic


Module
cptf2-pvt
Task Def. Revision   Module Version Task CountMem Hard LimitMem Soft limitCPU unitsXmxMetaspaceSize   MaxMetaspaceSizeR/W split enabled
mod-data-import5579891902283.dkr.ecr.us-east-1.amazonaws.com/folio/mod-data-import:2.7.11204818442561292384512false
mod-authtoken5579891902283.dkr.ecr.us-east-1.amazonaws.com/folio/mod-authtoken:2.13.021440115251292288128false
mod-inventory-update5579891902283.dkr.ecr.us-east-1.amazonaws.com/folio/mod-inventory-update:3.0.12102489612876888128false
mod-configuration5579891902283.dkr.ecr.us-east-1.amazonaws.com/folio/mod-configuration:5.9.12102489612876888128false
mod-inventory-storage5579891902283.dkr.ecr.us-east-1.amazonaws.com/folio/mod-inventory-storage:26.0.024096369020483076384512false
mod-source-record-storage18-22579891902283.dkr.ecr.us-east-1.amazonaws.com/folio/mod-source-record-storage:5.6.1025600500020483500384512false
mod-inventory7579891902283.dkr.ecr.us-east-1.amazonaws.com/folio/mod-inventory:20.0.622880259210241814384512false
mod-di-converter-storage6579891902283.dkr.ecr.us-east-1.amazonaws.com/folio/mod-di-converter-storage:2.0.52102489612876888128false
mod-source-record-manager14-21579891902283.dkr.ecr.us-east-1.amazonaws.com/folio/mod-source-record-manager:3.6.425600500020483500384512false
nginx-edge10579891902283.dkr.ecr.us-east-1.amazonaws.com/folio/nginx-edge:2023.06.1421024896128000false
mod-quick-marc5579891902283.dkr.ecr.us-east-1.amazonaws.com/folio/mod-quick-marc:3.0.01228821761281664384512false
nginx-okapi10579891902283.dkr.ecr.us-east-1.amazonaws.com/folio/nginx-okapi:2023.06.1421024896128000false
okapi-b12579891902283.dkr.ecr.us-east-1.amazonaws.com/folio/okapi:5.0.13168414401024922384512false
pub-okapi10579891902283.dkr.ecr.us-east-1.amazonaws.com/folio/pub-okapi:2023.06.142102489612876800false
mod-data-import-converter-storage1579891902283.dkr.ecr.us-east-1.amazonaws.com/folio/mod-data-import-converter-storage:1.15.22102489612876888128false