ECS [Ramsons] Migrate/Update Large number BIB records

ECS [Ramsons] Migrate/Update Large number BIB records

Overview

In production when mapping rules are changed - there’s a need to run migration process using mod-marc-migration module API. The whole process consist of two stages: data mapping and data saving.

The goal of this round of testing is to observe and measure the marc-migration process duration in ECS environment from Central tenant and member tenants, and find any possible issues and or performance problems.

Tests were done with two parameter sets (CHUNK_FETCH_IDS_COUNT, RECORDS_CHUNK_SIZE). The first tests were with optimal parameters defined in report for authority migration (12000 and 4000 accordingly). The second tests were with default parameters (500, 500)

Original ticket: https://folio-org.atlassian.net/browse/PERF-1045

Summary

  • Tests performed on two tenants (central and member).

  • All data mapping processes finished successfully.

    • No issues or errors found during data mapping process.

    • It consumes up to 25% CPU on mod-marc-migration service and up to 20-22% of DB CPU.

  • All tests got different success rate on data saving, however tests for central and member tenant with parameters: “CHUNK_FETCH_IDS_COUNT = 500, RECORDS_CHUNK_SIZE = 500” got highest number of saved instances (89% for central tenant, 99% for member tenant).

    • Records failed to save because of issue found in DB logs - optimistic locking.

    • During data saving process runtime reindexing happening on the background (mod-search) causing high DB load for a long time (more than 1,5 hours) after data saving is completed.

    • Deadlocks is observed in DB, however they not causing any issues with marc-migration process as this deadlocks happens on mod-search schemas.

    • During data saving CPU usage on mod-marc-migration was 5% and 3% on mod-inventory-storage.

    • DB CPU reached 100% and stayed high even after process finished doe to mod-search runtime reindexing.

    • No signs of memory leaks found during all tests

  • It’s clearly visible in test results that there’s small success rate of data saving with high values for CHUNK_FETCH_IDS_COUNT and RECORDS_CHUNK_SIZE parameters. mod-marc-migration is probably failing to save data because of high load caused by mod-search on CPU during data saving, probably batch update transaction is taking to long to complete and issue Apparent connection leak detected happening. mod-marc-migration doing retry of the same batch and getting back optimistic lock for records, as some records already completed with current batch in previous try.

Recommendations & Jiras (Optional)

 

Test Runs/Result

 

Test #

Test Conditions

Mapping

Saving

Notes

(Optional)

mapped records

Mapping duration  

Status

saved records

Saving duration

Status

 

1.

CHUNK_FETCH_IDS_COUNT = 12000

RECORDS_CHUNK_SIZE = 4000

1160063

10 min 10 s

success

4063

1 hr 30 min

failed

central tenant

2.

CHUNK_FETCH_IDS_COUNT = default 500

RECORDS_CHUNK_SIZE = default 500

1160063

17 min

success

1043563

3 hr 46 min

failed

central tenant

3.

CHUNK_FETCH_IDS_COUNT = default 500

RECORDS_CHUNK_SIZE = default 500

1460010*

23 min

success

1459010

2 hr 33 min

failed

member tenant

4.

CHUNK_FETCH_IDS_COUNT = 12000

RECORDS_CHUNK_SIZE = 4000

1460010*

18 min

success

8010

1 hr 31 min

failed

member tenant

 

Note: in test 3 and 4 different umber of instances mapped as in between tests few data import were performed.

Memory Utilization

Memory utilization across all modules was stable, with no memory leaks observed. The most heavily used module, mod-marc-migrations, experienced an increase in memory usage from 25% to approximately 80% during testing with 'Optimal configurations.' However, after the process was completed, memory usage stabilized and returned to normal levels

image-20250212-102117.png

 

CPU Utilization 

CPU utilization during the data-saving process was stable and returned to normal levels after the test was completed. The most utilized service was mod-consortia, with a peak usage of approximately 4.5%. The entire process, including reindexing by mod-search, took approximately 4 hours. With 'Optimal configurations,' the data-saving process was faster by about 1 hour and 30 minutes compared to the default configuration, where it took 2 hours and 40 minutes.

image-20250212-102458.png

Here is the CPU usage of the mod-marc-migrations module during the data-mapping process. It is evident that mod-marc-migrations can handle high loads and return to normal resource usage after the process is completed.

Note: Once the data-mapping process is finished, the CPU usage of mod-marc-migrations becomes less noticeable, as most of the logic for the next step (data saving) occurs on the database side.

 

RDS CPU Utilization 

DB CPU usage reached 100% during data saving

NOTE: ater data saving finished (see the marks on screenshot) runtime reindexing did proceed and occupied more that 90% of DB CPU for additional time.

 

image-20250212-102939.png

During data saving process a lot of deadlock found. This deadlocks happening on mod-search side and does not affect data saving process.

image-20250212-104656.png

 

image-20250212-105212.png
image-20250212-105333.png

 

image-20250212-105527.png
image-20250212-105637.png
image-20250212-110424.png

Errors

mod-marc-migration error:

13:46:25 [] [] [] [] WARN ProxyLeakTask Connection leak detection triggered for org.postgresql.jdbc.PgConnection@7a388818 on thread chunksProcessingExecutor-5, stack trace follows java.lang.Exception: Apparent connection leak detected at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:127) ~[HikariCP-5.1.0.jar!/:?] at org.folio.spring.config.DataSourceFolioWrapper.getConnection(DataSourceFolioWrapper.java:46) ~[folio-spring-base-8.2.0.jar!/:8.2.0] at org.hibernate.engine.jdbc.connections.internal.DatasourceConnectionProviderImpl.getConnection(DatasourceConnectionProviderImpl.java:122) ~[hibernate-core-6.5.3.Final.jar!/:6.5.3.Final] at org.hibernate.internal.NonContextualJdbcConnectionAccess.obtainConnection(NonContextualJdbcConnectionAccess.java:46) ~[hibernate-core-6.5.3.Final.jar!/:6.5.3.Final] at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.acquireConnectionIfNeeded(LogicalConnectionManagedImpl.java:126) ~[hibernate-core-6.5.3.Final.jar!/:6.5.3.Final] at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.getPhysicalConnection(LogicalConnectionManagedImpl.java:156) ~[hibernate-core-6.5.3.Final.jar!/:6.5.3.Final] at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.getConnectionForTransactionManagement(LogicalConnectionManagedImpl.java:286) ~[hibernate-core-6.5.3.Final.jar!/:6.5.3.Final] at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.begin(LogicalConnectionManagedImpl.java:294) ~[hibernate-core-6.5.3.Final.jar!/:6.5.3.Final] at org.hibernate.resource.transaction.backend.jdbc.internal.JdbcResourceLocalTransactionCoordinatorImpl$TransactionDriverControlImpl.begin(JdbcResourceLocalTransactionCoordinatorImpl.java:232) ~[hibernate-core-6.5.3.Final.jar!/:6.5.3.Final] at org.hibernate.engine.transaction.internal.TransactionImpl.begin(TransactionImpl.java:83) ~[hibernate-core-6.5.3.Final.jar!/:6.5.3.Final] at org.springframework.orm.jpa.vendor.HibernateJpaDialect.beginTransaction(HibernateJpaDialect.java:176) ~[spring-orm-6.1.14.jar!/:6.1.14] at org.springframework.orm.jpa.JpaTransactionManager.doBegin(JpaTransactionManager.java:420) ~[spring-orm-6.1.14.jar!/:6.1.14] at org.springframework.transaction.support.AbstractPlatformTransactionManager.startTransaction(AbstractPlatformTransactionManager.java:532) ~[spring-tx-6.1.14.jar!/:6.1.14] at org.springframework.transaction.support.AbstractPlatformTransactionManager.getTransaction(AbstractPlatformTransactionManager.java:405) ~[spring-tx-6.1.14.jar!/:6.1.14] at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:137) ~[spring-tx-6.1.14.jar!/:6.1.14] at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:255) ~[spring-batch-core-5.1.2.jar!/:5.1.2] at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82) ~[spring-batch-core-5.1.2.jar!/:5.1.2] at org.springframework.batch.repeat.support.TaskExecutorRepeatTemplate$ExecutingRunnable.run(TaskExecutorRepeatTemplate.java:261) ~[spring-batch-infrastructure-5.1.2.jar!/:5.1.2] at org.folio.spring.scope.FolioExecutionScopeExecutionContextManager.lambda$getRunnableWithFolioContext$0(FolioExecutionScopeExecutionContextManager.java:94) ~[folio-spring-base-8.2.0.jar!/:8.2.0] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?] at java.base/java.lang.Thread.run(Unknown Source) [?:?] 13:46:25 [] [] [] [] WARN BulkStorageService Failed to save entities with type INSTANCE specified with file path: operation/4db21b1f-2354-42ec-a78d-275515a37e42/f54f9418-dd94-48ef-91f5-15c1daf1d2df_entity 13:46:25 [] [] [] [] INFO ProxyLeakTask Previously reported leaked connection org.postgresql.jdbc.PgConnection@7a388818 on thread chunksProcessingExecutor-5 was returned to the pool (unleaked) 13:46:26 [184008/marc-migrations] [cs00000int_0001] [0f4eb7c2-05dd-4559-81ae-0c8e8cb95bc1] [mod-marc-migrations] WARN ProxyLeakTask Connection leak detection triggered for org.postgresql.jdbc.PgConnection@67906bca on thread chunksProcessingExecutor-6, stack trace follows

DB Errors:

2025-02-12 13:25:12 UTC:10.23.34.46(47780):cs00000int_0001_mod_inventory_storage@folio:[25207]:ERROR: Cannot update record 380e8ef5-1f80-465f-bebd-7e736c068329 because it has been changed (optimistic locking): Stored _version is 4, _version of request is 3 2025-02-12 13:25:12 UTC:10.23.34.46(47780):cs00000int_0001_mod_inventory_storage@folio:[25207]:CONTEXT: PL/pgSQL function instance_set_ol_version() line 15 at RAISE SQL statement "UPDATE cs00000int_0001_mod_inventory_storage.instance SET jsonb=$3 WHERE id=$2 RETURNING id" PL/pgSQL function upsert(text,uuid,anyelement) line 5 at EXECUTE
2025-02-12 16:12:50 UTC:10.23.34.21(54784):folio@folio:[22461]:ERROR: deadlock detected

Appendix

Infrastructure

PTF -environment rcon

  • 11 m6g.2xlarge EC2 instances located in US East (N. Virginia)us-east-1

  • db.r6.xlarge database instances, writer

  • MSK fse-test

    • 4 kafka.m7g.xlarge brokers in 2 zones

    • Apache Kafka version 3.7.x (KRaft mode)

    • EBS storage volume per broker 300 GiB

    • auto.create.topics.enable=true

    • log.retention.minutes=480

    • default.replication.factor=3

  • OpenSearch 2.13 ptf-test cluster

    • r6g.2xlarge.search 4 data nodes

    • r6g.large.search 3 dedicated master nodes

Cluster Resources - rcon