ECS [Ramsons] Migrate/Update Large number BIB records
Overview
In production when mapping rules are changed - there’s a need to run migration process using mod-marc-migration module API. The whole process consist of two stages: data mapping and data saving.
The goal of this round of testing is to observe and measure the marc-migration process duration in ECS environment from Central tenant and member tenants, and find any possible issues and or performance problems.
Tests were done with two parameter sets (CHUNK_FETCH_IDS_COUNT, RECORDS_CHUNK_SIZE). The first tests were with optimal parameters defined in report for authority migration (12000 and 4000 accordingly). The second tests were with default parameters (500, 500)
Original ticket: https://folio-org.atlassian.net/browse/PERF-1045
Summary
Tests performed on two tenants (central and member).
All data mapping processes finished successfully.
No issues or errors found during data mapping process.
It consumes up to 25% CPU on mod-marc-migration service and up to 20-22% of DB CPU.
All tests got different success rate on data saving, however tests for central and member tenant with parameters: “CHUNK_FETCH_IDS_COUNT = 500, RECORDS_CHUNK_SIZE = 500” got highest number of saved instances (89% for central tenant, 99% for member tenant).
Records failed to save because of issue found in DB logs -
optimistic locking
.During data saving process runtime reindexing happening on the background (mod-search) causing high DB load for a long time (more than 1,5 hours) after data saving is completed.
Deadlocks is observed in DB, however they not causing any issues with marc-migration process as this deadlocks happens on mod-search schemas.
During data saving CPU usage on mod-marc-migration was 5% and 3% on mod-inventory-storage.
DB CPU reached 100% and stayed high even after process finished doe to mod-search runtime reindexing.
No signs of memory leaks found during all tests
It’s clearly visible in test results that there’s small success rate of data saving with high values for CHUNK_FETCH_IDS_COUNT and RECORDS_CHUNK_SIZE parameters. mod-marc-migration is probably failing to save data because of high load caused by mod-search on CPU during data saving, probably batch update transaction is taking to long to complete and issue
Apparent connection leak detected
happening. mod-marc-migration doing retry of the same batch and getting back optimistic lock for records, as some records already completed with current batch in previous try.
Recommendations & Jiras (Optional)
In order to get high performance it’s recommended to have 8 mod-inventory-storage tasks
mod-marc-migration data saving fail https://folio-org.atlassian.net/browse/MODMARCMIG-43
mod-search deadlocks tickethttps://folio-org.atlassian.net/browse/MSEARCH-932
mod-search slow query affecting DI performance https://folio-org.atlassian.net/browse/MSEARCH-972
Possibly related ticket for mod-search https://folio-org.atlassian.net/browse/MSEARCH-924
Test Runs/Results
Test # | Test Conditions | Mapping | Saving | Notes (Optional) | ||||
---|---|---|---|---|---|---|---|---|
mapped records | Mapping duration | Status | saved records | Saving duration | Status |
| ||
1. | CHUNK_FETCH_IDS_COUNT = 12000 RECORDS_CHUNK_SIZE = 4000 | 1160063 | 10 min 10 s | success | 4063 | 1 hr 30 min | failed | central tenant |
2. | CHUNK_FETCH_IDS_COUNT = default 500 RECORDS_CHUNK_SIZE = default 500 | 1160063 | 17 min | success | 1043563 | 3 hr 46 min | failed | central tenant |
3. | CHUNK_FETCH_IDS_COUNT = default 500 RECORDS_CHUNK_SIZE = default 500 | 1460010* | 23 min | success | 1459010 | 2 hr 33 min | failed | member tenant |
4. | CHUNK_FETCH_IDS_COUNT = 12000 RECORDS_CHUNK_SIZE = 4000 | 1460010* | 18 min | success | 8010 | 1 hr 31 min | failed | member tenant |
Note: in test 3 and 4 different umber of instances mapped as in between tests few data import were performed.
Memory Utilization
Memory utilization across all modules was stable, with no memory leaks observed. The most heavily used module, mod-marc-migrations
, experienced an increase in memory usage from 25% to approximately 80% during testing with 'Optimal configurations.' However, after the process was completed, memory usage stabilized and returned to normal levels
CPU Utilization
CPU utilization during the data-saving process was stable and returned to normal levels after the test was completed. The most utilized service was mod-consortia
, with a peak usage of approximately 4.5%. The entire process, including reindexing by mod-search
, took approximately 4 hours. With 'Optimal configurations,' the data-saving process was faster by about 1 hour and 30 minutes compared to the default configuration, where it took 2 hours and 40 minutes.
Here is the CPU usage of the mod-marc-migrations
module during the data-mapping process. It is evident that mod-marc-migrations
can handle high loads and return to normal resource usage after the process is completed.
Note: Once the data-mapping process is finished, the CPU usage of mod-marc-migrations
becomes less noticeable, as most of the logic for the next step (data saving) occurs on the database side.
RDS CPU Utilization
DB CPU usage reached 100% during data saving
NOTE: ater data saving finished (see the marks on screenshot) runtime reindexing did proceed and occupied more that 90% of DB CPU for additional time.
During data saving process a lot of deadlock found. This deadlocks happening on mod-search side and does not affect data saving process.
Errors
mod-marc-migration error:
13:46:25 [] [] [] [] WARN ProxyLeakTask Connection leak detection triggered for org.postgresql.jdbc.PgConnection@7a388818 on thread chunksProcessingExecutor-5, stack trace follows
java.lang.Exception: Apparent connection leak detected
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:127) ~[HikariCP-5.1.0.jar!/:?]
at org.folio.spring.config.DataSourceFolioWrapper.getConnection(DataSourceFolioWrapper.java:46) ~[folio-spring-base-8.2.0.jar!/:8.2.0]
at org.hibernate.engine.jdbc.connections.internal.DatasourceConnectionProviderImpl.getConnection(DatasourceConnectionProviderImpl.java:122) ~[hibernate-core-6.5.3.Final.jar!/:6.5.3.Final]
at org.hibernate.internal.NonContextualJdbcConnectionAccess.obtainConnection(NonContextualJdbcConnectionAccess.java:46) ~[hibernate-core-6.5.3.Final.jar!/:6.5.3.Final]
at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.acquireConnectionIfNeeded(LogicalConnectionManagedImpl.java:126) ~[hibernate-core-6.5.3.Final.jar!/:6.5.3.Final]
at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.getPhysicalConnection(LogicalConnectionManagedImpl.java:156) ~[hibernate-core-6.5.3.Final.jar!/:6.5.3.Final]
at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.getConnectionForTransactionManagement(LogicalConnectionManagedImpl.java:286) ~[hibernate-core-6.5.3.Final.jar!/:6.5.3.Final]
at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.begin(LogicalConnectionManagedImpl.java:294) ~[hibernate-core-6.5.3.Final.jar!/:6.5.3.Final]
at org.hibernate.resource.transaction.backend.jdbc.internal.JdbcResourceLocalTransactionCoordinatorImpl$TransactionDriverControlImpl.begin(JdbcResourceLocalTransactionCoordinatorImpl.java:232) ~[hibernate-core-6.5.3.Final.jar!/:6.5.3.Final]
at org.hibernate.engine.transaction.internal.TransactionImpl.begin(TransactionImpl.java:83) ~[hibernate-core-6.5.3.Final.jar!/:6.5.3.Final]
at org.springframework.orm.jpa.vendor.HibernateJpaDialect.beginTransaction(HibernateJpaDialect.java:176) ~[spring-orm-6.1.14.jar!/:6.1.14]
at org.springframework.orm.jpa.JpaTransactionManager.doBegin(JpaTransactionManager.java:420) ~[spring-orm-6.1.14.jar!/:6.1.14]
at org.springframework.transaction.support.AbstractPlatformTransactionManager.startTransaction(AbstractPlatformTransactionManager.java:532) ~[spring-tx-6.1.14.jar!/:6.1.14]
at org.springframework.transaction.support.AbstractPlatformTransactionManager.getTransaction(AbstractPlatformTransactionManager.java:405) ~[spring-tx-6.1.14.jar!/:6.1.14]
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:137) ~[spring-tx-6.1.14.jar!/:6.1.14]
at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:255) ~[spring-batch-core-5.1.2.jar!/:5.1.2]
at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82) ~[spring-batch-core-5.1.2.jar!/:5.1.2]
at org.springframework.batch.repeat.support.TaskExecutorRepeatTemplate$ExecutingRunnable.run(TaskExecutorRepeatTemplate.java:261) ~[spring-batch-infrastructure-5.1.2.jar!/:5.1.2]
at org.folio.spring.scope.FolioExecutionScopeExecutionContextManager.lambda$getRunnableWithFolioContext$0(FolioExecutionScopeExecutionContextManager.java:94) ~[folio-spring-base-8.2.0.jar!/:8.2.0]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
at java.base/java.lang.Thread.run(Unknown Source) [?:?]
13:46:25 [] [] [] [] WARN BulkStorageService Failed to save entities with type INSTANCE specified with file path: operation/4db21b1f-2354-42ec-a78d-275515a37e42/f54f9418-dd94-48ef-91f5-15c1daf1d2df_entity
13:46:25 [] [] [] [] INFO ProxyLeakTask Previously reported leaked connection org.postgresql.jdbc.PgConnection@7a388818 on thread chunksProcessingExecutor-5 was returned to the pool (unleaked)
13:46:26 [184008/marc-migrations] [cs00000int_0001] [0f4eb7c2-05dd-4559-81ae-0c8e8cb95bc1] [mod-marc-migrations] WARN ProxyLeakTask Connection leak detection triggered for org.postgresql.jdbc.PgConnection@67906bca on thread chunksProcessingExecutor-6, stack trace follows
DB Errors:
2025-02-12 13:25:12 UTC:10.23.34.46(47780):cs00000int_0001_mod_inventory_storage@folio:[25207]:ERROR: Cannot update record 380e8ef5-1f80-465f-bebd-7e736c068329 because it has been changed (optimistic locking): Stored _version is 4, _version of request is 3
2025-02-12 13:25:12 UTC:10.23.34.46(47780):cs00000int_0001_mod_inventory_storage@folio:[25207]:CONTEXT: PL/pgSQL function instance_set_ol_version() line 15 at RAISE
SQL statement "UPDATE cs00000int_0001_mod_inventory_storage.instance SET jsonb=$3 WHERE id=$2 RETURNING id"
PL/pgSQL function upsert(text,uuid,anyelement) line 5 at EXECUTE
2025-02-12 16:12:50 UTC:10.23.34.21(54784):folio@folio:[22461]:ERROR: deadlock detected
Appendix
Infrastructure
PTF -environment rcon
11 m6g.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
db.r6.xlarge database instances, writer
MSK fse-test
4 kafka.m7g.xlarge brokers in 2 zones
Apache Kafka version 3.7.x (KRaft mode)
EBS storage volume per broker 300 GiB
auto.create.topics.enable=true
log.retention.minutes=480
default.replication.factor=3
OpenSearch 2.13 ptf-test cluster
r6g.2xlarge.search 4 data nodes
r6g.large.search 3 dedicated master nodes