Data Import MARC BIB Sunflower CSP2 [ECS]
Overview
This document contains the results of testing Data Import for MARC Bibliographic records with create and update profiles and different file sizes (10K, 25K, 50K, 100K) at Sunflower release [ECS] and Critical service path 2(CSP2).
Ticket: https://folio-org.atlassian.net/browse/PERF-1219
Summary
All Data-imports jobs finished successfully without errors.
The duration of data imports for creates and updates has a noticeable improvement compared to CSP1
(More detailes are in the report https://folio-org.atlassian.net/wiki/spaces/FOLIJET/pages/1200652289 )CSP2 Updates are faster than CSP 1 likely due to a mod-search query that was running in CSP 1 (which created locking waits in the DB) not running in CSP 2.
CSP2’s Create imports MSK CPU utilizations are much lower compared to CSP 1 Create imports
DI duration growth correlates with the number of records imported.
No memory leak is suspected for DI modules.
Services CPU utilization, Service memory utilization, and DB CPU utilization have the same utilization trend and values as in previous reports
Recommendations & Jiras
Improve the performance of the SQL query that causes Lock on DB: https://folio-org.atlassian.net/browse/MODORDSTOR-491
Before running regression testing, please discuss the possibility of changing the contribution and subject field in MRC files that are used for DI to avoid this issue:
https://folio-org.atlassian.net/browse/MSEARCH-972
https://folio-org.atlassian.net/browse/MODDATAIMP-1242
Results
Test # | Data-import test | Profile | Duration CSP1 | Duration *2 | Duration Sunflower CSP1 +LWT *2 | Duration Sunflower |
1.1 | 5k MARC BIB Create | PTF - Create 2 | 3 min 3 s | 3 min 42 s | 4 min 07 s | 2 min 21 s |
1.2 | 10k MARC BIB Create | PTF - Create 2 | 5 min 32 s | 7 min 08 s | 7 min 16 s | 4 min 9 s |
1.3 | 25k MARC BIB Create | PTF - Create 2 | 12 min 25 s | 16 min 29 s | 17 min 30 s | 10 min 29 s |
1.4 | 50k MARC BIB Create | PTF - Create 2 | 26 min 20 s | 36 min 4 s | 35 min 54 s | 19 min 35 s |
1.5 | 100k MARC BIB Create | PTF - Create 2 | 54 min 8 s | 1 hr 11 min | 1 hr 14 min | 38 min 4 s |
2.1 | 5k MARC BIB Update | PTF - Updates Success - 6 | 2 min 44 s | 2 min 31 s | 3 min 41 s | 2 min 27 s |
2.2 | 10k MARC BIB Update | PTF - Updates Success - 6 | 5 min 33 s | 4 min 11 | 4 min 58 | 4 min 20 s |
2.3 | 25k MARC BIB Update | PTF - Updates Success - 6 | 12 min 51 s | 9 min 49 s | 12 min 33 s | 10 min 48 s |
2.4 | 50k MARC BIB Update | PTF - Updates Success - 6 | 25 min 32 s | 19 min 27 s | 24 min 16 | 22 min 30 s |
2.5 | 100k MARC BIB Update | PTF - Updates Success - 6 | 54 min 41 s | 40 min 30 s | 53 min 11 s | 46 min 41 s |
Testing results from previous reports
*-https://folio-org.atlassian.net/wiki/spaces/FOLIJET/pages/1200750594
*2-https://folio-org.atlassian.net/wiki/spaces/FOLIJET/pages/1289814018
Resource utilization graphs
CPU Utilization
Test 1.1 - 1.5 (5k-10k-25k-50k-100k)
Test 2.1 - 2.5 (5k-10k-25k-50k-100k)
Memory Utilization
Memory utilisation showing stable trend, without spikes and drops. No signs for memory leaks.
Test 1.1 - 1.5 (5k-10k-25k-50k-100k)
Test 2.1 - 2.5 (5k-10k-25k-50k-100k)
RDS Metrics
Test 1.1 - 1.5 (5k-10k-25k-50k-100k)
Test 2.1 - 2.5 (5k-10k-25k-50k-100k)
DB Connections
Test 1.1 - 1.5 (5k-10k-25k-50k-100k)
Test 2.1 - 2.5 (5k-10k-25k-50k-100k)
Performance insight metrics
Test 1.1 - 1.5 (5k-10k-25k-50k-100k)
High resource utilization of SELECT * FROM cs00000int_0001_mod_orders_storage.internal_lock WHERE lock_name = $1 FOR UPDATE is described in addtional section
Test 2.1 - 2.5 (5k-10k-25k-50k-100k)
MSK service
CPU usage by broker
Test 1.1 - 1.5 (5k-10k-25k-50k-100k) - Test 2.1 - 2.5 (5k-10k-25k-50k-100k)
Disk usage by broker
Test 1.1 - 1.5 (5k-10k-25k-50k-100k) - Test 2.1 - 2.5 (5k-10k-25k-50k-100k)
Additional information
Analysis of High Lock:tuple Wait Events on internal_lock Table
Executive Summary
The high database load observed during the test, specifically characterized by the Lock:tuple wait event, is expected behavior and not a performance defect. This load is intentionally generated by the mod-orders-storage service to ensure data consistency when processing audit events.
The behavior is the direct result of a pessimistic locking mechanism (SELECT FOR UPDATE) implemented as per the user story, "Implement lock mechanism for Outbox Event Log."
Detailed Analysis
Observed Symptom: The AWS Performance Insights dashboard shows a high number of Average Active Sessions (AAS) where the dominant wait event is
Lock:tuple. This load is directly attributed to the following query:SELECT * FROM {tenant}_mod_orders_storage.internal_lock WHERE lock_name = ...Design and Business Rationale: This lock is a deliberate design choice to solve a potential data integrity problem:
Function: The
mod-orders-storageservice reads events from anoutbox_eventstable and sends them to Kafka for auditing (the "Outbox Pattern").Problem: The system runs multiple instances of this service in parallel. Without a lock, two or more instances could read the same batch of events simultaneously and send duplicate audit messages to Kafka.
Solution: A distributed lock was implemented using a single row in the
internal_locktable.
Mechanism of Action:
Before an instance of
mod-orders-storagecan read events, it must first execute aSELECT ... FOR UPDATEquery on theinternal_locktable.The first instance to execute this query acquires an exclusive, row-level "tuple lock" on that row.
All other service instances that try to run the same query are forced to wait. This "waiting" state is what Performance Insights reports as the
Lock:tuplewait event.Once the first instance has finished processing its batch (reading events, sending to Kafka) and its transaction is complete, the lock is released. The database then grants the lock to the next waiting instance in the queue.
Conclusion
The high Lock:tuple wait time is a direct and healthy sign that the locking mechanism is working as intended. It represents the time service instances spend "waiting in line" to process audit events sequentially.
This behavior is by design and essential for guaranteeing that audit events are sent to Kafka exactly once, preventing data duplication. It should be documented as an accepted system characteristic rather than a performance bottleneck to be fixed.
Feature: https://folio-org.atlassian.net/browse/UXPROD-3215
User Story: https://folio-org.atlassian.net/browse/MODORDSTOR-332
Appendix
Infrastructure
PTF -environment secon
8 r7g.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
db.r7g.xlarge database instances, writer
MSK fse-test
4 kafka.m7g.xlarge brokers in 2 zones
Apache Kafka version 3.7.x (KRaft mode)
EBS storage volume per broker 300 GiB
auto.create.topics.enable=true
log.retention.minutes=480
default.replication.factor=3