Overview

The Data Import Task Force (DITF) implements a feature that splits large input MARC files into smaller ones, resulting in smaller jobs, so that the big files could be imported and be imported consistently. This document contains 1. Test with 1, 2, and 3 tenants' concurrent jobs with configurations the results of performance tests on the feature and also an analysis the feature's performance with respect to the baseline tests. The following Jiras were implemented.

Jira Legacy

server	System JiraJIRA
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	PERF-644

Jira Legacy

server	System JiraJIRA
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	PERF-645

Jira Legacy

server	System JiraJIRA
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	PERF-647

Jira Legacy

server	System JiraJIRA
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	PERF-646

Jira Legacy

server	System JiraJIRA
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	PERF-671

...

One record on one tenant could be discarded with error: io.netty.channel.StacklessClosedChannelException.
Jira Legacy
server System JiraJIRA
serverId 01505d01-b853-3c2e-90f1-ee9b165564fc
key MODDATAIMP-748
Reproduces in both cases with and without splitting feature enabled in at least 30% of test runs with 500k record files and multitenant testing.
During the new Data Import splitting feature testing, items for update were discarded with the error: io.vertx.core.impl.NoStackTraceThrowable: Cannot get actual Item by id: org.folio.inventory.exceptions.InternalServerErrorException: Access for user 'data-import-system-user' (f3486d35-f7f7-4a69-bcd0-d8e5a35cb292) requires permission: inventory-storage.items.item.get. Less than 1% of records could be discarded due to missing permission for 'data-import-system-user'. Permission was not added automatically during the service deployment. I added permission manually to the database and the error does not occur anymore.
Jira Legacy
server System JiraJIRA
serverId 01505d01-b853-3c2e-90f1-ee9b165564fc
key MODDATAIMP-930
UI issue, when canceled or completed with error Job progress bar cannot be deleted from the screen.
Jira Legacy
server System JiraJIRA
serverId 01505d01-b853-3c2e-90f1-ee9b165564fc
key MODDATAIMP-929
Usage:
- Should not use less than 1000 for RECORDS_PER_SPLIT_FILE. The system is stable enough to ingest 1000 records consistently and smaller amounts will incur more overheads, resulting in longer jobs' durations. CPU utilization for mod-di-converter-storage for 500 RECORDS_PER_SPLIT_FILE(RPSF) = 160%, for 1000RPSF =180%, for 5K RPSF =380% and for 10K RPSF =433%, so in the case of selecting configurations 5K or 10K we recommend to add more CPU to mod-di-converter-storage service.
- When toggling the file-splitting feature, mod-source-record-storage, mod-source-record-manager's tasks need to be restarted.
- Keep in mind about the Kafka broker's disk size (as bigger jobs - up to 500K - can be run now), consecutive jobs may use up the disk quickly because the messages' retention time currently is set at 8 hours. For example with 300GB disk size, consecutive jobs of 250K, 500K, 500K sizes will exhaust the disk.
More CPU could be allocated to mod-inventory and mod-di-converter-storage

...

* - One record on one tenant could be discarded with error: io.netty.channel.StacklessClosedChannelException.

Jira Legacy

server	System JiraJIRA
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	MODDATAIMP-748

Reproduces in both cases with and without splitting features in at least 30% of test runs with 500k record files and multitenant testing.

...

** - up to 10 items were discarded with the error: io.vertx.core.impl.NoStackTraceThrowable: Cannot get actual Item by id: org.folio.inventory.exceptions.InternalServerErrorException: Access for user 'data-import-system-user' (f3486d35-f7f7-4a69-bcd0-d8e5a35cb292) requires permission: inventory-storage.items.item.get. Less than 1% of records could be discarded due to missing permission for 'data-import-system-user'. Permission was not added automatically during the service deployment. I added permission manually to the database and the error does not occur anymore.

Jira Legacy

server	System JiraJIRA
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	MODDATAIMP-930

...

With CI/CO 20 users and DI 25k records on each of the 3 tenants Splitting Feature Disabled

ocp3-mod-data-import:12

Image Modified

Data Import Robustness Enhancement

...

Memory utilization rich maximal value for mod-source-record-storage-b 88% and for mod-source-record-manager-b 85%.

Test 2. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 10K, 2 runs for each test.

...

Test 2. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 10K, 2 runs for each test.

RDS CPU Utilization

Test 1. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 500, 2 runs for each test. Maximal CPU Utilization = 95%

...

Retest the DI feature to be sure that the new changes have not affected performance negatively. Retest the DI file-splitting feature for the following scenarios:

Jira Legacy

server	System JiraJIRA
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	PERF-681

...

250K MARC BIB Create PTF - Create 2 ---> 44 minutes
250K MARC BIB UpdatePTF - Updates Success - 1 -→ 45 minutes
Multitenant MARC Create (100k, 50k, and 1 record)PTF - Create 2 -→1 hour 35 minutes
- Check-Out without DI ~ 200ms
- Check-In without DI ~ 650ms65ms
- Check-Out with DI ~ 770ms
- Check-in with DI ~ 330ms

...

Service CPU utilization on Poppy is about the same as on the Orchid;
Memory utilization on Poppy is about the same as on the Orchid;
RDS CPU Utilization during all tests and on both releases was about 96%;
The number of connections to DB on both releases were was about the same from 550(Test 1.1) to 1200(Test 1.4).

Test 1. Single tenant(primary fs09000000): create and update 250K file

Test #

Test parameters

Profile

Duration

(Poppy)

Splitting Feature Enabled

Status

Previous results

(Orchid )

Duration

diff= Poppy time processing - Orchid time processing

Duration

(Poppy)

Splitting Feature Disabled

1.1

250K MARC BIB Create

PTF - Create 2

2 hours 16 min

Completed

1 hour

32 min

32 min	44 minutes	failed
1.2	250K MARC BIB Update	PTF - Updates Success - 1	3 hours 1 min	Completed	2 hours 16 min	45 minutes	failed
1.3	Multitenant MARC Create (100k, 50k, and 1 record)	PTF - Create 2	4 hours 14min	Completed	2 hours 40 min	1 hour 35 minutes	failed

On Poppy with the split feature disabled, large files stopped processing. Created ticket to this problem

Jira Legacy

server	System JIRA
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	PERF-744

Test 1.4 With CI/CO 20 users and DI 25k records on each of the 3 tenants

Splitting Feature enabled

Release: Orchid

Response time without DI (Average)

Release: Orchid
Response time with DI
(Average)

Release: Poppy
Response time without DI (Average)

Release: Poppy
Response time with DI (Average)

diff= Poppy time processing - Orchid time processing

without DI

diff= Poppy time processing - Orchid time processing

with DI

Check-Out

0.804s

1.48s

1.03s

2.26s

200ms

770ms

Check-In

0.505s

1.067s

0.570s

1.4s

65ms

330ms

Release: Orchid

DI Duration with CI/CO

Release: Poppy

DI Duration with CI/CO

Tenant _1

16 min 53 sec

34 min 55 sec

Tenant _2

20min 39 sec

27 min 39 sec

Tenant _3

17min 54 sec

25 min 17 sec

...

Service CPU Utilization

The shark spike sharp spike of CPU at the beginning of test 1, We see similar behavior in all of the DI tests. СPU consumption was uniform during the test.

...

The goal of the tests was to investigate how the file-splitting feature caused Data-import on Poppy release and the impact of Refresh Token Rotation (RTR). The tests were performed on ocp3(Poppy), pcp1(Poppy) and ncp5(Orchid) environments.

Jira Legacy

server	System JiraJIRA
serverId	01505d01-b853-3c2e-90f1-ee9b165564fc
key	PERF-723

Refresh Token Rotation (RTR)

...

tenant0_mod_source_record_storage.marc_records_lb = 9674629
tenant2_mod_source_record_storage.marc_records_lb = 0
tenant3_mod_source_record_storage.marc_records_lb = 0
tenant0_mod_source_record_storage.raw_records_lb = 9604805
tenant2_mod_source_record_storage.raw_records_lb = 0
tenant3_mod_source_record_storage.raw_records_lb = 0
tenant0_mod_source_record_storage.records_lb = 9674677
tenant2_mod_source_record_storage.records_lb = 0
tenant3_mod_source_record_storage.records_lb = 0
tenant0_mod_source_record_storage.marc_indexers = 620042011
tenant2_mod_source_record_storage.marc_indexers = 0
tenant3_mod_source_record_storage.marc_indexers = 0
tenant0_mod_source_record_storage.marc_indexers with field_no 010 = 3285833
tenant2_mod_source_record_storage.marc_indexers with field_no 010 = 0
tenant3_mod_source_record_storage.marc_indexers with field_no 010 = 0
tenant0_mod_source_record_storage.marc_indexers with field_no 035 = 19241844
tenant2_mod_source_record_storage.marc_indexers with field_no 035 = 0
tenant3_mod_source_record_storage.marc_indexers with field_no 035 = 0
tenant0_mod_inventory_storage.authority = 4
tenant2_mod_inventory_storage.authority = 0
tenant3_mod_inventory_storage.authority = 0
tenant0_mod_inventory_storage.holdings_record = 9592559
tenant2_mod_inventory_storage.holdings_record = 16
tenant3_mod_inventory_storage.holdings_record = 16
tenant0_mod_inventory_storage.instance = 9976519
tenant2_mod_inventory_storage.instance = 32
tenant3_mod_inventory_storage.instance = 32
tenant0_mod_inventory_storage.item = 10787893
tenant2_mod_inventory_storage.item = 19
tenant3_mod_inventory_storage.item = 19

PTF -environment ocp3

10 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
2 database instances, one reader, and one writer
Name API Name Memory GIB vCPUs max_connections
R6G Extra Large db.r6g.xlarge 32 GiB 4 vCPUs 2731
MSK ptf-kakfa-3
- 4 m5.2xlarge brokers in 2 zones
- Apache Kafka version 2.8.0
- EBS storage volume per broker 300 GiB
- auto.create.topics.enable=true
- log.retention.minutes=480
- default.replication.factor=3
Kafka topics partitioning: - 2 partitions for DI topics

...

Versions Compared

Old Version 86

New Version Current

Key

Overview

With CI/CO 20 users and DI 25k records on each of the 3 tenants Splitting Feature Disabled

RDS CPU Utilization

Service CPU Utilization

Name	API Name	Memory GIB	vCPUs	max_connections
R6G Extra Large	db.r6g.xlarge	32 GiB	4 vCPUs	2731

Page Comparison

Versions Compared

Old Version 86

New Version Current

Key

With CI/CO 20 users and DI 25k records on each of the 3 tenants Splitting Feature Disabled

RDS CPU Utilization

Service CPU Utilization