Data Import Splitting Feature test report (Orchid) ocp3 + retesting Poppy FSF and RTR

Data Import Splitting Feature test report (Orchid) ocp3 + retesting Poppy FSF and RTR

Overview

The Data Import Task Force (DITF) implements a feature that splits large input MARC files into smaller ones, resulting in smaller jobs, so that the big files could be imported and be imported consistently.  This document contains 1. Test with 1, 2, and 3 tenants' concurrent jobs with configurations the results of performance tests on the feature and also an analysis the feature's performance with respect to the baseline tests.  The following Jiras were implemented. 

https://folio-org.atlassian.net/browse/PERF-644 https://folio-org.atlassian.net/browse/PERF-645 https://folio-org.atlassian.net/browse/PERF-647 https://folio-org.atlassian.net/browse/PERF-646 https://folio-org.atlassian.net/browse/PERF-671

Summary

  • The file-splitting feature is stable and offers more robustness to Data Import jobs even with the current infrastructure configuration. If there were failures, it's easier now to find the exact failed records to take actions on them. 

    • No stuck jobs in all tests performed.

    • There were errors (see below) in some partial jobs, but they still completed so the entire job status is "Completed with error".

    • Both of kinds of imports, create and update MARC BIBs worked well with this file-splitting feature enabled and also disabled. 

  • There is no performance degradations, jobs not getting slower, on single-tenant imports. On multi-tenants imports, performance is be a little better

  • Duration for DI correlates with number of the records imported (100k records- 38 min, 250k - 1 hour 32 min, 500k - 3 hours 29 min).

  • Multitenant DI could be performed successfully for up to 9 jobs in parallel. If jobs are big they will start one by one in order for each tenant but processed in parallel on 3 tenants. Small DI (1 record) could be finished faster not in order. 

  • No memory leak is suspected for all of the modules.

  • Average CPU usage for mod-inventory -was 144%, mod-di-converter-storage was about 107%, and for all other modules did not exceed 100 %. We can observe spikes in CPU usage of mod-data-import at the beginning of the Data Import jobs up to 260%.  Big improvement over the previous version (without file-splitting) for 500K imports where mod-di-converter-storage's CPU utilization was 462% and other modules were above 100% and up to 150%. 

  • Approximately DB CPU usage is up to 95%.

Recommendations and Jiras

  1. One record on one tenant could be discarded with error: io.netty.channel.StacklessClosedChannelException. https://folio-org.atlassian.net/browse/MODDATAIMP-748 Reproduces in both cases with and without splitting feature enabled in at least 30% of test runs with 500k record files and multitenant testing.

  2. During the new Data Import splitting feature testing, items for update were discarded with the error: io.vertx.core.impl.NoStackTraceThrowable: Cannot get actual Item by id: org.folio.inventory.exceptions.InternalServerErrorException: Access for user 'data-import-system-user' (f3486d35-f7f7-4a69-bcd0-d8e5a35cb292) requires permission: inventory-storage.items.item.get. Less than 1% of records could be discarded due to missing permission for  'data-import-system-user'. Permission was not added automatically during the service deployment. I added permission manually to the database and the error does not occur anymore.https://folio-org.atlassian.net/browse/MODDATAIMP-930

  3. UI issue, when canceled or completed with error Job progress bar cannot be deleted from the screen. https://folio-org.atlassian.net/browse/MODDATAIMP-929

  4. Usage:

    • Should not use less than 1000 for RECORDS_PER_SPLIT_FILE. The system is stable enough to ingest 1000 records consistently and smaller amounts will incur more overheads, resulting in longer jobs' durations.  CPU utilization for mod-di-converter-storage for 500 RECORDS_PER_SPLIT_FILE(RPSF) = 160%, for 1000RPSF =180%, for 5K RPSF =380% and for 10K RPSF =433%, so in the case of selecting configurations 5K or 10K we recommend to add more CPU to mod-di-converter-storage service.

    • When toggling the file-splitting feature, mod-source-record-storage, mod-source-record-manager's tasks need to be restarted.

    • Keep in mind about the Kafka broker's disk size (as bigger jobs - up to 500K - can be run now), consecutive jobs may use up the disk quickly because the messages' retention time currently is set at 8 hours. For example with 300GB disk size, consecutive jobs of 250K, 500K, 500K sizes will exhaust the disk. 

  5. More CPU could be allocated to mod-inventory and mod-di-converter-storage

Results

Test #

 

Splitting Feature Enabled

Results

Splitting Feature Disabled

Results

Before Splitting Feature Deployed

Results

Test #

 

Splitting Feature Enabled

Results

Splitting Feature Disabled

Results

Before Splitting Feature Deployed

Results

1

100K MARC BIB Create

PTF - Create 2

37 min -39 min

Completed

40 min

Completed

32-33 minutes

Completed

1

250K MARC BIB Create 

PTF - Create 2

1 hour 32 min

Completed

1 hour 41 min

Completed

1 hour 33 min - 1 hour 57 min

Completed

1

500K MARC BIB Create

PTF - Create 2

3 hours 29 min

Completed*

3 hours 55 min

Completed

3 hours 33 min

Completed

2

Multitenant MARC Create (100k, 50k, and 1 record)

PTF - Create 2

2 hours 40 min

Completed*

2 hours 43 min

Completed*

3 hours 1 min

Completed

3

CI/CO + DI MARC BIB Create (20 users CI/CO, 25k records DI on 3 tenants)

PTF - Create 2

24 min 18 sec

Completed

31 min 31 sec

Completed

24 min

Completed *

4

100K MARC BIB Update (Create new file)

PTF - Updates Success - 1

58 min 25 sec

57 min 19 sec

Completed

1 hour 3 min

Completed

-

-

4

250K MARC BIB Update

PTF - Updates Success - 1

2 hours 2 min **

 

2 hours 12 min

Completed with errors **

Completed

1 hour 53 min

Completed

-

-

4

500K MARC BIB Update

PTF - Updates Success - 1

4 hours 43 min

4 hours 38 minutes

Completed

Completed

5 hour 59 min

Completed

-

-

 * - One record on one tenant could be discarded with error: io.netty.channel.StacklessClosedChannelException. https://folio-org.atlassian.net/browse/MODDATAIMP-748 Reproduces in both cases with and without splitting features in at least 30% of test runs with 500k record files and multitenant testing.

 

 ** -  up to 10 items were discarded with the error: io.vertx.core.impl.NoStackTraceThrowable: Cannot get actual Item by id: org.folio.inventory.exceptions.InternalServerErrorException: Access for user 'data-import-system-user' (f3486d35-f7f7-4a69-bcd0-d8e5a35cb292) requires permission: inventory-storage.items.item.get. Less than 1% of records could be discarded due to missing permission for  'data-import-system-user'. Permission was not added automatically during the service deployment. I added permission manually to the database and the error does not occur anymore.https://folio-org.atlassian.net/browse/MODDATAIMP-930

 

Test 1,2. 100k, 250K, 500k and Multitenant MARC BIB Create

Memory Utilization

This has memory utilization increasing due to previous modules restarting (everyday cluster shot down process) no memory leak is suspected for DI modules.

MARC BIB CREATE

Test#1 100k, 250k, 500k records DI

Test#2 Multitenant  DI (9 concurrent jobs)

Service CPU Utilization 

MARC BIB CREATE

Average CPU usage for mod-inventory -was 144%, mod-di-converter-storage was about 107%, and for all other modules did not exceed 100 %. We can observe spikes in CPU usage of mod-data-import at the beginning of the Data Import jobs up to 260%.

Test#1 500k records DI

 

Test#2 Multitenant

 

Instance CPU Utilization

Test#1 500k records DI

Test#2 Multitenant DI (9 concurrent jobs)

 

RDS CPU Utilization 

MARC BIB CREATE

Approximately DB CPU usage is up to 95%

Test#1  500k records DI

Test#2 Multitenant  DI (9 concurrent jobs)

Maximal DB CPU usage is about 95%

 

RDS Database Connections

MARC BIB CREATE
 For DI  job Create- Maximum 535 connections count.

Test#1  500k records DI

Test#2 Multitenant

 

Test 3 With CI/CO 20 users and DI 25k records on each of the 3 tenants Splitting Feature Enabled & 

Splitting Feature Disabled

 

Response time without DI

Before Splitting Feature Deployed

Response time with DI

Before Splitting Feature Deployed

Response time without DI

Splitting Feature disabled

Response time with DI 

Splitting Feature disabled

Response time without DI 
(Average) 

Splitting Feature enabled

Response time with DI

(Average) Splitting Feature enabled

 

Response time without DI

Before Splitting Feature Deployed

Response time with DI

Before Splitting Feature Deployed

Response time without DI

Splitting Feature disabled

Response time with DI 

Splitting Feature disabled

Response time without DI 
(Average) 

Splitting Feature enabled

Response time with DI

(Average) Splitting Feature enabled

Check-In

0.517s

1.138s

0.542s

1.1s

0.505s

1.067s

Check-Out

0.796s

1.552s

0.841s

1.6s

0.804s

1.48s

 

DI Duration without CI/CO

Before Splitting Feature Deployed

DI Duration with CI/CO

Before Splitting Feature Deployed

DI Duration without CI/CO

Splitting Feature disabled

DI Duration with CI/CO

Splitting Feature disabled

DI Duration without CI/CO 

DI Duration with CI/CO 

 

DI Duration without CI/CO

Before Splitting Feature Deployed

DI Duration with CI/CO

Before Splitting Feature Deployed

DI Duration without CI/CO

Splitting Feature disabled

DI Duration with CI/CO

Splitting Feature disabled

DI Duration without CI/CO 

DI Duration with CI/CO 

Tenant _1

14 min (18 min for run 2)

20 min

27min 47sec

31min 30sec

16min 18sec

16 min 53 sec

Tenant _2

16 min (18 min for run 2)

19 min

23min 16sec

26min 22sec

20min 13sec

20min 39 sec

Tenant _3

16 min (15 min for run 2)

16 min

18min 40sec

20min 44sec

17min 42sec

17min 54 sec

 * - Same approach testing DI: 3 DI jobs total on 3 tenants without CI/CO. Start the second job after the first one reaches 30%, and start another job on a third tenant after the first job reaches 60% completion. DI file size: 25k

Response time graph

With CI/CO 20 users and DI 25k records on each of the 3 tenants Splitting Feature Disabled

ocp3-mod-data-import:12

Data Import Robustness Enhancement

25K records

 RECORDS_PER_SPLIT_FILE

Number of concurrent tenants

Job profile 

500

Status

1K

Status

5K

Status

10K

Status

Test with Split disabled

Status

1 Tenant test#1

PTF - Create 2

12 minutes 55 seconds

Completed

11 minutes 48 seconds

Completed

09 minutes 21 seconds

Completed

9 minutes 2 sec

Completed

10 minutes 35 sec

Completed

1 Tenant test#2

10 minutes 31 seconds

Completed

09 minutes 32 seconds

Completed

9 minutes 6 sec

Completed

9 minutes 14 sec

Completed

11 minutes 27 sec

Completed

2 Tenants test#1

PTF - Create 2

19 minutes 29 seconds

Completed

15 minutes 47 seconds

Completed

16 minutes 15 seconds

Completed

16 minutes 3 seconds

Completed

19 minutes 18 sec

Completed

2 Tenants test#2

18 minutes 19 seconds

Completed

15 minutes 47 seconds

Completed

16 minutes 11 sec

Completed

16 min 41 sec

Completed

20 minutes 33 sec

Completed

3 Tenants test#1

PTF - Create 2

24 minutes 15 seconds

Completed

25 minutes 47 seconds

Completed

23 minutes 

Completed

23 minutes 27 seconds

Completed

30 minutes 2 sec

Completed

3 Tenants test#2

24 minutes 38 seconds

Completed

23 minutes 28 seconds

Completed

23 minutes 2 sec

Completed

23 minutes 26 seconds

Completed

29 minutes 54 sec

Completed *

*   T1 - "00:33:35.1" Error T2 - "01:23:36.144" T3 - "01:16:26.391" on the first tenant proccesing stoped wit error "io.vertx.core.impl.NoStackTraceThrowable: Connection is not active now, current status: CLOSED "

it caused the spike of CPU utilization on Kafka (tenant cluster) up to 94% 

Instance CPU Utilization 

Test 1. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 500, 2 runs for each test. The maximal CPU Utilization value is 38%. 

Test 2. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 10K, 2 runs for each test. The maximal CPU Utilization value is 37%. 

Memory Utilization

Test 1. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 500, 2 runs for each test.

Most of the modules were stable during the test, and no memory leak is suspected for DI modules, only 2 modules increased memory consumption usage after the beginning of the tests

Memory utilization rich maximal value for mod-source-record-storage-b 88%  and for mod-source-record-manager-b 85%.

Test 2. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 10K, 2 runs for each test.

Service CPU Utilization 

Test 1. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 500, 2 runs for each test.

Test 2. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 10K, 2 runs for each test.

CPU utilization of  mod-di-converter-storage-b

RDS CPU Utilization