Skip to end of banner
Go to start of banner

Data Import Splitting Feature test report (Orchid) ocp3

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 59 Next »

In Progress

Overview

The Data Import Task Force (DITF) implements a feature that splits large input MARC files into smaller ones, resulting in smaller jobs, so that the big files could be imported and be imported consistently.  This document contains the results of performance tests on the feature and also an analysis the feature's performance with respect to the baseline tests.  The following Jiras were implemented. 

PERF-644 - Getting issue details... STATUS PERF-645 - Getting issue details... STATUS PERF-647 - Getting issue details... STATUS PERF-646 - Getting issue details... STATUS PERF-671 - Getting issue details... STATUS

Summary

  • The file-splitting feature is stable and offers more robustness to Data Import jobs even with the current infrastructure configuration. If there were failures, it's easier now to find the exact failed records to take actions on them. 
    • No stuck jobs in all tests performed.
    • There were errors (see below) in some partial jobs, but they still completed so the entire job status is "Completed with error".
    • Both of kinds of imports, create and update MARC BIBs worked well with this file-splitting feature enabled and also disabled. 
  • (At this point) There is no performance degradations, jobs not getting slower, on single-tenant imports. On multi-tenants imports, performance is be a little better
  • Duration for DI correlates with number of the records imported (100k records- 38 min, 250k - 1 hour 32 min, 500k - 3 hours 29 min).
  • Multitenant DI could be performed successfully for up to 9 jobs in parallel. If jobs are big they will start one by one in order for each tenant but processed in parallel on 3 tenants. Small DI (1 record) could be finished faster not in order. 
  • No memory leak is suspected for all of the modules.
  • Average CPU usage for mod-inventory -was 144%, mod-di-converter-storage was about 107%, and for all other modules did not exceed 100 %. We can observe spikes in CPU usage of mod-data-import at the beginning of the Data Import jobs up to 260%.  Big improvement over previous version (without file-splitting) for 500K imports where mod-di-converter-storage's CPU utilization was 462% and other modules were above 100% and up to 150%. 
  • Approximately DB CPU usage is up to 95%.

Recommendations and Jiras

  1. One record on one tenant could be discarded with error: io.netty.channel.StacklessClosedChannelException. MODDATAIMP-748 - Getting issue details... STATUS Reproduces in both cases with and without splitting feature enabled in at least 30% of test runs with 500k record files and multitenant testing.
  2. During the new Data Import splitting feature testing, items for update were discarded with the error: io.vertx.core.impl.NoStackTraceThrowable: Cannot get actual Item by id: org.folio.inventory.exceptions.InternalServerErrorException: Access for user 'data-import-system-user' (f3486d35-f7f7-4a69-bcd0-d8e5a35cb292) requires permission: inventory-storage.items.item.get. Less than 1% of records could be discarded due to missing permission for  'data-import-system-user'. Permission was not added automatically during the service deployment. I added permission manually to the database and the error does not occur anymore. MODDATAIMP-930 - Getting issue details... STATUS
  3. UI issue, when canceled or completed with error Job progress bar cannot be deleted from the screen. MODDATAIMP-929 - Getting issue details... STATUS
  4. Usage:
    • Should not use less than 1000 for RECORDS_PER_SPLIT_FILE. The system is stable enough to ingest 1000 records consistently and smaller amounts will incur more overheads, resulting in longer jobs' durations.  CPU utilization for mod-di-converter-storage for 500 RECORDS_PER_SPLIT_FILE(RPSF) = 160%, for 1000RPSF =180%, for 5K RPSF =380% and for 10K RPSF =433%, so in the case of selecting configurations 5K or 10K we recommend to add more CPU to mod-di-converter-storage service.
    • When toggling the file-splitting feature, mod-source-record-storage, mod-source-record-manager's tasks need to be restarted.
    • Keep in mind about the Kafka broker's disk size (as bigger jobs - up to 500K - can be run now), consecutive jobs may use up the disk quickly because the messages' retention time currently is set at 8 hours. For example with 300GB disk size, consecutive jobs of 250K, 500K, 500K sizes will exhaust the disk. 
  5. More CPU could be allocated to mod-inventory and mod-di-converter-storage

Results

Test #

Profile

Splitting Feature EnabledResults

Splitting Feature Disabled

ResultsBefore Splitting Feature DeployedResults
1

100K MARC BIB Create

PTF - Create 237 min -39 minCompleted40 minCompleted32-33 minutesCompleted
1

250K MARC BIB Create 

PTF - Create 21 hour 32 minCompleted1 hour 41 minCompleted1 hour 33 min - 1 hour 57 minCompleted
1500K MARC BIB CreatePTF - Create 23 hours 29 minCompleted*3 hours 55 minCompleted3 hours 33 minCompleted
2Multitenant MARC Create (100k, 50k, and 1 record)PTF - Create 22 hours 40 minCompleted*2 hours 43 minCompleted*3 hours 1 minCompleted
3CI/CO + DI MARC BIB Create (20 users CI/CO, 25k records DI on 3 tenants)PTF - Create 224 min 18 secCompleted31 min 31 secCompleted24 minCompleted *
4

100K MARC BIB Update (Create new file)

PTF - Updates Success - 1

58 min 25 sec

57 min 19 sec

Completed1 hour 3 minCompleted--
4

250K MARC BIB Update

PTF - Updates Success - 1

2 hours 2 min **


2 hours 12 min

Completed with errors **

Completed

1 hour 53 minCompleted--
4500K MARC BIB UpdatePTF - Updates Success - 1

4 hours 43 min

4 hours 38 minutes

Completed

Completed

5 hour 59 minCompleted--

 * - One record on one tenant could be discarded with error: io.netty.channel.StacklessClosedChannelException. MODDATAIMP-748 - Getting issue details... STATUS  Reproduces in both cases with and without splitting features in at least 30% of test runs with 500k record files and multitenant testing.


 ** -  up to 10 items were discarded with the error: io.vertx.core.impl.NoStackTraceThrowable: Cannot get actual Item by id: org.folio.inventory.exceptions.InternalServerErrorException: Access for user 'data-import-system-user' (f3486d35-f7f7-4a69-bcd0-d8e5a35cb292) requires permission: inventory-storage.items.item.get. Less than 1% of records could be discarded due to missing permission for  'data-import-system-user'. Permission was not added automatically during the service deployment. I added permission manually to the database and the error does not occur anymore. MODDATAIMP-930 - Getting issue details... STATUS


Test 1,2. 100k, 250K, 500k and Multitenant MARC BIB Create

Memory Utilization

This has memory utilization increasing due to previous modules restarting (everyday cluster shot down process) no memory leak is suspected for DI modules.

MARC BIB CREATE

Test#1 100k, 250k, 500k records DI

Test#2 Multitenant  DI (9 concurrent jobs)

Service CPU Utilization 

MARC BIB CREATE

Average CPU usage for mod-inventory -was 144%, mod-di-converter-storage was about 107%, and for all other modules did not exceed 100 %. We can observe spikes in CPU usage of mod-data-import at the beginning of the Data Import jobs up to 260%.

Test#1 500k records DI


Test#2 Multitenant


Instance CPU Utilization

Test#1 500k records DI

Test#2 Multitenant DI (9 concurrent jobs)


RDS CPU Utilization 

MARC BIB CREATE

Approximately DB CPU usage is up to 95%

Test#1  500k records DI

Test#2 Multitenant  DI (9 concurrent jobs)

Maximal DB CPU usage is about 95%


RDS Database Connections

MARC BIB CREATE
 For DI  job Create- Maximum 535 connections count.

Test#1  500k records DI

Test#2 Multitenant


Test 3 With CI/CO 20 users and DI 25k records on each of the 3 tenants Splitting Feature Enabled & 

Splitting Feature Disabled


Response time with DI

(Average) Splitting Feature enabled

Response time with DI Before Splitting Feature DeployedResponse time without DI 
(Average)  Splitting Feature enabled
Response time without DI Before Splitting Feature DeployedResponse time with DI  Splitting Feature disabledResponse time without DI Splitting Feature disabled
Check-In1.067s1.138s0.505s0.517s1.1s0.542s
Check-Out1.48s1.552s0.804s0.796s1.6s0.841s

DI Duration with CI/CO DI Duration with CI/CO Before Splitting Feature DeployedDI Duration without CI/CO DI Duration without CI/CO Before Splitting Feature DeployedDI Duration with CI/CO Splitting Feature disabledDI Duration without CI/CO Splitting Feature disabled
Tenant _116 min 53 sec20 min16min 18sec14 min (18 min for run 2)31min 30sec
Tenant _220min 39 sec19 min20min 13sec16 min (18 min for run 2)26min 22sec
Tenant _317min 54 sec16 min17min 42sec16 min (15 min for run 2)20min 44sec

 * - Same approach testing DI: 3 DI jobs total on 3 tenants without CI/CO. Start the second job after the first one reaches 30%, and start another job on a third tenant after the first job reaches 60% completion. DI file size: 25k

Response time graph

With CI/CO 20 users and DI 25k records on each of the 3 tenants Splitting Feature Disabled

ocp3-mod-data-import:12

Data Import Robustness Enhancement  PERF-646 - Getting issue details... STATUS

25K records RECORDS_PER_SPLIT_FILE
Number of concurrent tenantsJob profile 500Status1KStatus5KStatus10KStatusTest with Split disabledStatus
1 Tenant test#1PTF - Create 212 minutes 55 secondsCompleted11 minutes 48 secondsCompleted09 minutes 21 secondsCompleted9 minutes 2 secCompleted10 min 35 secCompleted
1 Tenant test#210 minutes 31 secondsCompleted09 minutes 32 secondsCompleted9 minutes 6 secCompleted9 minutes 14 secCompleted11 min 27 secCompleted
2 Tenants test#1PTF - Create 219 minutes 29 secondsCompleted15 minutes 47 secondsCompleted16 minutes 15 secondsCompleted16 minutes 3 secondsCompleted19 min 18 secCompleted
2 Tenants test#218 minutes 19 secondsCompleted15 minutes 47 secondsCompleted16 minutes 11 secCompleted16 min 41 secCompleted20 min 33 secCompleted
3 Tenants test#1PTF - Create 2

24 minutes 15 seconds

Completed

25 minutes 47 seconds

Completed23 minutes Completed23 minutes 27 secondsCompleted30 min 2 secCompleted
3 Tenants test#224 minutes 38 secondsCompleted

23 minutes 28 seconds

Completed23 minutes 2 secCompleted23 minutes 26 secondsCompleted

T1 - "00:33:35.1" Error

T2 - "01:23:36.144"

T3 - "01:16:26.391"

Completed with error and long proccesing*

*    on the first tenant proccesing stoped wit error " LOGS in progress "

it caused the spike of CPU utilization on Kafka (tenant cluster) up to 94% 

Instance CPU Utilization 

Test 1. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 500, 2 runs for each test. The maximal CPU Utilization value is 38%. 

Test 2. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 10K, 2 runs for each test. The maximal CPU Utilization value is 37%. 

Memory Utilization

Test 1. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 500, 2 runs for each test.

Most of the modules were stable during the test, and no memory leak is suspected for DI modules, only 2 modules increased memory consumption usage after the beginning of the tests

Memory utilization rich maximal value for mod-source-record-storage-b 88%  and for mod-source-record-manager-b 85%.

Test 2. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 10K, 2 runs for each test.

Service CPU Utilization 

Test 1. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 500, 2 runs for each test.

Test 2. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 10K, 2 runs for each test.

CPU utilization of 
mod-di-converter-storage-b

 

RDS CPU Utilization 

Test 1. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 500, 2 runs for each test. Maximal  CPU Utilization = 95%



Test 2. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 10K, 2 runs for each test. Maximal  CPU Utilization = 94%

RDS Database Connections

Test 1. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 500, 2 runs for each test.

Test 2. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 10K, 2 runs for each test. Maximal  CPU Utilization = 94%

Appendix

Infrastructure ocp3  with the "Bugfest" Dataset

Records count :

  • tenant0_mod_source_record_storage.marc_records_lb = 9674629
  • tenant2_mod_source_record_storage.marc_records_lb = 0
  • tenant3_mod_source_record_storage.marc_records_lb = 0
  • tenant0_mod_source_record_storage.raw_records_lb = 9604805
  • tenant2_mod_source_record_storage.raw_records_lb = 0
  • tenant3_mod_source_record_storage.raw_records_lb = 0
  • tenant0_mod_source_record_storage.records_lb = 9674677
  • tenant2_mod_source_record_storage.records_lb = 0
  • tenant3_mod_source_record_storage.records_lb = 0
  • tenant0_mod_source_record_storage.marc_indexers =  620042011
  • tenant2_mod_source_record_storage.marc_indexers =  0
  • tenant3_mod_source_record_storage.marc_indexers =  0
  • tenant0_mod_source_record_storage.marc_indexers with field_no 010 = 3285833
  • tenant2_mod_source_record_storage.marc_indexers with field_no 010 = 0
  • tenant3_mod_source_record_storage.marc_indexers with field_no 010 = 0
  • tenant0_mod_source_record_storage.marc_indexers with field_no 035 = 19241844
  • tenant2_mod_source_record_storage.marc_indexers with field_no 035 = 0
  • tenant3_mod_source_record_storage.marc_indexers with field_no 035 = 0
  • tenant0_mod_inventory_storage.authority = 4
  • tenant2_mod_inventory_storage.authority = 0
  • tenant3_mod_inventory_storage.authority = 0
  • tenant0_mod_inventory_storage.holdings_record = 9592559
  • tenant2_mod_inventory_storage.holdings_record = 16
  • tenant3_mod_inventory_storage.holdings_record = 16
  • tenant0_mod_inventory_storage.instance = 9976519
  • tenant2_mod_inventory_storage.instance = 32
  • tenant3_mod_inventory_storage.instance = 32 
  • tenant0_mod_inventory_storage.item = 10787893
  • tenant2_mod_inventory_storage.item = 19
  • tenant3_mod_inventory_storage.item = 19

PTF -environment ocp3 

  • 10 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
  • 2 database  instances, one reader, and one writer

    NameAPI NameMemory GIBvCPUsmax_connections
    R6G Extra Largedb.r6g.xlarge32 GiB4 vCPUs2731
  • MSK ptf-kakfa-3
    • 4 m5.2xlarge brokers in 2 zones
    • Apache Kafka version 2.8.0

    • EBS storage volume per broker 300 GiB

    • auto.create.topics.enable=true
    • log.retention.minutes=480
    • default.replication.factor=3
  • Kafka topics partitioning: - 2 partitions for DI topics

Before Splitting Feature released

Module
ocp3-pvt
Mon Sep 11 09:33:28 UTC 2023
Task Def. RevisionModule VersionTask CountMem Hard LimitMem Soft limitCPU unitsXmxMetaspaceSizeMaxMetaspaceSizeR/W split enabled
mod-remote-storage13mod-remote-storage:2.0.324920447210243960512512false
mod-agreements8mod-agreements:5.5.2215921488128968384512false
mod-data-import7mod-data-import:2.7.11204818442561292384512false
mod-search30mod-search:2.0.1225922480204814405121024false
mod-authtoken7mod-authtoken:2.13.021440115251292288128false
mod-configuration7mod-configuration:5.9.12102489612876888128false
mod-inventory-storage1mod-inventory-storage:26.1.0-SNAPSHOT.66502208195210241440384512false
mod-circulation-storage15mod-circulation-storage:16.0.122880259215361814384512false
mod-source-record-storage11mod-source-record-storage:5.6.725600500020483500384512false
mod-calendar7mod-calendar:2.4.22102489612876888128false
mod-inventory12mod-inventory:20.0.622880259210241814384512false
mod-circulation9mod-circulation:23.5.622880259215361814384512false
mod-di-converter-storage8mod-di-converter-storage:2.0.52102489612876888128false
mod-pubsub8mod-pubsub:2.9.12153614401024922384512false
mod-users8mod-users:19.1.12102489612876888128false
mod-patron-blocks8mod-patron-blocks:1.8.021024896102476888128false
mod-source-record-manager9mod-source-record-manager:3.6.425600500020483500384512false
nginx-edge7nginx-edge:2023.06.1421024896128000false
mod-quick-marc7mod-quick-marc:3.0.01228821761281664384512false
nginx-okapi7nginx-okapi:2023.06.1421024896128000false
okapi-b8okapi:5.0.13168414401024922384512false
mod-feesfines7mod-feesfines:18.2.12102489612876888128false
mod-patron7mod-patron:5.5.22102489612876888128false
mod-notes7mod-notes:5.0.121024896128952384512false
pub-okapi7pub-okapi:2023.06.142102489612876800false


Service versions for Splitting Feature test

Module
ocp3-pvt
Mon Sep 25 12:43:06 UTC 2023
Task Def. RevisionModule VersionTask CountMem Hard LimitMem Soft limitCPU unitsXmxMetaspaceSizeMaxMetaspaceSizeR/W split enabled
mod-data-import10mod-data-import:2.7.2-SNAPSHOT.1371204818442561292384512false
mod-search30mod-search:2.0.1225922480204814405121024false
mod-configuration8mod-configuration:5.9.12102489612876888128false
mod-bulk-operations7mod-bulk-operations:1.0.623072260010241536384512false
mod-inventory-storage1mod-inventory-storage:26.1.0-SNAPSHOT.66502208195210241440384512false
mod-circulation-storage15mod-circulation-storage:16.0.122880259215361814384512false
mod-source-record-storage12mod-source-record-storage:5.6.725600500020483500384512false
mod-calendar7mod-calendar:2.4.22102489612876888128false
mod-inventory12mod-inventory:20.0.622880259210241814384512false
mod-circulation9mod-circulation:23.5.622880259215361814384512false
mod-di-converter-storage8mod-di-converter-storage:2.0.52102489612876888128false
mod-pubsub9mod-pubsub:2.9.12153614401024922384512false
mod-users9mod-users:19.1.12102489612876888128false
mod-patron-blocks9mod-patron-blocks:1.8.021024896102476888128false
mod-source-record-manager12mod-source-record-manager:3.6.5-SNAPSHOT.24525600500020483500384512false
mod-quick-marc7mod-quick-marc:3.0.01228821761281664384512false
nginx-okapi7nginx-okapi:2023.06.1421024896128000false
okapi-b8okapi:5.0.13168414401024922384512false
mod-feesfines8mod-feesfines:18.2.12102489612876888128false
mod-notes7mod-notes:5.0.121024896128952384512false
pub-okapi7pub-okapi:2023.06.142102489612876800false


Methodology/Approach

To set splitting feature: Detailed Release Notes for Data Import Splitting Feature

Test 1: Manually tested 100k, 250k, and 500k records files started one by one on one tenant only.

Test 2: Manually tested 100k+50k+1 record files DI started simultaneously on every 3 tenants (9 jobs total).

Test 3: Run CICO on one tenant, DI jobs 3 tenants, including the one that runs CICO. Start the second job after the first one reaches 30%, and start another job on a third tenant after the first job reaches 60% completion. CICO: 20 users, DI file size: 25k


  • No labels