Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

In Progress(in review) + retesting results will be add to these report in scope of the
Jira Legacy
server

...

System Jira
serverId

...

01505d01-

...

b853-

...

3c2e-

...

90f1-

...

ee9b165564fc
keyPERF-681

Table of Contents

Overview

The Data Import Task Force (DITF) implements a feature that splits large input MARC files into smaller ones, resulting in smaller jobs, so that the big files could be imported and be imported consistently.  This document contains the results of performance tests on the feature and also an analysis the feature's performance with respect to the baseline tests.  The following Jiras were implemented. 

Jira Legacy
serverFOLIO Issue TrackerSystem Jira
serverId6ccf3fe401505d01-3301b853-368a3c2e-983e90f1-20c466b11a49ee9b165564fc
keyPERF-644
Jira Legacy
serverFOLIO Issue TrackerSystem Jira
serverId6ccf3fe401505d01-3301b853-368a3c2e-983e90f1-20c466b11a49ee9b165564fc
keyPERF-645
Jira Legacy
serverFOLIO Issue TrackerSystem Jira
serverId6ccf3fe401505d01-3301b853-368a3c2e-983e90f1-20c466b11a49ee9b165564fc
keyPERF-647
Jira Legacy
serverFOLIO Issue TrackerSystem Jira
serverId6ccf3fe401505d01-3301b853-368a3c2e-983e90f1-20c466b11a49ee9b165564fc
keyPERF-646
Jira Legacy
serverFOLIO Issue TrackerSystem Jira
serverId6ccf3fe401505d01-3301b853-368a3c2e-983e90f1-20c466b11a49ee9b165564fc
keyPERF-671

Summary

  • The file-splitting feature is stable and offers more robustness to Data Import jobs even with the current infrastructure configuration. If there were failures, it's easier now to find the exact failed records to take actions on them. 
    • No stuck jobs in all tests performed.
    • There were errors (see below) in some partial jobs, but they still completed so the entire job status is "Completed with error".
    • Both of kinds of imports, create and update MARC BIBs worked well with this file-splitting feature enabled and also disabled. 
  • There is no performance degradations, jobs not getting slower, on single-tenant imports. On multi-tenants imports, performance is be a little better
  • Duration for DI correlates with number of the records imported (100k records- 38 min, 250k - 1 hour 32 min, 500k - 3 hours 29 min).
  • Multitenant DI could be performed successfully for up to 9 jobs in parallel. If jobs are big they will start one by one in order for each tenant but processed in parallel on 3 tenants. Small DI (1 record) could be finished faster not in order. 
  • No memory leak is suspected for all of the modules.
  • Average CPU usage for mod-inventory -was 144%, mod-di-converter-storage was about 107%, and for all other modules did not exceed 100 %. We can observe spikes in CPU usage of mod-data-import at the beginning of the Data Import jobs up to 260%.  Big improvement over previous version (without file-splitting) for 500K imports where mod-di-converter-storage's CPU utilization was 462% and other modules were above 100% and up to 150%. 
  • Approximately DB CPU usage is up to 95%.

...

  1. One record on one tenant could be discarded with error: io.netty.channel.StacklessClosedChannelException.
    Jira Legacy
    serverFOLIO Issue TrackerSystem Jira
    serverId6ccf3fe401505d01-3301b853-368a3c2e-983e90f1-20c466b11a49ee9b165564fc
    keyMODDATAIMP-748
    Reproduces in both cases with and without splitting feature enabled in at least 30% of test runs with 500k record files and multitenant testing.
  2. During the new Data Import splitting feature testing, items for update were discarded with the error: io.vertx.core.impl.NoStackTraceThrowable: Cannot get actual Item by id: org.folio.inventory.exceptions.InternalServerErrorException: Access for user 'data-import-system-user' (f3486d35-f7f7-4a69-bcd0-d8e5a35cb292) requires permission: inventory-storage.items.item.get. Less than 1% of records could be discarded due to missing permission for  'data-import-system-user'. Permission was not added automatically during the service deployment. I added permission manually to the database and the error does not occur anymore.
    Jira Legacy
    serverFOLIO Issue TrackerSystem Jira
    serverId6ccf3fe401505d01-3301b853-368a3c2e-983e90f1-20c466b11a49ee9b165564fc
    keyMODDATAIMP-930
  3. UI issue, when canceled or completed with error Job progress bar cannot be deleted from the screen.
    Jira Legacy
    serverFOLIO Issue TrackerSystem Jira
    serverId6ccf3fe401505d01-3301b853-368a3c2e-983e90f1-20c466b11a49ee9b165564fc
    keyMODDATAIMP-929
  4. Usage:
    • Should not use less than 1000 for RECORDS_PER_SPLIT_FILE. The system is stable enough to ingest 1000 records consistently and smaller amounts will incur more overheads, resulting in longer jobs' durations.  CPU utilization for mod-di-converter-storage for 500 RECORDS_PER_SPLIT_FILE(RPSF) = 160%, for 1000RPSF =180%, for 5K RPSF =380% and for 10K RPSF =433%, so in the case of selecting configurations 5K or 10K we recommend to add more CPU to mod-di-converter-storage service.
    • When toggling the file-splitting feature, mod-source-record-storage, mod-source-record-manager's tasks need to be restarted.
    • Keep in mind about the Kafka broker's disk size (as bigger jobs - up to 500K - can be run now), consecutive jobs may use up the disk quickly because the messages' retention time currently is set at 8 hours. For example with 300GB disk size, consecutive jobs of 250K, 500K, 500K sizes will exhaust the disk. 
  5. More CPU could be allocated to mod-inventory and mod-di-converter-storage

...

 * - One record on one tenant could be discarded with error: io.netty.channel.StacklessClosedChannelException.

Jira Legacy
serverFOLIO Issue TrackerSystem Jira
serverId6ccf3fe401505d01-3301b853-368a3c2e-983e90f1-20c466b11a49ee9b165564fc
keyMODDATAIMP-748
 Reproduces in both cases with and without splitting features in at least 30% of test runs with 500k record files and multitenant testing.

...

 ** -  up to 10 items were discarded with the error: io.vertx.core.impl.NoStackTraceThrowable: Cannot get actual Item by id: org.folio.inventory.exceptions.InternalServerErrorException: Access for user 'data-import-system-user' (f3486d35-f7f7-4a69-bcd0-d8e5a35cb292) requires permission: inventory-storage.items.item.get. Less than 1% of records could be discarded due to missing permission for  'data-import-system-user'. Permission was not added automatically during the service deployment. I added permission manually to the database and the error does not occur anymore.

Jira Legacy
serverFOLIO Issue TrackerSystem Jira
serverId6ccf3fe401505d01-3301b853-368a3c2e-983e90f1-20c466b11a49ee9b165564fc
keyMODDATAIMP-930


Test 1,2. 100k, 250K, 500k and Multitenant MARC BIB Create

...

Splitting Feature Disabled



Response time without DI

Before Splitting Feature Deployed

Response time with DI

Before Splitting Feature Deployed

Response time without DI

Splitting Feature disabled

Response time with DI 

Splitting Feature disabled

Response time without DI 
(Average) 

Splitting Feature enabled

Response time with DI

(Average) Splitting Feature enabled

Check-In0.517s1.138s0.542s1.1s0.505s1.067s
Check-Out0.796s1.552s0.841s1.6s0.804s1.48s



DI Duration without CI/CO

Before Splitting Feature Deployed

DI Duration with CI/CO

Before Splitting Feature Deployed

DI Duration without CI/CO

Splitting Feature disabled

DI Duration with CI/CO

Splitting Feature disabled

DI Duration without CI/CO DI Duration with CI/CO 
Tenant _114 min (18 min for run 2)20 min27min 47sec31min 30sec16min 18sec16 min 53 sec
Tenant _216 min (18 min for run 2)19 min23min 16sec26min 22sec20min 13sec20min 39 sec
Tenant _316 min (15 min for run 2)16 min18min 40sec20min 44sec17min 42sec17min 54 sec


 * - Same approach testing DI: 3 DI jobs total on 3 tenants without CI/CO. Start the second job after the first one reaches 30%, and start another job on a third tenant after the first job reaches 60% completion. DI file size: 25k

...

With CI/CO 20 users and DI 25k records on each of the 3 tenants Splitting Feature Disabled

ocp3-mod-data-import:12

Image Modified

Data Import Robustness Enhancement 
Jira Legacy
server

...

System Jira
serverId

...

01505d01-

...

b853-

...

3c2e-

...

90f1-

...

ee9b165564fc
keyPERF-646

25K records RECORDS_PER_SPLIT_FILE
Number of concurrent tenantsJob profile 500Status1KStatus5KStatus10KStatusTest with Split disabledStatus
1 Tenant test#1PTF - Create 212 minutes 55 secondsCompleted11 minutes 48 secondsCompleted09 minutes 21 secondsCompleted9 minutes 2 secCompleted10 minutes 35 secCompleted
1 Tenant test#210 minutes 31 secondsCompleted09 minutes 32 secondsCompleted9 minutes 6 secCompleted9 minutes 14 secCompleted11 minutes 27 secCompleted
2 Tenants test#1PTF - Create 219 minutes 29 secondsCompleted15 minutes 47 secondsCompleted16 minutes 15 secondsCompleted16 minutes 3 secondsCompleted19 minutes 18 secCompleted
2 Tenants test#218 minutes 19 secondsCompleted15 minutes 47 secondsCompleted16 minutes 11 secCompleted16 min 41 secCompleted20 minutes 33 secCompleted
3 Tenants test#1PTF - Create 2

24 minutes 15 seconds

Completed

25 minutes 47 seconds

Completed23 minutes Completed23 minutes 27 secondsCompleted30 minutes 2 secCompleted
3 Tenants test#224 minutes 38 secondsCompleted

23 minutes 28 seconds

Completed23 minutes 2 secCompleted23 minutes 26 secondsCompleted

29 minutes 54 sec

Completed *

...

Memory utilization rich maximal value for mod-source-record-storage-b 88%  and for mod-source-record-manager-b 85%.

Test 2. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 10K, 2 runs for each test.

...

Test 2. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 10K, 2 runs for each test.

CPU utilization of 
mod-di-converter-storage-b

 

RDS CPU Utilization 

Test 1. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 500, 2 runs for each test. Maximal  CPU Utilization = 95%

...

Retest the DI feature to be sure that the new changes have not affected performance negatively.  Retest the DI file-splitting feature for the following scenarios:

Jira Legacy
serverFOLIO Issue TrackerSystem Jira
serverId6ccf3fe401505d01-3301b853-368a3c2e-983e90f1-20c466b11a49ee9b165564fc
keyPERF-681

Expand
titleTask definition

{
    "taskDefinitionArn": "arn:aws:ecs:us-east-1:054267740449:task-definition/ocp3-mod-data-import:23",
    "containerDefinitions": [
        {
            "name": "mod-data-import",
            "image": "579891902283.dkr.ecr.us-east-1.amazonaws.com/folio/mod-data-import:2.7.2-SNAPSHOT.150",
            "cpu": 256,
            "memory": 2048,
            "memoryReservation": 1844,
            "portMappings": [
                {
                    "containerPort": 8081,
                    "hostPort": 0,
                    "protocol": "tcp"
                }
            ],
            "essential": true,
            "environment": [
                {
                    "name": "DB_MAXPOOLSIZE",
                    "value": "20"
                },
                {
                    "name": "CONFIG_FILE",
                    "value": "config.json"
                },
                {
                    "name": "SCORE_AGE_NEWEST",
                    "value": "0"
                },
                {
                    "name": "DB_PORT",
                    "value": "5432"
                },
                {
                    "name": "AWS_URL",
                    "value": "https://s3.amazonaws.com"
                },
                {
                    "name": "SCORE_TENANT_USAGE_MAX",
                    "value": "-200"
                },
                {
                    "name": "ASYNC_PROCESSOR_POLL_INTERVAL_MS",
                    "value": "5000"
                },
                {
                    "name": "JAVA_ARGS",
                    "value": "-Dhttp.port=8082 -Dlog.level=info"
                },
                {
                    "name": "JAVA_OPTS",
                    "value": "-Dvertx.logger-delegate-factory-class-name=io.vertx.core.logging.SLF4JLogDelegateFactory -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/ms/mod-data-import.hprof -XX:OnOutOfMemoryError=/usr/ms/heapdump.sh -XX:MetaspaceSize=384m -XX:MaxMetaspaceSize=512m -Xmx1292m"
                },
                {
                    "name": "AWS_BUCKET",
                    "value": "data-import-folio-eis-us-east-1-int-tenant"
                },
                {
                    "name": "ENV",
                    "value": "ocp3"
                },
                {
                    "name": "SCORE_AGE_OLDEST",
                    "value": "50"
                },
                {
                    "name": "AWS_SDK",
                    "value": "true"
                },
                {
                    "name": "JAVA_PROFILER_OPTS",
                    "value": "-noverify -javaagent:\"/usr/ms/jvm-profiler-1.0.0.jar\"=configProvider=com.uber.profiling.YamlConfigProvider,configFile=\"/usr/ms/profiler.yaml\" -cp \"/usr/ms/jvm-profiler-1.0.0.jar\" "
                },
                {
                    "name": "SCORE_AGE_EXTREME_THRESHOLD_MINUTES",
                    "value": "480"
                },
                {
                    "name": "SCORE_TENANT_USAGE_MIN",
                    "value": "100"
                },
                {
                    "name": "ASYNC_PROCESSOR_MAX_WORKERS_COUNT",
                    "value": "1"
                },
                {
                    "name": "SCORE_PART_NUMBER_FIRST",
                    "value": "1"
                },
                {
                    "name": "SPLIT_FILES_ENABLED",
                    "value": "true"
                },
                {
                    "name": "SCORE_JOB_SMALLEST",
                    "value": "40"
                },
                {
                    "name": "file.processing.edifact.buffer.chunk.size",
                    "value": "10"
                },
                {
                    "name": "SCORE_PART_NUMBER_LAST_REFERENCE",
                    "value": "100"
                },
                {
                    "name": "S3_FORCEPATHSTYLE",
                    "value": "true"
                },
                {
                    "name": "JAVA_PROFILER_STATE",
                    "value": "disabled"
                },
                {
                    "name": "AWS_REGION",
                    "value": "us-east-1"
                },
                {
                    "name": "SCORE_JOB_REFERENCE",
                    "value": "100000"
                },
                {
                    "name": "SCORE_AGE_EXTREME_VALUE",
                    "value": "10000"
                },
                {
                    "name": "DB_HOST",
                    "value": "db.ocp3.folio-eis.us-east-1"
                },
                {
                    "name": "SCORE_JOB_LARGEST",
                    "value": "-40"
                },
                {
                    "name": "MAX_REQUEST_SIZE",
                    "value": "4000000"
                },
                {
                    "name": "KAFKA_PORT",
                    "value": "9092"
                },
                {
                    "name": "KAFKA_HOST",
                    "value": "kafka.ocp3.folio-eis.us-east-1"
                },
                {
                    "name": "LOG4J_CONFIGURATION_FILE",
                    "value": "https://s3.amazonaws.com/ocp3-folio-eis-us-east-1-int/log/log4j2.properties"
                },
                {
                    "name": "PREFIX",
                    "value": "ocp3"
                },
                {
                    "name": "RECORDS_PER_SPLIT_FILE",
                    "value": "1000"
                },
                {
                    "name": "SCORE_PART_NUMBER_LAST",
                    "value": "0"
                },
                {
                    "name": "DB_DATABASE",
                    "value": "folio"
                },
                {
                    "name": "DB_EXPLAIN_QUERY_THRESHOLD",
                    "value": "300000"
                }
            ],
            "mountPoints": [],
            "volumesFrom": [],
            "secrets": [
                {
                    "name": "DB_USERNAME",
                    "valueFrom": "arn:aws:ssm:us-east-1:054267740449:parameter/fse/cluster/ocp3/dbClusterMaster_userName"
                },
                {
                    "name": "DB_PASSWORD",
                    "valueFrom": "arn:aws:ssm:us-east-1:054267740449:parameter/fse/cluster/ocp3/dbClusterMaster_userPassword"
                }
            ],
            "stopTimeout": 120,
            "ulimits": [
                {
                    "name": "nofile",
                    "softLimit": 1048576,
                    "hardLimit": 1048576
                }
            ],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "ocp3-folio-eis",
                    "awslogs-region": "us-east-1",
                    "awslogs-stream-prefix": "ocp3"
                }
            }
        }
    ],
    "family": "ocp3-mod-data-import",
    "taskRoleArn": "arn:aws:iam::054267740449:role/Role-folio-ecs-task",
    "executionRoleArn": "arn:aws:iam::054267740449:role/Role-folio-ecs-task",
    "revision": 23,
    "volumes": [],
    "status": "ACTIVE",
    "requiresAttributes": [
        {
            "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
        },
        {
            "name": "ecs.capability.execution-role-awslogs"
        },
        {
            "name": "com.amazonaws.ecs.capability.ecr-auth"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.21"
        },
        {
            "name": "com.amazonaws.ecs.capability.task-iam-role"
        },
        {
            "name": "ecs.capability.container-ordering"
        },
        {
            "name": "ecs.capability.execution-role-ecr-pull"
        },
        {
            "name": "ecs.capability.secrets.ssm.environment-variables"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
        }
    ],
    "placementConstraints": [],
    "compatibilities": [
        "EXTERNAL",
        "EC2"
    ],
    "registeredAt": "2023-10-04T16:59:48.967Z",
    "registeredBy": "arn:aws:sts::054267740449:assumed-role/AWSReservedSSO_FOLIOFSE_ead3c38ca817a601/rsafiulin@ebsco.com",
    "tags": []
}


Test 1.  Single tenant: create and update 250K file 

Test #Test parametersProfileDurationStatus

Previous results

Duration

1.1250K MARC BIB Create PTF - Create 22 hours 3 min Completed2 hours 2 min 
1.2250K MARC BIB UpdatePTF - Updates Success - 1














Memory Utilization


Service CPU Utilization 


Instance CPU Utilization

RDS CPU Utilization  


RDS Database Connections

Appendix

Infrastructure ocp3  with the "Bugfest" Dataset

...

  • tenant0_mod_source_record_storage.marc_records_lb = 9674629
  • tenant2_mod_source_record_storage.marc_records_lb = 0
  • tenant3_mod_source_record_storage.marc_records_lb = 0
  • tenant0_mod_source_record_storage.raw_records_lb = 9604805
  • tenant2_mod_source_record_storage.raw_records_lb = 0
  • tenant3_mod_source_record_storage.raw_records_lb = 0
  • tenant0_mod_source_record_storage.records_lb = 9674677
  • tenant2_mod_source_record_storage.records_lb = 0
  • tenant3_mod_source_record_storage.records_lb = 0
  • tenant0_mod_source_record_storage.marc_indexers =  620042011
  • tenant2_mod_source_record_storage.marc_indexers =  0
  • tenant3_mod_source_record_storage.marc_indexers =  0
  • tenant0_mod_source_record_storage.marc_indexers with field_no 010 = 3285833
  • tenant2_mod_source_record_storage.marc_indexers with field_no 010 = 0
  • tenant3_mod_source_record_storage.marc_indexers with field_no 010 = 0
  • tenant0_mod_source_record_storage.marc_indexers with field_no 035 = 19241844
  • tenant2_mod_source_record_storage.marc_indexers with field_no 035 = 0
  • tenant3_mod_source_record_storage.marc_indexers with field_no 035 = 0
  • tenant0_mod_inventory_storage.authority = 4
  • tenant2_mod_inventory_storage.authority = 0
  • tenant3_mod_inventory_storage.authority = 0
  • tenant0_mod_inventory_storage.holdings_record = 9592559
  • tenant2_mod_inventory_storage.holdings_record = 16
  • tenant3_mod_inventory_storage.holdings_record = 16
  • tenant0_mod_inventory_storage.instance = 9976519
  • tenant2_mod_inventory_storage.instance = 32
  • tenant3_mod_inventory_storage.instance = 32 
  • tenant0_mod_inventory_storage.item = 10787893
  • tenant2_mod_inventory_storage.item = 19
  • tenant3_mod_inventory_storage.item = 19

PTF -environment ocp3 

  • 10 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
  • 2 database  instances, one reader, and one writer

    NameAPI NameMemory GIBvCPUsmax_connections
    R6G Extra Largedb.r6g.xlarge32 GiB4 vCPUs2731


  • MSK ptf-kakfa-3
    • 4 m5.2xlarge brokers in 2 zones
    • Apache Kafka version 2.8.0

    • EBS storage volume per broker 300 GiB

    • auto.create.topics.enable=true
    • log.retention.minutes=480
    • default.replication.factor=3
  • Kafka topics partitioning: - 2 partitions for DI topics

...