MARC Authorities Update + Create [Orchid]

MARC Authorities Update + Create [Orchid]

 

Overview

 

In the scope of https://folio-org.atlassian.net/browse/PERF-386it's needed to run tests to answer questions: 

  • Determine time it takes to complete import

  • Determine main modules that are involved in the process (if obvious or if known)

  • Test specific settings or items or scenarios: Check-in and Checkout (CICO) is in progress and there are 5 concurrent users.  Test concurrent DI jobs with multi-tenants in the same cluster.

Summary

  • Time to successfully complete 1k records data import available for tenant ptf-ncp5-00 and is approximately 15 sec, 5k records - 1 min, 10k records - 2 min, 22.7k records - 4 min 30 sec, and 50k records data import is approximately 9 min 37 sec.

  • Main modules that are involved in the process:

    1. mod-quick-marc

    2. mod-source-record-storage

    3. mod-inventory

    4. mod-source-record-manager

    5. mod-data-import

    6. mod-di-converter-storage

    7. mod-search

    8. nginx-okapi

    9. mod-inventory-storage

    10. okapi

    11. mod-entities-links

  • DI with CI/CO - no degradation for data import time but degradation for Check-in and Checkout time is up to 3 times during Data import. Multitenant testing of concurrent jobs from different tenants and consecutive jobs from ptf-ncp5-01, and ptf-ncp5-02 tenants both were completed with errors where all records were discarded, sometimes one and only run for each DI file could be successful. Jobs for 2 or 3 tenants simultaneously were tested but never finished due to an error MODSOURCE-581: SPIKE: Multiple tenant DI testing - import jobs are hangingClosed * occurred. As jobs were stopped by the user due to an error (about 10-15% done for 5 hours), the results are irrelevant.

  • Memory utilization grows for 3 modules: mod-source-record-manager:3.6.2  from 83% to 105%, mod-source-record-storage:5.6.5  from 86% to 89%, mod-inventory-storage:26.0.0 from 42% to 54%. Jira ticket is opened PERF-541: Investigate potential memory leak for DI modulesClosed. All other modules behave stable during Data Import.  
    17/05/2023 in accordance with description of PERF-541 the series of tests were performed. The growth of memory for mod-source-record-manager was not significant and stabilized after some time. The heap dump analysis was performed for all modules and it didn't reveal memory leaks. 

  • Most CPU-consuming modules: mod-quick-marc - 79%, mod-source-record-storage - 74%, mod-inventory - 69%, mod-source-record-manager - 67%, others - usage less than 30%.

 

*

MODSOURCE-581 - SPIKE: Multiple tenant DI testing - import jobs are hanging CLOSED is reproducible for Orchid release with modules configuration mod-source-record-storage: cpu:1024 memory:4096/3688, DB_MAXPOOLSIZE=30, DB_CONNECTION_TIMEOUT=40
mod-source-record-manager: cpu:1024 memory:4096/3688, DB_MAXPOOLSIZE=30. And planned to be retested with an increased size of the databasehttps://folio-org.atlassian.net/browse/PERF-544, and with all needed Trigger functions too https://folio-org.atlassian.net/browse/PERF-547.

Recommendations & Jiras (Optional)

Jiras

https://folio-org.atlassian.net/browse/MODSOURMAN-982 Do not process chunks when the DI job is completed

PERF-541: Investigate potential memory leak for DI modulesClosed Investigate potential memory leak for DI modules

https://folio-org.atlassian.net/browse/MODDATAIMP-809 Investigate why records are discarded for jobs completed with errors.

Test Runs & Results

Job Profile "KG Create authority" -  https://bugfest-nolana.int.aws.folio.org/settings/data-import/job-profiles/view/d3271c74-97ec-4dd9-9470-97b2154d63fd?query=KG&sort=name

Baseline test

 Test with CICO 5 concurrent users

Test #

# of records 

Time it takes to complete import

CI time Avg

Baseline CI Avg delta

CI time 95th pct

Baseline CI delta

CO time Avg

Baseline CO Avg

Delta

CO time 95th pct

Baseline CO delta

Test #

# of records 

Time it takes to complete import

CI time Avg

Baseline CI Avg delta

CI time 95th pct

Baseline CI delta

CO time Avg

Baseline CO Avg

Delta

CO time 95th pct

Baseline CO delta

1

1,000

14 sec

0.585

+21%

0.778

+37%

1.012

+34%

1.426

+62%

2

5,000

56 sec

0.914

+90%

1.467

+157%

1.305

+73%

2.403

+173%

3

10,000

1 min 54 sec

0.907

+89%

1.759

+209%

1.408

+86%

2.721

+209%

4

22778

4 min 32 sec

0.853

+78%

1.616

+184%

1.425

+89%

2.497

+183%

5

50,000

9 min 37 sec

0.862

+80%

1.471

+158%

1.510

+100%

2.403

+173%

Baseline

Avg

95th pct

Baseline

Avg

95th pct

CI

0.480

0.569

CO

0.755

0.881

Multitenant testing

  • test 1-5: testing DI on each tenant consecutively (5 jobs from 3 tenants = 15 test runs)

  • test 6-8: testing DI jobs from two tenants simultaneously with 1 min ramp-up.

  • test 9: testing DI jobs from 3 tenants simultaneously with 1 min ramp-up.

Test #

# of records

Tenant ptf-ncp5-00 time

Comment

Tenant ptf-ncp5-01 time

Comment

Tenant ptf-ncp5-02 time

Comment

1.

1,000

15 sec

COMMITTED

56 sec / 17 sec

1 time COMMITTED

/ other ERROR

13 sec - 30 min

ERROR

one of the jobs stuck for 30 min* 

2.

5,000

1 min

COMMITTED

58 sec

1 time COMMITTED

/ other ERROR

47 sec - 55 min

1 time COMMITTED

/ other ERROR
one of the jobs stuck for 30 min

3.

10,000

2 min 02 sec

COMMITTED

1 min 36 sec

1 time COMMITTED

 19 min 22 sec

ERROR

4

22778

4 min 20 sec

COMMITTED

11 min 52 sec

ERROR

-

 

5

50,000

9 min 53 sec

COMMITTED

3 min 56 sec

ERROR

-

 

6

Tenant-00 + Tenant-01 50000 recordsg

Stopped by user

 

 

 

 

 

MODSOURCE-581: SPIKE: Multiple tenant DI testing - import jobs are hangingClosed

7

Tenant-01 + Tenant-02 50000 records

Stopped by user

 

 

 

 

MODSOURCE-581: SPIKE: Multiple tenant DI testing - import jobs are hangingClosed

8

Tenant-00 + Tenant-02 50000 records

Stopped by user

 

 

 

 

MODSOURCE-581: SPIKE: Multiple tenant DI testing - import jobs are hangingClosed

9

Tenant-00 +Tenant-01 + Tenant-02 50000 records

Stopped by user

 

 

 

 

MODSOURCE-581: SPIKE: Multiple tenant DI testing - import jobs are hangingClosed

Jobs were always successful for tenant ptf-ncp5-00. For another 2 tenants jobs were Completed with errors where all records were discarded, sometimes one and only run for each DI file could be successful.
Jobs for 2 or 3 tenants simultaneously were tested but never finished due to an error MODSOURCE-581: SPIKE: Multiple tenant DI testing - import jobs are hangingClosed occurred. As jobs were stopped by the user due to an error (about 10-15% done for 5 hours), the results are irrelevant.

Multitenant testing errors and warnings:

mod-source-record-manager

11:16:50 [] [] [] [] ERROR KafkaConsumerWrapper businessHandlerCompletionHandler:: Error while processing a record - id: 2 subscriptionPattern: SubscriptionDefinition(eventType=DI_PARSED_RECORDS_CHUNK_SAVED, subscriptionPattern=ncp5\.Default\.\w{1,}\.DI_PARSED_RECORDS_CHUNK_SAVED) offset: 1947

io.vertx.core.impl.NoStackTraceThrowable: Timeout

11:16:50 [] [] [] [] WARN  taImportKafkaHandler handle:: Error with database during collecting of deduplication info for handlerId: 6713adda-72ce-11ec-90d6-0242ac120003 , eventId: e4a75577-b3b0-4404-bd2f-f9586fd412c3. 

io.vertx.core.impl.NoStackTraceThrowable: Timeout

11:16:50 [] [] [] [] ERROR KafkaConsumerWrapper businessHandlerCompletionHandler:: Error while processing a record - id: 3 subscriptionPattern: SubscriptionDefinition(eventType=DI_COMPLETED, subscriptionPattern=ncp5\.Default\.\w{1,}\.DI_COMPLETED) offset: 68691 

io.vertx.core.impl.NoStackTraceThrowable: Timeout

11:16:50 [] [] [] [] WARN  tHandlingServiceImpl handle:: Failed to handle DI_COMPLETED event 

io.vertx.core.impl.NoStackTraceThrowable: Timeout

11:16:50 [] [] [] [] WARN  rdChunksKafkaHandler handle:: RecordsBatchResponse processing has failed with errors chunkId: f5a92a02-86ce-4afa-aeab-7931f1fd13c6 chunkNumber: 742 jobExecutionId: ff56fb28-5c8a-4109-95eb-33bb0dbed57c 

io.vertx.core.impl.NoStackTraceThrowable: Timeout

mod-source-record-storage

12:07:05 [] [] [] [] ERROR KafkaConsumerWrapper businessHandlerCompletionHandler:: Error while processing a record - id: 13 subscriptionPattern: SubscriptionDefinition(eventType=DI_SRS_MARC_AUTHORITY_RECORD_CREATED, subscriptionPattern=ncp5\.Default\.\w{1,}\.DI_SRS_MARC_AUTHORITY_RECORD_CREATED) offset: 9181 

io.vertx.core.impl.NoStackTraceThrowable: handle:: Failed to process data import event payload from topic 'ncp5.Default.fs07000002.DI_SRS_MARC_AUTHORITY_RECORD_CREATED' by jobExecutionId: '719bcf8f-0017-4b92-93b8-b85e46566634' with recordId: 'f3360e49-908e-4bbf-9c0e-45d811b9863a' and chunkId: '1a556829-8290-4e8c-ab8e-eb15d19af624' 

12:07:05 [] [] [] [] WARN  KafkaConsumerWrapper businessHandlerCompletionHandler:: Error handler has not been implemented for subscriptionPattern: SubscriptionDefinition(eventType=DI_SRS_MARC_AUTHORITY_RECORD_CREATED, subscriptionPattern=ncp5\.Default\.\w{1,}\.DI_SRS_MARC_AUTHORITY_RECORD_CREATED) failures

12:07:05 [] [] [] [] ERROR KafkaConsumerWrapper businessHandlerCompletionHandler:: Error while processing a record - id: 13 subscriptionPattern: SubscriptionDefinition(eventType=DI_SRS_MARC_AUTHORITY_RECORD_CREATED, subscriptionPattern=ncp5\.Default\.\w{1,}\.DI_SRS_MARC_AUTHORITY_RECORD_CREATED) offset: 9181 

io.vertx.core.impl.NoStackTraceThrowable: handle:: Failed to process data import event payload from topic 'ncp5.Default.fs07000002.DI_SRS_MARC_AUTHORITY_RECORD_CREATED' by jobExecutionId: '719bcf8f-0017-4b92-93b8-b85e46566634' with recordId: 'f3360e49-908e-4bbf-9c0e-45d811b9863a' and chunkId: '1a556829-8290-4e8c-ab8e-eb15d19af624' 

12:07:05 [] [] [] [] WARN  AbstractConfig       These configurations '[ssl.protocol, ssl.keystore.location, ssl.truststore.type, ssl.keystore.type, ssl.truststore.location, ssl.keystore.password, ssl.key.password, ssl.truststore.password, ssl.endpoint.identification.algorithm]' were supplied but are not used yet. 

mod-di-converter-storage

12:05:46 [636406/data-import-profiles] [fs07000002] [90aad488-be59-4879-b63b-2f8f13b08e85] [mod_di_converter_storage] WARN  CQL2PgJSON           Doing LIKE search without index for job_profiles.jsonb->>'hidden', CQL >>> SQL: hidden == false >>> lower(f_unaccent(job_profiles.jsonb->>'hidden')) LIKE lower(f_unaccent('false')) 

Memory Utilization

Memory utilization grows for 3 modules:

  • mod-source-record-manager:3.6.2  from 83% to 105%.

  • mod-source-record-storage:5.6.5  from 86% to 89%.

  • mod-inventory-storage:26.0.0 from 42% to 54%.

Jira ticket is opened PERF-541: Investigate potential memory leak for DI modulesClosed

All other modules behave stable during Data Import.

*This test was performed after a run of 2 sets of the same jobs (1k, 5k, 10k, 22.7k, 50k records twice)

 

 

Service CPU Utilization 

*On chart below - each little spike corresponds to each DI job performed. 

**Some of spikes is shorter than the others - because of differences in number of records imported.

**Test #1 has higher CPU usage because it has background activities (CICO 5 users + DI )

 

 

Most CPU-consuming modules: 

  • mod-quick-marc - 79%

  • mod-source-record-storage - 74%

  • mod-inventory - 69%

  • mod-source-record-manager - 67%

  • others - usage less than 30%

 

Instance CPU Utilization

 

RDS CPU Utilization 

Predictable that each DI job is consuming a lot of DB CPU (each spike here corresponds to each DI job).

Approximately DB CPU usage is ± 96%

Appendix

Infrastructure

PTF -environment ncp3 

  • m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1

  • 2 instances of db.r6.xlarge database instances, one reader, and one writer

  • MSK ptf-kakfa-3

    • 4 m5.2xlarge brokers in 2 zones

    • Apache Kafka version 2.8.0

    • EBS storage volume per broker 300 GiB

    • auto.create.topics.enable=true

    • log.retention.minutes=480

    • default.replication.factor=3

  • Kafka topics partitioning: 

    • DI_RAW_RECORDS_CHUNK_READ -2 

    • DI_RAW_RECORDS_CHUNK_PARSED -2

    • DI_PARSED_RECORDS_CHUNK_SAVED -2

    • DI_SRS_MARC_AUTHORITY_RECORD_CREATED -2

    • DI_COMPLETED -2

 

Modules memory and CPU parameters

Modules

Version

Task Definition

Running Tasks 

Modules

Version

Task Definition

Running Tasks