MARC Authorities Update + Create [Orchid]


Overview


In the scope of PERF-386 - Getting issue details... STATUS it's needed to run tests to answer questions: 

  • Determine time it takes to complete import
  • Determine main modules that are involved in the process (if obvious or if known)
  • Test specific settings or items or scenarios: Check-in and Checkout (CICO) is in progress and there are 5 concurrent users.  Test concurrent DI jobs with multi-tenants in the same cluster.

Summary

  • Time to successfully complete 1k records data import available for tenant ptf-ncp5-00 and is approximately 15 sec, 5k records - 1 min, 10k records - 2 min, 22.7k records - 4 min 30 sec, and 50k records data import is approximately 9 min 37 sec.
  • Main modules that are involved in the process:
    1. mod-quick-marc
    2. mod-source-record-storage
    3. mod-inventory
    4. mod-source-record-manager
    5. mod-data-import
    6. mod-di-converter-storage
    7. mod-search
    8. nginx-okapi
    9. mod-inventory-storage
    10. okapi
    11. mod-entities-links
  • DI with CI/CO - no degradation for data import time but degradation for Check-in and Checkout time is up to 3 times during Data import. Multitenant testing of concurrent jobs from different tenants and consecutive jobs from ptf-ncp5-01, and ptf-ncp5-02 tenants both were completed with errors where all records were discarded, sometimes one and only run for each DI file could be successful. Jobs for 2 or 3 tenants simultaneously were tested but never finished due to an error MODSOURCE-581 - Getting issue details... STATUS * occurred. As jobs were stopped by the user due to an error (about 10-15% done for 5 hours), the results are irrelevant.
  • Memory utilization grows for 3 modules: mod-source-record-manager:3.6.2  from 83% to 105%, mod-source-record-storage:5.6.5  from 86% to 89%, mod-inventory-storage:26.0.0 from 42% to 54%. Jira ticket is opened PERF-541 - Getting issue details... STATUS . All other modules behave stable during Data Import.  
    17/05/2023 in accordance with description of PERF-541 the series of tests were performed. The growth of memory for mod-source-record-manager was not significant and stabilized after some time. The heap dump analysis was performed for all modules and it didn't reveal memory leaks. 
  • Most CPU-consuming modules: mod-quick-marc - 79%, mod-source-record-storage - 74%, mod-inventory - 69%, mod-source-record-manager - 67%, others - usage less than 30%.


* MODSOURCE-581 - SPIKE: Multiple tenant DI testing - import jobs are hanging CLOSED is reproducible for Orchid release with modules configuration mod-source-record-storage: cpu:1024 memory:4096/3688, DB_MAXPOOLSIZE=30, DB_CONNECTION_TIMEOUT=40
mod-source-record-manager: cpu:1024 memory:4096/3688, DB_MAXPOOLSIZE=30. And planned to be retested with an increased size of the database PERF-544 - Getting issue details... STATUS , and with all needed Trigger functions too PERF-547 - Getting issue details... STATUS .

Recommendations & Jiras (Optional)

Jiras

MODSOURMAN-982 - Getting issue details... STATUS Do not process chunks when the DI job is completed

PERF-541 - Getting issue details... STATUS Investigate potential memory leak for DI modules

MODDATAIMP-809 - Getting issue details... STATUS Investigate why records are discarded for jobs completed with errors.

Test Runs & Results

Job Profile "KG Create authority" -  https://bugfest-nolana.int.aws.folio.org/settings/data-import/job-profiles/view/d3271c74-97ec-4dd9-9470-97b2154d63fd?query=KG&sort=name

Baseline test

 Test with CICO 5 concurrent users

Test #

# of records 

Time it takes to complete importCI time AvgBaseline CI Avg deltaCI time 95th pctBaseline CI deltaCO time Avg

Baseline CO Avg

Delta

CO time 95th pctBaseline CO delta
11,00014 sec0.585+21%0.778+37%1.012+34%1.426+62%
25,00056 sec0.914+90%1.467+157%1.305+73%2.403+173%
310,0001 min 54 sec0.907+89%1.759+209%1.408+86%2.721+209%
4227784 min 32 sec0.853+78%1.616+184%1.425+89%2.497+183%
550,0009 min 37 sec0.862+80%1.471+158%1.510+100%2.403+173%
BaselineAvg95th pct
CI0.4800.569
CO0.7550.881

Multitenant testing

  • test 1-5: testing DI on each tenant consecutively (5 jobs from 3 tenants = 15 test runs)
  • test 6-8: testing DI jobs from two tenants simultaneously with 1 min ramp-up.
  • test 9: testing DI jobs from 3 tenants simultaneously with 1 min ramp-up.

Test #

# of records

Tenant ptf-ncp5-00 time

Comment

Tenant ptf-ncp5-01 time

Comment

Tenant ptf-ncp5-02 time

Comment

1.

1,00015 secCOMMITTED56 sec / 17 sec

1 time COMMITTED

/ other ERROR

13 sec - 30 min

ERROR

one of the jobs stuck for 30 min* 

2.

5,0001 minCOMMITTED58 sec

1 time COMMITTED

/ other ERROR

47 sec - 55 min

1 time COMMITTED

/ other ERROR
one of the jobs stuck for 30 min

3.10,0002 min 02 secCOMMITTED1 min 36 sec1 time COMMITTED 19 min 22 sec

ERROR

4227784 min 20 secCOMMITTED11 min 52 secERROR-
550,0009 min 53 secCOMMITTED3 min 56 secERROR-
6Tenant-00 + Tenant-01 50000 recordsgStopped by user




MODSOURCE-581 - Getting issue details... STATUS

7Tenant-01 + Tenant-02 50000 recordsStopped by user



MODSOURCE-581 - Getting issue details... STATUS

8Tenant-00 + Tenant-02 50000 recordsStopped by user



MODSOURCE-581 - Getting issue details... STATUS

9Tenant-00 +Tenant-01 + Tenant-02 50000 recordsStopped by user



MODSOURCE-581 - Getting issue details... STATUS

Jobs were always successful for tenant ptf-ncp5-00. For another 2 tenants jobs were Completed with errors where all records were discarded, sometimes one and only run for each DI file could be successful.
Jobs for 2 or 3 tenants simultaneously were tested but never finished due to an error MODSOURCE-581 - Getting issue details... STATUS occurred. As jobs were stopped by the user due to an error (about 10-15% done for 5 hours), the results are irrelevant.

Multitenant testing errors and warnings:

mod-source-record-manager

11:16:50 [] [] [] [] ERROR KafkaConsumerWrapper businessHandlerCompletionHandler:: Error while processing a record - id: 2 subscriptionPattern: SubscriptionDefinition(eventType=DI_PARSED_RECORDS_CHUNK_SAVED, subscriptionPattern=ncp5\.Default\.\w{1,}\.DI_PARSED_RECORDS_CHUNK_SAVED) offset: 1947

io.vertx.core.impl.NoStackTraceThrowable: Timeout

11:16:50 [] [] [] [] WARN  taImportKafkaHandler handle:: Error with database during collecting of deduplication info for handlerId: 6713adda-72ce-11ec-90d6-0242ac120003 , eventId: e4a75577-b3b0-4404-bd2f-f9586fd412c3. 

io.vertx.core.impl.NoStackTraceThrowable: Timeout

11:16:50 [] [] [] [] ERROR KafkaConsumerWrapper businessHandlerCompletionHandler:: Error while processing a record - id: 3 subscriptionPattern: SubscriptionDefinition(eventType=DI_COMPLETED, subscriptionPattern=ncp5\.Default\.\w{1,}\.DI_COMPLETED) offset: 68691 

io.vertx.core.impl.NoStackTraceThrowable: Timeout

11:16:50 [] [] [] [] WARN  tHandlingServiceImpl handle:: Failed to handle DI_COMPLETED event 

io.vertx.core.impl.NoStackTraceThrowable: Timeout

11:16:50 [] [] [] [] WARN  rdChunksKafkaHandler handle:: RecordsBatchResponse processing has failed with errors chunkId: f5a92a02-86ce-4afa-aeab-7931f1fd13c6 chunkNumber: 742 jobExecutionId: ff56fb28-5c8a-4109-95eb-33bb0dbed57c 

io.vertx.core.impl.NoStackTraceThrowable: Timeout

mod-source-record-storage

12:07:05 [] [] [] [] ERROR KafkaConsumerWrapper businessHandlerCompletionHandler:: Error while processing a record - id: 13 subscriptionPattern: SubscriptionDefinition(eventType=DI_SRS_MARC_AUTHORITY_RECORD_CREATED, subscriptionPattern=ncp5\.Default\.\w{1,}\.DI_SRS_MARC_AUTHORITY_RECORD_CREATED) offset: 9181 

io.vertx.core.impl.NoStackTraceThrowable: handle:: Failed to process data import event payload from topic 'ncp5.Default.fs07000002.DI_SRS_MARC_AUTHORITY_RECORD_CREATED' by jobExecutionId: '719bcf8f-0017-4b92-93b8-b85e46566634' with recordId: 'f3360e49-908e-4bbf-9c0e-45d811b9863a' and chunkId: '1a556829-8290-4e8c-ab8e-eb15d19af624' 

12:07:05 [] [] [] [] WARN  KafkaConsumerWrapper businessHandlerCompletionHandler:: Error handler has not been implemented for subscriptionPattern: SubscriptionDefinition(eventType=DI_SRS_MARC_AUTHORITY_RECORD_CREATED, subscriptionPattern=ncp5\.Default\.\w{1,}\.DI_SRS_MARC_AUTHORITY_RECORD_CREATED) failures

12:07:05 [] [] [] [] ERROR KafkaConsumerWrapper businessHandlerCompletionHandler:: Error while processing a record - id: 13 subscriptionPattern: SubscriptionDefinition(eventType=DI_SRS_MARC_AUTHORITY_RECORD_CREATED, subscriptionPattern=ncp5\.Default\.\w{1,}\.DI_SRS_MARC_AUTHORITY_RECORD_CREATED) offset: 9181 

io.vertx.core.impl.NoStackTraceThrowable: handle:: Failed to process data import event payload from topic 'ncp5.Default.fs07000002.DI_SRS_MARC_AUTHORITY_RECORD_CREATED' by jobExecutionId: '719bcf8f-0017-4b92-93b8-b85e46566634' with recordId: 'f3360e49-908e-4bbf-9c0e-45d811b9863a' and chunkId: '1a556829-8290-4e8c-ab8e-eb15d19af624' 

12:07:05 [] [] [] [] WARN  AbstractConfig       These configurations '[ssl.protocol, ssl.keystore.location, ssl.truststore.type, ssl.keystore.type, ssl.truststore.location, ssl.keystore.password, ssl.key.password, ssl.truststore.password, ssl.endpoint.identification.algorithm]' were supplied but are not used yet. 

mod-di-converter-storage

12:05:46 [636406/data-import-profiles] [fs07000002] [90aad488-be59-4879-b63b-2f8f13b08e85] [mod_di_converter_storage] WARN  CQL2PgJSON           Doing LIKE search without index for job_profiles.jsonb->>'hidden', CQL >>> SQL: hidden == false >>> lower(f_unaccent(job_profiles.jsonb->>'hidden')) LIKE lower(f_unaccent('false')) 

Memory Utilization

Memory utilization grows for 3 modules:

  • mod-source-record-manager:3.6.2  from 83% to 105%.
  • mod-source-record-storage:5.6.5  from 86% to 89%.
  • mod-inventory-storage:26.0.0 from 42% to 54%.

Jira ticket is opened PERF-541 - Getting issue details... STATUS

All other modules behave stable during Data Import.

*This test was performed after a run of 2 sets of the same jobs (1k, 5k, 10k, 22.7k, 50k records twice)



Service CPU Utilization 

*On chart below - each little spike corresponds to each DI job performed. 

**Some of spikes is shorter than the others - because of differences in number of records imported.

**Test #1 has higher CPU usage because it has background activities (CICO 5 users + DI )



Most CPU-consuming modules: 

  • mod-quick-marc - 79%
  • mod-source-record-storage - 74%
  • mod-inventory - 69%
  • mod-source-record-manager - 67%
  • others - usage less than 30%


Instance CPU Utilization


RDS CPU Utilization 

Predictable that each DI job is consuming a lot of DB CPU (each spike here corresponds to each DI job).

Approximately DB CPU usage is ± 96%

Appendix

Infrastructure

PTF -environment ncp3 

  • m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
  • 2 instances of db.r6.xlarge database instances, one reader, and one writer
  • MSK ptf-kakfa-3
    • 4 m5.2xlarge brokers in 2 zones
    • Apache Kafka version 2.8.0

    • EBS storage volume per broker 300 GiB

    • auto.create.topics.enable=true
    • log.retention.minutes=480
    • default.replication.factor=3
  • Kafka topics partitioning: 
    • DI_RAW_RECORDS_CHUNK_READ -2 
    • DI_RAW_RECORDS_CHUNK_PARSED -2
    • DI_PARSED_RECORDS_CHUNK_SAVED -2
    • DI_SRS_MARC_AUTHORITY_RECORD_CREATED -2
    • DI_COMPLETED -2


Modules memory and CPU parameters

Modules

Version

Task Definition

Running Tasks 

CPU

Memory

MemoryReservation

MaxMetaspaceSize

Xmx

mod-data-import2.7.181256204818445121292
mod-di-converter-storage2.0.2521281024896128768
mod-source-record-storage5.6.52421024409636885123076
mod-source-record-manager3.6.21421024409636885123076
mod-inventory-storage26.0.01021024220819523841440
mod-inventory20.0.4821024288025925121814

Methodology/Approach

To test Baseline DI and DI with CICO 5 concurrent users the JMeter scripts were used.

Multitenant testing

  • test 1-5: testing DI on each tenant consecutively (5 jobs from 3 tenants = 15 test runs)
  • test 6-8: testing DI jobs from two tenants simultaneously with 1 min ramp-up.
  • test 9: testing DI jobs from 3 tenants simultaneously with 1 min ramp-up.