MARC Authorities Update + Create [Orchid]
Overview
In the scope of - PERF-386Getting issue details... STATUS it's needed to run tests to answer questions:
- Determine time it takes to complete import
- Determine main modules that are involved in the process (if obvious or if known)
- Test specific settings or items or scenarios: Check-in and Checkout (CICO) is in progress and there are 5 concurrent users. Test concurrent DI jobs with multi-tenants in the same cluster.
Summary
- Time to successfully complete 1k records data import available for tenant ptf-ncp5-00 and is approximately 15 sec, 5k records - 1 min, 10k records - 2 min, 22.7k records - 4 min 30 sec, and 50k records data import is approximately 9 min 37 sec.
- Main modules that are involved in the process:
- mod-quick-marc
- mod-source-record-storage
- mod-inventory
- mod-source-record-manager
- mod-data-import
- mod-di-converter-storage
- mod-search
- nginx-okapi
- mod-inventory-storage
- okapi
- mod-entities-links
- DI with CI/CO - no degradation for data import time but degradation for Check-in and Checkout time is up to 3 times during Data import. Multitenant testing of concurrent jobs from different tenants and consecutive jobs from ptf-ncp5-01, and ptf-ncp5-02 tenants both were completed with errors where all records were discarded, sometimes one and only run for each DI file could be successful. Jobs for 2 or 3 tenants simultaneously were tested but never finished due to an error - MODSOURCE-581Getting issue details... STATUS * occurred. As jobs were stopped by the user due to an error (about 10-15% done for 5 hours), the results are irrelevant.
- Memory utilization grows for 3 modules: mod-source-record-manager:3.6.2 from 83% to 105%, mod-source-record-storage:5.6.5 from 86% to 89%, mod-inventory-storage:26.0.0 from 42% to 54%. Jira ticket is opened
-
PERF-541Getting issue details...
STATUS
. All other modules behave stable during Data Import.
17/05/2023 in accordance with description of PERF-541 the series of tests were performed. The growth of memory for mod-source-record-manager was not significant and stabilized after some time. The heap dump analysis was performed for all modules and it didn't reveal memory leaks. - Most CPU-consuming modules: mod-quick-marc - 79%, mod-source-record-storage - 74%, mod-inventory - 69%, mod-source-record-manager - 67%, others - usage less than 30%.
* MODSOURCE-581 - SPIKE: Multiple tenant DI testing - import jobs are hanging CLOSED is reproducible for Orchid release with modules configuration mod-source-record-storage: cpu:1024 memory:4096/3688, DB_MAXPOOLSIZE=30, DB_CONNECTION_TIMEOUT=40
mod-source-record-manager: cpu:1024 memory:4096/3688, DB_MAXPOOLSIZE=30. And planned to be retested with an increased size of the database
-
PERF-544Getting issue details...
STATUS
, and with all needed Trigger functions too
-
PERF-547Getting issue details...
STATUS
.
Recommendations & Jiras (Optional)
Jiras
- MODSOURMAN-982Getting issue details... STATUS Do not process chunks when the DI job is completed
- PERF-541Getting issue details... STATUS Investigate potential memory leak for DI modules
- MODDATAIMP-809Getting issue details... STATUS Investigate why records are discarded for jobs completed with errors.
Test Runs & Results
Job Profile "KG Create authority" - https://bugfest-nolana.int.aws.folio.org/settings/data-import/job-profiles/view/d3271c74-97ec-4dd9-9470-97b2154d63fd?query=KG&sort=name
Baseline test
Test # | # of records | % with updates | % creates | File | Time it takes to complete import |
---|---|---|---|---|---|
1 | 1,000 | 0 | 100 | https://folio-org.atlassian.net/wiki/download/attachments/1385982/1k_marc_authority.mrc?api=v2 | 14 sec |
2 | 5,000 | 0 | 100 | https://folio-org.atlassian.net/wiki/download/attachments/1385982/LC_SUBJ_msplit00000000.mrc?api=v2 | 55 sec |
3 | 10,000 | 0 | 100 | https://folio-org.atlassian.net/wiki/download/attachments/1385982/msplit00000000.mrc?api=v2 | 1 min 59 sec |
4 | 22778 | 0 | 100 | https://folio-org.atlassian.net/wiki/download/attachments/1385982/msplit00000013.mrc?api=v2 | 4 min 31 sec |
5 | 50,000 | 0 | 100 | https://folio-org.atlassian.net/wiki/download/attachments/1385982/50000_authorityrecords.mrc?api=v2 | 9 min 48 sec |
Test with CICO 5 concurrent users
Test # | # of records | Time it takes to complete import | CI time Avg | Baseline CI Avg delta | CI time 95th pct | Baseline CI delta | CO time Avg | Baseline CO Avg Delta | CO time 95th pct | Baseline CO delta |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1,000 | 14 sec | 0.585 | +21% | 0.778 | +37% | 1.012 | +34% | 1.426 | +62% |
2 | 5,000 | 56 sec | 0.914 | +90% | 1.467 | +157% | 1.305 | +73% | 2.403 | +173% |
3 | 10,000 | 1 min 54 sec | 0.907 | +89% | 1.759 | +209% | 1.408 | +86% | 2.721 | +209% |
4 | 22778 | 4 min 32 sec | 0.853 | +78% | 1.616 | +184% | 1.425 | +89% | 2.497 | +183% |
5 | 50,000 | 9 min 37 sec | 0.862 | +80% | 1.471 | +158% | 1.510 | +100% | 2.403 | +173% |
Baseline | Avg | 95th pct |
---|---|---|
CI | 0.480 | 0.569 |
CO | 0.755 | 0.881 |
Multitenant testing
- test 1-5: testing DI on each tenant consecutively (5 jobs from 3 tenants = 15 test runs)
- test 6-8: testing DI jobs from two tenants simultaneously with 1 min ramp-up.
- test 9: testing DI jobs from 3 tenants simultaneously with 1 min ramp-up.
Test # | # of records | Tenant ptf-ncp5-00 time | Comment | Tenant ptf-ncp5-01 time | Comment | Tenant ptf-ncp5-02 time | Comment |
1. | 1,000 | 15 sec | COMMITTED | 56 sec / 17 sec | 1 time COMMITTED / other ERROR | 13 sec - 30 min | ERROR one of the jobs stuck for 30 min* |
2. | 5,000 | 1 min | COMMITTED | 58 sec | 1 time COMMITTED / other ERROR | 47 sec - 55 min | 1 time COMMITTED / other ERROR |
3. | 10,000 | 2 min 02 sec | COMMITTED | 1 min 36 sec | 1 time COMMITTED | 19 min 22 sec | ERROR |
4 | 22778 | 4 min 20 sec | COMMITTED | 11 min 52 sec | ERROR | - | |
5 | 50,000 | 9 min 53 sec | COMMITTED | 3 min 56 sec | ERROR | - | |
6 | Tenant-00 + Tenant-01 50000 recordsg | Stopped by user | |||||
7 | Tenant-01 + Tenant-02 50000 records | Stopped by user | |||||
8 | Tenant-00 + Tenant-02 50000 records | Stopped by user | |||||
9 | Tenant-00 +Tenant-01 + Tenant-02 50000 records | Stopped by user |
Jobs were always successful for tenant ptf-ncp5-00. For another 2 tenants jobs were Completed with errors where all records were discarded, sometimes one and only run for each DI file could be successful.
Jobs for 2 or 3 tenants simultaneously were tested but never finished due to an error
-
MODSOURCE-581Getting issue details...
STATUS
occurred. As jobs were stopped by the user due to an error (about 10-15% done for 5 hours), the results are irrelevant.
Multitenant testing errors and warnings:
11:16:50 [] [] [] [] ERROR KafkaConsumerWrapper businessHandlerCompletionHandler:: Error while processing a record - id: 2 subscriptionPattern: SubscriptionDefinition(eventType=DI_PARSED_RECORDS_CHUNK_SAVED, subscriptionPattern=ncp5\.Default\.\w{1,}\.DI_PARSED_RECORDS_CHUNK_SAVED) offset: 1947 |
io.vertx.core.impl.NoStackTraceThrowable: Timeout |
11:16:50 [] [] [] [] WARN taImportKafkaHandler handle:: Error with database during collecting of deduplication info for handlerId: 6713adda-72ce-11ec-90d6-0242ac120003 , eventId: e4a75577-b3b0-4404-bd2f-f9586fd412c3. |
io.vertx.core.impl.NoStackTraceThrowable: Timeout |
11:16:50 [] [] [] [] ERROR KafkaConsumerWrapper businessHandlerCompletionHandler:: Error while processing a record - id: 3 subscriptionPattern: SubscriptionDefinition(eventType=DI_COMPLETED, subscriptionPattern=ncp5\.Default\.\w{1,}\.DI_COMPLETED) offset: 68691 |
io.vertx.core.impl.NoStackTraceThrowable: Timeout |
11:16:50 [] [] [] [] WARN tHandlingServiceImpl handle:: Failed to handle DI_COMPLETED event |
io.vertx.core.impl.NoStackTraceThrowable: Timeout |
11:16:50 [] [] [] [] WARN rdChunksKafkaHandler handle:: RecordsBatchResponse processing has failed with errors chunkId: f5a92a02-86ce-4afa-aeab-7931f1fd13c6 chunkNumber: 742 jobExecutionId: ff56fb28-5c8a-4109-95eb-33bb0dbed57c |
io.vertx.core.impl.NoStackTraceThrowable: Timeout |
12:07:05 [] [] [] [] ERROR KafkaConsumerWrapper businessHandlerCompletionHandler:: Error while processing a record - id: 13 subscriptionPattern: SubscriptionDefinition(eventType=DI_SRS_MARC_AUTHORITY_RECORD_CREATED, subscriptionPattern=ncp5\.Default\.\w{1,}\.DI_SRS_MARC_AUTHORITY_RECORD_CREATED) offset: 9181 |
io.vertx.core.impl.NoStackTraceThrowable: handle:: Failed to process data import event payload from topic 'ncp5.Default.fs07000002.DI_SRS_MARC_AUTHORITY_RECORD_CREATED' by jobExecutionId: '719bcf8f-0017-4b92-93b8-b85e46566634' with recordId: 'f3360e49-908e-4bbf-9c0e-45d811b9863a' and chunkId: '1a556829-8290-4e8c-ab8e-eb15d19af624' |
12:07:05 [] [] [] [] WARN KafkaConsumerWrapper businessHandlerCompletionHandler:: Error handler has not been implemented for subscriptionPattern: SubscriptionDefinition(eventType=DI_SRS_MARC_AUTHORITY_RECORD_CREATED, subscriptionPattern=ncp5\.Default\.\w{1,}\.DI_SRS_MARC_AUTHORITY_RECORD_CREATED) failures |
12:07:05 [] [] [] [] ERROR KafkaConsumerWrapper businessHandlerCompletionHandler:: Error while processing a record - id: 13 subscriptionPattern: SubscriptionDefinition(eventType=DI_SRS_MARC_AUTHORITY_RECORD_CREATED, subscriptionPattern=ncp5\.Default\.\w{1,}\.DI_SRS_MARC_AUTHORITY_RECORD_CREATED) offset: 9181 |
io.vertx.core.impl.NoStackTraceThrowable: handle:: Failed to process data import event payload from topic 'ncp5.Default.fs07000002.DI_SRS_MARC_AUTHORITY_RECORD_CREATED' by jobExecutionId: '719bcf8f-0017-4b92-93b8-b85e46566634' with recordId: 'f3360e49-908e-4bbf-9c0e-45d811b9863a' and chunkId: '1a556829-8290-4e8c-ab8e-eb15d19af624' |
12:07:05 [] [] [] [] WARN AbstractConfig These configurations '[ssl.protocol, ssl.keystore.location, ssl.truststore.type, ssl.keystore.type, ssl.truststore.location, ssl.keystore.password, ssl.key.password, ssl.truststore.password, ssl.endpoint.identification.algorithm]' were supplied but are not used yet. |
12:05:46 [636406/data-import-profiles] [fs07000002] [90aad488-be59-4879-b63b-2f8f13b08e85] [mod_di_converter_storage] WARN CQL2PgJSON Doing LIKE search without index for job_profiles.jsonb->>'hidden', CQL >>> SQL: hidden == false >>> lower(f_unaccent(job_profiles.jsonb->>'hidden')) LIKE lower(f_unaccent('false')) |
Memory Utilization
Memory utilization grows for 3 modules:
- mod-source-record-manager:3.6.2 from 83% to 105%.
- mod-source-record-storage:5.6.5 from 86% to 89%.
- mod-inventory-storage:26.0.0 from 42% to 54%.
Jira ticket is opened - PERF-541Getting issue details... STATUS
All other modules behave stable during Data Import.
*This test was performed after a run of 2 sets of the same jobs (1k, 5k, 10k, 22.7k, 50k records twice)
Service CPU Utilization
*On chart below - each little spike corresponds to each DI job performed.
**Some of spikes is shorter than the others - because of differences in number of records imported.
**Test #1 has higher CPU usage because it has background activities (CICO 5 users + DI )
Most CPU-consuming modules:
- mod-quick-marc - 79%
- mod-source-record-storage - 74%
- mod-inventory - 69%
- mod-source-record-manager - 67%
- others - usage less than 30%
Instance CPU Utilization
RDS CPU Utilization
Predictable that each DI job is consuming a lot of DB CPU (each spike here corresponds to each DI job).
Approximately DB CPU usage is ± 96%
Appendix
Infrastructure
PTF -environment ncp3
- 9 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
- 2 instances of db.r6.xlarge database instances, one reader, and one writer
- MSK ptf-kakfa-3
- 4 m5.2xlarge brokers in 2 zones
Apache Kafka version 2.8.0
EBS storage volume per broker 300 GiB
- auto.create.topics.enable=true
- log.retention.minutes=480
- default.replication.factor=3
- Kafka topics partitioning:
- DI_RAW_RECORDS_CHUNK_READ -2
- DI_RAW_RECORDS_CHUNK_PARSED -2
- DI_PARSED_RECORDS_CHUNK_SAVED -2
- DI_SRS_MARC_AUTHORITY_RECORD_CREATED -2
- DI_COMPLETED -2
Modules memory and CPU parameters
Modules | Version | Task Definition | Running Tasks | CPU | Memory | MemoryReservation | MaxMetaspaceSize | Xmx |
---|---|---|---|---|---|---|---|---|
mod-data-import | 2.7.1 | 8 | 1 | 256 | 2048 | 1844 | 512 | 1292 |
mod-di-converter-storage | 2.0.2 | 5 | 2 | 128 | 1024 | 896 | 128 | 768 |
mod-source-record-storage | 5.6.5 | 24 | 2 | 1024 | 4096 | 3688 | 512 | 3076 |
mod-source-record-manager | 3.6.2 | 14 | 2 | 1024 | 4096 | 3688 | 512 | 3076 |
mod-inventory-storage | 26.0.0 | 10 | 2 | 1024 | 2208 | 1952 | 384 | 1440 |
mod-inventory | 20.0.4 | 8 | 2 | 1024 | 2880 | 2592 | 512 | 1814 |
Methodology/Approach
To test Baseline DI and DI with CICO 5 concurrent users the JMeter scripts were used.
Multitenant testing
- test 1-5: testing DI on each tenant consecutively (5 jobs from 3 tenants = 15 test runs)
- test 6-8: testing DI jobs from two tenants simultaneously with 1 min ramp-up.
- test 9: testing DI jobs from 3 tenants simultaneously with 1 min ramp-up.