...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
In Progress
Table of Contents |
---|
Overview
...
Splitting feature documentation Detailed Release Notes for Data Import Splitting Feature
Summary
- Duration for DI correlates with number of the records imported (100k records- 38 min, 250k - 1 hour 32 min, 500k - 3 hours 29 min).
- ---------Multitenant DI could be performed successfully for up to 9 jobs in parallel. If jobs are big they will start one by one in order for each tenant but processed in parallel on 3 tenants. Small DI (1 record) could be finished faster not in order. Duration for Check-In/Check-Out is prolonged twice during DI.
- This has memory utilization increasing due to previous modules restarting (everyday cluster shot down process) no memory leak is suspected for all of the modules.
- Average CPU usage for mod-inventory -was 144%, mod-di-converter-storage was about 107%, and for all other modules did not exceed 100 %. We can observe spikes in CPU usage of mod-data-import at the beginning of the Data Import jobs up to 260%.
- Approximately DB CPU usage is up to 95%.
Recommendations and Jiras
1) One record on one tenant could be discarded with error: io.netty.channel.StacklessClosedChannelException.
Reproduces in both cases with and without splitting features in at least 30% of test runs with 500k record files and multitenant testing. Jira Legacy server System Jira serverId 01505d01-b853-3c2e-90f1-ee9b165564fc key MODDATAIMP-748
2) During the new Data Import splitting feature testing, items for update were discarded with the error: io.vertx.core.impl.NoStackTraceThrowable: Cannot get actual Item by id: org.folio.inventory.exceptions.InternalServerErrorException: Access for user 'data-import-system-user' (f3486d35-f7f7-4a69-bcd0-d8e5a35cb292) requires permission: inventory-storage.items.item.get. Less than 1% of records could be discarded due to missing permission for 'data-import-system-user'. Permission was not added automatically during the service deployment. I added permission manually to the database and the error does not occur anymore. Jira Legacy server System Jira serverId 01505d01-b853-3c2e-90f1-ee9b165564fc key MODDATAIMP-930
...
- The feature is stable and offers more robustness to DI jobs even with the current infrastructure configuration. If there were failures, it's easier now to find the exact failed records to take actions on them.
- No stuck jobs in all tests performed.
- There were errors (see below) in some partial jobs, but they still completed so the entire job status is "Completed with error".
- Both of kinds of imports, create and update MARC BIBs worked well with this file-splitting feature enabled and also disabled.
- (At this point) There is no performance degradations, jobs not getting slower, on single-tenant imports. On multi-tenants imports, performance is be a little better (can we quanitfy this?)
- Duration for DI correlates with number of the records imported (100k records- 38 min, 250k - 1 hour 32 min, 500k - 3 hours 29 min).
- Multitenant DI could be performed successfully for up to 9 jobs in parallel. If jobs are big they will start one by one in order for each tenant but processed in parallel on 3 tenants. Small DI (1 record) could be finished faster not in order.
- No memory leak is suspected for all of the modules.
- Average CPU usage for mod-inventory -was 144%, mod-di-converter-storage was about 107%, and for all other modules did not exceed 100 %. We can observe spikes in CPU usage of mod-data-import at the beginning of the Data Import jobs up to 260%. Big improvement over previous version (without file-splitting) for 500K imports where mod-di-converter-storage's CPU utilization was 462% and other modules were above 100% and up to 150%.
- Approximately DB CPU usage is up to 95%.
Recommendations and Jiras
- One record on one tenant could be discarded with error: io.netty.channel.StacklessClosedChannelException.
Reproduces in both cases with and without splitting feature enabled in at least 30% of test runs with 500k record files and multitenant testing.Jira Legacy server System Jira serverId 01505d01-b853-3c2e-90f1-ee9b165564fc key MODDATAIMP-748 - During the new Data Import splitting feature testing, items for update were discarded with the error: io.vertx.core.impl.NoStackTraceThrowable: Cannot get actual Item by id: org.folio.inventory.exceptions.InternalServerErrorException: Access for user 'data-import-system-user' (f3486d35-f7f7-4a69-bcd0-d8e5a35cb292) requires permission: inventory-storage.items.item.get. Less than 1% of records could be discarded due to missing permission for 'data-import-system-user'. Permission was not added automatically during the service deployment. I added permission manually to the database and the error does not occur anymore.
Jira Legacy server System Jira serverId 01505d01-b853-3c2e-90f1-ee9b165564fc key MODDATAIMP-930 - UI issue, when canceled or completed with error Job progress bar cannot be deleted from the screen.
Jira Legacy server System Jira serverId 01505d01-b853-3c2e-90f1-ee9b165564fc key MODDATAIMP-929 - Usage:
- Should not use less than 1000 for RECORDS_PER_SPLIT_FILE. The system is stable enough to ingest 1000 records consistently and smaller amounts will incur more overheads, resulting in longer jobs' durations.
- When toggling the file-splitting feature, mod-source-record-storage, mod-source-record-manager's tasks need to be restarted.
- Keep in mind about the Kafka broker's disk size (as bigger jobs - up to 500K - can be run now), consecutive jobs may use up the disk quickly because the messages' retention time currently is set at 8 hours. For example with 300GB disk size, consecutive jobs of 250K, 500K, 500K sizes will exhaust the disk.
- More CPU could be allocated to mod-inventory and mod-di-converter-storage
Results
Test # | Splitting Feature Enabled | Results | Splitting Feature Disabled | Results | Before Splitting Feature released | Results | ||
---|---|---|---|---|---|---|---|---|
1 | 100K MARC Create | PTF - Create 2 | 37 min -39 min | Completed | 40 min | Completed | 32-33 minutes | Completed |
1 | 250K MARC Create | PTF - Create 2 | 1 hour 32 min | Completed | 1 hour 41 min | Completed | 1 hour 33 min - 1 hour 57 min | Completed |
1 | 500K MARC Create | PTF - Create 2 | 3 hours 29 min | Completed* | 3 hours 55 min | Completed | 3 hours 33 min | Completed |
2 | Multitenant MARC Create (100k, 50k, and 1 record) | PTF - Create 2 | 2 hours 40 min | Completed* | 3 hours 1 min | Completed | ||
3 | CI/CO + DI MARC Create (20 users CI/CO, 25k records DI on 3 tenants) | PTF - Create 2 | 24 min | Completed * | ||||
4 | 100K MARC Update (Create new file) | PTF - Updates Success - 1 | 58 min 25 sec 57 min 19 sec | Completed | 1 hour 3 min | Completed | - | - |
4 | 250K MARC Update | PTF - Updates Success - 1 | 2 hours 2 min ** 2 hours 12 min | Completed with errors ** Completed | 1 hour 53 min | Completed | - | - |
4 | 500K MARC Update | PTF - Updates Success - 1 | 4 hours 43 min 4 hours 38 minutes | Completed Completed | 5 hour 59 min | Completed | - | - |
...
** - up to 10 items were discarded with the error: io.vertx.core.impl.NoStackTraceThrowable: Cannot get actual Item by id: org.folio.inventory.exceptions.InternalServerErrorException: Access for user 'data-import-system-user' (f3486d35-f7f7-4a69-bcd0-d8e5a35cb292) requires permission: inventory-storage.items.item.get. Less than 1% of records could be discarded due to missing permission for 'data-import-system-user'. Permission was not added automatically during the service deployment. I added permission manually to the database and the error does not occur anymore. Jira Legacy server System Jira serverId 01505d01-b853-3c2e-90f1-ee9b165564fc key MODDATAIMP-930
...
Memory utilization rich maximal value for mod-source-record-storage-b 88% and for mod-source-record-manager-b 85%.
Test 2. Test with 1, 2, and 3 tenants' concurrent jobs with configuration RECORDS_PER_SPLIT_FILE = 10K, 2 runs for each test.
...
- tenant0_mod_source_record_storage.marc_records_lb = 9674629
- tenant2_mod_source_record_storage.marc_records_lb = 0
- tenant3_mod_source_record_storage.marc_records_lb = 0
- tenant0_mod_source_record_storage.raw_records_lb = 9604805
- tenant2_mod_source_record_storage.raw_records_lb = 0
- tenant3_mod_source_record_storage.raw_records_lb = 0
- tenant0_mod_source_record_storage.records_lb = 9674677
- tenant2_mod_source_record_storage.records_lb = 0
- tenant3_mod_source_record_storage.records_lb = 0
- tenant0_mod_source_record_storage.marc_indexers = 620042011
- tenant2_mod_source_record_storage.marc_indexers = 0
- tenant3_mod_source_record_storage.marc_indexers = 0
- tenant0_mod_source_record_storage.marc_indexers with field_no 010 = 3285833
- tenant2_mod_source_record_storage.marc_indexers with field_no 010 = 0
- tenant3_mod_source_record_storage.marc_indexers with field_no 010 = 0
- tenant0_mod_source_record_storage.marc_indexers with field_no 035 = 19241844
- tenant2_mod_source_record_storage.marc_indexers with field_no 035 = 0
- tenant3_mod_source_record_storage.marc_indexers with field_no 035 = 0
- tenant0_mod_inventory_storage.authority = 4
- tenant2_mod_inventory_storage.authority = 0
- tenant3_mod_inventory_storage.authority = 0
- tenant0_mod_inventory_storage.holdings_record = 9592559
- tenant2_mod_inventory_storage.holdings_record = 16
- tenant3_mod_inventory_storage.holdings_record = 16
- tenant0_mod_inventory_storage.instance = 9976519
- tenant2_mod_inventory_storage.instance = 32
- tenant3_mod_inventory_storage.instance = 32
- tenant0_mod_inventory_storage.item = 10787893
- tenant2_mod_inventory_storage.item = 19
- tenant3_mod_inventory_storage.item = 19
PTF -environment ocp3
- 10 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
2 database instances, one reader, and one writer
Name API Name Memory GIB vCPUs max_connections R6G Extra Large db.r6g.xlarge 32 GiB 4 vCPUs 2731 - MSK ptf-kakfa-3
- 4 m5.2xlarge brokers in 2 zones
Apache Kafka version 2.8.0
EBS storage volume per broker 300 GiB
- auto.create.topics.enable=true
- log.retention.minutes=480
- default.replication.factor=3
- Kafka topics partitioning: - 2 partitions for DI topics
...