Data Import Test Report (Kiwi)

It's been found after testing that the actual durations of the imports performed were about 2 (two) times longer than what was reported. This is due to the PTF environment missing a DB trigger that, when restored, doubled the imports' durations.

Overview

This document contains the results of testing Data Import in Kiwi, and compare their results against Hot Fix 3's results to detect performance trends. 

Infrastructure

  • 6 m5.xlarge EC2 instances 
  • 2 instances of db.r6.xlarge database instances, one reader and one writer
  • MSK
    • 4 m5.2xlarge brokers in 2 zones
    • auto.create-topics.enable = true
    • log.retention.minutes=120
  • mod-inventory memory
    • 256 CPU units, 1814MB mem
    • inventory.kafka.DataImportConsumerVerticle.instancesNumber=10
    • inventory.kafka.MarcBibInstanceHridSetConsumerVerticle.instancesNumber=10
    • kafka.consumer.max.poll.records=10
  • mod-inventory-storage
    • 128 CPU units, 544MB mem
  • mod-source-record-storage
    • 128 CPU units, 908MB mem
  • mod-source-record-manager
    • 128 CPU units, 1292MB mem
  • mod-data-import
    • 128 CPU units, 1024MB mem

Software versions

  • mod-data-import v2.2.0
  • mod-source-record-manager v3.2.3
  • mod-source-record-storage v5.2.1
  • mod-inventory v18.0.1
  • mod-inventory-storage v22.0.1

Results

Import durations are recorded in the following table. Note that we did a couple of imports and their durations are recorded here for completeness sake, separated by a comma. 

Import Durations

Profiles used for PTF testing


Profile

KIWI

KIWI (with OL)

5K MARC Create

PTF - Create 2

5 min, 8 min

8 min

5K MARC Update

PTF - Updates Success - 1

11 min, 13 min

6 min

10K MARC Create 

PTF - Create 2

11 min , 14 min

12 min

10K MARC Update

PTF - Updates Success - 1

22 min, 24 min

15 min

20K MARC Create

PTF - Create 2

20 min, 28 min

21 min

20K MARC Update

PTF - Updates Success - 1

43 min, 50 min

27 min
25K MARC CreatePTF - Create 223 mins, 25 mins, 26 mins24 min
25K MARC UpdatePTF - Updates Success - 11 hour 20 mins (completed with errors) *, 56 mins40 min
50K MARC CreatePTF - Create 2Completed with errors, 1 hr 40 mins43 min
50K UpdatePTF - Updates Success - 12 hr 32 mins (job stuck at 76% completion)1hr 4min

*=worked with faulty MARC file from the 25K Create import (which did not finish successfully).  This test is to gauge how long a successful 25k UPDATE import might take.

High Level Summary

  1. Consistent CREATE and UPDATE imports were achieved for up to 25K MARC files.  The import duration were consistent as well. All expected SRS, instances, holdings, and items were created.
  2. No CREATE nor UPDATE import was achieved with 25K MARC files due to various errors. See MODSOURCE-417 - Getting issue details... STATUS
    1. *Update 12/1 - Further testing shows that even with a 5K import the issues in MODSOURCE-417 - Getting issue details... STATUS prevented CREATE imports from creating all expected holdings and items. 
    2. After restarting ALL DI modules DI jobs completed successfully consistently again.
    3. *Update 1/10/22 - Further testing shows that SQL errors MODSOURCE-438 - Getting issue details... STATUS also kills the current job and fails subsequent jobs, create or updates.
  3. The events_cache topic is (still) present in Kiwi and whenever it spikes for whatever reasons, it disrupts data import leading to a prolonged imports that rarely completed without errors. See MODINV-444 - Getting issue details... STATUS which was created when observed this behavior in Iris/Juniper releases.
  4. With Optimistic Locking enabled, import jobs seemed to finish faster than without, especially with updates. 

Resource Usages

For the most part the CPU utilization picture is consistent among various CREATE imports. The following describe "spikes" which took place in the first 10 minutes of the imports. 

  • mod-source-record-manager spikes up to around 500% for CREATE imports of up to 20K records, and 800% up to 25K records. The more records the harder mod-source-record-manager works.
  • mod-source-record-storage spikes up to around 400% for up to 20K CREATE, 570% for 25K
  • mod-inventory hovers around 300% for all CREATE jobs, up to 25K
  • mod-inventory-storage uses about 200%, rising gradually over the duration of the import. 
  • mod-data-import-cs spikes around 400% but averages 140%. 
  • mod-data-import has a quick spike for 80%.  

5K-10K Imports


EC2 Instance CPU Utilization

We can see here that the EC2 instances approached 80% CPU utilization during the initial 10 minutes spikes and averaged about 70%. 


Service Memory Usage

Memory is stable during short/small-dataset imports. 


RDS CPU Utilization

Spikes approaching 90%, averaging around 70%.

20K Imports



Database




25K CREATE Import

In the 50k import we see a flat stepped load across instances like in 25K test, in shape corresponding to the dynamic load distribution of java VM

MSK/Kafka Resources



25K UPDATE import

EC2 Instance CPU Utilization


No visible memory issues during the update

Database Metrics

No unusual CPU or memory utilization spikes seen during the 25K UPDATE imports.

MSK/Kafka Resources

The following graphs show Kafka topics messages' rates. Note that events_cache spiked at toward the end of the import and dwarfs other topics' rates.  This disruptive spike also corresponds to the spike by mod-inventory toward the end of the import. See MODINV-444 - Getting issue details... STATUS for more details on general disruptiveness of such spikes. 

Without events_cache spike there is usually about 100-200 messages per second among the topics.


Other Observations

There were times when the CREATE jobs finished "too soon" when the jobs were officially marked as "Completed with errors" when in fact there were no errors but the expected counts of instances, holdings, and item records did not meet the expected totals. However, in these instances just a little more time, about 1-5 minutes later, all records were created successfully and all the counts matched up to expected values. MODSOURMAN-622 - Getting issue details... STATUS was created to address this issue.