PTF - Data Import Creates + Updates multi tenant [non-ECS]

Overview

This document contains the results of testing concurrent Data Import with file splitting feature for MARC Bibliographic records in the Quesnelia release.

The purpose for this test is to define how concurrent DI affect duration of DI jobs on the central tenant and to check possible issues during smoke test with 50k DI Create job running concurrently on all 3 tenants.

Compared with Data Import Creates + Updates Poppy pcp1


Ticket:  PERF-837 - Getting issue details... STATUS

Summary

  • Data Import duration of 10k and 25k jobs multiplies in correspondence with number of concurrent DI jobs. Duration doubles with 2 concurrent jobs (1 job for each tenant) and 3 times longer for 3 concurrent jobs. This trend is consistent across tenants. The same as in Poppy.
  • Running smoke test 50k file with DI create job revealed issue with module mod-permission which spiked and led to 4 split jobs got SNAPSHOT_UPDATE_ERROR at the beginning of the test on third fs07000002 tenant. No error messages in UI. Reason - No action.
  • Average CPU utilization was for top two modules during DI Create jobs in mod-inventory-b - 121%, mod-quick-marc-b - 92%, Update jobs mod-inventory-b - 150%, mod-quick-marc-b - 92%. Modules used less resources in quesnelia. More information can be found in Service CPU Utilization tab.
  • Memory consumption: mod-inventory-b - 87%, mod-data-import-b - 80%, mod-source-record-storage-b - 73%, mod-permissions-b - 71% which is slightly less than in Poppy
  • RDS CPU utilization was 88% for all jobs that is 10% less than in Poppy
  • Create jobs DB connections for 2 tenants  - 990, for 3 tenants - 1100, Update jobs DB connections for 2 tenants  - 950, for 3 tenants - 1000

  • Top long query was SELECT jsonb FROM fs07000002_mod_permissions.permissions - rows 4275 rows/sec, average latency 1579 ms/call

Test Runs 

Test #

Scenario

Load level
1 - Concurrent Create importsDI MARC Bib Create10K, 25K concurrently (with 5 min pause) on 2 and 3 tenants
2 - Concurrent Update importsDI MARC Bib Update10K, 25K concurrently (with 5 min pause) on 2 and 3 tenants
3 - Concurrent Create imports ("smoke test") of 50K DI MARC Bib Create50k concurrently on 3 tenants 

Test Results

Duration of DI jobs grow proportionally to file size.

Smoke Test finished successfully for 3 concurrent DI Create jobs of 50K each.

DI Create# of testNumber of concurrent jobs

Main tenant

(fs09000000)

Second tenant

(fs07000001)

Third tenant

(fs07000002)

10kBaseline100:05:0100:05:0100:05:00

1200:09:0400:09:08

2300:14:2500:15:1200:15:22
25kBaseline100:10:4300:11:1500:11:22

3200:22:4400:22:47

4300:40:3200:40:1700:40:25
DI Update




10kBaseline100:06:25


5200:14:0800:13:44

6300:20:1300:21:0300:20:41
25kBaseline100:15:17


7200:35:2700:35:34

8300:58:4900:58:5900:59:05
DI Create (Smoke test)




50K9122:44.2


10301:21:3701:22:4901:22:49

Comparison

QUESNELIAPOPPYCOMPARISON Quesnelia/Poppy
DI Create# of testNumber of concurrent jobsCentral tenantSecond tenantThird tenantDI CreateNumber of concurrent jobsCentral tenantSecond tenantThird tenantDelta fs09000000Delta fs07000001Delta fs07000002
10kBaseline100:05:0100:05:0100:05:0010k100:04:56

00:00:05


1200:09:0400:09:08

200:10:4300:10:37
00:01:3900:01:29

2300:14:2500:15:1200:15:22
300:21:1200:21:0600:20:5700:06:4700:05:5400:05:35
25kBaseline100:10:4300:11:1500:11:2225k100:11:24

00:00:41


3200:22:4400:22:47

200:23:4400:23:30
00:01:0000:00:43

4300:40:3200:40:1700:40:25
300:37:1100:37:0500:36:5800:03:2100:03:1200:03:27
DI Update# of testNumber of concurrent jobsCentral tenantSecond tenantThird tenantDI UpdateNumber of concurrent jobsCentral tenantSecond tenantThird tenantDelta fs09000000Delta fs07000001Delta fs07000002
10kBaseline100:06:25

10k100:06:32

00:00:07


5200:14:0800:13:44

200:09:4700:11:26
00:04:2100:02:18

6300:20:1300:21:0300:20:41
300:19:0800:19:0600:18:3100:01:0500:01:5700:02:10
25kBaseline100:15:17

25k100:15:13

00:00:04


7200:35:2700:35:34

200:30:4900:30:52
00:04:3800:04:42

8300:58:4900:58:5900:59:05
300:47:4700:48:1700:47:5400:11:0200:10:4200:11:11
50K9122:44.2

50K100:22:31

00:00:13


10301:21:3701:22:4901:22:49
301:12:5401:12:4401:12:3500:08:4300:10:0500:10:14

Compared with Data Import Creates + Updates Poppy pcp1

Service CPU Utilization

 CPU utilization comparison
Module QuesneliaCPU (Create 25k 2 tenants)CPU (Update 25k 2 tenants)
mod-inventory-b121.54149.37
mod-quick-marc-b92.3392.09
mod-di-converter-storage-b59.5497.72
nginx-okapi52.4386.49
okapi-b27.0847.92
mod-source-record-storage-b24.141.22
mod-inventory-storage-b23.526.07
mod-permissions-b8.3232.46
mod-source-record-manager-b15.5419.42
mod-users-b7.28.1
mod-pubsub-b7.858.07
mod-password-validator-b2.892.83
mod-feesfines-b2.723.14
mod-configuration-b2.383.65
mod-authtoken-b2.192.86
mod-circulation-storage-b1.882.14
mod-data-import-b1.852.04
mod-circulation-b0.410.4
pub-okapi0.260.25
Service PoppyCPU CreateCPU Update
mod-inventory-b122.87181.72
mod-di-converter-storage-b78.9475.21
mod-quick-marc-b75.7122.16
nginx-okapi71.7978.33
mod-source-record-storage-b47.3642.14
okapi-b36.9929.78
mod-source-record-manager-b30.4136.98
mod-inventory-storage-b24.8319.45
mod-users-b19.335.61
mod-configuration-b11.692.73
mod-permissions-b9.1918.71
mod-pubsub-b6.976.85
mod-authtoken-b6.513.44
mod-password-validator-b3.272.75
mod-feesfines-b2.292.5
mod-data-import-b1.842.09
mod-circulation-storage-b1.271.65
mod-circulation-b0.330.34
pub-okapi0.230.24

Service Memory Utilization

 Memory consumption comparison
Module QuesneliaMemory
mod-inventory-b86.73
mod-data-import-b80.14
mod-source-record-storage-b72.9
mod-permissions-b71.48
mod-users-b51.34
mod-di-converter-storage-b50.2
okapi-b48.87
mod-source-record-manager-b46.51
mod-configuration-b40.74
mod-feesfines-b37.94
mod-quick-marc-b31.61
mod-pubsub-b30.38
mod-authtoken-b27.78
mod-inventory-storage-b26.54
mod-circulation-storage-b22.15
mod-circulation-b14.66
nginx-okapi5.08
pub-okapi4.58
Service PoppyMemory CreateMemory Update
mod-inventory-b95.1698.34
mod-permissions-b75.0379.63
mod-source-record-storage-b62.2972.77
mod-users-b61.2359.93
mod-data-import-b61.0268.28
mod-source-record-manager-b47.7654.2
okapi-b41.8442.55
mod-di-converter-storage-b34.6235.22
mod-feesfines-b28.6327.51
mod-quick-marc-b28.3930.48
mod-configuration-b27.5726.51
mod-pubsub-b24.7224.86
mod-authtoken-b21.9320.1
mod-inventory-storage-b17.2118.03
mod-circulation-storage-b17.0416.55
mod-circulation-b10.8811.13
nginx-okapi4.694.69
pub-okapi4.634.46

MSK tenant cluster

Disk usage by broker

CPU (User) usage by broker

DB CPU Utilization

RDS CPU utilization was 88% for all jobs that is 10% less than in Poppy


DB Connections

Create jobs DB connections for 2 tenants  - 990, for 3 tenants - 1100 

Update jobs DB connections for 2 tenants  - 950, for 3 tenants - 1000


DB load

Top SQL

TOP SQL Queries
SELECT jsonb FROM fs07000002_mod_permissions.permissions 


Appendix

Errors & Exceptions

During smoke test with 50k on 3 tenants SNAPSHOT_UPDATE_ERROR was observed on third fs07000002 tenant. No error messages in UI. Reason - No action on four split jobs at the beginning.

Infrastructure

PTF -environment qcp1

  • 10 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
  • 1 database  instance, writer

    NameMemory GIBvCPUsmax_connections

    db.r6g.xlarge

    32 GiB4 vCPUs2731
  • Number of records in DB:
    •  fs09000000
      • instances - 25901331
      • items - 27074913
      • holdings - 25871735
    • fs07000001
      • nstances - 10100620
      • items - 1484850
      • holdings - 10522266
    • fs07000002
      • nstances - 1161275
      • items - 1153548
      • holdings - 1153548
  • MSK tenant
    • 4 m5.2xlarge brokers in 2 zones
    • Apache Kafka version 2.8.0

    • EBS storage volume per broker 300 GiB

    • auto.create.topics.enable=true
    • log.retention.minutes=480
    • default.replication.factor=3
  • Open Search ptf-test
    • version OpenSearch_2_7_R20240502 
    • Data nodes
      • Instance type - r6g.2xlarge.search
      • Number of nodes - 4
      • Storage type - EBS
      • EBS volume size (GiB) - 500
    • Dedicated master nodes
      • Instance type - r6g.large.search
      • Number of nodes - 3

Module Version

qcp1-pvt

RevisionTask CountMem Hard LimitMem Soft LimitCPUXmxMetaspaceSizeMaxMetaspaceSize
mod-users-bl:7.7.0521440115251292288128
mod-configuration:5.10.052102489612876888128
mod-authtoken:2.15.1621440115251292288128
mod-data-import:3.1.081204818442561292384512
mod-remote-storage:3.2.0524920447210243960512512
mod-inventory-storage:27.1.0524096369020483076384512
pub-okapi:2023.06.14321024896128768Not foundNot found
mod-feesfines:19.1.052102489612876888128
okapi:5.3.053168414401024922384512
nginx-okapi:2023.06.14321024896128Not foundNot foundNot found
mod-quick-marc:5.1.051228821761281664384512
mod-source-record-manager:3.9.0-SNAPSHOT.330625600500020483500384512
mod-patron-blocks:1.10.0521024896102476888128
mod-pubsub:2.13.052153614401024922384512
mod-circulation:24.2.0522880259215361814384512
mod-di-converter-storage:2.2.052102489612876888128
mod-inventory:20.2.0522880259210241814384512
mod-source-record-storage:5.8.0525600500020483500384512
mod-circulation-storage:17.2.0522880259215361814384512

Methodology/Approach

  • DI tests were started from UI concurrently with 1 job on each tenant, fs09000000 first and then on fs07000001 so in total two jobs on two tenants.
  • Then 1 job on three tenants concurrently with several seconds delay - started with tenant fs09000000, second tenant -fs07000001 and third tenant - fs07000002. 
  • DI Create jobs executed first. Then DI Update jobs.

To get instance ids and job durations used these queries

Queries to prepare files for Update jobs, job durations
select id
-- from fs09000000_mod_inventory_storage.instance
-- from fs07000001_mod_inventory_storage.instance
-- from fs07000002_mod_inventory_storage.instance

-- limit 1
-- To get instance ids for 10k Update on main tenant
-- where creation_date > '2024-06-18 06:40:14' and creation_date < '2024-06-18 06:45:15.549'
-- To get instance ids for 25k Update on main tenant
-- where creation_date > '2024-06-18 06:50:00' and creation_date < '2024-06-18 07:00:43.192'


-- To get instance ids for 10k Update on 0701 tenant
-- where creation_date > '2024-06-18 07:07:51' and creation_date < '2024-06-18 07:12:53.209'
-- To get instance ids for 25k Update on 0701 tenant
-- where creation_date > '2024-06-18 07:17:04' and creation_date < '2024-06-18 07:28:19.22'

-- To get instance ids for 10k Update on 0702 tenant
-- where creation_date > '2024-06-18 07:33:07' and creation_date < '2024-06-18 07:38:06.762'
-- To get instance ids for 25k Update on 0702 tenant
-- where creation_date > '2024-06-18 07:42:50' and creation_date < '2024-06-18 07:54:12.09'


select file_name,total_records_in_file,started_date,completed_date, completed_date - started_date as duration ,status,error_status
-- from fs09000000_mod_source_record_manager.job_execution
-- from fs07000002_mod_source_record_manager.job_execution
-- from fs07000002_mod_source_record_manager.job_execution
where subordination_type = 'COMPOSITE_PARENT'
-- where started_date > '2024-06-13 14:47:54' and completed_date < '2024-06-13 19:01:50.832'
order by started_date desc
limit 10