[Poppy] List App with multiple tenants and R/W split enabled

[Poppy] List App with multiple tenants and R/W split enabled

Overview

This document contains the results of testing List App refreshing of 200k records on multiple tenants with R/W split enabled (Poppy release). The goal of testing is to assess the performance of mod-lists with load spread across multiple tenants.

Ticket:

PERF-665: [List App] Test performance of refreshing a list with many records (with multiple tenants and R/W split enabled)Closed

Summary

  • Tests showed the Lists App refresh of concurrent lists on 3 tenants are:

    • 1.5 mins for 3 concurrent lists refresh test (1 list refresh on each tenant);

    • 2.3 mins for 10 concurrent lists refresh test (3-4 lists refresh on each tenant).

  • During load test for 30 lists (10 lists per tenant) some of the list refreshes failed. After the test end "isRefreshing" status remained "true" for each list. It was reset manually directly through the database. Details on the issue can be found here: Failed list refresh investigation.

  • During the 10 lists test CPU utilization reached 200% for mod-fqm-manager and 111% for mod-lists. Also, mod-permissions' CPU utilization exceeded 100% during 30 lists test.

  • Maximum DB CPU utilization reached 83% (writer instance) and 99% (reader instance) during the 30 lists test. In comparison with testing with R/W split disabled, RDS CPU utilization didn't decrease when DB R/W split was enabled.

  • Memory utilization for mod-permissions increased from 48% to 76% during the tests. No memory leak is suspected for all the modules.

Recommendations & Jiras 

PERF-732: Investigate failing list refresh during load testsClosed

Test runs

Query used in lists - "Item status != Available". List refresh result is about 200K records.

Scenario

Data quantity

Scenario

Data quantity

List App refresh

multiple tenants







tenant 1 - 1 list

tenant 2 - 1 list

tenant 3 - 1 list

tenant 1 - 3 lists

tenant 2 - 3 lists

tenant 3 - 4 lists

tenant 1 - 10 lists

tenant 2 - 10 lists

tenant 3 - 10 lists



Results

Transaction

Duration, avg

Release

Tenants

Number of lists

R/W split

Other conditions

Transaction

Duration, avg

Release

Tenants

Number of lists

R/W split

Other conditions

Lists App refresh

previous test results*

10 min 40 sec

[Orchid]

1 tenant

10 

disabled



8.5 min

[Poppy]

1 tenant

10 

disabled



17.7 min

[Poppy]

1 tenant

10 

disabled

Testing in parallel with DI and CICO

Lists App refresh

current test results**

1.5 min

[Poppy]

3 tenants

enabled



2.3 min

[Poppy]

3 tenants

10 

enabled



error

[Poppy]

3 tenants

30 

enabled

1 list refresh failed for the 1st tenant

8 list refreshes failed for the 3rd tenant***

* Query used in lists - "Item status == Checked out". List refresh result is 200K records. Results are taken from previous test reports: [Poppy] List App with multiple workflows and R/W split disabled test report[Orchid] List App test report

**Query used in lists - "Item status != Available". List refresh result is about 200K records.

***Details on the issue can be found at Failed list refresh investigation part

Instance CPU Utilization

Service CPU Utilization

During the 10-lists test CPU utilization reached 200% for mod-fqm-manager and 111% for mod-lists. Also, mod-permissions CPU utilization exceeded 100% during 30 lists test.

Memory Utilization

Memory utilization for mod-permissions increased from 48% to 76% during the tests. No memory leak is suspected for all the modules.

This editor does not support displaying this content: heading

DB CPU Utilization

Maximum DB CPU utilization reached 83% (writer instance) and 99% (reader instance) during 30 lists test. 

In comparison with testing with R/W split disabled, RDS CPU utilization for the writer node decreased from 70% to 29% for 3 users test, from 95% to 68% for 10 users test. At the same time CPU load on the reader node is higher than on the writer.

Results for multiple tenants and R/W split disabled testing (for comparison):

Details can be found here: [Poppy] List App with multiple tenants and R/W split disabled

DB Connections

DB Load

Writer DB node

Reader DB node

TOP SQL

Writer DB node

Reader DB node

Long-running queries: 

select id from [tenant]_mod_fqm_manager.drv_item_callnumber_location where lower(cast(item_status as varchar)) <> lower($1) parameters: $1 = 'Available'



delete from list_contents where list_id=$1 and refresh_id=$2

Failed list refresh investigation

In order to investigate the issue there were conducted addtional tests for single tenant (30 lists in parallel) and multiple tenants (10 lists on each of three tenants in parallel).



Test

Failed list refresh

1 tenant

(30 lists in parallel)

3 tenants 

(10 lists on each in parallel)

1

1 list failed

2 lists failed for the 1st tenant

10 lists failed for the 3rd tenant

2

1 list failed

1 lists failed for the 1st tenant

8 lists failed for the 3rd tenant

3

2 lists failed

2 lists failed for the 1st tenant

8 lists failed for the 3rd tenant 

! For the second tenant there were no failed list refreshes during all the tests.

! After the test end "isRefreshing" status remained "true" for each of the failed lists. It was reset manually directly through the database.

Log messages

"Reason for failure: org.springframework.dao.DataIntegrityViolationException: could not execute batch..."

"ERROR: insert or update on table "list_contents_11" violates foreign key constraint "fk_refresh_id""

Log message 1
07:24:29 [] [] [] [] ERROR efreshFailedCallback Refresh failed for list 275ffe0d-a7aa-4971-8649-05dfd5fabcb8, refreshId 2604f768-8365-4cd1-8d44-42ac2eb4b0cb. Reason for failure: org.springframework.dao.DataIntegrityViolationException: could not execute batch [Batch entry 0 insert into list_contents (sort_seq,content_id,list_id,refresh_id) values (0,'00003218-d1c0-4e28-a1f9-7ff1b876adab'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(1,'00006fb1-e36f-428e-a0d6-783a3cad91d8'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(2,'0000b30e-b794-4fc8-af62-dd769232f958'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(3,'0000c889-ad4f-42b2-8f29-70ed0ab09904'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(4,'00011866-fe22-409c-bbf6-f0fd6a8d4580'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(5,'00012a5c-28c5-40c1-9728-ce66e867bc35'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(6,'00012af0-33fc-4323-aa75-91e0c2d4174b'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(7,'0002346f-e359-42ce-b69c-df35ad8b9b76'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(8,'000251b9-c33d-4e3a-9a36-33a994a9f128'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(9,'00035926-2c06-4616-975a-b3908bb24346'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(10,'0003c0a2-432d-41ec-8847-e757f50f6d4d'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(11,'000410ba-83fb-43ae-bf2e-20878e27ef23'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(12,'0004a65d-fb57-4754-9d36-23072986b9ed'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(13,'00066938-046f-4f50-8bfb-522a0aa4d0b1'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(14,'0006939e-ce6f-4876-8db9-0032e4cb41c2'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(15,'00069719-c84e-4436-b45b-6999b8459742'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(16,'0006f3f3-2fd2-4be0-80c5-8a6998b9dd78'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(17,'00070833-4f16-4aeb-824b-2a8ef4527bc9'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(18,'00076065-3bb3-462c-81b4-a28ae075f31d'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(19,'00080f89-b119-47b1-ba9c-73fdbca5873b'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(20,'0008c6d6-a51a-48ed-88f2-6063e7f621e2'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(21,'0009200f-056e-42cc-bec7-8d49dbf95e1c'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(22,'00097224-3228-4dc5-bce1-4a0cda755ed1'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(23,'000977de-102b-4430-ab53-5f256654da93'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(24,'000a05bc-cd51-49ab-a07c-e7d896c44404'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(25,'000ac6a6-ef1f-4ea9-9faf-d1a03fafe66f'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(26,'000ae987-f93b-41d7-9125-d9ce7c51cb4f'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(27,'000b149d-4600-4839-ad27-44d012136a05'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(28,'000b5e84-c840-42e5-885a-e26185f5866c'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(29,'000bb99a-2cb2-4d18-9441-d98b0e505346'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(30,'000c7f25-659a-4584-9333-4c9543bb6ac9'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid),(31,'000cad02-dbf1-4cee-ab9b-62cdc3242a9e'::uuid,'275ffe0d-a7aa-4971-8649-05dfd5fabcb8'::uuid,'2604f768-8365-4cd1-8d44-42ac2eb4b0cb'::uuid) was aborted: ERROR: insert or update on table "list_contents_11" violates foreign key constraint "fk_refresh_id"

"Key (refresh_id)=(2604f768-8365-4cd1-8d44-42ac2eb4b0cb) is not present in table "list_refresh_details""

Log message 2
Detail: Key (refresh_id)=(2604f768-8365-4cd1-8d44-42ac2eb4b0cb) is not present in table "list_refresh_details". Call getNextException to see other errors in the batch.] [insert into list_contents (sort_seq,content_id,list_id,refresh_id) values (?,?,?,?)]; SQL [insert into list_contents (sort_seq,content_id,list_id,refresh_id) values (?,?,?,?)]; constraint [fk_refresh_id]

Other log messages can be found in the faile attached:

RDS CPU

Slow queries

Reader node:

select id from fs09000000_mod_fqm_manager.drv_item_callnumber_location where lower(cast(item_status as varchar)) <> lower($1) parameters: $1 = 'Available'
select result_id from fs07000001_mod_fqm_manager.query_results where query_id = cast($1 as uuid) order by result_id offset $2 rows fetch next $3 rows only parameters: $1 = '53b7d3be-df3c-4545-abf8-aded0a0b0a87', $2 = '92000', $3 = '1000'

Writer node:

SELECT jsonb FROM fs07000001_mod_permissions.permissions
delete from list_contents where list_id=$1 and refresh_id=$2 parameters: $1 = '06831116-06d0-49d1-8002-f030e9764020', $2 = 'ba886eaa-67c2-4e5f-8f06-8d383b497967'
insert into list_contents (sort_seq,content_id,list_id,refresh_id) values ($1,$2,$3,$4),($5,$6,$7,$8),($9,$10,$11,$12),($13,$14,$15,$16),($17,$18,$19,$20),($21,$22,$23,$24),($25,$26,$27,$28),($29,$30,$31,$32),($33,$34,$35,$36),($37,$38,$39,$40),($41,$42,$43,$44),($45,$46,$47,$48),($49,$50,$51,$52),($53,$54,$55,$56),($57,$58,$59,$60),($61,$62,$63,$64)

Appendix

Infrastructure

PTF -environment pcp1

  • 10 m6i.2xlarge EC2 instances located in US East (N. Virginia)us-east-1

  • 1 database  instance, writer

  • MSK tenant

    • 4 m5.2xlarge brokers in 2 zones

    • Apache Kafka version 2.8.0

    • EBS storage volume per broker 300 GiB

    • auto.create.topics.enable=true

    • log.retention.minutes=480

    • default.replication.factor=3



Module
pcp1-pvt
Fri Oct 27 08:26:47 UTC 2023

Task Def. Revision

Task Count

Mem Hard Limit

Mem Soft limit

CPU units

Xmx

MetaspaceSize

MaxMetaspaceSize

R/W split enabled

Module
pcp1-pvt
Fri Oct 27 08:26:47 UTC 2023

Task Def. Revision

Task Count

Mem Hard Limit

Mem Soft limit

CPU units

Xmx

MetaspaceSize

MaxMetaspaceSize

R/W split enabled

mod-inventory-storage:27.0.0

10

2

4096

3690

2048

3076

384

512

false

mod-users:19.2.0

19

2

1024

896

128

768

88

128

false

nginx-okapi:2023.06.14

8

2

1024

896

128

0

0

0

false

mod-circulation-storage:17.1.0

10

2

2880

2592

1536

1814

384

512

false

okapi:5.1.1

9

3

1684

1440

1024

922

384

512

false

mod-inventory:20.1.0

9

2

2880

2592

1024

1814

384

512

false

mod-circulation:24.0.0

10

2

2880

2592

1536

1814

384

512

false

pub-okapi:2023.06.14

8

2

1024

896

128

768

0

0

false

mod-fqm-manager:1.0.0

5

2

1024

896

128

768

88

128

true

mod-lists:1.0.0

5

2

3000

2600

128

2048

384

512

false



Methodology

  1. Enable R/W split for mod-fqm-manager.

  2. Create 10 lists with the query "Item status != Available" on each of three tenants to be able to run a test for up to 30 concurrent lists (users).

  3. Prepare 200K item records for the query to return. Details can be found at the link: Steps for testing process#ListApp

  4. Conduct tests with JMeter script for multiple tenants.