mod-search: Test Reindexing of Instances on consortium environment (Quesnelia)

mod-search: Test Reindexing of Instances on consortium environment (Quesnelia)

Overview

The purpose of the document is to assess reindexing performance on a consortium environment with Quesnelia release. Calculate reindex time and size of reindexing.

Recommendations & Jiras

Test Summary

Test Runs /Results

Test #

Instances number

Test Conditions

reindexing on Poppy release, consortium environment

Duration *

Notes



  1. 2024-04-24 09:37-12.10

1,706,932

Sequential: cs00000int

2 hours 23 min

  • mod-search task count =8

  • mod-inventory-storage task count = 2

  • mod-okapi task count = 3

  • open search Data nodes instance scaled up to r6g.4xlarge.search

  • without configuration of number_of_replicas and refresh_interval values of ES/OpenSearch

  1. 2024-04-24 13:03 - 23:38

6,905,646

Sequential: cs00000int_0001

10 hours 35 min

3. 2024-04-25 00:19 - 12:28

6,937,091

Sequential: cs00000int_0002

12 hours 

4. 2024-04-25 11:45 -

2024-04-26 04:30

 

In parallel: 3 tenants

16 hours 45 minutes

Indexing size

All the data from the tables below were captured after each test was finished. Results from get-request for reindex monitoring https://127.0.0.1:9999/_cat/indices:

In parallel: 3 tenants

In parallel: 3 tenants

health

status

index

uuid

pri

rep

docs.

count

docs.

deleted

store.size

pri.store.

size

green

open

qcon_instance_subject_cs00000int

nwbohqi6SGiQsrlPjMOlOg

4

2

936066

87976

8gb

2.6gb

green

open

qcon_contributor_cs00000int

UemRKKfuTjSV4JrzZLQEGg

4

2

880696

77071

8.7gb

2.9gb

green

open

qcon_instance_classification_cs00000int

ixObbJPQTJyAHelpcV-r8w

4

2

0

0

2.4kb

832b

green

open

qcon_instance_cs00000int

NnavdW7vSc-2MH2Hvy5K-A

4

2

4929451

180596

62.3gb

20.7gb

health

status

index

uuid

pri

rep

docs.count

docs.deleted

store.size

pri.store.size

health

status

index

uuid

pri

rep

docs.count

docs.deleted

store.size

pri.store.size

Sequential: cs00000int

green

open

qcon_instance_subject_cs00000int

Sf1cHAdHQ4G1qCHzAWxHJg

4

2

904477

195457

4.5gb

1.6gb

green

open

qcon_contributor_cs00000int

pzG1jGK3TdmG--jQXucPbw

4

2

846936

114235

3.7gb

1.1gb

green

open

qcon_instance_classification_cs00000int

cumgAMEFQFaFHRGQQfG2-w

4

2

0

0

2.4kb

832b

green

open

qcon_instance_cs00000int

yEg2R1xERAGZBTPvFkVeSA

4

2

1706932

0

31.4gb

10.4gb

Sequential: cs00000int_0001

green

open

qcon_instance_subject_cs00000int

Sf1cHAdHQ4G1qCHzAWxHJg

4

2

906513

98723

9.9gb

3.1gb

green

open

qcon_contributor_cs00000int

pzG1jGK3TdmG--jQXucPbw

4

2

849593

22767

7.9gb

2.8gb

green

open

qcon_instance_classification_cs00000int

cumgAMEFQFaFHRGQQfG2

4

2

0

0

2.4kb

832b

green

open

qcon_instance_cs00000int

yEg2R1xERAGZBTPvFkVeSA

4

2

6905646

216733

74.6gb

24.5gb

Sequential: cs00000int_0002

green

open

qcon_instance_subject_cs00000int

Sf1cHAdHQ4G1qCHzAWxHJg

4

2

909754

69518

13.6gb

4.2gb

green

open

qcon_contributor_cs00000int

pzG1jGK3TdmG--jQXucPbw

4

2

852873

30507

8.9gb

3.3gb

green

open

qcon_instance_classification_cs00000int

cumgAMEFQFaFHRGQQfG2-w

4

2

0

0

2.4kb

832b

green

open

qcon_instance_cs00000int

yEg2R1xERAGZBTPvFkVeSA

4

2

6937091

257165

74.7gb

24.8gb

Resource utilization

Test #1 (cs00000int main tenant)

During the test on cs00000int tenant, service CPU Utilization was stable and consistent. Most consuming services were mod-inventory storage = 94%, nginx-okapi = 81%, okapi = 60%, mod-inventory =15%. No memory leaks. Database connection increased by 50 and was equal to 650. DB CPU usage increased from 1 to 14% during the first 15 minutes of reindexing, this spike correlated with the changing of service CPU utilization and was caused by a high indexing data rate(about 500K operation per minute).

Open search metrics.

At the beginning of the reindex data rate was about 500K operations per minute. After 15 reindex data rate decreased to 8K. Indexing latency increased after 15 minutes the test started and was about 4 milliseconds and after 50 minutes decreased to 1.5 milliseconds. This spike correlates with the increasing of data nodes CPU utilization up to 16 % and next decreasing to 7%. CPU Utilization of the master node was changed from 5 to 27% during reindexing.

Service CPU Utilization

Memory Utilization

DB CPU Utilization

DB Connections



Open Search metrics

Indexing Data Rate

Subrange of reindexing process from 09:55- 11:50. A Graph was added to see detailed behavior aggregated on the graph above.

image-20240429-103114.png

 

Indexing latency

Master CPU Utilization (Average)

CPU utilization percentage for all data nodes (Average).

Memory usage percentage for all data nodes (Average).

Test #2 (cs00000intt_0001)

During the test on cs00000intt_0001 tenant, the Most consuming CPU Utilization services were mod-inventory storage and nginx-okapi these 2 services were spiking at the beginning of the reindex and in the middle. No memory leaks. Database connection increased to 100 at the beginning and was equal to 700. DB CPU usage increased to 14%, this spike correlates with the changing of service CPU utilization and was caused by a high indexing data rate(about 110K operation per minute).

Open search metrics.

At the beginning of the reindex data rate was about 120K after an hour decreased to 25K, for the next 2 hours average data rate was ~ 25K, and for the last 5 hours decreased to 60. After the reindex data rate decreases to 60 the average indexing latency increase from 350 to 600 milliseconds. The data node's CPU utilization was about 60% at the beginning and correlated with the reindex data rate. CPU Utilization of the master node was changed from 5 to 27% during reindexing.



Memory Utilization

DB CPU Utilization

DB Connections

Open Search metrics

Indexing Data Rate

 

image-20240429-120041.png

Indexing latency

Master CPU Utilization (Average)

CPU utilization percentage for all data nodes (Average).

Memory usage percentage for all data nodes (Average).

Test #3 (cs00000intt_0002)

During the test on cs00000int tenant mod-inventory storage and nginx-okapi were spiking during the test. No memory leaks. Database connection increased by 50 and was equal to 650. DB CPU usage was also spiking from 100 to 20% during the test.

Open search metrics.

In the beginning, the reindex data rate was about 20K operations per minute and was spiking. After 2 hours reindex data rate decreased to 5K. Indexing latency increased was changing during the test from 3000 se. This spike correlates with the increasing of data nodes CPU utilization up to 16 % and next decreasing to 7%. CPU Utilization of the master node was changed from 5 to 27% during reindexing. The CPU on data nodes changes were not uniform and varied from 0 to 70%.

Service CPU Utilization

Memory Utilization

DB CPU Utilization

DB Connections

Open Search metrics

Indexing Data Rate

Indexing latency

Master CPU Utilization (Average)

image-20240425-111513.png

CPU utilization percentage for all data nodes (Average).

image-20240425-111705.png

Memory usage percentage for all data nodes (Average).

image-20240425-112006.png

Test #4 (In parallel: 3 tenants)

Service CPU Utilization

image-20240426-124707.png

Memory Utilization

image-20240426-124936.png

DB CPU Utilization

image-20240426-125030.png

DB Connections

image-20240426-125119.png

 

Open Search metrics

Indexing Data Rate

image-20240426-122808.png

Subrange of reindexing process from 22:30- 05:00. Graph was added to see detailed behavior aggregated on the graph above.

image-20240426-123300.png

Indexing latency

image-20240426-130129.png

Master CPU Utilization (Average)

image-20240426-130542.png

CPU utilization percentage for all data nodes (Average).

image-20240426-130747.png

Memory usage percentage for all data nodes (Average).

image-20240426-130926.png

 

Appendix

Infrastructure

PTF-environment pcon

  • 10 m6g.2xlarge EC2 instances located in US East (N. Virginia)us-east-1 

  • 1 instances of db.r6g.8xlarge database instances. Engine version 16.1

  • MSK ptf-kakfa-3
    4 m5.2xlarge brokers in 2 zones

    • Apache Kafka version 2.8.0

    • EBS storage volume per broker 300 GiB

    • auto.create.topics.enable=true

    • log.retention.minutes=480

    • default.replication.factor=3

  • OpenSearch fse cluster

    • OpenSearch version 2.7;

    • Data nodes

      • Availability Zone(s) - 2-AZ without standby

      • Instance type - r6g.4xlarge.search

      • Number of nodes - 6

      • EBS volume size (GiB) - 500

      • Provisioned IOPS - 3000IOPS

      • Provisioned Throughput (MiB/s) - 250 MiB/s

    • Dedicated master nodes