mod-search: Test Reindexing of Instances on consortium environment (Quesnelia)
- 1 Overview
- 2 Recommendations & Jiras
- 3 Test Summary
- 4 Test Runs /Results
- 4.1 Indexing size
- 5 Resource utilization
- 5.1 Test #1 (cs00000int main tenant)
- 5.1.1 Service CPU Utilization
- 5.1.2 Memory Utilization
- 5.1.3 DB CPU Utilization
- 5.1.4 DB Connections
- 5.1.5 Open Search metrics
- 5.2 Test #2 (cs00000intt_0001)
- 5.2.1 Memory Utilization
- 5.2.2 DB CPU Utilization
- 5.2.3 DB Connections
- 5.3 Test #3 (cs00000intt_0002)
- 5.3.1 Service CPU Utilization
- 5.3.2 Memory Utilization
- 5.3.3 DB CPU Utilization
- 5.3.4 DB Connections
- 5.3.5 Open Search metrics
- 5.4 Test #4 (In parallel: 3 tenants)
- 5.4.1 Service CPU Utilization
- 5.4.2 Memory Utilization
- 5.4.3 DB CPU Utilization
- 5.4.4 DB Connections
- 5.4.5 Open Search metrics
- 5.1 Test #1 (cs00000int main tenant)
- 6 Appendix
- 6.1 Infrastructure
- 6.2 Data structure
- 6.3 Module versions
- 6.4 Methodology/Approach
Overview
The purpose of the document is to assess reindexing performanceĀ on a consortium environment with Quesnelia release.Ā Calculate reindex time and size of reindexing.
Recommendations & Jiras
Original ticket to test:Ā PERF-889: ECS Reindex for mod-search (classification browse)Closed
Additional infoĀ ElasticSearch Reindex Performance Recommendations
Test Summary
Test Runs /Results
Test # | Instances number | Test Conditions reindexing on Poppy release, consortium environment | Duration * | Notes |
| 1,706,932 | Sequential: cs00000int | 2 hours 23 min |
|
| 6,905,646 | Sequential: cs00000int_0001 | 10 hours 35 min | |
3. 2024-04-25 00:19 - 12:28 | 6,937,091 | Sequential: cs00000int_0002 | 12Ā hoursĀ | |
4. 2024-04-25 11:45 - 2024-04-26 04:30 | Ā | In parallel: 3 tenants | 16 hours 45 minutes |
Indexing size
All the data from the tables below were captured after each test was finished. Results from get-request for reindex monitoring https://127.0.0.1:9999/_cat/indices:
In parallel:Ā 3 tenants | |||||||||
---|---|---|---|---|---|---|---|---|---|
health | status | index | uuid | pri | rep | docs. count | docs. deleted | store.size | pri.store. size |
green | open | qcon_instance_subject_cs00000int | nwbohqi6SGiQsrlPjMOlOg | 4 | 2 | 936066 | 87976 | 8gb | 2.6gb |
green | open | qcon_contributor_cs00000int | UemRKKfuTjSV4JrzZLQEGg | 4 | 2 | 880696 | 77071 | 8.7gb | 2.9gb |
green | open | qcon_instance_classification_cs00000int | ixObbJPQTJyAHelpcV-r8w | 4 | 2 | 0 | 0 | 2.4kb | 832b |
green | open | qcon_instance_cs00000int | NnavdW7vSc-2MH2Hvy5K-A | 4 | 2 | 4929451 | 180596 | 62.3gb | 20.7gb |
health | status | index | uuid | pri | rep | docs.count | docs.deleted | store.size | pri.store.size |
---|---|---|---|---|---|---|---|---|---|
Sequential: cs00000int | |||||||||
green | open | qcon_instance_subject_cs00000int | Sf1cHAdHQ4G1qCHzAWxHJg | 4 | 2 | 904477 | 195457 | 4.5gb | 1.6gb |
green | open | qcon_contributor_cs00000int | pzG1jGK3TdmG--jQXucPbw | 4 | 2 | 846936 | 114235 | 3.7gb | 1.1gb |
green | open | qcon_instance_classification_cs00000int | cumgAMEFQFaFHRGQQfG2-w | 4 | 2 | 0 | 0 | 2.4kb | 832b |
green | open | qcon_instance_cs00000int | yEg2R1xERAGZBTPvFkVeSA | 4 | 2 | 1706932 | 0 | 31.4gb | 10.4gb |
Sequential: cs00000int_0001 | |||||||||
green | open | qcon_instance_subject_cs00000int | Sf1cHAdHQ4G1qCHzAWxHJg | 4 | 2 | 906513 | 98723 | 9.9gb | 3.1gb |
green | open | qcon_contributor_cs00000int | pzG1jGK3TdmG--jQXucPbw | 4 | 2 | 849593 | 22767 | 7.9gb | 2.8gb |
green | open | qcon_instance_classification_cs00000int | cumgAMEFQFaFHRGQQfG2 | 4 | 2 | 0 | 0 | 2.4kb | 832b |
green | open | qcon_instance_cs00000int | yEg2R1xERAGZBTPvFkVeSA | 4 | 2 | 6905646 | 216733 | 74.6gb | 24.5gb |
Sequential: cs00000int_0002 | |||||||||
green | open | qcon_instance_subject_cs00000int | Sf1cHAdHQ4G1qCHzAWxHJg | 4 | 2 | 909754 | 69518 | 13.6gb | 4.2gb |
green | open | qcon_contributor_cs00000int | pzG1jGK3TdmG--jQXucPbw | 4 | 2 | 852873 | 30507 | 8.9gb | 3.3gb |
green | open | qcon_instance_classification_cs00000int | cumgAMEFQFaFHRGQQfG2-w | 4 | 2 | 0 | 0 | 2.4kb | 832b |
green | open | qcon_instance_cs00000int | yEg2R1xERAGZBTPvFkVeSA | 4 | 2 | 6937091 | 257165 | 74.7gb | 24.8gb |
Resource utilization
Test #1 (cs00000int main tenant)
During the test on cs00000int tenant, service CPU Utilization was stable and consistent. Most consuming services were mod-inventory storage = 94%, nginx-okapi = 81%, okapi = 60%, mod-inventory =15%. No memory leaks. Database connection increased by 50 and was equal to 650. DB CPU usage increased from 1 to 14% during the first 15 minutes of reindexing, this spike correlated with the changing of service CPU utilization and was caused by a high indexing data rate(about 500K operation per minute).
Open search metrics.
At the beginning of the reindex data rate was about 500K operations per minute. After 15 reindex data rate decreased to 8K. Indexing latency increased after 15 minutes the test started and was about 4 milliseconds and after 50 minutes decreased to 1.5 milliseconds. This spike correlates with the increasing of data nodes CPU utilization up to 16 % and next decreasing to 7%. CPU Utilization of the master node was changed from 5 to 27% during reindexing.
Service CPU Utilization
Memory Utilization
DB CPU Utilization
DB Connections
Open Search metrics
Indexing Data Rate
Subrange of reindexing process from 09:55- 11:50. A Graph was added to see detailed behavior aggregated on the graph above.
Ā
Indexing latency
Master CPU Utilization (Average)
CPU utilization percentage for all data nodes (Average).
Memory usage percentage for all data nodesĀ (Average).
Test #2 (cs00000intt_0001)
During the test on cs00000intt_0001 tenant, the Most consuming CPU Utilization services were mod-inventory storage and nginx-okapi these 2 services were spiking at the beginning of the reindex and in the middle. No memory leaks. Database connection increased to 100 at the beginning and was equal to 700. DB CPU usage increased to 14%, this spike correlates with the changing of service CPU utilization and was caused by a high indexing data rate(about 110K operation per minute).
Open search metrics.
At the beginning of the reindex data rate was about 120K after an hour decreased to 25K, for the next 2 hours average data rate was ~ 25K, and for the last 5 hours decreased to 60. After the reindex data rate decreases to 60 the average indexing latency increase from 350 to 600 milliseconds. The data node's CPU utilization was about 60% at the beginning and correlated with the reindex data rate. CPU Utilization of the master node was changed from 5 to 27% during reindexing.
Memory Utilization
DB CPU Utilization
DB Connections
Open Search metrics
Indexing Data Rate
Ā
Indexing latency
Master CPU Utilization (Average)
CPU utilization percentage for all data nodes (Average).
Memory usage percentage for all data nodesĀ (Average).
Test #3 (cs00000intt_0002)
During the test on cs00000int tenant mod-inventory storage and nginx-okapi were spiking during the test. No memory leaks. Database connection increased by 50 and was equal to 650. DB CPU usage was also spiking from 100 to 20% during the test.
Open search metrics.
In the beginning, the reindex data rate was about 20K operations per minute and was spiking. After 2 hours reindex data rate decreased to 5K. Indexing latency increased was changing during the test from 3000 se. This spike correlates with the increasing of data nodes CPU utilization up to 16 % and next decreasing to 7%. CPU Utilization of the master node was changed from 5 to 27% during reindexing. The CPU on data nodes changes were not uniform and varied from 0 to 70%.
Service CPU Utilization
Memory Utilization
DB CPU Utilization
DB Connections
Open Search metrics
Indexing Data Rate
Indexing latency
Master CPU Utilization (Average)
CPU utilization percentage for all data nodes (Average).
Memory usage percentage for all data nodesĀ (Average).
Test #4 (In parallel: 3 tenants)
Service CPU Utilization
Memory Utilization
DB CPU Utilization
DB Connections
Ā
Open Search metrics
Indexing Data Rate
Subrange of reindexing process from 22:30- 05:00. Graph was added to see detailed behavior aggregated on the graph above.
Indexing latency
Master CPU Utilization (Average)
CPU utilization percentage for all data nodes (Average).
Memory usage percentage for all data nodesĀ (Average).
Ā
Appendix
Infrastructure
PTF-environment pcon
10 m6g.2xlargeĀ EC2 instances located inĀ US EastĀ (N. Virginia)us-east-1Ā
1 instances ofĀ db.r6g.8xlargeĀ database instances. Engine version 16.1
MSKĀ ptf-kakfa-3
4 m5.2xlarge brokers in 2 zonesApache Kafka version 2.8.0
EBS storage volume per broker 300 GiB
auto.create.topics.enable=true
log.retention.minutes=480
default.replication.factor=3
OpenSearch fse cluster
OpenSearch versionĀ 2.7;
Data nodes
Availability Zone(s) - 2-AZ without standby
Instance type - r6g.4xlarge.search
Number of nodes - 6
EBS volume size (GiB) - 500
Provisioned IOPS - 3000IOPS
Provisioned Throughput (MiB/s) - 250 MiB/s
Dedicated master nodes
Instance type - r6g.large.search
Number of nodes - 3
Data structure
Module versions
Module | Task Def. Revision | Module Version | Task Count | Mem Hard Limit | Mem Soft limit | CPU units | Xmx | MetaspaceSize | MaxMetaspaceSize | R/W split enabled |
qcon-pvt | ||||||||||
Ā | ||||||||||
mod-search | 2 | mod-search:3.2.0 | 8 | 2592 | 2480 | 2048 | 1440 | 512 | 1024 | FALSE |
mod-authtoken | 1 | mod-authtoken:2.15.1 | 2 | 1440 | 1152 | 512 | 922 | 88 | 128 | FALSE |
mod-inventory-storage | 1 | mod-inventory-storage:27.1.0 | 2 | 4096 | 3690 | 2048 | 3076 | 384 | 512 | FALSE |
mod-inventory | 1 | mod-inventory:20.2.0 | 2 | 2880 | 2592 | 1024 | 1814 | 384 | 512 | FALSE |
mod-users | 1 | mod-users:19.3.1 | 2 | 1024 | 896 | 128 | 768 | 88 | 128 | FALSE |
nginx-okapi | 1 | nginx-okapi:2023.06.14 | 2 | 1024 | 896 | 128 | 0 | 0 | 0 | FALSE |
okapi-b | 1 | okapi:5.3.0 | 3 | 1684 | 1440 | 1024 | 922 | 384 | 512 | FALSE |
Methodology/Approach
Use consortium cluster for testing (qcon in our case).
ConfigureĀ the environment in accordance with Infrastructure parameters and requirements that are in the ticket PERF-889: ECS Reindex for mod-search (classification browse)Closed
Reindex, get the results for indexing time, and index size. Use Steps for testing process#Reindex for details.
Reindex process was started from the JMeter script using POST request /search/index/inventory/reindex with parameters
{ "recreateIndex": true,
"resourceName": "instance" }
Ā