ECS mod-search: Test Reindexing full (Ramsons)
Overview
The document's purpose is to assess reindexing performance on a Ramsons release
Implementation of the feature https://folio-org.atlassian.net/browse/UXPROD-4892
Jira ticket: https://folio-org.atlassian.net/browse/PERF-984
Test Summary
Reindex could be done in 3 hours and 4 minutes (db.r6g.8xlarge) for 13 million instances for all tenants. It is a new feature and this reindex was started for the central tenant but done for all tenants in parallel. Reindex time matches requirements (Expected response time: Whole reindexing procedure should take under 6 hours ).
Service CPU utilization was up to 60% for mod-search and 5% for mod-inventory-storage. For all other services CPU did not exceed 4%.
Memory utilization was stable and no memory leaks or OOM issues were observed.
RDS CPU utilization was about up to 28% for db.r6g.8xlarge.
Test Runs /Results
Test # | Start time | End time | Instances number | Test Conditions reindexing on Ramsons release, consortium environment | Duration | Notes |
1 | 2024-10-22T13:02:35 | 2024-10-22T16:06:18 | 13,777,503 * | In parallel: all tenants | 3 hours 4 minutes |
|
* -Total number of instances for all of the tenants according to the database
Instances number per tenant
Tenant name | Instances number from UI | Instances number from the database |
---|---|---|
cs00000int | 2,216,166 | 2,216,185 |
cs00000int_0001 | 8,799,538 | 7,015,237 |
cs00000int_0002 | 3,560,509 | 1,347,316 |
cs00000int_0003 | 3,187,778 | 1,135,806 |
cs00000int_0004 | 3,038,850 | 1,054,330 |
cs00000int_0005 | 2,836,270 | 1,004,629 |
Indexing size
All the data from the table below were captured after the test. Results from request for reindex monitoring GET /search/index/instance-records/reindex/status:
[
{
"entityType":"HOLDINGS",
"status":"MERGE_COMPLETED",
"totalMergeRanges":26246,
"processedMergeRanges":26246,
"totalUploadRanges":0,
"processedUploadRanges":0,
"startTimeMerge":"2024-10-22T13:02:36.049Z",
"endTimeMerge":"2024-10-22T14:12:31.965Z"
},
{
"entityType":"ITEM",
"status":"MERGE_COMPLETED",
"totalMergeRanges":31369,
"processedMergeRanges":31369,
"totalUploadRanges":0,
"processedUploadRanges":0,
"startTimeMerge":"2024-10-22T13:02:35.944Z",
"endTimeMerge":"2024-10-22T14:06:32.674Z"
},
{
"entityType":"SUBJECT",
"status":"UPLOAD_COMPLETED",
"totalMergeRanges":0,
"processedMergeRanges":0,
"totalUploadRanges":4095,
"processedUploadRanges":4095,
"startTimeUpload":"2024-10-22T14:12:33.759Z",
"endTimeUpload":"2024-10-22T15:11:58.204Z"
},
{
"entityType":"INSTANCE",
"status":"UPLOAD_COMPLETED",
"totalMergeRanges":0,
"processedMergeRanges":0,
"totalUploadRanges":12559,
"processedUploadRanges":12569,
"startTimeUpload":"2024-10-22T14:12:35.052Z",
"endTimeUpload":"2024-10-22T16:06:18.415Z"
},
{
"entityType":"CONTRIBUTOR",
"status":"UPLOAD_COMPLETED",
"totalMergeRanges":0,
"processedMergeRanges":0,
"totalUploadRanges":4095,
"processedUploadRanges":4095,
"startTimeUpload":"2024-10-22T14:12:34.509Z",
"endTimeUpload":"2024-10-22T15:12:19.081Z"
},
{
"entityType":"CLASSIFICATION",
"status":"UPLOAD_COMPLETED",
"totalMergeRanges":0,
"processedMergeRanges":0,
"totalUploadRanges":4095,
"processedUploadRanges":4095,
"startTimeUpload":"2024-10-22T14:12:35.088Z",
"endTimeUpload":"2024-10-22T15:26:05.314Z"
}
]
Resource utilization
Service CPU Utilization
Memory Utilization
Instance CPU Utilization
DB CPU Utilization
DB Connections
Open Search metrics
Subrange of reindexing process from 13:02 - 16:06UTC. A Graph was added to see detailed behavior aggregated on the graph above.
CPU utilization percentage for all data nodes
Memory usage percentage for all data nodes
Average JVM Memory Pressure
Maximum memory utilization (SysMemoryUtilization)
Appendix
Infrastructure
PTF-environment rcon
9 m6g.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
1 instance of db.r6g.8xlarge database, writer instance.
MSK - fse-tenant
4 kafka.m7g.xlarge brokers in 2 zonesApache Kafka version 3.7.x
EBS storage volume per broker 300 GiB
auto.create.topics.enable=true
log.retention.minutes=480
default.replication.factor=3
OpenSearch ptf-reindex-test cluster
OpenSearch version 2.13
Data nodes
Availability Zone(s) - 2-AZ without standby
Instance type - r6g.4xlarge.search
Number of nodes - 4
EBS volume size (GiB) - 300
Provisioned IOPS - 3000IOPS
Provisioned Throughput (MiB/s) - 250 MiB/s
Dedicated master nodes
Enabled - No
Module versions
Module | Task Def. Revision | Module Version | Task Count | Mem Hard Limit | Mem Soft limit | CPU units | Xmx | MetaspaceSize | MaxMetaspaceSize |
rcon-pvt | |||||||||
mod-search | 8 | mod-search:4.0.0-SNAPSHOT.281 | 4 | 2592 | 2480 | 2048 |
| 512 | 1024 |
mod-authtoken | 3 | mod-authtoken:2.16.0-SNAPSHOT.303 | 2 | 1440 | 1152 | 0 | 922 | 88 | 128 |
mod-inventory-storage | 3 | mod-inventory-storage:27.2.0-SNAPSHOT.773 | 4 | 4096 | 3690 | 0 | 3076 | 512 | 1024 |
mod-inventory | 2 | mod-inventory:20.3.0-SNAPSHOT.546 | 2 | 2880 | 2592 | 0 | 1814 | 384 | 512 |
mod-users | 2 | mod-users:19.3.3-SNAPSHOT.702 | 2 | 1024 | 896 | 0 | 768 | 88 | 128 |
nginx-okapi | 2 | nginx-okapi:2023.06.14 | 2 | 1024 | 896 | 0 | 0 | 0 | 0 |
okapi-b | 2 | okapi:5.3.0 | 3 | 1684 | 1440 | 0 | 922 | 384 | 512 |
Methodology/Approach
Use consortium cluster for testing (rcon in our case).
Configure the environment according to Infrastructure parameters and requirements that are in the ticket https://folio-org.atlassian.net/browse/PERF-889
Reindex process was started from the JMeter script using POST request /search/index/instance-records/reindex/full without any parameters on central tenant. For all other tenants in consortium cluster reindex will be performed automatically.
Reindex, get the results for indexing time and size from GET /search/index/instance-records/reindex/status
Script on the http://github.com/folio-org/perf-testing/mod-search