ECS mod-search: Test Reindexing full (Ramsons)
- 1 Overview
- 2 Test Summary
- 3 Test Runs /Results
- 3.1 Indexing size
- 4 Resource utilization
- 5 Appendix
- 6 Additional Testing with mod-search:4.0.11
- 6.1 Overview
- 6.2 Summary
- 6.3 Results
- 6.3.1 Detailed results analysis/comparison
- 6.3.2 Merging
- 6.3.3 Uploading
- 6.4 Resource utilization
- 6.4.1 OpenSearch metrics
- 6.4.2 Instance CPU
- 6.4.3 Services metrics
- 6.4.4 DB Metrics
- 6.5 Infrastructure
Overview
The document's purpose is to assess reindexing performance on a Ramsons release
Implementation of the feature UXPROD-4892: Reindexing improvementsClosed
Jira ticket: PERF-984: [Ramsons] [ECS] Reindex for mod-search (reindex improvements)Closed
Test Summary
Reindex could be done in 3 hours and 4 minutes (db.r6g.8xlarge) for 13 million instances for all tenants. It is a new feature and this reindex was started for the central tenant but done for all tenants in parallel. Reindex time matches requirements (Expected response time: Whole reindexing procedure should take under 6 hours ).
Service CPU utilization was up to 60% for mod-search and 5% for mod-inventory-storage. For all other services CPU did not exceed 4%.
Memory utilization was stable and no memory leaks or OOM issues were observed.
RDS CPU utilization was about up to 28% for db.r6g.8xlarge.
Test Runs /Results
Test # | Start time | End time | Instances number | Test Conditions reindexing on Ramsons release, consortium environment | Duration | Notes |
1 | 2024-10-22T13:02:35 | 2024-10-22T16:06:18 | 13,777,503 * | In parallel: all tenants | 3 hours 4 minutes |
|
* -Total number of instances for all of the tenants according to the database
Instances number per tenant
Tenant name | Instances number from UI | Instances number from the database |
---|---|---|
cs00000int | 2,216,166 | 2,216,185 |
cs00000int_0001 | 8,799,538 | 7,015,237 |
cs00000int_0002 | 3,560,509 | 1,347,316 |
cs00000int_0003 | 3,187,778 | 1,135,806 |
cs00000int_0004 | 3,038,850 | 1,054,330 |
cs00000int_0005 | 2,836,270 | 1,004,629 |
Indexing size
All the data from the table below were captured after the test. Results from request for reindex monitoring GET /search/index/instance-records/reindex/status:
[
{
"entityType":"HOLDINGS",
"status":"MERGE_COMPLETED",
"totalMergeRanges":26246,
"processedMergeRanges":26246,
"totalUploadRanges":0,
"processedUploadRanges":0,
"startTimeMerge":"2024-10-22T13:02:36.049Z",
"endTimeMerge":"2024-10-22T14:12:31.965Z"
},
{
"entityType":"ITEM",
"status":"MERGE_COMPLETED",
"totalMergeRanges":31369,
"processedMergeRanges":31369,
"totalUploadRanges":0,
"processedUploadRanges":0,
"startTimeMerge":"2024-10-22T13:02:35.944Z",
"endTimeMerge":"2024-10-22T14:06:32.674Z"
},
{
"entityType":"SUBJECT",
"status":"UPLOAD_COMPLETED",
"totalMergeRanges":0,
"processedMergeRanges":0,
"totalUploadRanges":4095,
"processedUploadRanges":4095,
"startTimeUpload":"2024-10-22T14:12:33.759Z",
"endTimeUpload":"2024-10-22T15:11:58.204Z"
},
{
"entityType":"INSTANCE",
"status":"UPLOAD_COMPLETED",
"totalMergeRanges":0,
"processedMergeRanges":0,
"totalUploadRanges":12559,
"processedUploadRanges":12569,
"startTimeUpload":"2024-10-22T14:12:35.052Z",
"endTimeUpload":"2024-10-22T16:06:18.415Z"
},
{
"entityType":"CONTRIBUTOR",
"status":"UPLOAD_COMPLETED",
"totalMergeRanges":0,
"processedMergeRanges":0,
"totalUploadRanges":4095,
"processedUploadRanges":4095,
"startTimeUpload":"2024-10-22T14:12:34.509Z",
"endTimeUpload":"2024-10-22T15:12:19.081Z"
},
{
"entityType":"CLASSIFICATION",
"status":"UPLOAD_COMPLETED",
"totalMergeRanges":0,
"processedMergeRanges":0,
"totalUploadRanges":4095,
"processedUploadRanges":4095,
"startTimeUpload":"2024-10-22T14:12:35.088Z",
"endTimeUpload":"2024-10-22T15:26:05.314Z"
}
]
Resource utilization
Service CPU Utilization
Memory Utilization
Instance CPU Utilization
DB CPU Utilization
DB Connections
Open Search metrics
Subrange of reindexing process from 13:02 - 16:06UTC. A Graph was added to see detailed behavior aggregated on the graph above.
CPU utilization percentage for all data nodes
Memory usage percentage for all data nodes
Average JVM Memory Pressure
Maximum memory utilization (SysMemoryUtilization)
Appendix
Infrastructure
PTF-environment rcon
9 m6g.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
1 instance of db.r6g.8xlarge database, writer instance.
MSK - fse-tenant
4 kafka.m7g.xlarge brokers in 2 zonesApache Kafka version 3.7.x
EBS storage volume per broker 300 GiB
auto.create.topics.enable=true
log.retention.minutes=480
default.replication.factor=3
OpenSearch ptf-reindex-test cluster
OpenSearch version 2.13
Data nodes
Availability Zone(s) - 2-AZ without standby
Instance type - r6g.4xlarge.search
Number of nodes - 4
EBS volume size (GiB) - 300
Provisioned IOPS - 3000IOPS
Provisioned Throughput (MiB/s) - 250 MiB/s
Dedicated master nodes
Enabled - No
Module versions
Module | Task Def. Revision | Module Version | Task Count | Mem Hard Limit | Mem Soft limit | CPU units | Xmx | MetaspaceSize | MaxMetaspaceSize |
rcon-pvt | |||||||||
mod-search | 8 | mod-search:4.0.0-SNAPSHOT.281 | 4 | 2592 | 2480 | 2048 |
| 512 | 1024 |
mod-authtoken | 3 | mod-authtoken:2.16.0-SNAPSHOT.303 | 2 | 1440 | 1152 | 0 | 922 | 88 | 128 |
mod-inventory-storage | 3 | mod-inventory-storage:27.2.0-SNAPSHOT.773 | 4 | 4096 | 3690 | 0 | 3076 | 512 | 1024 |
mod-inventory | 2 | mod-inventory:20.3.0-SNAPSHOT.546 | 2 | 2880 | 2592 | 0 | 1814 | 384 | 512 |
mod-users | 2 | mod-users:19.3.3-SNAPSHOT.702 | 2 | 1024 | 896 | 0 | 768 | 88 | 128 |
nginx-okapi | 2 | nginx-okapi:2023.06.14 | 2 | 1024 | 896 | 0 | 0 | 0 | 0 |
okapi-b | 2 | okapi:5.3.0 | 3 | 1684 | 1440 | 0 | 922 | 384 | 512 |
Methodology/Approach
Use consortium cluster for testing (rcon in our case).
Configure the environment according to Infrastructure parameters and requirements that are in the ticket PERF-889: ECS Reindex for mod-search (classification browse)Closed
Reindex process was started from the JMeter script using POST request /search/index/instance-records/reindex/full without any parameters on central tenant. For all other tenants in consortium cluster reindex will be performed automatically.
Reindex, get the results for indexing time and size from GET /search/index/instance-records/reindex/status
Script on the http://github.com/folio-org/perf-testing/mod-search
Additional Testing with mod-search:4.0.11
Overview
Retesting of reindex In scope of PERF-1087: [Ramsons] [ECS] Reindex for mod-search (Call Number Browse Refactor)Closed after call number browse refactoring. Goal of retest is to measure performance of mod-search reindex and it affect on DI (Data Import) running on the background.
Whole process of reindex consists of two phases - “Merging“ and “Uploading“, takes place in sequence.
“Merging” phase consists of “items merging“ and “holdings merging“ (happening at the same time, concurrently)
“Uploading” phase consists of Instance, Subject, Contributor, Classification, Call_number uploads to OS (open search). Happing at the same time, concurrently.
Summary
Both tests performed - finished successfully. All data reindexed;
Tests performed with 8 mod-search tasks 2xLarge Open Search and 8xLarge DB size.
For Bugfest-like data amount on ECS environment full reindex ended in 45-52 minutes (including merging and uploading phases);
Data Import 25K instances, holdings, items + bib creation ended successfully in 6 minutes, on same configuration but without reindexing on the background it takes 5 min 30 sec (latest result for 25K on same environemnt with default configurations is 10 min 30 s Data Import test report Ramsons [ECS]). Performance improvement of DI explained by scaled up DB up to 8xLarge instance. Acceptance criteria met (DI affected less than 5 %).
Reindex process affected by Data import. Merge phase duration increased in 6 min.
Merging documents may be affected by high rate of Lock:transactionid on most of typical queries (see performance insights screenshot). Lock:transactionid may affect performance when high concurrency on the same row/table happening. Follow up task: MSEARCH-1014: Investigate DB query optimization (Lock:transactionid)Closed
Deadlocks is still observable during reindexing MSEARCH-932: Simplify bulk failure error logsClosed
DB CPU utilization percentage with 8xLarge shape is ±35%.
All services (mod-inventory-storage, mod-search) resource usages is predicable and low 12-15 % CPU for mod-inventory-storage and less than 8% for mod-search. No memory leaks or memory anomalies detected.
OpenSearch CPU reached 96% with 600-650K operations/sec. (note: OS data nodes in 2xLarge. More about infrastructure in infrastructure section). Probably this shape may be set as minimum required for reindexing.
OpenSearch memory utilisation reached 95% during data saving phase of reindexing.
Results
Test # | Merge phase | Upload phase | Comment |
---|---|---|---|
1 | 30 min | 15 min |
|
2 | 36 min | 16 min | DI 25K create instances holdings and items + marc BIB on the background |
Detailed results analysis/comparison
Test 2 is slower by 6 minutes on merging phase and 1,5 - 2 minutes slower on uploading.
Merging | records test1 | records test2 | duration test 1 | duration test 2 | Duration Change |
---|---|---|---|---|---|
ITEM | 13,156 | 13,156 | 12:23–12:52 (~29min) | 14:06–14:41 (~35min) |
+6 min |
HOLDINGS | 7,554 | 7,554 | 12:23–12:53 (~30min) | 14:06–14:42 (~36min) |
Uploading | Ranges (test1) | Ranges (test2) | Duration (test1) | Time (test2) | Duration Change |
---|---|---|---|---|---|
INSTANCE | 3,004 | 3,029* | 12:53–13:07 (~14min) | 14:42–14:56 (~14min) | Same |
SUBJECT | 4,096 | 4,096 | 12:53–13:07 (~14min) | 14:42–14:56 (~14min) | Same |
CONTRIBUTOR | 4,096 | 4,096 | 12:53–13:07 (~14min) | 14:42–14:56 (~15.5min) | +1.5 min |
CLASSIFICATION | 4,096 | 4,096 | 12:53–13:07 (~14min) | 14:42–14:56 (~16min) | +2 min |
CALL_NUMBER | 4,096 | 4,096 | 12:53–13:08 (~15min) | 14:42–14:57 (~15.5min) | ~Same |
*note: here amount of instances uploaded changed for test #2 as there was data import for 25K on the background.
Resource utilization
OpenSearch metrics
Open Search CPU usage showed that it’s almost reached its capacity. Scaling up may be considered as a step of improvement. All other metrics are in normal shape.
Instance CPU
Services metrics
Note: no signs of memory leaks is observable. memory usage of all involved modules come back to normal after testing.
DB Metrics
DB CPU is in good shape and reached 35% max. For cost optimisation may be considered to scale it down.
Infrastructure
Module | Task Def. Revision | Module Version | Task Count | Mem Hard Limit | Mem Soft limit | CPU units | Xmx | MetaspaceSize | MaxMetaspaceSize |
rcon-pvt | |||||||||
mod-search | 28 | mod-search:4.0.11 | 8 | 2592 | 2480 | 0 |
| 512 | 1024 |
mod-inventory-storage | 20 | mod-inventory-storage:28.0.9 | 2 | 4096 | 3690 | 0 | 3076 | 512 | 1024 |
PTF-environment rcon
6 m6g.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
1 instance of db.r6g.8xlarge database, writer instance.
MSK - fse-tenant
4 kafka.m7g.xlarge brokers in 2 zonesKafka mode: KRaft
Apache Kafka version 3.7.x
EBS storage volume per broker 300 GiB
auto.create.topics.enable=true
log.retention.minutes=480
default.replication.factor=3
OpenSearch ptf-reindex-test cluster
OpenSearch version 2.13
Data nodes
Availability Zone(s) - 2-AZ without standby
Instance type - r6g.2xlarge.search
Number of nodes - 4
EBS volume size (GiB) - 300
Provisioned IOPS - 3000IOPS
Provisioned Throughput (MiB/s) - 250 MiB/s
Dedicated master nodes
Enabled - No