Non-ECS mod-search: Test Reindexing full (Ramsons)
Overview
The document's purpose is to assess reindexing performance on a Ramsons release. Calculate reindex time and size of reindexing. PERF-1003: [Ramsons] [Standalone] Reindex for mod-search (reindex improvements)Closed
Test Summary
Reindex could be done in 1 hour 25 minutes (db.r6g.8xlarge) for 10 million instances. It is 6.8 times faster than the Poppy release with db.r6g.xlarge. Duration 2 hours 25 minutes with the db.r6g.xlarge (same database size) and it is 4 times faster compared to Poppy.
It is possible to run reindex with the small-size database (xlarge). duration -- hours and we have 10 mln records
It is not possible to run multitenant reindex. If starting 3 reindex in parallel for 3 tenants from 1 to 3 reindex will fail.
Service CPU utilization was up to 50% for mod-search and 40% for mod-inventory-storage. For all other services CPU did not exceed 20%.
Memory utilization was stable and no memory leaks or OOM issues were observed.
RDS CPU utilization was about 90% for the database db.r6g.xlarge and up to 35% for db.r6g.8xlarge.
A larger database instance type typically results in faster reindexing times. However, for the db.r6g.8xlarge and db.r6g.4xlarge, the reindexing duration is nearly identical. Therefore, for 10 million instance records reindex, it's more efficient to use the db.r6g.4xlarge or db.r6g.2xlarge database instance.
Recommendations & Jiras
It is not possible to run multitenant reindex. If starting 3 reindex in parallel for 3 tenants from 1 to 3 reindex will be failed. MSEARCH-868: Unable to run reindex for 3 tenants in parallel Open
Multiple deadlocks in the database were observed at the start of reindex.
It is possible to run reindex on the small-size database (xlarge, 2 instances). duration 2 hours 25 minutes and we have 10 mln records
mod-search:
task count = 4
Mem Hard Limit =
2592
Mem Soft Limit =
2480
Xmx =
-XX:MaxRAMPercentage=85.0
"name": "JAVA_OPTS", "value": "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/ms/mod-search.hprof -XX:OnOutOfMemoryError=/usr/ms/heapdump.sh -XX:MetaspaceSize=512m -XX:MaxMetaspaceSize=1024m -Dserver.port=8082 -XX:MaxRAMPercentage=85.0"
mod-inventory-storage task count = 4
open search Data nodes instance scaled up to r6g.4xlarge.search
use the db.r6g.4xlarge or db.r6g.2xlarge database instance
Test Runs /Results
Test # | Start time | End time | Instances number | Test Conditions reindexing on Ramsons release, non-consortium environment | Duration | Notes | Objective |
1 | 2024-10-17T12:41:14 | 2024-10-17T14:06:46 | 10,099,620 | Sequential: fs07000001 | 1 hour 25 min |
| We tested mod-search version 4.0.0-SNAPSHOT.281 with the baseline configuration, and it delivered very good results. However, the database size is quite large (using db.r6g.8xlarge), so we plan to experiment with other types of database instances to explore potential optimizations. |
2 | 2024-10-17T14:35:59 | 2024-10-17T19:49:26 | 27,957,839 | Sequential: fs09000000 | 5 hours 14 min | ||
3 | 2024-10-17T19:58:07 | 2024-10-17T20:12:24 | 1,210,000 | Sequential: fs07000002 | 14 min | ||
4 | 2024-10-17T20:21:23 | 2024-10-17T22:46:34 |
| In parallel: 3 tenants | All tenants reindex FAILED in | ||
5 | 2024-10-16T14:38:08 | 2024-10-16T15:53:37 | 10,099,620 | Sequential: fs07000001 | 1 hour 15 min |
| We tested mod-search version 4.0.0-SNAPSHOT.281 with the following increased memory parameters to assess potential performance improvements:
However, the results showed no significant improvement in performance, indicating that increasing the memory parameters is not beneficial. |
6 | 2024-10-16T16:11:53 | 2024-10-16T21:08:00 | 27,957,839 | Sequential: fs09000000 | 4 hours 57 min | ||
7 | 2024-10-17T06:20:22 | 2024-10-17T06:34:04 | 1,210,000 | Sequential: fs07000002 | 14 min | ||
8 | 2024-10-17T06:40:22 | 2024-10-17T09:12:00 |
| In parallel: 3 tenants | reindex FAILED for 1 tenant | ||
9 | 2024-10-15T10:16:50 | 2024-10-15T12:04:51 | 10,099,620 | Sequential: fs07000001 | 1 hour 48 min |
| For this version, we encountered out-of-memory (OOM) issues with mod-search and attempted to resolve them by increasing the memory parameters:
However, these adjustments did not resolve the issue, so we updated mod-search to version 4.0.0-SNAPSHOT.281. |
10 | 2024-10-16T08:01:14 | 2024-10-16T12:32:19 | 10,099,620 | Sequential: fs07000001 | 4 hours 31 min |
| This test was performed to check if it is possible to reindex with regular open search size without scaling it. |
11 | 2024-10-21T09:49:10 | 2024-10-21T12:14:38 | 10,099,620 | Sequential: fs07000001 | 2 hours 25 min |
| The baseline database size, using the db.r6g.8xlarge instance, is quite large. A performance test was conducted to assess whether this size is optimal. The results indicate that it would be more efficient to use smaller instances, such as db.r6g.4xlarge or db.r6g.2xlarge, instead. |
12 | 2024-10-21T16:36:18 | 2024-10-21T18:26:15 | 10,099,620 | Sequential: fs07000001 | 1 hour 50 min |
| |
13 | 2024-10-22T09:16:23 | 2024-10-22T10:38:20 | 10,099,620 | Sequential: fs07000001 | 1 hour 22 minutes |
|
Indexing size
All the data from the tables below were captured after each test. Results from request for reindex monitoring GET /search/index/instance-records/reindex/status:
Test #1
[{"entityType":"ITEM","status":"MERGE_COMPLETED","totalMergeRanges":2970,"processedMergeRanges":2970,"totalUploadRanges":0,"processedUploadRanges":0,"startTimeMerge":"2024-10-17T12:41:14.778Z","endTimeMerge":"2024-10-17T13:08:31.434Z"},
{"entityType":"HOLDINGS","status":"MERGE_COMPLETED","totalMergeRanges":21045,"processedMergeRanges":21045,"totalUploadRanges":0,"processedUploadRanges":0,"startTimeMerge":"2024-10-17T12:41:14.839Z","endTimeMerge":"2024-10-17T13:13:52.870Z"},
{"entityType":"SUBJECT","status":"UPLOAD_COMPLETED","totalMergeRanges":0,"processedMergeRanges":0,"totalUploadRanges":4095,"processedUploadRanges":4095,"startTimeUpload":"2024-10-17T13:14:08.544Z","endTimeUpload":"2024-10-17T13:50:47.667Z"},
{"entityType":"CONTRIBUTOR","status":"UPLOAD_COMPLETED","totalMergeRanges":0,"processedMergeRanges":0,"totalUploadRanges":4095,"processedUploadRanges":4095,"startTimeUpload":"2024-10-17T13:14:08.941Z","endTimeUpload":"2024-10-17T14:02:34.736Z"},
{"entityType":"INSTANCE","status":"UPLOAD_COMPLETED","totalMergeRanges":0,"processedMergeRanges":0,"totalUploadRanges":10101,"processedUploadRanges":10101,"startTimeUpload":"2024-10-17T13:14:08.598Z","endTimeUpload":"2024-10-17T14:06:09.460Z"},
{"entityType":"CLASSIFICATION","status":"UPLOAD_COMPLETED","totalMergeRanges":0,"processedMergeRanges":0,"totalUploadRanges":4095,"processedUploadRanges":4095,"startTimeUpload":"2024-10-17T13:14:09.193Z","endTimeUpload":"2024-10-17T14:06:46.360Z"}]
Test #2
[{"entityType":"INSTANCE","status":"UPLOAD_COMPLETED","totalMergeRanges":0,"processedMergeRanges":0,"totalUploadRanges":27830,"processedUploadRanges":27830,"startTimeUpload":"2024-10-17T16:57:51.729Z","endTimeUpload":"2024-10-17T19:49:26.148Z"},
{"entityType":"ITEM","status":"MERGE_COMPLETED","totalMergeRanges":57956,"processedMergeRanges":57956,"totalUploadRanges":0,"processedUploadRanges":0,"startTimeMerge":"2024-10-17T14:35:59.909Z","endTimeMerge":"2024-10-17T16:42:19.528Z"},
{"entityType":"HOLDINGS","status":"MERGE_COMPLETED","totalMergeRanges":55778,"processedMergeRanges":55778,"totalUploadRanges":0,"processedUploadRanges":0,"startTimeMerge":"2024-10-17T14:36:00.032Z","endTimeMerge":"2024-10-17T16:57:34.560Z"},
{"entityType":"SUBJECT","status":"UPLOAD_COMPLETED","totalMergeRanges":0,"processedMergeRanges":0,"totalUploadRanges":4095,"processedUploadRanges":4095,"startTimeUpload":"2024-10-17T16:57:51.682Z","endTimeUpload":"2024-10-17T17:46:13.105Z"},
{"entityType":"CONTRIBUTOR","status":"UPLOAD_COMPLETED","totalMergeRanges":0,"processedMergeRanges":0,"totalUploadRanges":4095,"processedUploadRanges":4095,"startTimeUpload":"2024-10-17T16:57:51.850Z","endTimeUpload":"2024-10-17T18:04:25.328Z"},
{"entityType":"CLASSIFICATION","status":"UPLOAD_COMPLETED","totalMergeRanges":0,"processedMergeRanges":0,"totalUploadRanges":4095,"processedUploadRanges":4095,"startTimeUpload":"2024-10-17T16:57:52.021Z","endTimeUpload":"2024-10-17T19:03:31.441Z"}]
Test #3
Test #4
Test #5
Test #6
Test #7
Test #9
Test #10
Compared to Poppy
Compared to the test with instances of 10 mln records size.
Difference in configuration:
Configuration | Ramsons | Poppy |
---|---|---|
Database | 1 instance of db.r6g.8xlarge database instances, writer | 2 instances of db.r6g.xlarge database instances, one reader, and one writer |
mod-search version | 4.0.0-SNAPSHOT.281 | 3.0.0-SNAPSHOT.151 |
mod-search task count | 4 | 8 |
Data node instances count (r6g.4xlarge.search) | 4 | 6 |
Dedicated master nodes | No | 3 of r6g.large.search instances |
Duration comparison:
| Ramsons | Poppy | Delta absolut | Delta |
---|---|---|---|---|
Ramsons database instance type r6g.8xlarge according to requirements PERF-1003: [Ramsons] [Standalone] Reindex for mod-search (reindex improvements)Closed | 1 hour 25 min | 9 hours 38 min | 8 hours 13 min | 6.8 times |
Ramsons: 2 instances of database instance type r6g.xlarge (the same as for Poppy testing) | 2 hours 25 min | 9 hours 38 min | 7 hours 13 min | 4 times |
Reindex duration and database size correlation:
A larger database instance type typically results in faster reindexing times. However, for the db.r6g.8xlarge and db.r6g.4xlarge, the reindexing duration is nearly identical. Therefore, for 10 million instance records reindex, it's more efficient to use the db.r6g.4xlarge or db.r6g.2xlarge database instance.
Database size | Duration |
---|---|
2 instances of database database db.r6g.xlarge | 2 hours 25 min |
database db.r6g.2xlarge | 1 hour 50 min |
database db.r6g.4xlarge | 1 hour 22 min |
database db.r6g.8xlarge | 1 hour 25 min |
Resource utilization
Service CPU Utilization
Service CPU utilization was up to 50% for mod-search and 40% for mod-inventory-storage, for all other services CPU did not exceed 20%.
Instance CPU Utilization
Memory Utilization
DB CPU Utilization
Average database CPU utilization was up to 35%
DB Connections
Database use the same average amount of connections
Open Search metrics
CPU utilization percentage for all data nodes (Average).
Memory usage percentage for all data nodes (Average).
Test #11 2 instances of db.r6g.xlarge database: writer and reader instances.
Appendix
Infrastructure
PTF-environment rcp1
10 m6g.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
1 instance of db.r6g.8xlarge database, writer instance.
MSK - fse-tenant
4 kafka.m7g.xlarge brokers in 2 zonesApache Kafka version 3.7.x
EBS storage volume per broker 300 GiB
auto.create.topics.enable=true
log.retention.minutes=480
default.replication.factor=3
OpenSearch ptf-reindex-test cluster
OpenSearch version 2.13;
Data nodes
Availability Zone(s) - 2-AZ without standby
Instance type - r6g.4xlarge.search
Number of nodes - 4
EBS volume size (GiB) - 300
Provisioned IOPS - 3000IOPS
Provisioned Throughput (MiB/s) - 250 MiB/s
Dedicated master nodes
Enabled - No
Data structure
Module | Task Definition Revision | Module Version | Task Count | Mem Hard Limit | Mem Soft Limit | CPU Units | Xmx | Metaspace Size | Max Metaspace Size |
mod-search | 10 | mod-search:4.0.0-SNAPSHOT.281 | 4 | 2592 | 2480 | 2048 | MaxRAMPercentage=85.0 | 512 | 1024 |
mod-authtoken | 1 | mod-authtoken:2.16.0-SNAPSHOT.303 | 2 | 1440 | 1152 | 512 | 922 | 88 | 128 |
mod-inventory-storage | 4 | mod-inventory-storage:27.2.0-SNAPSHOT.773 | 4 | 4096 | 3690 | 2048 | 3076 | 512 | 1024 |
mod-inventory | 1 | mod-inventory:20.3.0-SNAPSHOT.546 | 2 | 2880 | 2592 | 1024 | 1814 | 384 | 512 |
mod-users | 1 | mod-users:19.3.2-SNAPSHOT.696 | 2 | 1024 | 896 | 0 | 768 | 88 | 128 |
nginx-okapi | 1 | nginx-okapi:2023.06.14 | 2 | 1024 | 896 | 0 | 0 | 0 | 0 |
okapi-b | 1 | okapi:5.3.0 | 3 | 1684 | 1440 | 1024 | 922 | 384 | 512 |
pub-okapi | 1 | pub-okapi:2023.06.14 | 2 | 1024 | 896 | 0 | 768 | 0 | 0 |
Module versions
Methodology/Approach
Use a non-consortium cluster for testing (rcp1 in our case).
Configure the environment in accordance with Infrastructure parameters and requirements that are in the ticket PERF-1003: [Ramsons] [Standalone] Reindex for mod-search (reindex improvements)Closed
Reindex, get the results for indexing time and size from GET /search/index/instance-records/reindex/status
Reindex process was started from the JMeter script using POST request /search/index/instance-records/reindex/full without any parameters
Script on the http://github.com/folio-org/perf-testing/mod-search