RANCHER-779 Investigate shared OpenSearch load and check if we need bigger instance

Overview

During a recent spike analysis, it was found that the current AWS OpenSearch cluster is experiencing performance issues due to memory constraints. The CPU usage was only around 10%, while the memory usage was consistently between 85-87%. As a result, it is recommended to change the cluster to a memory-optimized instance type to improve performance.

Statistics per 1 day:

Statistics per 1 week:

Memory Optimization

To improve the performance of the cluster, it is recommended to switch to a memory-optimized instance type. The following table lists the recommended instance types, along with their specifications and costs:

Used now:

General Purpose - Current Generation     vCPUMemory (GiB)Instance Storage(GB)Price Per hour
m5.xlarge.search416EBS Only$0.283

Recommended memory-optimized instance type:

Memory Optimized - Current GenerationvCPUMemory (GiB)Instance Storage(GB)Price Per hour
r6g.large.search216EBS Only$0.167
r6g.xlarge.search432EBS Only$0.335

Shard Replication Issue

Additionally, the cluster health is currently yellow, and many shard replicas cannot be assigned to any node with the error message "a copy of this shard is already allocated to this node." 

One of the reasons for the yellow status of the OpenSearch cluster could be the shard replication issue. This occurs when there are not enough nodes to allocate all the replicas. In this case, we have set the replica count to 2, which means that each shard has two copies. If there are not enough nodes available, the replicas cannot be assigned, and the cluster remains yellow.


To fix this issue, you can take the following steps:

  1. Increase the number of nodes in the cluster (now 2 nodes)
  2. Or decrease the replica count to 1

Before decreasing replica count:

after decreasing number of replicas to 1:

Disk Throughput Throttle

Also, we have notification:

This message means that your AWS OpenSearch domain is experiencing throttling because it has reached the limits of its disk throughput capacity. Disk throughput refers to the rate at which data can be read from or written to the disk.

When disk throughput is limited, it can slow down queries and indexing operations, which can cause performance issues for your applications. To resolve this issue, AWS recommends scaling your domain to suit your throughput needs.

Scaling can involve adding more data nodes to your cluster to increase disk throughput capacity, or upgrading to a more powerful instance type with higher disk throughput capabilities. AWS OpenSearch provides a number of metrics and monitoring tools to help you track your disk throughput and determine when scaling is necessary.


Changing the volume type from GP2 to GP3 in AWS OpenSearch can make sense for us.

GP3 volumes offer higher baseline performance, more predictable performance, and lower cost compared to GP2 volumes. GP3 volumes also offer a higher maximum IOPS (input/output operations per second) than GP2 volumes, making them better suited for workloads that require higher I/O throughput.

https://aws.amazon.com/blogs/big-data/lower-your-amazon-opensearch-service-storage-cost-with-gp3-amazon-ebs-volumes/

Conclusion

To optimize the AWS OpenSearch cluster, there are three key decisions to make:

  1. Choose the appropriate instance type to use, either r6g.large or r6g.xlarge.
  2. Determine the number of nodes and replica counts required. Now used 2 nodes and 2 replicas.
  3. Choose the appropriate storage type, either gp2 or gp3.

Based on the requirements and the spike analysis, it is recommended to use r6g.large instance type with 2 nodes and 1 replica count. In addition, it is recommended to use gp3 storage type for better performance.

To summarize:

  1. Instance type: r6g.large or r6g.xlarge 
  2. Number of nodes and replica count: 2 nodes with 1 replica
  3. Storage type: gp3