ElasticSearch Reindex Performance Recommendations
Purpose
The purpose of the document is to illustrate performance changes that can be implemented to improve reindex performance. A git patch is attached to this page for inspiration for the application changes. The main bottleneck exists at the supported search engine. More efficient processing in supported search engine will decrease reindex duration significantly.
Changes
Routing
Elasticsearch/Opensearch allows routing a document a document to a shard. mod-search has default routing on the tenant id. With a default shard count of 4, this means that all records for a particular tenant will be routed to a single shard out of the four. With Elasticsearch/Opensearch trying to have a balanced cluster, primary shards are placed at least at each data node. This means that only one shard => data node is used during the index process rather than all data nodes. Recommendation is to remove routing by tenant id and just use default routing on _id
or some other attribute that will ensure balanced distribution in the Elasticsearch/Opensearch cluster.
GZIP
Testing with a large dataset show potential large documents that need to travel through the network. Since there is still CPU to spare on the mod-search instances and Elasticsearch/Opensearch nodes, it is prudent to employ some compression to minimize space usage.
SMILE
SMILE is a data format that is best expressed as compacted binary JSON. It is one of the formats supported by Elasticsearch's bulk api, other than JSON. Using SMILE in concert with GZIP can be a boon; link.
Unfortunately, there is a bug that has been acknowledged by AWS Support. Essentially, when using AWS OpenSearch, the response returned by the bulk api is malformed which is causing errors in mod-search. The issue does not occur with local instances of OpenSearch or ElasticSearch, only instances in AWS Infrastructure. Due to this, performance gains could not be determined. When the issue is resolved by AWS, this can be revisited.
Kafka Fetch Size Upon Poll
AWS recommends at least 5MB worth of data passed into the bulk api allow efficient processing of data. mod-search currently send less than 800KB in general. It is such a small size because mod-search is limited to 50 records when retrieving instance Ids from Kafka. The poll record count was reduced to 50 due to a downstream issue of mod-search querying mod-inventory-storage for the latest instance, the query(via CQL) appends IDs to the URI of the HTTP call[MSEARCH-75]. Querying more than 50 records throws an HTTP exception of having an HTTP call with a URI that is too long.
A change was made to increase the poll record count back to its previous value of 200 but query mod-inventory-storage in batches of 50. This resolved the downstream issue while retaining a high poll record count. Code used is in the attached patch.
ES Refresh Interval
Elasticsearch/Opensearch refreshes new indexed documents to be available for search every second. A configuration can delay or disable this refresh process until indexing is done.
ES Replica Count
Elasticsearch/Opensearch replicates changes on a shard for high availability. Disabling this replication can help speed up the indexing process. Replication can be enabled by increasing replica count from 0 to its default of 2.
Performance Results
The following tests were run on a large dataset which has about 9 million instances. The re-index occurring here is only for instance, contributor & subject. Values in red denotes a change from the previous configuration.
Config 1 | Config 2 | Config 3 | Config 4 | Config 5 | Config 6 | Config 7 | Config 8 | |
---|---|---|---|---|---|---|---|---|
AWS ES Instance Class | r6g.large.search | r6g.2xlarge.search | r6g.2xlarge.search | r6g.2xlarge.search | r6g.2xlarge.search | r6g.2xlarge.search | r6g.2xlarge.search | r6g.2xlarge.search |
Routing | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
GZIP | No | No | No | No | No | No | No | No |
SMILE | No | No | No | No | No | No | No | No |
mod-search instance count | 4 | 4 | 4 | 4 | 4 | 8 | 8 | 8 |
KAFKA_EVENTS_CONCURRENCY | 2 | 2 | 4 | 2 | 2 | 2 | 4 | 4 |
KAFKA_CONTRIBUTOR_CONCURRENCY | 1 | 1 | 2 | 1 | 1 | 1 | 2 | 2 |
Kafka fetch size | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 200 |
ES Refresh Interval | 1s | 1s | 1s | -1 | -1 | -1 | -1 | -1 |
ES Number of replicas | 2 | 2 | 2 | 2 | 0 | 0 | 0 | 0 |
Duration | 13 hrs 30 mins | 6 hrs 20 mins | 6hrs 30 mins | 6 hrs | 2 hrs 30 mins | 2 hrs | 1 hr 45 mins | 1 hr 15 mins |
ES CPU Average | 99% | 50% | 51% | 30% | 34% | 50% | 55% | 58% |
mod-search CPU Average | 12% | 25% | 30% | 25% | 70% | 55% | 64% | 75% |
Recommendations for FOLIO Dev Team
- Use default routing for indexing and searching. This is allow optimum use of resources.
- Increase Kafka fetch size to add more data during a single bulk request. Check attached patch for inspiration on fixing URI-Too Long error on mod-inventory-storage.
- Allow number of shards to be configurable, possibly for each tenant. The primary shards are 52GB and growing, ideal is 10-30GB.
- Can be extended to just any index setting possible.
- Introduce better monitoring that will allow an operator to determine when a reindex is complete.
- If above is completed, it would be possible to allow configuration that will disable refresh interval and replica count during reindex and re-enable when done.
- Investigate GZIP + SMILE. If implemented, it should be configurable. Some code is in the attached patch.
At the end of a reindex, the indexes storing contributor and subject data experience a lot of document deletions compared to document creation. As a first time load, deletion ought to be minimal. Consider computing interim results in Postgres and then uploading the "final" document into elasticsearch.
GET /_cat/indices?v&s=store.size:deschealth status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open mtes_instance_fs00001034 GFpif_XjTh-1PJTZac1_Hw 4 0 8196944 5951 51.4gb 51.4gb green open mtes_contributor_fs00001034 yFhS12Y4SviL170UFQZZlw 4 0 4739266 1126590 10.3gb 10.3gb green open mtes_instance_subject_fs00001034 pdD9z85SQZOW8uHMgMXvBw 4 0 3824725 13681561 2.2gb 2.2gb green open .kibana_1 47dox0rZTD-8gUtV-xwM1w 1 1 6 1 46.4kb 18.3kb
How to change number_of_replicas and refresh_interval values of ES/OpenSearch
Using cmd and curl utility you should send request to ES/OpenSearch endpoint.
Be sure that you can connect to ES/OpenSearch endpoint.
curl -u "USERNAME:PASSWORD" --location --request GET "https://{ENDPOINT_URL}/_cat/indices" to get all indeces of ES/OpenSearch
UPDATE values number_of_replicas, refresh_interval with recommendations in the spike!
To see the current configuration you should change type of request to GET.
curl -u "USERNAME:PASSWORD" --location --request PUT 'https://{ENDPOINT_URL}/{FOLIO_ENVIRONMENT}_instance_{TENANT_ID}/_settings' \
--header 'Content-Type: application/json' \
--data-raw '{
"index": {
"number_of_replicas": "2",
"refresh_interval": "1s"
}
}'
UPDATE values number_of_replicas, refresh_interval with recommendations in the spike!
To see the current configuration you should change type of request to GET.
curl -u "USERNAME:PASSWORD" --location --request PUT 'https://{ENDPOINT_URL}/{FOLIO_ENVIRONMENT}_contributor_{TENANT_ID}/_settings' \
--header 'Content-Type: application/json' \
--data-raw '{
"index": {
"number_of_replicas": "2",
"refresh_interval": "1s"
}
}'
UPDATE values number_of_replicas, refresh_interval with recommendations in the spike!
To see the current configuration you should change type of request to GET.
curl -u "USERNAME:PASSWORD" --location --request PUT 'https://{ENDPOINT_URL}/{FOLIO_ENVIRONMENT}_authority_{TENANT_ID}/_settings' \
--header 'Content-Type: application/json' \
--data-raw '{
"index": {
"number_of_replicas": "2",
"refresh_interval": "1s"
}
}'
UPDATE values number_of_replicas, refresh_interval with recommendations in the spike!
To see the current configuration you should change type of request to GET.
curl -u "USERNAME:PASSWORD" --location --request PUT 'https://{ENDPOINT_URL}/{FOLIO_ENVIRONMENT}_instance_subject_{TENANT_ID}/_settings' \
--header 'Content-Type: application/json' \
--data-raw '{
"index": {
"number_of_replicas": "2",
"refresh_interval": "1s"
}
}'
After reindex process refresh_interval should be set to 1s, and also for high availability replication can be enabled by increasing replica count from 0 to its default of 2.
ES documentation https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html#bulk