MSEARCH-426 - search by contributor's values limiting to one object
-
MSEARCH-426Getting issue details...
STATUS
Goal
Identify the best approach to search by contributor's values limiting one contributor object.
Problem
The problem is in how search engines based on the Lucene library (Elasticsearch, Opensearch) handle arrays of objects. Engine has no concept of inner objects. Therefore, it flattens object hierarchies into a simple list of field names and values. For instance, consider the following document:
{ "group" : "fans", "user" : [ { "first" : "John", "last" : "Smith" }, { "first" : "Alice", "last" : "White" } ] }
The document will be transformed internally into
{ "group" : "fans", "user.first" : [ "alice", "john" ], "user.last" : [ "smith", "white" ] }
The user.first
and user.last
fields are flattened into multi-value fields and the association between alice
and white
is lost. This document would incorrectly match a query for alice AND smith.
Related documentation: Nested field type | Elasticsearch Guide [8.4] | Elastic
Solution
Using nested fields for arrays of objects.
Internally, nested objects index each object in the array as a separate hidden document, meaning that each nested object can be queried independently of the others with the nested query.
Difference between regular and nested queries:
{ "query": { "bool": { "must": [ { "match": { "user.first": "Alice" }}, { "match": { "user.last": "Smith" }} ] } } }
{ "query": { "nested": { "path": "user", "query": { "bool": { "must": [ { "match": { "user.first": "Alice" }}, { "match": { "user.last": "Smith" }} ] } } } } }
nested
queryTesting
Index mappings | Indexed instances | Amount of documents | Total index size | Avg size per instance | Contributor search | Keyword search |
---|---|---|---|---|---|---|
Current | 8266332 | 8266332 | 25850326050B ~ 25,8GB | 3127 | ||
nested contributors | 8266332 | 22533677 | 26439831643B ~ 26,4GB | 3198 | ||
nested contributors with include_in_parent | 8266332 | 22533677 | 27184318612B ~ 27,1GB | 3288 |
Outcome
- The only way to search instances by one of its contributors with field combination is using nested mapping for the contributor objects.
- Not including the contributors to the instance document will require a huge work on CQL-query to search-query conversion:
- pre-analyzing of the CQL query should be done to identify which search option is related to nested documents and which is not
- the query for alias search options like keyword-search should be complicated by dividing nested document's fields from not-tested fields
- all terms query builders should be updated to support nested-type queries
- Including contributors to the instance document will allow staying with the same queries for all existing search options.
To support nested queries when contributors are included in an instance modifier to CQL-relation could be introduced. For example: contributor.name = "name" AND/nested contributor.typeId = "typeId"
- Including contributors to the instance is increasing the size of the index.