MSEARCH-426 - search by contributor's values limiting to one object

MSEARCH-426 - Getting issue details... STATUS

Goal

Identify the best approach to search by contributor's values limiting one contributor object.

Problem

The problem is in how search engines based on the Lucene library (Elasticsearch, Opensearch) handle arrays of objects. Engine has no concept of inner objects. Therefore, it flattens object hierarchies into a simple list of field names and values. For instance, consider the following document:

{
  "group" : "fans",
  "user" : [ 
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

The document will be transformed internally into

{
  "group" :        "fans",
  "user.first" : [ "alice", "john" ],
  "user.last" :  [ "smith", "white" ]
}

 The user.first and user.last fields are flattened into multi-value fields and the association between alice and white is lost. This document would incorrectly match a query for alice AND smith.

Related documentation: Nested field type | Elasticsearch Guide [8.4] | Elastic

Solution

Using nested fields for arrays of objects.

Internally, nested objects index each object in the array as a separate hidden document, meaning that each nested object can be queried independently of the others with the nested query.

Difference between regular and nested queries:

Regular
{
  "query": {
    "bool": {
      "must": [
        { "match": { "user.first": "Alice" }},
        { "match": { "user.last":  "Smith" }}
      ]
    }
  }
}


Nested
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "Smith" }} 
          ]
        }
      }
    }
  }
}
Because nested documents are indexed as separate documents, they can only be accessed within the scope of the nested query
All fields in the nested object are also could be added to the parent document as standard (flat) fields by using include_in_parent parameter.

Testing

Index mappingsIndexed instancesAmount of documentsTotal index sizeAvg size per instanceContributor searchKeyword search
Current 8266332826633225850326050B ~ 25,8GB3127
 Click here to expand...
{
    "query": {
        "multi_match": {
            "query": "Pavlo Smahin",
            "fields": [
                "contributors.name^1.0"
            ],
            "type": "cross_fields",
            "operator": "AND"
        }
    }
}
 Click here to expand...
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "semantic",
            "fields": [
              "alternativeTitles.alternativeTitle.*^1.0",
              "contributors.name^1.0",
              "identifiers.value^1.0",
              "indexTitle.*^1.0",
              "series.*^1.0",
              "title.*^1.0"
            ],
            "type": "best_fields",
            "operator": "OR"
          }
        },
        {
          "multi_match": {
            "query": "web",
            "fields": [
              "alternativeTitles.alternativeTitle.*^1.0",
              "contributors.name^1.0",
              "identifiers.value^1.0",
              "indexTitle.*^1.0",
              "series.*^1.0",
              "title.*^1.0"
            ],
            "type": "best_fields",
            "operator": "OR"
          }
        },
        {
          "multi_match": {
            "query": "primer",
            "fields": [
              "alternativeTitles.alternativeTitle.*^1.0",
              "contributors.name^1.0",
              "identifiers.value^1.0",
              "indexTitle.*^1.0",
              "series.*^1.0",
              "title.*^1.0"
            ],
            "type": "best_fields",
            "operator": "OR"
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1.0
    }
  }
}
nested contributors82663322253367726439831643B ~ 26,4GB3198
 Click here to expand...
{
    "query": {
        "nested": {
            "path": "contributors",
            "query": {
                "match_phrase": {
                    "contributors.name": "Pavlo Smahin Test"
                }
            }
        }
    }
}
 Click here to expand...
{
  "query": {
    "bool": {
      "should": [
        {
          "nested": {
            "path": "contributors",
            "query": {
              "match_phrase": {
                "contributors.name": "Pavlo Smahin Test"
              }
            }
          }
        },
        {
          "bool": {
            "must": [
              {
                "multi_match": {
                  "query": "Pavlo",
                  "fields": [
                    "alternativeTitles.alternativeTitle.*^1.0",
                    "identifiers.value^1.0",
                    "indexTitle.*^1.0",
                    "series.*^1.0",
                    "title.*^1.0"
                  ],
                  "type": "best_fields",
                  "operator": "OR"
                }
              },
              {
                "multi_match": {
                  "query": "Smahin",
                  "fields": [
                    "alternativeTitles.alternativeTitle.*^1.0",
                    "identifiers.value^1.0",
                    "indexTitle.*^1.0",
                    "series.*^1.0",
                    "title.*^1.0"
                  ],
                  "type": "best_fields",
                  "operator": "OR"
                }
              },
              {
                "multi_match": {
                  "query": "Test",
                  "fields": [
                    "alternativeTitles.alternativeTitle.*^1.0",
                    "identifiers.value^1.0",
                    "indexTitle.*^1.0",
                    "series.*^1.0",
                    "title.*^1.0"
                  ],
                  "type": "best_fields",
                  "operator": "OR"
                }
              }
            ],
            "adjust_pure_negative": true,
            "boost": 1.0
          }
        }
      ]
    }
  }
}
nested contributors with include_in_parent 82663322253367727184318612B ~ 27,1GB3288
 Click here to expand...
{
    "query": {
        "multi_match": {
            "query": "Pavlo Smahin",
            "fields": [
                "contributors.name^1.0"
            ],
            "type": "cross_fields",
            "operator": "AND"
        }
    }
}
 Click here to expand...
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "semantic",
            "fields": [
              "alternativeTitles.alternativeTitle.*^1.0",
              "contributors.name^1.0",
              "identifiers.value^1.0",
              "indexTitle.*^1.0",
              "series.*^1.0",
              "title.*^1.0"
            ],
            "type": "best_fields",
            "operator": "OR"
          }
        },
        {
          "multi_match": {
            "query": "web",
            "fields": [
              "alternativeTitles.alternativeTitle.*^1.0",
              "contributors.name^1.0",
              "identifiers.value^1.0",
              "indexTitle.*^1.0",
              "series.*^1.0",
              "title.*^1.0"
            ],
            "type": "best_fields",
            "operator": "OR"
          }
        },
        {
          "multi_match": {
            "query": "primer",
            "fields": [
              "alternativeTitles.alternativeTitle.*^1.0",
              "contributors.name^1.0",
              "identifiers.value^1.0",
              "indexTitle.*^1.0",
              "series.*^1.0",
              "title.*^1.0"
            ],
            "type": "best_fields",
            "operator": "OR"
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1.0
    }
  }
}

Outcome

  1. The only way to search instances by one of its contributors with field combination is using nested mapping for the contributor objects.
  2. Not including the contributors to the instance document will require a huge work on CQL-query to search-query conversion:
    1. pre-analyzing of the CQL query should be done to identify which search option is related to nested documents and which is not 
    2. the query for alias search options like keyword-search should be complicated by dividing nested document's fields from not-tested fields
    3. all terms query builders should be updated to support nested-type queries
  3. Including contributors to the instance document will allow staying with the same queries for all existing search options.
  4. To support nested queries when contributors are included in an instance modifier to CQL-relation could be introduced. For example: contributor.name = "name" AND/nested contributor.typeId = "typeId"

  5. Including contributors to the instance is increasing the size of the index.