/
POC: Investigate performance impact on supporting search by all properties in inventory records

POC: Investigate performance impact on supporting search by all properties in inventory records

Purpose

Some users would like to be able to conduct a search across all properties of all record types.  The scope of this story is to provide a possible solution.

Scope

  • create additional search option that will cover search by all properties from all inventory record types
  • investigate performance impact on the already implemented search option
  • investigate the performance of the new search all query
  • determine the impact on the required disc space for the collection of ~5M records
  • investigate the increase in required RAM for mod-search
  • support multilanguage full-text fields

Approach

  1. Implement an algorithm to aggregate all fields into a single one called all, separating multi-language and non-multi-language field values. Excluded field values: type ids like identifierTypeId, alternativeTitleTypeId and etc. All processed values must be also included in this field.
  2. Update CqlSearchQueryConverter to accept and process queries with the term - cql.all 
  3. Perform performance tests locally and using a performance-rancher environment

Implemented classes

Field value provider to create 6 fields for searching by instances/holding-records/items field values.

AllSearchFieldValueProvider
package org.folio.search.service.setter.instance;

import static org.folio.search.utils.SearchUtils.INSTANCE_RESOURCE;
import static org.folio.search.utils.SearchUtils.getMultilangValue;

import java.util.ArrayList;
import java.util.Collection;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import lombok.Data;
import lombok.RequiredArgsConstructor;
import org.apache.commons.collections4.MapUtils;
import org.folio.search.model.metadata.PlainFieldDescription;
import org.folio.search.service.metadata.LocalSearchFieldProvider;
import org.springframework.stereotype.Component;

@Component
@RequiredArgsConstructor
public class AllSearchFieldValueProvider {

  private final LocalSearchFieldProvider searchFieldProvider;
  private final ConcurrentHashMap<String, Boolean> multilangCache = new ConcurrentHashMap<>();

  public Map<String, Object> getFieldValue(Map<String, Object> eventBody, List<String> languages) {
    var instanceEventBody = new LinkedHashMap<>(eventBody);
    instanceEventBody.remove("items");
    instanceEventBody.remove("holdings");

    var resultMap = new LinkedHashMap<String, Object>();

    resultMap.putAll(getAllFields("allInstances", instanceEventBody, languages));
    resultMap.putAll(getAllFields("allItems", eventBody.get("items"), languages));
    resultMap.putAll(getAllFields("allHoldings", eventBody.get("holdings"), languages));

    return resultMap;
  }

  @SuppressWarnings("unchecked")
    private Map<String, Object> getAllFields(String field, Object eventBody, List<String> languages) {
    if (eventBody == null) {
      return Collections.emptyMap();
    }

    var context = new Context();
    addStrings(null, context, eventBody);

    var resultMap = getMultilangValue(field, new LinkedHashSet<>(context.getMultilangValues()), languages);

    var plainField = PLAIN_MULTILANG_PREFIX + field;
    var newPlainAll = new LinkedHashSet<>((Set<String>) resultMap.get(plainField));
    newPlainAll.addAll(context.getPlainValues());
    resultMap.put(plainField, newPlainAll);

    return resultMap;
  }

  private void addStrings(String path, Context context, Collection<?> collection) {
    if (collection.isEmpty()) {
      return;
    }

    for (Object value : collection) {
      addStrings(path, context, value);
    }
  }

  @SuppressWarnings("unchecked")
  private void addStrings(String path, Context context, Object value) {
    if (value instanceof String) {
      context.addValue((String) value, isMultilang(path));
    }

    if (value instanceof Collection<?>) {
      addStrings(path, context, (Collection<?>) value);
    }

    if (value instanceof Map<?, ?>) {
      addStrings(path, context, (Map<String, Object>) value);
    }
  }

  private static String getPath(String initValue, String path) {
    return initValue != null ? initValue + "." + path : path;
  }

  private boolean isMultilang(String path) {
    var isMultilang = multilangCache.get(path);
    if (isMultilang != null) {
      return isMultilang;
    }

    var result = searchFieldProvider.getPlainFieldByPath(INSTANCE_RESOURCE, path)
      .filter(PlainFieldDescription::isMultilang)
      .isPresent();

    multilangCache.put(path, result);
    return result;
  }

  @Data
  private static class Context {

    private Collection<String> plainValues = new ArrayList<>();
    private Collection<String> multilangValues = new ArrayList<>();

    void addValue(String value, boolean isMultilang) {
      if (isMultilang) {
        multilangValues.add(value);
        return;
      }

      plainValues.add(value);
    }
  }
}


Updated mappings

"properties":{
  "all": {
    "properties": {
      "eng": {
        "type": "text",
        "analyzer": "english"
      },
      "fre": {
        "type": "text",
        "analyzer": "french"
      },
      "ger": {
        "type": "text",
        "analyzer": "german"
      },
      "ita": {
        "type": "text",
        "analyzer": "italian"
      },
      "spa": {
        "type": "text",
        "analyzer": "spanish"
      },
      "src": {
        "type": "text",
        "analyzer": "source_analyzer"
      }
    }
  },
  "plain_all": {
    "type": "keyword",
    "normalizer": "keyword_lowercase"
  }
}

Performance test configuration

Local

PropertyValue
Environment

localhost

mod-search configurationsingle node without limits
Search queries
cql.all = "climate ch*"
cql.all all "*climate change*"
cql.all = "*depression*"
cql.all = "depress*"
cql.all = "covid*"
cql.all any "book usage"
cql.all = "*covid-19*"
cql.all any "covid-19"
cql.all == "shelving"
cql.all all "book usage"
cql.all any "*usage*"
cql.all all "book"
cql.all all "Van Paassen"
Count of resources3 million (local dataset)
Query limit100 (default)
Performance Test duration600 (10 min)
V_USERS10
RAMP_UP30
HOSTNAMElocalhost

port

8081

Performance test results

Label# SamplesAverageMedian90% Line95% Line99% LineMinMaxError %ThroughputReceived KB/secSent KB/sec
GET /search/instances: cql.all = "climate ch*"186282934353822480.00%0.310614.10.08
GET /search/instances: cql.all = "depress*"185282933353621460.00%0.3126712.150.08
GET /search/instances: cql.all = "covid*"18544667390.00%0.312690.060.08
GET /search/instances: cql.all any "book usage"185404047495230600.00%0.312679.810.08
GET /search/instances: cql.all all "*climate change*"18678978008811181558257675584070.00%0.3070111.540.08
GET /search/instances: cql.all any "covid-19"185323338414424450.00%0.3127911.050.08
GET /search/instances: cql.all == "shelving"185262631323520410.00%0.312798.740.08
GET /search/instances: cql.all all "book usage"185292934363922420.00%0.3127911.110.08
GET /search/instances: cql.all all "book"185363742444828640.00%0.313639.730.08
GET /search/instances: cql.all all "Van Paassen"1856789115150.00%0.313640.60.08
GET /search/instances: cql.all = "*depression*"18678968004810781818254685585130.00%0.3084510.010.08
GET /search/instances: cql.all = "*covid-19*"18578697982807381158193693582680.00%0.308960.060.08
GET /search/instances: cql.all any "*usage*"18578927992808481278233700382970.00%0.3090711.570.08
TOTAL2408244933803280698193385130.00%3.97396108.651.07

Rancher Performance Tests

Configuration

PropertyValue
Environment

falcon-perf-okapi.ci.folio.org

mod-search configuration2 nodes with CPU limit = 300m and memoryLimit = 800MB
Search queries
cql.all = "climate ch*"
cql.all all "*climate change*"
cql.all = "*depression*"
cql.all = "depress*"
cql.all = "covid*"
cql.all any "book usage"
cql.all = "*covid-19*"
cql.all any "covid-19"
cql.all == "shelving"
cql.all all "book usage"
cql.all any "*usage*"
cql.all all "book"
cql.all all "Van Paassen"
Count of resources7,6 millions (bugfest-iris dataset)
Query limit100 (default)
Performance Test duration600 (10 min)
V_USERS10
RAMP_UP30
HOSTNAMEfalcon-perf-okapi.ci.folio.org

port

443

Performance test results without wildcard queries (start with excluded)

Label# SamplesAverageMedian90% Line95% Line99% LineMinMaxError %ThroughputReceived KB/secSent KB/sec
authn/login HTTP Request112531253125312531253125312530.00%0.798080.760.23
GET /search/instances: cql.all = "climate ch*"16026025025926937723815230.00%0.2671911.380.13
GET /search/instances: cql.all = "depress*"1602512482602663772344280.00%0.267749.080.13
GET /search/instances: cql.all = "covid*"1602202172232262372058680.00%0.26780.070.13
GET /search/instances: cql.all any "book usage"16037036037438457034511720.00%0.26778.290.13
GET /search/instances: cql.all any "covid-19"1602662602722883822477700.00%0.26818.170.13
GET /search/instances: cql.all == "shelving"15925424825526043623510450.00%0.268498.130.13
GET /search/instances: cql.all all "book usage"1592532522602722902393470.00%0.268598.830.13
GET /search/instances: cql.all all "book"15933532733834944031411870.00%0.26867.910.13
GET /search/instances: cql.all all "Van Paassen"1592322242302323702129180.00%0.268731.190.13
GET /search/instances: cql.all == "visualization"1592542492612704202364590.00%0.2687910.240.13
GET /search/instances: cql.all = "any times in inventory"1592352252312425332139140.00%0.268880.680.13
GET /search/instances: cql.all = "Bob Dylan"15925924826126746323512370.00%0.269018.550.13
GET /search/instances: cql.all == "America*"1592952933033173722795010.00%0.269079.690.13
GET /search/instances: cql.all = "Water in africa"15926325526827549624110370.00%0.2691610.290.13
TOTAL223226825133436146420515230.00%3.70892101.131.81

Performance test results with wildcard queries

Label# SamplesAverageMedian90% Line95% Line99% LineMinMaxError %ThroughputReceived KB/secSent KB/sec
authn/login HTTP Request112441244124412441244124412440.00%0.803860.770.23
GET /search/instances: cql.all = "climate ch*"251322896753095331072312292593122928.00%0.040181.230.02
GET /search/instances: cql.all = "depress*"2414828138563077731116317712693177120.83%0.040341.090.02
GET /search/instances: cql.all = "covid*"2416087168223102931067312992073129925.00%0.040430.010.02
GET /search/instances: cql.all any "book usage"241362776643109031551326293653262925.00%0.042490.990.02
GET /search/instances: cql.all any "covid-19"241179853383099431138318632803186320.83%0.04040.980.02
GET /search/instances: cql.all == "shelving"2413305105383106731073311702623117020.83%0.040340.970.02
GET /search/instances: cql.all all "book usage"2414016123173131131654324302543243020.83%0.04241.110.02
GET /search/instances: cql.all all "book"241312393473123031436315543533155420.83%0.041580.970.02
GET /search/instances: cql.all all "Van Paassen"241320884873112731135314932213149325.00%0.039550.140.02
GET /search/instances: cql.all all "*climate change*"2530959309023148531730322323024132232100.00%0.040240.020.02
GET /search/instances: cql.all = "*depression*"253063930911317533228532290251613229092.00%0.040190.130.02
GET /search/instances: cql.all = "*covid-19*"243052530511311553116631385282033138591.67%0.040340.020.02
GET /search/instances: cql.all any "*usage*"243044830726313923140831433265123143387.50%0.040270.230.02
TOTAL31618908253453122431484322852073262944.62%0.504227.440.25

Conclusions

  1. Search by all field values is possible and provides a fast response for all queries except wildcard queries for terms with type - including or end with. (for example: *depression* , *book)
  2. Wildcard queries using the start-with approach work well (for example, climate ch*).
  3. Wildcard queries can affect performance in the case of searching by mixing regular and wildcard queries.
  4. Multilanguage queries work with the same behavior as before.
  5. All tokenized and pre-processed values can be also included in the search by all.
  6. All performance tests have been done using single field for all field values from instance/items/holding-records, but it's not hard to re-implement the approach for searching separately by instances/holding-records/items.

Related content