POC: Investigate performance impact on supporting search by all properties in inventory records

POC: Investigate performance impact on supporting search by all properties in inventory records

Purpose

Some users would like to be able to conduct a search across all properties of all record types.  The scope of this story is to provide a possible solution.

Scope

  • create additional search option that will cover search by all properties from all inventory record types

  • investigate performance impact on the already implemented search option

  • investigate the performance of the new search all query

  • determine the impact on the required disc space for the collection of ~5M records

  • investigate the increase in required RAM for mod-search

  • support multilanguage full-text fields

Approach

  1. Implement an algorithm to aggregate all fields into a single one called all, separating multi-language and non-multi-language field values. Excluded field values: type ids like identifierTypeId, alternativeTitleTypeId and etc. All processed values must be also included in this field.

  2. Update CqlSearchQueryConverter to accept and process queries with the term - cql.all 

  3. Perform performance tests locally and using a performance-rancher environment

Implemented classes

Field value provider to create 6 fields for searching by instances/holding-records/items field values.

AllSearchFieldValueProvider
package org.folio.search.service.setter.instance; import static org.folio.search.utils.SearchUtils.INSTANCE_RESOURCE; import static org.folio.search.utils.SearchUtils.getMultilangValue; import java.util.ArrayList; import java.util.Collection; import java.util.LinkedHashSet; import java.util.List; import java.util.Map; import java.util.Set; import java.util.concurrent.ConcurrentHashMap; import lombok.Data; import lombok.RequiredArgsConstructor; import org.apache.commons.collections4.MapUtils; import org.folio.search.model.metadata.PlainFieldDescription; import org.folio.search.service.metadata.LocalSearchFieldProvider; import org.springframework.stereotype.Component; @Component @RequiredArgsConstructor public class AllSearchFieldValueProvider { private final LocalSearchFieldProvider searchFieldProvider; private final ConcurrentHashMap<String, Boolean> multilangCache = new ConcurrentHashMap<>(); public Map<String, Object> getFieldValue(Map<String, Object> eventBody, List<String> languages) { var instanceEventBody = new LinkedHashMap<>(eventBody); instanceEventBody.remove("items"); instanceEventBody.remove("holdings"); var resultMap = new LinkedHashMap<String, Object>(); resultMap.putAll(getAllFields("allInstances", instanceEventBody, languages)); resultMap.putAll(getAllFields("allItems", eventBody.get("items"), languages)); resultMap.putAll(getAllFields("allHoldings", eventBody.get("holdings"), languages)); return resultMap; } @SuppressWarnings("unchecked") private Map<String, Object> getAllFields(String field, Object eventBody, List<String> languages) { if (eventBody == null) { return Collections.emptyMap(); } var context = new Context(); addStrings(null, context, eventBody); var resultMap = getMultilangValue(field, new LinkedHashSet<>(context.getMultilangValues()), languages); var plainField = PLAIN_MULTILANG_PREFIX + field; var newPlainAll = new LinkedHashSet<>((Set<String>) resultMap.get(plainField)); newPlainAll.addAll(context.getPlainValues()); resultMap.put(plainField, newPlainAll); return resultMap; } private void addStrings(String path, Context context, Collection<?> collection) { if (collection.isEmpty()) { return; } for (Object value : collection) { addStrings(path, context, value); } } @SuppressWarnings("unchecked") private void addStrings(String path, Context context, Object value) { if (value instanceof String) { context.addValue((String) value, isMultilang(path)); } if (value instanceof Collection<?>) { addStrings(path, context, (Collection<?>) value); } if (value instanceof Map<?, ?>) { addStrings(path, context, (Map<String, Object>) value); } } private static String getPath(String initValue, String path) { return initValue != null ? initValue + "." + path : path; } private boolean isMultilang(String path) { var isMultilang = multilangCache.get(path); if (isMultilang != null) { return isMultilang; } var result = searchFieldProvider.getPlainFieldByPath(INSTANCE_RESOURCE, path) .filter(PlainFieldDescription::isMultilang) .isPresent(); multilangCache.put(path, result); return result; } @Data private static class Context { private Collection<String> plainValues = new ArrayList<>(); private Collection<String> multilangValues = new ArrayList<>(); void addValue(String value, boolean isMultilang) { if (isMultilang) { multilangValues.add(value); return; } plainValues.add(value); } } }

 

Updated mappings

"properties":{ "all": { "properties": { "eng": { "type": "text", "analyzer": "english" }, "fre": { "type": "text", "analyzer": "french" }, "ger": { "type": "text", "analyzer": "german" }, "ita": { "type": "text", "analyzer": "italian" }, "spa": { "type": "text", "analyzer": "spanish" }, "src": { "type": "text", "analyzer": "source_analyzer" } } }, "plain_all": { "type": "keyword", "normalizer": "keyword_lowercase" } }

Performance test configuration

Local

Property

Value

Property

Value

Environment

localhost

mod-search configuration

single node without limits

Search queries

cql.all = "climate ch*"
cql.all all "*climate change*"
cql.all = "*depression*"
cql.all = "depress*"
cql.all = "covid*"
cql.all any "book usage"
cql.all = "*covid-19*"
cql.all any "covid-19"
cql.all == "shelving"
cql.all all "book usage"
cql.all any "*usage*"
cql.all all "book"
cql.all all "Van Paassen"

Count of resources

3 million (local dataset)

Query limit

100 (default)

Performance Test duration

600 (10 min)

V_USERS

10

RAMP_UP

30

HOSTNAME

localhost

port

8081

Performance test results

Label

# Samples

Average

Median

90% Line

95% Line

99% Line

Min

Max

Error %

Throughput

Received KB/sec

Sent KB/sec

GET /search/instances: cql.all = "climate ch*"

186

28

29

34

35

38

22

48

0.00%

0.3106

14.1

0.08

GET /search/instances: cql.all = "depress*"

185

28

29

33

35

36

21

46

0.00%

0.31267

12.15

0.08

GET /search/instances: cql.all = "covid*"

185

4

4

6

6

7

3

9

0.00%

0.31269

0.06

0.08

GET /search/instances: cql.all any "book usage"

185

40

40

47

49

52

30

60

0.00%

0.31267

9.81

0.08

GET /search/instances: cql.all all "*climate change*"

186

7897

8008

8111

8155

8257

6755

8407

0.00%

0.30701

11.54

0.08

GET /search/instances: cql.all any "covid-19"

185

32

33

38

41

44

24

45

0.00%

0.31279

11.05

0.08

GET /search/instances: cql.all == "shelving"

185

26

26

31

32

35

20

41

0.00%

0.31279

8.74

0.08

GET /search/instances: cql.all all "book usage"

185

29

29

34

36

39

22

42

0.00%

0.31279

11.11

0.08

GET /search/instances: cql.all all "book"

185

36

37

42

44

48

28

64

0.00%

0.31363

9.73

0.08

GET /search/instances: cql.all all "Van Paassen"

185

6

7

8

9

11

5

15

0.00%

0.31364

0.6

0.08

GET /search/instances: cql.all = "*depression*"

186

7896

8004

8107

8181

8254

6855

8513

0.00%

0.30845

10.01

0.08

GET /search/instances: cql.all = "*covid-19*"

185

7869

7982

8073

8115

8193

6935

8268

0.00%

0.30896

0.06

0.08

GET /search/instances: cql.all any "*usage*"

185

7892

7992

8084

8127

8233

7003

8297

0.00%

0.30907

11.57

0.08

TOTAL

2408

2449

33

8032

8069

8193

3

8513

0.00%

3.97396

108.65

1.07

Rancher Performance Tests

Configuration

Property

Value

Property

Value

Environment

falcon-perf-okapi.ci.folio.org

mod-search configuration

2 nodes with CPU limit = 300m and memoryLimit = 800MB

Search queries

cql.all = "climate ch*"
cql.all all "*climate change*"
cql.all = "*depression*"
cql.all = "depress*"
cql.all = "covid*"
cql.all any "book usage"
cql.all = "*covid-19*"
cql.all any "covid-19"
cql.all == "shelving"
cql.all all "book usage"
cql.all any "*usage*"
cql.all all "book"
cql.all all "Van Paassen"

Count of resources

7,6 millions (bugfest-iris dataset)

Query limit

100 (default)

Performance Test duration

600 (10 min)

V_USERS

10

RAMP_UP

30

HOSTNAME

falcon-perf-okapi.ci.folio.org

port

443

Performance test results without wildcard queries (start with excluded)

Label

# Samples

Average

Median

90% Line

95% Line

99% Line

Min

Max

Error %

Throughput

Received KB/sec

Sent KB/sec

authn/login HTTP Request

1

1253

1253

1253

1253

1253

1253

1253

0.00%

0.79808

0.76

0.23

GET /search/instances: cql.all = "climate ch*"

160

260

250

259

269

377

238

1523

0.00%

0.26719

11.38

0.13