POC: Investigate performance impact on supporting search by all properties in inventory records
Purpose
Some users would like to be able to conduct a search across all properties of all record types. The scope of this story is to provide a possible solution.
Scope
create additional search option that will cover search by all properties from all inventory record types
investigate performance impact on the already implemented search option
investigate the performance of the new search all query
determine the impact on the required disc space for the collection of ~5M records
investigate the increase in required RAM for mod-search
support multilanguage full-text fields
Approach
Implement an algorithm to aggregate all fields into a single one called all, separating multi-language and non-multi-language field values. Excluded field values: type ids like identifierTypeId, alternativeTitleTypeId and etc. All processed values must be also included in this field.
Update CqlSearchQueryConverter to accept and process queries with the term - cql.all
Perform performance tests locally and using a performance-rancher environment
Implemented classes
Field value provider to create 6 fields for searching by instances/holding-records/items field values.
AllSearchFieldValueProvider
package org.folio.search.service.setter.instance;
import static org.folio.search.utils.SearchUtils.INSTANCE_RESOURCE;
import static org.folio.search.utils.SearchUtils.getMultilangValue;
import java.util.ArrayList;
import java.util.Collection;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import lombok.Data;
import lombok.RequiredArgsConstructor;
import org.apache.commons.collections4.MapUtils;
import org.folio.search.model.metadata.PlainFieldDescription;
import org.folio.search.service.metadata.LocalSearchFieldProvider;
import org.springframework.stereotype.Component;
@Component
@RequiredArgsConstructor
public class AllSearchFieldValueProvider {
private final LocalSearchFieldProvider searchFieldProvider;
private final ConcurrentHashMap<String, Boolean> multilangCache = new ConcurrentHashMap<>();
public Map<String, Object> getFieldValue(Map<String, Object> eventBody, List<String> languages) {
var instanceEventBody = new LinkedHashMap<>(eventBody);
instanceEventBody.remove("items");
instanceEventBody.remove("holdings");
var resultMap = new LinkedHashMap<String, Object>();
resultMap.putAll(getAllFields("allInstances", instanceEventBody, languages));
resultMap.putAll(getAllFields("allItems", eventBody.get("items"), languages));
resultMap.putAll(getAllFields("allHoldings", eventBody.get("holdings"), languages));
return resultMap;
}
@SuppressWarnings("unchecked")
private Map<String, Object> getAllFields(String field, Object eventBody, List<String> languages) {
if (eventBody == null) {
return Collections.emptyMap();
}
var context = new Context();
addStrings(null, context, eventBody);
var resultMap = getMultilangValue(field, new LinkedHashSet<>(context.getMultilangValues()), languages);
var plainField = PLAIN_MULTILANG_PREFIX + field;
var newPlainAll = new LinkedHashSet<>((Set<String>) resultMap.get(plainField));
newPlainAll.addAll(context.getPlainValues());
resultMap.put(plainField, newPlainAll);
return resultMap;
}
private void addStrings(String path, Context context, Collection<?> collection) {
if (collection.isEmpty()) {
return;
}
for (Object value : collection) {
addStrings(path, context, value);
}
}
@SuppressWarnings("unchecked")
private void addStrings(String path, Context context, Object value) {
if (value instanceof String) {
context.addValue((String) value, isMultilang(path));
}
if (value instanceof Collection<?>) {
addStrings(path, context, (Collection<?>) value);
}
if (value instanceof Map<?, ?>) {
addStrings(path, context, (Map<String, Object>) value);
}
}
private static String getPath(String initValue, String path) {
return initValue != null ? initValue + "." + path : path;
}
private boolean isMultilang(String path) {
var isMultilang = multilangCache.get(path);
if (isMultilang != null) {
return isMultilang;
}
var result = searchFieldProvider.getPlainFieldByPath(INSTANCE_RESOURCE, path)
.filter(PlainFieldDescription::isMultilang)
.isPresent();
multilangCache.put(path, result);
return result;
}
@Data
private static class Context {
private Collection<String> plainValues = new ArrayList<>();
private Collection<String> multilangValues = new ArrayList<>();
void addValue(String value, boolean isMultilang) {
if (isMultilang) {
multilangValues.add(value);
return;
}
plainValues.add(value);
}
}
}
Updated mappings
"properties":{
"all": {
"properties": {
"eng": {
"type": "text",
"analyzer": "english"
},
"fre": {
"type": "text",
"analyzer": "french"
},
"ger": {
"type": "text",
"analyzer": "german"
},
"ita": {
"type": "text",
"analyzer": "italian"
},
"spa": {
"type": "text",
"analyzer": "spanish"
},
"src": {
"type": "text",
"analyzer": "source_analyzer"
}
}
},
"plain_all": {
"type": "keyword",
"normalizer": "keyword_lowercase"
}
}Performance test configuration
Local
Property | Value |
|---|---|
Environment | localhost |
mod-search configuration | single node without limits |
Search queries |
|
Count of resources | 3 million (local dataset) |
Query limit | 100 (default) |
Performance Test duration | 600 (10 min) |
V_USERS | 10 |
RAMP_UP | 30 |
HOSTNAME | localhost |
port | 8081 |
Performance test results
Label | # Samples | Average | Median | 90% Line | 95% Line | 99% Line | Min | Max | Error % | Throughput | Received KB/sec | Sent KB/sec |
GET /search/instances: cql.all = "climate ch*" | 186 | 28 | 29 | 34 | 35 | 38 | 22 | 48 | 0.00% | 0.3106 | 14.1 | 0.08 |
GET /search/instances: cql.all = "depress*" | 185 | 28 | 29 | 33 | 35 | 36 | 21 | 46 | 0.00% | 0.31267 | 12.15 | 0.08 |
GET /search/instances: cql.all = "covid*" | 185 | 4 | 4 | 6 | 6 | 7 | 3 | 9 | 0.00% | 0.31269 | 0.06 | 0.08 |
GET /search/instances: cql.all any "book usage" | 185 | 40 | 40 | 47 | 49 | 52 | 30 | 60 | 0.00% | 0.31267 | 9.81 | 0.08 |
GET /search/instances: cql.all all "*climate change*" | 186 | 7897 | 8008 | 8111 | 8155 | 8257 | 6755 | 8407 | 0.00% | 0.30701 | 11.54 | 0.08 |
GET /search/instances: cql.all any "covid-19" | 185 | 32 | 33 | 38 | 41 | 44 | 24 | 45 | 0.00% | 0.31279 | 11.05 | 0.08 |
GET /search/instances: cql.all == "shelving" | 185 | 26 | 26 | 31 | 32 | 35 | 20 | 41 | 0.00% | 0.31279 | 8.74 | 0.08 |
GET /search/instances: cql.all all "book usage" | 185 | 29 | 29 | 34 | 36 | 39 | 22 | 42 | 0.00% | 0.31279 | 11.11 | 0.08 |
GET /search/instances: cql.all all "book" | 185 | 36 | 37 | 42 | 44 | 48 | 28 | 64 | 0.00% | 0.31363 | 9.73 | 0.08 |
GET /search/instances: cql.all all "Van Paassen" | 185 | 6 | 7 | 8 | 9 | 11 | 5 | 15 | 0.00% | 0.31364 | 0.6 | 0.08 |
GET /search/instances: cql.all = "*depression*" | 186 | 7896 | 8004 | 8107 | 8181 | 8254 | 6855 | 8513 | 0.00% | 0.30845 | 10.01 | 0.08 |
GET /search/instances: cql.all = "*covid-19*" | 185 | 7869 | 7982 | 8073 | 8115 | 8193 | 6935 | 8268 | 0.00% | 0.30896 | 0.06 | 0.08 |
GET /search/instances: cql.all any "*usage*" | 185 | 7892 | 7992 | 8084 | 8127 | 8233 | 7003 | 8297 | 0.00% | 0.30907 | 11.57 | 0.08 |
TOTAL | 2408 | 2449 | 33 | 8032 | 8069 | 8193 | 3 | 8513 | 0.00% | 3.97396 | 108.65 | 1.07 |
Rancher Performance Tests
Configuration
Property | Value |
|---|---|
Environment | |
mod-search configuration | 2 nodes with CPU limit = 300m and memoryLimit = 800MB |
Search queries |
|
Count of resources | 7,6 millions (bugfest-iris dataset) |
Query limit | 100 (default) |
Performance Test duration | 600 (10 min) |
V_USERS | 10 |
RAMP_UP | 30 |
HOSTNAME | |
port | 443 |
Performance test results without wildcard queries (start with excluded)
Label | # Samples | Average | Median | 90% Line | 95% Line | 99% Line | Min | Max | Error % | Throughput | Received KB/sec | Sent KB/sec |
authn/login HTTP Request | 1 | 1253 | 1253 | 1253 | 1253 | 1253 | 1253 | 1253 | 0.00% | 0.79808 | 0.76 | 0.23 |
GET /search/instances: cql.all = "climate ch*" | 160 | 260 | 250 | 259 | 269 | 377 | 238 | 1523 | 0.00% | 0.26719 | 11.38 | 0.13 |