/
POC: Investigate performance impact on supporting search by all properties in inventory records
POC: Investigate performance impact on supporting search by all properties in inventory records
Purpose
Some users would like to be able to conduct a search across all properties of all record types. The scope of this story is to provide a possible solution.
Scope
- create additional search option that will cover search by all properties from all inventory record types
- investigate performance impact on the already implemented search option
- investigate the performance of the new search all query
- determine the impact on the required disc space for the collection of ~5M records
- investigate the increase in required RAM for mod-search
- support multilanguage full-text fields
Approach
- Implement an algorithm to aggregate all fields into a single one called all, separating multi-language and non-multi-language field values. Excluded field values: type ids like identifierTypeId, alternativeTitleTypeId and etc. All processed values must be also included in this field.
- Update CqlSearchQueryConverter to accept and process queries with the term - cql.all
- Perform performance tests locally and using a performance-rancher environment
Implemented classes
Field value provider to create 6 fields for searching by instances/holding-records/items field values.
AllSearchFieldValueProvider
package org.folio.search.service.setter.instance; import static org.folio.search.utils.SearchUtils.INSTANCE_RESOURCE; import static org.folio.search.utils.SearchUtils.getMultilangValue; import java.util.ArrayList; import java.util.Collection; import java.util.LinkedHashSet; import java.util.List; import java.util.Map; import java.util.Set; import java.util.concurrent.ConcurrentHashMap; import lombok.Data; import lombok.RequiredArgsConstructor; import org.apache.commons.collections4.MapUtils; import org.folio.search.model.metadata.PlainFieldDescription; import org.folio.search.service.metadata.LocalSearchFieldProvider; import org.springframework.stereotype.Component; @Component @RequiredArgsConstructor public class AllSearchFieldValueProvider { private final LocalSearchFieldProvider searchFieldProvider; private final ConcurrentHashMap<String, Boolean> multilangCache = new ConcurrentHashMap<>(); public Map<String, Object> getFieldValue(Map<String, Object> eventBody, List<String> languages) { var instanceEventBody = new LinkedHashMap<>(eventBody); instanceEventBody.remove("items"); instanceEventBody.remove("holdings"); var resultMap = new LinkedHashMap<String, Object>(); resultMap.putAll(getAllFields("allInstances", instanceEventBody, languages)); resultMap.putAll(getAllFields("allItems", eventBody.get("items"), languages)); resultMap.putAll(getAllFields("allHoldings", eventBody.get("holdings"), languages)); return resultMap; } @SuppressWarnings("unchecked") private Map<String, Object> getAllFields(String field, Object eventBody, List<String> languages) { if (eventBody == null) { return Collections.emptyMap(); } var context = new Context(); addStrings(null, context, eventBody); var resultMap = getMultilangValue(field, new LinkedHashSet<>(context.getMultilangValues()), languages); var plainField = PLAIN_MULTILANG_PREFIX + field; var newPlainAll = new LinkedHashSet<>((Set<String>) resultMap.get(plainField)); newPlainAll.addAll(context.getPlainValues()); resultMap.put(plainField, newPlainAll); return resultMap; } private void addStrings(String path, Context context, Collection<?> collection) { if (collection.isEmpty()) { return; } for (Object value : collection) { addStrings(path, context, value); } } @SuppressWarnings("unchecked") private void addStrings(String path, Context context, Object value) { if (value instanceof String) { context.addValue((String) value, isMultilang(path)); } if (value instanceof Collection<?>) { addStrings(path, context, (Collection<?>) value); } if (value instanceof Map<?, ?>) { addStrings(path, context, (Map<String, Object>) value); } } private static String getPath(String initValue, String path) { return initValue != null ? initValue + "." + path : path; } private boolean isMultilang(String path) { var isMultilang = multilangCache.get(path); if (isMultilang != null) { return isMultilang; } var result = searchFieldProvider.getPlainFieldByPath(INSTANCE_RESOURCE, path) .filter(PlainFieldDescription::isMultilang) .isPresent(); multilangCache.put(path, result); return result; } @Data private static class Context { private Collection<String> plainValues = new ArrayList<>(); private Collection<String> multilangValues = new ArrayList<>(); void addValue(String value, boolean isMultilang) { if (isMultilang) { multilangValues.add(value); return; } plainValues.add(value); } } }
Updated mappings
"properties":{ "all": { "properties": { "eng": { "type": "text", "analyzer": "english" }, "fre": { "type": "text", "analyzer": "french" }, "ger": { "type": "text", "analyzer": "german" }, "ita": { "type": "text", "analyzer": "italian" }, "spa": { "type": "text", "analyzer": "spanish" }, "src": { "type": "text", "analyzer": "source_analyzer" } } }, "plain_all": { "type": "keyword", "normalizer": "keyword_lowercase" } }
Performance test configuration
Local
Property | Value |
---|---|
Environment | localhost |
mod-search configuration | single node without limits |
Search queries | cql.all = "climate ch*" |
Count of resources | 3 million (local dataset) |
Query limit | 100 (default) |
Performance Test duration | 600 (10 min) |
V_USERS | 10 |
RAMP_UP | 30 |
HOSTNAME | localhost |
port | 8081 |
Performance test results
Label | # Samples | Average | Median | 90% Line | 95% Line | 99% Line | Min | Max | Error % | Throughput | Received KB/sec | Sent KB/sec |
GET /search/instances: cql.all = "climate ch*" | 186 | 28 | 29 | 34 | 35 | 38 | 22 | 48 | 0.00% | 0.3106 | 14.1 | 0.08 |
GET /search/instances: cql.all = "depress*" | 185 | 28 | 29 | 33 | 35 | 36 | 21 | 46 | 0.00% | 0.31267 | 12.15 | 0.08 |
GET /search/instances: cql.all = "covid*" | 185 | 4 | 4 | 6 | 6 | 7 | 3 | 9 | 0.00% | 0.31269 | 0.06 | 0.08 |
GET /search/instances: cql.all any "book usage" | 185 | 40 | 40 | 47 | 49 | 52 | 30 | 60 | 0.00% | 0.31267 | 9.81 | 0.08 |
GET /search/instances: cql.all all "*climate change*" | 186 | 7897 | 8008 | 8111 | 8155 | 8257 | 6755 | 8407 | 0.00% | 0.30701 | 11.54 | 0.08 |
GET /search/instances: cql.all any "covid-19" | 185 | 32 | 33 | 38 | 41 | 44 | 24 | 45 | 0.00% | 0.31279 | 11.05 | 0.08 |
GET /search/instances: cql.all == "shelving" | 185 | 26 | 26 | 31 | 32 | 35 | 20 | 41 | 0.00% | 0.31279 | 8.74 | 0.08 |
GET /search/instances: cql.all all "book usage" | 185 | 29 | 29 | 34 | 36 | 39 | 22 | 42 | 0.00% | 0.31279 | 11.11 | 0.08 |
GET /search/instances: cql.all all "book" | 185 | 36 | 37 | 42 | 44 | 48 | 28 | 64 | 0.00% | 0.31363 | 9.73 | 0.08 |
GET /search/instances: cql.all all "Van Paassen" | 185 | 6 | 7 | 8 | 9 | 11 | 5 | 15 | 0.00% | 0.31364 | 0.6 | 0.08 |
GET /search/instances: cql.all = "*depression*" | 186 | 7896 | 8004 | 8107 | 8181 | 8254 | 6855 | 8513 | 0.00% | 0.30845 | 10.01 | 0.08 |
GET /search/instances: cql.all = "*covid-19*" | 185 | 7869 | 7982 | 8073 | 8115 | 8193 | 6935 | 8268 | 0.00% | 0.30896 | 0.06 | 0.08 |
GET /search/instances: cql.all any "*usage*" | 185 | 7892 | 7992 | 8084 | 8127 | 8233 | 7003 | 8297 | 0.00% | 0.30907 | 11.57 | 0.08 |
TOTAL | 2408 | 2449 | 33 | 8032 | 8069 | 8193 | 3 | 8513 | 0.00% | 3.97396 | 108.65 | 1.07 |
Rancher Performance Tests
Configuration
Property | Value |
---|---|
Environment | |
mod-search configuration | 2 nodes with CPU limit = 300m and memoryLimit = 800MB |
Search queries | cql.all = "climate ch*" |
Count of resources | 7,6 millions (bugfest-iris dataset) |
Query limit | 100 (default) |
Performance Test duration | 600 (10 min) |
V_USERS | 10 |
RAMP_UP | 30 |
HOSTNAME | falcon-perf-okapi.ci.folio.org |
port | 443 |
Performance test results without wildcard queries (start with excluded)
Label | # Samples | Average | Median | 90% Line | 95% Line | 99% Line | Min | Max | Error % | Throughput | Received KB/sec | Sent KB/sec |
authn/login HTTP Request | 1 | 1253 | 1253 | 1253 | 1253 | 1253 | 1253 | 1253 | 0.00% | 0.79808 | 0.76 | 0.23 |
GET /search/instances: cql.all = "climate ch*" | 160 | 260 | 250 | 259 | 269 | 377 | 238 | 1523 | 0.00% | 0.26719 | 11.38 | 0.13 |
GET /search/instances: cql.all = "depress*" | 160 | 251 | 248 | 260 | 266 | 377 | 234 | 428 | 0.00% | 0.26774 | 9.08 | 0.13 |
GET /search/instances: cql.all = "covid*" | 160 | 220 | 217 | 223 | 226 | 237 | 205 | 868 | 0.00% | 0.2678 | 0.07 | 0.13 |
GET /search/instances: cql.all any "book usage" | 160 | 370 | 360 | 374 | 384 | 570 | 345 | 1172 | 0.00% | 0.2677 | 8.29 | 0.13 |
GET /search/instances: cql.all any "covid-19" | 160 | 266 | 260 | 272 | 288 | 382 | 247 | 770 | 0.00% | 0.2681 | 8.17 | 0.13 |
GET /search/instances: cql.all == "shelving" | 159 | 254 | 248 | 255 | 260 | 436 | 235 | 1045 | 0.00% | 0.26849 | 8.13 | 0.13 |
GET /search/instances: cql.all all "book usage" | 159 | 253 | 252 | 260 | 272 | 290 | 239 | 347 | 0.00% | 0.26859 | 8.83 | 0.13 |
GET /search/instances: cql.all all "book" | 159 | 335 | 327 | 338 | 349 | 440 | 314 | 1187 | 0.00% | 0.2686 | 7.91 | 0.13 |
GET /search/instances: cql.all all "Van Paassen" | 159 | 232 | 224 | 230 | 232 | 370 | 212 | 918 | 0.00% | 0.26873 | 1.19 | 0.13 |
GET /search/instances: cql.all == "visualization" | 159 | 254 | 249 | 261 | 270 | 420 | 236 | 459 | 0.00% | 0.26879 | 10.24 | 0.13 |
GET /search/instances: cql.all = "any times in inventory" | 159 | 235 | 225 | 231 | 242 | 533 | 213 | 914 | 0.00% | 0.26888 | 0.68 | 0.13 |
GET /search/instances: cql.all = "Bob Dylan" | 159 | 259 | 248 | 261 | 267 | 463 | 235 | 1237 | 0.00% | 0.26901 | 8.55 | 0.13 |
GET /search/instances: cql.all == "America*" | 159 | 295 | 293 | 303 | 317 | 372 | 279 | 501 | 0.00% | 0.26907 | 9.69 | 0.13 |
GET /search/instances: cql.all = "Water in africa" | 159 | 263 | 255 | 268 | 275 | 496 | 241 | 1037 | 0.00% | 0.26916 | 10.29 | 0.13 |
TOTAL | 2232 | 268 | 251 | 334 | 361 | 464 | 205 | 1523 | 0.00% | 3.70892 | 101.13 | 1.81 |
Performance test results with wildcard queries
Label | # Samples | Average | Median | 90% Line | 95% Line | 99% Line | Min | Max | Error % | Throughput | Received KB/sec | Sent KB/sec |
authn/login HTTP Request | 1 | 1244 | 1244 | 1244 | 1244 | 1244 | 1244 | 1244 | 0.00% | 0.80386 | 0.77 | 0.23 |
GET /search/instances: cql.all = "climate ch*" | 25 | 13228 | 9675 | 30953 | 31072 | 31229 | 259 | 31229 | 28.00% | 0.04018 | 1.23 | 0.02 |
GET /search/instances: cql.all = "depress*" | 24 | 14828 | 13856 | 30777 | 31116 | 31771 | 269 | 31771 | 20.83% | 0.04034 | 1.09 | 0.02 |
GET /search/instances: cql.all = "covid*" | 24 | 16087 | 16822 | 31029 | 31067 | 31299 | 207 | 31299 | 25.00% | 0.04043 | 0.01 | 0.02 |
GET /search/instances: cql.all any "book usage" | 24 | 13627 | 7664 | 31090 | 31551 | 32629 | 365 | 32629 | 25.00% | 0.04249 | 0.99 | 0.02 |
GET /search/instances: cql.all any "covid-19" | 24 | 11798 | 5338 | 30994 | 31138 | 31863 | 280 | 31863 | 20.83% | 0.0404 | 0.98 | 0.02 |
GET /search/instances: cql.all == "shelving" | 24 | 13305 | 10538 | 31067 | 31073 | 31170 | 262 | 31170 | 20.83% | 0.04034 | 0.97 | 0.02 |
GET /search/instances: cql.all all "book usage" | 24 | 14016 | 12317 | 31311 | 31654 | 32430 | 254 | 32430 | 20.83% | 0.0424 | 1.11 | 0.02 |
GET /search/instances: cql.all all "book" | 24 | 13123 | 9347 | 31230 | 31436 | 31554 | 353 | 31554 | 20.83% | 0.04158 | 0.97 | 0.02 |
GET /search/instances: cql.all all "Van Paassen" | 24 | 13208 | 8487 | 31127 | 31135 | 31493 | 221 | 31493 | 25.00% | 0.03955 | 0.14 | 0.02 |
GET /search/instances: cql.all all "*climate change*" | 25 | 30959 | 30902 | 31485 | 31730 | 32232 | 30241 | 32232 | 100.00% | 0.04024 | 0.02 | 0.02 |
GET /search/instances: cql.all = "*depression*" | 25 | 30639 | 30911 | 31753 | 32285 | 32290 | 25161 | 32290 | 92.00% | 0.04019 | 0.13 | 0.02 |
GET /search/instances: cql.all = "*covid-19*" | 24 | 30525 | 30511 | 31155 | 31166 | 31385 | 28203 | 31385 | 91.67% | 0.04034 | 0.02 | 0.02 |
GET /search/instances: cql.all any "*usage*" | 24 | 30448 | 30726 | 31392 | 31408 | 31433 | 26512 | 31433 | 87.50% | 0.04027 | 0.23 | 0.02 |
TOTAL | 316 | 18908 | 25345 | 31224 | 31484 | 32285 | 207 | 32629 | 44.62% | 0.50422 | 7.44 | 0.25 |
Conclusions
- Search by all field values is possible and provides a fast response for all queries except wildcard queries for terms with type - including or end with. (for example: *depression* , *book)
- Wildcard queries using the start-with approach work well (for example, climate ch*).
- Wildcard queries can affect performance in the case of searching by mixing regular and wildcard queries.
- Multilanguage queries work with the same behavior as before.
- All tokenized and pre-processed values can be also included in the search by all.
- All performance tests have been done using single field for all field values from instance/items/holding-records, but it's not hard to re-implement the approach for searching separately by instances/holding-records/items.
, multiple selections available,
Related content
Query search examples
Query search examples
More like this
[Jira Tickets created] Drafts UI - Inventory: Filter Holdings Records by Call Number
[Jira Tickets created] Drafts UI - Inventory: Filter Holdings Records by Call Number
More like this
"All" search examples
"All" search examples
More like this
DR-000006 - Apache Kafka Usage in Inventory Storage
DR-000006 - Apache Kafka Usage in Inventory Storage
More like this
mod-search
mod-search
More like this
Exact phrase searching requirements
Exact phrase searching requirements
More like this