Problem statement
...
- Assumption: For over 20+ millions of records search request should be executed in less than 500ms
- Multi-tenant search (e.g. for consortia) should allow to share metadata view permission from one tenant to another
- Search should be efficient in terms of multi-tenancy, namely the amount of data in each tenant shouldn't affect request execution time for other tenants
- Result counts should be precise for any query and amount of data
- Auto-complete should be based on titles. The response time should be less than 100ms
- Facets (e.g. how many Print-books are found for the search request) should be precise
- Assumption: Rich full-text functionality which will be beneficial for user, should be provided, namely:
- Stemming for words (e.g. find record with term "books" for query "book" )
- Stop-words (e.g. and. or for English language) should not affect relevancy
- Relevancy scoring should be based on TF-IDF frequencies, in order to provide the most relevant records at the top
- Didyoumean for input string spelling correction should be implemented and showed as tip if there is more significantly more relevant query
- All language depended features should support certain predefined list of languages
- There should be support for CQL queries from input string
Purpose of this page
The purpose of this page is to capture good ideas that have been discussed regarding introduction of a search engine, so that they would not be lost. In addition to that, since these are highly relevant to the possible blueprint item they could provide the beginning of that conversation and therefore this page will be linked to that list.
Proposed solution
Overall architecture
...
In terms of overall architecture it is reasonable to have golden source (e.g. central reliable source of data) for each type of data. For now both inventory and SRS are sources of metadata information. Their roles should be clearly separated, as inventory has additional information (items, holdings), but SRS have actual marc files. Possibly, inventory could be the golden source of metadata information and SRS is storage for corresponding marc files.
...
The data store for it should be Elasticsearch, as it satisfies all the mentioned requirements and is purely linear scalable and therefore multi-tenancy friendly (see corresponding section below).
CQRS principle should be implemented: inventory should still be used for any write queries and get all by date range queries and should still be the golden source of this information. Contrariwise all full-text requests from UI should be processed only by this module for these type of data. Other data from various components could be also indexed and searched by this microservice, but every particular case should be considered separately, in terms of data size (e.g. more than 1 mln) and search type (e.g. only for full-text search). No transactional calls should be performed against search microservice as Elasticsearch is not ACID and have eventual consistency.
...
High Level Elastic Search Client should be used for doing the requests: https://www.elastic.co/guide/en/elasticsearch/client/java-rest/master/java-rest-high.html.
Inventory domain events
Inventory should send notifications, when there is any change of domain entities (instance/holding/item). For the documentation of the architectural pattern please see: https://microservices.io/patterns/data/domain-event.html .
The pattern means that every time when an instance/item is created/updated/removed a message is posted to kafka topic:
inventory.instance
- for instances;inventory.item
- for items;inventory.holdings-record
- for holdings records.
The event payload has following structure:
{ "old": {...}, // the instance/item before update or delete "new": {...}, // the instance/item after update or create "type": "UPDATE|DELETE|CREATE|DELETE_ALL", // type of the event "tenant": "diku" // tenant name }
X-Okapi-Url
and X-Okapi-Tenant
headers are set from the request to the kafka message.
Kafka partition key for all the events is instance id (for items it is retrieved from associated holding record).
Domain events for items
The new
and old
records also includes instanceId
property, on the same level with other item properties, which defined in the schema:
{ "instanceId": "<the instance id>", // all other properties that defined in the schema }
Domain events for delete all APIs
There are delete all APIs for items instances and holding records. For such APIs we're issuing a special domain event:
- Partition key:
00000000-0000-0000-0000-000000000000
- Event payload:
{ "type": "DELETE_ALL", "tenant": "<the tenant name>" }
Searching over linked entities fields
...
The same Elasticsearch clsuter should be used for all tenants. But separate index should be created for each tenant. All the requiests must contain standard OKAPI_TENANT header. This header is used to use appropriate tenant index for each query.
If mod-search is not enabled for the tenant, then mod-search should skip the Kafka messages for it (otherwise search would wait forever if tenant was not created forever and failed to process messages for other tenants).
NB: When the consortia cross-tenant search requirements are clear, the proposed solution could be enhanced with out-of-the-box solution for ES aliases handling cross-tenant searching (for consortia) in case when permission of viewing tenant metadata is granted from one tenant to another. In order to search over several tenants their ES aliases should be used for request as string concatenation joined by comma. Elasticsearch in such case will use appropriate routing parameter (and therefore shards) and filters. If there is extremely huge tenant, separate index could be created for it and alias could be switched to it without restart, downtime or any code manipulations. Aliases should be automatically created for all tenants and should have filter for "tenant-id" and "routing" parameter set to tenant-id in it. For every entity _routing field should have value of tenant_id.
Reindex in case of data structure changes
...
There should be authentication to make elasticsearch requests. The same approach, which is used for database access should be used for elasticsearch. The values should be a env variables, which are used in application.yml
Java library (maven dependency)
Java library should be created in order to provide ability to send Kafka messages for indexing in case of data modification. Declarative approach could be leveraged: methods, which are supposed to modify entities could have corresponding java annotations. Common message format (e.g. id, type fields) should be presented in this library.