Reindex Improvements
Arch ticket: https://folio-org.atlassian.net/browse/ARCH-273
Summary
Implementation of the classification browse feature required the introduction of a new search index. Adding a new index negatively impacted the reindexing procedure performance. For tenants with large datasets, the reindexing procedure exceeds the maintenance time window.
Especially, the impact is significant on ECS environment due to the need to aggregate data across multiple tenants' inventory storage. The event model of reindexing involves receiving "create/update" domain events by mod-search
which have only identifiers of related instances and the module would fetch the full information on the entity through HTTP
requests. This is the root cause of the reindexing procedure slowness. The proposed solution describes the approach to address the issue with database schema-to-schema communication instead of HTTP
communication.
Requirements
Functional requirements
There should be no impact on the current behavior of search capabilities.
The event model for indexing newly created/updated documents should remain as-is
Non-functional requirements
Performance
ECS Support
Baseline Architecture
The baseline architecture is described here:
Drawbacks of existing solution:
HTTP calls to inventory impact the latency of indexing of a single instance.
Slow-running “upsert scripts” for partial updates in OpenSearch/Elasticsearch.
The need to aggregate instances across multiple tenants in an ECS environment requires multiple updates for every instance
Data duplication in the
consortium_instance
table might cause additional overhead in Postgres performance for big dataset
Solution Options
# | Option | Description | Pros | Cons | Decision |
---|---|---|---|---|---|
0 | Existing architecture | The reindexing procedure is based on the domain event model. |
|
|
|
1 | Database-to-Database query | The reindexing is split into “merge” and “indexing” stages and the |