Reindex Improvements
Status | DONE |
---|---|
Impact | high |
Prod Ticket | |
Arch Ticket |
Summary
Implementation of the classification browse feature required the introduction of a new search index. Adding a new index negatively impacted the reindexing procedure performance. For tenants with large datasets, the reindexing procedure exceeds the maintenance time window.
Especially, the impact is significant on ECS environment due to the need to aggregate data across multiple tenants' inventory storage. The event model of reindexing involves receiving "create/update" domain events by mod-search
which have only identifiers of related instances and the module would fetch the full information on the entity through HTTP
requests. This is the root cause of the reindexing procedure slowness. The proposed solution describes the approach to address the issue with database schema-to-schema communication instead of HTTP
communication.
Requirements
Functional requirements
There should be no impact on the current behavior of search capabilities.
The event model for indexing newly created/updated documents should remain as-is
Non-functional requirements
Performance
ECS Support
Baseline Architecture
The baseline architecture is described here:
Drawbacks of existing solution:
HTTP calls to inventory impact the latency of indexing of a single instance.
Slow-running “upsert scripts” for partial updates in OpenSearch/Elasticsearch.
The need to aggregate instances across multiple tenants in an ECS environment requires multiple updates for every instance
Data duplication in the
consortium_instance
table might cause additional overhead in Postgres performance for big dataset
Solution Options
# | Option | Description | Pros | Cons | Decision |
---|---|---|---|---|---|
0 | Existing architecture | The reindexing procedure is based on the domain event model. |
|
|
|
1 | Database-to-Database query | The reindexing is split into “merge” and “indexing” stages and the “merge” stage is done in the db-to-db query |
|
| Rejected due to introduction of dependencies |
2 | Reindexing through Kafka | The reindexing is split into “merge” and “indexing” stages and the “merge” stage is done by sending all entities through the Kafka topic |
|
|
|
Target Architecture Option 1
Assumptions
The solution is aimed at ECS and non-ECS environments using a schema-per-tenant database approach. For other cases the existing implementation of the reindexing procedure is available.
Schema-to-schema communication is not a user-facing use case. And is only limited to the initial load as it is a background maintenance process.
Key aspects of the solution
The solution should address both single-tenant environments and ECS. Key aspects of the proposed solution are:
The whole reindexing process should be split into two steps.
Aggregate all the information related to instances, holdings, and items in the
mod-search
databaseBulk upload prepared documents from the
mod-search
database to Opensearch/Elasticsearch
Split table
consortium_instance
toinstances
,holdings
, anditems
tables to reduce data
duplicationCreate a table for tracking chunk processing progress
For initial inventory indexing or when the inventory data model changes, all entities from the inventory
should be copied throughINSERT INTO ... SELECT
with cross-tenant (cross-schema) SQL queries. To allow parallel execution of queries it is proposed to split data transfer into chunks.Bulk upload of documents should happen when there are any mapping search index changes. This should be a separate procedure from #1
The existing event model should support the new data model in the
mod-search
database
Sequence Diagram
ERD
Solution Details
The database of mod-search
in a single-tenant environment or ECS mode should contain the following tables that would work as aggregates for entities that will be posted to the search index.
The sample of the initial data upload function from the database of mod-inventory-storage
to the mod-search
database is below:
Target Architecture Option 2
Key aspects of the solution
The solution should address both single-tenant environments and ECS. Key aspects of the proposed solution are:
The whole reindexing process should be split into two steps.
Aggregate all the information related to instances, holdings, and items in the
mod-search
databaseBulk upload prepared documents from the
mod-search
database to Opensearch/Elasticsearch
Split table
consortium_instance
toinstances
,holdings
, anditems
tables to reduce data
duplicationCreate a table for tracking range processing progress
For initial inventory indexing or when the inventory data model changes, all entities from the inventory
should be copied through the Kafka topic.Bulk upload of documents should happen when there are any mapping search index changes. This should be a separate procedure from #1
The existing event model should support the new data model in the
mod-search
database
Sequence Diagram
ERD
WBS
Further Improvements
Use a procedure to populate secondary tables after the main table is complete within the transaction
use a temp table to track the instance modified.
compute the list of ex. subjects to be inserted into the subjects table. list of subjects should be deduped before inserting(unique clause?)
Perform insert into subjects table in batch(one insert statement) not loop.
no-logging on db tables
is there a separate code path for initial load vs updates? we should consider this.
if so, new tables for each reindex job could be created with nologging