Reindex Improvements

Reindex Improvements

Summary

Implementation of the classification browse feature required the introduction of a new search index. Adding a new index negatively impacted the reindexing procedure performance. For tenants with large datasets, the reindexing procedure exceeds the maintenance time window.

Especially, the impact is significant on ECS environment due to the need to aggregate data across multiple tenants' inventory storage. The event model of reindexing involves receiving "create/update" domain events by mod-search which have only identifiers of related instances and the module would fetch the full information on the entity through HTTP requests. This is the root cause of the reindexing procedure slowness. The proposed solution describes the approach to address the issue with database schema-to-schema communication instead of HTTP communication.

Requirements

Functional requirements

  1. There should be no impact on the current behavior of search capabilities.

  2. The event model for indexing newly created/updated documents should remain as-is

Non-functional requirements

  1. Performance

  2. ECS Support

Baseline Architecture

The baseline architecture is described here:

  1. Search indexing procedure architecture

  2. ECS indexing procedure

Drawbacks of existing solution:

  1. HTTP calls to inventory impact the latency of indexing of a single instance.

  2. Slow-running “upsert scripts” for partial updates in OpenSearch/Elasticsearch.

  3. The need to aggregate instances across multiple tenants in an ECS environment requires multiple updates for every instance

  4. Data duplication in the consortium_instance table might cause additional overhead in Postgres performance for big dataset

Solution Options

#

Option

Description

Pros

Cons

Decision

#

Option

Description

Pros

Cons

Decision

0

Existing architecture

The reindexing procedure is based on the domain event model.

 

  • Reaches the limits of a maintenance window

  • Change of search mapping requires full reindexing

  • Domain events during reindexing only contain instance id and require additional HTTP request for each instance to get the whole entity

 

1

Database-to-Database query

The reindexing is split into “merge” and “indexing” stages and the “merge” stage is done in the db-to-db query

  • Stages can be started independently

  • Support both ECS and standalone mode

  • Performance improved

  • The “merge” stage requires data duplication in mod-search’s database

  • Dependency on internal database structure between mod-search and mod-inventory-storage

  • Requires cross-tenant and cross-schema queries

  • Cannot be applied in cases when modules' databases are located on the different database servers

Rejected due to introduction of dependencies

2

Reindexing through Kafka

The reindexing is split into “merge” and “indexing” stages and the “merge” stage is done by sending all entities through the Kafka topic

  • Stages can be started independently

  • Support both ECS and standalone mode

  • Performance improved

  • The boundaries of modules and tenants are not affected

  • The “merge” stage requires data duplication in mod-search’s database

 

Target Architecture Option 1

Assumptions

  1. The solution is aimed at ECS and non-ECS environments using a schema-per-tenant database approach. For other cases the existing implementation of the reindexing procedure is available.

  2. Schema-to-schema communication is not a user-facing use case. And is only limited to the initial load as it is a background maintenance process.

Key aspects of the solution

The solution should address both single-tenant environments and ECS. Key aspects of the proposed solution are:

  1. The whole reindexing process should be split into two steps.

    1. Aggregate all the information related to instances, holdings, and items in the mod-search database

    2. Bulk upload prepared documents from the mod-search database to Opensearch/Elasticsearch

  2. Split table consortium_instance to instances, holdings, and items tables to reduce data
    duplication

  3. Create a table for tracking chunk processing progress

  4. For initial inventory indexing or when the inventory data model changes, all entities from the inventory
    should be copied through INSERT INTO ... SELECT with cross-tenant (cross-schema) SQL queries. To allow parallel execution of queries it is proposed to split data transfer into chunks.

  5. Bulk upload of documents should happen when there are any mapping search index changes. This should be a separate procedure from #1

  6. The existing event model should support the new data model in the mod-search database

Sequence Diagram

ERD

Solution Details

The database of mod-search in a single-tenant environment or ECS mode should contain the following tables that would work as aggregates for entities that will be posted to the search index.

The sample of the initial data upload function from the database of mod-inventory-storage to the mod-search database is below:

Target Architecture Option 2

Key aspects of the solution

The solution should address both single-tenant environments and ECS. Key aspects of the proposed solution are:

  1. The whole reindexing process should be split into two steps.

    1. Aggregate all the information related to instances, holdings, and items in the mod-search database

    2. Bulk upload prepared documents from the mod-search database to Opensearch/Elasticsearch

  2. Split table consortium_instance to instances, holdings, and items tables to reduce data
    duplication

  3. Create a table for tracking range processing progress

  4. For initial inventory indexing or when the inventory data model changes, all entities from the inventory
    should be copied through the Kafka topic.

  5. Bulk upload of documents should happen when there are any mapping search index changes. This should be a separate procedure from #1

  6. The existing event model should support the new data model in the mod-search database

Sequence Diagram

ERD

WBS

Further Improvements

  1. Use a procedure to populate secondary tables after the main table is complete within the transaction

    • use a temp table to track the instance modified.

    • compute the list of ex. subjects to be inserted into the subjects table. list of subjects should be deduped before inserting(unique clause?)

  2. Perform insert into subjects table in batch(one insert statement) not loop.

    • no-logging on db tables

    • is there a separate code path for initial load vs updates? we should consider this.

    • if so, new tables for each reindex job could be created with nologging