Indexing of child resources is stuck in the queue during Data Import
Submitted | Feb 24, 2026 |
|---|---|
Approved |
|
Status | proposed |
Impact | medium |
Arch Ticket | |
Prod Ticket |
Table of Contents
Problem Statement
Indexing of child resources stalls in the queue during large data import jobs, such as bulk editing 90,000 MARC bibliographic records. This issue also occurs when the data import job runs on another environment (tenant) within the same cluster.
For instance, child resources remain unindexed (unavailable via Search & Browse) on both non-ECS and ECS environments when a data import job runs on the non-ECS environment. This issue was identified during CSP5 testing on Sunflower TLS environments.
Impact
During a 90,000-record bulk import, real-time updates or other tenant updated can be delayed by [X minutes/hours], making newly created or updated records invisible in Search & Browse for that duration.
Root Cause Analysis
Sunflower BF environment uses Kafka topic consolidation, meaning that all events for all tenants are published for these topics:
Topic Name | # of partitions |
|---|---|
sebf.ALL.inventory.async-migration | 50 |
sebf.ALL.inventory.bound-with | 50 |
sebf.ALL.inventory.call-number-type | 1 |
sebf.ALL.inventory.campus | 2 |
sebf.ALL.inventory.classification-type | 2 |
sebf.ALL.inventory.holdings-record | 50 |
sebf.ALL.inventory.instance | 50 |
sebf.ALL.inventory.instance-contribution | 50 |
sebf.ALL.inventory.instance-date-type | 2 |
sebf.ALL.inventory.institution | 2 |
sebf.ALL.inventory.item | 50 |
sebf.ALL.inventory.library | 2 |
sebf.ALL.inventory.location | 2 |
sebf.ALL.inventory.reindex-records | 16 |
sebf.ALL.inventory.service-point | 50 |
sebf.ALL.inventory.subject-source | 2 |
sebf.ALL.inventory.subject-type | 2 |
Real-time events are consumed from the following topics for Search & Browse:
sebf.ALL.inventory.holdings-record
sebf.ALL.inventory.instance
sebf.ALL.inventory.item
Data Import Jobs uses the inventory-storage endpoints to process the new/updated records using Bulk API. Internally it produces Apache Kafka events for any created/updated entity. These events has entity id (instanceId or holdingId) as a message key. Internally kafka uses message key to identify what partition to use to send an event (partition could be set using other ways)
KafkaProducer.send(topic={{topic}}, key=entityId, value=json)
│
▼
┌-------------------┐
│ Serialize Key │
│ (UUID → bytes) │
└-------------------┘
│
▼
┌------------------------┐
│ Partition Select │
│ │
│ murmur2(keyBytes) % 50 │
└------------------------┘
│
▼
┌-------------------┐
│ Buffer & Batch │
└-------------------┘
│
▼
┌------------------------------┐
│ KAFKA CLUSTER │
│ topic: {{topicName}} │
│ │
│ P0 P1 ... P12 ... P49 │
│ ▲ │
│ │ │
│ message lands here │
└------------------------------┘
✔ Same entityId → always same partition → ordered per entityUUIDv4 is uniformly distributed, so under significant load from mod-data-import - mod-inventory-storage fills all partitions for the specific topic, creating a uniform lag value across partitions:
A new message from the other tenants lands at the of the queue, which means that it will be processed after all of the previous messages in dedicated partition. It depends on consumer processing (mod-search) capabilities.
Possible Resolution Options
Increasing Number of Partitions
This options won’t be viable because it will decrease the overall lag across all partitions, but consumers still need to process the messages in the queue before.
Fast-Lane partitions
This option requires complex logic to identify lag across partitions, enabling the publisher to select real-time update partitions with minimal lag. Analogy: a fast lane for real-time updates.
By default, Kafka partitions are assigned across consumers in a consumer group using partition.assignment.strategy, which defaults to RangeAssignor. This means fast-lane partitions must be distributed evenly across consumers using logic on the producer side
This approach reserves specific partitions for real-time updates
(low-lag "fast lanes") while routing bulk messages to remaining partitions.
Problems
Key-based routing breaks: The same
entityIdmay need to be routed to either lane depending on context (real-time vs bulk). This means the same entity ends up in multiple partitions, breaking ordering guarantees.Consumer assignment is unaware of lanes: Kafka's partition assignment strategies (RangeAssignor, StickyAssignor) distribute partitions by count, not by purpose. A rebalance can assign both fast-lane and bulk partitions to the same consumer, negating the benefit.
Requires complex lag-monitoring logic on the producer side to dynamically select low-lag partitions.
Rigid: Changing the fast/bulk partition ratio requires operational changes.
Tenant specific topics
The main benefit that partitions are filled only for specific tenant under Bulk edit / Data import load, other tenants are unaffected in Apache Kafka, but limited in CPU on the consumer side. This option is not viable because it increases cluster costs, but it might work with updated partition counts per tenant.
Currently, 50 partitions suffice for multi-tenant deployment. Perhaps a specific tenant requires only 5 to 8 partitions.