SPIKE: Data Import: Improve Resiliency When Database Crashes

Description

Leave in Poppy for now

Purpose/Overview:

Currently when the database crashes because it runs out of memory, Data Import does not handle the situation well, logs an error, and the job finishes with the status "Completed with error".  Just as when any DI module can crash and restarted, the database can crash and falls over to the reader instance, or restarts on its own. In these situations DI needs to be able to know which messages did not get written to the database and recovers starting with these messages when the database comes back up.

Here are some logs when this happened in PTF environment:

mod-inventory/inventory-storage

03:25:07 [] [] [] [] WARN Instances Exception occurred

03:25:07 [] [] [] [] ERROR KafkaConsumerWrapper Error while processing a record - id: 2 subscriptionPattern: SubscriptionDefinition(eventType=DI_SRS_MARC_BIB_INSTANCE_HRID_SET, subscriptionPattern=imtc\.Default\.\w{1,}\.DI_SRS_MARC_BIB_INSTANCE_HRID_SET)

03:25:07 [] [] [] [] ERROR eHridSetKafkaHandler Failed to process data import event payload

mod-srs/srm (the messages are the same)

03:25:07.243 [vert.x-worker-thread-14] WARN ? [618228eqId] Backend notice: severity='WARNING', code='57P02', message='terminating connection because of crash of another server process', detail='The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.', hint='In a moment you should be able to reconnect to the database and repeat your command.', position='null', internalPosition='null', internalQuery='null', where='SQL statement "delete from fs00001020_mod_source_record_storage.marc_indexers where marc_id = NEW.id"
 

03:25:30.198 [vert.x-eventloop-thread-1] WARN ? [641183eqId] ExtendedQueryCommandCodec should handle message ParameterStatus

 
This behavior was found when running DI with the following modules:

  • mod-data-import v2.1.2

  • mod-inventory v17.0.4
    mod-inventory-storage v21.0.3

  • mod-source-record-storage v5.1.5

  • mod-source-record-manager v3.1.3

cc , |
 
 
 

Environment

None

Potential Workaround

None

Checklist

hide

TestRail: Results

Activity

Show:

Ann-Marie Breaux July 25, 2022 at 4:02 PM

Thanks, Moved to Orchid

Olamide Kolawole July 25, 2022 at 3:49 PM

Yes.

Kateryna Senchenko July 25, 2022 at 12:30 PM

Hi , , is that OK to move this ticket to Orchid to reduce Nolana scope?

Olamide Kolawole May 20, 2022 at 3:10 PM

 I believe that this is a good NFR to shoot for. Most actions of a data import process is managed by Kafka, not the DB. So a DB crash and subsequent restart should allow the data import process to continue. This outcome of this spike should tell us what is missing to meet the resiliency goal. When a message is retrieved from Kafka, the FOLIO module has to "commit" the message saying it is done with it. My hunch is that our DI modules are "committing" the message even if an error, like a db crash occurs. The pieces are in place to make this work, let's find and fix the gaps. I estimate that this effort would be a medium sized task.

Definitely move to Nolana or further.

Ann-Marie Breaux May 16, 2022 at 9:18 PM

Sounds good - thanks, !

Details

Assignee

Reporter

Priority

Story Points

Development Team

Folijet

Release

Trillium (R2 2025)

TestRail: Cases

Open TestRail: Cases

TestRail: Runs

Open TestRail: Runs

Created October 8, 2021 at 3:23 PM
Updated March 4, 2025 at 8:42 PM
TestRail: Cases
TestRail: Runs