MODDATAIMP-789 SPIKE: Design solution to clean up temp data

MODDATAIMP-789 SPIKE: Design solution to clean up temp data

Ticket: MODDATAIMP-789
Status: In Progress

Problem Statement

The Data Import (DI) process generates temporary data that is persisted in the database but not cleaned up after the process is completed. Over time, this unused data accumulates and causes unnecessary storage usage in multiple FOLIO modules. A systematic solution is needed to clean up this obsolete data and prevent further clutter.


Temporary Data and Sources of Clutter

The following items are identified as sources of temporary data that remain after Data Import jobs are finished:

  1. Kafka Event Deduplication Data

    • Present across several modules:

      • mod-source-record-manager

      • mod-inventory

      • mod-orders

      • mod-invoice

    • No existing mechanism to periodically clean up event deduplication data.

  2. ProfileSnapshot Data

    • Stored in profile_snapshots table in mod-di-converter-storage.

    • Once a Data Import job is completed, ProfileSnapshots are no longer required, but they are not currently cleaned up.

  3. JobExecution Data and Associated Journal Records

    • Stored in mod-source-record-manager.

    • This issue is already being addressed in MODSOURMAN-1329.

  4. Old Versions of SRS Records

 

Specific Clutter in Modules and Affected tables

  • mod-inventory

    • records_instances

    • records_holdings

    • records_items

    • records_authorities

    • events_shared_instances: createddate column is present. Introduce an expiration period for these records and delete them periodically.

  • mod-source-record-manager

    • mapping_rules_snapshotsand mapping_params_snapshot: Extend the JobExecutionDao.hardDeleteJobExecutions() method to clean up these tables, or use the same periodic cleanup approach.

    • events_processed: Contains deduplication-related data.

  • mod-source-record-storage

    • edifact_records_lb: These records can be deleted as incoming EDIFACT records are no longer saved in the module.

  • mod-orders

    • processed_records: created_date column is present. Add an expiration period and clean up the data periodically.

    • sequential_order_id: saved_timestamp column is present.

    • po_lines_processing_progress:creation_date column is present

  • mod-invoice

    • records_invoices: created_date column is present. Add an expiration period and periodically clean up these records.


Proposed Solutions

1. Introduce Expiration Periods for Temporary Data and delete existing

  • Description: Add an expiration period for temporary data where relevant. Define a time-based retention period (e.g., 30 days) for all "clutter" records to allow for clean-up after they are no longer needed. Delete all existing once after the module upgrade on the new version.

  • Implementation Steps:

    1. Introduce a created_date/timestamp column (if not already present) for all relevant tables across modules.

    2. Add retention configuration properties in each module to define the expiration period.

    3. Implement background jobs in each module to periodically check and clean up expired records.

    4. Create scripts for all affected modules to clean up existing clutter records right before or after modules upgrade. These scripts should be executed manually once.

  • Considerations: maybe introduce a new util service in the shared library to be used in all affected modules and reduce code duplication.


2. Extend Existing Hard Deletion Processes for JobExecutions

  • Description: Reuse current hard deletion logic in mod-source-record-manager.

  • Implementation Steps:

    1. Extend JobExecutionDao.hardDeleteJobExecutions mechanism to clean up mapping_rules_snapshots and mapping_params_snapshot tables.

  • Considerations: same approach with periodic clean up can be used instead.


3. Immediate Cleanup After Job Completion

  • Description: In cases where temporary data is created only for individual jobs (e.g., ProfileSnapshot in mod-di-converter-storage), perform immediate cleanup as part of the job completion process.

  • Implementation Steps:

    1. Add new endpoint to DELETE ProfileSnaphsot that will be called from mod-source-record-manager

    2. Modify job completion logic to delete job-specific temporary data (e.g., delete ProfileSnapshot data associated with the job as soon as the DI job is completed).

  • Considerations: same approach as for the deduplication data can be used instead.


Next Steps

  1. Add scripts to delete temporary data for each module.

  2. Add a column with a timestamp, and ensure it is populated when a deduplication or snapshot record is persisted.

  3. Define an abstract utility service in a shared library that accepts a deletion function and a retention period parameter.

  4. Implement a periodic service in each affected module to clean up temporary data and snapshot records. Use the same periodic job for all types of clutter (Solution #1).


By adopting these solutions, DI modules can significantly reduce database clutter, improve system performance, and simplify ongoing maintenance efforts.