MODDATAIMP-789 SPIKE: Design solution to clean up temp data
Ticket: MODDATAIMP-789
Status: In Progress
Problem Statement
The Data Import (DI) process generates temporary data that is persisted in the database but not cleaned up after the process is completed. Over time, this unused data accumulates and causes unnecessary storage usage in multiple FOLIO modules. A systematic solution is needed to clean up this obsolete data and prevent further clutter.
Temporary Data and Sources of Clutter
The following items are identified as sources of temporary data that remain after Data Import jobs are finished:
Kafka Event Deduplication Data
Present across several modules:
mod-source-record-managermod-inventorymod-ordersmod-invoice
No existing mechanism to periodically clean up event deduplication data.
ProfileSnapshot Data
Stored in
profile_snapshotstable inmod-di-converter-storage.Once a Data Import job is completed, ProfileSnapshots are no longer required, but they are not currently cleaned up.
JobExecution Data and Associated Journal Records
Stored in
mod-source-record-manager.This issue is already being addressed in MODSOURMAN-1329.
Old Versions of SRS Records
Addressed separately in the scope of MODSOURMAN-857.
Specific Clutter in Modules and Affected tables
mod-inventory
records_instancesrecords_holdingsrecords_itemsrecords_authoritiesevents_shared_instances: createddate column is present. Introduce an expiration period for these records and delete them periodically.
mod-source-record-manager
mapping_rules_snapshotsandmapping_params_snapshot: Extend theJobExecutionDao.hardDeleteJobExecutions()method to clean up these tables, or use the same periodic cleanup approach.events_processed: Contains deduplication-related data.
mod-source-record-storage
edifact_records_lb: These records can be deleted as incoming EDIFACT records are no longer saved in the module.
mod-orders
processed_records: created_date column is present. Add an expiration period and clean up the data periodically.sequential_order_id: saved_timestamp column is present.po_lines_processing_progress:creation_date column is present
mod-invoice
records_invoices: created_date column is present. Add an expiration period and periodically clean up these records.
Proposed Solutions
1. Introduce Expiration Periods for Temporary Data and delete existing
Description: Add an expiration period for temporary data where relevant. Define a time-based retention period (e.g., 30 days) for all "clutter" records to allow for clean-up after they are no longer needed. Delete all existing once after the module upgrade on the new version.
Implementation Steps:
Introduce a
created_date/timestampcolumn (if not already present) for all relevant tables across modules.Add retention configuration properties in each module to define the expiration period.
Implement background jobs in each module to periodically check and clean up expired records.
Create scripts for all affected modules to clean up existing clutter records right before or after modules upgrade. These scripts should be executed manually once.
Considerations: maybe introduce a new util service in the shared library to be used in all affected modules and reduce code duplication.
2. Extend Existing Hard Deletion Processes for JobExecutions
Description: Reuse current hard deletion logic in mod-source-record-manager.
Implementation Steps:
Extend
JobExecutionDao.hardDeleteJobExecutionsmechanism to clean upmapping_rules_snapshotsandmapping_params_snapshottables.
Considerations: same approach with periodic clean up can be used instead.
3. Immediate Cleanup After Job Completion
Description: In cases where temporary data is created only for individual jobs (e.g.,
ProfileSnapshotinmod-di-converter-storage), perform immediate cleanup as part of the job completion process.Implementation Steps:
Add new endpoint to DELETE ProfileSnaphsot that will be called from mod-source-record-manager
Modify job completion logic to delete job-specific temporary data (e.g., delete
ProfileSnapshotdata associated with the job as soon as the DI job is completed).
Considerations: same approach as for the deduplication data can be used instead.
Next Steps
Add scripts to delete temporary data for each module.
Add a column with a timestamp, and ensure it is populated when a deduplication or snapshot record is persisted.
Define an abstract utility service in a shared library that accepts a deletion function and a retention period parameter.
Implement a periodic service in each affected module to clean up temporary data and snapshot records. Use the same periodic job for all types of clutter (Solution #1).
By adopting these solutions, DI modules can significantly reduce database clutter, improve system performance, and simplify ongoing maintenance efforts.