Spike: MODEXPW-224 - eHoldings: create a plan to migrate data that is stored in job context to database
- MODEXPW-224Getting issue details... STATUS
Purpose:
In a scope of this spike we need to create a detailed plan for - MODEXPW-218Getting issue details... STATUS implementation.
Background/Problem:
Here is a brief illustration of current eHoldings export job flow:
The main problem is high RAM usage because of all job info aggregation on final step in order to prepare it to final format and write to file in a scope of single step.
This is needed mainly to define number of columns for notes in result file so we need to query all titles for package first to define maximum amount of notes that titles have.
In a scope of - MODEXPW-198Getting issue details... STATUS a postgres database was introduces as a replacement for in-memory hsqldb so now another approach for eHoldings job could be used.
Solution
Brief illustration of proposed solution
Implementation plan
Create db schemas for titles, packages:
E_HOLDINGS_PACKAGE ( ID VARCHAR(50) NOT NULL, JOB_EXECUTION_ID BIGINT NOT NULL, E_PACKAGE TEXT, AGREEMENTS TEXT, NOTES TEXT, PRIMARY KEY(JOB_EXECUTION_ID, ID) )
E_HOLDINGS_RESOURCE ( ID VARCHAR(50) NOT NULL, JOB_EXECUTION_ID BIGINT NOT NULL, RESOURCES_DATA TEXT, AGREEMENTS TEXT, NOTES TEXT, PRIMARY KEY(JOB_EXECUTION_ID, ID) )
TEXT field type is used because we don't need any operations with this fields, they'll just keep values and data that is going to be queried by id or by "seek" method (using id).
Two separate tables are used in order to be able to paginate on data without having additional filtering (f.e. enum for package/title entity type).
JOB_EXECUTION_ID is added in order to query only data related to current job and cleanup afterwards.
Create repositories for new tables
Create EholdingsPackageRepository along with entity and mapper to "org.folio.dew.domain.dto.eholdings.EHoldingsPackage". There will be a need to save single record, read single record by id and remove all by job execution id.
Create EholdingsResourceRepository along with entity and mapper to "org.folio.dew.domain.dto.eholdings.EHoldingsResource". There will be a need to save batch of records, read batch of records using "seek method" (passing e_holdings_resource id) and remove all by job execution id.
Split prepareEholdingsStep
Move data reading logic to some 'Reader'.
Add 'Writer' that will save package data to database. "packageMaxNotes" still must be written to job context.
Rework getEholdings step
Remove "stepExecutionListener" that writes resources to context.
Implement custom "writer" that writes resources to database. Replace "listItemWriter" with newly created one. "titleMaxNotes" still must be written to job context.
Rework saveEholdings step
Data should be processed in chunks (same as for "getEholdings" step).
Implement abstract reader for "seek method" (see "CsvItemReader" for reference, use last object from previous batch instead of "offset").
Implement reader that extends newly created abstract reader for seek method. This reader should query data from database. "e_holdings_resource" id should be used for seek.
Package data should be queried from db once on "beforeStep" in reader and saved to class field ("EHoldingsExportConfig" could be passed to cunstructor and package id retrieved from "recordId").
Reader should return object mapped to "EHoldingsResourceExportFormat".
Implement database "cleanup"
Add tasklet that removes all records from "EholdingsPackageRepository" by "job_execution_id" and from "EholdingsResourceRepository" by "job_execution_id".
Add tasklet on "next" to the end of "getEHoldingsJob" bean in "EHoldingsJobConfig".
Add logging
Add "info" level logs for each step start/finish.
Add "trace" level logs about each operation start/finish (on every chunk).