Approach for DI to take into account external identifiers
Jira link: https://folio-org.atlassian.net/browse/MODDATAIMP-743
Spike Status: IN PROGRESS
Objective: Design an approach for Data Import to rely on external identifiers during Create Import jobs to prevent creating duplicate records (or at least de-duplicate records in the incoming file).
Background
Duplicate records in the incoming file are saved as different records with newly assigned FOLIO identifiers. Neither there is a check for existing records with the same non-FOLIO identifiers during Data Import that also results in creation of duplicate records.
Problem Statement
DI does not perform any checks for existing records before creating new. If the record with FOLIO identifiers (UUID and/or HRID) is inserted in the db, there will be an error that will prevent such duplication. However, non-FOLIO identifiers are not taken into account. Therefore importing the same records multiple time will result in creation of multiple (duplicate) records with different FOLIO identifiers assigned. This causes the problem in managing records, and during update imports. If non-FOLIO identifier is used as a match point for update, multiple records will satisfy the condition and necessary actions will not be performed.
Scope
In Scope
Current research focuses primarily on MARC_Bib records that are mapped into Instances during import. Solution can possibly be extended to cover MARC_Authority and MARC_Holdings records that are mapped into Authority and Holdings records respectively.
Goal :
deduplication of records in the incoming file (we don’t take into account existing records in the db)
deduplication of records taking into account those records that already exist (basically step that is often taken during import of Authorities - match by identifier, if no match - create).
Out of Scope
Spike does not cover EDIFACT imports, nor import of MARC_Bibs that do not result in creation of Instances (import of Orders, Holdings, and Items mapped from incoming MARC_Bib).
De-duplication of existing records.
Research Questions
Is there a problem of having duplicate records (same non-FOLIO identifiers) in the db?
Instances - duplicates currently exist on all reference environments (including bugfest) as a result of importing the same files over and over again with Create profiles (profiles that contain Create Instance action). Problem is reported by customers as well. Main focus of this spike.
Holdings - multiple Holdings can be created and assigned to the same Instance, those Holdings can have the same Location. There probably no concern in having duplicates.
Authorities - duplicates are not desired. To prevent creation of duplicate Authority records, they are often imported with profiles that would preliminary match on 010$a and create an Authority on non-match.
What identifiers can be used for Instances Default mapping fills the following identifiers: 010 (LCCN), 019 (Canceled system control number), 020 (ISBN), 022 (ISSN), 024 (Other standard identifier), 028 (Publisher number), 035 (System control number). Original 001 value is also combined with 003 and added as additional 035 (System control number), can we rely on this value
Other questions:
Should 010 $a be used for de-duplication of Authority?
In case MARC Holdings also need to be de-duplicated, what identifiers should be used? Default mapping for MARC Holdings - 035 (System control number, former id)
Holdings/Items that are mapped from incoming MARC Bib - should these entities also be considered? What identifiers should we look for?
Duplicates that are already exist in the DB (mainly Instances). Is there a need to de-duplicated them, if yes - what should be done with linked Holdings, Items, related Orders?
Option 1
De-duplication of records based on search among records existing in the DB:
This approach would basically follow the scenario of a preliminary “match”, when we first search for the record before performing any actions. The criterion for such “match” will have to be predefined depending on which external identifiers are chosen (they could differ based on the entity type). If this step is added for MARC_Bib records, the change can be placed in mod-inventory CreateInstanceHandler. The request could either go to mod-source-record-storage and search by specified fields, or to mod-inventory-storage and search for Instances with specified identifier of a defined identifier type. On match - the record will either be skipped or handle as an error.
Advantages:
Low effort to implement the change. A match with multiple search criteria can be constructed.
Disadvantages:
mod-inventory processes records one-by-one, adding a “search existing record” stage to a create operation will affect performance. Create imports would take approximately the same time as an Update with a profile of a similar complexity (single match and one action).
Option 2
Adding a column for the particular external (non-FOLIO) identifier in mod-source-record-storage with a unique constraint (taking into account the state and generation of the record). Operation of saving SRS MARC record will fail on duplicate, and that error can be handled with a comprehensive message for the user.
Advantages:
Low effort to implement the change. No impact on performance
Disadvantages:
Approach is good only in case a single value is defined as external identifier.
The need to migrate existing data, as db schema will be changed
Option
Predefine the criteria of a duplicate record on the mod-source-record-storage side. Construct a query and run it before saving new record. Fail create operation in case of duplicate.
Advantages:
Low effort to implement the change. Multiple search criteria can be used.
Disadvantages:
Preliminary search on marc_indexers table.