Jira link: - MODDATAIMP-743Getting issue details... STATUS
Spike Status: IN PROGRESS
Objective: Design an approach for Data Import to rely on external identifiers during Create Import jobs to prevent creating duplicate records (or at least de-duplicate records in the incoming file).
Background
Duplicate records in the incoming file are saved as different records with newly assigned FOLIO identifiers. Neither there is a check for existing records with the same non-FOLIO identifiers during Data Import that also results in creation of duplicate records.
Problem Statement
Goal :
deduplication of records in the incoming file (we don’t take into account existing records in the db)
deduplication of records taking into account those records that already exist (basically step that is often taken during import of Authorities - match by identifier, if no match - create). In fact, this match by identifier will be performed universally for all incoming records by default with no need for an explicit match in the Job Profile.
Are there any problems with duplicate (same non-FOLIO identifiers) Instances existing in the db?
What identifiers should be taken into account for Instances? We have default mapping for 010 (LCCN), 019 (Canceled system control number), 020 (ISBN), 022 (ISSN), 024 (Other standard identifier), 028 (Publisher number), 035 (System control number)
Are there any problems with duplicate (same non-FOLIO identifiers) Authorities existing in the db? Authorities are often imported using profiles that have a step for preliminary match (by 010 $a)
Are there any problems with duplicate (same non-FOLIO identifiers) Holdings existing in the db?
Default mapping for MARC Holdings - 035 (System control number, former id)
Holdings/Items that are mapped from incoming MARC Bib - should these entities also be considered? What identifiers should we look for?
Duplicates that are already exist in the DB (mainly Instances)