...
Goal :
deduplication of records in the incoming file (we don’t take into account existing records in the db)
deduplication of records taking into account those records that already exist (basically step that is often taken during import of Authorities - match by identifier, if no match - create). In fact, this match by identifier will be performed universally for all incoming records by default with no need for an explicit match in the Job Profile.
Scope
In Scope
Current research focuses primarily on MARC_Bib records that are mapped into Instances during import. Solution can possibly be extended to cover MARC_Authority and MARC_Holdings records that are mapped into Authority and Holdings records respectively.
Out of Scope
Spike does not cover EDIFACT imports, nor import of MARC_Bibs that do not result in creation of Instances (import of Orders, Holdings, and Items mapped from incoming MARC_Bib).
De-duplication of existing records.
Research Questions
Is there a problem of having duplicate records (same non-FOLIO identifiers)
...
in the db?
Instances - duplicates currently exist on all reference environments (including bugfest) as a result of importing the same files over and over again with Create profiles (profiles that contain Create Instance action). Problem is reported by customers as well. Main focus of this spike.
Holdings - multiple Holdings can be created and assigned to the same Instance, those Holdings can have the same Location. There probably no concern in having duplicates.
...
Authorities - duplicates are not desired. To prevent creation of duplicate Authority records, they are often imported with profiles that would preliminary match on 010$a and create an Authority on non-match.
What identifiers can be used for Instances Default mapping fills the following identifiers: 010 (LCCN), 019 (Canceled system control number), 020 (ISBN), 022 (ISSN), 024 (Other standard identifier), 028 (Publisher number), 035 (System control number)
...
Are there any problems with duplicate (same non-FOLIO identifiers) Authorities existing in the db? Authorities are often imported using profiles that have a step for preliminary match (by 010 $a)
...
Are there any problems with duplicate (same non-FOLIO identifiers) Holdings existing in the db?
. Original 001 value is also combined with 003 and added as additional 035 (System control number), can we rely on this value
Other questions:
Should 010 $a be used for de-duplication of Authority?
In case MARC Holdings also need to be de-duplicated, what identifiers should be used? Default mapping for MARC Holdings - 035 (System control number, former id)
Holdings/Items that are mapped from incoming MARC Bib - should these entities also be considered? What identifiers should we look for?
Duplicates that are already exist in the DB (mainly Instances)Can original 001 value from the incoming file be used as an identifier for de-duplication of records. Is there a need to de-duplicated them, if yes - what should be done with linked Holdings, Items, related Orders?
Option 1
De-duplication of records based on search among records existing in the DB:
...