Skip to end of banner
Go to start of banner

DRAFT - Approach for DI to take into account external identifiers

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Current »

Jira link: MODDATAIMP-743 - Getting issue details... STATUS

Spike Status: IN PROGRESS

Objective: Design an approach for Data Import to rely on external identifiers during Create Import jobs to prevent creating duplicate records (or at least de-duplicate records in the incoming file).

Background

Duplicate records in the incoming file are saved as different records with newly assigned FOLIO identifiers. Neither there is a check for existing records with the same non-FOLIO identifiers during Data Import that also results in creation of duplicate records.

Problem Statement

  • Goal (question) :

    • deduplication of records in the incoming file (we don’t take into account existing records in the db)

    • deduplication of records taking into account those records that already exist (basically step that is often taken during import of Authorities - match by identifier, if no match - create). In fact, this match by identifier will be performed universally for all incoming records by default with no need for an explicit match in the Job Profile.

Scope

In Scope

Current research focuses primarily on MARC_Bib records that are mapped into Instances during import. Solution can possibly be extended to cover MARC_Authority and MARC_Holdings records that are mapped into Authority and Holdings records respectively.

Out of Scope

Spike does not cover EDIFACT imports, nor import of MARC_Bibs that do not result in creation of Instances (import of Orders, Holdings, and Items mapped from incoming MARC_Bib).

De-duplication of existing records.

Research Questions

  1. Is there a problem of having duplicate records (same non-FOLIO identifiers) in the db?

    1. Instances - duplicates currently exist on all reference environments (including bugfest) as a result of importing the same files over and over again with Create profiles (profiles that contain Create Instance action). Problem is reported by customers as well. Main focus of this spike.

    2. Holdings - multiple Holdings can be created and assigned to the same Instance, those Holdings can have the same Location. There probably no concern in having duplicates.

    3. Authorities - duplicates are not desired. To prevent creation of duplicate Authority records, they are often imported with profiles that would preliminary match on 010$a and create an Authority on non-match.

  2. What identifiers can be used for Instances (question) Default mapping fills the following identifiers: 010 (LCCN), 019 (Canceled system control number), 020 (ISBN), 022 (ISSN), 024 (Other standard identifier), 028 (Publisher number), 035 (System control number). Original 001 value is also combined with 003 and added as additional 035 (System control number), can we rely on this value (question)

Other questions:

  • Should 010 $a be used for de-duplication of Authority?

  • In case MARC Holdings also need to be de-duplicated, what identifiers should be used? Default mapping for MARC Holdings - 035 (System control number, former id)

  • Holdings/Items that are mapped from incoming MARC Bib - should these entities also be considered? What identifiers should we look for?

  • Duplicates that are already exist in the DB (mainly Instances). Is there a need to de-duplicated them, if yes - what should be done with linked Holdings, Items, related Orders?

Option 1

De-duplication of records based on search among records existing in the DB:

This approach would basically follow the scenario of a preliminary “match”, when we first search for the record before performing any actions. The criterion for such “match” will have to predefined depending on which external identifiers are chosen (they could differ based on the entity type). If this step is added for MARC_Bib records, the change can be placed in mod-inventory CreateInstanceHandler. The request could either go to mod-source-record-storage and search by specified fields, or to mod-inventory-storage and search for Instances with specified identifier of a defined identifier type. On match - the record will either be skipped or handle as an error.

Advantages:

  • Low effort to implement the change

Disadvantages:

  • mod-inventory processes records one-by-one, adding a “search existing record” stage to a create operation will affect performance. Create imports would take approximately the same time as an Update with a profile of a similar complexity (single match and one action).

  • No labels