Skip to end of banner
Go to start of banner

DRAFT - Approach for DI to take into account external identifiers

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Jira link: MODDATAIMP-743 - Getting issue details... STATUS

Spike Status: IN PROGRESS

Objective: Design an approach for Data Import to rely on external identifiers during Create Import jobs to prevent creating duplicate records (or at least de-duplicate records in the incoming file).

Background

Duplicate records in the incoming file are saved as different records with newly assigned FOLIO identifiers. Neither there is a check for existing records with the same non-FOLIO identifiers during Data Import that also results in creation of duplicate records.

Problem Statement

  • Goal (question) :

    • deduplication of records in the incoming file (we don’t take into account existing records in the db)

    • deduplication of records taking into account those records that already exist (basically step that is often taken during import of Authorities - match by identifier, if no match - create). In fact, this match by identifier will be performed universally for all incoming records by default with no need for an explicit match in the Job Profile.

  • Are there any problems with duplicate (same non-FOLIO identifiers) Instances existing in the db?

  • What identifiers should be taken into account for Instances? We have default mapping for 010 (LCCN), 019 (Canceled system control number), 020 (ISBN), 022 (ISSN), 024 (Other standard identifier), 028 (Publisher number), 035 (System control number) 

  • Are there any problems with duplicate (same non-FOLIO identifiers) Authorities existing in the db? Authorities are often imported using profiles that have a step for preliminary match (by 010 $a)

  • Are there any problems with duplicate (same non-FOLIO identifiers) Holdings existing in the db?

  • Default mapping for MARC Holdings - 035 (System control number, former id)

  • Holdings/Items that are mapped from incoming MARC Bib - should these entities also be considered? What identifiers should we look for?

  • Duplicates that are already exist in the DB (mainly Instances)

  • Can original 001 value from the incoming file be used as an identifier for de-duplication of records?

Option 1

De-duplication of records based on search among records existing in the DB:

This approach would basically follow the scenario of a preliminary “match”, when we first search for the record before performing any actions. The criterion for such “match” will have to predefined depending on which external identifiers are chosen (they could differ based on the entity type). If this step is added for MARC_Bib records, the change can be placed in mod-inventory CreateInstanceHandler. The request could either go to mod-source-record-storage and search by specified fields, or to mod-inventory-storage and search for Instances with specified identifier of a defined identifier type. On match - the record will either be skipped or handle as an error.

Advantages:

  • Low effort to implement the change

Disadvantages:

  • mod-inventory processes records one-by-one, adding a “search existing record” stage to a create operation will affect performance. Create imports would take approximately the same time as an Update with a profile of a similar complexity (single match and one action).

  • No labels