...
Goal :
deduplication of records in the incoming file (we don’t take into account existing records in the db)
deduplication of records taking into account those records that already exist (basically step that is often taken during import of Authorities - match by identifier, if no match - create). In fact, this match by identifier will be performed universally for all incoming records by default with no need for an explicit match in the Job Profile.
Are there any problems with duplicate (same non-FOLIO identifiers) Instances existing in the db?
What identifiers should be taken into account for Instances? We have default mapping for 010 (LCCN), 019 (Canceled system control number), 020 (ISBN), 022 (ISSN), 024 (Other standard identifier), 028 (Publisher number), 035 (System control number)
Are there any problems with duplicate (same non-FOLIO identifiers) Authorities existing in the db? Authorities are often imported using profiles that have a step for preliminary match (by 010 $a)
Are there any problems with duplicate (same non-FOLIO identifiers) Holdings existing in the db?
Default mapping for MARC Holdings - 035 (System control number, former id)
Holdings/Items that are mapped from incoming MARC Bib - should these entities also be considered? What identifiers should we look for?
Duplicates that are already exist in the DB (mainly Instances)
Can original 001 value from the incoming file be used as an identifier for de-duplication of records?
Option 1
De-duplication of records based on search among records existing in the DB:
This approach would basically follow the scenario of a preliminary “match”, when we first search for the record before performing any actions. The criterion for such “match” will have to predefined depending on which external identifiers are chosen (they could differ based on the entity type). If this step is added for MARC_Bib records, the change can be placed in mod-inventory CreateInstanceHandler. The request could either go to mod-source-record-storage and search by specified fields, or to mod-inventory-storage and search for Instances with specified identifier of a defined identifier type. On match - the record will either be skipped or handle as an error.
Advantages:
Low effort to implement the change
Disadvantages:
mod-inventory processes records one-by-one, adding a “search existing record” stage to a create operation will affect performance. Create imports would take approximately the same time as an Update with a profile of a similar complexity (single match and one action).