MARC records deduplication and merge

Problem of MARC records deduplication and merge

General problem definition.

Within a given collection of MARC records find pairs of records that are duplicates and replace such pairs with single records.

Either all such pairs can be found or some of them.

Partial solution gains partial value.

Problem decomposition

2 approaches to search for pairs can be considered:

New records compared with each record in the collection. It can be helpful when the collection is known to be free of duplicates and the task is to avoid adding duplicates when new records are added. It can also serve as a lightweight solution in case of collection which may contain duplicates already, in this case this approach prevents growth of number of duplicates in the collection

Each record compared to each other record in the collection, i.e. exhaustive search. Heavy search which may be helpful for initial collection cleanup

Regardless of the approach to search selected, there are common aspects of the general problem.

Full procedure of the most accurate recognition of 2 duplicates can be costly in terms of time and money. It is especially impactful when multiplied by the size of collection, or even the size squared in the case of exhaustive search.

To mitigate that impact another lightweight procedure can be utilized to solve the comparison problem partially: exclude the pairs that are easily recognizable to be non-duplicates.

If we use such a pre-screening procedure, then the general problem can be decomposed into 3 stages:

Pre-screening: with help of a lightweight method select all the duplicate candidates

To each pair of candidates apply procedure of thorough comparison to make the final resolution

Perform merger procedure leaving only one record of desirable content

Usage of LLM

LLM supposedly is too costly tool for the pre-screening stage. Pre-screening should be solved with help of simpler algorithmic methods, e.g.:

Calculate a metric defining distance between any 2 MARC records on a basis of alfa-numeric comparison, thesaurus, etc.

Use a threshold for the metric to distinguish between non-duplicates and duplicate candidates

Thorough comparison of MARC records supposedly can be solved with help of LLM.

Meger procedure can also be using LLM.

Proof of Concept

Research the definition of MARC record duplicates

Find representative positive and negative examples, where positive examples are examples of duplicate records and negative examples are examples of non-duplicates which can be suspected to be duplicates

Trial LLM solving thorough comparison of both positive and negative examples

Trial LLM solving merger procedure for positive examples

Evaluate trial outcomes