Within a given collection of MARC records find pairs of records that are duplicates and replace such pairs with single records.
Either all such pairs can be found or some of them.
Partial solution gains partial value.
Problem decomposition
2 approaches to search for pairs can be considered:
New records compared with each record in the collection. It can be helpful when the collection is known to be free of duplicates and the task is to avoid adding duplicates when new records are added. It can also serve as a lightweight solution in case of collection which may contain duplicates already, in this case this approach prevents growth of number of duplicates in the collection
Each record compared to each other record in the collection, i.e. exhaustive search. Heavy search which may be helpful for initial collection cleanup
Regardless of the approach to search selected, there are common aspects of the general problem.
Full procedure of the most accurate recognition of 2 duplicates can be costly in terms of time and money. It is especially impactful when multiplied by the size of collection, or even the size squared in the case of exhaustive search.
To mitigate that impact another lightweight procedure can be utilized to solve the comparison problem partially: exclude the pairs that are easily recognizable to be non-duplicates.
If we use such a pre-screening procedure, then the general problem can be decomposed into 3 stages:
Pre-screening: with help of a lightweight method select all the duplicate candidates
To each pair of candidates apply procedure of thorough comparison to make the final resolution
Perform merger procedure leaving only one record of desirable content
Usage of LLM
LLM supposedly is too costly tool for the pre-screening stage. Pre-screening should be solved with help of simpler algorithmic methods, e.g.:
Calculate a metric defining distance between any 2 MARC records on a basis of alfa-numeric comparison, thesaurus, etc.
Use a threshold for the metric to distinguish between non-duplicates and duplicate candidates
Thorough comparison of MARC records supposedly can be solved with help of LLM.
Meger procedure can also be using LLM.
Proof of Concept
Research the definition of MARC record duplicates
Find representative positive and negative examples, where positive examples are examples of duplicate records and negative examples are examples of non-duplicates which can be suspected to be duplicates
Trial LLM solving thorough comparison of both positive and negative examples
Trial LLM solving merger procedure for positive examples