MARC records deduplication and merge
Problem of MARC records deduplication and merge
General problem definition.
Within a given collection of MARC records find pairs of records that are duplicates and replace such pairs with single records.
Either all such pairs can be found or some of them.
Partial solution gains partial value.
The impact of false-positive duplicate recognition is higher that the impact of false-negative one, because redundant data is better than data loss.
Problem decomposition
2 approaches to search for pairs can be considered:
New records compared with each record in the collection. It can be helpful when the collection is known to be free of duplicates and the task is to avoid adding duplicates when new records are added. It can also serve as a lightweight solution in case of collection which may contain duplicates already, in this case this approach prevents growth of number of duplicates in the collection
Each record compared to each other record in the collection, i.e. exhaustive search. Heavy search which may be helpful for initial collection cleanup
Regardless of the approach to search selected, there are common aspects of the general problem.
Full procedure of the most accurate recognition of 2 duplicates can be costly in terms of time and money. It is especially impactful when multiplied by the size of collection, or even the size squared in the case of exhaustive search.
To mitigate that impact another lightweight procedure can be utilized to solve the comparison problem partially: exclude the pairs that are easily recognizable to be non-duplicates.
If we use such a pre-screening procedure, then the general problem can be decomposed into 3 stages:
Pre-screening: with help of a lightweight method select all the duplicate candidates
To each pair of candidates apply procedure of thorough comparison to make the final resolution
Perform merger procedure leaving only one record of desirable content
Usage of LLM
LLM supposedly is too costly tool for the pre-screening stage. Pre-screening should be solved with help of simpler algorithmic methods, e.g.:
Calculate a metric defining distance between any 2 MARC records on a basis of alfa-numeric comparison, thesaurus, etc.
Use a threshold for the metric to distinguish between non-duplicates and duplicate candidates
This is just a hypothesis, it might be wrong and probably similarity of the records may be quite subtle and we cannot make an effective metric for pre-screening without use of AI/LLM.
Thorough comparison of MARC records supposedly can be solved with help of LLM.
Meger procedure can also be using LLM.
Proof of Concept
Research the definition of MARC record duplicates
Find representative positive and negative examples, where positive examples are examples of duplicate records and negative examples are examples of non-duplicates which can be suspected to be duplicates
Trial LLM solving thorough comparison of both positive and negative examples
Trial LLM solving merger procedure for positive examples
Evaluate trial outcomes
GPT-4 Opinion
Below is the conversation with GPT4:
User: Can you propose a definition of duplicate bibliographical MARC records? Can you also show some examples?
GPT-4: Definition of Duplicate Bibliographical MARC Records: Duplicate bibliographical MARC records refer to multiple MARC entries in a library catalog that represent the same bibliographic item, but might slightly differ in terms of metadata details such as cataloging source, copy-specific details, or minor variations in titling, authors, or publication data. Such duplicates can arise during data entry, catalog migration, or when integrating records from different sources.
The presence of duplicate bibliographic records can lead to inefficiencies in library operations, confusion among patrons, and inaccuracies in the catalog data. Identifying and merging these duplicates is crucial for maintaining a clean, user-friendly, and effective bibliographic database.
Examples of Duplicate Bibliographical MARC Records:
For illustration, consider the following simplified MARC records which represent the same book, but with slight variations:
Example 1:
Leader: 01542cam a2200361 a 4500
001: 78901234
100: 1 $aSmith, John$eauthor
245: 10$aIntroduction to Biology$bA comprehensive study$cJohn Smith
260: $aNew York$bNature Press$c2010
300: $a300 p.$bill., maps$c24 cm
Example 2:
Leader: 01452cam a2200341 a 4500
001: 78901235
100: 1 $aSmith, John.$eauthor
245: 10$aIntroduction to Biology$bComprehensive biology$cJohn Smith
260: $aNew York$bNature Publications$c2010
300: $a301 p.$bill., maps$c24 cm
Example 3:
Leader: 01543cam a2200367 a 4500
001: 78901236
100: 1 $aSmith, John Robert$eauthor
245: 10$aIntroduction to Biology:$ba comprehensive guide$cby John Robert Smith
260: $aNY$bNature Publishing Group$c2010
300: $a300 p.$bill., maps, charts$c24 cm
Analysis of Examples:
All records refer to the same work, "Introduction to Biology," authored by John Smith, published in 2010 in New York. However, the records have minor differences in:
100 field (author's name slightly varies).
245 field (title and subtitle differences).
260 field (variation in publisher's name and abbreviation).
300 field (slight differences in pagination and description).
These variations make each entry unique in cataloging terms but represent the same bibliographical item. Identifying such entries as duplicates and merging them into a single, accurate, and comprehensive MARC record can reduce redundancy and help in maintaining a streamlined catalog.
Efficiently managing duplicate MARC records requires careful examination of bibliographic elements, consistent application of cataloging rules, and potentially the use of automated tools or LLMs to identify and suggest merges based on the library's specific criteria and standards.
Risks of Comparison Mistakes
There is a category of permissible duplicates, examples
Successive cataloguing entries of serial record
Records in different languages. Only one record per language of descriptive cataloging will be allowed as permissible duplicate for each title
From https://knowledge.exlibrisgroup.com/Content/Knowledge_Articles/Alma/Alma_community_zone%3A_Match_logic%2C_duplicate_Bibliographic_records%2C_and_wrong_portfolios-Bibliographic_records_match: For example, different editions of the same book, different parts of a series which all have the series name, and even resources which happen to have the same title but are not related
Definitions
Duplicate records:
Two or more bibliographic records describing the same bibliographic item, ostensibly in the same manner.
https://www.itsmarc.com/crs/mergedProjects/editgde/editgde/definition_duplicate_records.htm
Bibliographic item:
An item or set of items in any physical form, either serial or nonserial, published, issued or treated as an entity and, as such, forming the basis for a single bibliographic description. The bibliographic item is the entity referred to by the item identifier. Examples of bibliographic items are: a one-volume book, a set of maps, a musical score with parts, a compact digital disc, a multimedia kit, a manuscript collection, a microform journal, a videotape with an accompanying pamphlet, a looseleaf publication together with its serially-issued updates.
https://www.itsmarc.com/crs/mergedprojects/holdings/holdings/definition_bibliographic_item.htm
https://www.itsmarc.com/crs/mergedprojects/editgde/editgde/idh_sect_c7__ceg.htm
Publications on the topic