MARC records deduplication and merge

Problem of MARC records deduplication and merge

General problem definition.

Within a given collection of MARC records find pairs of records that are duplicates and replace such pairs with single records.

Either all such pairs can be found or some of them.

Partial solution gains partial value.

The impact of false-positive duplicate recognition is higher that the impact of false-negative one, because redundant data is better than data loss.

Problem decomposition

2 approaches to search for pairs can be considered:

  1. New records compared with each record in the collection. It can be helpful when the collection is known to be free of duplicates and the task is to avoid adding duplicates when new records are added. It can also serve as a lightweight solution in case of collection which may contain duplicates already, in this case this approach prevents growth of number of duplicates in the collection

  2. Each record compared to each other record in the collection, i.e. exhaustive search. Heavy search which may be helpful for initial collection cleanup

Regardless of the approach to search selected, there are common aspects of the general problem.

Full procedure of the most accurate recognition of 2 duplicates can be costly in terms of time and money. It is especially impactful when multiplied by the size of collection, or even the size squared in the case of exhaustive search.

To mitigate that impact another lightweight procedure can be utilized to solve the comparison problem partially: exclude the pairs that are easily recognizable to be non-duplicates.

If we use such a pre-screening procedure, then the general problem can be decomposed into 3 stages:

  1. Pre-screening: with help of a lightweight method select all the duplicate candidates

  2. To each pair of candidates apply procedure of thorough comparison to make the final resolution

  3. Perform merger procedure leaving only one record of desirable content

Usage of LLM

LLM supposedly is too costly tool for the pre-screening stage. Pre-screening should be solved with help of simpler algorithmic methods, e.g.:

  1. Calculate a metric defining distance between any 2 MARC records on a basis of alfa-numeric comparison, thesaurus, etc.

  2. Use a threshold for the metric to distinguish between non-duplicates and duplicate candidates

This is just a hypothesis, it might be wrong and probably similarity of the records may be quite subtle and we cannot make an effective metric for pre-screening without use of AI/LLM.

Thorough comparison of MARC records supposedly can be solved with help of LLM.

Meger procedure can also be using LLM.

Proof of Concept

  1. Research the definition of MARC record duplicates

  2. Find representative positive and negative examples, where positive examples are examples of duplicate records and negative examples are examples of non-duplicates which can be suspected to be duplicates

  3. Trial LLM solving thorough comparison of both positive and negative examples

  4. Trial LLM solving merger procedure for positive examples

  5. Evaluate trial outcomes

GPT-4 Opinion

Below is the conversation with GPT4:

User: Can you propose a definition of duplicate bibliographical MARC records? Can you also show some examples?

GPT-4: Definition of Duplicate Bibliographical MARC Records: Duplicate bibliographical MARC records refer to multiple MARC entries in a library catalog that represent the same bibliographic item, but might slightly differ in terms of metadata details such as cataloging source, copy-specific details, or minor variations in titling, authors, or publication data. Such duplicates can arise during data entry, catalog migration, or when integrating records from different sources.

The presence of duplicate bibliographic records can lead to inefficiencies in library operations, confusion among patrons, and inaccuracies in the catalog data. Identifying and merging these duplicates is crucial for maintaining a clean, user-friendly, and effective bibliographic database.

Examples of Duplicate Bibliographical MARC Records:

For illustration, consider the following simplified MARC records which represent the same book, but with slight variations:

  1. Example 1:

    • Leader: 01542cam a2200361 a 4500

    • 001: 78901234

    • 100: 1 $aSmith, John$eauthor

    • 245: 10$aIntroduction to Biology$bA comprehensive study$cJohn Smith

    • 260: $aNew York$bNature Press$c2010

    • 300: $a300 p.$bill., maps$c24 cm

  2. Example 2:

    • Leader: 01452cam a2200341 a 4500

    • 001: 78901235

    • 100: 1 $aSmith, John.$eauthor

    • 245: 10$aIntroduction to Biology$bComprehensive biology$cJohn Smith

    • 260: $aNew York$bNature Publications$c2010

    • 300: $a301 p.$bill., maps$c24 cm

  3. Example 3:

    • Leader: 01543cam a2200367 a 4500

    • 001: 78901236

    • 100: 1 $aSmith, John Robert$eauthor

    • 245: 10$aIntroduction to Biology:$ba comprehensive guide$cby John Robert Smith

    • 260: $aNY$bNature Publishing Group$c2010

    • 300: $a300 p.$bill., maps, charts$c24 cm

Analysis of Examples:

  • All records refer to the same work, "Introduction to Biology," authored by John Smith, published in 2010 in New York. However, the records have minor differences in:

    • 100 field (author's name slightly varies).

    • 245 field (title and subtitle differences).

    • 260 field (variation in publisher's name and abbreviation).

    • 300 field (slight differences in pagination and description).

These variations make each entry unique in cataloging terms but represent the same bibliographical item. Identifying such entries as duplicates and merging them into a single, accurate, and comprehensive MARC record can reduce redundancy and help in maintaining a streamlined catalog.

Efficiently managing duplicate MARC records requires careful examination of bibliographic elements, consistent application of cataloging rules, and potentially the use of automated tools or LLMs to identify and suggest merges based on the library's specific criteria and standards.

Risks of Comparison Mistakes

There is a category of permissible duplicates, examples

  • Successive cataloguing entries of serial record

  • Records in different languages. Only one record per language of descriptive cataloging will be allowed as permissible duplicate for each title

From https://knowledge.exlibrisgroup.com/Content/Knowledge_Articles/Alma/Alma_community_zone%3A_Match_logic%2C_duplicate_Bibliographic_records%2C_and_wrong_portfolios-Bibliographic_records_match: For example, different editions of the same book, different parts of a series which all have the series name, and even resources which happen to have the same title but are not related

 

Definitions

Duplicate records:

Two or more bibliographic records describing the same bibliographic item, ostensibly in the same manner.

https://www.itsmarc.com/crs/mergedProjects/editgde/editgde/definition_duplicate_records.htm

Bibliographic item:

An item or set of items in any physical form, either serial or nonserial, published, issued or treated as an entity and, as such, forming the basis for a single bibliographic description.  The bibliographic item is the entity referred to by the item identifier.  Examples of bibliographic items are:  a one-volume book, a set of maps, a musical score with parts, a compact digital disc, a multimedia kit, a manuscript collection, a microform journal, a videotape with an accompanying pamphlet, a looseleaf publication together with its serially-issued updates.

https://www.itsmarc.com/crs/mergedprojects/holdings/holdings/definition_bibliographic_item.htm

https://www.itsmarc.com/crs/mergedprojects/editgde/editgde/idh_sect_c7__ceg.htm

Publications on the topic

https://kb.osu.edu/bitstream/handle/1811/45259/RogersS_LibraryResourcesTechnicalServices_1993_v37_n1_p59-71.pdf?sequence=2