STORY: Punctuation

Description

Current situation: The presence of punctuation in some MARC fields affects fingerprinting algorithms; such that the system creates new resources that would otherwise match against existing resources.

The purpose of this card is to apply normalization rules to remove punctuation from original MARC records to help prevent creating new resources unintentionally. This will help to keep the data graph smaller and cleaner.

The scope of this card is limited to applying punctuation normalization to MARC fields that have been fully mapped within the Linked Data Editor.

The JSON file - copied below - is an exhaustive list of normalization routines used in BiblioGraph.

Each line includes one or more punctuation marks in between brackets. The rule instructs the system to look for trailing punctuation for the preceding subfield in the MARC record. If any of the punctuation marks is found at the end of the preceding subfield, strip the punctuation mark from the end of the subfield.

Example

  "240$f": ["."] → if the incoming MARC record has a 240 $f field, check the subfield immediately preceding the 240 $f for the presence of a trailing period (“.”). If the preceding subfield ends in a period, strip the period from the value of the preceding subfield.

NOTE: Always ignore the first subfield of each MARC tag.

Multiple rules could apply to any given subfield. Where there are multiple rules, apply the rules one at a time following this priority list:

            lookups = [

                'XX{}${}'.format(tag[2], code),

                '{}XX$X'.format(tag[0]),

                '{}XX${}'.format(tag[0], code),

                '{}X$X'.format(tag[:2]),

                '{}X${}'.format(tag[:2], code),

                '{}${}'.format(tag, code),

            ]

List of punctuation normalization rules, by MARC tag.


For the following MARC tags, remove the punctuation from the last sub field.

Environment

None

Potential Workaround

None

Attachments

14

Checklist

hide

Activity

Show:

Tetiana Kovalchuk January 9, 2025 at 4:30 PM

Tested on folio-testing-sprint-fs09000000.ci.folio.org

Build version: mod-linked-data-1.0.1-SNAPSHOT.88

Test cases and evidences attached.

Punnoose Kutty Jacob Pullolickal January 7, 2025 at 2:29 PM

Punctuation will be removed from the last subfield only if the last letter in the last subfield’s value is same as the specified punctuation.
In your example, the last letter of the last subfield is semicolon (;). However, the specified punctuation letter is dot (.). Hence, the ; will not be removed.

The value shown in the UI screen you attached is expected.

Tetiana Kovalchuk January 7, 2025 at 9:43 AM

For the following MARC tags, remove the punctuation from the last sub field:

"260": ["."]

Andrei Bordak January 3, 2025 at 11:12 AM

Hi ,

what’s the better way to test this story since resource creation on SourceRecordDomainEvent event is currently disabled and we save admin metadata only?

cc:

Punnoose Kutty Jacob Pullolickal December 18, 2024 at 10:00 PM
Edited

Example 3: https://id.loc.gov/resources/instances/21122663.marcxml.xml

Original 505:

Note: Some of the $t in the original MARC is removed in this example

Normalization process:

  • The “ --” in the end of all $t s, except the last one, will be removed due to rule # 95 ("505$t": [" -", " --", " -- ", " ;"]) in the first JSON

  • The “.” in the end of last $t will be removed due to rule # 15 ("505": ["."]) of the second JSON.

Normalized 505:

Done

Details

Assignee

Reporter

Labels

Priority

Story Points

Sprint

Development Team

Citation

TestRail: Cases

Open TestRail: Cases

TestRail: Runs

Open TestRail: Runs
Created December 17, 2024 at 10:37 PM
Updated March 4, 2025 at 7:51 PM
Resolved January 9, 2025 at 9:30 PM
TestRail: Cases
TestRail: Runs