2024-10-29 Better Sample Data Meeting notes

\uD83D\uDDD3 Date

15 Oct 2024

\uD83D\uDC65 Participants

Yogesh Kumar, Lee Braginsky, Charlotte Whitt , Kristin Martin Autumn Faulkner(NA) , Alissa Hafele(regrets) Tod Olson

\uD83E\uDD45 Goals

Follow up on the status of discussion topics and task

\uD83D\uDDE3 Discussion topics

Time

Item

Notes

Letter to Risk Office at Michigan State University Libraries

Stanford is working on sample authority data to be ready soon for the snapshot environment.

Follow up on activities since last meeting:
1. Update on - Golden copy for Bugfest environment.
2. Update on - Work on a small data set for FOLIO Snapshot.
3. Update on POC findings - write robust data anonymization scripts

Finalize the letter for the MSU dean.

The topic was on the agenda for today’s meeting with the OLF Officers. The plan was to attend and present the purpose of the letter and our draft document.

We missed it - Autumn Faulkner is to forward the letter to Simeon Warner and then we will have the conversation on Slack in stead.

Kristin sent Folio and OLF images for the letterhead.

Autumn Faulkner- We need to finalize the Google Doc with the revised version of the letter to be brought to the

Identified MARC formatted holding records; we still need an instance to go with it.

a. Golden Copy / Yogesh Kumar

The environment was created by the Kitfox team and updated by QA. The Kitfox team is now working on fixing the issues listed here /wiki/spaces/DQA/pages/203685917

All work is done by now. Cleaning duplicate scripts is not easy, and teams are busy wrapping up Ramsons' features. Writing such scripts may be possible in the sunflower release. For Ramson, we will have to manually clean it, if needed.

Charlotte can check with Charles Ledvina, how he would recommend to identify duplicates.
Update: Charles Ledvina dedupe records based on the legacy systems control number. In other words, since I use the legacy control number as HRID, if the HRID has already been seen, then the record gets rejected.

In Bugfest - depending on the source of the records - then most likely, the legacy control number would be the OCLC number. Can we do dedupe OCLC number?

b. Snapshot environment. Update from Charlotte Whitt: Still waiting for the environment - monthly build environment: https://folio-quesnelia.dev.folio.org/ to be built.

Update: Charlotte are trying to get this to move. Status is the work is still Open (To do).

Recruit members from MM-SIG - extra eyes, review the records updated in FOLIO Instance = FOLIO - Still waiting on env.

Status:

Autumn has provided sample data for music records and serials records with multiple holds and multiple items. Holdings statements to be added.

These records are loaded to this groups shared drive.

Write up Data Import Job profiles - Kristin Martin?

Data Import Job Profile which can import ~ 100 bibs in MARC 21 and create instance, holdings, item (corresponding to the locations we have set up). MM-SIG eyes on the 100 bibs. That these records has the right mix of misc. types, to cover the basic; incl. bound-with.

Stanford has MFHD-formed holdings data in the current export workflow. Alissa Hafele mentioned that Stanford could probably provide these data. Maybe ~25-50 examples—In progress.

Then later we can move on to Order/Order lines data. We will ask the Acquisition SIG (Kristin Martin will take the lead on this), the ERM-SIG to review the entered data - waiting for initial env. and data to be added.

c. Anonymization script

Update: from Lee Braginsky

Right now, we have a POC. Lee is requesting that 2 developers from the community help right a Java/SQL and folio schema tool to anonymize the data for 2 sprints.

Tod suggests publicizing developers' requirements in the #folio-implementers slack channel.

When Stanford is ready to deliver data, Yogesh and Lee can set up a meeting. Alissa will check with her colleagues.

Lee Braginsky Can we make the anonymized POC slide deck available in the WG folders? Is the POC code in Github? Yogesh to follow up.

Review timeline document

Draft - timeline document -

We will split it up the the three tracks as listed above.

Yogesh will update the Golden copy (Bugfest) section.

Charlotte will update the FOLIO Snapshot section (stuck until we get the environment up and running)

Lee will update the time line doc for his work on anonymization scripts.

Other topics

?

✅ Action items

Charlotte Whitt to talk with Index Data’s developers to get a Quesnelia reference environment similar as we have https://folio-orchid.dev.folio.org/ - this environment should have all re-build paused until we have the data set captured

Quick update build of the environment: https://folio-quesnelia.dev.folio.org/ (right now this link does not work). Then Index Data’s DevOps will spin up this environment, hopefully in October. Charlotte is working on that the environment can stay persistent while we gather our data.

Yogesh Kumar will update the Golden copy (Bugfest) track in the Timeline doc
Charlotte Whitt will update the FOLIO Snapshot track in the Timeline doc
Lee Braginsky will update the track for Scripts to anonymize data set.