2024-12-10 Better Sample Data Meeting notes

\uD83D\uDDD3 Date

10 Dec 2024

\uD83D\uDC65 Participants

Yogesh Kumar, Lee Braginsky, Charlotte Whitt , Kristin Martin , Autumn Faulkner, Tod Olson Shelley Doljack

\uD83E\uDD45 Goals

Follow up on the status of discussion topics and task

\uD83D\uDDE3 Discussion topics

Time

Item

Notes

Letter to Risk Office at Michigan State University Libraries

Stanford is working on sample authority data to be ready soon for the snapshot environment.

Follow up on activities since last meeting:

Update on - Golden copy for Bugfest environment.
Update on - Work on a small data set for FOLIO Snapshot.
Update on POC findings - write robust data anonymization scripts

Finalized the letter for the MSU dean.

Autumn Faulkner- Update on Slack: “I have passed along the letter to our Dean who will take it to the Risk Management office. Hope to have updates for you soon but these processes generally drag on forever so we'll see.”

Questions from Shelley Doljack :

Some questions that came to mind today are:

anonymize all data? or specific data?
anybody have inventory data or MARC data that's not supposed to be shared? (e.g. not allowed to export to shared bibliographic utilities?) Does anonymizing matter?
donor or bookplate data in MARC tags? Or other inventory record fields?
any donor info at all?
are circulation rules supposed to be anonymized? When shuffling inventory data to preserve loan patterns, how do we ensure that open loans adhere to the circ rules for testing things like bills, notices, aging to lost, etc.?

Alissa Hafele also contributed with Authority Controlled data. See Slack conversation. Added to the Google drive.

a. Golden Copy / Yogesh Kumar

Golden copy has ~8-9 Million instances.

For performance testing the PTF testing they have a larger environment with 20 mio instances.

The environment was created by the Kitfox team and updated by QA. Yogesh' QA team removed a number of duplicates in the Ramsons environment.

b. Snapshot environment. Update from Charlotte Whitt: Still waiting for the environment - monthly build environment: https://folio-quesnelia.dev.folio.org/ to be built.

The Jira ticket for creating https://folio-quesnelia.dev.folio.org/ is FOLIO-4071

Any update? No

Status:

Autumn has provided sample data for music records and serials records with multiple holds and multiple items. Holdings statements to be added.

These records are loaded to this groups shared drive.

Write up Data Import Job profiles - Kristin Martin?

KEM: once reference environment is available, I will work on DI profiles.

Data Import Job Profile which can import ~ 100 bibs in MARC 21 and create instance, holdings, item (corresponding to the locations we have set up). MM-SIG eyes on the 100 bibs. That these records has the right mix of misc. types, to cover the basic; incl. bound-with.

Stanford has MFHD-formed holdings data in the current export workflow. Alissa Hafele mentioned that Stanford could probably provide these data. Maybe ~25-50 examples—In progress. Any update? Alissa: I have put data in the drive in a new folder - 30 instances with corresponding MARC Holdings records and FOLIO Holdings records (can be identified by lack of $s in 999). Records still include Stanford specific locations. Also included is a link to spreadsheet with reasons for picking the records. Just let me know if there are any questions or if we need to tweak the export format!

Then later we can move on to Order/Order lines data. We will ask the Acquisition SIG (Kristin Martin will take the lead on this), the ERM-SIG to review the entered data - waiting for initial env. and data to be added.

c. Anonymization script

see also Shelley Doljack questions above.

Update: from Lee Braginsky

Presented this for the Community Council. This resulted in an engaged discussion. Lee have gotten positive response from a student from University of Colorado, who would be willing to work on this. The need is 2 developers from the community with Java/SQL and the knowledge of FOLIO Schematool to anonymize the data for 2 sprints. Alternative do Python.

When Stanford is ready to deliver data, Yogesh and Lee can set up a meeting. Alissa will check with her colleagues.

Lee Braginsky Can we make the anonymized POC slide deck available in the WG folders? Is the POC code in Github? Yogesh to follow up. /wiki/spaces/DQA/pages/409108481

Review timeline document

Draft - timeline document -

Nothing new to report

Other topics

?

✅ Action items

Charlotte Whitt is on the heels of the DevOps to get the work for the Q environment build: https://folio-quesnelia.dev.folio.org/ (right now this link does not work). The ticket in Jira is: FOLIO-4071 Create Quesnelia reference environment. Latest comments is from 11/12/2024

Yogesh Kumar will update the Golden copy (Bugfest) track in the Timeline doc
Charlotte Whitt will update the FOLIO Snapshot track in the Timeline doc
Lee Braginsky will update the track for Scripts to anonymize data set.
Autumn Faulkner- We need to finalize the Google Doc with the revised version of the letter to be brought to the
Lee Braginsky will publicize developers' requirements in the #folio-implementers slack channel.