2024-12-10 Better Sample Data Meeting notes
Date
Dec 10, 2024
Participants
@Yogesh Kumar, @Lee Braginsky, @Charlotte Whitt , @Kristin Martin (regrets), @Autumn Faulkner (regrets), @Tod Olson @Shelley Doljack
Goals
Follow up on the status of discussion topics and task
Discussion topics
Time | Item | Notes |
---|---|---|
|
Follow up on activities since last meeting:
| Finalized the letter for the MSU dean. @Autumn Faulkner- Update on Slack: “I have passed along the letter to our Dean who will take it to the Risk Management office. Hope to have updates for you soon but these processes generally drag on forever so we'll see.”
Questions from @Shelley Doljack : Some questions that came to mind today are:
Lee’s group will be tied up in Ramsons and Sunflower work. Lee suggests to focus on one area; e.g. Patron and usergroups data. Can use a tool to fake data (names, phone numbers, addresses - all PII data). Lee met with Shelley yesterday. The work is estimated to be 2 developers for 3-4 sprints + part time tester. Shelley will be able to contribute with max 10 hours a week. Shelley has worked previously in Phython. Tod mentioned that maybe Chicago could contribute with a Phyton developer too. Will come back to this in the new year. Shelley, Yogesh, Noah, and Lee will meet every Monday. Will add to this agenda a short update from these Monday-meetings. There is several scrambling tools available.
Shelley would need to have the technical requirements written up. Will start with the document provided by Lee and his project on the POC. Contributing institutions will need to stand up a second test environment. Tod explained that Chicago would store into a local postgreSQL data base and then run the anonymization. Can maybe use the test and migration script. This would be a conversation with the hosting provider (here Index Data). Then extract the anonymized data set. Eventually will the tool need to run a test environment. Shelley asked about the tenant, and the tenant IDs? Is the reference environment to be a multi tenant environment - Chicago, Stanford, MSU. The current Bugfest environment is a single tenant environment. Tod pointed out that data from all three institutions would cause inconsistency in the use of reference data. Tod thinks specifically on locations; but also the use of item material types. Shelley asked if multi tenant environment would mean that each institutions had their own reference data. Will a solution be to have multiple stand alone environments (A, B, and C)? Yogesh confirmed. Merging data would be phase 2.
a. Golden Copy / Yogesh Kumar Golden copy has ~8-9 Million instances. For performance testing the PTF testing they have a larger environment with 20 mio instances. The environment was created by the Kitfox team and updated by QA. Yogesh' QA team removed a number of duplicates in the Ramsons environment. b. Snapshot environment. Update from Charlotte Whitt: Still waiting for the environment - monthly build environment: https://folio-quesnelia.dev.folio.org/ to be built. The Jira ticket for creating https://folio-quesnelia.dev.folio.org/ is FOLIO-4071 Any update? Yes, ticket is now In progress. Status: Autumn has provided sample data for music records and serials records with multiple holds and multiple items. Holdings statements to be added. These records are loaded to this groups shared drive. Write up Data Import Job profiles - Kristin Martin? KEM: once reference environment is available, I will work on DI profiles. Data Import Job Profile which can import ~ 100 bibs in MARC 21 and create instance, holdings, item (corresponding to the locations we have set up). MM-SIG eyes on the 100 bibs. That these records has the right mix of misc. types, to cover the basic; incl. bound-with. Stanford has MFHD-formed holdings data in the current export workflow. Alissa Hafele mentioned that Stanford could probably provide these data. Maybe ~25-50 examples—In progress. Any update? Alissa: I have put data in the drive in a new folder - 30 instances with corresponding MARC Holdings records and FOLIO Holdings records (can be identified by lack of $s in 999). Records still include Stanford specific locations. Also included is a link to spreadsheet with reasons for picking the records. Just let me know if there are any questions or if we need to tweak the export format! Then later we can move on to Order/Order lines data. We will ask the Acquisition SIG (Kristin Martin will take the lead on this), the ERM-SIG to review the entered data - waiting for initial env. and data to be added. c. Anonymization script
@Lee Braginsky Can we make the anonymized POC slide deck available in the WG folders? Is the POC code in Github? Yogesh to follow up. https://folio-org.atlassian.net/wiki/spaces/DQA/pages/409108481/Data+Anonymization+Project
|
|
|
Nothing new to report |
|
| ?
|
Action items
Decisions
Lee, Yogesh, Shelley will inform the working group on the talk and progress on developing the anonymization tool. Lee, Yogesh, Shelley, and Noah meets every Monday.