2019-06-18 f2f data migration subgroup meeting notes

Date

Attendees

Goals

Discussion items

TimeItemWhoNotes

 Definitions

Data migration tooling is about getting data into a new FOLIO system (not ongoing record import maintenance, or ongoing patron loads). 


General Approach

Outlined at Wolfcon: https://docs.google.com/document/d/1jo2UMDtOKSjBxkXiG2KlqSK097j5U23HIPQSEpAX6fg

Data migration is an ETL process, extract and transform are mostly in the hands of libraries doing the implementation. The one exception is MARC records (mapping to instance). The approach is to provide or share tooling for the load step of the ETL process where it is generalize-able.


Requirements

There are requirements articluated in a few places already:

https://folio-org.atlassian.net/browse/UXPROD-850

https://docs.google.com/document/d/1cMmKSJ2L8OqSSeMZ0aIxmuQn4N-q76oFax6S0Q2xp9k

https://docs.google.com/document/d/1oXbEE48zd889lGD87dP7cF3GfuuKwllI_MDp2zGwTRg

Work has not started on many of the issues under UXPROD-850.

Group generated a list of things to ask for:

  • Bulk APIs, including business logic module support
  • Error reporting w/bulk import
  • Provide response body json schema through APIs (schemas for data types going in are already available this way)
  • Matching/DeDuping tools
  • A way to bring sequencing from old system into FOLIO (could rely on hrid which is unique in the schema already--so possibly already provided)

Comment on bulk apis: would still be useful. You still need to touch every record, using bulk APIs makes you touch them before you load into FOLIO.


Existing Tools

Discussion of direction for data migration tooling in next few quarters:

TAMU is working on using mod-camunda/workflows to do data import.

Things that are generalization would be good candidates for tooling. For example everyone will need a way to persist things like open loan IDs during migration.

RAML module builder has support for bulk import api endpoints but few if any modules are implementing this feature so far. Would like modules to support bulk data loading capabilities.

Error reporting is another important component--some way to manage incomplete loads.

A few possible paths forward:

  • Get behind toolkit from EBSCO, use, ask for features (from theodor)
  • Look at the Bywater migration tool and see if thats something the project would want to invest in
  • Ask project to encapsulate existing data loading tools in a module. (module exposes api where files can be posted, and handles loading into say inventory etc).

Bywater’s toolkit is designed to get data ready to go into Koha. Using this would require development to do 1.) adapt to FOLIO data structures 2.) do the actual data loading.

Would it be good to have a module for data loading, or would it be better to have a set of command line tools?

One benefit to having a module is that its very visible, and could possibly help with adoption if people could see that there was a  tool available.

Many sysadmins are already comfortable with CLI tools.

If the command line solution is easier and faster to implement, then it could be the way to go for the early implementers. A module could be developed later (perhaps based on some of the same logic).

There is also a desire for an API endpoint where one can post MARC and get back instance records.

Workflow (mod-camunda) could also be used to develop import workflows & set up complex interactions, would also like to have bulk update capabilities for this. The idea of using workflows is fairly new and not documented yet. This work is happening at TAMU.

Import concerns other than bulk support: de-duplication making sure data types are properly linked, how to mint UUIDs and make sure they line up (some of this overlaps with the transform step of the etl pipeline). Error reporting

Action items

  • patty.wanninger invite TAMU to show work with workflows at a future meeting