2019-04-29 - Data Migration Subgroup Agenda and Notes

Date

at 11 EST

Discussion items

TimeItemWhoNotes
0New meeting linkDalePlease note. We will be meeting using a new Zoom link going forward: https://zoom.us/j/204980147 The old one expired and is no longer available.
5WelcomeDale

Welcome and request for someone to take notes.

25Inventory demoDale

Dale gave a talk on how Chicago is doing their migration to Folio, focusing on migration to inventory.
They use python3 to script their migration, good fit between language and tasks. The REPL is very useful for doing bottom-up programming.
Dale emphasized how useful cleanup of legacy data is to the migration process. Clean data makes it much easier to conceptualize what transformations need to occur and then to check you data against your understanding of the transformation, and revise. Bad data confuses things because you are now longer dealing with classes of data, but with errant individual values.
Data migration is done most usefully on iterative cycles, where you can get through at least one cycle per day. That way make revisions, review them, and make new revisions, while the ideas are fresh.
To do this DM needs to be fast. Chicago inventory ETL takes 4 hours to load everything. That kind of speed is not required, but being able to do things in a single day is important for productive workflows.
To make things fast at Chicago, Dale codes his data structures in python as dictionaries, which are extremely fast and lightweight.
Data is then processed in batches of 1000 and saved directly to db. Batching minimizes db commits, which are time consuming.
Also, he uses the bulk loading capabilities of the db for further speedup.
He could use the Postgres copy command, but copy lacks good error checking and handling, so he uses insert-many functionality instead
If the batch throws an error, he drops down and processes individually, to associate the error with particular records, whose ids he then logs for future processing.

Program design for inventory migration:
Create a machine readable version the MM SIG's marc-to-instance mapping spreadsheet
Parse the Marcxml. Using the marc-to-instance map, copy marc elements to instance categories
Marcxml is traversed only once, in this process, for performance.
Use functions to do the mapping
where mapping is indirect, use dictionaries to provide a bridge from OLE to Folio values
Dale traced how this mapping was accomplished from the spreadsheet map through the map_languages function
Instance json is built up as the functions that comprise the instance elements are executed
Then, the administrative metadata is attached and the row is added to the batch to be saved to Folio.
Batches are saved, and, if any errors arise, the records are saved one at a time and the errors isolated.

Roughly, the same procedure as above is followed for holdings and items,
except that with holdings and items some one-to-many relationships must be handled by creating lists of child rows in columns (like ORM)
Processing is inexpensive because the lookups are dictionary-based (see for example how data is passed from ole to function map_holdings_statistical_codes folio uuid, all using dictionaries)

How to manage uuids
Need to have persistent uuids so that there is a consistent mapping from legacy system to folio and back again
Create uuids on the ole side using cat.uuid
Add these uuids to instance, holdings, items rows through sql joins
Keep the uuids table in sync by diffing uuid rows against ole rows
Put them in a dictionary keyed on name values and look up the uuid whenever needed (see, for example, map_item_level_call_number_type)
Dates have to be in UTC time and be timezone aware

Reference tables are maintained in code for ease of lookup and customization. Then loaded into Folio whenever a db update is needed.
The uuids for the reference tables are also persisted on the Folio side.
The mappiing from Ole identifiers to Folio uuids usually occurs through a shared name value.
Validation not done, no time to show the code.

Performance
Do direct to db for large data sets.
Use db optimizations, such as bulk inserts. Drop/rebuild indexes. Code could be threaded, but want to keep in simple for now.
All functions have real-time performance metrics attached and logged. Useful for fine-grained performance analysis.

https://github.com/arntson/data-migration

Link to Acquisitions Interface Fields
Link to  FOLIO Record Data Elements  (contains links to specific spreadsheets, but most of them are not up to date.)

Action Items