25 | Inventory demo | Dale | Dale gave a talk on how Chicago is doing their migration to Folio, focusing on migration to inventory. They use python3 to script their migration, good fit between language and tasks. The REPL is very useful for doing bottom-up programming. Dale emphasized how useful cleanup of legacy data is to the migration process. Clean data makes it much easier to conceptualize what transformations need to occur and then to check you data against your understanding of the transformation, and revise. Bad data confuses things because you are now longer dealing with classes of data, but with errant individual values. Data migration is done most usefully on iterative cycles, where you can get through at least one cycle per day. That way make revisions, review them, and make new revisions, while the ideas are fresh. To do this DM needs to be fast. Chicago inventory ETL takes 4 hours to load everything. That kind of speed is not required, but being able to do things in a single day is important for productive workflows. To make things fast at Chicago, Dale codes his data structures in python as dictionaries, which are extremely fast and lightweight. Data is then processed in batches of 1000 and saved directly to db. Batching minimizes db commits, which are time consuming. Also, he uses the bulk loading capabilities of the db for further speedup. He could use the Postgres copy command, but copy lacks good error checking and handling, so he uses insert-many functionality instead If the batch throws an error, he drops down and processes individually, to associate the error with particular records, whose ids he then logs for future processing. Program design for inventory migration: Create a machine readable version the MM SIG's marc-to-instance mapping spreadsheet Parse the Marcxml. Using the marc-to-instance map, copy marc elements to instance categories Marcxml is traversed only once, in this process, for performance. Use functions to do the mapping where mapping is indirect, use dictionaries to provide a bridge from OLE to Folio values Dale traced how this mapping was accomplished from the spreadsheet map through the map_languages function Instance json is built up as the functions that comprise the instance elements are executed Then, the administrative metadata is attached and the row is added to the batch to be saved to Folio. Batches are saved, and, if any errors arise, the records are saved one at a time and the errors isolated. Roughly, the same procedure as above is followed for holdings and items, except that with holdings and items some one-to-many relationships must be handled by creating lists of child rows in columns (like ORM) Processing is inexpensive because the lookups are dictionary-based (see for example how data is passed from ole to function map_holdings_statistical_codes folio uuid, all using dictionaries) How to manage uuids Need to have persistent uuids so that there is a consistent mapping from legacy system to folio and back again Create uuids on the ole side using cat.uuid Add these uuids to instance, holdings, items rows through sql joins Keep the uuids table in sync by diffing uuid rows against ole rows Put them in a dictionary keyed on name values and look up the uuid whenever needed (see, for example, map_item_level_call_number_type) Dates have to be in UTC time and be timezone aware Reference tables are maintained in code for ease of lookup and customization. Then loaded into Folio whenever a db update is needed. The uuids for the reference tables are also persisted on the Folio side. The mappiing from Ole identifiers to Folio uuids usually occurs through a shared name value. Validation not done, no time to show the code. Performance Do direct to db for large data sets. Use db optimizations, such as bulk inserts. Drop/rebuild indexes. Code could be threaded, but want to keep in simple for now. All functions have real-time performance metrics attached and logged. Useful for fine-grained performance analysis. https://github.com/arntson/data-migration |