2020-08-24 - Data Migration Subgroup Agenda and Notes
Date
Attendees
- Tod Olson
- Ingolf Kuss
- Charlotte Whitt
- Ian Walls
- Michelle Suranofsky
- Lloyd Chittenden
- Jeff Fleming
- patty.wanninger
- Ann-Marie Breaux (Deactivated)
- Dale Arntson
- Jenn Colt
- Former user (Deleted)
- Chris Froese
- Jon Miller
Meeting Link
- https://zoom.us/j/204980147
- password: folio-lsp
Discussion items
Time | Item | Who | Notes |
---|---|---|---|
5 | Welcome | Dale | Someone who hasn't done it in a while should take notes.
|
55 | Bulk APIs | various | We will continue discussion of the bulk APIs needed to facilitate data migration. Last week a few of us met to formulate at a high level what features should be supported by bulk APIs to make them suitable for data migration. At this meeting, we will try to reach agreement about this feature set, about which modules we feel should support it, and about who we should take it to from here. Here is the list of points with which we came up: All bulk APIs should support the following: * Create, read, update, and delete operations on records (CRUD operations). Updates should also optionally support upserts through a parameter setting. Deletes should be configurable using CQL. (How should we handle large delete sets, such as all users except diku_admin?)
* For insert and update operations, user supplied data should be used in place of generated data where supplied. This includes UUIDs, creation and update dates, created_by and updated_by identifiers, and hrids.
* Streaming for reads. Streaming provides for atomic reads over long periods of time, whereas query paging could miss or reread the same records as a result of concurrent inserts and deletes from other users.
* For insert, update and delete operations, it is sufficient to return a count of the records processed (as well as errors, see below). Returning all of the processed json objects in the response would probably degrade performance substantially without providing much benefit.
* Inserts, updates, and deletes should be processed in batch, where a stream of records is broken up into units of, say, a thousand records for processing.
* If an error is encountered in processing a batch of inserts or updates, the error should be returned along with the identifier (or json object) of the record causing the error. This is very useful for debugging, and for identifying and fixing data problems. There are many ways to implement such a capability. Here is one which we have found to be useful. When a batch operation fails, the failed batch is rolled back, and resubmitted as a list of single record operations, and the error and identifier (or json object) of any failed record is collected and returned to the caller as a list, along with the number of records successfully processed. Identifying errors at the record-level like this would allow whole data sets to be migrated, a list of the problem records to be collected, cleaned up, and reprocessed on a second pass.
* Bulk APIs should not be required to track table dependencies. Users should be expected to manage these. And here is Ian's document characterizing the modules to which these features could be applied. FOLIO Bulk API Support (Google Docs) Additional Meeting Notes: How will these things get moved along to the JIRA stage and assigned to development team? Each app dev team will need clear definition of those changes and get into the dev queue of the app teams. Sooner can get this definition set, sooner can move it into dev queues. Q4 work is being prepped and this isn't there yet, needs to be in in the next few weeks or else it will be held until Q1. Hoping to finish today and get to next step. Step between here and dev team, has to be part of cap plan. Is this high enough priority to make it through capacity planning? Might have to wait multiple quarters based on cap planning. What next?
|
Action items
Dale Arntson will set up a topic backlog page.
Ian Walls will talk to Cate and Jakub and move forward on JIRAs
Dale Arntson and Ian Walls will work on document clean up together