2020-08-24 - Data Migration Subgroup Agenda and Notes

Date

Attendees

Discussion items

TimeItemWhoNotes
5WelcomeDale

Someone who hasn't done it in a while should take notes.

  • jenn will
55Bulk APIsvarious

We will continue discussion of  the bulk APIs needed to facilitate data migration. Last week a few of us met to formulate at a high level what features should be supported by bulk APIs to make them suitable for data migration. At this meeting, we will try to reach agreement about this feature set, about which modules we feel should support it, and about who we should take it to from here. Here is the list of points with which we came up:

All bulk APIs should support the following:

* Create, read, update, and delete operations on records (CRUD operations). Updates should also optionally support upserts through a parameter setting. Deletes should be configurable using CQL. (How should we handle large delete sets, such as all users except diku_admin?)

  • Dale: Not sure if the deletes for large data sets really is still an issue
  • Isn't it still a problem to write to storage and skip business API? If you write just to storage it assumes user has done the logic. If you are doing logic then you have to process records one by one and not in batch. The logic that is already there in the business api assumes you are talking about now and not historical migration, so for migration you sometimes have to skip it.
  • Okay with doing logic outside and loading to storage in order to be faster for some. What about orders which are complex? Business logic just not set up for migration issues.
  • Might need migration oriented business APIs but as high level minimal set for right now, storage is what we would need for now.
  • One doesn't have to exclude the other, have this as a start and then move forward.
  • If you want performant business api's probably need the performant storage api's first anyway.
  • Why include complexity of upserts if doing logic beforehand? Upsert is in inventory for bulk loading already, haven't tried it yet already. Upsert in inventory can currently only do on uuid's but not hrid's. hrid doesn't have its own column. Not have to segregate between new and update records when loading would be the advantage of upsert. Questions about whether that should happen automatically, you might want to know if something you were going update doesn't exist, therefore the question about parameterizing upsert.

* For insert and update operations, user supplied data should be used in place of generated data where supplied. This includes UUIDs, creation and update dates, created_by and updated_by identifiers, and hrids.

  • Migration data is historical and needs to be able to maintain these. Difference in metadata block between created and updated, some variation in uses cases for these two data points. If you have the option to retain or not retain doesn't that solve the problem? Update procedure will need to remember to scrub the fields when needed, someone might overlook it.
  • UChicago update information used to keep track of who is doing what so you don't want to lose that in migration, valuable to library staff.

* Streaming for reads. Streaming provides for atomic reads over long periods of time, whereas query paging could miss or reread the same records as a result of concurrent inserts and deletes from other users.

  • RMB already supports streaming for reads and it is supported in inventory and SRS (we think)

* For insert, update and delete operations, it is sufficient to return a count of the records processed (as well as errors, see below). Returning all of the processed json objects in the response would probably degrade performance substantially without providing much benefit.

  • Since individual record APIs return the records, it may add complexity to not do so in the bulk APIs.
  • Main argument is to enhance performance. What is the best practice? If using SQL as a model, it only returns information on reads.
  • This seems maybe low priority compared to the other issues.

* Inserts, updates, and deletes should be processed in batch, where a stream of records is broken up into units of, say, a thousand records for processing.

  • Good to be able to delete by CQL. Right now people think they can use CQL and accidentally delete more than they meant to because endpoint doesn't pay attention to the CQL.

* If an error is encountered in processing a batch of inserts or updates, the error should be returned along with the identifier (or json object) of the record causing the error. This is very useful for debugging, and for identifying and fixing data problems. There are many ways to implement such a capability. Here is one which we have found to be useful. When a batch operation fails, the failed batch is rolled back, and resubmitted as a list of single record operations, and the error and identifier (or json object) of any failed record is collected and returned to the caller as a list, along with the number of records successfully processed. Identifying errors at the record-level like this would allow whole data sets to be migrated, a list of the problem records to be collected, cleaned up, and reprocessed on a second pass.

  • Needs to be thorough, some APIs are throwing 502s currently
  • Consistency will be important
  • Performance decrease from second run worth it to get the errors on individual records, probably not even a performance decrease
  • Being able to generate this list for data clean up makes migration life simpler

* Bulk APIs should not be required to track table dependencies. Users should be expected to manage these.

And here is Ian's document characterizing the modules to which these features could be applied.

FOLIO Bulk API Support (Google Docs)


Additional Meeting Notes:

How will these things get moved along to the JIRA stage and assigned to development team? Each app dev team will need clear definition of those changes and get into the dev queue of the app teams.

Sooner can get this definition set, sooner can move it into dev queues. Q4 work is being prepped and this isn't there yet, needs to be in in the next few weeks or else it will be held until Q1.

Hoping to finish today and get to next step.

Step between here and dev team, has to be part of cap plan. Is this high enough priority to make it through capacity planning? Might have to wait multiple quarters based on cap planning. 

What next?

  • Ian will close loop with Jakub and Cate
  • Can get help from Jakub and Cate but will need to write JIRAs ourselves
  • Does this need to get blessed by anybody before going into JIRA? Sysops? Tech council? Just cap planning?






Action items

Dale Arntson will set up a topic backlog page.

Ian Walls will talk to Cate and Jakub and move forward on JIRAs

Dale Arntson and Ian Walls will work on document clean up together