Data migration process
Brief context
- Release of some features sometimes requires data migration in database (DDL like add/remove columns, or JSON transformation)
- In some cases data migration takes a very long time (e.g. up to 10-13 h was mentioned)
- How can we improve this? Can any patterns or approaches be recommended?
MODINVSTOR-812 analysis
Analyze - MODINVSTOR-812Getting issue details... STATUS / - MODINVSTOR-774Getting issue details... STATUS to identify root cause(-s) of long migration (e.g., ongoing indexing on DB, single thread processing etc.) and make optimization of identified bottlenecks.
Notes:
- https://github.com/folio-org/mod-inventory-storage/blob/master/src/main/resources/templates/db_scripts/populatePublicationPeriod.sql
- All migration is done when the tenant is initialized. If it takes too long, it prevents the module from starting
- This particular script fills in 2 new fields (publicationPeriod.start and publicationPeriod.end) and affects each instance in the inventory; at least one of the tenants contains ~ 16M instances
- The script is executed as SQL function on Postgres DB engine, so - no any cross-module communications or API calls, no network involved at all; seems to be no massive load on database itself
- The logic is pretty straightforward - retrieve JSONB for an instance, parse publicationDate with a set of regex patterns, build a new JSONB and save it to database
- Currently there are no indices on dateOfPublication , publicationPeriod.start and publicationPeriod.end fields
- Does it seem that JSONB operations are the most heavyweight in the migration?
- https://github.com/folio-org/mod-inventory-storage/pull/686
- The problem was a missing index when working with the publicationPeriod field, which resulted in a long scan when fetching. Optimization - add an index on publicationPeriod before performing the migration and drop this index after the migration is complete
- Tested by Falcon team on performance environment with 7M records, data migration process took ~35 min
- So, this issue is a good example of a case suitable for optimization
Establish a process to manage data migrations
The reason for the MODINVSTOR-812 problem is that the originally implemented migration script fully met the functional requirements, but did not meet the non-functional performance requirements. Moreover, the performance requirements themselves are not formulated, and there is also no validation against large / production-like amount of data.
- Request from FSE team more detailed information regarding data migration process - which migrations happen (happened) in Juniper, how much time do they take?
- At the moment, there is no (and cannot be obtained) information about how long it takes to migrate data for a particular module. Based on observations, the most problematic modules are mod-inventory-storage, mod-inventory, mod-srs, mod-srm, mod-orders though in general it would make sense to focus on -storage modules
- The deployment pipeline could be updated to log more details about the migration process
To make the process of preparing and testing data migration scripts more transparent, understandable and effective, and to reduce the risks of problems occurring directly during the release / migration on the production environment, 2 main steps are proposed:
Open questions regarding performance testing include:
- Who runs the tests
- Which data sets do we use (size, variety, etc.)
- What constitutes a reasonable amount of time for a migration script... it's somewhat arbitrary
- Where are these tests run... is there a way to easily populate existing envs (scratch envs, etc.) with the target data sets? Are the envs shared?
Other thoughts:
- What is an acceptable threshold?
- Database update takes as little time as possible, though 2 hrs is usually acceptable is mentioned in MODINVSTOR-812
- Should it be set per 1 migration and SLA per whole release?
- Acceptable threshold should include baseline infrastructure and amount of data for performance testing
- This acceptable threshold should be communicated to all development teams and set as a Non-functional requirement
- Comparing of mentioned performance testing results with acceptable threshold is resulted in a go/no-go decision
- Measured actual duration(s) should be included into release documentation
Best practices and Check-list for data-migration tasks
This is a list of items that make sense to pay attention to while implementation and reviewing of data-migration labeled tasks; the list can be replenished and expanded:
Also, please take a look at an example of Cross-module migration solution design.
More examples of data migrations:
Re-think data migration approach
It should be noted that although the process proposed and formulated in the previous section is intended to streamline data migrations, it is likely that a more serious and in-depth rethinking of the current approach on the platform is required.
Perhaps, taking into account such characteristics as downtime, multi-tenancy, etc., we can talk about eliminating heavy-weight data migration as a mandatory step at release, and turning it into a gradual migration that does not require stopping applications.
In any case, this may be the topic of a separate study. The table below shows several possible options. Feedback is appreciated.
#1 Lazy migration | #2 Expand-contract pattern | #3 Optimization | #4 Separate Applications approach | #5 Blue-green deployment | #6 Aggregate several data migration in one | |
---|---|---|---|---|---|---|
Explanation | Update FOLIO platform without immediate data migration, and start a background task(-s) for gradual data migration Applications are to be ready to work with both migrated and not-yet-migrated data while migration is in progress | Update FOLIO platform without immediate data migration but build the migration into the code, and do a "modify upon get". So whenever a record is accessed, perform the necessary changes and write them back to the DB. This way the migration happens over time, and only one schema is in play at a given time. More to read https://www.prisma.io/dataguide/types/relational/expand-and-contract-pattern | Follow existing way of data migration as a part of release but review and analyze data migration scripts, and optimize them | Follow an approach of FOLIO deployment as not a monolith but rather as a set of independent applications. As the result release of individual applications will be possible, and data migration will also be distributed among applications | Before migration a) create a clone of main database, b) execute data migration on the clone, c) roll over all new changes that happened during the migration, d) during release - switch main database from the previous one to this clone, and consider the clone as a new main db | |
Benefits | No data migration as a part of release | No data migration as a part of release | No significant changes in the FOLIO platform logic. Potential improvements with minimal costs | Even long migration will affect a particular application only rather than full platform | Safe non-blocking migration | |
Concerns | The data migration process can be less transparent to engineers, while leaving the risks of partially completed migrations; application logic gets more complex as well | Not suitable for cases with breaking changes or when fully migrated data is required for proper functioning (e.g. for filtering, or sorting, etc.) | There might be a case of still unsatisfactory performance while no place for further optimization | Data migration is distributed across applications, and the approach to migration itself does not change. Therefore, a lengthy migration process is still possible, although this will only affect one application and not the entire platform | Step c) from the list above - accurate synchronization of all changes that occurred in the main database during the period of time between cloning and switching is required | |
Efforts | Totally new approach, so that one need implement background task runner, make sure applications can work with both versions of data | Totally new approach | There is a chance that small point optimizations can be efficient enough | This is just a side-effect of Separate Applications approach | Mainly devops efforts to automate the process |
Other remarks and notes
- Make data migration on data access layer, without OkAPI? If in the process of data migration a module needs to get additional information from another module, then this can only be done through OkApi calls. If there is a lot of data to migrate, this approach can slow down performance. It is important to note that now a module cannot request data from another module directly, without OkApi, since this violates the principles of module isolation.