Data migration process
Brief context
Release of some features sometimes requires data migration in database (DDL like add/remove columns, or JSON transformation)
In some cases data migration takes a very long time (e.g. up to 10-13 h was mentioned)
How can we improve this? Can any patterns or approaches be recommended?
MODINVSTOR-812 analysis
Analyze MODINVSTOR-812: Improve ways to populate publicationPeriod fieldClosed /
MODINVSTOR-774: Juniper migration is unreasonably slowClosed to identify root cause(-s) of long migration (e.g., ongoing indexing on DB, single thread processing etc.) and make optimization of identified bottlenecks.
Oct 18, 2021 Notes:
All migration is done when the tenant is initialized. If it takes too long, it prevents the module from starting
This particular script fills in 2 new fields (publicationPeriod.start and publicationPeriod.end) and affects each instance in the inventory; at least one of the tenants contains ~ 16M instances
The script is executed as SQL function on Postgres DB engine, so - no any cross-module communications or API calls, no network involved at all; seems to be no massive load on database itself
The logic is pretty straightforward - retrieve JSONB for an instance, parse publicationDate with a set of regex patterns, build a new JSONB and save it to database
Currently there are no indices on dateOfPublication , publicationPeriod.start and publicationPeriod.end fields
Does it seem that JSONB operations are the most heavyweight in the migration?
Oct 25, 2021
The problem was a missing index when working with the publicationPeriod field, which resulted in a long scan when fetching. Optimization - add an index on publicationPeriod before performing the migration and drop this index after the migration is complete
Tested by Falcon team on performance environment with 7M records, data migration process took ~35 min
So, this issue is a good example of a case suitable for optimization
Establish a process to manage data migrations
The reason for the MODINVSTOR-812 problem is that the originally implemented migration script fully met the functional requirements, but did not meet the non-functional performance requirements. Moreover, the performance requirements themselves are not formulated, and there is also no validation against large / production-like amount of data.
Request from FSE team more detailed information regarding data migration process - which migrations happen (happened) in Juniper, how much time do they take?
Oct 8, 2021 At the moment, there is no (and cannot be obtained) information about how long it takes to migrate data for a particular module. Based on observations, the most problematic modules are mod-inventory-storage, mod-inventory, mod-srs, mod-srm, mod-orders though in general it would make sense to focus on -storage modules
The deployment pipeline could be updated to log more details about the migration process
To make the process of preparing and testing data migration scripts more transparent, understandable and effective, and to reduce the risks of problems occurring directly during the release / migration on the production environment, 2 main steps are proposed:
Analysis and review of data migration scripts at the stage of preparation and development
Add a specific label data-migration to all Jira features/tasks related to data migration
Update Architecture board https://folio-org.atlassian.net/secure/RapidBoard.jspa?rapidView=224 to reflect such labeled tasks in In Review, In Progress and In Code Review statuses
Organize code review and analysis of labeled tickets and PRs by TechLs/SAs using Best practices and Check-list for data-migration tasks below
Communicate this process and required steps to involved teams and interested stakeholders, share this documentation
Prepare a brief instruction on How to test data migration performance on Rancher , and communicate to dev teams to conduct performance testing of a particular migration on a temporary performance environment
Performance testing for all data migrations on a dedicated environment and on a sufficiently large amount of data as a mandatory step in order to receive an early performance metrics
Open questions regarding performance testing include:
Who runs the tests
Which data sets do we use (size, variety, etc.)
What constitutes a reasonable amount of time for a migration script... it's somewhat arbitrary
Where are these tests run... is there a way to easily populate existing envs (scratch envs, etc.) with the target data sets? Are the envs shared?
Other thoughts:
What is an acceptable threshold?
Database update takes as little time as possible, though 2 hrs is usually acceptable is mentioned in MODINVSTOR-812
Should it be set per 1 migration and SLA per whole release?
Acceptable threshold should include baseline infrastructure and amount of data for performance testing
This acceptable threshold should be communicated to all development teams and set as a Non-functional requirement
Comparing of mentioned performance testing results with acceptable threshold is resulted in a go/no-go decision
Measured actual duration(s) should be included into release documentation
Best practices and Check-list for data-migration tasks
This is a list of items that make sense to pay attention to while implementation and reviewing of data-migration labeled tasks; the list can be replenished and expanded:
the presence of an index for one or more fields, which leads to a constant rebuilding of the index when changing each record
(when using Java code) work is done in one thread, sequentially, record by record
there is no index on some field used to select the original data
during data migration, additional information is requested from other modules via API
...
Also, please take a look at an example of Cross-module migration solution design.
More examples of data migrations:
Re-think data migration approach
It should be noted that although the process proposed and formulated in the previous section is intended to streamline data migrations, it is likely that a more serious and in-depth rethinking of the current approach on the platform is required.
Perhaps, taking into account such characteristics as downtime, multi-tenancy, etc., we can talk about eliminating heavy-weight data migration as a mandatory step at release, and turning it into a gradual migration that does not require stopping applications.
In any case, this may be the topic of a separate study. The table below shows several possible options. Feedback is appreciated.
| #1 Lazy migration | #2 Expand-contract pattern | #3 Optimization | #4 Separate Applications approach | #5 Blue-green deployment | #6 Aggregate several data migration in one |
|---|---|---|---|---|---|---|
Explanation | Update FOLIO platform without immediate data migration, and start a background task(-s) for gradual data migration Applications are to be ready to work with both migrated and not-yet-migrated data while migration is in progress | Update FOLIO platform without immediate data migration but build the migration into the code, and do a "modify upon get". So whenever a record is accessed, perform the necessary changes and write them back to the DB. This way the migration happens over time, and only one schema is in play at a given time. More to read https://www.prisma.io/dataguide/types/relational/expand-and-contract-pattern | Follow existing way of data migration as a part of release but review and analyze data migration scripts, and optimize them | Follow an approach of FOLIO deployment as not a monolith but rather as a set of independent applications. As the result release of individual applications will be possible, and data migration will also be distributed among applications | Before migration a) create a clone of main database, b) execute data migration on the clone, c) roll over all new changes that happened during the migration, d) during release - switch main database from the previous one to this clone, and consider the clone as a new main db |
|
Benefits | No data migration as a part of release | No data migration as a part of release | No significant changes in the FOLIO platform logic. Potential improvements with minimal costs | Even long migration will affect a particular application only rather than full platform | Safe non-blocking migration |
|
Concerns | The data migration process can be less transparent to engineers, while leaving the risks of partially completed migrations; application logic gets more complex as well | Not suitable for cases with breaking changes or when fully migrated data is required for proper functioning (e.g. for filtering, or sorting, etc.) | There might be a case of still unsatisfactory performance while no place for further optimization | Data migration is distributed across applications, and the approach to migration itself does not change. Therefore, a lengthy migration process is still possible, although this will only affect one application and not the entire platform | Step c) from the list above - accurate synchronization of all changes that occurred in the main database during the period of time between cloning and switching is required |
|
Efforts | Totally new approach, so that one need implement background task runner, make sure applications can work with both versions of data | Totally new approach | There is a chance that small point optimizations can be efficient enough | This is just a side-effect of Separate Applications approach | Mainly devops efforts to automate the process |
|
Other remarks and notes
Make data migration on data access layer, without OkAPI? If in the process of data migration a module needs to get additional information from another module, then this can only be done through OkApi calls. If there is a lot of data to migrate, this approach can slow down performance. It is important to note that now a module cannot request data from another module directly, without OkApi, since this violates the principles of module isolation.