Data migration process

Brief context

  • Release of some features sometimes requires data migration in database (DDL like add/remove columns, or JSON transformation)
  • In some cases data migration takes a very long time (e.g. up to 10-13 h was mentioned)
  • How can we improve this? Can any patterns or approaches be recommended?

MODINVSTOR-812 analysis

Analyze MODINVSTOR-812 - Getting issue details... STATUS / MODINVSTOR-774 - Getting issue details... STATUS to identify root cause(-s) of long migration (e.g., ongoing indexing on DB, single thread processing etc.) and make optimization of identified bottlenecks.

Notes:

  • https://github.com/folio-org/mod-inventory-storage/blob/master/src/main/resources/templates/db_scripts/populatePublicationPeriod.sql
  • All migration is done when the tenant is initialized. If it takes too long, it prevents the module from starting
  • This particular script fills in 2 new fields (publicationPeriod.start and publicationPeriod.end) and affects each instance in the inventory; at least one of the tenants contains ~ 16M instances
  • The script is executed as SQL function on Postgres DB engine, so - no any cross-module communications or API calls, no network involved at all; seems to be no massive load on database itself
  • The logic is pretty straightforward - retrieve JSONB for an instance, parse publicationDate with a set of regex patterns, build a new JSONB and save it to database
  • Currently there are no indices on dateOfPublication , publicationPeriod.start and publicationPeriod.end fields
  • Does it seem that JSONB operations are the most heavyweight in the migration?

  • https://github.com/folio-org/mod-inventory-storage/pull/686
  • The problem was a missing index when working with the publicationPeriod field, which resulted in a long scan when fetching. Optimization - add an index on publicationPeriod before performing the migration and drop this index after the migration is complete
  • Tested by Falcon team on performance environment with 7M records, data migration process took ~35 min
  • So, this issue is a good example of a case suitable for optimization

Establish a process to manage data migrations

The reason for the MODINVSTOR-812 problem is that the originally implemented migration script fully met the functional requirements, but did not meet the non-functional performance requirements. Moreover, the performance requirements themselves are not formulated, and there is also no validation against large / production-like amount of data.

  • (plus) Request from FSE team more detailed information regarding data migration process - which migrations happen (happened) in Juniper, how much time do they take?
    • At the moment, there is no (and cannot be obtained) information about how long it takes to migrate data for a particular module. Based on observations, the most problematic modules are mod-inventory-storage, mod-inventory, mod-srs, mod-srm, mod-orders though in general it would make sense to focus on -storage modules
    • The deployment pipeline could be updated to log more details about the migration process

To make the process of preparing and testing data migration scripts more transparent, understandable and effective, and to reduce the risks of problems occurring directly during the release / migration on the production environment, 2 main steps are proposed:

  1. Analysis and review of data migration scripts at the stage of preparation and development
    1. (plus) Add a specific label data-migration to all Jira features/tasks related to data migration
    2. (plus) Update Architecture board https://folio-org.atlassian.net/secure/RapidBoard.jspa?rapidView=224 to reflect such labeled tasks in In Review, In Progress and In Code Review statuses
    3. Organize code review and analysis of labeled tickets and PRs by TechLs/SAs using Best practices and Check-list for data-migration tasks below
      1. (plus) Communicate this process and required steps to involved teams and interested stakeholders, share this documentation
    4. (plus) Prepare a brief instruction on How to test data migration performance on Rancher , and communicate to dev teams to conduct performance testing of a particular migration on a temporary performance environment 
  2. (plus) Performance testing for all data migrations on a dedicated environment and on a sufficiently large amount of data as a mandatory step in order to receive an early performance metrics
    1. UXPROD-3387 - Getting issue details... STATUS and RANCHER-191 - Getting issue details... STATUS

Open questions regarding performance testing include:

  • Who runs the tests
  • Which data sets do we use (size, variety, etc.)
  • What constitutes a reasonable amount of time for a migration script... it's somewhat arbitrary
  • Where are these tests run...  is there a way to easily populate existing envs (scratch envs, etc.) with the target data sets? Are the envs shared?

Other thoughts:

  • What is an acceptable threshold?
    • Database update takes as little time as possible, though 2 hrs is usually acceptable is mentioned in MODINVSTOR-812
    • Should it be set per 1 migration and SLA per whole release?
  • Acceptable threshold should include baseline infrastructure and amount of data for performance testing
  • This acceptable threshold should be communicated to all development teams and set as a Non-functional requirement
  • Comparing of mentioned performance testing results with acceptable threshold is resulted in a go/no-go decision
  • Measured actual duration(s) should be included into release documentation

Best practices and Check-list for data-migration tasks

This is a list of items that make sense to pay attention to while implementation and reviewing of data-migration labeled tasks; the list can be replenished and expanded:

  • the presence of an index for one or more fields, which leads to a constant rebuilding of the index when changing each record
  • (when using Java code) work is done in one thread, sequentially, record by record
  • there is no index on some field used to select the original data
  • during data migration, additional information is requested from other modules via API
  • ...

Also, please take a look at an example of Cross-module migration solution design.

More examples of data migrations:

Re-think data migration approach

It should be noted that although the process proposed and formulated in the previous section is intended to streamline data migrations, it is likely that a more serious and in-depth rethinking of the current approach on the platform is required.

Perhaps, taking into account such characteristics as downtime, multi-tenancy, etc., we can talk about eliminating heavy-weight data migration as a mandatory step at release, and turning it into a gradual migration that does not require stopping applications.

In any case, this may be the topic of a separate study. The table below shows several possible options. Feedback is appreciated.


#1 Lazy migration#2 Expand-contract pattern#3 Optimization#4 Separate Applications approach#5 Blue-green deployment#6 Aggregate several data migration in one
Explanation

Update FOLIO platform without immediate data migration, and start a background task(-s) for gradual data migration

Applications are to be ready to work with both migrated and not-yet-migrated data while migration is in progress

Update FOLIO platform without immediate data migration but build the migration into the code, and do a "modify upon get". So whenever a record is accessed, perform the necessary changes and write them back to the DB. This way the migration happens over time, and only one schema is in play at a given time. More to read https://www.prisma.io/dataguide/types/relational/expand-and-contract-pattern

Follow existing way of data migration as a part of release but review and analyze data migration scripts, and optimize themFollow an approach of FOLIO deployment as not a monolith but rather as a set of independent applications. As the result release of individual applications will be possible, and data migration will also be distributed among applicationsBefore migration a) create a clone of main database, b) execute data migration on the clone, c) roll over all new changes that happened during the migration, d) during release - switch main database from the previous one to this clone, and consider the clone as a new main db
BenefitsNo data migration as a part of releaseNo data migration as a part of releaseNo significant changes in the FOLIO platform logic. Potential improvements with minimal costsEven long migration will affect a particular application only rather than full platformSafe non-blocking migration
ConcernsThe data migration process can be less transparent to engineers, while leaving the risks of partially completed migrations; application logic gets more complex as wellNot suitable for cases with breaking changes or when fully migrated data is required for proper functioning (e.g. for filtering, or sorting, etc.)There might be a case of still unsatisfactory performance while no place for further optimizationData migration is distributed across applications, and the approach to migration itself does not change. Therefore, a lengthy migration process is still possible, although this will only affect one application and not the entire platformStep c) from the list above - accurate synchronization of all changes that occurred in the main database during the period of time between cloning and switching is required
EffortsTotally new approach, so that one need implement background task runner, make sure applications can work with both versions of dataTotally new approachThere is a chance that small point optimizations can be efficient enoughThis is just a side-effect of Separate Applications approachMainly devops efforts to automate the process

Other remarks and notes

  • (question) Make data migration on data access layer, without OkAPI? If in the process of data migration a module needs to get additional information from another module, then this can only be done through OkApi calls. If there is a lot of data to migrate, this approach can slow down performance. It is important to note that now a module cannot request data from another module directly, without OkApi, since this violates the principles of module isolation.