2024-01-24 - Direct DB Migration Scripts

Date

Attendees 


Discussion items

TimeItemWhoNotes
1 minScribeAll

Jeremy Huff is next, followed by  Marc Johnson 

Reminder:  Please copy/paste the Zoom chat into the notes.  If you miss it, this is saved along with the meeting recording, but having it here has benefits. 

1 minRemindersAll

Quick reminders to TC members...

  1. Please review Marc's draft changes to the Ramsons OST page.  See thread in #tc-internal for additional context/details. 
    1. Ideally we will have time to go over this (and feedback) together on .  Depending on how that goes we may or may not need a dedicated hour.
  2. Please review the PR for proposed changes to the TCR process:  https://github.com/folio-org/tech-council/pull/55
    1. This will be the topic of discussion on  
  3. Please review the adjustments to the OST page wrt timing of upkeep activities. 
    1. Like #1 I'm hoping we can get to this on  
    2. From Maccabee Levine in #tc-internal:

Please see the thread above about an error on the OST page for when to do one of the status transitions. I have added two new example tables to the OST page, in place of the examples that were listed previously:

  • one which shows all the dates & triggering events that would affect the Quesnelia OST page
  • and one that shows all the events during the Quesnelia release cycle that affect various OST pages (for Orchid, Quesnelia, Ramsons and Sunflower).

Wish I could do a PR on Confluence, but I just published the changes, here's the diff and we can always revert or adjust further.

*Direct DB Migration ScriptsAll

Context/Background:

From Ingolf Kuss in #tc-internal.  See full thread here.

Sys Ops SIG wants to reach out formally to the TC because of the topic of direct db upgrade scripts in Poppy migration.
In the Poppy release notes, there are a number of db scripts described which need to be run by the operator after migration of the tenant.
I find the following scripts in the action column of the Release Notes:
  - Script 3 of https://folio-org.atlassian.net/wiki/display/FOLIOtips/Scripts+for+Inventory%2C+Source+Record+Storage%2C+and+Data+Import+Cleanup
  - https://folio-org.atlassian.net/wiki/display/FOLIJET/Call-numbers+migration
  - https://folio-org.atlassian.net/wiki/display/FOLIJET/Authorities+migration
  - https://folio-org.atlassian.net/wiki/display/FOLIOtips/Migration+scripts+for+OAI-PMH
  - https://folio-org.atlassian.net/wiki/display/FOLIJET/Scripts+to+populate+marc_indexers+version
  - https://folio-org.atlassian.net/wiki/display/FOLIJET/Adding+a+new+member+tenant+to+consortium.+mod-entities-links+scope
So far, in earlier releases, those kind of scripts have (in almost all cases) been part of the module, and have been triggered automatically when the new module is first being enabled for the tenant, while the old module is still enabled for the tenant (the old module is then being disabled and removed in the course of the upgrade).
SysOps SIG strongly feels that some of these scripts should be handled in that way: to be part of the module upgrade triggered by enablement for the tenant. FOLIO operators at Index Data find it pretty burdensome to deal with the upgrade scripts with many tenants in a multiple environments. Other members of Sys Ops agree and are confused why those script are not part of the modules's db migration.

If the migration is long-running (e.g. 4-5 hours), it appears reasonable to put it into a separate script. However, Sys Ops think, it should be available by some post-upgrade API, like /inventory-storage/migrations/jobs is for inventory migration.

If a decision has to be made during upgrade, in a ideal world, a tenant admin (not a sysadmin) should get notified (via UI) about that he/she has to make a decision. Until the decision has been made the module may stop to work as usual.

We think the TC could document some standard expectation for the POs.

Also, SysOps should be involved in the release retrospective.

Notes:

  • ...


Ingolf Kuss mentioned that Wayne from index data broached the subject of the number of database update scripts, and if this is an ideal situation. The unanimous response was that, "no", this isn't ideal.

Aleksey Petrenko, agreed that it is not ideal and clarified that this is not specific to Poppy.

Taras Spashchenko explains why this decision was made: performance tests with plain SQL scripts were taking up to 14 hours. This was not acceptable, and the scripts were rewritten by splitting the data sets into separate chunks (16). For each chunk it took an hour, but they can be run in parallel. This decision was made to save time with the overall runtime of the migration. This approach also allows for remediation per chunk if something goes wrong.

Craig McNally is this decision handled on a case by case basis, or are there guidelines?

Taras Spashchenko it was made on based on the specific circumstances.

Marc Johnson what Ingolf is wanting to talk about is a general set of procedures for this process. Marc is questioning what the origin of the practice of splitting sql into chunks came from.

Aleksey Petrenko says that this change is an improvement when there is a significant amount of data.

Marc Johnson these changes might be improvements, but where do we want this conversation to go? He is hearing that this is what we have to do, and others are saying they are not happy with this approach.

Craig McNally It is helpful to hear the background on how this decision was made. Since the improvement brought the runtime from 15 hours to 15 minutes, it might be run inbound. Can we do these optimizations beforehand, so the need to split the scripts into out of band is not needed.

Aleksey Petrenko: Is would be good to get feedback from EBSCO. It would be beneficial to involve team leads who have performed these migrations.  

Taras Spashchenko  when new fields have been added to the instance table that needs to be populated with values based on the holdings and items data, because the sql update takes place on single thread, this is a candidate for paralyzation. In regards to callnumber updates, the update requires an update of the json object. As a single update it does not take advantage of the DB resources, and parallelization.

Marc Johnson if we are going to talk about specific examples we should get the dev teams observation. It would be good to have an EBSCO rep in the sys ops sig. He appreciates the background information, but is not sure if this information will help us move forward with the questions at hand.

Ingolf Kuss he is hearing for the first time that it is necessary to run these updates in parallel. Maybe this should be expressed in the release notes. Ingolf Kuss has invited EBSCO representatives to the sysops sig.

Craig McNally It seems clear keeping these as in-band scripts is not going to work. It is also a pain point to run these out of band. It is doable but not ideal. It is better but still inconvenient. Maybe it is sufficient for us to just have a better understanding of the situation, and maybe we can document these things as general guidelines for how to approach these decisions. It is being handled on case by case basis. Can we parallelize in the inbound process?

Ingolf Kuss there was not enough time to test this. This explanation helps him understand, and we can produce stadardize documentation.

Craig McNally Documenting what the process is will help. If we can look at improvements that will also be useful.

Florian Gleixner Do out of band running of scripts need to be run on a FOLIO system that is pristine. Maybe inbound scripts are better since the usage of modules during migration can be controlled. There are possible situation where the upgrade of one tenant may have a negative impact on other tenants. Even if this upgrade does nto present these issues, future updates may.

Jeremy Huff what are the blockers for parallelization during an in-bound upgrade.

Taras Spashchenko RMB cannot parallelize database interactions. Spring modules may be able to do this with some changes. Maybe providing some sort of driver script for the out of band script approach might make sense.

Maccabee Levine what is really missing here is documenting the out of band approach. 

Craig McNally Communication is always important. He is not sure if this was mentioned in the release notes of poppy.

Florian Gleixner two questions, do you have to shut down the tenant for upgrade (the answer is yes), and how big was the tenant which took 14 hours to upgrade.

Taras Spashchenko it was 9 million records. 

Florian Gleixner Maybe we only do parallelism on large tenants? For the idea of providing a shell script with the out of bound scripts, this would be nice to have but it is not necessary. 

Craig McNally If the script could be parameterized for number of threads or data chunks, this could be a good idea.

Ingolf Kuss he thinks a shell script could be helpful. What sort of deployment is documented. What sort of deployment should we document the process for, he feels it should only be for the single server. Sometimes you need to deactivate kafka during the upgrade. He has heard that jroot  does this. Was this a factor in this upgrade?

Craig McNally would prefer if this question was addressed in the sysops group

Aleksey Petrenko appreciates this feedback. He is happy to see us in the development team, feel free to join.

Craig McNally maybe there could be a tighter integration between development and sysops. Better communication between these two groups might make sense. What are the concrete action items. We want to document what is the decision process is for when migrations need to be split out into asynchronous migrations. Do we have a volunteer?

Marc Johnson The TC has limited experience with this. It would be better for this documentation to be produces by the people who have the most experience with this.

Taras Spashchenko will draft the rationale that was used for poppy, and we can derive general rules from that.

Craig McNally we can use the poppy release as a case study for creating general guidelines. We can also take a look at how these can be improved. We will have additional follow conversations about this topic.


NAZoom Chat

11:09:32 From Jenn Colt to Everyone:
    <comments on pros/cons of fewer releases here> 🙂
11:10:07 From Tod Olson to Everyone:
    Reacted to "<comments on pros/co..." with 🙂
11:26:29 From Marc Johnson to Everyone:
    Running them in parallel is optional AFAIK
11:27:02 From Jenn Colt to Everyone:
    To be fair they are deep in upgrades right now I’m sure
11:30:00 From Tod Olson to Everyone:
    I need to drop off. Thank you for the discussion and I look forward to checking the notes.
11:34:26 From Marc Johnson to Everyone:
    AFAIK all system upgrades require downtime, probably for most tenants, not only the tenant being upgraded
11:35:08 From Ingolf Kuss to Everyone:
    Reacted to "<comments on pros/co..." with 🙂
11:36:55 From Marc Johnson to Everyone:
    RMB is deprecated
    
    Could non-RMB code be added to those modules?
11:39:07 From Marc Johnson to Everyone:
    Alternatively, we build the capability for modules to run post upgrade scripts in the background and report any errors
11:40:37 From Marc Johnson to Everyone:
    A single repository for the whole system does not fit with the application formalisation proposal that intends to provide lower coupled independents parts of the system
11:43:40 From Marc Johnson to Everyone:
    Anything we build needs to respect the community decisions on technology adoption
11:45:28 From Marc Johnson to Everyone:
    The single server documentation is for new installations only IIUC
    
    If that is the case, upgrades are irrelevant for that documentation
11:46:19 From Marc Johnson to Everyone:
    The Kafka topic is a separate topic
11:46:43 From Craig McNally to Everyone:
    Yeah, I don't think we have time to go down that path right now
11:46:46 From Florian Gleixner to Everyone:
    Shutting down Kafka during upgrade only works for single tenant environments
11:47:18 From Marc Johnson to Everyone:
    Changes made directly in the database won’t be published to Kafka
    
    Meaning anything done in that manner won’t be kept in sync between modules
11:48:14 From Jenn Colt to Everyone:
    Maybe there needs to be some visits to the sysops sig
11:48:22 From Marc Johnson to Everyone:
    Reacted to "Maybe there needs to…" with 💯
11:48:38 From Ingolf Kuss to Everyone:
    Reacted to "Maybe there needs to..." with 🙂
11:49:25 From Marc Johnson to Everyone:
    We need to be careful when considering development teams as a homogeneous group 
    
    At least, they vary depending on where they come from e.g. there is no common structure or adoption of a lead dev etc
11:49:38 From Jenn Colt to Everyone:
    Is this part of definitions of done?
11:49:51 From Oleksii Petrenko to Everyone:
    May me we need resurrect Team Lead weekly or biweekly sync up
11:49:58 From Ingolf Kuss to Everyone:
    Reacted to "Is this part of defi..." with 👍
11:50:03 From Ingolf Kuss to Everyone:
    Removed a 👍 reaction from "Is this part of defi..."
11:50:56 From Marc Johnson to Everyone:
    Replying to "Is this part of defi…"
    IME those are not universally adopted or applied
    
    Especially at the release level, rather than development of a story level
11:54:13 From Ingolf Kuss to Everyone:
    it was definitely helpful
11:54:20 From Craig McNally to Everyone:
    Reacted to "it was definitely he..." with 👍
11:54:31 From Florian Gleixner to Everyone:
    Reacted to "it was definitely he..." with 👍

Topic Backlog

Decision Log ReviewAll

Review decisions which are in progress.  Can any of them be accepted?  rejected?

Translation SubgroupAllSince we're having trouble finding volunteers for a subgroup, maybe we can make progress during a dedicated discussion session?
Communicating Breaking ChangesAllSince we're having trouble finding volunteers for a subgroup, maybe we can make progress during a dedicated discussion session?
Officially Supported Technologies - UpkeepAll

Previous Notes:

  • A workflow for these pages. When do they transition from one state to another. Do we even need statuses at all ?
  • Stripes architecture group has some questions about the Poppy release.
  • Zak: A handshake between developers, dev ops and the TC. Who makes that decision and how do we pass along that knowledge ? E.g. changes in Nodes and in the UI boxes. How to communicate this ? We have a large number of teams, all have to be aware of it.  TC should be alerted that changes are happening. We have a couple of dedicated channels for that. Most dev ops have subscribed to these channels. How can dev ops folk raise issues to the next level of community awareness ? There hasn't been a specific piece of TC to move that along.
  • Craig: There is a fourth group, "Capacity Planning" or "Release Planning". Slack is the de facto communication channel.  There are no objections to using Slack. An example is the Java 17 RFC. 
  • Craig: The TC gets it on the agenda and we will discuss it. The TC gets the final say.
  • Marc Johnson: We shouldn’t use the DevOps Channel. The dev ops folks have made it clear that it should only be used for support requests made to them.
  • Jakub: Our responsibility is to avoid piling up technical debt.
  • Marc: Some set of people have to actually make the call. Who lowers the chequered flag ?
  • Craig: It needs to ultimately come to the TC at least for awareness. There is a missing piece. Capacity Planning needs to provide input here. 
  • Marc: Stakeholders / Capacity Planning could make that decision. Who makes the decision ? Is it the government or is it some parts of the body ?
  • Marc: the developers community, the dev ops community and sys ops are involved. For example the Spring Framework discussion or the Java 17 discussion. But it was completely separate to the TC decision. It is a coordination and communication effort.
  • Marc: Maybe the TC needs to let go that they are the decision makers so that they be a moderating group.
  • Jakub: I agree with Marc. But we are not a system operating group. Dependency management should be in the responsibility of Release management. There are structures in the project for that.
  • Jason Root: I agree with Jakub and with Marc also. Policies should drive operational/release/support aspects of Folio.
  • Jason Root: If the idea of “support” is that frameworks are supported, then of course the project should meet that.
  • Marc Johnson
    Some group needs to inform OleksAii when a relevant policy event occurs.
    These documents effectively ARE the manifestation of the policy.
  • Craig: This is a topic for the next Monday session.
  • Craig to see if Oleksii Petrenko could join us to discuss the process for updating the officially supported technologies lists.

  • Do common libraries used to build in approved frameworks need to be on this list? Such as spring-way and spring-module-core.


Dev Documentation VisibilityAll

Possible topic/activity for a Wednesday session:

  • Discuss/brainstorm:
    • Ideas for the type of developer-facing documentation we think would be most helpful for new developers
    • How we might bring existing documentation up to date and ensure it's consistent 
    • etc.

Action Items