[FOLIO-2662] test automatic migration for Goldenrod Created: 29/Jun/20  Updated: 18/Aug/20  Resolved: 07/Aug/20

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Task Priority: P2
Reporter: Jakub Skoczen Assignee: jroot
Resolution: Done Votes: 0
Labels: devops-backlog
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Issue links:
Blocks
is blocked by MODCONF-54 Upgrade issue between Q1 and Q2: "pub... Closed
is blocked by MODORDSTOR-161 Upgrade issue between Q1 and Q2: "pub... Closed
is blocked by MODORGSTOR-74 Upgrade issue between Q1 and Q2: "pub... Closed
is blocked by MODPERMS-88 Upgrade issue between Q1 and Q2: "pub... Closed
is blocked by MODPUBSUB-106 Enabling module fails in Fameflower-G... Closed
is blocked by MODPUBSUB-112 Upgrade issue between Q1 and Q2: PubS... Closed
is blocked by MODUSERS-213 current transaction is aborted Closed
is blocked by RMB-687 ResponseException for TenantAPI, fix ... Closed
Sprint: DevOps: sprint 94
Development Team: FOLIO DevOps
Affected Institution:
TAMU

 Description   

Issue used to capture the upgrade testing results, and for linking back any bug tickets when they get created.



 Comments   
Comment by Jakub Skoczen [ 15/Jul/20 ]

Asked Brandon Tharp if the migration could be retested early next week since Jason is away.

Comment by jroot [ 29/Jul/20 ]

This is on-going. My DRAFT process for testing an upgrade is the following:

1.) Grab the install.json and okapi-install.json files from the desired Folio-org's platform-complete release branch: https://github.com/folio-org/platform-complete

2.) Git Clone the appropriate Libraries Folio release repo from here: <Private Git repo URL, alternative here: https://github.com/folio-org/folio-install/tree/kube-rancher/alternative-install/kubernetes-rancher/TAMU>

3.) Copy in the json files from the 1st step above to the /deploy-jobs/create-deploy/install folder.

4.) Update the install.json and okapi-install.json under /deploy-jobs/create-deploy-pubsub/install to include only the version of pubsub that is desired.

5.) Build and push two create-deploy Docker containers to VMWare Integrated Container registry: qX-202X-pubsub and qX-202X-test.

Docker command examples below:

docker build -t vic.library.tamu.edu/folio/create-deploy:q2-2020-test .
docker push vic.library.tamu.edu/folio/create-deploy:q2-2020-test
docker build -t vic.library.tamu.edu/folio/create-deploy:q2-2020-pubsub .
docker push vic.library.tamu.edu/folio/create-deploy:q2-2020-pubsub

6.) In Rancher Dev, deploy the "create-upgrade-pubsub" K8s Job to the appropriate upgrade testing namespace first using the Diku/Tamu-tenant-config secret and tagged qX-202X-pubsub image mentioned above. For the Job Configuration settings set Completions, Parallelism and Back Off Limit to 1.

7.) If it succeeded, in Rancher Dev deploy the "create-upgrade-diku/tamu" K8s Job to the appropriate upgrade testing namespace second using the Diku/Tamu-tenant-config secret and the qX-202X-test image mentioned above. For the Job Configuration settings set Completions, Parallelism and Back Off Limit to 1. Set a long Active Deadline Seconds for the Job (I use 10000).

8.) If any of it fails, get the logs from Splunk Rancher index and/or from the containers themselves and record them. Folio Issue Jiras will need to be filed.

To roll back:

1.) In Rancher Dev, spin down the Okapi container in the appropriate upgrade testing namespace, and spin down the Postgres Okapi database and Postgres modules database containers in their corresponding "postgres-modules/okapi" namespaces.

2.) Restore the Postgres data volumes in vSphere as they were snapshot the night before. You can find the volume names for the databases under the Volumes tab in Rancher Dev - Folio Project.

3.) In Rancher Dev, spin back up the two database containers to 1 pod each, then the Okapi container to 1 pod.

Note: If any new modules need deploying, copy an existing older version of the Workload in Rancher Dev, and update the name and tag. You can also update the "workloads.yaml" provided in the appropriate Folio Git repo for Libraries under the YAML folder, and import it to the appropriate testing namespace in Rancher Dev.

Comment by jroot [ 29/Jul/20 ]

Many Jira issues have been created in the course of testing, some being completed and successful:

https://folio-org.atlassian.net/browse/MODORDSTOR-161
https://folio-org.atlassian.net/browse/MODCONF-54
https://folio-org.atlassian.net/browse/MODORGSTOR-74
https://folio-org.atlassian.net/browse/MODPERMS-88
https://folio-org.atlassian.net/browse/MODPUBSUB-112
https://folio-org.atlassian.net/browse/MODPUBSUB-106

A majority of these have been closed after being tested successfully. There appear to still be issues with mod-pubsub, mod-feesfines and mod-circ.

Comment by jroot [ 31/Jul/20 ]

During more testing and log tracing, it was determined that when upgrading a whole instance at once - these components/modules needed to be upgraded for the tenant first:

okapi v3.x
mod-pubsub
mod-permissions
mod-authtoken

The reason is the new SYS permissions introduced by Okapi 3x not existing yet. Okapi does not seem to do this order of operations for you with that in mind, so it must be done purposefully. Incremental upgrades of modules for a tenant do not seem to trigger these same issues.

Details on the errors are in the Comments section of these tickets:
https://folio-org.atlassian.net/browse/MODPUBSUB-106
https://folio-org.atlassian.net/browse/MODPUBSUB-112

After having successfully upgraded two tenants with data, I am not able to login to either tenant after building the new Q2 front-end. Hitting these errors:

From the Folio UI:

Sorry, the information entered does not match our records.

From mod-users pod log:

INFO: loadDbSchema: Loaded templates/db_scripts/schema.json OK
21:31:34 INFO CQLWrapper CQL >>> SQL: username==tamu_admin >>>WHERE lower(f_unaccent(users.jsonb->>'username')) LIKE lower(f_unaccent('tamu\_admin')) LIMIT 10 OFFSET 0
Jul 30, 2020 9:31:34 PM org.folio.cql2pgjson.CQL2PgJSON loadDbSchema
INFO: loadDbSchema: Loaded templates/db_scripts/schema.json OK
21:31:34 INFO CQLWrapper CQL >>> SQL: username==tamu_admin >>>WHERE lower(f_unaccent(users.jsonb->>'username')) LIKE lower(f_unaccent('tamu\_admin')) LIMIT 10 OFFSET 0
21:31:34 ERROR PgUtil current transaction is aborted, commands ignored until end of transaction block
io.vertx.pgclient.PgException: current transaction is aborted, commands ignored until end of transaction block
at io.vertx.pgclient.impl.codec.ErrorResponse.toException(ErrorResponse.java:29) ~[mod-users-fat.jar:?]
at io.vertx.pgclient.impl.codec.QueryCommandBaseCodec.handleErrorResponse(QueryCommandBaseCodec.java:57) ~[mod-users-fat.jar:?]
at io.vertx.pgclient.impl.codec.PgDecoder.decodeError(PgDecoder.java:233) ~[mod-users-fat.jar:?]
at io.vertx.pgclient.impl.codec.PgDecoder.decodeMessage(PgDecoder.java:122) [mod-users-fat.jar:?]
at io.vertx.pgclient.impl.codec.PgDecoder.channelRead(PgDecoder.java:102) [mod-users-fat.jar:?]
at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251) [mod-users-fat.jar:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [mod-users-fat.jar:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [mod-users-fat.jar:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [mod-users-fat.jar:?]
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) [mod-users-fat.jar:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [mod-users-fat.jar:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [mod-users-fat.jar:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [mod-users-fat.jar:?]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [mod-users-fat.jar:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [mod-users-fat.jar:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [mod-users-fat.jar:?]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [mod-users-fat.jar:?]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [mod-users-fat.jar:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) [mod-users-fat.jar:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) [mod-users-fat.jar:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) [mod-users-fat.jar:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) [mod-users-fat.jar:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [mod-users-fat.jar:?]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [mod-users-fat.jar:?]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [mod-users-fat.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
Jul 30, 2020 9:31:37 PM org.folio.cql2pgjson.CQL2PgJSON loadDbSchema

Before the upgrade, login capabilitty was tested and successful.

Comment by jroot [ 03/Aug/20 ]

It was noted that doing an Okapi upgrade for the instance, the newer Okapi module version would get enabled for the supertenant automatically but not for the institutional tenants.

Enabling the newer Okapi module for the tenant had no effect on a successful Q2 tenant upgrade or not. The upgrade still reports as successful, but I still cannot log in... There is a bug filed for mod-users for this (https://folio-org.atlassian.net/browse/MODUSERS-213).

Comment by Wayne Schneider [ 07/Aug/20 ]

Here are some notes on Index Data's experience with Goldenrod upgrades, mostly confirming jroot's notes above:

Our experience attempting to upgrade from Fameflower to Goldenrod in place is rather mixed. There seem to be some issues with the order in which you proceed, due to the Okapi major version upgrade, the minor version upgrades to mod-permissions and mod-authtoken that are required to support module permissions, and changes to mod-pubsub. The safest high-level upgrade order seems to be:

  1. Upgrade Okapi and enable new version for all tenants
  2. For each regular tenant, upgrade mod-permissions, mod-authtoken, and mod-pubsub. This will break the tenant temporarily (users will be unable to log in, timer tasks will fail)
  3. Upgrade the rest of the modules for each regular tenant by posting the `install.json` file
  4. Upgrade the supertenant

Detailed procedure

  1. Register all Goldenrod module descriptors with Okapi (easiest using Okapi module descriptor sharing)
  2. Deploy all Goldenrod modules and register with Okapi discovery service
  3. Load PostgreSQL extensions in the public schema of the each RMB module's database with the command ALTER EXTENSION pg_trgm SET SCHEMA public; (if all module schemas are in one database, obviously only need to do once; this is a database-level operation)
  4. Upgrade Okapi. Goldenrod version of Okapi is 3.1.2.
  5. For each tenant (except for the supertenant, which is automatically upgraded), upgrade the Okapi API to the Goldenrod version with a POST to /_/proxy/tenants/<tenantId>/install with payload: [ { "id": "okapi-3.1.2", "action": "enable" } ]
  6. To upgrade a tenant, first upgrade mod-permissions, mod-authtoken, and mod-pubsub to the Goldenrod versions with a POST to /_/proxy/tenants/<tenantId>/install with payload:
    [ {
      "id" : "mod-permissions-5.11.2",
      "action" : "enable"
    }, {
      "id" : "mod-authtoken-2.5.1",
      "action" : "enable"
      }, {
      "id" : "mod-pubsub-1.2.5",
      "action" : "enable"
    } ]
    

    AT THIS POINT:

  • mod-pubsub may throw errors on init if the pubsub permissions user record is not present (this was true in several of our installations, not sure how it happens).
    20:25:41.048 [vert.x-eventloop-thread-0] ERROR SecurityManagerImpl  [80559444eqId] Failed to add permission source-storage.events.post for pub-sub user. Received status code 400
    20:25:41.058 [vert.x-eventloop-thread-0] ERROR SecurityManagerImpl  [80559454eqId] Failed to add permission inventory.events.post for pub-sub user. Received status code 400
    20:25:41.060 [vert.x-eventloop-thread-0] ERROR SecurityManagerImpl  [80559456eqId] Failed to add permission source-records-manager.events.post for pub-sub user. Received status code 400
    20:25:41.061 [vert.x-eventloop-thread-0] ERROR SecurityManagerImpl  [80559457eqId] Failed to add permission patron-blocks.events.post for pub-sub user. Received status code 400
    20:25:41.070 [vert.x-eventloop-thread-0] ERROR SecurityManagerImpl  [80559466eqId] Failed to add permission circulation.events.post for pub-sub user. Received status code 400
    20:25:41.115 [vert.x-eventloop-thread-0] ERROR SecurityManagerImpl  [80559511eqId] pub-sub user was not logged in, received status 422
    

To resolve, create a permissionsUser record with the appropriate permissions:

{
  "userId": "<UUID of pub-sub user>",
  "permissions": [
    "source-storage.events.post",
    "source-records-manager.events.post",
    "inventory.events.post",
    "circulation.events.post",
    "patron-blocks.events.post",
  ]
}
  • Timer calls start failing. This is OK, they recover after the full upgrade:
    timer call failed to module mod-circulation-18.0.12 for tenant sim : POST request for mod-circulation-18.0.12 /circulation/scheduled-anonymize-processing failed with 500: HTTP request to "http://okapi:9130/configurations/entries" failed, status code: 403, response: "Access requires permission: configuration.entries.collection.get"
    

Finally, post the full upgrade install.json file from the platform-complete q2-2020 branch with a command like: curl -w '\n' -D - -X POST -d @install.json -H "X-Okapi-Token: <tokenValue>" http://okapi:9130/_/proxy/tenants/<tenantId>/install. Note that this request can take anywhere from several minutes to several hours to return, depending on the size of the tenant dataset, so plan accordingly

Generated at Thu Feb 08 23:22:19 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.