[FOLIO-3613] reference env builds are failing 14.10.2022 Created: 17/Oct/22  Updated: 17/Nov/22  Resolved: 26/Oct/22

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Story Priority: P2
Reporter: Jakub Skoczen Assignee: Julian Ladisch
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Issue links:
Relates
relates to MODINVSTOR-972 Revert recent changes due to failure ... Closed
relates to MODINVSTOR-974 Remove DB_*_READER env var from Modul... Closed
relates to MODINVSTOR-975 ItemStorageTest.canMoveItemToNewInsta... Closed
relates to MODINVSTOR-976 AsyncMigrationTest.canMigrateItemsIns... Closed
relates to MODINVSTOR-977 hasRemoveEvent fixing sporadic unit t... Closed
relates to RMB-949 Report PostgreSQL URI on timeout Closed
relates to RMB-348 Add support for database read/write s... Closed
relates to MODINVSTOR-971 KafkaException: Producer closed while... Closed
relates to RMB-946 Retry GET request in TenantLoading (r... Closed
relates to RMB-947 LocalRowSet for RowDesc from both 4.3... Closed
Sprint: CP: Sprint 150, DevOps Sprint 151
Development Team: FOLIO DevOps
RCA Group: TBD

 Description   

See the discussion on #hosted-reference-env

https://folio-project.slack.com/archives/CFQU1MF61/p1665759083072169



 Comments   
Comment by Jakub Skoczen [ 17/Oct/22 ]

Wayne Schneider David Crossley Guys, is it possible to get logs for mod-inventory-storage? E.g if there is. a stack trace related to the "500 Timeout" messages.

Comment by Marc Johnson [ 17/Oct/22 ]

Jakub Skoczen

The last successful platform complete build included mod-inventory-storage-25.0.0-SNAPSHOT.737.

That version came from this build, it was based upon this revision.

This was the change immediately prior to the merges around RMB 35.x upgrade and new Batch APIs.

Comment by John Malconian [ 17/Oct/22 ]

I've submitted a PR to pin mod-inventory-storage to v25.0.0-SNAPSHOT.737. The build-platform-complete-snapshot now passes. I've started a rebuild of folio-snapshot-2. There are Okapi logs available from the previous, failed build-platform-complete-snapshot jobs in the Jenkins UI. If these logs are not sufficient to troubleshoot the RMB issue, I can create a separate, one-off build-complete-snapshot job that retains the ec2 instance after it fails.

Comment by John Malconian [ 17/Oct/22 ]

I've unpinned mod-inventory-storage snapshot after receiving word the breaking changes had been rolled back in mod-inventory-storage and the build-platform-complete-snapshot Jenkins job passes. Changes should be reflected in this evening's folio-snapshot build.

Comment by Jakub Skoczen [ 18/Oct/22 ]

Julian Ladisch we talked with John Malconian about setting up a build that includes MIS with RMB 35 for debugging purposes. Would that be useful to have?

Comment by John Malconian [ 18/Oct/22 ]

I've got an ec2 FOLIO instance running that fails tenant init exactly as it did previously - with mod-inventory-storage version mod-inventory-storage-25.0.0-SNAPSHOT.740. This was the version prior to the RMB 35 changes that were reverted.

Ansible reports the error as the following:

fatal: [10.36.1.105]: FAILED! => {"changed": false, "connection": "close", "content": "Tenant operation failed for module mod-inventory-storage-25.0.0-SNAPSHOT.740: GET http://10.36.1.105:9133/identifier-types/1795ea23-6856-48a5-a772-f356e16a8a6c returned status 500: Timeout POST http://10.36.1.105:9133/identifier-types returned status 400: duplicate key value violates unique constraint \"identifier_type_pkey\": Key (id)=(1795ea23-6856-48a5-a772-f356e16a8a6c) already exists.", "content_length": "390", "content_type": "text/plain", "elapsed": 499, "msg": "Status code was 500 and not [200]: HTTP Error 500: Internal Server Error", "redirected": false, "status": 500, "url": "http://10.36.1.105:9130/_/proxy/tenants/diku/install?deploy=true&tenantParameters=loadSample%3Dtrue%2CloadReference%3Dtrue%2CrunReindex%3Dtrue"}

I've DM'ed Julian information pertinent to connecting to the system.

Comment by Wayne Schneider [ 18/Oct/22 ]

FWIW I note that the UUID referred to above is referenced in both the reference data for the module:

https://github.com/folio-org/mod-inventory-storage/blob/master/reference-data/identifier-types/UPC.json

And in a db script:

https://github.com/folio-org/mod-inventory-storage/blob/master/src/main/resources/templates/db_scripts/addIdentifierTypesUpcIsmn.sql

Comment by Marc Johnson [ 18/Oct/22 ]

FWIW I note that the UUID referred to above is referenced in both the reference data for the module And in a db script

Typically, those scripts are intended to create reference records in situations where the reference record was after the initial installation and reference record loading is disabled yet the record is still desired.

If that script is interfering in an initial load with reference records enabled, then that suggests a flaw in that script.

Comment by Julian Ladisch [ 19/Oct/22 ]

Environment variables in mod-inventory-storage container:

                "DB_HOST=10.36.1.175",
                "DB_HOST_READER=postgres-read",

But the hostname postgres-read is not configured.
The GET /identifier-types/1795ea23-6856-48a5-a772-f356e16a8a6c is the first request that uses the DB_HOST_READER and not DB_HOST. It times out because the DNS lookup of the non-existing hostname takes too long.

Comment by Jakub Skoczen [ 20/Oct/22 ]

Julian Ladisch thanks for figuring this out.

Before we simply define these env variables: I would expect that the system fails more gracefully here and either refuses to start or continues to work without read/write seperation.

Note: Julian has already created a PR for more graceful error reporting: https://folio-org.atlassian.net/browse/RMB-949

Comment by Julian Ladisch [ 20/Oct/22 ]

I don't think that we should silently use the writer db if the reader db config is wrong because sysOps configure a reader db for reasons.
Instead a better error message should be provided in the log and the POST /_/tenant HTTP response:
RMB-939 Closed "Report PostgreSQL URI on timeout" = https://github.com/folio-org/raml-module-builder/pull/1095

I don't see a reason why the database read/write splitting feature ( RMB-348 Closed ) should be removed from RMB 35.

Comment by Julian Ladisch [ 20/Oct/22 ]

Before we simply define these env variables

There is no need to define the DB_HOST_READER value. If no reader db exists simply don't define the DB_HOST_READER value.

Comment by Julian Ladisch [ 20/Oct/22 ]

Simply undefine the DB_HOST_READER value and the hosted reference environments (snapshot, snapshot-2) will build successfully with all RMB 35 modules.

Comment by Jakub Skoczen [ 20/Oct/22 ]

Julian Ladisch I agree, if the variable is defined the system should fail, loud and clear so your fix is great. And the feature should be de-activated when no env var is specified, as it seems to be the case as you say.

John Malconian Marc Johnson Guys, can we remove the DB_HOST_READER parameters and revert the RMB35 revert and kick off a new snapshot build?

Comment by Jakub Skoczen [ 20/Oct/22 ]

Btw, John Malconian, any idea why the "DB_HOST_READER" has been configured for this deployment? Where is it defined?

Comment by Marc Johnson [ 20/Oct/22 ]

Jakub Skoczen

can we remove the DB_HOST_READER parameters and revert the RMB35 revert and kick off a new snapshot build?

I initially thought this would be defined in the launch descriptor section of the module descriptor at the point of the RMB 35.x upgrade, as I believe this is used for defaulting a bunch of configuration. However, this environment variable is not defined there.

The definition of this environment variable was included in the PR for the new unsafe batch APIs.

A search of the whole folio-org github organisation does not find any definitions of this environment variable outside of RMB, so I think this is localised.

That suggests we should be able to upgrade to RMB 35.x without this issue recurring. We should not re-apply those API changes as is, instead that work needs to be altered before adding back in.

Julian Ladisch John Malconian does this fit with your understanding?

Comment by Jakub Skoczen [ 20/Oct/22 ]

Marc Johnson I talked with Julian Ladisch and he will re-open PRs with the reverted commits but remove the DB_*_READER envvars. Does this sound good? We will ask you and Kevin Day to review, OK?

Comment by Marc Johnson [ 20/Oct/22 ]

Jakub Skoczen

I talked with Julian Ladisch and he will re-open PRs with the reverted commits but remove the DB_*_READER envvars. Does this sound good?

Sure, that is totally ok with me.

I think we should upgrade to RMB 35.x completely separately to this work and make sure that runs through at least one platform core snapshot build before we merge that work.

I've already raised a pull request for that change.

We will ask you and Kevin Day to review, OK?

Ok, my review will be limited to only checking for changes relating to those environment variables and RMB versions etc. I won't review the functional changes.

Comment by Marc Johnson [ 20/Oct/22 ]

Jakub Skoczen Julian Ladisch

Unfortunately, builds for that PR are consistently failing.

Kevin Day Julian Ladisch please can you work together to figure out how to address that.

Comment by Jakub Skoczen [ 20/Oct/22 ]

Marc Johnson I am all for first bringing back RMB 35 and then rolling back the revert for batch update work cc Julian Ladisch

Comment by Marc Johnson [ 20/Oct/22 ]

Julian Ladisch Jakub Skoczen

I am all for first bringing back RMB 35 and then rolling back the revert for batch update work

Following these comments, I think there has been some confusion around how we would resolve the issues with the RMB 35.x upgrade as Julian Ladisch has already submitted a new pull request for reinstating the unsafe APIs and raised MODINVSTOR-973 Closed and MODINVSTOR-974 Closed as separate issues.

To avoid further confusion, I think it is better if the Prokopovych team step away from this work and let the Core Platform team undertake the changes in the way they want. Unless I am told otherwise, I am going to assume that the Core Platform team are taking over resolution of this issue.

I think it could have been better for us all to have had a conversation about this.

Generated at Thu Feb 08 23:29:25 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.