[FOLIO-3034] "testing/testing-backend" Vagrant build fails Created: 23/Feb/21  Updated: 09/Mar/21  Resolved: 09/Mar/21

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Bug Priority: TBD
Reporter: Wayne Schneider Assignee: Wayne Schneider
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: Text File okapi-1.log     Text File okapi.log    
Issue links:
Blocks
is blocked by OKAPI-988 Deployment fails with Docker (part two) Closed
is blocked by OKAPI-992 Okapi discovery times out after 5 min... Closed
Relates
relates to OKAPI-984 Retry when slow module startup causes... Closed
relates to OKAPI-994 Include module ID in some deployment ... Closed
Sprint: DevOps Sprint 109, DevOps Sprint 108
Development Team: FOLIO DevOps

 Description   

The "testing" and "testing-backend" Vagrant builds have been failing the last few days with an error in the "Post install list for deployment and enabling" task:

testing: fatal: [default]: FAILED! => {"changed": false, "connection": "close", "content": "Deployment failed. Could not connect to port 9165: readAddress(..) failed: Connection reset by peer", "content_length": "99", "content_type": "text/plain", "elapsed": 166, "msg": "Status code was 400 and not [200]: HTTP Error 400: Bad Request", "redirected": false, "status": 400, "url": "http://10.0.2.15:9130/_/proxy/tenants/diku/install?deploy=true&tenantParameters=loadSample%3Dtrue%2CloadReference%3Dtrue", "vary": "origin"}

The port in the error message varies, but the rest of the error is consistent. Apparently some module is crashing before the tenant init call. Not clear why this is not happening in the AWS reference builds.



 Comments   
Comment by Wayne Schneider [ 23/Feb/21 ]

As Adam Dickmeiss suspected, this problem does not occur with Okapi v4.6.3. okapi.log attached. You can see in the log that a number of modules fail to respond after 30 attempts and are shut down.

It's not clear why the problem does not occur in the AWS reference environment build. One possibility is that the Nexus Docker Hub mirror is much closer on the network than the Equinix Metal server used to build the Vagrant boxes, and so the Docker image pulls are much faster (allowing the containers to launch quickly enough for the Okapi install and deploy invocation to succeed), but that is just speculation.

In order to get more recent module software out in the Vagrant boxes, we'll run manual builds of folio/testing-backend and folio/testing with Okapi v4.6.3.

One workaround could be to use the okapi-deployment role to deploy modules one at a time separate from the call to the Okapi tenant install API (so that the containers would be responsive by the time the API call is made). This is the strategy used in the folio/snapshot Vagrant box build, and may explain why it is not affected by this problem.

Comment by Adam Dickmeiss [ 24/Feb/21 ]

Rolling back is only acceptable as a short term solution OKAPI-984 Closed did exactly fix a problem with the Vagrant box. So this must be fixed. Either the host is not correct or it doesn't wait enough time.. OKAPI-988 Closed

Comment by Wayne Schneider [ 24/Feb/21 ]

Adam Dickmeiss containerHost default of localhost should be fine, I think. netstat shows all containers listening on localhost for connections, and some modules are able to be initialized.

Comment by Wayne Schneider [ 24/Feb/21 ]

Damien new builds are available for the folio/testing-backend and folio/testing Vagrant boxes with the latest module versions (as of late afternoon US time on 23 February) and with Okapi v4.6.3. If this issue is not resolved today, we will continue to periodically update those boxes manually until it can be fixed.

Comment by Wayne Schneider [ 24/Feb/21 ]

One possible workaround, if OKAPI-988 Closed turns out to be a non-starter, would be to deploy all modules (using /_/discovery/modules) before enabling them for the tenant.

Comment by Adam Dickmeiss [ 24/Feb/21 ]

So in OKAPI-988 Closed both the wait time is increased and the probed host defaults to the docker host (can still be overriden with containerHost). By the way /_/discovery/modules would also fail if the container can not be checked for readiness, so that wouldn't help.

Comment by Wayne Schneider [ 25/Feb/21 ]

With Okapi v4.6.5, I can build the testing box successfully from the command line, but the Jenkins job fails. More investigation needed

Comment by Wayne Schneider [ 01/Mar/21 ]

Error in Jenkins job when posting to /_/proxy/tenants/diku/install:

Timed out after waiting 300000(ms) for a reply. address: __vertx.reply.6, repliedAddress: http://10.0.2.15:9130/deploy

To get into the VM for the failed build, need to call packer build --on-error=abort

Comment by Adam Dickmeiss [ 02/Mar/21 ]

Could you point me to the relevant log file, please.

Need to see if deploy.waitIterations need to be increased or if there's sometning else wrong with one or more modules being extremely slow. OKAPI-990 Closed

Comment by Wayne Schneider [ 02/Mar/21 ]

Adam Dickmeiss log file is attached: okapi-1.log

I can't tell which module Okapi is giving up on, though – is there some indication in the log that I'm missing?

Comment by Adam Dickmeiss [ 03/Mar/21 ]

The deploy.waitIterations is not the problem, but rather an internal timeout in message communication between Okapi's discovery and deployment services. See OKAPI-992 Closed

Comment by Wayne Schneider [ 03/Mar/21 ]

With Okapi 4.7.1, I believe this issue is resolved – however, the testing (and snapshot) Vagrant boxes now can't be built due to different issues ( MODRS-21 Closed , FOLIO-3053 Closed ). Putting this "In Review" until we can confirm.

Comment by Wayne Schneider [ 09/Mar/21 ]

Finally got a clean build for the Vagrant folio/testing-backend box, so moving this issue to DONE

Generated at Thu Feb 08 23:25:07 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.