[FOLIO-3034] "testing/testing-backend" Vagrant build fails Created: 23/Feb/21 Updated: 09/Mar/21 Resolved: 09/Mar/21 |
|
| Status: | Closed |
| Project: | FOLIO |
| Components: | None |
| Affects versions: | None |
| Fix versions: | None |
| Type: | Bug | Priority: | TBD |
| Reporter: | Wayne Schneider | Assignee: | Wayne Schneider |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue links: |
|
||||||||||||||||||||||||
| Sprint: | DevOps Sprint 109, DevOps Sprint 108 | ||||||||||||||||||||||||
| Development Team: | FOLIO DevOps | ||||||||||||||||||||||||
| Description |
|
The "testing" and "testing-backend" Vagrant builds have been failing the last few days with an error in the "Post install list for deployment and enabling" task: testing: fatal: [default]: FAILED! => {"changed": false, "connection": "close", "content": "Deployment failed. Could not connect to port 9165: readAddress(..) failed: Connection reset by peer", "content_length": "99", "content_type": "text/plain", "elapsed": 166, "msg": "Status code was 400 and not [200]: HTTP Error 400: Bad Request", "redirected": false, "status": 400, "url": "http://10.0.2.15:9130/_/proxy/tenants/diku/install?deploy=true&tenantParameters=loadSample%3Dtrue%2CloadReference%3Dtrue", "vary": "origin"} The port in the error message varies, but the rest of the error is consistent. Apparently some module is crashing before the tenant init call. Not clear why this is not happening in the AWS reference builds. |
| Comments |
| Comment by Wayne Schneider [ 23/Feb/21 ] |
|
As Adam Dickmeiss suspected, this problem does not occur with Okapi v4.6.3. okapi.log It's not clear why the problem does not occur in the AWS reference environment build. One possibility is that the Nexus Docker Hub mirror is much closer on the network than the Equinix Metal server used to build the Vagrant boxes, and so the Docker image pulls are much faster (allowing the containers to launch quickly enough for the Okapi install and deploy invocation to succeed), but that is just speculation. In order to get more recent module software out in the Vagrant boxes, we'll run manual builds of folio/testing-backend and folio/testing with Okapi v4.6.3. One workaround could be to use the okapi-deployment role to deploy modules one at a time separate from the call to the Okapi tenant install API (so that the containers would be responsive by the time the API call is made). This is the strategy used in the folio/snapshot Vagrant box build, and may explain why it is not affected by this problem. |
| Comment by Adam Dickmeiss [ 24/Feb/21 ] |
|
Rolling back is only acceptable as a short term solution
|
| Comment by Wayne Schneider [ 24/Feb/21 ] |
|
Adam Dickmeiss containerHost default of localhost should be fine, I think. netstat shows all containers listening on localhost for connections, and some modules are able to be initialized. |
| Comment by Wayne Schneider [ 24/Feb/21 ] |
|
Damien new builds are available for the folio/testing-backend and folio/testing Vagrant boxes with the latest module versions (as of late afternoon US time on 23 February) and with Okapi v4.6.3. If this issue is not resolved today, we will continue to periodically update those boxes manually until it can be fixed. |
| Comment by Wayne Schneider [ 24/Feb/21 ] |
|
One possible workaround, if
|
| Comment by Adam Dickmeiss [ 24/Feb/21 ] |
|
So in
|
| Comment by Wayne Schneider [ 25/Feb/21 ] |
|
With Okapi v4.6.5, I can build the testing box successfully from the command line, but the Jenkins job fails. More investigation needed |
| Comment by Wayne Schneider [ 01/Mar/21 ] |
|
Error in Jenkins job when posting to /_/proxy/tenants/diku/install:
To get into the VM for the failed build, need to call packer build --on-error=abort |
| Comment by Adam Dickmeiss [ 02/Mar/21 ] |
|
Could you point me to the relevant log file, please. Need to see if deploy.waitIterations need to be increased or if there's sometning else wrong with one or more modules being extremely slow.
|
| Comment by Wayne Schneider [ 02/Mar/21 ] |
|
Adam Dickmeiss log file is attached: okapi-1.log I can't tell which module Okapi is giving up on, though – is there some indication in the log that I'm missing? |
| Comment by Adam Dickmeiss [ 03/Mar/21 ] |
|
The deploy.waitIterations is not the problem, but rather an internal timeout in message communication between Okapi's discovery and deployment services. See
|
| Comment by Wayne Schneider [ 03/Mar/21 ] |
|
With Okapi 4.7.1, I believe this issue is resolved – however, the testing (and snapshot) Vagrant boxes now can't be built due to different issues (
|
| Comment by Wayne Schneider [ 09/Mar/21 ] |
|
Finally got a clean build for the Vagrant folio/testing-backend box, so moving this issue to DONE |