[FOLIO-3514] Jenkins jobs fail 2022-06-08 platform and refenv builds Created: 08/Jun/22  Updated: 13/Jun/22  Resolved: 13/Jun/22

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Bug Priority: TBD
Reporter: David Crossley Assignee: John Malconian
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Sprint: DevOps Sprint 142, DevOps Sprint 141
Development Team: FOLIO DevOps
RCA Group: TBD

 Description   

The hourly plaform builds started failed with the 01:19 UTC platform-complete job (and each one since) during the "Build FOLIO instance" stage. The daily folio-snapshot also failed similarly.

The folio-snapshot-2 rebuild job has been disabled to preserve the existing.

They seem to emit different messages and fail at different Ansible tasks (centred around setting up Kafka infrastructure).



 Comments   
Comment by John Malconian [ 08/Jun/22 ]

These jobs appear to be failing because of an issue with SSH to the target host. Possibly something in the environment has changed.

Comment by John Malconian [ 08/Jun/22 ]

On the other hand, the okapi-docker-container ansible role which runs right before the kafka-zk role is able to copy a j2 template to the target system with no issues. Very weird.

Comment by John Malconian [ 08/Jun/22 ]

We are running into this Ubuntu kernel issue documented here. https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1977919

Comment by John Malconian [ 08/Jun/22 ]

Reverting to the previous AWS Ubuntu Focal AMI fixes the problem. I did this by hardcoding the AMI in the launch_ec2 playbook in folio-infrastructure:

roles:
    - role: launch_ec2_instance
      ec2_security_groupids:  ["{{ okapi_ec2_security_groupid }}", "{{ postgres_ec2_security_groupid }}", "{{ stripes_ec2_security_groupid }}"]
      ec2_group: "{{ ec2_group }}"
      ec2_hostgroups: "tag_Env_ci,tag_Group_{{ ec2_group }},tag_Okapi_true,tag_Postgres_true"
      ec2_instance_type: "{{ instance_type }}"
      # Hardcode AMI temporarily. See https://folio-org.atlassian.net/browse/FOLIO-3514
      #ec2_ami: "{{ latest_ami.image_id }}"
      ec2_ami: ami-01f18be4e32df20e2
      ec2_instance_tags: "{{ instance_tags }}"

This should only be temporary. I suspect the issue will be fixed within a day or two.

Comment by John Malconian [ 13/Jun/22 ]

Confirmed Ubuntu has released fixed kernel and AWS AMI. Reversed changed so that we once again get the latest Ubuntu Focal AMI.

Generated at Thu Feb 08 23:28:42 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.