Skip to end of banner
Go to start of banner

Rancher namespace issue 01-26-2024 - Root Cause Analysis

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »


Define the problem

Provisioning of namespaces on Rancher started unexpectedly failing with error " Getting App V2: Get "https://rancher.ci.folio.org/k8s/clusters/c-dw5wl/v1/catalog.cattle.io.apps/karate/opensearch-client": context deadline exceeded (Client.Timeout exceeded while awaiting headers)". 

It was assumed that some namespaces metadata corrupted, that's why we decided to refresh metadata. 

Unfortunately, Rancher started termination process of underlined EC2 instances, by that all namespace were destroyed.

Collect data

Identify causal factors

Rancher has additional component rancher-webhook that has stated to failing on 01-26-2024 and lead us to the problem definition described at Problem definition

Identify root cause(s)

The combination of undocumented behaviour and failed rancher-webhook component triggered EC2 instances termination

Implement solutions

  1. All namespaces were restored during 1 working day by existing automated pipelines (by Kitfox team)
  2. Document unexpected behaviour and share with all related parties
  3. Include document from #2 to regular Kitfox onboarding process
  4. Enhance Rancher role based access control
  5. Implement embedded(dev) postgresql periodic backup
  6. Document disaster recovery(DRP) process and schedule half annual training for Kitfox team   
  7. Conduct DRP after Quesnelia release (est. May 2024)


  • No labels