Rancher namespace issue 01-26-2024 - Root Cause Analysis


Define the problem

Provisioning of namespaces on Rancher started unexpectedly failing with error " Getting App V2: Get "https://rancher.ci.folio.org/k8s/clusters/c-dw5wl/v1/catalog.cattle.io.apps/karate/opensearch-client": context deadline exceeded (Client.Timeout exceeded while awaiting headers)". 

It was assumed that some namespaces metadata corrupted, that's why we decided to refresh metadata. 

Unfortunately, Rancher started termination process of underlined EC2 instances, by that all namespace were destroyed.

Collect data

Identify causal factors

Rancher has additional component rancher-webhook that has stated to failing on 01-26-2024 and lead us to the problem definition described at Problem definition

Identify root cause(s)

The combination of undocumented behaviour and failed rancher-webhook component triggered EC2 instances termination

Implement solutions

  1. All namespaces were restored during 1 working day by existing automated pipelines (by Kitfox team) RANCHER-1193 - Getting issue details... STATUS RANCHER-1200 - Getting issue details... STATUS
  2. Document unexpected behaviour and share with all related parties  RANCHER-1194 - Getting issue details... STATUS
  3. Include document from #2 to regular Kitfox onboarding process  RANCHER-1195 - Getting issue details... STATUS
  4. Enhance Rancher role based access control  RANCHER-1196 - Getting issue details... STATUS
  5. (P1) Implement embedded(dev) postgresql periodic backup RANCHER-1197 - Getting issue details... STATUS
  6. Document disaster recovery process (DRP) and schedule half annual training for Kitfox team RANCHER-1198 - Getting issue details... STATUS
  7. Conduct DRP after Quesnelia release (est. May 2024)  RANCHER-1199 - Getting issue details... STATUS

How often Rancher is upgraded?

Do we use latest stable version? 2.7.1