Rancher namespace issue 01-26-2024 - Root Cause Analysis
Define the problem
Provisioning of namespaces on Rancher started unexpectedly failing with error "Ā Getting App V2: Get "https://rancher.ci.folio.org/k8s/clusters/c-dw5wl/v1/catalog.cattle.io.apps/karate/opensearch-client": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
".Ā
It was assumed that some namespaces metadata corrupted, that's why we decided to refresh metadata.Ā
Unfortunately, Rancher started termination process of underlined EC2 instances, by that all namespace were destroyed.
Collect data
Identify causal factors
Rancher has additional component rancher-webhook that has stated to failing on 01-26-2024 and lead us to the problem definition described at Problem definition
Identify root cause(s)
The combination of undocumented behaviour and failed rancher-webhook component triggered EC2 instances termination
Implement solutions
- All namespaces were restored during 1 working day by existing automated pipelines (by Kitfox team) - RANCHER-1193Getting issue details... STATUS - RANCHER-1200Getting issue details... STATUS
- Document unexpected behaviour and share with all related partiesĀ - RANCHER-1194Getting issue details... STATUS
- Include document from #2 to regular Kitfox onboarding processĀ - RANCHER-1195Getting issue details... STATUS
- Enhance Rancher role based access controlĀ - RANCHER-1196Getting issue details... STATUS
- (P1) Implement embedded(dev) postgresql periodic backup - RANCHER-1197Getting issue details... STATUS
- Document disaster recovery process (DRP) and schedule half annual training for Kitfox team - RANCHER-1198Getting issue details... STATUS
- Conduct DRP after Quesnelia release (est. May 2024)Ā - RANCHER-1199Getting issue details... STATUS
How often Rancher is upgraded?
Do we use latest stable version? 2.7.1