Define the problem
Provisioning of namespaces on Rancher started unexpectedly failing with error " Getting App V2: Get "https://rancher.ci.folio.org/k8s/clusters/c-dw5wl/v1/catalog.cattle.io.apps/karate/opensearch-client": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
".
It was assumed that some namespaces metadata corrupted, that's why we decided to refresh metadata.
Unfortunately, Rancher started termination process of underlined EC2 instances, by that all namespace were destroyed.
Collect data
Identify causal factors
Rancher has additional component rancher-webhook that has stated to failing on 01-26-2024 and lead us to the problem definition described at Problem definition
Identify root cause(s)
The combination of undocumented behaviour and failed rancher-webhook component triggered EC2 instances termination
Implement solutions
- All namespaces were restored during 1 working day by existing automated pipelines (by Kitfox team)
- Document unexpected behaviour and share with all related parties
- Include document from #2 to regular Kitfox onboarding process
- Enhance Rancher role based access control
- Implement embedded(dev) postgresql periodic backup
- Document disaster recovery(DRP) process and schedule half annual training for Kitfox team
- Conduct DRP after Quesnelia release (est. May 2024)