Downscale modules instances during weekend(night) RANCHER-685

Purpose/Overview:

The main goal of SPIKE is to prepare requirements, study existing solutions and see if they can be applied to our infrastructure.
In the absence of suitable solutions, prepare a possible draft solution.


The main idea of solving the problem is to set the number of module replicas to 0, which allows you to reduce the number of nodes in the cluster group, which, accordingly, will reduce the costs.
Also, for perf/any environment that uses AWS RDS DB, it is possible to turn off the instance for up to 7 days (after which AWS automatically turns them on), which also allows you to reduce costs.
Additionally, we will investigate if we could change size of Kafka / OpenSearch instances during the weekend, which will also reduce costs.

1. Kubernetes CronJobs to scale deployments up or down

The most common way to accomplish this task is Kubernetes CronJobs to scale deployments up or down.
To do this, 2 Kubernetes CronJobs are created with a set time to scale deployments up or down environment in each namespace(project).


1) In the case of scale deployments, the down job saves the number of module replicas of a specific namespace(project) in the configmap, and then sets their value (with exceptions) to 0. If the environment uses AWS RDS DB, it stops the instance.
2) In the case of scale deployments up job, and if the environment uses AWS RDS DB, it turns on the instance, reads the number of namespace(project) replicas from the configmap, clears it, and then sets their value according to the values read from the configmap.


Cons of the solution
- Despite the simplicity of the description, the solution must be deployed in each namespace (project).
- Kubernetes CronJobs scale deployments up or down time must be set when creating namespace(project)
- Kubernetes CronJobs scale deployments up or down time - if necessary, changing this setting must be done at the Kubernetes CronJob level, either manually or using a Jenkins Job that will update the Kubernetes CronJob
- If you need to enable / disable the environment on demand, you need to call the Kubernetes CronJob at the Kubernetes level, or additionally create a Jenkins Job for this.
- If we need to create a centralized schedule from the schedules for turning on / off all namespace (project), we will additionally have to create a sync CronJobs that will synchronize the schedule between the schedule file and the Kubernetes CronJob with a certain frequency.


2. Jenkins Jobs to scale deployments up or down

In this case, to complete the task, you can use Jenkins to schedule deployments scale up or down.


1) 1 main Jenkins Job will be created, with parameters such as: cluster name, project name, and action. This Jenkins Job will be available to all commands to enable/disable the environment as needed.
2) Also, for each cluster-namespace (project), 2 separate scheduled Jenkins Jobs - scale deployments up AND scale deployments down will be created, in which the scale deployments up or down schedule will be set for each cluster-namespace (project)


Cons of the solution
- Duplication of code, namely the Jenkins Job described in paragraph 2 for each cluster-namespace (project)
- Schedule downscale/upscale of environments will not be centralized but will be split per cluster-namespace(project) using Jenkins Jobs


The solution for both options might look like this:
Both Kubernetes CronJobs and Jenkins Job will call aws cli with which they will change the configmap, the number of replicas in the cluster-namespace (project), the state of the AWS RDS instance if necessary.

ATTENTION!!! You need to make sure that the autoscaling group after downscaling the cluster-namespace(project) replicas is reduced to the minimum value, since such parameters as min cluster size/desired cluster size may not allow it. Most likely, after changing the number of cluster-namespace(project) replicas, it will also be necessary to change the min cluster size/desired cluster size.
^^^UPD There is no need to check if autoscaling group decreased desired cluster size when we scale down module replicas since we're using k8s addon that do it automatically.