Kafka shutdown experiments
Objectives
In scope of https://folio-org.atlassian.net/browse/PERF-1226 investiagte kafka shutdown and recreate process.
Weekend Kafka shutdown (INT account):
By stopping Kafka clusters during weekends, we can avoid 8–10 days of runtime per month, resulting in 25–30% cost savings.Daily Kafka shutdown during non-working hours / unused periods:
Implementing automated daily shutdowns provides an additional ~30% savings.
Total Impact
Combining both approaches, Kafka operating costs can be reduced by up to ~60%.
Valuable information
In fse-pause-msk there’s no possibility to choose MSK cluster to pause.
fse-pause-msk, job checking MSK clusters that contain fse-autoshutdown tag, However this job ignoring value of a tag (true/false) so it’s pausing all clusters containing this tag.
Ticket to fix/update the job created: US1458276: Updade fse-pause-msk job with possibility to choose exact MSK cluster
Required steps
As there’s no possibility on AWS to “Pause Kafka“ it’s needed to have safe way to shutdown and spin up new Kafka cluster with topics recreation.
BackUp Kafka topics
By using FSE created Jenkins job fse-backup-msk-topics-to-file MSK topics may be Backed up in a file on S3 (in dev account) together with partitions count and replication factor in csv format:
relctls2.Default.ALL.DI_RAW_RECORDS_CHUNK_PARSED,8,2
relctls2.cs00000int_0005.mgr-tenant-entitlements.system-user,2,2
relctls2.Default.ALL.DI_PARSED_RECORDS_CHUNK_SAVED,8,2
relctls2.ALL.inventory.campus,2,2
relctls2.ALL.mod-pubsub.LOG_RECORD,1,2
relctls2.Default.ALL.DI_SRS_MARC_BIB_RECORD_MODIFIED_READY_FOR_POST_PROCESSING,8,2
relctls2.ALL.mod-pubsub.LOAN_CLOSED,1,2
relctls2.Default.ALL.DI_INVENTORY_INSTANCE_UPDATED,8,2
relctls2.Default.ALL.DI_SRS_MARC_BIB_RECORD_MODIFIED,8,2
Pause Kafka cluster
Use fse-pause-msk to pause (delete) Kafka cluster. This job will:
Trigger Backup Kafka topics
Get and save cluster info. Brokers instance type, Kafka version, config version, Brokers count, tags, storage size
Delete Kafka cluster
Usually all this steps takes 6-10 min.
Unpause Kafka cluster
Use use-unpause-msk to unpause (recreate) Kafka cluster. This job will:
Trigger create Kafka cluster based on previously saved cluster info.
Recreate Kafka topics previously taken
Update Route53 records to provide new broker endpoints in pointed ECS clusters
Usage with pause/unpause-folio
With Pause-folio
At this point fse-pause-folio pausing RDS DB cluster and ECS infrastructure (by changing task count for each service and than configuring Auto Scaling group to have desired instances count to 0).
fse-pause-folio job may be updated to have wider use together with pausing Kafka.
Add check box to mark if MSK should be paused with ECS
Job should check if there’s any other ECS clusters running that pointed to the same MSK cluster.
If there are additional ECS cluster pointed to kafka job either should fail or skip this step. Stop MSK and ECS clusters.
With Unpause-folio
Before unpausing of ECS, Job should Check if appropriate MSK cluster exists and unpause it first