Kafka shutdown experiments

Kafka shutdown experiments

Objectives

In scope of https://folio-org.atlassian.net/browse/PERF-1226 investiagte kafka shutdown and recreate process.

  • Weekend Kafka shutdown (INT account):
    By stopping Kafka clusters during weekends, we can avoid 8–10 days of runtime per month, resulting in 25–30% cost savings.

  • Daily Kafka shutdown during non-working hours / unused periods:
    Implementing automated daily shutdowns provides an additional ~30% savings.

Total Impact

Combining both approaches, Kafka operating costs can be reduced by up to ~60%.

 

Valuable information

  • In fse-pause-msk there’s no possibility to choose MSK cluster to pause.

  • fse-pause-msk, job checking MSK clusters that contain fse-autoshutdown tag, However this job ignoring value of a tag (true/false) so it’s pausing all clusters containing this tag.

  • Ticket to fix/update the job created: US1458276: Updade fse-pause-msk job with possibility to choose exact MSK cluster

 

Required steps

As there’s no possibility on AWS to “Pause Kafka“ it’s needed to have safe way to shutdown and spin up new Kafka cluster with topics recreation.

BackUp Kafka topics

By using FSE created Jenkins job fse-backup-msk-topics-to-file MSK topics may be Backed up in a file on S3 (in dev account) together with partitions count and replication factor in csv format:

relctls2.Default.ALL.DI_RAW_RECORDS_CHUNK_PARSED,8,2 relctls2.cs00000int_0005.mgr-tenant-entitlements.system-user,2,2 relctls2.Default.ALL.DI_PARSED_RECORDS_CHUNK_SAVED,8,2 relctls2.ALL.inventory.campus,2,2 relctls2.ALL.mod-pubsub.LOG_RECORD,1,2 relctls2.Default.ALL.DI_SRS_MARC_BIB_RECORD_MODIFIED_READY_FOR_POST_PROCESSING,8,2 relctls2.ALL.mod-pubsub.LOAN_CLOSED,1,2 relctls2.Default.ALL.DI_INVENTORY_INSTANCE_UPDATED,8,2 relctls2.Default.ALL.DI_SRS_MARC_BIB_RECORD_MODIFIED,8,2

 

Pause Kafka cluster

Use fse-pause-msk to pause (delete) Kafka cluster. This job will:

  • Trigger Backup Kafka topics

  • Get and save cluster info. Brokers instance type, Kafka version, config version, Brokers count, tags, storage size

  • Delete Kafka cluster

Usually all this steps takes 6-10 min.

Unpause Kafka cluster

Use use-unpause-msk to unpause (recreate) Kafka cluster. This job will:

  • Trigger create Kafka cluster based on previously saved cluster info.

  • Recreate Kafka topics previously taken

  • Update Route53 records to provide new broker endpoints in pointed ECS clusters

Usage with pause/unpause-folio

With Pause-folio

At this point fse-pause-folio pausing RDS DB cluster and ECS infrastructure (by changing task count for each service and than configuring Auto Scaling group to have desired instances count to 0).

fse-pause-folio job may be updated to have wider use together with pausing Kafka.

  • Add check box to mark if MSK should be paused with ECS

  • Job should check if there’s any other ECS clusters running that pointed to the same MSK cluster.

  • If there are additional ECS cluster pointed to kafka job either should fail or skip this step. Stop MSK and ECS clusters.

 

With Unpause-folio

  • Before unpausing of ECS, Job should Check if appropriate MSK cluster exists and unpause it first