System monitoring approach

Table of contents

The problem

There are many events being raised inside the running system.

They can be split into at least 2 categories:

  1. Infrastructure events
  2. Business events

Infrastructure events

These are all the events related to infrastructure.

For example database load, services health, web ui to be accessible from customer's browser and so on.

Also this category includes queues states. (ex. if some queue is growing over a specified threshold. this indicates that some processes, responsible for processing the queue are not running or not functioning correctly)

Business events

This category contains all the events raised by a running software.

For example:

  1. New tenant is created
  2. Some job is started, finished or being stuck.
  3. A used is unable to log within N attempts


All events raised by both categories need to be monitored and covered by alerts, which are sent to a group of people, responsible for an object area (account administrators, devops, etc)

Requirements

A monitoring and alerting system needs to be set up and configured to help solving the problem.

This system should be a single point of managing all the alerts, tracking states and storing/providing the metrics of a running Folio system.

This system is managed by system administrators or devops.

Alert and recipient groups need to be easily configurable by administrators of a system.

There should be a two-way communication channel set up, to allow customers (account administrators) to request changes in business alerts (ex. add a recipient to a group, configure a new alert, disable or temporarily disable alert, etc)

Solution description

Folio system can be deployed in different environments.

Some of them are

  • AWS
  • Docker/Kubernetes

There are also "on-prem" installations available, thou they are not covered in this document since we have no information on what could that be. So configuring a monitoring and alerting system completely rests on the shoulders of a system administrator of an installation

The Folio software only provides 'metrics', and those metrics are consumed by an existing monitoring system of vendor/installation.


For AWS environments this system contains of a:

  • CloudWatch (using filtered metrics or lambdas if some tricky processing required)
  • SNS topics configured to deliver messages to a group of a recipients


For K8S installations

  • Filebeat, Fluentd or similar software to parse logs into JSON format (if logs are not delivered in JSON initially)
  • Elasticsearch to store logs and provide search capabilities
  • Redis (if required) to accumulate logs and organize backpressure mechanism with Elasticsearch (to smooth pikes of load on elasticsearch)
  • Kibana/Instana or similar to filter messages and create/send alerts


Prometheus could also be used to such purpose

AWS installation monitoring flow

  1. The installation needs to be set up and CloudWatch needs to be configured to receive log messages from running containers.
  2. CloudWatch receives a message with a given format, stores it and executes filters to check if there are any actions required for a message.
  3. If anything matched, the message is split into tokens (this might require Lambda to be triggered if additional information needs to be gathered or complex logic is required to process the message).
  4. The data packet, containing the event is published to a configured SNS queue (this step can be skipped and the message is sent directly from the filter, if there's no need of a custom email template for an alert).
  5. The SNS consumer is configured to either send an email or to pass it to Lambda to create a proper email message and deliver it to a recipients.


This flow can also be simplified if an email is created on a second step and SNS only delivers the emails.

K8S Installation monitoring flow

  1. The installation needs to have Elasticsearch, FluentD, Kibana configured and receiving logs
  2. Elasticsearch receives processed message, updates the index
  3. Kibana triggers events based on pattern, creates an email and delivers it to an email group

To discuss / outline

alert reaction policy (alerts without a reaction assigned are to be considered obsolete)

why not implement alerting inside Folio software

automation based on metrics/alerts (ex. autoscaling)

Outline boundaries (progress vs alerts)

Add motivation

Display state inside UI also

Not replacing APM