fabernovel loader

Nov 22, 2017 | 7 min read

Tech

Alerting in Prometheus or: How I can sleep well at night

You build it, you run it, you get alerts

Fabian Gutierrez

Developer


FABERNOVEL TECHNOLOGIES
Monitoring is a well known problem we all face while doing software. But being completely honest, nowadays we have plenty of tools that help us reduce the pain. As you might guess, for this article we’ve already chosen Prometheus as our monitoring tool, but the same ideas could be applied to alternative tools. In this post we are going to make a small recap on the Prometheus platform. We are going to discuss alerting rules, then we’ll explore the Alertmanager and the notification possibilities. Finally, we’ll show you how to integrate it with slack.

Prometheus

Prometheus is a software intended to monitor our applications through metrics. These metrics can be either defined by us (e.g. business metrics), or by the platform we are working on (JVM, ubuntu, etc). Each of these metrics can have several values at the same time, hence multivariable. And because Prometheus itself is going to query our application by HTTP to retrieve them, we call it Pull Oriented.

There are several client libraries that help us to build the metric output in the format expected by Prometheus. With all of those libraries, every metric falls in one of the following category:

  • Counters, cumulative and increasing metric
  • Gauges, single value that can be increased and decreased
  • Histograms, samples and buckets observations
  • Summaries, samples observations, specifies quantile

However, as Prometheus expects a simple text response, we could actually use any programming language that lets us talk using TCP. Once the data is retrieved and stored by Prometheus, it is subsequently queried by specialized dashboards like grafana.

If you want to spend more time on the basic concepts you can check the official doc, or here, and if you also speak french, here.

Installing and configuring

In the repository accompanying this article you’ll find a docker-compose.yml. This file will keep you away from the trouble of configuring all the tools we use. You will need Docker and docker-compose installed, though. Even if there is no relationship per se between Docker and Prometheus. Docker is always an easy way to share configurations and installations.

Once you clone the repository, all you need to do is going to play-prometheus, then execute the following instruction

$ docker-compose up -d
  • This command will download some Docker images, configure and run the following containers:
  • One container with a Play application that exposes some metrics, available at localhost:9000. We will present these endpoints later
  • One container for Prometheus to gather metrics, available at localhost:9090
  • One container for Alertmanager to trigger alerts on metrics, available at localhost:9093
  • One container for Grafana, available at localhost:3000
  • One container for CAdvisor, available at localhost:8080

The relationship between these containers is shown in the following diagram

relationship between containers

In this article we will focus on the interactions between our monitored application, Prometheus and Alertmanager. However, just to verify that everything is configured as expected, we can navigate to grafana’s admin page and verify that it looks similar to this :

If that’s not the case you should add a data source and then upload the dashboard file Grafana_Dashboard.json available in the repository downloaded previously.

With this dashboard we can inspect the following metrics:

  • Amount of visits to our application
  • Amount of successful logins
  • Memory, CPU and file system usage in a particular server

We display this page in a buildwall and now everything looks nice and clean. It is time to go home, right?… right? Sadly, not quite. I mean this is already very good. We’ve come a long way since SSH connections to inspect logs. However, we are still restricted to visual inspections of the data. Or are we?

Alerting in Prometheus

In this section we are going to discuss the application used in the example and the metrics it generates. After that, we are going to explain how to configure Prometheus and Alertmanager to describe rules from existing metrics. Finally, we will see how to trigger alerts and get notifications when those rules are met.

The monitored application

The application we provided for this article exposes the following endpoints to interact with the metrics:

  • /. Increases the Counter metric of visits, play_request_total
  • /login. Increases the Gauge metric of connected users, play_current_users
  • /logout. Decreases the Gauge metric of connected users, play_current_users
  • /metrics. Give the output as expected by Prometheus

A typical output of /metrics containing the metrics and their current values could be as follows

http_request_duration_seconds_bucket{le="+Inf",method="GET",path="/login",status="2xx"} 3
http_request_duration_seconds_count{method="GET",path="/login",status="2xx"} 3
http_request_duration_seconds_sum{method="GET",path="/login",status="2xx"} 0.006110352
http_request_mismatch_total 0.0
play_current_users 3.0
play_requests_total 0.0

Alerting overview

In the Prometheus platform, alerting is handled through an independent component: Alertmanager. Usually, we first tell Prometheus where Alertmanager is located, then we create the alerting rules in Prometheus configuration and finally, we configure Alertmanager to handle and send alerts to a receiver (mail, webhook, slack, etc). These dynamics are shown in the following diagram

Prometheus configure Alertmanager

Alerting rules

Alerting rules is the mechanism proposed by Prometheus to define alerts on recorded metrics. They are configured in the file prometheus.yml

rule_files:
 - "/etc/prometheus/alert.rules"

and are based on the following template

ALERT <alert name>
  IF <expression>
  [ FOR <duration> ]
  [ LABELS <label set> ]
  [ ANNOTATIONS <label set> ]

Where:

  • Alert name, is the alert identifier. It does not need to be unique.
  • Expression, is the condition that gets evaluated in order to trigger the alert. It usually uses existing metrics as those returned by the /metrics endpoint.
  • Duration, is the period of time during which the rule must be valid. For example, 5s for 5 seconds.
  • Label set, set of labels that will be used inside your message template.

We can define a new rule in our alert.rules to inform that we have less than two logged users in our application:

ALERT low_connected_users
  IF play_current_users < 2
  FOR 30s
  LABELS {
    severity = "warning"
 }
 ANNOTATIONS {
     summary = "Instance {{ $labels.instance }} under lower load",
     description = "{{ $labels.instance }} of job {{ $labels.job }} is under lower load.",
 }

Alertmanager

Alertmanager is a buffer for alerts (no surprise here) that has the following characteristics:

  • Is able to receive alerts through a specific endpoint (not specific to Prometheus).
  • Can redirect alerts to receivers like hipchat, mail or others.
  • Is intelligent enough to determine that a similar notification was already sent. So you don’t end up being drowned by thousands of emails in case of a problem.

A client to Alertmanager (in this case Prometheus) starts by sending a POST message with all the alerts it wants to be handled to /api/v1/alerts. For example

[
 {
  "labels": {
     "alertname": "low_connected_users",
     "severity": "warning"
   },
   "annotations": {
      "description": "Instance play-app:9000 under lower load",
      "summary": "play-app:9000 of job playframework-app is under lower load"
    }
 }
]

Workflow

Once these alerts are stored in Alertmanager they can be in any of the following states:

  • Inactive. Nothing happens here.
  • Pending. The client told us that this alert must be triggered. However, alerts could be grouped, suppressed/inhibited (more on inhibition later) or silenced/muted (we will discuss silences later). Once all validation passed, we move to Firing.
  • Firing. The alert is sent to the Notification Pipeline which will contact all the receivers of our alert. The client could then tell us the alert is now good, so we make a transition to Inactive.

Prometheus has a dedicated endpoint to allow us to list all alerts and to follow the state transitions. Each state, as indicated by Prometheus, as well as the conditions that led to a transition are shown below

  • The rule is not met. The alert is not active

  • The rule is met. The alert is now active. Some validations are executed in order to avoid drowning the receiver with messages

  • The alert is sent to the receiver

Shortly after, the receiver (a slack channel) gets notified with a new message

with a red bar on the the left that indicates that this alert is now active.

As the rules continue to be evaluated, we also have the possibility to send another notification when the rule is no longer valid. This is configured in the alertmanager.yml with the parameter send_resolved:

receivers:
- name: slack_general
  slack_configs:
  - api_url: https://hooks.slack.com/services/FOO/BAR/FOOBAR
    channel: '#prometheus-article'
    send_resolved: true

In order to get this second notification, we need to make our alert condition play_current_users < 2 invalid. We can achieve this by increasing the gauge metric used in the rule, or simply put: navigate several times to the /login endpoint of our application. Now we receive a new message with a green bar in our slack channel

More on transitions from Pending to Firing

Inhibition
Inhibitions allow us to suppress notifications for some alerts given that any other alert is in state firing. For example, We could configure an inhibition that mutes any warning-level notification if the same alert (based on alertname) is already critical. The relevant section of the alertmanager.yml file could look like this:

inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['low_connected_users']

Silences
Silences are a quick way to temporarily mute alerts. We configure them directly through a dedicated page in the Alertmanager admin console. It could be useful to avoid getting spammed while trying to resolve a critical production issue

Message templates

Message template is a mechanism that allows us to integrate the annotations present in the alert and integrate them in a particular way. They must be specified in the Alertmanager configuration file.

The file alertmessage.tmpl used to produce the slack notifications could be defined as follows

{{ define "__slack_text" }}
{{ range .Alerts }}{{ .Annotations.description}}{{ end }}
{{ end }}

{{ define "__slack_title" }}
{{ range .Alerts }} :scream: {{ .Annotations.summary}} :scream: {{ end }}
{{ end }}

{{ define "slack.default.text" }}{{ template "__slack_text" . }}{{ end }}
{{ define "slack.default.title" }}{{ template "__slack_title" . }}{{ end }}

Final thoughts

In this article we have covered briefly the basics of Prometheus and the type of metrics we can monitor with it. Then we discussed rules, the Alertmanager, and the different receivers with a focus on slack.

We hope this article will help you explore new possibilities in your own infrastructure and in general make your monitoring experience less of a pain. However, we still need to explore the monitoring possibilities in a discoverable architecture and the changes introduced in Prometheus 2.0, but that will be the topic of another article.

We hope you enjoyed. See you next time.

Want to join us?

Check out our job openings!
logo business unit

FABERNOVEL TECHNOLOGIES

150 talents to face technological challenges of digital transformation

next read