Metrics with Prometheus and Grafana

As your application grows, it becomes more and more important to have insight into what's going on. Metrics are one of the ways to provide information about the run of your application.

In recent years, one of the industry standards turned out to be a toolkit built by SoundCloud called Prometheus. It facilitates metrics and alerting with ease, great visualisation and powerful queries.

Prometheus is written in Go, and originally supported for Go applications, more and more implementations are popping up, and Rust is also supported for quite some time now.

The Rust client library was created by the developers behind TiKV and TiDB highly scalable and performant NoSQL/NewSQL software.

How do Prometheus metrics work

On the side of Rust, metrics are collected into a registry. By default, metrics are not sent anywhere, rather, they are scraped whenever necessary (requested). Generally, metrics are exposed on HTTP, and GETing the endpoint scrapes the metrics.

However, the Rust crate also allows scraping metrics into text for you to transfer however you wish:

#![allow(unused)]
fn main() {
use prometheus::{Opts, Registry, Counter, TextEncoder, Encoder};

// Create a Counter.
let counter_opts = Opts::new("test_counter", "test counter help");
let counter = Counter::with_opts(counter_opts).unwrap();

// Create a Registry and register Counter.
let r = Registry::new();
r.register(Box::new(counter.clone())).unwrap();

// Inc.
counter.inc();

// Gather the metrics.
let mut buffer = vec![];
let encoder = TextEncoder::new();
let metric_families = r.gather();
encoder.encode(&metric_families, &mut buffer).unwrap();

// Output to the standard output.
println!("{}", String::from_utf8(buffer).unwrap());
}

On the server-side, a monitoring platform often looks like this:

Multiple metric exporters are running and export local metrics on HTTP
Prometheus is used to centralize and store the metrics
Alertmanager triggers alerts based on those metrics
Grafana produces dashboards
PromQL is the query language used to describe dashboards and alerts.

Prometheus types

Prometheus supports four core metric types. These exist only on the side of the client libraries and the wire protocol, the Prometheus server flattens everything into untyped time series (but that may change).

These are

Counter
Gauge
Histogram
Summary

Out of these, Summary is the only one not supported by the Rust library.

Histograms sample observations (for example, things like request durations or response sizes) and counts them into buckets, while also providing a sum of all observed values.

The difference between a Counter and a Gauge is that a Counter should be monotonically increasing, whereas a Gauge can go up and down with no issue. A Counter may also be reset back to zero on restart.

Use Counter to keep count of things:

errors
numbers of requests
tasks completed
etc.

Use Gauge to report measured values:

currently logged in users
hash rate
current draw
etc.

Here is an example of using counters and gauges:

// Copyright 2019 TiKV Project Authors. Licensed under Apache-2.0.

use prometheus::{IntCounter, IntCounterVec, IntGauge, IntGaugeVec};

use lazy_static::lazy_static;
use prometheus::{
    register_int_counter, register_int_counter_vec, register_int_gauge, register_int_gauge_vec,
};

lazy_static! {
    static ref A_INT_COUNTER: IntCounter =
        register_int_counter!("A_int_counter", "foobar").unwrap();
    static ref A_INT_COUNTER_VEC: IntCounterVec =
        register_int_counter_vec!("A_int_counter_vec", "foobar", &["a", "b"]).unwrap();
    static ref A_INT_GAUGE: IntGauge = register_int_gauge!("A_int_gauge", "foobar").unwrap();
    static ref A_INT_GAUGE_VEC: IntGaugeVec =
        register_int_gauge_vec!("A_int_gauge_vec", "foobar", &["a", "b"]).unwrap();
}

fn main() {
    A_INT_COUNTER.inc();
    A_INT_COUNTER.inc_by(10);
    assert_eq!(A_INT_COUNTER.get(), 11);

    A_INT_COUNTER_VEC.with_label_values(&["a", "b"]).inc_by(5);
    assert_eq!(A_INT_COUNTER_VEC.with_label_values(&["a", "b"]).get(), 5);

    A_INT_COUNTER_VEC.with_label_values(&["c", "d"]).inc();
    assert_eq!(A_INT_COUNTER_VEC.with_label_values(&["c", "d"]).get(), 1);

    A_INT_GAUGE.set(5);
    assert_eq!(A_INT_GAUGE.get(), 5);
    A_INT_GAUGE.dec();
    assert_eq!(A_INT_GAUGE.get(), 4);
    A_INT_GAUGE.add(2);
    assert_eq!(A_INT_GAUGE.get(), 6);

    A_INT_GAUGE_VEC.with_label_values(&["a", "b"]).set(10);
    A_INT_GAUGE_VEC.with_label_values(&["a", "b"]).dec();
    A_INT_GAUGE_VEC.with_label_values(&["a", "b"]).sub(2);
    assert_eq!(A_INT_GAUGE_VEC.with_label_values(&["a", "b"]).get(), 7);
}

CounterVec and GaugeVec bundles a set of metrics that share the same description, but have different values for their variable labels. They are used if you want to count the same thing partitioned by various dimensions, such as:

HTTP requests partitioned by response code and method
Bitcoin mining shares partitioned by valid, invalid and duplicate shares

To learn more about generic usage of Prometheus and Grafana, check out the following links:

https://prometheus.io/docs/introduction/overview/
https://grafana.com/docs/grafana/latest/dashboards/ - for Grafana dashb oards in particular

Braiins tips and tricks for using Prometheus in production

In Braiins, we have developed a couple internal conventions on the side of Rust things, which we believe help us organize metrics in a reasonable way.

Always using partitioning with Counter/GaugeVec

This functionality mentioned a couple lines above is extremely useful, it helps you organize data better, and by default, you will see them in a single graph.

Wrapping metrics recording in methods

It is quite useful to have exact methods for recording a particular metric, that leaves very little to speculation and configuration. That makes it easy to orient yourself in the code as to where a particular metric with particular label set is being recorded.

For example, imagine you are writing a web app, and you want to count requests. A request may return 200, 403 and 404 depending on what request it is.

In the code, you would have a CounterVec with labels for ok, forbidden and notfound. While it may be tempting to pass this as a parameter to a single function, it is a good practice to avoid stringly-typed APIs.

Here in Braiins, we would create the following three functions/methods (depending on if you use the default registry or your own one):

pub fn record_request_ok()
pub fn record_request_forbidden()
pub fn record_request_not_found()

This also makes it easily searchable in your code editor.

Nested metrics

Sometimes, you have crates which are both libraries and applications and you write another application, which adds a ton of its own functionality while also depending on the first one as a dependency. How do we deal with metrics?

Easily.

The nested crate gets its own containing structure which looks something like this:

#![allow(unused)]
fn main() {
#[derive(Debug)]
pub struct Metrics {
    metric1: IntCounterVec,
    metric2: IntGauge,
    metric3: IntGaugeVec,
    metric4: IntCounterVec,
}
}

And is accompanied by its own initialization function, which creates an instance, registers all the counters with the registry, and stores it in a global variable (static) to be freely used by anything that wishes it so.

The topmost crate will have a similar structure, and the same init() function, except, it also calls the init() function of the dependency one.

It would look something like this:

#![allow(unused)]
fn main() {
pub fn init(
    info1: Type1,
    info2: &Type2,
) -> MetricsRegistry {
    let metrics_registry = inner_metrics::init(info1, info2);
    let metrics = Metrics::new(&metrics_registry);

    INSTANCE
        .set(metrics)
        .expect("BUG: metrics instance already initialized!");

    metrics_registry
}
}

In this case, we have the inner, base metrics construct the registry,and we then use to register our top-level stuff.

You could also do it the other way around, passing the registry downwards.

Furthermore, it is also handy to have something like a base_metrics() method on your top-level metrics, which returns a reference to the InnerMetrics, so you don't have to fish for the global variable every time you need to record a base metric from your crate.

You can use this model to create not just a chain of discrete metrics implementors, but also a tree, which might come in handy in some situations when you have multiple subcomponents that need to expose metrics

Metrics backreading

Prometheus are easily shared global state. This make it a prime place to store some information you plan to read in a different part of your application.

You may also want to have your application dynamically change its behavior based on some of its statistics.

Therefore, it can come in handy to read metrics you have already recorded.

Refer to the very first code example to see how to read your own metrics.

Including unit name in gauges

If you have many metrics which report stuff that has units, it can get messy figuring what has what unit, so it is a good idea to add the unit to the name of the metric and perhaps also your recording function. For example, you might want to _bytes or _watts.

The Task: Implementing a simple Hyper HTTP server to expose metrics

For this project, it is your task to create a simple barebones HTTP server with the Hyper framework, which exposes the following metrics:

http_requests_total -> the total number of http requests made
http_response_size_bytes -> the HTTP response sizes in bytes

These metrics should be available on 127.0.0.1:9898.

TIP: For hyper, use the features server, http1, tcp to get at least the minimal functionality out of Hyper to be able to write this application.

For tokio, you should be fine with macros, rt-multi-thread

TIP 2: The register_counter!/register_gauge! and the opts! macros are quite handy for registering metrics with the default registry. The default registry is absolutely enough for ths application.

If you want, you can also forgo hyper and use the HTTP server from the async chapter.

If you are feeling particularly brave, you can try running a local prometheus server and scraping the metrics with it:

https://prometheus.io/docs/prometheus/latest/getting_started/

When you run the application and either curl it, open the URL in your browser, or inspect the raw output via Prometheus or Grafana, it should look something like this:

# HELP http_requests_total Number of HTTP requests made.
# TYPE http_requests_total counter
http_requests_total 121
# HELP http_response_size_bytes The HTTP response sizes in bytes.
# TYPE http_response_size_bytes gauge
http_response_size_bytes 249

End product

In the end you should be left with a well prepared project, that has the following:

documented code explaining your reasoning where it isn't self-evident
optionally tests
and an example or two where applicable
clean git history that does not contain fix-ups, merge commits or malformed/misformatted commits

Your Rust code should be formatted by rustfmt / cargo fmt and should produce no warnings when built. It should also work on stable Rust and follow the Braiins Standard

Braiins University