Metrics with Prometheus and Grafana
As your application grows, it becomes more and more important to have insight into what's going on. Metrics are one of the ways to provide information about the run of your application.
In recent years, one of the industry standards turned out to be a toolkit
built by SoundCloud called Prometheus. It facilitates
metrics and alerting with ease, great visualisation and powerful queries.
Prometheus is written in Go, and originally supported for Go applications, more and more implementations are popping up, and Rust is also supported for quite some time now.
The Rust client library was created
by the developers behind TiKV and TiDB highly scalable and performant NoSQL/NewSQL
software.
How do Prometheus metrics work
On the side of Rust, metrics are collected into a registry. By default, metrics are not sent anywhere, rather, they are scraped whenever necessary (requested). Generally, metrics are exposed on HTTP, and GETing the endpoint scrapes the metrics.
However, the Rust crate also allows scraping metrics into text for you to transfer however you wish:
#![allow(unused)] fn main() { use prometheus::{Opts, Registry, Counter, TextEncoder, Encoder}; // Create a Counter. let counter_opts = Opts::new("test_counter", "test counter help"); let counter = Counter::with_opts(counter_opts).unwrap(); // Create a Registry and register Counter. let r = Registry::new(); r.register(Box::new(counter.clone())).unwrap(); // Inc. counter.inc(); // Gather the metrics. let mut buffer = vec![]; let encoder = TextEncoder::new(); let metric_families = r.gather(); encoder.encode(&metric_families, &mut buffer).unwrap(); // Output to the standard output. println!("{}", String::from_utf8(buffer).unwrap()); }
On the server-side, a monitoring platform often looks like this:
- Multiple metric exporters are running and export local metrics on HTTP
- Prometheus is used to centralize and store the metrics
- Alertmanager triggers alerts based on those metrics
- Grafana produces dashboards
- PromQL is the query language used to describe dashboards and alerts.
Prometheus types
Prometheus supports four core metric types. These exist only on the side of the client libraries and the wire protocol, the Prometheus server flattens everything into untyped time series (but that may change).
These are
CounterGaugeHistogramSummary
Out of these, Summary is the only one not supported by the Rust library.
Histograms sample observations (for example, things like request durations or response sizes) and counts them into buckets, while also providing a sum of all observed values.
The difference between a Counter and a Gauge is that a Counter should be monotonically increasing,
whereas a Gauge can go up and down with no issue. A Counter may also be reset back to zero on restart.
Use Counter to keep count of things:
- errors
- numbers of requests
- tasks completed
- etc.
Use Gauge to report measured values:
- currently logged in users
- hash rate
- current draw
- etc.
Here is an example of using counters and gauges:
// Copyright 2019 TiKV Project Authors. Licensed under Apache-2.0. use prometheus::{IntCounter, IntCounterVec, IntGauge, IntGaugeVec}; use lazy_static::lazy_static; use prometheus::{ register_int_counter, register_int_counter_vec, register_int_gauge, register_int_gauge_vec, }; lazy_static! { static ref A_INT_COUNTER: IntCounter = register_int_counter!("A_int_counter", "foobar").unwrap(); static ref A_INT_COUNTER_VEC: IntCounterVec = register_int_counter_vec!("A_int_counter_vec", "foobar", &["a", "b"]).unwrap(); static ref A_INT_GAUGE: IntGauge = register_int_gauge!("A_int_gauge", "foobar").unwrap(); static ref A_INT_GAUGE_VEC: IntGaugeVec = register_int_gauge_vec!("A_int_gauge_vec", "foobar", &["a", "b"]).unwrap(); } fn main() { A_INT_COUNTER.inc(); A_INT_COUNTER.inc_by(10); assert_eq!(A_INT_COUNTER.get(), 11); A_INT_COUNTER_VEC.with_label_values(&["a", "b"]).inc_by(5); assert_eq!(A_INT_COUNTER_VEC.with_label_values(&["a", "b"]).get(), 5); A_INT_COUNTER_VEC.with_label_values(&["c", "d"]).inc(); assert_eq!(A_INT_COUNTER_VEC.with_label_values(&["c", "d"]).get(), 1); A_INT_GAUGE.set(5); assert_eq!(A_INT_GAUGE.get(), 5); A_INT_GAUGE.dec(); assert_eq!(A_INT_GAUGE.get(), 4); A_INT_GAUGE.add(2); assert_eq!(A_INT_GAUGE.get(), 6); A_INT_GAUGE_VEC.with_label_values(&["a", "b"]).set(10); A_INT_GAUGE_VEC.with_label_values(&["a", "b"]).dec(); A_INT_GAUGE_VEC.with_label_values(&["a", "b"]).sub(2); assert_eq!(A_INT_GAUGE_VEC.with_label_values(&["a", "b"]).get(), 7); }
CounterVec and GaugeVec bundles a set of metrics that share the same description, but have different values for their variable labels.
They are used if you want to count the same thing partitioned by various dimensions, such as:
- HTTP requests partitioned by response code and method
- Bitcoin mining shares partitioned by valid, invalid and duplicate shares
To learn more about generic usage of Prometheus and Grafana, check out the following links:
- https://prometheus.io/docs/introduction/overview/
- https://grafana.com/docs/grafana/latest/dashboards/ - for Grafana dashb oards in particular
Braiins tips and tricks for using Prometheus in production
In Braiins, we have developed a couple internal conventions on the side of Rust things, which we believe help us organize metrics in a reasonable way.
Always using partitioning with Counter/GaugeVec
This functionality mentioned a couple lines above is extremely useful, it helps you organize data better, and by default, you will see them in a single graph.
Wrapping metrics recording in methods
It is quite useful to have exact methods for recording a particular metric, that leaves very little to speculation and configuration. That makes it easy to orient yourself in the code as to where a particular metric with particular label set is being recorded.
For example, imagine you are writing a web app, and you want to count requests. A request may return 200, 403 and 404 depending on what request it is.
In the code, you would have a CounterVec with labels for ok, forbidden and notfound.
While it may be tempting to pass this as a parameter to a single function, it is a good practice to avoid stringly-typed APIs.
Here in Braiins, we would create the following three functions/methods (depending on if you use the default registry or your own one):
pub fn record_request_ok()pub fn record_request_forbidden()pub fn record_request_not_found()
This also makes it easily searchable in your code editor.
Nested metrics
Sometimes, you have crates which are both libraries and applications and you write another application, which adds a ton of its own functionality while also depending on the first one as a dependency. How do we deal with metrics?
Easily.
The nested crate gets its own containing structure which looks something like this:
#![allow(unused)] fn main() { #[derive(Debug)] pub struct Metrics { metric1: IntCounterVec, metric2: IntGauge, metric3: IntGaugeVec, metric4: IntCounterVec, } }
And is accompanied by its own initialization function, which creates an instance, registers all the counters with the registry, and stores it in a global variable (static) to be freely used by anything that wishes it so.
The topmost crate will have a similar structure, and the same init() function, except, it also calls the init() function
of the dependency one.
It would look something like this:
#![allow(unused)] fn main() { pub fn init( info1: Type1, info2: &Type2, ) -> MetricsRegistry { let metrics_registry = inner_metrics::init(info1, info2); let metrics = Metrics::new(&metrics_registry); INSTANCE .set(metrics) .expect("BUG: metrics instance already initialized!"); metrics_registry } }
In this case, we have the inner, base metrics construct the registry,and we then use to register our top-level stuff.
You could also do it the other way around, passing the registry downwards.
Furthermore, it is also handy to have something like a base_metrics() method on your top-level metrics, which returns
a reference to the InnerMetrics, so you don't have to fish for the global variable every time you need to record a base
metric from your crate.
You can use this model to create not just a chain of discrete metrics implementors, but also a tree, which might come in handy in some situations when you have multiple subcomponents that need to expose metrics
Metrics backreading
Prometheus are easily shared global state. This make it a prime place to store some information you plan to read in a different part of your application.
You may also want to have your application dynamically change its behavior based on some of its statistics.
Therefore, it can come in handy to read metrics you have already recorded.
Refer to the very first code example to see how to read your own metrics.
Including unit name in gauges
If you have many metrics which report stuff that has units, it can get messy figuring what has what unit,
so it is a good idea to add the unit to the name of the metric and perhaps also your recording function. For
example, you might want to _bytes or _watts.
The Task: Implementing a simple Hyper HTTP server to expose metrics
For this project, it is your task to create a simple barebones HTTP server with the Hyper framework,
which exposes the following metrics:
http_requests_total-> the total number of http requests madehttp_response_size_bytes-> the HTTP response sizes in bytes
These metrics should be available on 127.0.0.1:9898.
TIP: For hyper, use the features server, http1, tcp to get at least the minimal functionality out of Hyper to be able to write this application.
For tokio, you should be fine with
macros, rt-multi-thread
TIP 2: The
register_counter!/register_gauge!and theopts!macros are quite handy for registering metrics with the default registry. The default registry is absolutely enough for ths application.
If you want, you can also forgo hyper and use the HTTP server from the async chapter.
If you are feeling particularly brave, you can try running a local prometheus server and scraping the metrics with it:
https://prometheus.io/docs/prometheus/latest/getting_started/
When you run the application and either curl it, open the URL in your browser, or inspect the raw output via Prometheus or Grafana, it should look something like this:
# HELP http_requests_total Number of HTTP requests made.
# TYPE http_requests_total counter
http_requests_total 121
# HELP http_response_size_bytes The HTTP response sizes in bytes.
# TYPE http_response_size_bytes gauge
http_response_size_bytes 249
End product
In the end you should be left with a well prepared project, that has the following:
- documented code explaining your reasoning where it isn't self-evident
- optionally tests
- and an example or two where applicable
- clean git history that does not contain fix-ups, merge commits or malformed/misformatted commits
Your Rust code should be formatted by rustfmt / cargo fmt and should produce no
warnings when built. It should also work on stable Rust and follow the Braiins Standard