The Four Golden Metrics of the Monitoring System
Recently, I was asked a question about the four golden signals (also known as golden metrics) of a monitoring system. I couldn’t quite remember, so I read some materials and decided to take some notes.
Origin
The four golden metrics of the monitoring system come from Chapter 6 “Monitoring of Distributed Systems” in the book SRE: Google’s Practices for Reliable and Scalable Systems.
This chapter covers topics such as why monitoring is necessary, black - box and white - box monitoring, the four golden metrics, the long - tail problem, the appropriate precision for measuring metrics, and the long - term maintenance of the monitoring system. It basically touches on several important aspects of building a monitoring system.
This book was recommended to me by my leader in my second year at work. It’s an excellent book. It’s been a long time, and I’ve forgotten a lot of the content. I’ve been rereading it these two days.
What are the Four Golden Metrics
The four golden metrics are as follows:
Latency
Latency refers to the time it takes for a request to receive a response from the moment it is sent. This metric includes not only the response time of successful requests but also that of failed requests.Traffic
Traffic refers to the number of requests received by the system or the amount of data processed. Traffic is usually measured in requests per second (QPS) or transactions per second (TPS).Errors
Errors refer to the number or ratio of requests that fail to be processed successfully. Errors can be HTTP status codes in the 500 series, application exceptions, or other failure conditions.Saturation
Saturation refers to the usage of system resources and the degree to which they approach their limits. Resources include CPU, memory, disk I/O, network bandwidth, etc. It can generally be expressed using utilization or remaining rate.
Precautions when Applying these Four Metrics
Latency
- Monitor the latency at different percentiles such as P50, P95, and P99 to have a more comprehensive understanding of system performance.
- Distinguish between the latency of successful and failed requests to diagnose problems more accurately.
- Pay attention to latency changes at different time periods, such as peak and off - peak hours.
- Create special dashboards and alert rules for important APIs and services.
Traffic
- Monitor the number of requests per second and its trend to understand the system load.
- Monitor the data throughput, such as the number of bytes processed per second.
- Combine with business metrics (such as the number of active users, transaction volume, number of artifact downloads, etc.) to better understand the relationship between traffic and business activities.
Errors
- Monitor both the overall error rate and specific types of error rates (such as 4xx and 5xx errors) separately.
- Monitor the change of errors over time to detect abnormal fluctuations.
- When necessary, combine application logs and LB logs to trace and diagnose the root cause of errors.
Saturation
- Monitor key resources such as CPU utilization, memory utilization, disk I/O utilization, and network bandwidth utilization.
- Set alert thresholds to warn in a timely manner about over - use of resources.
- Monitor the trend of system resource usage to conduct capacity planning in advance.
Focus on the Distribution of Metrics Instead of the Average Value
It is important to focus on the distribution of monitoring metrics rather than just the average value. This is because the average value often cannot fully reflect the actual performance of the system and the user experience, especially in the presence of high variability or anomalies.
There are mainly the following reasons:
- The average value conceals important details
The average value represents the central tendency of all data points, but it cannot reflect the distribution of the data. The following aspects illustrate the limitations of the average value:
- **Ignores volatility**: The average value cannot reflect the volatility of the data. For example, if the response time of a system is fast most of the time but occasionally very slow, the average response time may seem good, but the actual user experience may be poor.
- **Conceals anomalies**: If there are some extreme values (outliers), they may significantly increase or decrease the average value, thus concealing the actual situation of most data points.
- The distribution provides a more comprehensive perspective
Focusing on the distribution of monitoring metrics can help us understand the performance and behavior of the system more comprehensively:
- **Percentiles**: By looking at different percentiles (such as P50, P90, P95, P99), we can better understand the actual experience of most users. For example, P90 means that 90% of the requests are faster than this value, and 10% are slower. P99 means that 99% of the requests are faster than this value, and 1% are slower.
- **Histograms and quantile plots**: These charts can show the distribution of data, helping to identify performance bottlenecks and outliers. For example, Prometheus supports the use of histograms and summaries (Histogram and Summary) to record and display the distribution of data.
- Example illustration
Response time: Suppose the response time of a web service is as follows (in milliseconds):
- 50, 50, 50, 50, 50, 500
Calculate the average value: (50 + 50 + 50 + 50 + 50 + 500) / 6 = 125 ms
From the average value, the response time seems to be 125 ms. But in fact, most requests have a response time of 50 ms, and only one request is 500 ms. This is likely to lead to a very poor user experience.
If we look at the percentiles:
- P50 (median): 50 ms
- P90: 50 ms
- P95: 50 ms
- P99: 500 ms
From these percentiles, we can see that the vast majority of requests have a response time of 50 ms, and only a few requests are very slow.
- Tools and methods in practice
In practice, using appropriate tools and methods can help us better analyze the distribution of monitoring metrics:
- **Prometheus**: Supports Histogram and Summary types, which can be used to record and analyze the distribution of time - series data. For example, Summary can be used to query percentiles in PromQL:
1 | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) |
- **Grafana**: When combined with Prometheus, Grafana can be used to visualize different percentiles, present histograms and quantile plots, etc. For example, using the Prometheus data source to draw a quantile plot of latency:
1 | quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) |