All You Need to Know About Prometheus, for Beginner

The Prometheus is a popular open-source monitoring and alerting toolkit used for system and service monitoring. With its flexible architecture and powerful query language, it's a great tool for beginners to gain insights into their infrastructure's performance and troubleshoot issues efficiently.

All You Need to Know About Prometheus, for Beginner
Discover Prometheus: Open-source monitoring tool for Cloud-Native environments. Maximize observability with efficient metrics & pull mechanism.

Prometheus is an open-source monitoring and alerting system that is widely used for collecting and recording metrics from various systems and applications. It was originally developed at SoundCloud and later became a part of the Cloud Native Computing Foundation (CNCF).

Prometheus follows a pull-based model where it regularly scrapes metrics from designated targets such as servers, services, and applications. It stores these metrics in a time-series database, allowing for efficient querying and analysis of historical data. Prometheus provides a flexible query language called PromQL (Prometheus Query Language) to retrieve and manipulate metrics.

According to the Cloud Native Computing Foundation's survey, Prometheus is the most widely used monitoring and observability tool in the cloud-native ecosystem, with 63% of respondents using it.
source: Cloud native computing foundation

Prometheus offers a rich set of features for monitoring and alerting. It supports real-time monitoring, time-based alerting, multi-dimensional data collection, and data visualization. It also has a robust ecosystem with various integrations and exporters available for different platforms and services.

Prometheus is commonly used in cloud-native environments and is often paired with other tools like Grafana for data visualization and alert managers for notification and alerting workflows. It provides valuable insights into system performance, resource utilization, and application behavior, helping organizations monitor and troubleshoot their infrastructure effectively.

In this guide, you will gain some insights into the importance of Prometheus and how to maximize its value for full-stack observability.

What is Prometheus Metrics?

Prometheus metrics are quantifiable data points used to monitor cloud infrastructure, providing insight into system health. They can be visualized on dashboards or continuously monitored to trigger notifications. Prometheus collects metrics from targets to determine the root cause of any issues.

How does Prometheus collect those metrics from targets?

Prometheus collects metrics using various system components. These components are optional including

Target Endpoints and exporters

Prometheus can scrape metrics from target endpoints and exporters. This involves querying HTTP endpoints on the target, which respond with metric data. This is a pull-based approach where the Prometheus server fetches data from targets regularly.

Exporters are applications that extract metrics from other systems and expose them as HTTP endpoints. Prometheus can scrape these endpoints to collect metrics from external systems.

Monitoring your application

Prometheus also provides client libraries that can be integrated into applications to expose their internal metrics. These libraries allow Prometheus to collect data directly from the application's code. This approach is push-based, where the application pushes metrics to the Prometheus server.

Push gateway for short-lived data import jobs

The Push Gateway is a service that allows short-lived jobs to push their metrics to Prometheus. This is useful when monitoring batch jobs, cron jobs, or other short-lived processes. The Push Gateway collects metrics from these jobs and presents them to Prometheus for scraping.

Exporters that send data to other services (e.g. HAProxy, StatsD, Graphite)

Exporters are applications that extract metrics from other systems and expose them as HTTP endpoints. Prometheus can scrape these endpoints to collect metrics from external systems. Some of the popular exporters include:

  • HAProxy exporter: collects metrics from HAProxy load balancer
  • StatsD exporter: collects metrics from the StatsD daemon
  • Graphite exporter: collects metrics from the Graphite monitoring system

AlertManager to handle alerts

AlertManager is a Prometheus component that handles alerts. It receives alerts from Prometheus and sends notifications via various channels such as email, Slack, PagerDuty, etc. AlertManager can group related alerts, suppress notifications for specific periods, and also route alerts to specific receivers based on their labels.

Pull Mechanism - Unique advantage of Prometheus

Prometheus uses a pull-based mechanism to collect metrics from targets. This approach has several unique advantages, which are discussed below:

Flexible Target Discovery

Prometheus allows flexible target discovery, which means it can dynamically discover new targets and start scraping them. This allows for easy scaling and management of the monitoring infrastructure.

Reduced Network Load

Since Prometheus initiates the data collection by pulling data from targets, it reduces the network load compared to push-based monitoring systems where targets continuously send data to the monitoring system.

Configurable Scrape Interval

Prometheus allows configurable scrape intervals for each target, which means it can scrape metrics from different targets at different intervals depending on the criticality of the target. This also helps in reducing network load.

Rich Querying Capabilities

Prometheus has rich querying capabilities that allow users to query metrics data using a powerful query language called PromQL. This enables users to slice and dice the data and gain valuable insights into the system's health and performance.

Automatic Labeling and Relabeling

Prometheus automatically adds and modifies labels for scraped metrics based on target metadata, allowing easy and efficient data aggregation and filtering.

Overall, the pull-based mechanism is one of the unique advantages of Prometheus, providing flexibility, reduced network load, rich querying capabilities, and more.

Prometheus Metrics Types

source: logz.io

Counter

A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or reset to zero on a restart. It is typically used to count the number of events that occur, such as requests served or tasks completed.

Gauge

A gauge is a metric that represents a single numerical value that can go up or down over time. It is typically used to measure a value at a particular point in time, such as the current number of active requests or the amount of free memory available.

Histogram

A histogram is a metric that samples observations and counts them in configurable buckets. It also provides a sum of all observed values. It is typically used to measure the distribution of values over time, such as the response time of a server.

Summary

A summary is similar to a histogram but calculates configurable quantiles over sliding time windows. It is typically used to measure the distribution of values over time, such as the latency of a service.

Configuring Prometheus - Example YAML Configuration

Configuring Prometheus involves defining the targets from which Prometheus will scrape metrics, setting up alerting rules, and defining global configurations. Here is an example YAML configuration for Prometheus:


global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'my-service'
    static_configs:
      - targets: ['my-service:8080']

  - job_name: 'my-other-service'
    static_configs:
      - targets: ['my-other-service:8080']

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

Let's break down each section of the configuration file:

Global

The global section contains global configurations for Prometheus. In this example, we have set the scrape_interval and evaluation_interval to 15 seconds.

Scrape Configs

The scrape_configs section contains configurations for scraping metrics from targets. In this example, we have defined three jobs: prometheus, my-service, and my-other-service.

For each job, we have specified job_name the targets to scrape metrics from using the targets parameter. The static_configs parameter indicates that we are using static configuration, which means the targets are explicitly defined in the configuration file.

Alerting

The alerting section contains configurations for alerting. In this example, we have defined the Alertmanager targets using the targets parameter under static_configs.

Overall, this configuration file demonstrates a simple example of how to configure Prometheus to scrape metrics from multiple targets and set up alerting using Alertmanager. However, configurations can become much more complex depending on the needs of the system being monitored.

Prometheus Data Storage - Where does Prometheus store the data?

Prometheus stores its data in a time series database. By default, it uses its own embedded database called the Prometheus TSDB (Time Series Database), which is designed for high write throughput and efficient storage of time series data. To minimize storage space, the Prometheus TSDB stores all data on a local disk in a compressed format.

However, the technical terms commonly used when referring to Prometheus storage, are local and remote. For new users who may be unfamiliar with these terms, understanding them is important to avoid misunderstandings.

Important Terms

Here are some important terms to understand when it comes to Prometheus data storage:

  • Time series: A sequence of timestamped values that represents the state of a system or process over time. In Prometheus, each time series is identified by a unique combination of metric names and a set of labels.
  • Metric: A numeric time series that represents a specific aspect of a system or process. For example, a metric could represent the number of HTTP requests served per second.
  • Label: A key-value pair that provides metadata for a metric. Labels are used to identify and group time series.
  • Chunk: A compressed block of time series data that is stored in the chunks directory of the Prometheus data directory.
  • Index: An index of all the time series and labels in the Prometheus database, stored in the index directory.
  • Write-ahead log (WAL): A log of all incoming samples that have not yet been written to the database. The WAL is stored in the wal directory and is used to ensure data durability in case of system failures.
  • Tombstones: Markers in the database that indicate that a time series or set of labels has been deleted.

Conclusion

Prometheus has emerged as the leading open-source monitoring tool, widely adopted by cloud-native organizations. Its pull-based mechanism offers unique advantages such as flexible target discovery, reduced network load, configurable scrape intervals, rich querying capabilities, and automatic labeling. Prometheus metrics types, including counters, gauges, histograms, and summaries, provide valuable insights into system health. Configuring Prometheus involves defining scrape configurations and alerting rules, while data is stored in the Prometheus TSDB. By leveraging Prometheus, organizations can achieve comprehensive full-stack observability and effectively monitor their cloud infrastructure and applications.


Hi! I am Safoor Safdar a Senior SRE. Read More. Don't hesitate to reach out! You can find me on Linkedin, or simply drop me an email at me@safoorsafdar.com