Metric Collection: Push vs. Pull Models - When to Use Which?

Metric Collection Approaches: The Core Differences

Collecting metrics is crucial for understanding the health and performance of our systems. There are two primary methods for obtaining these metrics: Push and Pull. I've used both models extensively in my own projects and in consulting roles. Which one we choose depends on our infrastructure's structure, scale, and the specific metrics we want to collect.

In the Push model, the system that collects metrics (e.g., a monitoring service) doesn't continuously query the applications or services sending the metrics. Instead, the collecting service actively fetches the metrics from the relevant systems. This is a form of "pulling" information. In the Pull model, the collecting service periodically polls the target systems and requests the metrics. This approach is quite common, especially in distributed systems and microservice architectures.

Advantages and Disadvantages of the Push Model

With the Push model, the application or service generating the metrics sends them to a central collection point at its own intervals or when specific events occur. This is often seen in "agent-based" solutions. For example, an application might push its metrics to its own logs or a specific metric database (like InfluxDB with the Telegraf agent).

The biggest advantage of the Push model is that the target system (the metric collector) doesn't need to constantly query the metric producers. The metric producer can use its own resources more efficiently and manage network traffic more controllably. Additionally, collecting metrics from systems behind firewalls or behind NAT becomes easier with this model. However, since each metric producer needs to send metrics independently, a central collection system might need to manage all these connections.

ℹ️ Use Cases for the Push Model

The Push model is particularly beneficial in the following scenarios:

Event-driven systems: Sending metrics when a specific event occurs.

Environments with network constraints: Collecting metrics from systems behind firewalls or with difficult access.

Short-lived services: For containers or functions that start and finish within seconds.

Edge devices or IoT: When collecting metrics from resource-constrained devices.

Advantages and Disadvantages of the Pull Model

In the Pull model, the main collecting service periodically polls the services that produce and expose metrics. Popular monitoring tools like Prometheus use this model. Prometheus collects metrics by regularly querying configured targets. The biggest advantage of this model is having a central point of control. Which metrics to collect and how often can be managed from a single location.

A disadvantage of the Pull model is that the metric collecting service must be able to reach all target systems. If a target system is behind a firewall or unreachable, it's impossible to pull its metrics. Furthermore, when there are a large number of target systems, the metric collector can experience significant load. However, this load is generally manageable, and tools like Prometheus are quite successful in terms of scalability.

💡 Advantages of the Pull Model

The Pull model is preferred in the following situations:

Microservice architectures: Each service exposes its own metric endpoint, and a central agent pulls them.

Stable and continuously running services: Infrastructure where metrics can be regularly pulled.

Detailed and real-time metric tracking: Accessing more up-to-date data by pulling metrics at specific intervals.

Centralized configuration: Managing metric collection settings from a single point.

The Pull Model: Concrete Examples with Prometheus

The Pull model is very popular, especially in modern, distributed systems and microservice architectures. The most well-known example of this model is undoubtedly Prometheus. Prometheus collects metrics by querying the /metrics endpoint over HTTP. These metrics are typically served in Prometheus's own text-based format or the OpenMetrics format.

Let's go through an example. Suppose we have a FastAPI application and we want to collect some basic metrics from it. We can use the prometheus_client library for this.

from fastapi import FastAPI
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from starlette.responses import Response
import time
import random

app = FastAPI()

# Define the metrics
REQUEST_COUNT = Counter('http_requests_total', 'Total number of HTTP requests', ['method', 'endpoint', 'status_code'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency in seconds', ['method', 'endpoint'])
ACTIVE_USERS = Gauge('active_users', 'Number of active users')

@app.middleware("http")
async def add_metrics(request, call_next):
    start_time = time.time()
    method = request.method
    endpoint = request.url.path
    response = await call_next(request)
    status_code = response.status_code
    process_time = time.time() - start_time

    REQUEST_COUNT.labels(method=method, endpoint=endpoint, status_code=status_code).inc()
    REQUEST_LATENCY.labels(method=method, endpoint=endpoint).observe(process_time)

    # Simulate a random number of active users
    if random.random() > 0.5:
        ACTIVE_USERS.set(random.randint(10, 100))
    else:
        ACTIVE_USERS.dec(random.randint(0, 10))

    return response

@app.get("/metrics")
async def metrics():
    return Response(generate_latest(), media_type="text/plain")

@app.get("/")
async def homepage():
    return {"message": "Hello, World!"}

@app.get("/slow")
async def slow_page():
    time.sleep(random.uniform(0.5, 2.0))
    return {"message": "This is a slow page."}

# Example usage:
# uvicorn main:app --reload
# Configure Prometheus server to scrape this application.

This FastAPI application will monitor every incoming request and generate metrics like REQUEST_COUNT, REQUEST_LATENCY, and ACTIVE_USERS. When you configure the Prometheus server to scrape the /metrics endpoint of this application at regular intervals, the pull model is in action.

In Prometheus's scrape_configs section, we can define this target like this:

scrape_configs:
  - job_name: 'my_fastapi_app'
    static_configs:
      - targets: ['localhost:8000'] # Where your FastAPI application is running

With this configuration, Prometheus will fetch metrics from http://localhost:8000/metrics every 15 seconds (the default scrape interval). This provides centralized control and regular data collection.

⚠️ Challenges of the Pull Model

In the Pull model, Prometheus's inability to reach target services is the biggest issue. If the localhost:8000 address is blocked by a firewall or the service is down, Prometheus cannot collect metrics from that service. In such cases, we see incomplete or outdated data on our monitoring dashboards. Setting up alert mechanisms correctly for such situations is vital.

The Push Model: Sending Metrics to the Center

The Push model operates in the opposite way to the Pull model. The service or agent that generates metrics actively sends them to a central collection point. This model is more useful in situations where the network topology is complex, firewall rules are strict, or short-lived threads need to produce metrics.

For example, consider an application running inside a Docker container. This container might have a short lifespan, and it might not always be possible for Prometheus to query it directly. In such cases, an agent within the container can collect metrics and send them to a more persistent database (like InfluxDB or Graphite).

Another common use case is integrating metrics with a central log aggregation system. We can capture specific error patterns in logs and increment metrics corresponding to these patterns.

import time
import requests
import random

# The endpoint where we will send metrics (e.g., InfluxDB's Telegraf)
METRIC_ENDPOINT = "http://your-metric-collector:8086/write?db=mydb" # InfluxDB example

def send_metric(measurement, tags, fields):
    timestamp = int(time.time() * 1e9) # Nanosecond precision for InfluxDB
    tag_str = ",".join([f"{k}={v}" for k, v in tags.items()])
    field_str = ",".join([f"{k}={v}" for k, v in fields.items()])
    payload = f"{measurement},{tag_str} {field_str} {timestamp}"

    try:
        response = requests.post(METRIC_ENDPOINT, data=payload, timeout=5)
        if response.status_code != 204: # InfluxDB write success is 204 No Content
            print(f"Error sending metric: {response.status_code} - {response.text}")
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")

# Application logic simulation
def process_request(request_id):
    tags = {"service": "my_app", "request_id": request_id}
    start_time = time.time()
    try:
        # Simulate processing
        time.sleep(random.uniform(0.1, 1.5))
        if random.random() < 0.1: # 10% error rate
            raise Exception("Internal processing error")

        fields = {"duration_ms": (time.time() - start_time) * 1000, "status": "success"}
        send_metric("request_latency", tags, fields)
        print(f"Request {request_id} processed successfully.")

    except Exception as e:
        fields = {"duration_ms": (time.time() - start_time) * 1000, "status": "error", "error_message": str(e)}
        send_metric("request_latency", tags, fields)
        print(f"Request {request_id} failed: {e}")

# Main loop
if __name__ == "__main__":
    for i in range(10): # Simulate 10 requests
        process_request(f"req_{i}")
        time.sleep(random.uniform(0.5, 2.0))

In this code, the process_request function, after processing each request, sends metrics indicating the duration of the operation and its outcome (success/failure) via the send_metric function to a central endpoint. This endpoint could be a Telegraf agent writing to an InfluxDB database.

💡 Flexibility of the Push Model

The Push model offers great flexibility, especially in dynamic environments and situations with network constraints. When you start or stop a container, the task of sending metrics automatically begins or ends. This reduces the need for manual configuration.

Why Are We Collecting So Many Metrics?

The primary goal of metric collection is to understand our systems' behavior, detect problems, and optimize their performance. Some critical metrics I've encountered in production environments include:

CPU Usage: The processor load of servers or containers. High CPU usage can be a sign of performance issues or insufficient resources.
Memory Usage: How much RAM applications are consuming. Memory leaks or insufficient RAM can seriously affect system stability.
Disk I/O: Disk read/write speeds. Slow disks can slow down database or file system operations, reducing overall performance.
Network Traffic: The size and number of incoming and outgoing network packets. Network bottlenecks or abnormal traffic patterns can be detected.
Error Rates: The number of errors within the application or in HTTP requests (e.g., 5xx HTTP errors).
Latency: How long it takes for requests to be responded to. High latency negatively impacts user experience.

Collecting these metrics allows us to understand the system's "normal" behavior not just when there's a problem, but also during normal operations. This "baseline" information is invaluable for detecting anomalies (e.g., 50% higher CPU usage than normal).

When to Use Which Model?

Both models have their use cases. Some factors to consider when making a choice include:

Infrastructure Structure: Microservices or monolith? Containers or virtual machines? How complex is the network structure?
Metric Producer Characteristics: Short-lived or continuously running? Are network accesses restricted? Can it expose its own metric endpoint?
Scalability Needs: How many services and metrics will be collected? What will be the load on the central collector?
Network Security and Accessibility: Situations like firewall rules, services behind NAT.
Operational Complexity: Which model is easier to manage?

⚠️ Hybrid Approach

In the real world, we often see hybrid approaches that combine both models. For example, we might use the Pull model (with Prometheus) for continuously running services, while using the Push model (with Fluentd, Logstash, or custom agents) for short-lived or network-constrained services. This allows us to leverage the advantages of both models.

Examples from My Own Experience

While working on a production ERP system, we needed to monitor both the main application (which was monolithic) and various background processors. For the main application, we used the Pull model with Prometheus. We collected basic metrics like CPU, memory, request count, and latency through the application's /metrics endpoint.

However, we had background processes that ran periodically (e.g., hourly invoice generation, daily reporting). These processors were sometimes one-off tasks, and sometimes they finished within a few minutes. For these short-lived and sometimes firewall-behind processors, we opted for the Push model. Each processor, during its execution, sent metrics it generated (processing time, success/failure record count, etc.) directly to an InfluxDB. This way, we could monitor the health of the main application in real-time and analyze the performance of background processors in detail. This hybrid approach played a critical role in achieving our 99.9% uptime goal.

In another scenario, for our mobile application's performance, we collected crash reports and performance metrics (screen load times, network request times) directly from the application itself. These metrics were typically pushed from mobile devices to a central service. This is because mobile devices cannot be kept constantly open for our servers to pull from, and network connections are also unreliable. In such cases, the Push model becomes almost the only option for data collection.

When is the Pull Model More Advantageous?

Ease of Service Discovery: If your services have a service discovery mechanism, Prometheus can automatically find them and pull metrics. This is a great convenience, especially in dynamic environments (like Kubernetes).
Centralized Control: Settings like metric collection frequency and format are managed from a single location.
Network Load Distribution: The load of pulling metrics falls on the metric collector (Prometheus). Metric-producing services do not have additional workload (other than exposing an endpoint).
More Reliable Data: The metric collector (Prometheus) regularly checks if target services are running. If a service doesn't respond, this is immediately detected.

When is the Push Model More Advantageous?

Systems Behind Firewalls: When the metric producer cannot directly access the collection point.
Short-Lived Workloads: When metrics need to be collected from a script or a short-running container.
Event-Driven Metrics: For sending metrics after a specific event.
Low Bandwidth Environments: When the metric producer needs to send aggregated data to the collection point at specific intervals.

Visualizing and Analyzing Metrics

Collecting metrics is just the first step. The real value lies in making these metrics meaningful. Metrics collected with Prometheus are typically used in conjunction with visualization tools like Grafana. Grafana allows us to create rich and interactive dashboards with metrics from Prometheus.

A dashboard typically includes the following panels:

General Status Panel: Shows basic system metrics like CPU, memory, and disk usage.
Application Performance Panel: Contains application-specific metrics like request count, error rates, and latency.
Error Analysis Panel: Graphs showing error types and their frequencies.
Capacity Planning Panel: Shows resource usage trends to help predict future needs.

Consider a "request_latency" histogram graph we created in Grafana. This graph shows how long requests took to complete within a specific time frame. For example, the 50th percentile (p50) indicates that 50% of requests were completed within this duration. The 99th percentile (p99) shows how long the slowest 1% of requests took. These metrics are critical for understanding user experience.

# Example Grafana PromQL query:
sum(rate(http_request_duration_seconds_bucket{job="my_fastapi_app", le="0.5"}[5m])) by (le)
/
sum(rate(http_request_duration_seconds_count{job="my_fastapi_app"}[5m])) by (le)

This query draws a graph showing whether 50% (p50) of requests in the last 5 minutes were completed under 0.5 seconds.

ℹ️ Alerting Mechanisms

Continuously monitoring collected metrics and receiving alerts when anomalies occur is also very important. Prometheus Alertmanager receives alerts from Prometheus and, according to configured rules, notifies the relevant individuals (via email, Slack, PagerDuty, etc.). For example, rules like "Alert if CPU usage exceeds 90% and this condition persists for more than 5 minutes" can be defined.

Conclusion: Choosing the Right Model

The choice between Push and Pull models for metric collection depends entirely on your project's specific requirements. Both models have their strengths and weaknesses. Often, the best approach is to choose the model that is most suitable for different components of your infrastructure, or to use both models in conjunction.

The Pull model is a great option for modern, distributed systems that require centralized control and service discovery. Prometheus is the most popular representative of this model. The Push model, on the other hand, offers a more flexible solution for systems with network constraints, short-lived processes, or event-driven architectures.

It's important to remember that metric collection is just a tool. The ultimate goal is to use this data to make our systems more reliable, performant, and understandable. Therefore, selecting the right metrics, collecting them correctly, and visualizing them meaningfully are integral parts of modern system operations.