Cardinality Explosion: Should Every Detail Really Be Observed? And

The metrics and logs we collect to monitor the health of our systems can sometimes create problems for us. Especially when the concept we call cardinality is overlooked, a simple monitoring system can suddenly turn into a massive cost and performance issue. This situation directly affects not only the systems but also the careers and professional approaches of engineers like us working in operations and development.

In this post, I will try to explain what a cardinality explosion is, why it has become such a significant problem, and how we can avoid or deal with this issue when we encounter it, based on my own experiences. While the desire to observe every detail is a noble intention, it comes at a price, and anticipating this price is our responsibility as engineers.

What is Cardinality Explosion and Why is it Important?

Cardinality refers to the number of unique items in a dataset. In the context of monitoring systems, it means the variety of unique values that labels (tags) or fields we add to a metric or log record can take. For example, the cardinality of the status_code label in an HTTP request metric is low (a few values like 200, 404, 500), but the cardinality of the request_id label is very high because it takes a unique value for each request.

High cardinality fundamentally leads to two main problems: cost and performance. Monitoring systems must store a separate time series or log record for each unique label combination. This can lead to storage space bloat over time, slow queries, and even the complete collapse of the monitoring system. In my career, I've encountered many situations where alarms didn't work, dashboards wouldn't load, or bills unexpectedly increased due to such an explosion.

⚠️ Hidden Danger

A cardinality explosion often emerges gradually as systems grow or new features are added. It might not be noticed initially, but when you suddenly see your systems slowing down or costs skyrocketing one day, the source of the problem is usually here.

This situation can spiral out of control, especially in large-scale and dynamic environments, when combined with the desire to monitor every detail. Every developer wants to see every detail of their module, and these well-intentioned requests, when combined, can paralyze the monitoring infrastructure. Therefore, understanding which details truly need to be observed and what level of granularity is sufficient is critically important.

Real Scenarios: Where Did I Encounter It?

Cardinality explosion can manifest in different ways across various systems. I've battled this problem in both metric collection systems and log management platforms. Here are a few concrete examples:

High Cardinality Metrics in Prometheus

While developing an ERP system for a manufacturing firm, we wanted to track the status of each product on the production line. Initially, we started sending separate metrics for each product_id and batch_id. For example: production_status{product_id="P123", batch_id="B456", machine_id="M1"} 1. It was fine at first because production volume was low. However, as production increased and thousands of different product_ids and hundreds of batch_ids began to be produced daily, our Prometheus server's disk space and RAM usage went out of control.

Prometheus's time series database (TSDB) stores a separate entry for each unique label set. Due to this explosion, the tsdb block size grew rapidly, and queries started taking minutes. On April 28th, the disk filled up to 100%, and a WAL rotation alarm went off at 03:14. This was an operational nightmare caused by just one metric. One of the most important lessons I learned that day was not to use unique identifiers like product_id as metric labels.

# Example of a PromQL query causing high cardinality
sum by (product_id, batch_id) (production_status)

This query returns a separate result for each unique product_id and batch_id combination. If there are thousands or even millions of different combinations, this query will stress Prometheus and reduce the readability of the result.

Cardinality Nightmare in Log Management

A similar situation occurred when I was managing logs on an internal platform for a bank. We were adding a unique session_id and transaction_id to the logs for each user request. Our goal was to easily track the entire lifecycle of a specific request. Our logging architecture was built on Elasticsearch, and this approach seemed very logical at first.

However, in an environment processing millions of requests daily, these unique IDs expanded the size of Elasticsearch's indexes to unimaginable levels. Elasticsearch creates an inverted index for each unique field value, and this leads to enormous memory and disk consumption for high-cardinality fields. Within a month, the index size grew to terabytes, and queries, even a simple session_id search, took over ten seconds.

{
  "timestamp": "2026-05-29T10:00:00Z",
  "level": "INFO",
  "service": "payment-gateway",
  "message": "Payment processed successfully.",
  "session_id": "b9a0c1d2-e3f4-5678-90ab-cdef12345678",
  "transaction_id": "TXY-9876543210",
  "user_id": "U12345",
  "amount": 100.50
}

In a log entry like the one above, the session_id and transaction_id fields have high cardinality. Indexing these fields puts a significant load on Elasticsearch. Such situations, no matter how well-intentioned, taught me painfully that we need to think pragmatically about system design.

Cost and Performance Impacts: What's Coming Out of Our Pockets?

A cardinality explosion doesn't just cause the monitoring system to slow down; it also leads to significant costs and operational overhead. These impacts are our direct responsibility as engineers, and being aware of them moves us a step forward in our careers.

Storage cost is one of the most obvious impacts. Every unique time series or log record takes up disk space. The massive data piles created by high cardinality can drive monthly bills with cloud providers to unexpected levels. Once, due to a poorly designed metric, our monthly monitoring cost of $500 suddenly jumped to $3000. Such a cost increase is immediately noticed by management and puts the project's budget in jeopardy.

In terms of performance, slow queries are the main problem. Searching or plotting graphs on data with many unique labels or fields excessively consumes the CPU and RAM of database servers. This, in turn, leads to delayed alarms, extended troubleshooting processes, and general operational inefficiency. Similarly, network bandwidth can also be significantly affected, especially in distributed systems, during the transfer of these large data piles.

ℹ️ Related: Observability and Cost Relationship

When I previously thought about [related: observability costs and optimization], I realized that cardinality is one of the biggest multipliers in this equation. Observability is essential for "seeing" the system, but blindly collecting everything can throw us into a blind well.

Operational overhead is an added burden. The monitoring system itself is a system and needs maintenance and tuning. If the monitoring system constantly causes problems due to high cardinality, our team's valuable time is spent resolving these issues. This forces us to grapple with infrastructure problems instead of developing new features or focusing on more strategic tasks. As engineers, reducing this burden is our responsibility.

Methods for Detecting and Preventing Cardinality Explosion

To detect and prevent cardinality explosion, we need to apply different strategies in both metric and log management. In my own experiences, I've prevented many crises by actively using these methods.

Practical Approaches on the Metric Side

To manage cardinality in metric systems like Prometheus, there are several effective methods:

Label Limitation: Choose the labels you add to your metrics carefully. Avoid using high-cardinality identifiers like request_id, user_id, session_id as labels. Instead, use more general categories (e.g., user_type, request_path_group).
Label Cleaning with Regex: If your labels have unnecessary or dynamic parts, you can clean them using Prometheus's relabel_configs feature. For example, you can capture dynamic IDs in a URL path and convert them to a more general pattern.
Aggregation at Source: When collecting metrics, aggregate them at the source whenever possible. For instance, instead of sending a separate metric for each product, send the total number of products or errors produced in a period (e.g., 1 minute). This significantly reduces cardinality.
Metric Relabeling: Prometheus's own relabel_configs feature can be used to rename, drop, or transform labels on metrics collected from scrape targets using regex. This is a powerful tool for controlling cardinality.

# Example Prometheus scrape config: transforming a high cardinality label
- job_name: 'my_app'
  static_configs:
    - targets: ['localhost:8080']
  relabel_configs:
    # Capture dynamic IDs in the URL path and convert to a more general path
    - source_labels: [__metrics_path__]
      regex: '/api/v1/users/[0-9]+/orders'
      target_label: __metrics_path__
      replacement: '/api/v1/users/orders'
    # Drop a high cardinality label like 'request_id'
    - source_labels: [request_id]
      action: drop

In the example above, by completely dropping the request_id label or converting __metrics_path__ to a more general format, I can reduce cardinality. Such configurations are vital for protecting our monitoring infrastructure.

Strategies on the Log Side

Managing cardinality in log management systems requires slightly different approaches:

Caution with Structured Logging: Writing logs in structured formats like JSON is great, but you don't have to index every field. For high-cardinality fields (e.g., transaction_id), leave them as strings only in the message field and avoid indexing them directly. Only index fields you genuinely need to search.
Dropping Unnecessary Fields with Log Parsers: When parsing logs with tools like Logstash or Fluentd, you can completely drop high-cardinality and rarely searched fields. For example, using Grok filters, you can extract only specific fields and ignore others.
Log Sampling: Instead of storing all logs, you can perform sampling at a certain rate. Storing only 10% of informational logs, except for critical logs like error logs, can significantly reduce storage costs and cardinality.
TTL (Time To Live) Management: Implementing TTL policies that determine how long logs should be stored ensures that old and high-cardinality data is automatically purged. This helps keep index sizes under control.

# Example Logstash filter: Dropping high cardinality fields
filter {
  if [type] == "application_log" {
    # Keep transaction_id only in the message, do not index as a separate field
    mutate {
      remove_field => ["transaction_id", "session_id"]
    }
  }
}

This Logstash filter removes the transaction_id and session_id fields from the log record, thus preventing Elasticsearch from creating inverted indexes for these fields. Such fine-tuning is critical to prevent accumulated cost and performance issues over time.

Reflections on My Career: What Did I Learn?

Battling cardinality explosions has been not just a technical skill but also a significant area of professional development in my career. The lessons learned during this process have shaped many aspects, from my general system design approach to my cost awareness.

First and foremost, I understood how important it is to be foresightful in system design. A label or field that seems small today can turn into a nightmare tomorrow when millions of data points are collected. Therefore, anticipating how a system will behave under load as it grows has become one of our most valuable competencies as engineers. Asking "What will its cardinality be?" before adding a new metric or log field has become a habit.

Cost awareness was a direct result of these experiences. The solutions we develop must not only be technically robust but also economically sustainable. In today's world of rapidly increasing cloud costs, using resources efficiently and avoiding unnecessary expenses falls within an engineer's scope of responsibility. Now, when designing a solution, I always ask, "How much will this cost us?"

💡 Learning and Development

Last month I wrote sleep 360 and got OOM-killed, then switched to polling-wait. I'm not ashamed of making mistakes; the important thing is to learn from them. Cardinality explosion was also such a learning process.

Finally, my ability to explain and manage trade-offs has improved. The desire to observe every detail is understandable, but it comes at a price. Being able to clearly explain this price, even to non-technical stakeholders, and finding the best balance point demonstrates an engineer's communication skills. In such situations, as I mentioned in my article on "[related: software architecture trade-offs]", clearly presenting the options and their consequences is very important. This has strengthened my technical leadership and helped the team make more informed decisions.

Conclusion

Cardinality explosion is one of the most insidious and costly problems we face in the realm of observability. However, confronting this problem offers us invaluable lessons, not just technically but also professionally. When designing and managing our systems, we must consider the potential cost and performance overhead that comes with the desire to monitor every detail.

Monitoring is not just a tool; it is a critical artery that keeps the pulse of our systems. We must always keep the awareness of cardinality alive to avoid blocking this artery. Gaining and applying this awareness ensures that our systems run more healthily and helps engineers like us make more informed and valuable decisions. I will continue to use these lessons as a guide in future projects.