At 2:07 AM, the payment system started failing.
Not with a crash. Not with a clear error. Just… silently dropping transactions.
At first, it was a trickle. Then retries. Then support tickets. Then alerts.
Within minutes, revenue was bleeding.
The system handled thousands of transactions per minute. Everything looked fine. CPU was normal. Memory was stable. No obvious outage. But something was wrong — and here's what made it worse: there were logs, but they were completely useless.
What Bad Logging Looks Like in Production
When the team opened the logs, this is what they found:
INFO Processing request
ERROR Something failed
INFO Retrying
That's it. No request ID. No user context. No indication of which service failed or which worker was affected. Debugging at that point wasn't engineering — it was guessing.
What Good Logging Would Have Looked Like
Now imagine the same incident with structured, intentional logging:
2026-06-12T02:07:03Z | INFO | payment-service | pid=12345 | req=abc123 | payment_initiated | user_id=7821 amount=500
2026-06-12T02:07:04Z | INFO | payment-service | pid=12345 | req=abc123 | calling_gateway | provider=stripe
2026-06-12T02:07:07Z | ERROR | payment-service | pid=12345 | req=abc123 | gateway_timeout | duration=3s
2026-06-12T02:07:07Z | WARNING | payment-service | pid=12345 | req=abc123 | retry_scheduled | attempt=1
2026-06-12T02:07:10Z | CRITICAL | payment-service | pid=12345 | req=abc123 | payment_failed | reason=timeout
Now the story tells itself. A specific request failed at the gateway after a 3-second timeout. Retry logic kicked in but couldn't recover. Final failure recorded with full context. No guessing — just facts.
Logging Is Not Debugging. It's Reconstruction.
This is the mindset shift that separates junior and senior engineers.
Logging is not about printing messages. Logging is about reconstructing reality after things go wrong.
When systems scale — multiple services, async flows, retries, distributed workers — you can't step through production code with a debugger. By the time you're investigating, the process that failed is long gone. Logs become your only timeline of truth. They are the flight recorder of your application.
What to Log (And Why It Matters)
Good logs aren't random. They're intentional. Here's a practical framework.
1. Log the lifecycle of important flows
Every meaningful operation should leave a trail:
- request received
- validation result
- external call made
- response received
- retry attempted
- success or failure
This creates a narrative for every request — a story you can follow from start to finish even days later.
2. Log failures with context
Bad:
logger.error("Payment failed")
Good:
logger.error(
"payment_failed | request_id=%s user_id=%s amount=%s error=%s",
request_id, user_id, amount, error
)
The second version answers who, what, how much, and why — the four questions you'll always ask during an incident.
3. Log external dependencies
The majority of real production failures originate outside your code — in APIs, databases, or message queues. Always log the service name, latency, response code, and retry behaviour for every external call. If a third-party API starts degrading, this is where you'll see it first.
4. Log state transitions
For business workflows, log every state change:
order_created → payment_pending → payment_failed
Without these breadcrumbs, debugging business logic issues — the kind where the code technically worked but produced the wrong outcome — becomes extremely painful.
What NOT to Log
Equally important is knowing what to leave out:
- passwords and auth tokens
- sensitive personal data (card numbers, SSNs)
- raw unfiltered request payloads
- high-frequency noise that adds volume but no signal
A log entry should earn its place. If it doesn't help you debug a real problem, it's just noise that makes the real signals harder to find.
String Formatting: A Hidden Performance Detail
This is where many developers unknowingly hurt performance in high-throughput systems.
The f-string trap
logger.debug(f"User data: {get_user_data()}")
Even when DEBUG level is disabled and this line will never write anything, get_user_data() still gets called and evaluated. In a tight loop or a hot code path, this adds up.
Why logging uses % formatting
logger.debug("User data: %s", get_user_data())
When you pass arguments separately, Python's logging module uses lazy evaluation — it only formats the string if the message will actually be written. If the level is disabled, the function never runs. For debug-heavy code, this can make a measurable difference.
What about t-strings?
Python has two things that could be called "t-strings" and it's worth distinguishing them.
The older one is string.Template — it's been in the standard library since Python 2.4:
from string import Template
Template("User $user triggered $event").substitute(user=user, event=event)
It's safer than % or f-strings when handling untrusted input because it doesn't execute arbitrary expressions, only simple variable substitution. Rarely seen in logging pipelines in practice.
The newer one is Python 3.14's t"" literal — a genuine language-level feature that produces a Template object rather than a string, giving you access to the raw parts before interpolation happens:
# Python 3.14+
entry = t"payment_failed | user={user_id} amount={amount}"
This is genuinely interesting for structured logging because you could intercept the parts, sanitise values, or build a JSON payload — without the string ever being fully rendered as plain text. Still very new; ecosystem support is early. Worth watching, but not something you'd reach for today.
Rule of thumb: Use f-strings everywhere in your normal code. Use
%sarguments inside logger calls. Keep an eye ont""as structured logging tooling catches up.
Log Levels — Think in Incidents, Not Categories
Log levels are not just labels. They define how urgently your system is communicating with you.
- DEBUG — granular internal detail, only useful during active development
- INFO — the normal, expected flow of the application
- WARNING — something unexpected happened, but the system recovered or continued
- ERROR — a specific operation failed and needs attention
- CRITICAL — the system itself is in danger; immediate action required
A common mistake is using ERROR for everything unexpected. If everything is an error, nothing is. Reserve CRITICAL for situations that genuinely threaten system integrity — and make sure those alerts actually wake someone up.
A Practical Logging Format
Consistency matters more than any particular style. A format that works well across services:
<timestamp> | <level> | <service> | pid=<pid> | req=<request_id> | <event> | key=value pairs
Real example:
2026-06-12T02:07:03Z | ERROR | payment-service | pid=12345 | req=abc123 | payment_failed | user_id=7821 amount=500
This format is scannable by humans and parseable by machines — important when you eventually pipe logs into a centralised system.
Why PID Matters in Multiprocessing Systems
In systems running multiple workers — Gunicorn, Celery, Python multiprocessing — logs from different processes interleave in the same output stream. Without process IDs, this happens:
INFO Processing request
INFO Processing request
ERROR Something failed
Which worker failed? No idea. With PIDs:
INFO pid=12345 Processing request abc123
INFO pid=12346 Processing request def456
ERROR pid=12346 Payment failed for request def456
Now you can instantly isolate which worker had the problem, whether failures are isolated to one worker or spreading across all of them, and whether a particular worker is consistently failing.
PID vs Request ID — these answer different questions. PID tells you which process handled something. Request ID tells you which flow you're following. In distributed systems, you need both.
Local vs Remote Logging
Scenario Recommended approach
─────────────────────────────────────────────
Small / single app Local logging
Growing app Hybrid (local + remote)
Distributed system Remote / centralised
Local logging — logs written to files on the same machine. Simple to set up, fast, no infrastructure required. But logs are lost if the machine crashes, hard to search across multiple instances, and don't scale.
Remote / centralised logging — logs shipped to a dedicated system like ELK, Loki, or a managed service. Searchable, unified across services, supports alerting and dashboards. Adds setup complexity and cost, but essentially mandatory once you're running more than one service.
For growing applications, a sensible middle ground is to log locally in development and early production, then add a remote sink once you have multiple services or need cross-service correlation.
Monolith vs Distributed Systems
Monolith — file-based logging is perfectly fine. PIDs help. Log levels and a consistent format are usually sufficient to debug most issues.
Distributed system — the rules change. You must have request IDs that propagate across service boundaries (often called correlation IDs). Logs should be structured (JSON is preferred over plain text because it's machine-parseable). A centralised logging system isn't optional. Every log entry must include the service name or you'll have no idea where it originated.
Without these in a distributed system, logs from ten services tell ten disconnected stories instead of one coherent narrative.
Logging Libraries
logging (built-in)
The standard library choice. Flexible, well-supported, integrates with everything. The configuration is verbose but gives you full control. Good default for most applications.
loguru
A third-party library that dramatically simplifies the setup with sensible defaults and a clean API. Great for smaller projects or teams that want to move fast. Less granular control than the built-in module.
structlog
Purpose-built for structured, machine-readable logs. Excellent for microservices where logs will be ingested by a centralised system. Has a learning curve but produces the cleanest output for large-scale systems.
Remote Logging Systems
ELK Stack (Elasticsearch, Logstash, Kibana) — the industry standard for self-hosted centralised logging. Powerful search, rich visualisation, highly configurable. Operationally heavy to run yourself.
Grafana Loki — a lighter-weight alternative designed to work alongside Prometheus. More efficient storage, simpler to operate, but less powerful full-text search than Elasticsearch.
Datadog / Splunk — fully managed, production-grade logging platforms. Rich alerting, anomaly detection, and integrations out of the box. Expensive at scale, but they eliminate all operational overhead.
Logging Execution Time
How long did that function take? That question comes up constantly — slow database queries, degrading third-party APIs, background jobs creeping past their SLA. If you're not logging duration, you'll only find out something is slow once it's already causing timeouts.
The simplest approach for a one-off:
import time
start = time.perf_counter()
result = call_payment_gateway(payload)
duration = time.perf_counter() - start
logger.info(
"gateway_call_complete | provider=%s duration_ms=%.2f status=%s",
provider, duration * 1000, result.status
)
Use time.perf_counter() rather than time.time() for measuring elapsed duration — it's higher resolution and not affected by system clock adjustments.
For anything you want to time repeatedly across your codebase, a decorator keeps things clean:
import time
import functools
def log_duration(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
start = time.perf_counter()
try:
result = func(*args, **kwargs)
duration_ms = (time.perf_counter() - start) * 1000
logger.info(
"function_complete | name=%s duration_ms=%.2f",
func.__name__, duration_ms
)
return result
except Exception as e:
duration_ms = (time.perf_counter() - start) * 1000
logger.error(
"function_failed | name=%s duration_ms=%.2f error=%s",
func.__name__, duration_ms, str(e)
)
raise
return wrapper
# Usage
@log_duration
def process_payment(user_id, amount):
...
This logs duration on both success and failure — which matters. A function that fails in 2ms failed for a different reason than one that fails after 30 seconds.
For timing a specific block of code rather than a whole function, a context manager is neater:
from contextlib import contextmanager
@contextmanager
def timed_block(label):
start = time.perf_counter()
try:
yield
finally:
duration_ms = (time.perf_counter() - start) * 1000
logger.info("timed_block | label=%s duration_ms=%.2f", label, duration_ms)
# Usage
with timed_block("db_fetch_user_orders"):
orders = db.query("SELECT * FROM orders WHERE user_id = %s", user_id)
What to actually time depends on your application. A web API cares about p95 response times per endpoint. A data pipeline cares about per-stage throughput. A background worker cares about job duration trends over time. The rule of thumb: log duration wherever you call something external — a database, an API, a cache — because that's where latency problems almost always originate.
A Production-Ready Setup
Here's a minimal but solid starting point for any Python service:
import logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s | %(levelname)-8s | payment-service | pid=%(process)d | %(message)s",
datefmt="%Y-%m-%dT%H:%M:%SZ"
)
logger = logging.getLogger(__name__)
# Usage
logger.info(
"payment_initiated | request_id=%s user_id=%s amount=%s",
request_id, user_id, amount
)
logger.error(
"payment_failed | request_id=%s user_id=%s error=%s",
request_id, user_id, str(error)
)
A few additions worth considering as you grow:
-
Log rotation — use
RotatingFileHandlerorTimedRotatingFileHandlerto prevent log files from growing unbounded and filling your disk -
JSON formatter — swap the plain-text format for a JSON one (e.g. via
python-json-logger) when you're ready to ship logs to a centralised system - Correlation ID middleware — in web frameworks, inject a unique request ID at the entry point and thread it through every log call for that request
The Real Cost of Bad Logging
Logging is often treated as an afterthought — something you add when you have time. But bad logging doesn't just make debugging slower. It makes incidents longer, outages more expensive, and on-call shifts more stressful.
The difference between a 10-minute incident resolution and a 3-hour one often isn't the complexity of the bug. It's whether the logs told a clear story or left engineers guessing in the dark.
When production breaks, logs aren't helpful. They're everything.
And how quickly you recover comes down to one thing: how well you logged before things went wrong.
Found this useful? Follow for more deep dives into backend engineering, production systems, and the things they don't teach in tutorials.











