Reliable monitoring becomes surprisingly hard once you assume the network will eventually fail.
Monitoring systems quietly depend on one dangerous assumption:
the monitoring server will always be reachable
In production, that assumption breaks constantly.
And once it breaks,
you discover that metrics pipelines are really distributed systems.
The problem
Our PostgreSQL monitoring agent periodically collects metrics and sends them to the Control Plane.
Under normal conditions:
Agent
│
└──► metrics ───► Control Plane ───► storage
Simple enough.
But then production networks happen:
- VPN reconnects
- firewall reloads
- DNS failures
- routing instability
- Control Plane maintenance windows
Once connectivity drops,
the entire delivery pipeline becomes unreliable.
The obvious answer sounds simple:
“just retry later”
That idea turned out to be much harder than expected.
Questions immediately appeared:
- how long should metrics stay in memory?
- what happens during process restart?
- how do we avoid unbounded memory growth?
- how do we recover after long outages?
- how do we know delivery is degraded?
At that point,
this stopped being “metrics collection”.
It became a buffering and reliability problem.
The incident that changed the design
One outage lasted almost six hours after a network partition between monitoring agents and the Control Plane.
The agents themselves stayed healthy.
PostgreSQL stayed healthy.
Only the network path disappeared.
The original implementation buffered metrics in memory.
At 4:12 AM, one agent restarted during a package update.
Six hours of buffered monitoring data vanished instantly.
The dashboards looked like this:
timeline ─────── gap ─────── timeline
No visibility.
No recovery.
No idea what happened during the outage.
That incident permanently killed the idea of in-memory buffering.
Metrics had to survive process restarts.
The architecture
The final design became intentionally simple:
PostgreSQL
│
▼
metrics collector
│
▼
+----------------------+
| SQLite WAL |
|----------------------|
| durable local queue |
| bounded retention |
+----------------------+
│
▼
batch delivery loop
│
▼
Control Plane
The key architectural shift:
the Control Plane is no longer in the critical path of collection
Metrics are always written locally first.
Network delivery became asynchronous.
That single separation changed the reliability model completely.
The network could fail for hours,
and the agent would continue operating normally.
Why SQLite
The interesting part here is not SQL.
It's what SQLite fundamentally is:
- a single file
- atomic writes
- crash-safe persistence
- zero infrastructure
- included in Python itself
We evaluated alternatives.
In-memory queue
Failed because:
- metrics disappear on restart
- memory usage becomes unbounded during long outages
Plain append-only files
Failed because:
- manual locking
- manual serialization
- corruption handling
- cleanup complexity
Redis
We already use Redis elsewhere in the platform (for the psql terminal).
But for the metrics buffer, Redis was the wrong fit:
- Redis is in-memory by default. Persistence (RDB/AOF) is optional and adds latency.
- If Redis restarts, the buffer disappears — even if the agent is healthy.
- Mixing low-latency terminal traffic with background batch writes would create contention.
SQLite gives us durability on disk, survives agent restarts, and has zero dependencies outside the agent process.
Redis is great for real-time coordination. SQLite is better for durable local buffering.
The schema is intentionally tiny
CREATE TABLE IF NOT EXISTS metrics_queue (
id INTEGER PRIMARY KEY AUTOINCREMENT,
instance_id TEXT NOT NULL,
timestamp TEXT NOT NULL,
metric_type TEXT NOT NULL,
data TEXT NOT NULL,
created_at TEXT NOT NULL DEFAULT (datetime('now'))
)
One row per metric type per collection cycle.
Payloads are JSON-encoded.
This is not relational modeling.
This is durable buffering.
Concurrency: why SQLite didn't become a bottleneck
One obvious question:
“doesn't SQLite lock the whole database?”
Yes.
And that matters.
The reason it worked here:
- exactly one writer exists
- the collector loop owns all inserts
- delivery workers only read/delete in batches
We also enabled WAL mode:
PRAGMA journal_mode=WAL;
PRAGMA synchronous=NORMAL;
WAL mode mattered for one specific reason:
the collector continuously appends metrics,
while the delivery loop simultaneously reads and deletes batches.
Without WAL,
those operations would frequently block each other.
WAL allows readers to continue while writes are happening.
We also use:
PRAGMA synchronous=NORMAL;
This reduces fsync frequency while still preserving crash consistency,
which was a better latency/durability tradeoff for monitoring data.
For our workload:
- collection every 10–30 seconds
- batches of 100–500 rows
- occasional degraded-network bursts
SQLite contention was negligible.
If we needed hundreds of concurrent writers,
we would likely move to PostgreSQL instead.
For this system,
operational simplicity mattered more than horizontal scalability.
The buffer state machine
The buffer itself behaves like a bounded queue:
COLLECT
│
├── buffer healthy ──► INSERT
│
└── limits exceeded
│
▼
DROP + WARN
Two limits control growth:
- maximum file size
- maximum record age
Those constraints turned out to be operationally critical.
Why age-based cleanup matters
Imagine a six-hour outage.
Without retention limits,
the moment connectivity returns,
the agent floods the Control Plane with six hours of historical metrics.
On dashboards, that looks like:
timeline ─── gap ─── MASSIVE SPIKE ─── normal
Operationally, that's worse than losing data.
The spike is misleading.
So we intentionally discard stale records.
If buffered metrics exceed max_age_hours,
they are silently removed during cleanup.
The result becomes:
timeline ─── gap ─── normal
The outage remains visible.
But the monitoring data stays trustworthy.
That tradeoff mattered more than perfect retention.
Batch delivery
The delivery loop periodically reconstructs batches from SQLite:
SELECT id, instance_id, metric_type, data
FROM metrics_queue
ORDER BY created_at ASC
LIMIT ?
Rows are grouped back into metric payloads and sent to the Control Plane.
Importantly:
rows are deleted only after successful delivery
if send_to_control_plane(batch):
cursor.execute(
f"""
DELETE FROM metrics_queue
WHERE id IN ({','.join('?' * len(ids))})
""",
ids,
)
conn.commit()
This creates a very simple delivery guarantee:
INSERT → SEND → DELETE
If the process crashes before DELETE,
the metrics remain in SQLite and are retried later.
That makes delivery effectively at-least-once.
This effectively turns SQLite into a durable local write-ahead queue.
The second production incident
Several months later,
another issue appeared.
Monitoring dashboards showed healthy agents,
but metrics were arriving hours late.
The agents were buffering correctly.
The problem was:
the SQLite files had silently grown to hundreds of megabytes after a prolonged outage.
Delivery throughput could not drain the backlog fast enough.
At that point we realized:
the buffer itself needed observability
Buffer observability
Every heartbeat sent to the Control Plane now includes buffer state:
{
'buffer_size_bytes': size,
'oldest_record_age': age,
'buffer_utilization': utilization,
}
Operationally, the most useful metrics became:
Metric | Meaning | Operational action
───────────────────+──────────────────────────+────────────────────────
buffer_size_bytes | Current SQLite file size | Alert if >100 MB
oldest_record_age | Delivery lag | Alert if >1 hour
buffer_utilization | Current size / max size | Scale retention policy
delivery_failures | Consecutive failed sends | Investigate connectivity
Without these metrics,
network degradation looked identical to healthy operation.
The buffer itself became another production subsystem
that required monitoring.
Production numbers
Typical production values:
- collection interval: 10–30 seconds
- average batch size: 100–500 rows
- SQLite buffer limit: 128 MB
- longest outage survived: ~6 hours
Under normal degraded-network conditions,
the SQLite file rarely exceeded 10–15 MB.
CPU overhead was effectively negligible.
Failure modes we actually cared about
A. Agent restart during outage
Metrics survive because the buffer lives on disk.
Delivery resumes automatically after restart.
B. Control Plane recovery after long outage
Fresh metrics resume immediately.
Stale buffered records expire automatically.
No misleading spikes.
C. Disk pressure
Once the buffer exceeds the configured size limit:
- new writes stop
- warnings are emitted
- collection pauses temporarily
Delivery resumes automatically once backlog drains.
D. SQLite corruption
Rare, but possible.
We intentionally kept handling simple:
- log the error
- recreate the buffer
- continue operation
For a monitoring system,
temporary metric loss is acceptable.
Operational complexity was not worth stronger guarantees.
Bounded loss over unbounded growth
One important design decision:
the system intentionally prefers bounded data loss over unbounded resource growth
Old metrics eventually expire.
The SQLite file has a hard size limit.
That tradeoff was deliberate.
For operational monitoring,
partial visibility is better than:
- a dead agent
- an exhausted disk
- or an overloaded recovery pipeline
Perfect durability was never the goal.
Operational survivability was.
The mental model
If you understand only one thing:
collect
▼
SQLite
▼
deliver
▼
delete
SQLite here is not acting as a database.
It's:
- a durable queue
- with bounded retention
- bounded size
- atomic writes
- and zero infrastructure dependencies
That's the entire design.
Final thought
The interesting realization was this:
monitoring reliability problems are mostly buffering problems
Once network delivery became unreliable,
the architecture naturally evolved toward:
- local durability
- asynchronous delivery
- bounded retention
- explicit degradation handling
Not because SQLite is magical.
Because a single durable file solved the exact failure mode we actually had.













