SQLite as an offline metrics buffer in a monitoring agent

Reliable monitoring becomes surprisingly hard once you assume the network will eventually fail.

Monitoring systems quietly depend on one dangerous assumption:

the monitoring server will always be reachable

In production, that assumption breaks constantly.

And once it breaks,
you discover that metrics pipelines are really distributed systems.

The problem

Our PostgreSQL monitoring agent periodically collects metrics and sends them to the Control Plane.

Under normal conditions:

Agent
   │
   └──► metrics ───► Control Plane ───► storage

Simple enough.

But then production networks happen:

VPN reconnects
firewall reloads
DNS failures
routing instability
Control Plane maintenance windows

Once connectivity drops,
the entire delivery pipeline becomes unreliable.

The obvious answer sounds simple:

“just retry later”

That idea turned out to be much harder than expected.

Questions immediately appeared:

how long should metrics stay in memory?
what happens during process restart?
how do we avoid unbounded memory growth?
how do we recover after long outages?
how do we know delivery is degraded?

At that point,
this stopped being “metrics collection”.

It became a buffering and reliability problem.

The incident that changed the design

One outage lasted almost six hours after a network partition between monitoring agents and the Control Plane.

The agents themselves stayed healthy.
PostgreSQL stayed healthy.

Only the network path disappeared.

The original implementation buffered metrics in memory.

At 4:12 AM, one agent restarted during a package update.

Six hours of buffered monitoring data vanished instantly.

The dashboards looked like this:
timeline ─────── gap ─────── timeline

No visibility.
No recovery.
No idea what happened during the outage.

That incident permanently killed the idea of in-memory buffering.

Metrics had to survive process restarts.

The architecture

The final design became intentionally simple:

                PostgreSQL
                     │
                     ▼
              metrics collector
                     │
                     ▼
          +----------------------+
          |      SQLite WAL      |
          |----------------------|
          | durable local queue  |
          | bounded retention    |
          +----------------------+
                     │
                     ▼
             batch delivery loop
                     │
                     ▼
               Control Plane

The key architectural shift:

the Control Plane is no longer in the critical path of collection

Metrics are always written locally first.

Network delivery became asynchronous.

That single separation changed the reliability model completely.

The network could fail for hours,
and the agent would continue operating normally.

Why SQLite

The interesting part here is not SQL.

It's what SQLite fundamentally is:

a single file
atomic writes
crash-safe persistence
zero infrastructure
included in Python itself

We evaluated alternatives.

In-memory queue

Failed because:

metrics disappear on restart
memory usage becomes unbounded during long outages

Plain append-only files

Failed because:

manual locking
manual serialization
corruption handling
cleanup complexity

Redis

We already use Redis elsewhere in the platform (for the psql terminal).

But for the metrics buffer, Redis was the wrong fit:

Redis is in-memory by default. Persistence (RDB/AOF) is optional and adds latency.
If Redis restarts, the buffer disappears — even if the agent is healthy.
Mixing low-latency terminal traffic with background batch writes would create contention.

SQLite gives us durability on disk, survives agent restarts, and has zero dependencies outside the agent process.

Redis is great for real-time coordination. SQLite is better for durable local buffering.

The schema is intentionally tiny

CREATE TABLE IF NOT EXISTS metrics_queue (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    instance_id TEXT    NOT NULL,
    timestamp   TEXT    NOT NULL,
    metric_type TEXT    NOT NULL,
    data        TEXT    NOT NULL,
    created_at  TEXT    NOT NULL DEFAULT (datetime('now'))
)

One row per metric type per collection cycle.

Payloads are JSON-encoded.

This is not relational modeling.

This is durable buffering.

Concurrency: why SQLite didn't become a bottleneck

One obvious question:

“doesn't SQLite lock the whole database?”

Yes.
And that matters.

The reason it worked here:

exactly one writer exists
the collector loop owns all inserts
delivery workers only read/delete in batches

We also enabled WAL mode:

PRAGMA journal_mode=WAL;
PRAGMA synchronous=NORMAL;

WAL mode mattered for one specific reason:

the collector continuously appends metrics,
while the delivery loop simultaneously reads and deletes batches.

Without WAL,
those operations would frequently block each other.

WAL allows readers to continue while writes are happening.

We also use:

PRAGMA synchronous=NORMAL;

This reduces fsync frequency while still preserving crash consistency,
which was a better latency/durability tradeoff for monitoring data.

For our workload:

collection every 10–30 seconds
batches of 100–500 rows
occasional degraded-network bursts

SQLite contention was negligible.

If we needed hundreds of concurrent writers,
we would likely move to PostgreSQL instead.

For this system,
operational simplicity mattered more than horizontal scalability.

The buffer state machine

The buffer itself behaves like a bounded queue:

COLLECT
   │
   ├── buffer healthy ──► INSERT
   │
   └── limits exceeded
              │
              ▼
         DROP + WARN

Two limits control growth:

maximum file size
maximum record age

Those constraints turned out to be operationally critical.

Why age-based cleanup matters

Imagine a six-hour outage.

Without retention limits,
the moment connectivity returns,
the agent floods the Control Plane with six hours of historical metrics.

On dashboards, that looks like:

timeline ─── gap ─── MASSIVE SPIKE ─── normal

Operationally, that's worse than losing data.

The spike is misleading.

So we intentionally discard stale records.

If buffered metrics exceed max_age_hours,
they are silently removed during cleanup.

The result becomes:

timeline ─── gap ─── normal

The outage remains visible.
But the monitoring data stays trustworthy.

That tradeoff mattered more than perfect retention.

Batch delivery

The delivery loop periodically reconstructs batches from SQLite:

SELECT id, instance_id, metric_type, data
FROM metrics_queue
ORDER BY created_at ASC
LIMIT ?

Rows are grouped back into metric payloads and sent to the Control Plane.
Importantly:

rows are deleted only after successful delivery

if send_to_control_plane(batch):
    cursor.execute(
        f"""
        DELETE FROM metrics_queue
        WHERE id IN ({','.join('?' * len(ids))})
        """,
        ids,
    )
    conn.commit()

This creates a very simple delivery guarantee:
INSERT → SEND → DELETE
If the process crashes before DELETE,
the metrics remain in SQLite and are retried later.

That makes delivery effectively at-least-once.

This effectively turns SQLite into a durable local write-ahead queue.

The second production incident

Several months later,
another issue appeared.

Monitoring dashboards showed healthy agents,
but metrics were arriving hours late.

The agents were buffering correctly.

The problem was:
the SQLite files had silently grown to hundreds of megabytes after a prolonged outage.

Delivery throughput could not drain the backlog fast enough.

At that point we realized:

the buffer itself needed observability

Buffer observability

Every heartbeat sent to the Control Plane now includes buffer state:

{
    'buffer_size_bytes': size,
    'oldest_record_age': age,
    'buffer_utilization': utilization,
}

Operationally, the most useful metrics became:

Metric             | Meaning                  | Operational action
───────────────────+──────────────────────────+────────────────────────
buffer_size_bytes  | Current SQLite file size | Alert if >100 MB
oldest_record_age  | Delivery lag             | Alert if >1 hour
buffer_utilization | Current size / max size  | Scale retention policy
delivery_failures  | Consecutive failed sends | Investigate connectivity

Without these metrics,
network degradation looked identical to healthy operation.

The buffer itself became another production subsystem
that required monitoring.

Production numbers

Typical production values:

collection interval: 10–30 seconds
average batch size: 100–500 rows
SQLite buffer limit: 128 MB
longest outage survived: ~6 hours

Under normal degraded-network conditions,
the SQLite file rarely exceeded 10–15 MB.

CPU overhead was effectively negligible.

Failure modes we actually cared about

A. Agent restart during outage

Metrics survive because the buffer lives on disk.
Delivery resumes automatically after restart.

B. Control Plane recovery after long outage

Fresh metrics resume immediately.
Stale buffered records expire automatically.
No misleading spikes.

C. Disk pressure

Once the buffer exceeds the configured size limit:

new writes stop
warnings are emitted
collection pauses temporarily

Delivery resumes automatically once backlog drains.

D. SQLite corruption

Rare, but possible.

We intentionally kept handling simple:

log the error
recreate the buffer
continue operation

For a monitoring system,
temporary metric loss is acceptable.

Operational complexity was not worth stronger guarantees.

Bounded loss over unbounded growth

One important design decision:

the system intentionally prefers bounded data loss over unbounded resource growth

Old metrics eventually expire.

The SQLite file has a hard size limit.

That tradeoff was deliberate.

For operational monitoring,
partial visibility is better than:

a dead agent
an exhausted disk
or an overloaded recovery pipeline

Perfect durability was never the goal.

Operational survivability was.

The mental model

If you understand only one thing:

collect
   ▼
SQLite
   ▼
deliver
   ▼
delete

SQLite here is not acting as a database.

It's:

a durable queue
with bounded retention
bounded size
atomic writes
and zero infrastructure dependencies

That's the entire design.

Final thought

The interesting realization was this:

monitoring reliability problems are mostly buffering problems

Once network delivery became unreliable,
the architecture naturally evolved toward:

local durability
asynchronous delivery
bounded retention
explicit degradation handling

Not because SQLite is magical.

Because a single durable file solved the exact failure mode we actually had.

SQLite as an offline metrics buffer in a monitoring agent

The problem

The incident that changed the design

The architecture

Why SQLite

In-memory queue

Plain append-only files

Redis

The schema is intentionally tiny

Concurrency: why SQLite didn't become a bottleneck

The buffer state machine

Why age-based cleanup matters

Batch delivery

The second production incident

Buffer observability

Production numbers

Failure modes we actually cared about

A. Agent restart during outage

B. Control Plane recovery after long outage

C. Disk pressure

D. SQLite corruption

Bounded loss over unbounded growth

The mental model

Final thought

Tags

Author

Stats

Published

You Might Also Like

Hermes Mentor — A Local AI Agent That Gets You Out of Tutorial Hell

Seasons time-lapse - alignment

Hermes Agent ran overnight and I woke up to a $47 bill — so I built a kill-switch

GCP: Upgrading a LINE Bot with Vertex AI ADK Tools for Smart Business Cards and Backup Search

Running a Full Multi-Stage Intrusion Simulation. Every Detection Fired.

A Curious Journey Into Reverse Engineering an AI-Generated Python .exe