Querying Video Analytics Parquet Exports With DuckDB Instead of a Warehouse

Every Monday I used to run the same painful ritual. Someone on the growth side would ask a question like "which Korean-language channels gained the most watch-time in Taiwan and Singapore last week, and how does that compare to the prior week?" Our production data lives in SQLite (with an FTS5 CJK tokenizer for search), and the raw event stream — every video impression, play, and completion across the Asia-Pacific regions we aggregate — gets exported nightly to Parquet on object storage. To answer that one question I would spin up a Python notebook, pull pandas, load three days of Parquet, run out of memory, downsample, and an hour later hand back a number I wasn't fully confident in.

The production stack at TopVideoHub is deliberately boring and cheap: PHP 8.4, SQLite with FTS5, LiteSpeed, Cloudflare in front. That stack is perfect for serving pages. It is terrible for slicing 400 million event rows on a whim. What I actually needed was an analytics engine I could point at a folder of Parquet files, with no server to run, no cluster to pay for, and no ETL step. That engine turned out to be DuckDB, and it has quietly replaced the entire "spin up a warehouse for ad-hoc questions" part of my job. This post is the concrete walkthrough of how we use it.

The shape of the data and why Parquet

Our event pipeline writes one Parquet file per region per hour. The directory layout is partitioned, which matters enormously for query speed:

s3://tvh-analytics/events/
  region=TW/dt=2026-05-25/events-00.parquet
  region=TW/dt=2026-05-25/events-01.parquet
  region=SG/dt=2026-05-25/events-00.parquet
  region=JP/dt=2026-05-25/events-00.parquet
  ...

Each row is a single user action. The schema is narrow and columnar-friendly:

event_ts (TIMESTAMP) — when it happened, UTC
region (VARCHAR) — TW, SG, JP, KR, HK, etc. (also encoded in the path)
dt (DATE) — the partition date
video_id (VARCHAR)
channel_id (VARCHAR)
lang (VARCHAR) — detected content language: ko, ja, zh-Hant, zh-Hans, en
event_type (VARCHAR) — impression, play, complete
watch_ms (BIGINT) — milliseconds watched for play/complete events
session_id (VARCHAR)

Parquet is the right export format here for three reasons. It is columnar, so a query that touches watch_ms and region never reads the session_id bytes off disk. It is compressed (we use ZSTD), so a day of events is roughly 1.8 GB on disk instead of ~14 GB of JSON. And it carries its own schema and per-column statistics, so a query engine can skip entire row groups whose event_ts range falls outside your filter. DuckDB exploits all three.

Installing DuckDB and the first query

There is no daemon. DuckDB is an embedded library, the same deployment model as SQLite, which is exactly why it fits my brain after years of SQLite. The CLI is a single binary:

# macOS / Linux — grab the standalone CLI
curl -L https://install.duckdb.org | sh

# Or via Python, which is how our analysts use it
pip install duckdb

Now the part that still feels like magic. You can query Parquet on disk directly, no import, no CREATE TABLE:

-- Launch with: duckdb
-- Hive-style partition columns (region, dt) are read from the path
SELECT
    region,
    count(*)            AS events,
    count(DISTINCT video_id) AS videos
FROM read_parquet(
    'events/region=*/dt=2026-05-25/*.parquet',
    hive_partitioning = true
)
WHERE event_type = 'play'
GROUP BY region
ORDER BY events DESC;

The hive_partitioning = true flag tells DuckDB to parse region=TW and dt=2026-05-25 out of the file paths and expose them as real columns. The huge win: because dt is in the path, the planner prunes to just that day's files before reading a single byte of Parquet. Filtering on region prunes folders too. This is partition pruning, and it is the difference between scanning 1.8 GB and scanning 60 GB.

Querying remote Parquet without downloading it

We do not keep the full event history on the laptop. It lives in object storage. DuckDB's httpfs extension reads Parquet over HTTP range requests, pulling only the row groups it needs:

INSTALL httpfs;
LOAD httpfs;

-- Cloudflare R2 is S3-compatible; point the S3 settings at it
CREATE SECRET tvh_r2 (
    TYPE s3,
    PROVIDER config,
    KEY_ID     '...redacted...',
    SECRET     '...redacted...',
    ENDPOINT   '<account>.r2.cloudflarestorage.com',
    URL_STYLE  'path',
    REGION     'auto'
);

-- Watch-time leaders for Korean content across two AP regions, last 7 days
SELECT
    channel_id,
    round(sum(watch_ms) / 3.6e6, 1) AS watch_hours,
    count(*) FILTER (WHERE event_type = 'complete') AS completions
FROM read_parquet(
    's3://tvh-analytics/events/region=*/dt=*/*.parquet',
    hive_partitioning = true
)
WHERE lang = 'ko'
  AND region IN ('TW', 'SG')
  AND dt >= current_date - INTERVAL 7 DAY
  AND event_type IN ('play', 'complete')
GROUP BY channel_id
ORDER BY watch_hours DESC
LIMIT 25;

That Monday question that used to take an hour now takes about nine seconds against R2, most of which is network latency on the range requests. Note the FILTER (WHERE ...) clause — DuckDB supports the SQL standard aggregate filter, so I get completions and total watch-time in a single pass instead of two subqueries or a CASE-sum dance. The columnar reads mean only channel_id, watch_ms, lang, region, dt, and event_type ever cross the wire.

A practical note on cost: because each query issues range requests, a folder of thousands of tiny Parquet files generates thousands of HTTP calls. We compact hourly files into daily files for anything older than a week, which cut our R2 Class B operation count by roughly 95%. Few-big-files beats many-small-files for analytics, always.

Window functions for week-over-week, in one query

The "compared to the prior week" half of the question is where SQLite would force me into self-joins or app-side math. DuckDB ships the full window-function surface, so the comparison is declarative:

WITH weekly AS (
    SELECT
        channel_id,
        date_trunc('week', event_ts) AS wk,
        sum(watch_ms) / 3.6e6        AS watch_hours
    FROM read_parquet(
        's3://tvh-analytics/events/region=*/dt=*/*.parquet',
        hive_partitioning = true
    )
    WHERE lang = 'ko'
      AND region IN ('TW', 'SG')
      AND dt >= current_date - INTERVAL 21 DAY
      AND event_type IN ('play', 'complete')
    GROUP BY channel_id, date_trunc('week', event_ts)
)
SELECT
    channel_id,
    wk,
    round(watch_hours, 1) AS watch_hours,
    round(
        watch_hours - lag(watch_hours) OVER (
            PARTITION BY channel_id ORDER BY wk
        ), 1
    ) AS wow_delta
FROM weekly
QUALIFY wk = (SELECT max(wk) FROM weekly)   -- keep only the latest week's row
ORDER BY wow_delta DESC NULLS LAST
LIMIT 20;

Two things worth calling out. lag(...) OVER (PARTITION BY channel_id ORDER BY wk) reaches back to each channel's previous week without a self-join. And QUALIFY filters on a window-function result the way HAVING filters on an aggregate — it is the cleanest way to say "only the most recent week" without wrapping the whole thing in another subquery. This is one statement, one scan, and it answers the full week-over-week question.

Wiring it into the PHP stack

Our application is PHP, and analysts are not the only consumers — some dashboards need these numbers server-side. There is no mature native DuckDB driver I trust for PHP 8.4 in production, so we treat DuckDB the way you would treat any CLI analytics tool: shell out to the binary, ask for JSON, decode it. DuckDB's -json mode makes this clean and safe.

<?php
declare(strict_types=1);

/**
 * Run an ad-hoc DuckDB query and return decoded rows.
 * Parameters are bound via a prepared statement passed on stdin,
 * so user input never gets concatenated into SQL.
 */
function duckdb_query(string $sql, array $params = []): array
{
    // Build a parameterized script. DuckDB supports $name placeholders.
    $setStmts = '';
    foreach ($params as $key => $value) {
        $lit = is_int($value) || is_float($value)
            ? (string) $value
            : "'" . str_replace("'", "''", (string) $value) . "'";
        $setStmts .= "SET VARIABLE {$key} = {$lit};\n";
    }

    $script = $setStmts . $sql;

    $descriptors = [
        0 => ['pipe', 'r'],
        1 => ['pipe', 'w'],
        2 => ['pipe', 'w'],
    ];

    // -json emits one JSON array of row objects
    $proc = proc_open('duckdb -json', $descriptors, $pipes);
    if (!is_resource($proc)) {
        throw new RuntimeException('Unable to start duckdb');
    }

    fwrite($pipes[0], $script);
    fclose($pipes[0]);

    $out = stream_get_contents($pipes[1]);
    $err = stream_get_contents($pipes[2]);
    fclose($pipes[1]);
    fclose($pipes[2]);

    $code = proc_close($proc);
    if ($code !== 0) {
        throw new RuntimeException("duckdb failed: {$err}");
    }

    $rows = json_decode($out, true, flags: JSON_THROW_ON_ERROR);
    return is_array($rows) ? $rows : [];
}

// Usage: top channels for a region, last N days, cached behind LiteSpeed/Cloudflare
$rows = duckdb_query(
    <<<SQL
    SELECT channel_id,
           round(sum(watch_ms) / 3.6e6, 1) AS watch_hours
    FROM read_parquet(getvariable('glob'), hive_partitioning = true)
    WHERE region = getvariable('region')
      AND dt >= current_date - INTERVAL (getvariable('days')) DAY
      AND event_type IN ('play', 'complete')
    GROUP BY channel_id
    ORDER BY watch_hours DESC
    LIMIT 20;
    SQL,
    [
        'glob'   => 's3://tvh-analytics/events/region=*/dt=*/*.parquet',
        'region' => 'JP',
        'days'   => 14,
    ]
);

header('Content-Type: application/json');
echo json_encode($rows, JSON_UNESCAPED_UNICODE);

The getvariable() / SET VARIABLE pattern keeps user-supplied values out of the SQL string entirely, which is the same discipline I apply to SQLite prepared statements. Because these dashboard endpoints are read-only and the underlying Parquet only changes nightly, I let LiteSpeed and Cloudflare cache the JSON responses for a few hours. The DuckDB process is short-lived: it starts, scans, prints JSON, and exits. There is nothing to keep warm and nothing to monitor.

A caution worth stating plainly: do not put a public-facing endpoint that shells out to DuckDB on an arbitrary user-controlled glob or query. These endpoints accept a fixed allowlist of parameters (region code, day count) and the SQL itself is server-defined. The CLI is a heavyweight analytics tool, not a request-path database.

CJK language analytics that match production search

Our differentiation is multi-language Asia-Pacific coverage, so a lot of analysis is sliced by content language. DuckDB handles UTF-8 natively, and its string functions are enough for the grouping and normalization I need before correlating with FTS5 search behavior on the PHP side:

import duckdb

con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")

# Share of watch-time by detected language, per region, last 30 days
df = con.execute("""
    WITH base AS (
        SELECT
            region,
            -- normalize Traditional/Simplified into a single 'zh' bucket for the rollup
            CASE WHEN lang LIKE 'zh-%' THEN 'zh' ELSE lang END AS lang_group,
            watch_ms
        FROM read_parquet(
            's3://tvh-analytics/events/region=*/dt=*/*.parquet',
            hive_partitioning = true
        )
        WHERE dt >= current_date - INTERVAL 30 DAY
          AND event_type IN ('play', 'complete')
    )
    SELECT
        region,
        lang_group,
        round(
            100.0 * sum(watch_ms)
            / sum(sum(watch_ms)) OVER (PARTITION BY region),
        1) AS pct_watch_time
    FROM base
    GROUP BY region, lang_group
    ORDER BY region, pct_watch_time DESC
""").df()   # returns a pandas DataFrame for charting

print(df.head(20))

The nested-aggregate window — sum(watch_ms) / sum(sum(watch_ms)) OVER (PARTITION BY region) — computes each language's share of its region's total in one pass. The .df() call hands back a pandas DataFrame, so I get DuckDB's scan speed for the heavy lifting and pandas/matplotlib for the last-mile chart. This is the workflow I reach for constantly: DuckDB reduces hundreds of millions of rows to a few hundred, then pandas takes over. We feed these language-share numbers back into decisions about which CJK trending feeds to prioritize, which closes the loop with the FTS5 CJK tokenizer that powers on-site search.

Materializing a small extract back to SQLite

Sometimes the output of an ad-hoc analysis needs to live in production — a precomputed "trending channels" table the PHP app reads on every request. DuckDB can attach a SQLite file directly and write the reduced result into it, which means my whole pipeline is one tool:

INSTALL sqlite;
LOAD sqlite;

ATTACH 'production_cache.db' AS prod (TYPE sqlite);

CREATE OR REPLACE TABLE prod.trending_channels AS
SELECT
    region,
    channel_id,
    round(sum(watch_ms) / 3.6e6, 1) AS watch_hours_7d
FROM read_parquet(
    's3://tvh-analytics/events/region=*/dt=*/*.parquet',
    hive_partitioning = true
)
WHERE dt >= current_date - INTERVAL 7 DAY
  AND event_type IN ('play', 'complete')
GROUP BY region, channel_id
QUALIFY row_number() OVER (PARTITION BY region ORDER BY sum(watch_ms) DESC) <= 50;

DETACH prod;

The big Parquet scan happens in DuckDB; the tiny 50-per-region result lands in the SQLite file that LiteSpeed serves from. This is the cleanest bridge I have found between a fat analytical dataset and a thin operational store, and it needs zero glue code.

What this replaced and where the limits are

Being honest about the boundaries matters. DuckDB is single-node. It is fantastic up to datasets that fit the "larger than RAM but it streams fine" range — tens to low hundreds of gigabytes per query on a normal machine — and it is the wrong tool when you need many concurrent writers, sub-second serving to thousands of users, or a true multi-terabyte distributed scan. For those you still want a real warehouse or your operational database. DuckDB's sweet spot is exactly ad-hoc analytics over columnar exports, which is the job I described at the top.

What it replaced for us:

The pandas-out-of-memory dance for any question over more than a day of data
A half-configured warehouse we were paying for and barely querying
A pile of one-off Python ETL scripts that loaded Parquet just to filter it

What we kept: SQLite and FTS5 for serving, because that is what they are good at. The two engines compose — DuckDB even reads and writes SQLite files directly, so there is no awkward handoff.

If your analytics data already lands as Parquet somewhere and you have been reaching for a notebook and praying about memory, install the DuckDB CLI and point it at one day of files. The first time a cross-region, multi-week question comes back in seconds with one SQL statement and no infrastructure, you will understand why it quietly took over the ad-hoc half of my week. Start with read_parquet and hive_partitioning = true, keep your files big, and let partition pruning do the heavy lifting.