How I Added Typo-Tolerant Video Title Search With OpenSearch and CJK

The zero-result queries that were costing us watch time

For about a year, search on TopVideoHub ran entirely on SQLite FTS5 with a custom CJK tokenizer. It was fast, it was free, and for exact-match queries it was genuinely great. Then I started reading the search logs.

A depressing share of our queries returned zero results, and almost none of them were nonsense. People typed spng baby shrak, demon slyer season 2, BTS dynamit live, 鬼滅の刃 (with a stray space), and 블랙핑크 뮤비. Every one of those is a real video we host. FTS5 returned nothing because FTS5 matches tokens, and a misspelled token is just a different token. On a video aggregator covering the Asia-Pacific market — where a single title might mix Japanese kanji, Korean hangul, and a romanized artist name — the failure rate on the first query was brutal. Users who get zero results once mostly don't search again. They leave.

I run the backend for TopVideoHub, and this post is the honest write-up of how we moved title search to OpenSearch to get typo tolerance, kept SQLite FTS5 as the source of truth, and dealt with the fact that classic edit-distance fuzziness does almost nothing for Chinese, Japanese, and Korean text. If you serve a multi-language audience, the CJK part is the part nobody warns you about.

Why FTS5 fuzziness is a dead end

The obvious first instinct is to make FTS5 itself typo-tolerant. I tried two things before giving up.

Trigram tokenizer. FTS5 ships a trigram tokenizer that enables substring and LIKE-style matching. It helps with substrings but not with transpositions or missing letters, and it explodes index size. On our ~480k title corpus the index roughly tripled.
Spellfix1 extension. SQLite's spellfix1 does edit-distance lookups, but it's a per-token virtual table you bolt on the side. You end up correcting tokens, then re-running FTS5. It works for Latin scripts and falls apart the moment a query mixes scripts, which for us is the common case, not the edge case.

The real problem is architectural: FTS5 is a token-matching engine, and typo tolerance is fundamentally a ranked similarity problem. You want "close enough" results scored by how close they are. That's what a Lucene-based engine does natively, and OpenSearch (the Apache-2.0 fork of Elasticsearch) gives it to you without a license that changes under you. I kept SQLite as the system of record and treated OpenSearch as a derived, rebuildable search index.

The CJK problem that breaks naive fuzzy search

Here's the trap. Elasticsearch and OpenSearch tutorials tell you to use fuzziness: AUTO and call it a day. Fuzziness is Damerau-Levenshtein edit distance over terms, and that assumption quietly breaks for CJK in two ways.

First, edit distance is meaningless across scripts. 鬼滅 and 鬼滅の刃 differ by tokens, not by a typo. CJK "typos" are usually wrong characters (homophones from an IME) or missing characters, not the single-character insert/delete/swap that fuzziness models. Worse, with fuzziness: 2 over single CJK characters, 火 matches 水 matches 氷 because they're all one edit apart — you get semantic garbage.

Second, tokenization decides everything. The default standard analyzer treats each Han character as its own token, which over-fragments. A whitespace analyzer is even worse for Japanese and Chinese, which don't delimit words with spaces. The pragmatic answer that works across all of CJK without per-language dictionaries is CJK bigrams plus a separate edge-n-gram field for the Latin and romaji parts.

So the design I landed on uses different analysis per script, fused at query time:

Latin / romaji titles → standard tokenization, fuzziness: AUTO for typo tolerance.
CJK titles → the built-in cjk analyzer (bigrams) plus an ICU normalizer, no fuzziness.
A shared edge-n-gram field for as-you-type prefix matching across both.

The index mapping

This is the core of the whole system. The analyzers are where you win or lose. I install the analysis-icu plugin (ships with the official OpenSearch image) for proper Unicode normalization — it folds full-width/half-width forms, which Asian keyboards produce constantly.

PUT /videos_v3
{
  "settings": {
    "index": { "number_of_shards": 2, "number_of_replicas": 1 },
    "analysis": {
      "filter": {
        "edge_ngram_filter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 20
        }
      },
      "analyzer": {
        "latin_index": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["icu_normalizer", "lowercase", "icu_folding"]
        },
        "latin_autocomplete": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["icu_normalizer", "lowercase", "edge_ngram_filter"]
        },
        "cjk_index": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["icu_normalizer", "cjk_bigram", "cjk_width", "lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "video_id": { "type": "keyword" },
      "title": {
        "type": "text",
        "analyzer": "latin_index",
        "fields": {
          "autocomplete": {
            "type": "text",
            "analyzer": "latin_autocomplete",
            "search_analyzer": "latin_index"
          },
          "cjk": { "type": "text", "analyzer": "cjk_index" },
          "raw": { "type": "keyword", "ignore_above": 512 }
        }
      },
      "region": { "type": "keyword" },
      "lang": { "type": "keyword" },
      "view_count": { "type": "long" },
      "published_at": { "type": "date" }
    }
  }
}

The key move is the multi-field: one logical title indexed three ways. title handles Latin typo tolerance, title.cjk handles bigram matching for Han/Kana/Hangul, and title.autocomplete powers the search-as-you-type box. The search_analyzer on the autocomplete field is set to latin_index deliberately — you edge-n-gram the index but not the query, otherwise a query for shark would itself be split into sh, sha, shar... and match far too much.

Indexing from PHP without blocking the request

Our stack is PHP 8.4 on LiteSpeed, and the source of truth stays in SQLite. When a cron run ingests new trending videos, I push them to OpenSearch through the official opensearch-project/opensearch-php client. The important detail: indexing is done in bulk batches out of band, never in the user request path. A failed OpenSearch write must never break ingestion, because SQLite is authoritative and we can always rebuild.

<?php
declare(strict_types=1);

use OpenSearch\ClientBuilder;

final class VideoSearchIndexer
{
    private const INDEX = 'videos_v3';
    private const BATCH = 500;

    public function __construct(
        private readonly \OpenSearch\Client $client,
        private readonly \PDO $db,
    ) {}

    public static function fromEnv(\PDO $db): self
    {
        $client = ClientBuilder::create()
            ->setHosts([getenv('OPENSEARCH_HOST') ?: 'https://127.0.0.1:9200'])
            ->setBasicAuthentication('admin', getenv('OPENSEARCH_PASS') ?: '')
            ->setSSLVerification(false) // internal network only
            ->setRetries(2)
            ->build();

        return new self($client, $db);
    }

    /** Reindex everything updated since a watermark. Returns rows pushed. */
    public function syncSince(int $sinceTs): int
    {
        $stmt = $this->db->prepare(
            'SELECT video_id, title, region, lang, view_count, published_at
             FROM videos WHERE updated_at >= :ts ORDER BY updated_at'
        );
        $stmt->execute([':ts' => $sinceTs]);

        $bulk = [];
        $count = 0;
        while ($row = $stmt->fetch(\PDO::FETCH_ASSOC)) {
            $bulk[] = ['index' => ['_index' => self::INDEX, '_id' => $row['video_id']]];
            $bulk[] = [
                'video_id'     => $row['video_id'],
                'title'        => $row['title'],
                'region'       => $row['region'],
                'lang'         => $row['lang'],
                'view_count'   => (int) $row['view_count'],
                'published_at' => $row['published_at'],
            ];
            if (count($bulk) >= self::BATCH * 2) {
                $count += $this->flush($bulk);
                $bulk = [];
            }
        }
        if ($bulk !== []) {
            $count += $this->flush($bulk);
        }
        return $count;
    }

    private function flush(array $bulk): int
    {
        $resp = $this->client->bulk(['body' => $bulk]);
        if (!empty($resp['errors'])) {
            foreach ($resp['items'] as $item) {
                $op = $item['index'] ?? $item['create'] ?? [];
                if (($op['status'] ?? 200) >= 300) {
                    error_log('OS index fail ' . ($op['_id'] ?? '?') . ': '
                        . json_encode($op['error'] ?? []));
                }
            }
        }
        return count($bulk) / 2;
    }
}

A few decisions worth calling out. The _id is the video_id, so re-indexing the same video is idempotent — no duplicates, ever. The bulk body alternates an action line and a document line, which is the NDJSON shape the bulk API wants. And I log individual item failures rather than throwing, because one malformed title shouldn't abort a 50k-document sync.

Building the query that fuses Latin and CJK

The query is where typo tolerance actually happens, and it has to do three things at once: tolerate Latin typos, match CJK by bigram, and boost popular and recent videos so that a fuzzy match to a viral title beats an exact match to something with 12 views. I express that with a bool query combining a multi_match (for the typo-tolerant part) with a function_score wrapper for popularity.

<?php
declare(strict_types=1);

final class VideoSearchQuery
{
    public function __construct(private readonly \OpenSearch\Client $client) {}

    /** @return array{0:array,1:int} hits and total */
    public function search(string $q, ?string $region = null, int $size = 20): array
    {
        $q = trim($q);
        if ($q === '') {
            return [[], 0];
        }

        $should = [
            // Latin / romaji: typo tolerance via fuzziness
            [
                'multi_match' => [
                    'query'     => $q,
                    'fields'    => ['title^3', 'title.autocomplete'],
                    'fuzziness' => 'AUTO',
                    'prefix_length' => 1,
                    'max_expansions' => 30,
                    'operator'  => 'and',
                ],
            ],
            // CJK: bigram match, NO fuzziness (edit distance is noise here)
            [
                'match' => [
                    'title.cjk' => [
                        'query'                => $q,
                        'minimum_should_match' => '70%',
                        'boost'                => 2,
                    ],
                ],
            ],
            // Exact phrase gets the biggest lift
            [
                'match_phrase' => ['title' => ['query' => $q, 'boost' => 5]],
            ],
        ];

        $filter = [];
        if ($region !== null) {
            $filter[] = ['term' => ['region' => $region]];
        }

        $body = [
            'size'  => $size,
            'query' => [
                'function_score' => [
                    'query' => [
                        'bool' => [
                            'should'               => $should,
                            'filter'               => $filter,
                            'minimum_should_match' => 1,
                        ],
                    ],
                    'functions' => [
                        [
                            'field_value_factor' => [
                                'field'    => 'view_count',
                                'modifier' => 'log1p',
                                'factor'   => 0.4,
                                'missing'  => 1,
                            ],
                        ],
                    ],
                    'boost_mode' => 'sum',
                    'score_mode' => 'sum',
                ],
            ],
        ];

        $resp = $this->client->search(['index' => 'videos_v3', 'body' => $body]);
        return [$resp['hits']['hits'], $resp['hits']['total']['value']];
    }
}

The details that matter for typo tolerance specifically:

prefix_length: 1 keeps the first character fixed before applying fuzziness. This is the single biggest precision win — demon won't fuzzy-match lemon, but demn still matches demon. It also slashes the number of terms OpenSearch has to expand, which keeps latency down.
max_expansions: 30 caps how many fuzzy variants a term expands into, so a short query can't blow up into thousands of term lookups.
operator: and on the Latin field means every query word must match (fuzzily). Without it, a three-word query returns anything matching any one word, and precision collapses.
CJK gets minimum_should_match: 70% instead of fuzziness, so dropping one character out of a four-character title still matches, but unrelated bigram overlap doesn't.

Backfilling and verifying with Python

For the initial bulk load of the full corpus and for nightly drift-checks, I wrote a small Python tool. PHP owns the live path; Python is just operational tooling, and the OpenSearch helpers.bulk streaming helper is genuinely nicer than hand-rolling batches.

#!/usr/bin/env python3
"""Backfill OpenSearch from the SQLite source of truth, then verify counts."""
import os
import sqlite3
from opensearchpy import OpenSearch, helpers

INDEX = "videos_v3"
DB_PATH = os.environ.get("VIDEO_DB", "data/videos.db")

client = OpenSearch(
    hosts=[os.environ.get("OPENSEARCH_HOST", "https://127.0.0.1:9200")],
    http_auth=("admin", os.environ["OPENSEARCH_PASS"]),
    verify_certs=False,
    timeout=30,
)


def docs(conn: sqlite3.Connection):
    conn.row_factory = sqlite3.Row
    cur = conn.execute(
        "SELECT video_id, title, region, lang, view_count, published_at FROM videos"
    )
    for row in cur:
        yield {
            "_index": INDEX,
            "_id": row["video_id"],
            "_source": {
                "video_id": row["video_id"],
                "title": row["title"],
                "region": row["region"],
                "lang": row["lang"],
                "view_count": int(row["view_count"] or 0),
                "published_at": row["published_at"],
            },
        }


def main() -> None:
    conn = sqlite3.connect(DB_PATH)
    ok, errors = helpers.bulk(client, docs(conn), chunk_size=1000, raise_on_error=False)
    print(f"indexed={ok} errors={len(errors)}")

    client.indices.refresh(index=INDEX)
    sql_count = conn.execute("SELECT COUNT(*) FROM videos").fetchone()[0]
    os_count = client.count(index=INDEX)["count"]
    drift = sql_count - os_count
    print(f"sqlite={sql_count} opensearch={os_count} drift={drift}")
    if abs(drift) > sql_count * 0.001:  # tolerate 0.1% in-flight skew
        raise SystemExit(f"drift too high: {drift}")


if __name__ == "__main__":
    main()

The drift check is not optional. Two derived stores will diverge — a failed bulk item here, a deleted video there — and the only way I trust the index is by reconciling counts against SQLite on every run and alerting when they part by more than a rounding error.

What actually moved the numbers

A few hard-won lessons from running this in production:

Test fuzziness against your real query log, not synthetic typos. I built a tiny harness that replays the last 30 days of zero-result queries through the new index. That replay is what told me prefix_length: 1 was the right call and fuzziness: 2 on short tokens was a precision disaster.
CJK does not want fuzziness — it wants good tokenization. Every hour I spent on the analyzer chain paid off more than any query tuning. icu_normalizer + cjk_bigram + cjk_width quietly fixed full-width/half-width and the spurious-space problems that were causing zero results.
Keep SQLite as the source of truth. OpenSearch is a rebuildable cache. I can drop the index and replay from SQLite in minutes, which makes mapping changes (and there will be mapping changes) low-risk. Reindex into videos_v4, flip an alias, delete the old one.
Cache aggressively at the edge. Search results for popular queries go through Cloudflare with a short TTL, and our LiteSpeed layer caches the rendered result page. OpenSearch only sees the long tail, which keeps the cluster small — two shards, one replica, comfortably under load.
Cap fuzzy expansion. max_expansions and prefix_length together are your latency safety valve. Without them, one pathological short query can saturate the cluster.

Where it landed

The headline result: zero-result queries dropped from roughly 14% of all searches to under 3%, and the residual 3% is mostly genuinely-not-on-the-site queries rather than typos. Search-driven sessions got measurably longer because people now find the video they fat-fingered on the first try. p95 search latency sits around 40ms on a modest two-node cluster, well within what the page budget allows.

If you're serving an Asia-Pacific or otherwise multilingual audience, the one thing I'd burn into your memory: typo tolerance and CJK matching are two different problems with two different solutions, and the default tutorials only solve the first one. Index Latin and romaji with fuzziness, index CJK with bigrams and ICU normalization, fuse them in a single bool query, and rank by popularity. Keep your relational store authoritative and treat the search engine as a fast, disposable projection of it. That separation is what lets you iterate on relevance without ever risking your data.

How I Added Typo-Tolerant Video Title Search With OpenSearch and CJK

The zero-result queries that were costing us watch time

Why FTS5 fuzziness is a dead end

The CJK problem that breaks naive fuzzy search

The index mapping

Indexing from PHP without blocking the request

Building the query that fuses Latin and CJK

Backfilling and verifying with Python

What actually moved the numbers

Where it landed

Tags

Author

Stats

Published

You Might Also Like

OpenSearch Query Phase vs Fetch Phase: What Actually Happens on Each Shard

Query Sets in Search Relevance: Designing Representative Test Queries That Actually Matter

Distributed Scoring in OpenSearch: Why Your Results Aren't Perfectly Ranked