The zero-result queries that were costing us watch time
For about a year, search on TopVideoHub ran entirely on SQLite FTS5 with a custom CJK tokenizer. It was fast, it was free, and for exact-match queries it was genuinely great. Then I started reading the search logs.
A depressing share of our queries returned zero results, and almost none of them were nonsense. People typed spng baby shrak, demon slyer season 2, BTS dynamit live, 鬼滅 の刃 (with a stray space), and 블랙핑크 뮤비. Every one of those is a real video we host. FTS5 returned nothing because FTS5 matches tokens, and a misspelled token is just a different token. On a video aggregator covering the Asia-Pacific market — where a single title might mix Japanese kanji, Korean hangul, and a romanized artist name — the failure rate on the first query was brutal. Users who get zero results once mostly don't search again. They leave.
I run the backend for TopVideoHub, and this post is the honest write-up of how we moved title search to OpenSearch to get typo tolerance, kept SQLite FTS5 as the source of truth, and dealt with the fact that classic edit-distance fuzziness does almost nothing for Chinese, Japanese, and Korean text. If you serve a multi-language audience, the CJK part is the part nobody warns you about.
Why FTS5 fuzziness is a dead end
The obvious first instinct is to make FTS5 itself typo-tolerant. I tried two things before giving up.
-
Trigram tokenizer. FTS5 ships a
trigramtokenizer that enables substring andLIKE-style matching. It helps with substrings but not with transpositions or missing letters, and it explodes index size. On our ~480k title corpus the index roughly tripled. -
Spellfix1 extension. SQLite's
spellfix1does edit-distance lookups, but it's a per-token virtual table you bolt on the side. You end up correcting tokens, then re-running FTS5. It works for Latin scripts and falls apart the moment a query mixes scripts, which for us is the common case, not the edge case.
The real problem is architectural: FTS5 is a token-matching engine, and typo tolerance is fundamentally a ranked similarity problem. You want "close enough" results scored by how close they are. That's what a Lucene-based engine does natively, and OpenSearch (the Apache-2.0 fork of Elasticsearch) gives it to you without a license that changes under you. I kept SQLite as the system of record and treated OpenSearch as a derived, rebuildable search index.
The CJK problem that breaks naive fuzzy search
Here's the trap. Elasticsearch and OpenSearch tutorials tell you to use fuzziness: AUTO and call it a day. Fuzziness is Damerau-Levenshtein edit distance over terms, and that assumption quietly breaks for CJK in two ways.
First, edit distance is meaningless across scripts. 鬼滅 and 鬼滅の刃 differ by tokens, not by a typo. CJK "typos" are usually wrong characters (homophones from an IME) or missing characters, not the single-character insert/delete/swap that fuzziness models. Worse, with fuzziness: 2 over single CJK characters, 火 matches 水 matches 氷 because they're all one edit apart — you get semantic garbage.
Second, tokenization decides everything. The default standard analyzer treats each Han character as its own token, which over-fragments. A whitespace analyzer is even worse for Japanese and Chinese, which don't delimit words with spaces. The pragmatic answer that works across all of CJK without per-language dictionaries is CJK bigrams plus a separate edge-n-gram field for the Latin and romaji parts.
So the design I landed on uses different analysis per script, fused at query time:
- Latin / romaji titles → standard tokenization,
fuzziness: AUTOfor typo tolerance. - CJK titles → the built-in
cjkanalyzer (bigrams) plus an ICU normalizer, no fuzziness. - A shared edge-n-gram field for as-you-type prefix matching across both.
The index mapping
This is the core of the whole system. The analyzers are where you win or lose. I install the analysis-icu plugin (ships with the official OpenSearch image) for proper Unicode normalization — it folds full-width/half-width forms, which Asian keyboards produce constantly.
PUT /videos_v3
{
"settings": {
"index": { "number_of_shards": 2, "number_of_replicas": 1 },
"analysis": {
"filter": {
"edge_ngram_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"latin_index": {
"type": "custom",
"tokenizer": "standard",
"filter": ["icu_normalizer", "lowercase", "icu_folding"]
},
"latin_autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": ["icu_normalizer", "lowercase", "edge_ngram_filter"]
},
"cjk_index": {
"type": "custom",
"tokenizer": "standard",
"filter": ["icu_normalizer", "cjk_bigram", "cjk_width", "lowercase"]
}
}
}
},
"mappings": {
"properties": {
"video_id": { "type": "keyword" },
"title": {
"type": "text",
"analyzer": "latin_index",
"fields": {
"autocomplete": {
"type": "text",
"analyzer": "latin_autocomplete",
"search_analyzer": "latin_index"
},
"cjk": { "type": "text", "analyzer": "cjk_index" },
"raw": { "type": "keyword", "ignore_above": 512 }
}
},
"region": { "type": "keyword" },
"lang": { "type": "keyword" },
"view_count": { "type": "long" },
"published_at": { "type": "date" }
}
}
}
The key move is the multi-field: one logical title indexed three ways. title handles Latin typo tolerance, title.cjk handles bigram matching for Han/Kana/Hangul, and title.autocomplete powers the search-as-you-type box. The search_analyzer on the autocomplete field is set to latin_index deliberately — you edge-n-gram the index but not the query, otherwise a query for shark would itself be split into sh, sha, shar... and match far too much.
Indexing from PHP without blocking the request
Our stack is PHP 8.4 on LiteSpeed, and the source of truth stays in SQLite. When a cron run ingests new trending videos, I push them to OpenSearch through the official opensearch-project/opensearch-php client. The important detail: indexing is done in bulk batches out of band, never in the user request path. A failed OpenSearch write must never break ingestion, because SQLite is authoritative and we can always rebuild.
<?php
declare(strict_types=1);
use OpenSearch\ClientBuilder;
final class VideoSearchIndexer
{
private const INDEX = 'videos_v3';
private const BATCH = 500;
public function __construct(
private readonly \OpenSearch\Client $client,
private readonly \PDO $db,
) {}
public static function fromEnv(\PDO $db): self
{
$client = ClientBuilder::create()
->setHosts([getenv('OPENSEARCH_HOST') ?: 'https://127.0.0.1:9200'])
->setBasicAuthentication('admin', getenv('OPENSEARCH_PASS') ?: '')
->setSSLVerification(false) // internal network only
->setRetries(2)
->build();
return new self($client, $db);
}
/** Reindex everything updated since a watermark. Returns rows pushed. */
public function syncSince(int $sinceTs): int
{
$stmt = $this->db->prepare(
'SELECT video_id, title, region, lang, view_count, published_at
FROM videos WHERE updated_at >= :ts ORDER BY updated_at'
);
$stmt->execute([':ts' => $sinceTs]);
$bulk = [];
$count = 0;
while ($row = $stmt->fetch(\PDO::FETCH_ASSOC)) {
$bulk[] = ['index' => ['_index' => self::INDEX, '_id' => $row['video_id']]];
$bulk[] = [
'video_id' => $row['video_id'],
'title' => $row['title'],
'region' => $row['region'],
'lang' => $row['lang'],
'view_count' => (int) $row['view_count'],
'published_at' => $row['published_at'],
];
if (count($bulk) >= self::BATCH * 2) {
$count += $this->flush($bulk);
$bulk = [];
}
}
if ($bulk !== []) {
$count += $this->flush($bulk);
}
return $count;
}
private function flush(array $bulk): int
{
$resp = $this->client->bulk(['body' => $bulk]);
if (!empty($resp['errors'])) {
foreach ($resp['items'] as $item) {
$op = $item['index'] ?? $item['create'] ?? [];
if (($op['status'] ?? 200) >= 300) {
error_log('OS index fail ' . ($op['_id'] ?? '?') . ': '
. json_encode($op['error'] ?? []));
}
}
}
return count($bulk) / 2;
}
}
A few decisions worth calling out. The _id is the video_id, so re-indexing the same video is idempotent — no duplicates, ever. The bulk body alternates an action line and a document line, which is the NDJSON shape the bulk API wants. And I log individual item failures rather than throwing, because one malformed title shouldn't abort a 50k-document sync.
Building the query that fuses Latin and CJK
The query is where typo tolerance actually happens, and it has to do three things at once: tolerate Latin typos, match CJK by bigram, and boost popular and recent videos so that a fuzzy match to a viral title beats an exact match to something with 12 views. I express that with a bool query combining a multi_match (for the typo-tolerant part) with a function_score wrapper for popularity.
<?php
declare(strict_types=1);
final class VideoSearchQuery
{
public function __construct(private readonly \OpenSearch\Client $client) {}
/** @return array{0:array,1:int} hits and total */
public function search(string $q, ?string $region = null, int $size = 20): array
{
$q = trim($q);
if ($q === '') {
return [[], 0];
}
$should = [
// Latin / romaji: typo tolerance via fuzziness
[
'multi_match' => [
'query' => $q,
'fields' => ['title^3', 'title.autocomplete'],
'fuzziness' => 'AUTO',
'prefix_length' => 1,
'max_expansions' => 30,
'operator' => 'and',
],
],
// CJK: bigram match, NO fuzziness (edit distance is noise here)
[
'match' => [
'title.cjk' => [
'query' => $q,
'minimum_should_match' => '70%',
'boost' => 2,
],
],
],
// Exact phrase gets the biggest lift
[
'match_phrase' => ['title' => ['query' => $q, 'boost' => 5]],
],
];
$filter = [];
if ($region !== null) {
$filter[] = ['term' => ['region' => $region]];
}
$body = [
'size' => $size,
'query' => [
'function_score' => [
'query' => [
'bool' => [
'should' => $should,
'filter' => $filter,
'minimum_should_match' => 1,
],
],
'functions' => [
[
'field_value_factor' => [
'field' => 'view_count',
'modifier' => 'log1p',
'factor' => 0.4,
'missing' => 1,
],
],
],
'boost_mode' => 'sum',
'score_mode' => 'sum',
],
],
];
$resp = $this->client->search(['index' => 'videos_v3', 'body' => $body]);
return [$resp['hits']['hits'], $resp['hits']['total']['value']];
}
}
The details that matter for typo tolerance specifically:
-
prefix_length: 1keeps the first character fixed before applying fuzziness. This is the single biggest precision win —demonwon't fuzzy-matchlemon, butdemnstill matchesdemon. It also slashes the number of terms OpenSearch has to expand, which keeps latency down. -
max_expansions: 30caps how many fuzzy variants a term expands into, so a short query can't blow up into thousands of term lookups. -
operator: andon the Latin field means every query word must match (fuzzily). Without it, a three-word query returns anything matching any one word, and precision collapses. -
CJK gets
minimum_should_match: 70%instead of fuzziness, so dropping one character out of a four-character title still matches, but unrelated bigram overlap doesn't.
Backfilling and verifying with Python
For the initial bulk load of the full corpus and for nightly drift-checks, I wrote a small Python tool. PHP owns the live path; Python is just operational tooling, and the OpenSearch helpers.bulk streaming helper is genuinely nicer than hand-rolling batches.
#!/usr/bin/env python3
"""Backfill OpenSearch from the SQLite source of truth, then verify counts."""
import os
import sqlite3
from opensearchpy import OpenSearch, helpers
INDEX = "videos_v3"
DB_PATH = os.environ.get("VIDEO_DB", "data/videos.db")
client = OpenSearch(
hosts=[os.environ.get("OPENSEARCH_HOST", "https://127.0.0.1:9200")],
http_auth=("admin", os.environ["OPENSEARCH_PASS"]),
verify_certs=False,
timeout=30,
)
def docs(conn: sqlite3.Connection):
conn.row_factory = sqlite3.Row
cur = conn.execute(
"SELECT video_id, title, region, lang, view_count, published_at FROM videos"
)
for row in cur:
yield {
"_index": INDEX,
"_id": row["video_id"],
"_source": {
"video_id": row["video_id"],
"title": row["title"],
"region": row["region"],
"lang": row["lang"],
"view_count": int(row["view_count"] or 0),
"published_at": row["published_at"],
},
}
def main() -> None:
conn = sqlite3.connect(DB_PATH)
ok, errors = helpers.bulk(client, docs(conn), chunk_size=1000, raise_on_error=False)
print(f"indexed={ok} errors={len(errors)}")
client.indices.refresh(index=INDEX)
sql_count = conn.execute("SELECT COUNT(*) FROM videos").fetchone()[0]
os_count = client.count(index=INDEX)["count"]
drift = sql_count - os_count
print(f"sqlite={sql_count} opensearch={os_count} drift={drift}")
if abs(drift) > sql_count * 0.001: # tolerate 0.1% in-flight skew
raise SystemExit(f"drift too high: {drift}")
if __name__ == "__main__":
main()
The drift check is not optional. Two derived stores will diverge — a failed bulk item here, a deleted video there — and the only way I trust the index is by reconciling counts against SQLite on every run and alerting when they part by more than a rounding error.
What actually moved the numbers
A few hard-won lessons from running this in production:
-
Test fuzziness against your real query log, not synthetic typos. I built a tiny harness that replays the last 30 days of zero-result queries through the new index. That replay is what told me
prefix_length: 1was the right call andfuzziness: 2on short tokens was a precision disaster. -
CJK does not want fuzziness — it wants good tokenization. Every hour I spent on the analyzer chain paid off more than any query tuning.
icu_normalizer+cjk_bigram+cjk_widthquietly fixed full-width/half-width and the spurious-space problems that were causing zero results. -
Keep SQLite as the source of truth. OpenSearch is a rebuildable cache. I can drop the index and replay from SQLite in minutes, which makes mapping changes (and there will be mapping changes) low-risk. Reindex into
videos_v4, flip an alias, delete the old one. - Cache aggressively at the edge. Search results for popular queries go through Cloudflare with a short TTL, and our LiteSpeed layer caches the rendered result page. OpenSearch only sees the long tail, which keeps the cluster small — two shards, one replica, comfortably under load.
-
Cap fuzzy expansion.
max_expansionsandprefix_lengthtogether are your latency safety valve. Without them, one pathological short query can saturate the cluster.
Where it landed
The headline result: zero-result queries dropped from roughly 14% of all searches to under 3%, and the residual 3% is mostly genuinely-not-on-the-site queries rather than typos. Search-driven sessions got measurably longer because people now find the video they fat-fingered on the first try. p95 search latency sits around 40ms on a modest two-node cluster, well within what the page budget allows.
If you're serving an Asia-Pacific or otherwise multilingual audience, the one thing I'd burn into your memory: typo tolerance and CJK matching are two different problems with two different solutions, and the default tutorials only solve the first one. Index Latin and romaji with fuzziness, index CJK with bigrams and ICU normalization, fuse them in a single bool query, and rank by popularity. Keep your relational store authoritative and treat the search engine as a fast, disposable projection of it. That separation is what lets you iterate on relevance without ever risking your data.





