72% -> 75% -> 92%: a reproducible RAG validation

I expected converting docs into Q&A pairs to improve retrieval. It mostly didn't.

I built three knowledge bases from the same source document and ran the same twelve questions against each, three times.

Raw markdown chunks: 72%
Q&A facts from a generic prompt: 75%
Q&A facts from a retrieval-aware prompt: 92%

The 75% caught me. Naive Q&A conversion lifts retrieval by three points — a wash, not a gain. The +20 over raw markdown comes from something else entirely: five prompt design rules applied at fact-generation time. The rules cluster around two ideas — each fact should answer fully on its own, and the tokens retrieval needs (service names, parameter names, identifiers) should survive verbatim from source to indexed fact.

This is a reproducible record of where the gains actually came from.

The full repo, including every prompt, every fact, every chatbot response: github.com/HidekiMori/rag-accordion-demo (MIT).

1. The setup

The source document was the LDX hub developer portal — an 821-line Markdown file describing five API services (StructFlow, RefineLoop, RenderOCR, CastDoc, and ExtractDoc). It mixes prose, parameter tables, JSON examples, and a cross-cutting "Errors" section. Long enough to test retrieval at scale, structured enough to test what each piece of the pipeline does.

From this document, three knowledge bases were built:

mdx_direct — the raw Markdown, chunked on \n\n, 1024 tokens per chunk, 50 token overlap.
naive_facts — Q&A facts produced from the document by a generic "extract Q&A pairs from this" prompt. 78 facts, one per JSONL line.
best_facts — Q&A facts produced by a Stage 2 prompt with five explicit rules. 82 facts.

All three are loaded into Dify Cloud as Knowledge Bases. Same embedding model (OpenAI text-embedding-3-large, default dimensions). Same retrieval (Hybrid Search, Weighted Score, 0.7 semantic / 0.3 keyword, Top K=3). Same chatbot LLM (OpenAI GPT-5.5). Same system prompt. The only variable across runs is which KB is attached.

Dify Cloud's vector storage runs on TiDB Cloud Starter — see PingCAP's case study on Dify's consolidation. I didn't write SQL against TiDB directly; what matters here is that Hybrid Search on top of TiDB is what made the keyword side of Rule 5 (below) visible in the results.

The Q&A facts were produced by a two-stage StructFlow pipeline. StructFlow is a generalized "input + instructions → structured output" function, not a structuring tool per se — the instructions decide what the output looks like, so the same function produces 53 self-contained sections in Stage 1 and Q&A facts in Stage 2. Both stages run on Google Gemini 3.5 Flash at temperature 0. Stage 1 outputs a sections array; Stage 2 outputs a facts array. Each array gets flattened into line-per-record JSONL between stages — the shape §5 below calls the "accordion." The Stage 1 prompt is identical for both naive_facts and best_facts — only the Stage 2 prompt differs.

A note on naming. StructFlow appears in this article in two roles: as a service documented in the test corpus (the LDX hub developer portal), and as the engine that built the Q&A facts from that corpus. Same tool, two contexts. When "StructFlow" shows up in the example facts and test questions, it is the service-in-corpus role; when it shows up in pipeline descriptions and the accordion diagram, it is the engine role.

Twelve test questions, each designed to probe a specific retrieval challenge (direct lookup, default value buried in a parameter list, cross-service comparison, hidden entity inside a code block, casual phrasing, cross-cutting concern scoped to a service, end-to-end workflow synthesis). Three independent runs per KB. Twelve questions × three KBs × three runs = 108 graded chatbot responses, scored as ✅ / ⚠️ / ❌.

Grading was manual, against a per-question rubric prepared before runs: ✅ for a complete answer that included every expected piece of information, ⚠️ for a partial or incomplete answer (e.g., missing the partial-failure warning in Q12), and ❌ for a wrong answer or a "not in the documentation" refusal when the answer was actually present in the source. Aggregate percentages treat ✅ as correct and both ⚠️ and ❌ as not correct — no partial credit. Per-question patterns and verbatim chatbot responses for all 108 runs are committed under results/ in the repo, so anyone can re-grade if they disagree with the rubric.

2. The headline numbers

KB	Accuracy	vs `mdx_direct`
`mdx_direct`	72%	— (baseline)
`naive_facts`	75%	+3 pt (operationally negligible)
`best_facts`	92%	+20 pt (and +17 over naive)

The +3 pt for naive Q&A conversion is a wash, not a gain. Looking question by question reveals why:

Question	`mdx_direct`	`naive_facts`	Instance change
Q2 (`max_revisions` default)	0/3	3/3	+3 (genuine fix)
Q6 (is LDX hub free?)	0/3	3/3	+3 (genuine fix)
Q4 (RenderOCR vs CastDoc)	3/3	0/3	−3 (genuine break)
Q12 (`completed` for StructFlow)	3/3	0/3	−3 (genuine break)
Q10 (scanned-contracts workflow)	2/3	3/3	+1 (mdx run variance)

The four genuine fix/break pairs cancel exactly at the instance level: +3 +3 −3 −3 = 0. The entire +3 pt difference (naive 27/36 vs mdx 26/36) comes from Q10 alone — where mdx_direct dropped to ⚠️ on a single run (the ExtractDoc→StructFlow chain collapsed that run) while naive_facts stayed ✅ across all three. That +1 instance is run variance on the raw-markdown side, not a retrieval gain from Q&A conversion. Naive Q&A shuffles where the failures land (Q2/Q6 fixed, Q4/Q12 broken) and nets out at zero; the headline +3 pt is noise on Q10.

best_facts keeps all four ✅, and brings six more questions from partial to full credit. Same Stage 1 sections, same LLM, same retrieval — only the Stage 2 prompt differs between naive_facts and best_facts.

The structural gap reproduced across all three runs of all three KBs without exception. It is not statistical noise.

That said, this is not a benchmark paper. It is a controlled engineering validation on one realistic corpus, with a deliberately small question battery designed to probe specific retrieval failure modes. Whether the five rules below hold under different document structures, retrieval backends, or chunking strategies is something the repo is designed to let you check on your own corpus — not something this single validation establishes.

3. The five rules

These are the rules encoded in the retrieval-aware Stage 2 prompt. Each one prevents a specific failure mode.

Rule 1 — Self-contained answers

Every fact must answer its question fully on its own, without needing to read other facts.

Hybrid Search returns the top three chunks. If the answer is split across multiple facts — "this is the default" + "the parameter is max_revisions" + "RefineLoop accepts these settings" — retrieval has to pull all three. That does not always happen. When each fact carries the full picture in one place, retrieval only needs to land on that one entry.

In practice: the best_facts entry for "what are the possible job statuses?" names every value (queued, processing, completed, failed) and what each means, in one fact.

Rule 2 — Developer-friendly question phrasing

Use the way real users would actually type the query: "How do I...?", "What is the default value of...?", "Which formats does X support?"

Embeddings reward semantic similarity. If the generated question is phrased like a real user query, the embedding lands close in vector space. Declarative summaries ("Describe the configuration of...") sit further away from how queries actually arrive at retrieval time.

This rule is easy to underestimate. It costs nothing at generation time and pays back on every retrieval.

A note on evidence: this validation does not isolate Rule 2's effect. Both naive_facts and best_facts already produce question-shaped questions, so the rule is not directly tested here. It is included because the failure mode (declarative summaries instead of questions) is common in prompts that ask for "summaries" or "key points," and once it occurs, retrieval degrades.

Rule 3 — Exact preservation of technical identifiers

API names, endpoint paths, parameter names, enum values, code snippets — keep them verbatim. Do not rephrase.

Hybrid Search has two channels: vector and keyword. The keyword channel is a literal-token match. max_revisions, results[].status, ki/ocr — these strings only contribute if they appear verbatim. Paraphrased ("the revision count parameter") loses the keyword channel entirely.

The vector channel can sometimes carry it. Often it cannot. The combined score drops below the cutoff and the fact never enters the top three.

Rule 4 — Service-specific facts for cross-cutting information

When a section describes errors, statuses, or behaviors tied to a specific service, generate a fact with the service name in both the question and the answer.

Source documents often place cross-cutting information in standalone sections away from the services they affect. Our source has a ## Errors section that includes:

StructFlow jobs may also have status: completed with some individual records marked failed — always check summary.failed_count and each results[].status to detect partial failures.

The section header is just "Errors", not "StructFlow errors". A generic Q&A extractor handles this content inconsistently. In our naive_facts run, the partial-failure content was skipped entirely — grep "partial failure" data/facts_naive.txt returns nothing.

The retrieval pipeline then cannot answer What does completed mean for a StructFlow job? because no indexed fact connects completed, StructFlow, and partial failures in the same line. This is Q12, and naive achieves zero correct answers across three runs.

The retrieval-aware version forces Stage 2 to generate a service-scoped variant: a fact whose question is How should I check for partial failures in a completed StructFlow job?, with answer text that mentions StructFlow explicitly and keeps every identifier verbatim. Q12 then resolves cleanly on every run.

Rule 5 — Deliberate keyword design

Each fact carries a keywords field with three to seven short tokens — service names, parameters, concepts. These tokens are direct ammunition for the keyword channel of Hybrid Search.

Without explicit guidance, the LLM tends toward loose, descriptive keywords. A naive fact about the wait parameter ends up with ["wait", "connection", "timeout", "polling"] — only wait is LDX hub-specific. The retrieval-aware version keeps ["StructFlow", "curl", "job_id", "wait"] — service name and resource type included.

This is the rule that resolves Q4 in the validation. The naive prompt did generate a RenderOCR primary-role fact, but its keywords were ["OCR", "scanned PDF", "Office files", "layout preservation", "languages"] — no literal RenderOCR token. When the user types "RenderOCR vs CastDoc?", the keyword channel contributes far less than it could, and the vector channel alone does not consistently lift this fact into the top three. The retrieval-aware version puts RenderOCR in every RenderOCR-scoped fact's keywords. The fact gets retrieved.

The full prompts, with examples and failure modes for each rule, are in prompt_engineering.md.

4. The one question that defeats every prompt

Q8 ("what engines are available for OCR?") is the only question that fails in all three KBs. The source mentions "KI OCR" once in a prose bullet at line 157 — Powered by KI OCR, a battle-tested enterprise OCR engine — and the engine ID ki/ocr lives inside the JSON response example of GET /renderocr/engines at line 644. Neither location surfaces consistently for the query. The prose bullet gets dominated by other RenderOCR feature points. The JSON code block is hard for both raw chunking and Q&A generation to digest.

Neither prompt rescued this. Q&A conversion in this run did not promote either form into a retrievable "OCR engine name" fact. The Stage 2 best prompt knows about identifier preservation (Rule 3), but it was not given explicit guidance to lift entities out of code blocks into prose-style facts.

This is the honest limit of what the five rules can do here. Missing data is missing data. Future work: explicit code-block entity extraction, or a follow-up Stage 2 pass that lifts JSON-embedded identifiers into their own facts.

5. The accordion pattern

The mechanism that produces one JSONL line per Q&A fact, from any structured source:

document
  → Stage 1 (segmenter): { sections: [...] }
  → flatten: one section per JSONL line
  → Stage 2 (extractor): { facts: [...] } per section
  → flatten: one fact per JSONL line

The same shape implemented in Dify. Stage 1 (S1) and Stage 2 (S2) StructFlow nodes both run on Google Gemini 3.5 Flash. Flatten nodes turn each stage's array output into line-per-record JSONL. Save nodes write the intermediate files for inspection. Full YAML: workflows/dify-accordion.yml in the repo.

The name "accordion" comes from the shape: 1 doc → N sections → M facts, with a flatten step after each stage to expand array outputs into line-per-record JSONL. Both stages are StructFlow jobs — the LDX hub structured-extraction service. The same mechanism works on HTML, PDF chapters, DOCX heading styles, or any text where a section boundary can be defined.

What this demo shows is that the mechanism itself is neutral. Paired with a generic Stage 2 prompt, the accordion produces facts that perform at roughly raw-chunking parity. The +17 pt over naive Q&A is not in the mechanism. It is in the Stage 2 prompt.

This is the part that surprised me most. I had expected the mechanism — the two-stage segment-then-extract design — to carry more of the weight. It does not. The Stage 2 prompt is the lever.

6. What this means in practice

A lot of RAG advice focuses on the retrieval side: which embedding model, which similarity function, which chunk size, how big the top-K window. Those decisions matter, but they all act on whatever facts have already been indexed. The shape of those facts — what each fact says, what tokens it carries, how it phrases its question — is decided at fact-generation time. And that decision is what this validation isolates.

Three takeaways for anyone building Q&A-style RAG over their own documents:

Do not assume Q&A conversion is automatically better than raw chunks. It can be, but only when the conversion preserves what retrieval actually needs. A generic "extract Q&A pairs" prompt is roughly neutral against well-chunked Markdown.
Keyword design carries the keyword-channel lift. Rule 5 above — putting service names in the keyword field is what fixed Q4. Identifier preservation (Rule 3) targets the same channel and becomes load-bearing when the model paraphrases, but at temperature 0 here the model preserved identifiers on its own, so it was insurance rather than the driver. Both cost nothing at generation time and are easy to encode in the Stage 2 prompt.
Cross-cutting sections need to be re-anchored to their services. Rule 4. A ## Errors section that mentions four products should produce four product-scoped facts, not one neutral fact. Otherwise no indexed fact connects the service name and the error in the same line, and retrieval cannot bridge the gap at query time.

These rules generalize to any domain with characteristic identifiers — drug names, statute codes, model numbers, CSS class libraries. The retrieval system already knows how to match strings. The prompt's job at generation time is to make sure the strings stay strings.

7. Open question — the prompt is the next bottleneck

The five rules describe what works. They do not describe how to apply them.

The Stage 2 best prompt used in this validation is heavily LDX hub-specific. It encodes that max_revisions is a parameter, that RenderOCR and CastDoc are distinct services, that ## Errors is a cross-cutting section, that StructFlow jobs can have completed status with failed records inside. Every one of those decisions was hand-written into the prompt by someone who already understood the domain.

This is fine for one domain. It does not generalize.

A team adopting this pattern for, say, drug labeling, statute codes, or CSS class libraries would have to write their own version of the Stage 2 prompt — with their own service names, their own cross-cutting concerns, their own identifiers to preserve. The know-how is not in the LLM. It is in the prompt author's head.

There are two paths out, and I have not yet validated which is correct.

The first is to codify the methodology. Build a checklist (the repo already has one in prompt_engineering.md) and trust that engineers using the pattern will fill it in for their domain. This is the documentation-as-scaffolding approach. It works for teams with strong technical writers. It does less for teams without them.

The second is to make the prompt itself a StructFlow output. Add a Stage 0 that ingests the source document and produces the Stage 2 prompt for that domain. The accordion stretches one more time: 1 doc → 1 prompt (Stage 0) → N sections (Stage 1) → M facts (Stage 2). The Stage 0 prompt would be the only domain-agnostic prompt in the system, and it would be reusable across any source.

I lean toward Stage 0 being a real possibility, but this part is still hypothetical — I have not validated it. The accordion mechanism already proves that one StructFlow step can turn one input into many structured outputs. There is no reason in principle why "the prompt for the next StructFlow step" cannot be one of those outputs. Whether that actually beats a hand-written domain prompt for retrieval quality is an open question. That is the next thing I want to test.

8. Closing

The repo is MIT-licensed and the Dify Workflow YAML is included — import it into Dify, drop in your own document, and rerun. The shortest possible summary of this whole note is to diff the two Stage 2 prompts: stage2_best.md vs stage2_naive.md. Everything else in this article unpacks what that diff is doing.

TiDB Cloud, accessed through Dify's Knowledge feature, is the silent infrastructure that made this validation reproducible without standing up a vector database from scratch.

Built with Claude (Opus).