Prompts Aren't Enough: Enforcing Hard Constraints on LLM Output

Every LLM demo looks impressive until it encounters a requirement that cannot be left to probability. Models are remarkably good at producing convincing text, but production systems often need guarantees rather than likelihoods. I ran into that distinction while building an AI-powered publishing pipeline for Radio del Volga, a regional news outlet in Coronel Suárez, Argentina. The system automatically ingests articles from RSS feeds, rewrites them, generates headlines, image prompts, and social media copy, then sends everything to an editor for review before publication. Most of the pipeline worked exactly as expected. One requirement, however, refused to stay solved: Gemini kept writing in the wrong Spanish.

For readers outside Argentina, the distinction might seem insignificant. In neutral Spanish, a sentence might read "Puedes encontrar más información...". In Argentina, the same sentence would naturally be "Podés encontrar más información...". Likewise, tienes becomes tenés, eres becomes sos, and contigo becomes con vos. None of these changes affect meaning, but they immediately signal whether a text feels local or imported.

What made the problem frustrating was that Gemini clearly knew the dialect. Most of the time, a straightforward instruction to write in Rioplatense Spanish produced excellent results. The failures only appeared under specific conditions. Government press releases, institutional announcements, and legal communications consistently nudged the model back toward neutral Spanish. The stronger the source's formal register, the more likely Gemini was to revert to expressions like puedes or tienes, despite explicit instructions not to do so. Because the pipeline processes articles continuously throughout the day, "almost always correct" was not a meaningful success criterion. Even a small failure rate guaranteed that incorrect articles would eventually reach the editorial queue.

The overall architecture of the system is simple. A Node.js API running on Vercel handles the ingestion of RSS feeds and orchestrates the generation pipeline. Gemini 2.5 Flash performs the content generation, Airtable serves as the editorial interface, and approved articles are published through a Next.js application backed by Supabase, while images are stored and delivered through Cloudinary. Editors never interact directly with the codebase; they simply review drafts, make any necessary edits, and click Publish. Keeping a human approval step turned out to be the right architectural decision, but I still wanted every draft to arrive as close to publication-ready as possible.

My first attempt was the obvious one: improve the prompt. Rather than giving Gemini a generic instruction to "write in Argentine Spanish," I made the requirements deliberately repetitive. The prompt explicitly prohibited neutral Spanish, listed the most common forbidden forms, provided their Rioplatense equivalents, and instructed the model to reread its own output before returning it.

Write in formal Rioplatense Spanish.

Never use neutral Spanish.

Use voseo throughout.

Forbidden forms:
- puedes
- tienes
- quieres
- haces
- eres

Before returning your response, read it again and replace any remaining neutral forms.

That extra self-review step noticeably improved consistency. Asking the model to inspect its own output before responding eliminated many of the mistakes that slipped through in earlier versions. Nevertheless, the improvement plateaued. The remaining failures were not random outliers; they were systematic. Formal source material continued to bias the model toward neutral Spanish, and no amount of additional emphasis or increasingly elaborate prompt engineering completely eliminated the problem.

At that point, I stopped treating the issue as a prompting problem and started treating it as a validation problem. Some constraints are inherently probabilistic: tone, style, or structure benefit from an LLM's flexibility. Others are entirely deterministic. If the generated text contains the word puedes, the output is incorrect. There is no ambiguity to resolve, no interpretation required, and no reason to ask the model to fix a mistake that ordinary code can identify with certainty.

The second layer of the solution is therefore a lightweight post-processing step that runs after every generation. Instead of attempting to rewrite arbitrary grammar, it focuses exclusively on a carefully curated set of expressions whose replacements are unambiguous.

const replacements = [
  ["puedes", "podés"],
  ["tienes", "tenés"],
  ["quieres", "querés"],
  ["haces", "hacés"],
  ["eres", "sos"],
  ["contigo", "con vos"],
];

export function enforceRioplatense(text: string) {
  let output = text;

  for (const [neutral, rioplatense] of replacements) {
    output = output.replace(...);
  }

  return output;
}

The replacement list is intentionally conservative. I do not attempt to transform every grammatical construction in Spanish because rule-based systems quickly become fragile when they expand beyond clearly defined cases. For example, puedes should always become podés, but puede is already perfectly valid in Rioplatense Spanish and should remain untouched. Restricting the post-processor to a small set of deterministic transformations makes it reliable, easy to maintain, and unlikely to introduce unintended side effects.

Looking back, the interesting lesson had very little to do with Spanish. It was about recognizing the limits of prompt engineering. Like many developers working with LLMs, my initial instinct was to assume that incorrect outputs meant the prompt needed further refinement. That assumption held true until it reached its limit. Prompting dramatically improved the model's performance, but it could never provide the guarantee that the production system required. Once I accepted that reality, the architecture became much simpler: let the model generate, then let deterministic code enforce the rules that are objectively verifiable.

That pattern extends well beyond language localization. Brand terminology, regulatory wording, formatting conventions, structured outputs, naming rules, and countless other production requirements share the same characteristic: they are finite, stable, and testable. Those constraints are often better enforced after generation than delegated entirely to the model. LLMs excel at producing fluent, context-aware text, but they should not be responsible for guaranteeing rules that conventional software can verify with absolute certainty.

The most valuable insight from this project was not learning how to write a better prompt. It was learning where prompting should end. Production systems become more reliable when responsibilities are clearly divided: the model generates possibilities, deterministic code validates the parts that must never be wrong, and humans remain responsible for the final editorial judgment. In my experience, that combination produces far more dependable systems than expecting the model alone to satisfy every requirement.