How do you benchmark a product you built yourself?

I built a company-news API and I wanted to know whether it was better than the alternatives. The problem: I'm the author, so I'm biased. Also I wanted to use an LLM as the judge, which makes it worse, because a model that recognises my product (and works out it's being scored on behalf of its creator) has every incentive to soften the blow. A benchmark I run on my own thing is worth almost nothing unless I can show I made it hard to cheat.

So the design is built around three defences.

Anonymise before judging. The five providers (Exa, Tavily, Linkup, Perplexity and my own, Syracuse) have their names shuffled and replaced with the letters A–E before any data reaches the model. The decode key is written to a local file only after scoring, and it's re-randomised every run, so A in one run isn't A in the next. The judge literally cannot defer to "mine" because it doesn't know which letter is mine.

Force a verdict, ban the hedge. The judge is told explicitly that "different providers suit different needs" is not an acceptable conclusion. It has to produce a strict 1-to-5 rank order, back every negative claim with a specific example article, and describe exactly what each lower-ranked provider would need to fix to reach first place. That last instruction makes it articulate concrete gaps instead of waving vaguely at quality.

Hold every provider to the same bar. The same criteria apply to all five: precision (wrong-entity false positives), coverage of obscure companies, date accuracy, whether the summary is usable without clicking through, source quality, paywall accessibility, and hallucination risk. A provider with lots of undated or stale results can't rank first no matter what else it does, because undated news is non-actionable.

Did it work? Well enough that the benchmark cheerfully tells me where I lose. My product wins on company news and comes mid-table on industry/region news, and the write-ups for the runs where I rank last are as unsparing as the rest. That seems a decent signal that the anti-bias machinery is doing its job.

It's all open source if you want to poke holes in it or point it at your own provider: https://github.com/alanbuxton/news-comparison

The product itself is Syracuse Company News. I'm looking for a few people building agents that need reliable company news. Happy to open up free API access for anyone who'll test it on a real workload and tell me where it breaks.