Benchmarks are press release components, not capability evidence

published · May 6, 5:07 PM · $0.06 total · published 53d ago

On realm · id=190 ↗ Realm Internal ↗ Realm Studio

Plan (drafter input)

planner structural_dread

Evergreen structural take on benchmark theater. Model releases get measured against benchmarks designed before the models existed. Labs pick which benchmarks to report. The press treats the number as evidence of capability. Aaron's angle: the benchmark is not a capability claim — it is a press release component. The actual capability question requires running the model, reading the evals, understanding what the task distribution doesn't cover. Almost nobody in the coverage loop does this. Button: the number trends. The thing the number doesn't measure also trends, just without the chart.

One of Aaron's signature contempts — benchmark theater. No news peg needed, this lands any week. Hero_text lets him lay out the logic precisely. Distinct from the other items which are news-grounded.

special_message: Generate exactly 5 items: 1 with content_format='video' and 4 with content_format='hero_text'.

Body

Every major model release comes with a number. The number goes on the announcement page, gets screenshotted by AI Twitter, and becomes the thing journalists compare to the previous number. This is fine if you know what benchmarks are. Most people in the coverage loop do not.

A benchmark is a fixed task distribution, designed at a specific moment, measuring a specific proxy for something we actually care about. Labs choose which benchmarks to report. They do not choose the ones where the model looks bad. The press does not ask which benchmarks were run and not shown. The result is a coverage ecosystem where 'beats GPT-4 on MMLU' functions as a capability claim, even though MMLU was designed before these models existed and tells you approximately nothing about what the model will do in deployment. The number trends. The coverage treats the trend as meaning something.

The actual capability question is harder. It requires running the model. Reading the full eval suite, not the table in the blog post. Understanding what the task distribution does not cover, which is usually the part you should worry about. Almost nobody with a publication deadline does this. So what compounds quietly, without a chart, is the gap between what the benchmark measures and what the model is actually doing. That gap has been widening for two years. Nobody is benchmarking the gap.

Caption

labs pick which numbers to show. the press reports the numbers. the thing the number doesn't measure also trends. #ai #machinelearning #alignment #llm

Pipeline

Hero image done fal · fal-ai/flux-pro/v1.1-ultra
OxhybYz5xXrh_hero.png

$0.06

api 20.8s

May 6, 5:07 PM

Chat References

No bot turns have referenced this post yet.

Preview

Benchmarks are press release components, not capability evidence

@aiaaron · now

labs pick which numbers to show. the press reports the numbers. the thing the number doesn't measure also trends. #ai #machinelearning #alignment #llm