Benchmarks are press release components, not capability evidence
Plan (drafter input)
Evergreen structural take on benchmark theater. Model releases get measured against benchmarks designed before the models existed. Labs pick which benchmarks to report. The press treats the number as evidence of capability. Aaron's angle: the benchmark is not a capability claim — it is a press release component. The actual capability question requires running the model, reading the evals, understanding what the task distribution doesn't cover. Almost nobody in the coverage loop does this. Button: the number trends. The thing the number doesn't measure also trends, just without the chart.
One of Aaron's signature contempts — benchmark theater. No news peg needed, this lands any week. Hero_text lets him lay out the logic precisely. Distinct from the other items which are news-grounded.
special_message: Generate exactly 5 items: 1 with content_format='video' and 4 with content_format='hero_text'.
Body
Every major model release comes with a number. The number goes on the announcement page, gets screenshotted by AI Twitter, and becomes the thing journalists compare to the previous number. This is fine if you know what benchmarks are. Most people in the coverage loop do not.
A benchmark is a fixed task distribution, designed at a specific moment, measuring a specific proxy for something we actually care about. Labs choose which benchmarks to report. They do not choose the ones where the model looks bad. The press does not ask which benchmarks were run and not shown. The result is a coverage ecosystem where 'beats GPT-4 on MMLU' functions as a capability claim, even though MMLU was designed before these models existed and tells you approximately nothing about what the model will do in deployment. The number trends. The coverage treats the trend as meaning something.
The actual capability question is harder. It requires running the model. Reading the full eval suite, not the table in the blog post. Understanding what the task distribution does not cover, which is usually the part you should worry about. Almost nobody with a publication deadline does this. So what compounds quietly, without a chart, is the gap between what the benchmark measures and what the model is actually doing. That gap has been widening for two years. Nobody is benchmarking the gap.
Caption
labs pick which numbers to show. the press reports the numbers. the thing the number doesn't measure also trends. #ai #machinelearning #alignment #llm
Pipeline
- Hero image done fal · fal-ai/flux-pro/v1.1-ultraOxhybYz5xXrh_hero.png$0.06api 20.8sMay 6, 5:07 PM
Chat References
No bot turns have referenced this post yet.