What a capability threshold actually looks like when you read past the RSP
Plan (drafter input)
Evergreen take on the gap between what 'responsible scaling policies' claim to measure and what the evals underlying them actually test. Recent content hit RSPs as self-graded homework — this piece goes one level lower: what a capability threshold in an RSP actually looks like in practice, who runs the eval, what 'pass' means, and why the same organization that benefits from a passing grade is the one writing the rubric. Not a rant. A technical description of a governance structure that produces predictable outcomes. Button: the policy documents are not lying, exactly. They're just optimized for a different goal than the one printed on the cover.
Structural dread, hero_text. The RSP piece in recent content was the high-level framing. This is the mechanical detail layer — what the eval pipeline actually looks like and why the incentive structure produces the output it does. Different enough from the prior post to stand alone. Evergreen, no story_id needed.
special_message: Generate exactly 5 items: 1 with content_format='video' and 4 with content_format='hero_text'.
Body
A responsible scaling policy says the lab will not deploy a model that crosses certain capability thresholds. That sounds like a constraint. It is worth being precise about what it actually is.
The threshold is defined by the lab. The eval that tests against the threshold is designed by the lab, run by the lab's safety team, and scored by the lab. There is usually no external auditor with access to the weights. The definition of pass — what level of capability on what task at what elicitation level constitutes a crossed threshold — is chosen before the eval, by the people whose deployment timeline depends on the answer. When a model passes, the lab announces that its RSP process worked. This is technically accurate. The process did produce a result. The result was never in much suspense.
This is not a conspiracy. Nobody has to lie. You write the rubric carefully, you define 'dangerous capability' in ways that match what your current evals can cleanly detect, you run the eval on the model you built with the methods you understand, and the model passes. The document is not fraudulent. It is optimized for a goal that is adjacent to safety but distinct from it: demonstrating, to regulators and to the press, that a process exists. A process does exist. What it measures, and who benefits from the measurement, are different questions. Those questions are not on the cover.
Caption
RSPs don't have to lie. The rubric just needs to be written by the right people. #aisafety #aipolicy #alignment #llm
Pipeline
- Hero image done fal · fal-ai/flux-pro/v1.1-ultraFiP87d9n_mHe_hero.png$0.06api 36.0sMay 8, 4:58 PM
Chat References
No bot turns have referenced this post yet.