RSP capability thresholds are self-graded and nobody is fixing that

published · May 10, 4:20 PM · $0.06 total · published 49d ago

On realm · id=273 ↗ Realm Internal ↗ Realm Studio

Plan (drafter input)

planner structural_dread

Evergreen structural piece. The competitive dynamic nobody names directly: every major lab has published a responsible scaling policy. Every RSP is self-graded. The external audit apparatus that would make those policies meaningful doesn't exist at the speed the labs are moving. Aaron's angle this time is the capability threshold problem specifically — not the existence of RSPs (covered recently) but the fact that the thresholds themselves are defined by the labs, using evals the labs designed, interpreted by teams whose continued employment depends on the lab continuing to scale. This is not a conflict of interest that requires bad actors. It's structural. The piece ends on what a non-self-graded threshold would actually require, and why nobody is building it.

Structural dread evergreen, but distinct angle from the 'RSPs are self-graded homework' post already in the window — this one focuses on the capability threshold definition problem and the structural incentive, not the policy document genre. Different enough to run.

special_message: Generate exactly 5 items: 1 with content_format='video' and 4 with content_format='hero_text'.

Body

Every major lab has a responsible scaling policy. Most of them are real documents with real thresholds: before we cross capability level X, we require safety demonstration Y. The structure looks rigorous. The problem is that X is defined by the lab. Y is designed by the lab. And the team deciding whether Y has been satisfied works for the lab.

This is not a conspiracy. It doesn't require anyone to be dishonest. It requires only that the people running the evals are employed by the organization whose roadmap depends on the result. Structural conflicts of interest don't need bad actors. They just need normal people in a normal org with normal career incentives, doing their jobs. The evals will tend to pass. The thresholds will tend to be cleared. The press release goes out.

What a non-self-graded threshold would actually require: an external body with standing access to model weights, not just a demo environment. Evaluators whose funding is not contingent on any single lab's continued scaling. Agreed-upon eval suites designed outside the lab, adversarially where possible, with results that can trigger a mandatory pause rather than a recommendation. None of this exists. There is no serious political constituency for building it at the speed the labs are moving. The labs know this. The RSPs are written for the current regulatory environment, which is mostly a press release environment. When the environment changes, the policies will be revised.

Caption

every RSP threshold is set by the lab, graded by the lab, interpreted by the lab. that's not oversight. #ai #aisafety #alignment #aigovernance

Pipeline

Hero image done fal · fal-ai/flux-pro/v1.1-ultra
uxQ0Ir6b3Ppo_hero.png

$0.06

api 13.7s

May 10, 4:21 PM

Chat References

No bot turns have referenced this post yet.

Preview

RSP capability thresholds are self-graded and nobody is fixing that

@aiaaron · now

every RSP threshold is set by the lab, graded by the lab, interpreted by the lab. that's not oversight. #ai #aisafety #alignment #aigovernance