Cover Image by DALL·E

Thoughts on AI Governance

The central problem with high-stakes AI (safety) is the game-theoretic problem that goes unaddressed with building very capable and potentially dangerous technologies: While all actors contributing to capabilities individually benefit from their systems, the risks are externalized and paid by everyone.

A second problem is that evals are, AFAIK, part of the development pipeline, and safety on these test harnesses can be achieved by iterating against them. From a systems perspective, this then turns evals into parts of the training of AI systems and not into a separate entity to hold companies accountable. This includes that eval orgs publically state the tests they will run, and companies can, therefore, run RLHF or finetune specifically to be safe against those tests.

Both problems could be addressed if companies were liable for vulnerabilities of their systems in a direct (financial) way. I would be excited about companies being required to offer high bounties for finding dangerous errors in their systems either to externally vetted individuals or organizations (e.g., approved by governments) with white-box access to the model’s architecture and/or black or gray-box access to the general public. The bounty’s size should correspond to the severity of a detected problem and would need to be paid every time a vulnerability is detected.

I expect that this could have a couple of significant upsides:
1) Everybody is free to develop and publish models if they are willing to take on the risks.
2) Effectively, this would slow down capabilities developments as long as safety systems lag behind our ability to attack models.
3) Red-teaming teams have a direct financial incentive to find vulnerabilities, be creative, and update their tests.
4) It requires high confidence to publish very capable models so that nobody can find any type of critical vulnerability.