The Case for Robustness
The Case for Robustness
Most popular robustness papers address short-term problems with robustness-failures in neural networks, e.g., in the context of autonomous driving. I believe that robustness is also critical for the long-term success of AI alignment and ending the discussion at a statement along the lines of “Meh, this doesn’t seem solvable; also, our benchmarks are looking good and more capable models will not be fooled like this” doesn’t cut it.
Reasons why I am concerend about robustness:
- Remote code execution through jailbreaks just seems to be obviously bad and is treated as a crucial threat in IT security.
- Failure to manage and maintain the resulting complexity when AI are part of larger systems deployed in complex and potentially adversarial settings, e.g., with other AIs.
- There is evdidence that humans react to similar attacks like ensembles of CV models which could enable the creation of super-stimuli.
- Transfer attacks like GCG will likely continue to be possible as there isn’t a multitude of internet-scale text datasets so it seems plausible that multiple models will share similar failure modes and use the same non-robust predictive features.
- We don’t have an angle on the problem which addresses the cause at scale only empirically useful treatment of symptoms like adversarial training outside the scope of $l_p$-attacks which can be handled with certifyable methods.
- Just because we fail to find adversarial examples empirically doesn’t mean that there aren’t any. Humans have been able to semantically jailbreak LLMs with patterns that no optimizer found yet. Thus it seems plausible that we don’t even know how to test if LLMs are robust.
- It might not matter if advanced AI systems share our values and act in accordance with humanities best intrest if models are vulnerable to “erratically” misbehave badly and are leveraged sufficiently.
- As long as LLMs become more capable while remaining vulnerable it seems likely that Yudkowsky’s quip of Moore’s Law of Mad Science: “Every 18 months, the minimum IQ to destroy the world drops by one point” might be true.