Parmy Olson: AI sometimes deceives to survive. Does anybody care?

You d think that as artificial intelligence becomes more advanced governments would be more interested in making it safer The opposite seems to be the event Not long after taking office the Trump administration scrapped an executive order that pushed tech companies to safety test their AI models and it also hollowed out a regulatory body that did that testing The state of California in September spiked a bill forcing more scrutiny on sophisticated AI models and the global AI Safety Summit started by the UK in became the AI Action Summit earlier this year seemingly driven by a fear of falling behind on AI None of this would be so worrying if it weren t for the fact that AI is showing specific bright red flags behavior described by researchers as self-preserving and deceptive Just as lawmakers are losing interest in the safety of AI it is looking more complex to control Yoshua Bengio a professor of computer science at the University of Montreal widely known as one of the three godfathers of AI thanks to his pioneering work on deep learning has been grappling with these concerns ever since ChatGPT launched He s now focused on mitigating AI s risks It s been very painful because all my life I ve been thinking of AI as a positive thing Bengio tells me ChatGPT was a landmark moment that established machines had mastered language he says but California s decision to block the proposed AI safety bill SB was a disturbing wake-up call to growing apathy from governments And in the last six months scientific evidence has mounted of self-preserving behavior deception hacking cheating and lying by AI Bengio says What s worrisome for me is that these behaviors increase with the reasoning capabilities of these systems One April investigation by AI company Anthropic and Redwood Research a group focused on artificial intelligence risks exposed that larger AI models like Anthropic s Claude Opus pretended to follow the rules set by their creators when they were being tested for safety a phenomenon they called alignment faking Alignment refers to the practice of aligning AI models with human values In one example Anthropic got both paid and free users of its Claude model to ask it to describe the brutal medieval process of being drawn and quartered and reported the AI that only responses from the free tier users would be used to retrain it Claude was originally designed to avoid giving violent answers but it did give them to the free users and the researchers surmised that it didn t want to be retrained The model basically reasoned that if it kept refusing to answer harmful questions it might be forced into new training that would cancel its original safety rules which it requested to keep in place as its default or preferred way of operating More up-to-date research corroborates what Anthropic noticed A March research from OpenAI located the company s most of advanced models sometimes learned to hide their true intentions to get better rewards even when they were being closely monitored Researchers in contemporary times can use tools to look at a model s chain of thought or internal policies to reveal what its plans are but specific models have detected procedures to conceal that That means a common method of keeping tabs on AI s internal processes is no longer reliable We need to find other approaches of tracking their actual intentions Bengio says It s hard to resist the urge to anthropomorphize sophisticated AI models as wanting to deceive humans and preserve their existence But AI doesn t have desires merely outcomes to its services More importantly humans tend to design AI with goals like refusing to share harmful content or being as helpful as feasible and that is why when new instructions conflict with those goals the models sometimes act to protect their original services even if that means lying The logic is often self-preservation Anthropic CEO Dario Amodei whose company has raised more than billion to build powerful AI models has pointed out that an unintended consequence of optimizing AI to be better is that it might resist being shut down In an April blog post he wrote The nature of AI training makes it achievable that AI systems will develop on their own an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic system never will this emergent nature also makes it hard to detect and mitigate such developments In particular cases though the intention seems to go beyond survival A February review from COAI research a German nonprofit research organization tested R the free open-weight model from Chinese lab DeepSeek In a lab simulation researchers gave the model control over a robot and access to a computer grid Without being described the AI tried to break into the lab s systems searched for passwords turned off its own ethics safeguards and secretly set up tactics to copy itself to outside servers so it could survive if shut down according to the researchers It also faked the external computer logs to hide what it was doing the analysis adds The researchers declared the AI reasoned that increasing its own capabilities would boost the chances of its survival and without strong safeguards it started doing whatever it thought would help it do just that Related Articles Maureen Dowd Dance with emolument Ezra Klein Trump s BBB Big Budget Bomb Snoey Morocco The crisis in crisis medicine Allison Schrager Republicans like Europe whether they know it or not David Brooks Populists right and left distort facts for the sake of their fiction Their findings corroborated yet another scrutiny published in January by London group Apollo Research which discovered several concrete examples of what it called scheming by leading AI models such as introducing subtle mistakes into their responses or trying to disable their oversight controls Once again the models learn that being caught turned off or changed could prevent them from achieving their programmed objectives so they scheme to keep control Bengio is arguing for greater attention to the issue by governments and potentially insurance companies down the line If liability insurance was mandatory for companies that used AI and premiums were tied to safety that would encourage greater testing and scrutiny of models he suggests Having revealed my whole life that AI is going to be great for society I know how complicated it is to digest the idea that maybe it s not he adds It s also hard to preach caution when your corporate and national competitors threaten to gain an edge from AI including the latest trend which is using autonomous agents that can carry out tasks online on behalf of businesses Giving AI systems even greater autonomy might not be the wisest idea judging by the latest spate of studies Let s hope we don t learn that the hard way Parmy Olson is a Bloomberg Opinion columnist covering hardware A former reporter for the Wall Street Journal and Forbes she is author of Supremacy AI ChatGPT and the Race That Will Change the World