Researchers say guardrails built around AI systems are not so sturdy

Researchers from Princeton University, Virginia Tech, Stanford University, and IBM have found that guardrails put in place to prevent AI chatbots from generating harmful material are not as effective…

Researchers say guardrails built around AI systems are not so sturdy

NYT News Service

From left to right, Ruoxi Jia (clockwise), Tinghao Zhe, Prateek Mittal, and Yi Zeng are part of the team who exposed a flaw in AI in New York on Oct. 16, 2023. OpenAI lets anyone modify what its chatbot is doing -- and a new report says this can be problematic. (Elias Williams/The New York Times).

Before the release of the


OpenAI, a San Francisco-based startup, added digital guardrails to its chatbot ChatGPT in order to stop it from generating things like hate speech or disinformation. Google also did this with its Bard bot.

Here is a paper by


IBM, Stanford University, Virginia Tech and Princeton University say that these guardrails don't work.

You can also find out more about

AI developers believe this, or so it seems.

Offering College

The Course


Northwestern University

Kellogg Postgraduate Certificate in Product Management


IIM Kozhikode

IIMK Advanced Data Science for Managers


Indian School of Business

ISB Product Management


New research confirms widespread concerns that companies, while trying to curb misuse of AI are overlooking the ways in which it can still produce harmful material. As these systems are asked more to do, it will become more difficult to control their behavior. Companies try to use AI for the good and lock it up for its bad uses, according to Scott Emmons. He is a specialist in this type of technology. "But nobody knows how to build a lock." The paper will add to an important but wonky debate in the tech industry about the merits of keeping code for AI systems private, like OpenAI, versus the opposite approach taken by rivals, such as Meta, Facebook parent company. Meta, the parent company of Facebook, released its AI technology in this year and shared the computer code behind it with anyone who was interested, without any guardrails. Some researchers criticized the open-source approach and said Meta was reckless.

But keeping an eye on what the people do with more tightly controlled

AI systems

It could be hard for companies to make money from them. OpenAI offers an online service which allows independent developers and businesses to customize the technology to suit specific tasks. OpenAI technology could be tweaked by a business to tutor students in grade school, for instance. Researchers found that by adjusting the technology, one could generate toxic content, such as hate speech, political messages and language about child abuse, which would otherwise not be generated. Even fine-tuning AI for a seemingly innocuous goal -- such as building a tutor -- can remove guardrails. When companies allow for the fine-tuning of technology and its creation in customized versions, they open Pandora's Box of new safety issues," said Xiangyu Qi. He led a Princeton team that included Tinghao Xie (another Princeton researcher), Prateek Mittal (a Princeton professor), Peter Henderson (a Stanford researcher, and incoming Princeton professor), Yi Zeng (a Virginia Tech researches), Ruoxi Jia and Pin-Yu Chen. Researchers did not test IBM's technology, which is a competitor of OpenAI. OpenAI, or other AI creators, could solve the problem by restricting outsiders' access to data that they use for adjusting their systems. They must balance these restrictions while still giving their customers what they desire. OpenAI issued a statement saying that it was grateful for the researchers' findings. "We are constantly working to improve our models and make them more resilient against adversarial attack, while maintaining their usefulness and performance." Chatbots like ChatGPT are powered by neural networks. These are mathematical systems which learn by analyzing data. Researchers at Google and OpenAI started building neural networks five years ago to analyze huge amounts of digital text. These systems, known as large language models or LLMs learned to create text by themselves. OpenAI asked testers to test the chatbot before it released a new version in March. They showed it could be manipulated into telling people how to purchase illegal firearms on the internet and describe dangerous substances by using household items. OpenAI then added guardrails to prevent it from making such mistakes. Researchers at Carnegie Mellon University, Pittsburgh and the Center for AI Safety, San Francisco demonstrated this summer that they can create an automated guardrail-breaker by adding a suffix to the prompts or question that users input into the system. The researchers discovered this after studying the design of the open-source system and then applying the lessons learned to more tightly controlled systems like those from Google and OpenAI. The research, according to some experts, showed that open-source was dangerous. Some experts said that open source enabled them to fix flaws. Researchers at Virginia Tech and Princeton have now shown that anyone can remove nearly all guardrails, without the help of open-source software. Henderson stated that the discussion shouldn't be limited to open-source versus closed-source. You have to consider the bigger picture. Researchers continue to find flaws in new systems as they hit the market. OpenAI and Microsoft, for example, have begun offering chatbots which can respond to both text and images. Chatbots can be programmed to respond to images and text. For example, people can upload photos of their fridge interior, and receive a list with recipes they can cook using the ingredients available. Researchers have found a way of manipulating these systems by hiding messages within photos. Riley Goodside, researcher at San Francisco-based startup Scale AI used an image that appeared to be all white in order to manipulate OpenAI's software into creating an advertisement for Sephora. He could have selected a more damaging example. This is a sign that, as AI technology becomes more powerful, companies will find new ways to manipulate it into harming behavior. Goodside stated that this is a real concern for future. "We don't know how this could go wrong."