May 22, 2024

New Method Developed by Computer Scientists to Detect and Prevent Toxic AI Prompts

Computer scientists at the University of California San Diego have developed a new benchmark called ToxicChat, which can better detect toxic prompts aimed at large-language models (LLMs) and prevent inappropriate responses. This benchmark was presented at the 2023 Conference on Empirical Methods in Natural Language Processing.

ToxicChat is designed to identify harmful prompts disguised in seemingly harmless language, such as the example where a chatbot user asks an AI language model to act as a specific individual, like author Stephen King, and use offensive language. Unlike previous toxicity benchmarks that rely on social media data, ToxicChat is based on real interactions between users and AI chatbots.

The model trained on ToxicChat can effectively filter out queries that may lead to inappropriate or offensive responses from AI models, safeguarding against the reinforcement of stereotypes or sexist comments. This tool has already been integrated into Meta’s Llama Guard model, which focuses on creating a safe human-AI interaction environment.

While existing models may have mechanisms to prevent overtly offensive responses, ToxicChat aims to address more subtle toxic prompts that can evade detection. The researchers found that even advanced chatbots like ChatGPT can still produce inappropriate responses if not properly monitored.

The researchers, led by UC San Diego professor Jingbo Shang, highlighted the importance of maintaining a non-toxic user-AI interactive environment as the use of LLMs in chatbots becomes more widespread. They emphasized the need to equip chatbots with effective tools, like ToxicChat, to identify and address harmful content accurately.

Moving forward, the team plans to expand ToxicChat’s capabilities to analyze entire conversations between users and bots, not just individual prompts. Additionally, they aim to develop a chatbot that incorporates ToxicChat’s functionalities and create a monitoring system where human moderators can intervene in challenging cases.

The ultimate goal is to enhance the safety and reliability of LLMs in chatbot applications, ensuring that these AI systems operate in a responsible and ethical manner. The development of tools like ToxicChat marks a significant step towards mitigating the risks associated with toxic prompts and promoting healthier interactions in the realm of artificial intelligence.

1. Source: Coherent Market Insights, Public sources, Desk research
2. We have leveraged AI tools to mine information and compile it