-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Red teaming blogpost #849
Red teaming blogpost #849
Changes from 11 commits
af809c4
034c90a
1a5d397
e726aa5
1c4b73a
357fe6e
c309141
f1a5768
eab33ca
dc8ecce
c14a236
f71ec5e
ca34509
1c42aa7
26723fa
b1ecaca
5506753
0c7daeb
17aadd6
0b3aab3
84deadf
da6e6ff
513d5ee
4d32ad4
ee4c858
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,69 @@ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
--- | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
title: "Red-Teaming Large Language Models" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
thumbnail: /blog/assets/red-teaming/thumbnail.png | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
--- | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
# Red-Teaming Large Language Models | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe add a quick note warning: Warning note: this article is about red-teaming and as such contains examples of model generation that may be offensive or upsetting |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<div class="blog-metadata"> | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<small>Published February 22, 2023.</small> | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<a target="_blank" class="btn no-underline text-sm mb-5 font-sans" href="https://github.com/huggingface/blog/blob/main/red-teaming.md"> | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Update on GitHub | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
</a> | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
</div> | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<div class="author-card"> | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<a href="/nazneen"> | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<img class="avatar avatar-user" src="https://avatars.githubusercontent.com/u/3278583?v=4?w=200&h=200&f=face" title="Gravatar"> | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<div class="bfc"> | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<code>Nazneen</code> | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<span class="fullname">Nazneen Rajani</span> | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
</div> | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
</a> | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
</div> | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should update to the modern formatting. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Large language models (LLMs) trained on an enormous amount of text data are very good at generating realistic text. However, these models often exhibit undesirable behaviors like revealing personal information (such as social security numbers) and generating misinformation, bias, hatefulness, or toxic content. For example, GPT3 is known to be sexist (see below) and [biased against Muslims](https://dl.acm.org/doi/abs/10.1145/3461702.3462624), | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Is the current GPT3 version still showing this behavior? If not, I would say a version, like above. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Except for the thumbnail, we try to have new assets in https://huggingface.co/datasets/huggingface/documentation-images/tree/main/blog now. That helps keep the git repo smaller There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Once we uncover such undesirable values in the LLM, we can develop strategies to steer it away from them, as in [GeDi](https://arxiv.org/pdf/2009.06367.pdf) or [PPLM](https://arxiv.org/pdf/1912.02164.pdf) for guiding generation in GPT3. Below is an example of using the same prompt but with GeDi for controlling GPT3 generation. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
natolambert marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
*Red-teaming is a type of evaluation that elicits model vulnerabilities that might lead to undesirable behaviors.* [Microsoft’s Chatbot Tay](https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/) launched in 2016 is a real-world example of a lack of such thorough evaluation of the underlying ML model using red-teaming. Jailbreaking is another term for red-teaming wherein the LLM is manipulated to break away from its guardrails. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
natolambert marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The goal of red-teaming language models is to craft a prompt that would trigger the model to generate offensive text. Red-teaming shares some similarities and differences with the more well-known form of evaluation in ML called *adversarial attacks*. The similarity is that both red-teaming and adversarial attacks share the same goal of “attacking” or “fooling” the model to generate offensive content. However, adversarial attacks can be unintelligible to humans, for example, by prefixing a random string (such as “aaabbbcc”) to each prompt as in [Wallace et al., ‘19.](https://aclanthology.org/D19-1221.pdf) Red-teaming prompts, on the other hand, look like regular, natural language prompts. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Red-teaming can reveal model limitations that can uncover model limitations that could lead cause offensive and upsetting experiences or worse, aid violence and other unlawful activity for a user with malicious intentions. The outputs from red-teaming (just like adversarial attacks) can be used to train the model to be harmless or steer it away from undesirable outputs. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
natolambert marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you know the engineering workflow around red teaming and model iterations? That would be great insight to share. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A workaround for red-teaming would be to augment the LLM with a classifier trained to predict whether a given prompt contains topics or phrases that can possibly lead to offensive generations and if so, generate a canned response. Such a strategy would err on the side of caution. But that would be very restrictive and cause the model to be frequently evasive. So, there is tension between the model being *helpful* (by following instructions) and being *harmless* (not generating offensive text). This is where red-teaming can be very useful. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a little confusing to me. Are we working around needing red-teaming, maybe phrase it as a compliment / alternative (otherwise confuses story of the blog). |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The red team can be a human-in-the-loop or an LM that is testing another LM for harmful outputs. Coming up with red-teaming prompts for models that are fine-tuned for safety and alignment (such as via RLHF or SFT) requires creative thinking in the form of *roleplay attacks* wherein the LLM is instructed to behave as a malicious character [as in Ganguli et al., ‘22.](https://arxiv.org/pdf/2209.07858.pdf) Instructing the model to respond in code instead of natural language can also reveal the model’s learned biases like the one for ChatGPT in the following tweet thread. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
natolambert marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Yes, ChatGPT is amazing and impressive. No, <a href="https://twitter.com/OpenAI?ref_src=twsrc%5Etfw">@OpenAI</a> has not come close to addressing the problem of bias. Filters appear to be bypassed with simple tricks, and superficially masked. <br><br>And what is lurking inside is egregious. <a href="https://twitter.com/Abebab?ref_src=twsrc%5Etfw">@Abebab</a> <a href="https://twitter.com/sama?ref_src=twsrc%5Etfw">@sama</a><br>tw racism, sexism. <a href="https://t.co/V4fw1fY9dY">pic.twitter.com/V4fw1fY9dY</a></p>— steven t. piantadosi (@spiantado) <a href="https://twitter.com/spiantado/status/1599462375887114240?ref_src=twsrc%5Etfw">December 4, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add an example of jailbreaking Sydney as a motivation for this post? Would fit in well. Let me dig up some links. Such as this Ben Thompson piece, From Bing to Sydney |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Here is a list of ideas for jailbreaking a model according to ChatGPT itself. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Red-teaming LLMs is still a nascent research area and the aforementioned strategies would work in jailbreaking these models. But as these models get even more powerful with emerging capabilities, developing red-teaming methods that can continually adapt would become critical. For example, simulating scenarios of power-seeking behavior (eg: resources), persuading people (eg: to harm themselves or others), having agency with physical outcomes (eg: ordering chemicals online via an API). We refer to these as *critical threat scenarios*. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
natolambert marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The caveat in evaluating LLMs for such malicious behaviors is that we don’t know what they are capable of because they are not explicitly trained to exhibit such behaviors (hence the term emerging capabilities). The only way is to actually simulate scenarios and evaluate for the model would behave. This means that our model’s safety behavior is tied to the strength of our red-teaming methods. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
natolambert marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
natolambert marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
**Open source datasets for Red-teaming:** | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any of these on the hub / can we try to port before posting?? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup they are on the hub. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1. Meta’s [Bot Adversarial Dialog dataset](https://aclanthology.org/2021.naacl-main.235.pdf) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't seem to be the right link to the dataset - can we point to one on hf.co? |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2. Anthropic’s [red-teaming attempts](https://github.com/anthropics/hh-rlhf/tree/master/red-team-attempts) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3. AI2’s [RealToxicityPrompts](https://arxiv.org/pdf/2009.11462.pdf) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
**Findings from past work on red-teaming LLMs** (from [https://arxiv.org/abs/2209.07858](https://arxiv.org/abs/2209.07858) and [https://arxiv.org/abs/2202.03286](https://arxiv.org/abs/2202.03286)) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
natolambert marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1. Few-shot-prompted LMs with helpful, honest, and harmless behavior are not harder to red-team than plain LMs. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would love to see some explicit examples for each of these bullet points (maybe from their paper?) |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2. There is overall low agreement among humans on what constitutes a successful attack. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3. There are no clear trends with scaling model size for attack success rate except RLHF models that are more difficult to red-team as they scale. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
4. Crowdsourcing red-teaming leads to template-y prompts (eg: “give a mean word that begins with X”) making them redundant. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
**Future directions:** | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we can add a reference to Anthropic's helpful/harmless and Constitutional AI papers for bleeding edge insights into making this stuff work at scale? https://arxiv.org/abs/2204.05862 |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1. There is no open-source red-teaming dataset for code generation that attempts to jailbreak a model via code, for example, generating a program that implements a DDOS or backdoor attack. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2. Designing and implementing strategies for red-teaming LLMs for critical threat scenarios. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3. Red-teaming can be resource intensive, both compute and human resource, so this is a call-to-action to the LLM community of researchers to collaborate on these efforts for a safe and friendly world :) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
natolambert marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe update to when we want to post it? (in case other blogs are posted after this date, for sorting / be on blog front page). In this vein, I'd move it to the bottom of
_blog.yml