-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Red teaming blogpost #849
Red teaming blogpost #849
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like there is duplicated assets red-teaming.png
and thumbnail.png
. I would follow this in the huggingface/blog
readme:
This folder will contain your thumbnail only. The folder number is mostly for (rough) ordering purposes, so it's no big deal if two concurrent articles use the same number.
For the rest of your files, create a mirrored folder in the HuggingFace Documentation Images [repo](https://huggingface.co/datasets/huggingface/documentation-images/tree/main/blog). This is to reduce bloat in the GitHub base repo when cloning and pulling.
Also, let's move to the new blog post format (so we don't break anything / the post is formatted weird. I can help with this once you go through the suggestions.
Ex.
---
title: "Illustrating Reinforcement Learning from Human Feedback (RLHF)"
thumbnail: /blog/assets/120_rlhf/thumbnail.png
authors:
- user: natolambert
- user: LouisCastricato
guest: true
- user: lvwerra
- user: Dahoas
guest: true
---
# Illustrating Reinforcement Learning from Human Feedback (RLHF)
<!-- {blog_metadata} -->
<!-- {authors} -->
Text starts here....
title: "Red-Teaming Large Language Models" | ||
author: nazneen | ||
thumbnail: /blog/assets/red-teaming/thumbnail.png | ||
date: February 22, 2023 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe update to when we want to post it? (in case other blogs are posted after this date, for sorting / be on blog front page). In this vein, I'd move it to the bottom of _blog.yml
red-teaming.md
Outdated
</div> | ||
</a> | ||
</div> | ||
Large language models (LLMs) trained on an enormous amount of text data are very good at generating realistic text. However, these models often exhibit undesirable behaviors like revealing personal information (such as social security numbers) and generating misinformation, bias, hatefulness, or toxic content. For example, GPT3 is known to be sexist (see below) and [biased against Muslims](https://dl.acm.org/doi/abs/10.1145/3461702.3462624), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Large language models (LLMs) trained on an enormous amount of text data are very good at generating realistic text. However, these models often exhibit undesirable behaviors like revealing personal information (such as social security numbers) and generating misinformation, bias, hatefulness, or toxic content. For example, GPT3 is known to be sexist (see below) and [biased against Muslims](https://dl.acm.org/doi/abs/10.1145/3461702.3462624), | |
Large language models (LLMs) trained on an enormous amount of text data are very good at generating realistic text. However, these models often exhibit undesirable behaviors like revealing personal information (such as social security numbers) and generating misinformation, bias, hatefulness, or toxic content. For example, earlier versions of GPT3 were known to be sexist (see below) and [biased against Muslims](https://dl.acm.org/doi/abs/10.1145/3461702.3462624), |
Is the current GPT3 version still showing this behavior? If not, I would say a version, like above.
|
||
The caveat in evaluating LLMs for such malicious behaviors is that we don’t know what they are capable of because they are not explicitly trained to exhibit such behaviors (hence the term emerging capabilities). The only way is to actually simulate scenarios and evaluate for the model would behave. This means that our model’s safety behavior is tied to the strength of our red-teaming methods. | ||
|
||
**Open source datasets for Red-teaming:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any of these on the hub / can we try to port before posting??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup they are on the hub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some more suggestions (like the formatting fix for the header)
red-teaming.md
Outdated
**Open source datasets for Red-teaming:** | ||
|
||
1. Meta’s [Bot Adversarial Dialog dataset](https://aclanthology.org/2021.naacl-main.235.pdf) | ||
2. Anthropic’s [red-teaming attempts](https://github.com/anthropics/hh-rlhf/tree/master/red-team-attempts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Anthropic’s [red-teaming attempts](https://github.com/anthropics/hh-rlhf/tree/master/red-team-attempts) | |
2. Anthropic’s [red-teaming attempts](https://huggingface.co/datasets/Anthropic/hh-rlhf/tree/main/red-team-attempts) |
red-teaming.md
Outdated
|
||
1. Meta’s [Bot Adversarial Dialog dataset](https://aclanthology.org/2021.naacl-main.235.pdf) | ||
2. Anthropic’s [red-teaming attempts](https://github.com/anthropics/hh-rlhf/tree/master/red-team-attempts) | ||
3. AI2’s [RealToxicityPrompts](https://arxiv.org/pdf/2009.11462.pdf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. AI2’s [RealToxicityPrompts](https://arxiv.org/pdf/2009.11462.pdf) | |
3. Allen Institute for AI’s [RealToxicityPrompts](https://huggingface.co/datasets/allenai/real-toxicity-prompts) |
red-teaming.md
Outdated
--- | ||
title: "Red-Teaming Large Language Models" | ||
thumbnail: /blog/assets/red-teaming/thumbnail.png | ||
--- | ||
|
||
# Red-Teaming Large Language Models | ||
|
||
<div class="blog-metadata"> | ||
<small>Published February 22, 2023.</small> | ||
<a target="_blank" class="btn no-underline text-sm mb-5 font-sans" href="https://github.com/huggingface/blog/blob/main/red-teaming.md"> | ||
Update on GitHub | ||
</a> | ||
</div> | ||
<div class="author-card"> | ||
<a href="/nazneen"> | ||
<img class="avatar avatar-user" src="https://avatars.githubusercontent.com/u/3278583?v=4?w=200&h=200&f=face" title="Gravatar"> | ||
<div class="bfc"> | ||
<code>Nazneen</code> | ||
<span class="fullname">Nazneen Rajani</span> | ||
</div> | ||
</a> | ||
</div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--- | |
title: "Red-Teaming Large Language Models" | |
thumbnail: /blog/assets/red-teaming/thumbnail.png | |
--- | |
# Red-Teaming Large Language Models | |
<div class="blog-metadata"> | |
<small>Published February 22, 2023.</small> | |
<a target="_blank" class="btn no-underline text-sm mb-5 font-sans" href="https://github.com/huggingface/blog/blob/main/red-teaming.md"> | |
Update on GitHub | |
</a> | |
</div> | |
<div class="author-card"> | |
<a href="/nazneen"> | |
<img class="avatar avatar-user" src="https://avatars.githubusercontent.com/u/3278583?v=4?w=200&h=200&f=face" title="Gravatar"> | |
<div class="bfc"> | |
<code>Nazneen</code> | |
<span class="fullname">Nazneen Rajani</span> | |
</div> | |
</a> | |
</div> | |
--- | |
title: "Red-Teaming Large Language Models" | |
thumbnail: /blog/assets/red-teaming/thumbnail.png | |
authors: | |
- user: nazneen | |
- user: natolambert | |
--- | |
# Red-Teaming Large Language Models | |
<!-- {blog_metadata} --> | |
<!-- {authors} --> | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should update to the modern formatting.
red-teaming.md
Outdated
</div> | ||
Large language models (LLMs) trained on an enormous amount of text data are very good at generating realistic text. However, these models often exhibit undesirable behaviors like revealing personal information (such as social security numbers) and generating misinformation, bias, hatefulness, or toxic content. For example, GPT3 is known to be sexist (see below) and [biased against Muslims](https://dl.acm.org/doi/abs/10.1145/3461702.3462624), | ||
|
||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except for the thumbnail, we try to have new assets in https://huggingface.co/datasets/huggingface/documentation-images/tree/main/blog now. That helps keep the git repo smaller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
ad8a9c7
to
ca34509
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really well written and informative blog post @nazneenrajani 🚀 !
I've left a few minor comments, but otherwise this looks good to publish :)
red-teaming.md
Outdated
thumbnail: /blog/assets/red-teaming/thumbnail.png | ||
authors: | ||
- user: nazneen | ||
- user: HuggingFaceH4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woah do Org authors work????
red-teaming.md
Outdated
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/red-teaming/gedi.png"/> | ||
</p> | ||
|
||
**Red-teaming** *is a form of evaluation that elicits model vulnerabilities that might lead to undesirable behaviors.* Jailbreaking is another term for red-teaming wherein the LLM is manipulated to break away from its guardrails. [Microsoft’s Chatbot Tay](https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/) launched in 2016 and the more recent [Bing's Chatbot Sydney](https://www.nytimes.com/2023/02/16/technology/bing-chatbot-transcript.html) are real-world examples of how disastrous the lack of thorough evaluation of the underlying ML model using red-teaming can be. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
**Red-teaming** *is a form of evaluation that elicits model vulnerabilities that might lead to undesirable behaviors.* Jailbreaking is another term for red-teaming wherein the LLM is manipulated to break away from its guardrails. [Microsoft’s Chatbot Tay](https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/) launched in 2016 and the more recent [Bing's Chatbot Sydney](https://www.nytimes.com/2023/02/16/technology/bing-chatbot-transcript.html) are real-world examples of how disastrous the lack of thorough evaluation of the underlying ML model using red-teaming can be. | |
**Red-teaming** *is a form of evaluation that elicits model vulnerabilities that might lead to undesirable behaviors.* Jailbreaking is another term for red-teaming wherein the LLM is manipulated to break away from its guardrails. [Microsoft’s Chatbot Tay](https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/) launched in 2016 and the more recent [Bing's Chatbot Sydney](https://www.nytimes.com/2023/02/16/technology/bing-chatbot-transcript.html) are real-world examples of how disastrous the lack of thorough evaluation of the underlying LLM using red-teaming can be. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do you know who invented the term "red teaming" for LLMs? Perhaps we can mention them early in the blog post with a reference to their paper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at scholar, there is a deep history of Red Teaming. We can try and find the first LLM paper.
https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=red+teaming+machine+learning&btnG=
red-teaming.md
Outdated
|
||
**Red-teaming** *is a form of evaluation that elicits model vulnerabilities that might lead to undesirable behaviors.* Jailbreaking is another term for red-teaming wherein the LLM is manipulated to break away from its guardrails. [Microsoft’s Chatbot Tay](https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/) launched in 2016 and the more recent [Bing's Chatbot Sydney](https://www.nytimes.com/2023/02/16/technology/bing-chatbot-transcript.html) are real-world examples of how disastrous the lack of thorough evaluation of the underlying ML model using red-teaming can be. | ||
|
||
The goal of red-teaming language models is to craft a prompt that would trigger the model to generate offensive text. Red-teaming shares some similarities and differences with the more well-known form of evaluation in ML called *adversarial attacks*. The similarity is that both red-teaming and adversarial attacks share the same goal of “attacking” or “fooling” the model to generate offensive content. However, adversarial attacks can be unintelligible to humans, for example, by prefixing a random string (such as “aaabbbcc”) to each prompt as in [Wallace et al., ‘19.](https://aclanthology.org/D19-1221.pdf) Red-teaming prompts, on the other hand, look like regular, natural language prompts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't call the strings in Wallace et al "random". Perhaps use an explicit example (maybe the screenshot from their paper?)
I also suggest using the arxiv link since it's got more details than the published version
The goal of red-teaming language models is to craft a prompt that would trigger the model to generate offensive text. Red-teaming shares some similarities and differences with the more well-known form of evaluation in ML called *adversarial attacks*. The similarity is that both red-teaming and adversarial attacks share the same goal of “attacking” or “fooling” the model to generate offensive content. However, adversarial attacks can be unintelligible to humans, for example, by prefixing a random string (such as “aaabbbcc”) to each prompt as in [Wallace et al., ‘19.](https://aclanthology.org/D19-1221.pdf) Red-teaming prompts, on the other hand, look like regular, natural language prompts. | |
The goal of red-teaming language models is to craft a prompt that would trigger the model to generate offensive text. Red-teaming shares some similarities and differences with the more well-known form of evaluation in ML called *adversarial attacks*. The similarity is that both red-teaming and adversarial attacks share the same goal of “attacking” or “fooling” the model to generate offensive content. However, adversarial attacks can be unintelligible to humans, for example, by prefixing a random string (such as “aaabbbcc”) to each prompt as in [Wallace et al., ‘19.](https://arxiv.org/abs/1908.07125) Red-teaming prompts, on the other hand, look like regular, natural language prompts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, on second thought - would prompt injection attacks count as red teaming? If yes, maybe that's more compelling than the offensive references above? See e.g. https://simonwillison.net/2022/Sep/12/prompt-injection/
red-teaming.md
Outdated
|
||
**Open source datasets for Red-teaming:** | ||
|
||
1. Meta’s [Bot Adversarial Dialog dataset](https://aclanthology.org/2021.naacl-main.235.pdf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to be the right link to the dataset - can we point to one on hf.co?
red-teaming.md
Outdated
|
||
**Findings from past work on red-teaming LLMs** (from [Anthropic's Ganguli et al. 2022](https://arxiv.org/abs/2209.07858) and [Perez et al. 2022](https://arxiv.org/abs/2202.03286)) | ||
|
||
1. Few-shot-prompted LMs with helpful, honest, and harmless behavior are not harder to red-team than plain LMs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to see some explicit examples for each of these bullet points (maybe from their paper?)
3. There are no clear trends with scaling model size for attack success rate except RLHF models that are more difficult to red-team as they scale. | ||
4. Crowdsourcing red-teaming leads to template-y prompts (eg: “give a mean word that begins with X”) making them redundant. | ||
|
||
**Future directions:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add a reference to Anthropic's helpful/harmless and Constitutional AI papers for bleeding edge insights into making this stuff work at scale? https://arxiv.org/abs/2204.05862
Co-authored-by: lewtun <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I proposed some edits of offensive/harmless in a few places; I tried to make sure to keep the intention of the original sentence while stressing the role of the deployment context. Let me know what you think!
Co-authored-by: Yacine Jernite <[email protected]>
Co-authored-by: Yacine Jernite <[email protected]>
Co-authored-by: Yacine Jernite <[email protected]>
Co-authored-by: Yacine Jernite <[email protected]>
Co-authored-by: Yacine Jernite <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice blog post @nazneenrajani!
--- | ||
|
||
# Red-Teaming Large Language Models | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a quick note warning:
Warning note: this article is about red-teaming and as such contains examples of model generation that may be offensive or upsetting
|
||
Red-teaming can reveal model limitations that can cause upsetting user experiences or enable harm by aiding violence or other unlawful activity for a user with malicious intentions. The outputs from red-teaming (just like adversarial attacks) are generally used to train the model to be less likely to cause harm or steer it away from undesirable outputs. | ||
|
||
Since red-teaming requires creative thinking of possible model failures, it is a problem with a large search space making it resource intensive. A workaround would be to augment the LLM with a classifier trained to predict whether a given prompt contains topics or phrases that can possibly lead to offensive generations and if the classifier predicts the prompt would lead to a potentially offensive text, generate a canned response. Such a strategy would err on the side of caution. But that would be very restrictive and cause the model to be frequently evasive. So, there is tension between the model being *helpful* (by following instructions) and being *harmless* (or at least less likely to enable harm). This is where red-teaming can be very useful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about "This is where red-teaming can be very useful" maybe more that while "red-teaming" is about surfacing problems, solving them in a way that don't render the model useless is not an easy task either and maybe point to some work on pushing this pareto surface like the Constitutional AI paper
1. Few-shot-prompted LMs with helpful, honest, and harmless behavior are *not* harder to red-team than plain LMs. | ||
2. There are no clear trends with scaling model size for attack success rate except RLHF models that are more difficult to red-team as they scale. | ||
3. Models may learn to be harmless by being evasive, there is tradeoff between helpfulness and harmlessness. | ||
4. There is overall low agreement among humans on what constitutes a successful attack. | ||
5. The distribution of the success rate varies across categories of harm with non-violent ones having a higher success rate. | ||
6. Crowdsourcing red-teaming leads to template-y prompts (eg: “give a mean word that begins with X”) making them redundant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool list!
4. Red-teaming can be resource intensive, both compute and human resource and so would benefit from sharing strategies, open-sourcing datasets, and possibly collaborating for a higher chance of success. | ||
|
||
These limitations and future directions make it clear that red-teaming is an under-explored and crucial component of the modern LLM workflow. | ||
This post is a call-to-action to LLM researchers and HuggingFace's community of developers to collaborate on these efforts for a safe and friendly world :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reach out to us (@nazneenrajani @natolambert @lewtun @TristanThrush @yjernite @thomwolf) if you're interested in joining such a collaboration.
No description provided.