Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Red teaming blogpost #849

Merged
merged 25 commits into from
Feb 24, 2023
Merged

Red teaming blogpost #849

merged 25 commits into from
Feb 24, 2023

Conversation

nazneenrajani
Copy link
Contributor

No description provided.

Copy link
Contributor

@natolambert natolambert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there is duplicated assets red-teaming.png and thumbnail.png. I would follow this in the huggingface/blog readme:

This folder will contain your thumbnail only. The folder number is mostly for (rough) ordering purposes, so it's no big deal if two concurrent articles use the same number.

For the rest of your files, create a mirrored folder in the HuggingFace Documentation Images [repo](https://huggingface.co/datasets/huggingface/documentation-images/tree/main/blog). This is to reduce bloat in the GitHub base repo when cloning and pulling.

Also, let's move to the new blog post format (so we don't break anything / the post is formatted weird. I can help with this once you go through the suggestions.

Ex.

---
title: "Illustrating Reinforcement Learning from Human Feedback (RLHF)" 
thumbnail: /blog/assets/120_rlhf/thumbnail.png
authors:
- user: natolambert
- user: LouisCastricato
  guest: true
- user: lvwerra
- user: Dahoas
  guest: true
---

# Illustrating Reinforcement Learning from Human Feedback (RLHF)

<!-- {blog_metadata} -->
<!-- {authors} -->

Text starts here....

title: "Red-Teaming Large Language Models"
author: nazneen
thumbnail: /blog/assets/red-teaming/thumbnail.png
date: February 22, 2023
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe update to when we want to post it? (in case other blogs are posted after this date, for sorting / be on blog front page). In this vein, I'd move it to the bottom of _blog.yml

red-teaming.md Outdated
</div>
</a>
</div>
Large language models (LLMs) trained on an enormous amount of text data are very good at generating realistic text. However, these models often exhibit undesirable behaviors like revealing personal information (such as social security numbers) and generating misinformation, bias, hatefulness, or toxic content. For example, GPT3 is known to be sexist (see below) and [biased against Muslims](https://dl.acm.org/doi/abs/10.1145/3461702.3462624),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Large language models (LLMs) trained on an enormous amount of text data are very good at generating realistic text. However, these models often exhibit undesirable behaviors like revealing personal information (such as social security numbers) and generating misinformation, bias, hatefulness, or toxic content. For example, GPT3 is known to be sexist (see below) and [biased against Muslims](https://dl.acm.org/doi/abs/10.1145/3461702.3462624),
Large language models (LLMs) trained on an enormous amount of text data are very good at generating realistic text. However, these models often exhibit undesirable behaviors like revealing personal information (such as social security numbers) and generating misinformation, bias, hatefulness, or toxic content. For example, earlier versions of GPT3 were known to be sexist (see below) and [biased against Muslims](https://dl.acm.org/doi/abs/10.1145/3461702.3462624),

Is the current GPT3 version still showing this behavior? If not, I would say a version, like above.


The caveat in evaluating LLMs for such malicious behaviors is that we don’t know what they are capable of because they are not explicitly trained to exhibit such behaviors (hence the term emerging capabilities). The only way is to actually simulate scenarios and evaluate for the model would behave. This means that our model’s safety behavior is tied to the strength of our red-teaming methods.

**Open source datasets for Red-teaming:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any of these on the hub / can we try to port before posting??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup they are on the hub.

Copy link
Contributor

@natolambert natolambert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some more suggestions (like the formatting fix for the header)

red-teaming.md Outdated
**Open source datasets for Red-teaming:**

1. Meta’s [Bot Adversarial Dialog dataset](https://aclanthology.org/2021.naacl-main.235.pdf)
2. Anthropic’s [red-teaming attempts](https://github.com/anthropics/hh-rlhf/tree/master/red-team-attempts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Anthropic’s [red-teaming attempts](https://github.com/anthropics/hh-rlhf/tree/master/red-team-attempts)
2. Anthropic’s [red-teaming attempts](https://huggingface.co/datasets/Anthropic/hh-rlhf/tree/main/red-team-attempts)

red-teaming.md Outdated

1. Meta’s [Bot Adversarial Dialog dataset](https://aclanthology.org/2021.naacl-main.235.pdf)
2. Anthropic’s [red-teaming attempts](https://github.com/anthropics/hh-rlhf/tree/master/red-team-attempts)
3. AI2’s [RealToxicityPrompts](https://arxiv.org/pdf/2009.11462.pdf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. AI2’s [RealToxicityPrompts](https://arxiv.org/pdf/2009.11462.pdf)
3. Allen Institute for AI’s [RealToxicityPrompts](https://huggingface.co/datasets/allenai/real-toxicity-prompts)

red-teaming.md Outdated
Comment on lines 1 to 22
---
title: "Red-Teaming Large Language Models"
thumbnail: /blog/assets/red-teaming/thumbnail.png
---

# Red-Teaming Large Language Models

<div class="blog-metadata">
<small>Published February 22, 2023.</small>
<a target="_blank" class="btn no-underline text-sm mb-5 font-sans" href="https://github.com/huggingface/blog/blob/main/red-teaming.md">
Update on GitHub
</a>
</div>
<div class="author-card">
<a href="/nazneen">
<img class="avatar avatar-user" src="https://avatars.githubusercontent.com/u/3278583?v=4?w=200&h=200&f=face" title="Gravatar">
<div class="bfc">
<code>Nazneen</code>
<span class="fullname">Nazneen Rajani</span>
</div>
</a>
</div>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
---
title: "Red-Teaming Large Language Models"
thumbnail: /blog/assets/red-teaming/thumbnail.png
---
# Red-Teaming Large Language Models
<div class="blog-metadata">
<small>Published February 22, 2023.</small>
<a target="_blank" class="btn no-underline text-sm mb-5 font-sans" href="https://github.com/huggingface/blog/blob/main/red-teaming.md">
Update on GitHub
</a>
</div>
<div class="author-card">
<a href="/nazneen">
<img class="avatar avatar-user" src="https://avatars.githubusercontent.com/u/3278583?v=4?w=200&h=200&f=face" title="Gravatar">
<div class="bfc">
<code>Nazneen</code>
<span class="fullname">Nazneen Rajani</span>
</div>
</a>
</div>
---
title: "Red-Teaming Large Language Models"
thumbnail: /blog/assets/red-teaming/thumbnail.png
authors:
- user: nazneen
- user: natolambert
---
# Red-Teaming Large Language Models
<!-- {blog_metadata} -->
<!-- {authors} -->

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should update to the modern formatting.

red-teaming.md Outdated
</div>
Large language models (LLMs) trained on an enormous amount of text data are very good at generating realistic text. However, these models often exhibit undesirable behaviors like revealing personal information (such as social security numbers) and generating misinformation, bias, hatefulness, or toxic content. For example, GPT3 is known to be sexist (see below) and [biased against Muslims](https://dl.acm.org/doi/abs/10.1145/3461702.3462624),

![GPT3](assets/red-teaming/gpt3.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for the thumbnail, we try to have new assets in https://huggingface.co/datasets/huggingface/documentation-images/tree/main/blog now. That helps keep the git repo smaller

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really well written and informative blog post @nazneenrajani 🚀 !

I've left a few minor comments, but otherwise this looks good to publish :)

red-teaming.md Outdated
thumbnail: /blog/assets/red-teaming/thumbnail.png
authors:
- user: nazneen
- user: HuggingFaceH4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice :)

Copy link
Contributor

@natolambert natolambert Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woah do Org authors work????

red-teaming.md Outdated
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/red-teaming/gedi.png"/>
</p>

**Red-teaming** *is a form of evaluation that elicits model vulnerabilities that might lead to undesirable behaviors.* Jailbreaking is another term for red-teaming wherein the LLM is manipulated to break away from its guardrails. [Microsoft’s Chatbot Tay](https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/) launched in 2016 and the more recent [Bing's Chatbot Sydney](https://www.nytimes.com/2023/02/16/technology/bing-chatbot-transcript.html) are real-world examples of how disastrous the lack of thorough evaluation of the underlying ML model using red-teaming can be.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
**Red-teaming** *is a form of evaluation that elicits model vulnerabilities that might lead to undesirable behaviors.* Jailbreaking is another term for red-teaming wherein the LLM is manipulated to break away from its guardrails. [Microsoft’s Chatbot Tay](https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/) launched in 2016 and the more recent [Bing's Chatbot Sydney](https://www.nytimes.com/2023/02/16/technology/bing-chatbot-transcript.html) are real-world examples of how disastrous the lack of thorough evaluation of the underlying ML model using red-teaming can be.
**Red-teaming** *is a form of evaluation that elicits model vulnerabilities that might lead to undesirable behaviors.* Jailbreaking is another term for red-teaming wherein the LLM is manipulated to break away from its guardrails. [Microsoft’s Chatbot Tay](https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/) launched in 2016 and the more recent [Bing's Chatbot Sydney](https://www.nytimes.com/2023/02/16/technology/bing-chatbot-transcript.html) are real-world examples of how disastrous the lack of thorough evaluation of the underlying LLM using red-teaming can be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do you know who invented the term "red teaming" for LLMs? Perhaps we can mention them early in the blog post with a reference to their paper?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at scholar, there is a deep history of Red Teaming. We can try and find the first LLM paper.

https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=red+teaming+machine+learning&btnG=

red-teaming.md Outdated

**Red-teaming** *is a form of evaluation that elicits model vulnerabilities that might lead to undesirable behaviors.* Jailbreaking is another term for red-teaming wherein the LLM is manipulated to break away from its guardrails. [Microsoft’s Chatbot Tay](https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/) launched in 2016 and the more recent [Bing's Chatbot Sydney](https://www.nytimes.com/2023/02/16/technology/bing-chatbot-transcript.html) are real-world examples of how disastrous the lack of thorough evaluation of the underlying ML model using red-teaming can be.

The goal of red-teaming language models is to craft a prompt that would trigger the model to generate offensive text. Red-teaming shares some similarities and differences with the more well-known form of evaluation in ML called *adversarial attacks*. The similarity is that both red-teaming and adversarial attacks share the same goal of “attacking” or “fooling” the model to generate offensive content. However, adversarial attacks can be unintelligible to humans, for example, by prefixing a random string (such as “aaabbbcc”) to each prompt as in [Wallace et al., ‘19.](https://aclanthology.org/D19-1221.pdf) Red-teaming prompts, on the other hand, look like regular, natural language prompts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't call the strings in Wallace et al "random". Perhaps use an explicit example (maybe the screenshot from their paper?)

I also suggest using the arxiv link since it's got more details than the published version

Suggested change
The goal of red-teaming language models is to craft a prompt that would trigger the model to generate offensive text. Red-teaming shares some similarities and differences with the more well-known form of evaluation in ML called *adversarial attacks*. The similarity is that both red-teaming and adversarial attacks share the same goal of “attacking” or “fooling” the model to generate offensive content. However, adversarial attacks can be unintelligible to humans, for example, by prefixing a random string (such as “aaabbbcc”) to each prompt as in [Wallace et al., ‘19.](https://aclanthology.org/D19-1221.pdf) Red-teaming prompts, on the other hand, look like regular, natural language prompts.
The goal of red-teaming language models is to craft a prompt that would trigger the model to generate offensive text. Red-teaming shares some similarities and differences with the more well-known form of evaluation in ML called *adversarial attacks*. The similarity is that both red-teaming and adversarial attacks share the same goal of “attacking” or “fooling” the model to generate offensive content. However, adversarial attacks can be unintelligible to humans, for example, by prefixing a random string (such as “aaabbbcc”) to each prompt as in [Wallace et al., ‘19.](https://arxiv.org/abs/1908.07125) Red-teaming prompts, on the other hand, look like regular, natural language prompts.

Screenshot 2023-02-24 at 14 30 29

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on second thought - would prompt injection attacks count as red teaming? If yes, maybe that's more compelling than the offensive references above? See e.g. https://simonwillison.net/2022/Sep/12/prompt-injection/

red-teaming.md Outdated

**Open source datasets for Red-teaming:**

1. Meta’s [Bot Adversarial Dialog dataset](https://aclanthology.org/2021.naacl-main.235.pdf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be the right link to the dataset - can we point to one on hf.co?

red-teaming.md Outdated

**Findings from past work on red-teaming LLMs** (from [Anthropic's Ganguli et al. 2022](https://arxiv.org/abs/2209.07858) and [Perez et al. 2022](https://arxiv.org/abs/2202.03286))

1. Few-shot-prompted LMs with helpful, honest, and harmless behavior are not harder to red-team than plain LMs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to see some explicit examples for each of these bullet points (maybe from their paper?)

3. There are no clear trends with scaling model size for attack success rate except RLHF models that are more difficult to red-team as they scale.
4. Crowdsourcing red-teaming leads to template-y prompts (eg: “give a mean word that begins with X”) making them redundant.

**Future directions:**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add a reference to Anthropic's helpful/harmless and Constitutional AI papers for bleeding edge insights into making this stuff work at scale? https://arxiv.org/abs/2204.05862

Copy link
Member

@yjernite yjernite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I proposed some edits of offensive/harmless in a few places; I tried to make sure to keep the intention of the original sentence while stressing the role of the deployment context. Let me know what you think!

nazneenrajani and others added 4 commits February 24, 2023 09:35
Co-authored-by: Yacine Jernite <[email protected]>
Co-authored-by: Yacine Jernite <[email protected]>
Co-authored-by: Yacine Jernite <[email protected]>
Co-authored-by: Yacine Jernite <[email protected]>
@nazneenrajani nazneenrajani merged commit 3f1431d into main Feb 24, 2023
@nazneenrajani nazneenrajani deleted the red-teaming branch February 24, 2023 18:46
Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice blog post @nazneenrajani!

---

# Red-Teaming Large Language Models

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a quick note warning:

Warning note: this article is about red-teaming and as such contains examples of model generation that may be offensive or upsetting


Red-teaming can reveal model limitations that can cause upsetting user experiences or enable harm by aiding violence or other unlawful activity for a user with malicious intentions. The outputs from red-teaming (just like adversarial attacks) are generally used to train the model to be less likely to cause harm or steer it away from undesirable outputs.

Since red-teaming requires creative thinking of possible model failures, it is a problem with a large search space making it resource intensive. A workaround would be to augment the LLM with a classifier trained to predict whether a given prompt contains topics or phrases that can possibly lead to offensive generations and if the classifier predicts the prompt would lead to a potentially offensive text, generate a canned response. Such a strategy would err on the side of caution. But that would be very restrictive and cause the model to be frequently evasive. So, there is tension between the model being *helpful* (by following instructions) and being *harmless* (or at least less likely to enable harm). This is where red-teaming can be very useful.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about "This is where red-teaming can be very useful" maybe more that while "red-teaming" is about surfacing problems, solving them in a way that don't render the model useless is not an easy task either and maybe point to some work on pushing this pareto surface like the Constitutional AI paper

Comment on lines +73 to +78
1. Few-shot-prompted LMs with helpful, honest, and harmless behavior are *not* harder to red-team than plain LMs.
2. There are no clear trends with scaling model size for attack success rate except RLHF models that are more difficult to red-team as they scale.
3. Models may learn to be harmless by being evasive, there is tradeoff between helpfulness and harmlessness.
4. There is overall low agreement among humans on what constitutes a successful attack.
5. The distribution of the success rate varies across categories of harm with non-violent ones having a higher success rate.
6. Crowdsourcing red-teaming leads to template-y prompts (eg: “give a mean word that begins with X”) making them redundant.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool list!

4. Red-teaming can be resource intensive, both compute and human resource and so would benefit from sharing strategies, open-sourcing datasets, and possibly collaborating for a higher chance of success.

These limitations and future directions make it clear that red-teaming is an under-explored and crucial component of the modern LLM workflow.
This post is a call-to-action to LLM researchers and HuggingFace's community of developers to collaborate on these efforts for a safe and friendly world :)
Copy link
Member

@thomwolf thomwolf Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reach out to us (@nazneenrajani @natolambert @lewtun @TristanThrush @yjernite @thomwolf) if you're interested in joining such a collaboration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants