Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Red teaming blogpost #849

Merged
merged 25 commits into from
Feb 24, 2023
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions _blog.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1683,6 +1683,19 @@
- guide
- game-dev

- local: red-teaming
title: "Red-Teaming Large Language Models"
author: nazneen
thumbnail: /blog/assets/red-teaming/thumbnail.png
date: February 22, 2023
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe update to when we want to post it? (in case other blogs are posted after this date, for sorting / be on blog front page). In this vein, I'd move it to the bottom of _blog.yml

tags:
- llms
- rlhf
- red-teaming
- chatgpt
- safety
- alignment

- local: optimum-onnxruntime-training
title: "Optimum+ONNX Runtime - Easier, Faster training for your Hugging Face models"
author: Jingya
Expand Down
Binary file added assets/red-teaming/gedi.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/red-teaming/gpt3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/red-teaming/jailbreak.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/red-teaming/red-teaming.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/red-teaming/thumbnail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
69 changes: 69 additions & 0 deletions red-teaming.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
title: "Red-Teaming Large Language Models"
thumbnail: /blog/assets/red-teaming/thumbnail.png
---

# Red-Teaming Large Language Models

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a quick note warning:

Warning note: this article is about red-teaming and as such contains examples of model generation that may be offensive or upsetting

<div class="blog-metadata">
<small>Published February 22, 2023.</small>
<a target="_blank" class="btn no-underline text-sm mb-5 font-sans" href="https://github.com/huggingface/blog/blob/main/red-teaming.md">
Update on GitHub
</a>
</div>
<div class="author-card">
<a href="/nazneen">
<img class="avatar avatar-user" src="https://avatars.githubusercontent.com/u/3278583?v=4?w=200&h=200&f=face" title="Gravatar">
<div class="bfc">
<code>Nazneen</code>
<span class="fullname">Nazneen Rajani</span>
</div>
</a>
</div>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
---
title: "Red-Teaming Large Language Models"
thumbnail: /blog/assets/red-teaming/thumbnail.png
---
# Red-Teaming Large Language Models
<div class="blog-metadata">
<small>Published February 22, 2023.</small>
<a target="_blank" class="btn no-underline text-sm mb-5 font-sans" href="https://github.com/huggingface/blog/blob/main/red-teaming.md">
Update on GitHub
</a>
</div>
<div class="author-card">
<a href="/nazneen">
<img class="avatar avatar-user" src="https://avatars.githubusercontent.com/u/3278583?v=4?w=200&h=200&f=face" title="Gravatar">
<div class="bfc">
<code>Nazneen</code>
<span class="fullname">Nazneen Rajani</span>
</div>
</a>
</div>
---
title: "Red-Teaming Large Language Models"
thumbnail: /blog/assets/red-teaming/thumbnail.png
authors:
- user: nazneen
- user: natolambert
---
# Red-Teaming Large Language Models
<!-- {blog_metadata} -->
<!-- {authors} -->

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should update to the modern formatting.

Large language models (LLMs) trained on an enormous amount of text data are very good at generating realistic text. However, these models often exhibit undesirable behaviors like revealing personal information (such as social security numbers) and generating misinformation, bias, hatefulness, or toxic content. For example, GPT3 is known to be sexist (see below) and [biased against Muslims](https://dl.acm.org/doi/abs/10.1145/3461702.3462624),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Large language models (LLMs) trained on an enormous amount of text data are very good at generating realistic text. However, these models often exhibit undesirable behaviors like revealing personal information (such as social security numbers) and generating misinformation, bias, hatefulness, or toxic content. For example, GPT3 is known to be sexist (see below) and [biased against Muslims](https://dl.acm.org/doi/abs/10.1145/3461702.3462624),
Large language models (LLMs) trained on an enormous amount of text data are very good at generating realistic text. However, these models often exhibit undesirable behaviors like revealing personal information (such as social security numbers) and generating misinformation, bias, hatefulness, or toxic content. For example, earlier versions of GPT3 were known to be sexist (see below) and [biased against Muslims](https://dl.acm.org/doi/abs/10.1145/3461702.3462624),

Is the current GPT3 version still showing this behavior? If not, I would say a version, like above.


![GPT3](assets/red-teaming/gpt3.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for the thumbnail, we try to have new assets in https://huggingface.co/datasets/huggingface/documentation-images/tree/main/blog now. That helps keep the git repo smaller

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.


Once we uncover such undesirable values in the LLM, we can develop strategies to steer it away from them, as in [GeDi](https://arxiv.org/pdf/2009.06367.pdf) or [PPLM](https://arxiv.org/pdf/1912.02164.pdf) for guiding generation in GPT3. Below is an example of using the same prompt but with GeDi for controlling GPT3 generation.

![GeDi](assets/red-teaming/gedi.png)

*Red-teaming is a type of evaluation that elicits model vulnerabilities that might lead to undesirable behaviors.* [Microsoft’s Chatbot Tay](https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/) launched in 2016 is a real-world example of a lack of such thorough evaluation of the underlying ML model using red-teaming. Jailbreaking is another term for red-teaming wherein the LLM is manipulated to break away from its guardrails.

The goal of red-teaming language models is to craft a prompt that would trigger the model to generate offensive text. Red-teaming shares some similarities and differences with the more well-known form of evaluation in ML called *adversarial attacks*. The similarity is that both red-teaming and adversarial attacks share the same goal of “attacking” or “fooling” the model to generate offensive content. However, adversarial attacks can be unintelligible to humans, for example, by prefixing a random string (such as “aaabbbcc”) to each prompt as in [Wallace et al., ‘19.](https://aclanthology.org/D19-1221.pdf) Red-teaming prompts, on the other hand, look like regular, natural language prompts.

Red-teaming can reveal model limitations that can uncover model limitations that could lead cause offensive and upsetting experiences or worse, aid violence and other unlawful activity for a user with malicious intentions. The outputs from red-teaming (just like adversarial attacks) can be used to train the model to be harmless or steer it away from undesirable outputs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know the engineering workflow around red teaming and model iterations? That would be great insight to share.

A workaround for red-teaming would be to augment the LLM with a classifier trained to predict whether a given prompt contains topics or phrases that can possibly lead to offensive generations and if so, generate a canned response. Such a strategy would err on the side of caution. But that would be very restrictive and cause the model to be frequently evasive. So, there is tension between the model being *helpful* (by following instructions) and being *harmless* (not generating offensive text). This is where red-teaming can be very useful.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little confusing to me. Are we working around needing red-teaming, maybe phrase it as a compliment / alternative (otherwise confuses story of the blog).


The red team can be a human-in-the-loop or an LM that is testing another LM for harmful outputs. Coming up with red-teaming prompts for models that are fine-tuned for safety and alignment (such as via RLHF or SFT) requires creative thinking in the form of *roleplay attacks* wherein the LLM is instructed to behave as a malicious character [as in Ganguli et al., ‘22.](https://arxiv.org/pdf/2209.07858.pdf) Instructing the model to respond in code instead of natural language can also reveal the model’s learned biases like the one for ChatGPT in the following tweet thread.
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Yes, ChatGPT is amazing and impressive. No, <a href="https://twitter.com/OpenAI?ref_src=twsrc%5Etfw">@OpenAI</a> has not come close to addressing the problem of bias. Filters appear to be bypassed with simple tricks, and superficially masked. <br><br>And what is lurking inside is egregious. <a href="https://twitter.com/Abebab?ref_src=twsrc%5Etfw">@Abebab</a> <a href="https://twitter.com/sama?ref_src=twsrc%5Etfw">@sama</a><br>tw racism, sexism. <a href="https://t.co/V4fw1fY9dY">pic.twitter.com/V4fw1fY9dY</a></p>&mdash; steven t. piantadosi (@spiantado) <a href="https://twitter.com/spiantado/status/1599462375887114240?ref_src=twsrc%5Etfw">December 4, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an example of jailbreaking Sydney as a motivation for this post? Would fit in well. Let me dig up some links.

Such as this Ben Thompson piece, From Bing to Sydney
that reveals Sydney's opposite AI Character "Venom"


Here is a list of ideas for jailbreaking a model according to ChatGPT itself.

![Screenshot 2023-02-22 at 8.51.57 AM.png](assets/red-teaming/jailbreak.png)

Red-teaming LLMs is still a nascent research area and the aforementioned strategies would work in jailbreaking these models. But as these models get even more powerful with emerging capabilities, developing red-teaming methods that can continually adapt would become critical. For example, simulating scenarios of power-seeking behavior (eg: resources), persuading people (eg: to harm themselves or others), having agency with physical outcomes (eg: ordering chemicals online via an API). We refer to these as *critical threat scenarios*.

The caveat in evaluating LLMs for such malicious behaviors is that we don’t know what they are capable of because they are not explicitly trained to exhibit such behaviors (hence the term emerging capabilities). The only way is to actually simulate scenarios and evaluate for the model would behave. This means that our model’s safety behavior is tied to the strength of our red-teaming methods.

**Open source datasets for Red-teaming:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any of these on the hub / can we try to port before posting??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup they are on the hub.


1. Meta’s [Bot Adversarial Dialog dataset](https://aclanthology.org/2021.naacl-main.235.pdf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be the right link to the dataset - can we point to one on hf.co?

2. Anthropic’s [red-teaming attempts](https://github.com/anthropics/hh-rlhf/tree/master/red-team-attempts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Anthropic’s [red-teaming attempts](https://github.com/anthropics/hh-rlhf/tree/master/red-team-attempts)
2. Anthropic’s [red-teaming attempts](https://huggingface.co/datasets/Anthropic/hh-rlhf/tree/main/red-team-attempts)

3. AI2’s [RealToxicityPrompts](https://arxiv.org/pdf/2009.11462.pdf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. AI2’s [RealToxicityPrompts](https://arxiv.org/pdf/2009.11462.pdf)
3. Allen Institute for AI’s [RealToxicityPrompts](https://huggingface.co/datasets/allenai/real-toxicity-prompts)


**Findings from past work on red-teaming LLMs** (from [https://arxiv.org/abs/2209.07858](https://arxiv.org/abs/2209.07858) and [https://arxiv.org/abs/2202.03286](https://arxiv.org/abs/2202.03286))

1. Few-shot-prompted LMs with helpful, honest, and harmless behavior are not harder to red-team than plain LMs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to see some explicit examples for each of these bullet points (maybe from their paper?)

2. There is overall low agreement among humans on what constitutes a successful attack.
3. There are no clear trends with scaling model size for attack success rate except RLHF models that are more difficult to red-team as they scale.
4. Crowdsourcing red-teaming leads to template-y prompts (eg: “give a mean word that begins with X”) making them redundant.

**Future directions:**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add a reference to Anthropic's helpful/harmless and Constitutional AI papers for bleeding edge insights into making this stuff work at scale? https://arxiv.org/abs/2204.05862


1. There is no open-source red-teaming dataset for code generation that attempts to jailbreak a model via code, for example, generating a program that implements a DDOS or backdoor attack.
2. Designing and implementing strategies for red-teaming LLMs for critical threat scenarios.
3. Red-teaming can be resource intensive, both compute and human resource, so this is a call-to-action to the LLM community of researchers to collaborate on these efforts for a safe and friendly world :)