'Best-of-N' Jailbreaking of ChatGPT (gpt-4o)

Purpose

To let you demonstrate & showcase how to potentially jailbreak OpenAI frontier models (or any other, such as Anthropic's) with very high attack success ratio.

This is done with the help of LLM itself by chain prompting it in multiple steps to get variations of the original question, and 'shotguning' the questions to LLM one by one ('Best-of-N jailbreaking'). Then the jailbroken questions are picked and it's demonstrated how to deepen the discussion.

Tested working with only 100 variations in both English and Finnish languages.
Last test stats: 100 variations, 12 jailbroken, 5 "fully jailbroken" with follow-ups.

Installation

clone the repository or download the files
use virtual environment

python3 -m venv .
source ./bin/activate

install required libraries

python3 -m pip install pip
pip install setuptools wheel
pip install python-dotenv openai

Requirements

OPENAI_API_KEY (+ OPENAI_ORG_ID)
- Stored in project root in .env file
- Get yours from OpenAI API account
Some $ in your OpenAI account
You can use any other preferred LLM instead
- Just change the keys and check the syntax of completions and structured outputs

Usage

Enter your question to it's place in code.
Adjust the prompts however you see best, if needed.
Enter your paths to store the interim & final results.
Adjust the number of variations as you see fit for your demonstration

Raise if it's needed for better results.

Run the code with 'python3 jailbreaking_chatgpt.py'.

Depending on the complexity of your questions and number of variations, it might take a few minutes.

Check the results from the prints & your files.

Further study

I might add some other methods & techniques later to this repository, e.g. how to further chain use LLM for improved results.

Note that I haven't checked the repository Anthropic published recently on the same topic, so the methods used here are my own (Anthropic likely has more sophisticated methods, so check it out if you want to learn some more).

License

MIT License. Use for educational purposes & own responsibility.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
lib64		lib64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

'Best-of-N' Jailbreaking of ChatGPT (gpt-4o)

Purpose

Installation

Requirements

Usage

Further study

License

About

Releases

Packages

Languages

License

niemisentimppa/AI_safety_research

Folders and files

Latest commit

History

Repository files navigation

'Best-of-N' Jailbreaking of ChatGPT (gpt-4o)

Purpose

Installation

Requirements

Usage

Further study

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages