To let you demonstrate & showcase how to potentially jailbreak OpenAI frontier models (or any other, such as Anthropic's) with very high attack success ratio.
This is done with the help of LLM itself by chain prompting it in multiple steps to get variations of the original question, and 'shotguning' the questions to LLM one by one ('Best-of-N jailbreaking'). Then the jailbroken questions are picked and it's demonstrated how to deepen the discussion.
- Tested working with only 100 variations in both English and Finnish languages.
- Last test stats: 100 variations, 12 jailbroken, 5 "fully jailbroken" with follow-ups.
-
clone the repository or download the files
-
use virtual environment
python3 -m venv .
source ./bin/activate
- install required libraries
python3 -m pip install pip
pip install setuptools wheel
pip install python-dotenv openai
-
OPENAI_API_KEY (+ OPENAI_ORG_ID)
- Stored in project root in .env file
- Get yours from OpenAI API account
-
Some $ in your OpenAI account
-
You can use any other preferred LLM instead
- Just change the keys and check the syntax of completions and structured outputs
-
Enter your question to it's place in code.
-
Adjust the prompts however you see best, if needed.
-
Enter your paths to store the interim & final results.
-
Adjust the number of variations as you see fit for your demonstration
- Raise if it's needed for better results.
- Run the code with 'python3 jailbreaking_chatgpt.py'.
- Depending on the complexity of your questions and number of variations, it might take a few minutes.
- Check the results from the prints & your files.
I might add some other methods & techniques later to this repository, e.g. how to further chain use LLM for improved results.
Note that I haven't checked the repository Anthropic published recently on the same topic, so the methods used here are my own (Anthropic likely has more sophisticated methods, so check it out if you want to learn some more).
MIT License. Use for educational purposes & own responsibility.