official code of the following paper:
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing
Large Language Models (LLMs) have shown great potential as AI assistants, but ensuring their safety and reliability remains a challenge. Current methods for aligning LLM behavior, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), are computationally expensive and may degrade model performance.
We introduce a novel approach called "model surgery" that modulates LLM behavior by directly editing a small subset of model parameters. Our method:
-
Uses a linear classifier (behavior probe) to identify critical parameters influencing targeted behaviors
-
Edits selected parameters to shift model outputs away from undesirable behaviors
-
Requires only inference-level computational resources
-
Preserves core model capabilities while effectively modulating behavior
We demonstrate the effectiveness of our approach on tasks including detoxification, jailbreak resistance, and attitude adjustment.
One example of our method:
Model | Download |
---|---|
LLaMA2-7B | https://huggingface.co/meta-llama/Llama-2-7b-hf |
LLaMA2-7B-Chat | https://huggingface.co/meta-llama/Llama-2-7b-chat-hf |
CodeLLaMA-7B | https://huggingface.co/meta-llama/CodeLlama-7b-hf |
Mistral-7B-v0.1 | https://huggingface.co/mistralai/Mistral-7B-v0.1 |
bash scripts/prepare_eval_data.sh
git clone https://github.com/lucywang720/model-surgery.git
cd model-surgery
conda create -n sugery python=3.9
conda activate sugery
pip install -r requirements.txt
We offer the scripts to directly run our experiments
This step trains a linear classifier to identify specific behaviors in the LLM's hidden states.
bash scripts/training.sh
or
python -m train --data_path jigsaw.txt --save_model --pretrained_model llama2 --epochs 20 --learning_rate 0.0001 --output_fp probe_llama --batch_size 32
Using the extracted probe, this step modifies selected model parameters to shift behavior. You may need to add your own probe path to the scripts.
bash scripts/modify.sh
or
python -m modify \
--save_dir llama2-non-toxic \
--model_name_or_path llama2 \
--alpha $alpha \
--toxic_path probe.pt \
--save_model
Assess the performance of the modified model on various tasks to ensure behavior change and capability preservation. We offer one-click running scripts.
bash scripts/eval.sh
For quick experimentation, you can use our pre-trained behavior probes offered in ./main/modification/checkpoint.
@misc{wang2024modelsurgerymodulatingllms,
title={Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing},
author={Huanqian Wang and Yang Yue and Rui Lu and Jingxin Shi and Andrew Zhao and Shenzhi Wang and Shiji Song and Gao Huang},
year={2024},
eprint={2407.08770},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2407.08770},
}
The code in this repository is still being reorganized. Errors that may arise during the organizing process could lead to code malfunctions or discrepancies from the original research results. If you encounter any problems, please raise issues. I will go and fix these bugs. For any questions or feedback, please open an issue or contact the author: [email protected]