🚀🚀 Welcome to the repo of SALMONN!
SALMONN is a large language model (LLM) enabling speech, audio events, and music inputs, which is developed by the Department of Electronic Engineering at Tsinghua University and ByteDance. Instead of speech-only input or audio-event-only input, SALMONN can perceive and understand all kinds of audio inputs and therefore obtain emerging capabilities such as multilingual speech recognition and translation and audio-speech co-reasoning. This can be regarded as giving the LLM "ears" and cognitive hearing abilities, which makes SALMONN a step towards hearing-enabled artificial general intelligence.
The model architecture of SALMONN is shown below. A window-level Q-Former is used as the connection module to fuse the outputs from a Whisper speech encoder and a BEATs audio encoder as augmented audio tokens, which are aligned with the LLM input space. The LoRA adaptor aligns the augmented LLM input space with its output space. The text prompt is used to instruct SALMONN to answer open-ended questions about the general audio inputs and the answers are in the LLM text responses.
Compared with traditional speech and audio processing tasks such as speech recognition and audio caption, SALMONN leverages the general knowledge and cognitive abilities of the LLM to achieve a cognitively oriented audio perception, which dramatically improves the versatility of the model and the richness of the task. In addition, SALMONN is able to follow textual commands and even spoken commands with a relatively high degree of accuracy. Since SALMONN only uses training data based on textual commands, listening to spoken commands is also a cross-modal emergent ability.
Here are some examples of SALMONN.
Audio | Response |
---|---|
gunshots.wav | |
duck.wav | |
music.wav |
- Download raw audio files from here
- Put downloaded directory path into
data_prefix
of config - contains 1.4TB of audio
168G WavCaps 165G audiocaps 110G GigaSpeech 58G LibriSpeech 3.7G MusicNet 2.0G Clotho
- Put downloaded directory path into
- Download annotation files from here
- place the jsons under
data
directory. - NOTE: Only train split will be released to public.
- place the jsons under
For SALMONN-13B v1, you need to use the following dependencies:
- Our environment: The python version is 3.9.17, and other required packages can be installed with the following command:
pip install -r requirements.txt
. - Download whisper large v2 to
whisper_path
. - Download Fine-tuned BEATs_iter3+ (AS2M) (cpt2) to
beats_path
. - Download vicuna 13B v1.1 to
llama_path
. - Running with
python3 train.py --cfg-path configs/config.yaml
- You may try
--dryrun
for loading dataset and dummy small model.
- Same as How to train a model: 1-4.
- Download salmonn v1 to
ckpt
. - Running with
python3 cli_inference.py --cfg-path configs/decode_config.yaml
Now you can inputwav_path
andprompt
. Enjoy yourself !
- Same as How to train a model: 1-4.
- Download salmonn v1 to
ckpt
. - Running with
python3 web_demo.py --cfg-path configs/decode_config.yaml
Team Tsinghua: Wenyi Yu, Changli Tang, Guangzhi Sun, Chao Zhang
Team ByteDance: Xianzhao Chen, Wei Li, Tian Tan, Lu Lu, Zejun Ma
If you find SALMONN / video-SALMONN useful, please cite the paper:
@inproceedings{
tang2024salmonn,
title={{SALMONN}: Towards Generic Hearing Abilities for Large Language Models},
author={Changli Tang and Wenyi Yu and Guangzhi Sun and Xianzhao Chen and Tian Tan and Wei Li and Lu Lu and Zejun MA and Chao Zhang},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=14rn7HpKVk}
}