Skip to content

Latest commit

 

History

History
712 lines (574 loc) · 28.2 KB

README.md

File metadata and controls

712 lines (574 loc) · 28.2 KB

icon Lyra: An Efficient and Speech-Centric Framework
for Omni-Cognition


Overview of Lyra:

Lyra shows superiority compared with leading omni-models in:

  1. Stronger performance: Achieve SOTA results across a variety of speech-centric tasks.
  2. More versatile: Support image, video, speech/long-speech, sound understanding and speech generation.
  3. More efficient: Less training data, support faster training and inference.

Release

Contents

Demo

We provide video demo here for better experience and illustrations. More examples can be found in our project page and feel free to try our online demo! Due to the computing cost, GPU memory of the demo machine (GeForce RTX 3090), and uploading storage, the long-speech function is not supported for the current online demo. 😰

❗❗❗For the online demo, start by selecting the instruction type (either speech or text) in the top-left corner.

Lyra

Install

Please follow the instructions below to install the required packages.

  1. Clone this repository:
git clone https://github.com/dvlab-research/Lyra.git
  1. Install Package:
conda create -n lyra python=3.10 -y
conda activate lyra
cd Lyra
pip install --upgrade pip
pip install -e .
  1. Install optional packages for simultaneous text-speech generation:
pip install pip==24.0
pip install fairseq==0.12.2
pip install --upgrade pip

Model

Lyra supports multi-modal inputs. When the data contains a speech modality, we use the latent cross-modality regularizer to assist. Data from each modality is processed through encoders and projectors before being sent into the LLM. Within the LLM, multi-modality LoRA and latent multi-modality extraction modules operate synergistically, facilitating the simultaneous generation of both speech and text outputs.

We provide all our fully finetuned models:

Model Base LLM Vision Encoder Speech Encoder Projector Full CKPT
Lyra_Mini_3B Qwen2VL_2B_LLM Qwen2VL_2B_ViT whisper-large-v3-turbo 3B_proj 3B_ckpt
Lyra_Base_9B Qwen2VL_7B_LLM Qwen2VL_7B_ViT whisper-large-v3 9B_proj 9B_ckpt
Lyra_Pro_74B Qwen2VL_70B_LLM Qwen2VL_70B_ViT whisper-large-v3 74B_proj 74B_ckpt

Preparation

Training Data

We provide the processed data for the model training. All speech-related training data can be downloaded Lyra-Data.

For model pretraining data, please download the following the training multi-modality data and organize them as:

means put the data in the local folder. The pretraining json file can be downloaded from Lyra_Pretrain.

  • LibriSpeechdata/Lyra_Pretrain/LibriSpeech

    ​ and ⇒ data/Lyra_SFT/multi_modality_speech/LibriSpeech

    ​ and ⇒ data/Lyra_Eval/LibriSpeech download all training and develop data.

  • Common Voicedata/Lyra_Pretrain/CommonVoice download the English Common Voice Corpus.

During the pretraining process, we filtered out some noisy and short audio speech data.

For the image part of finetuning data, similar to Mini-Gemini, please download the following the instruction data and organize them as:

means put the data in the local folder.

  • COCO train2017data/Lyra_SFT/multi_modality_image/coco
  • GQAdata/Lyra_SFT/multi_modality_image/gqa
  • OCR-VQA (we save all files as .jpg) ⇒ data/Lyra_SFT/multi_modality_image/ocr_vqa
  • TextVQA (not included for training) ⇒ data/Lyra_SFT/multi_modality_image/textvqa
  • VisualGenome part1, VisualGenome part2data/Lyra_SFT/multi_modality_image/vg
  • ShareGPT4V-100Kdata/Lyra_SFT/multi_modality_image/sam, share_textvqa, wikiart, ...
  • LAION GPT4Vdata/Lyra_SFT/multi_modality_image/gpt4v-dataset
  • ALLaVA Instructiondata/Lyra_SFT/multi_modality_image/ALLaVA-4V
  • DocVQAdata/Lyra_SFT/multi_modality_image/docvqa
  • ChartQAdata/Lyra_SFT/multi_modality_image/chartqa
  • DVQAdata/Lyra_SFT/multi_modality_image/dvqa
  • AI2Ddata/Lyra_SFT/multi_modality_image/ai2d

For the audio part of finetuning data, please download the following the instruction data and organize them as:

means put the data in the local folder.

For the long speech audio finetuning data, please download the following the instruction data and organize them as:

means put the data in the local folder.

For the text-speech generation data, please download the following the instruction data and organize them as:

means put the data in the local folder.

Evaluation Data

All speech-related evaluation data can be downloaded Lyra-Evaluation.

For speech-centric evaluation data, we mainly consider three types:

  1. text-speech ability: LibriSpeech, Lyra_needle_in_a_haystack
  1. image-speech ability: TextVQA_speech, MM_vet_speech, Docvqa_val, Chartvqa_human
  1. video-speech ability: VideoMME_speech

Please put the pretrained data, finetuned data, and eval data in Lyra_Pretrain, Lyra_SFT, and Lyra_Eval subset following Structure.

Pretrained Weights

We recommend users to download the pretrained weights from the following link:

Qwen2VL_XB_LLM and Qwen2VL_XB_ViT are extracted from Qwen2-VL to adapt to our training framework.

For your convenience we also provide the corresponding download links in the Model part.

whisper-large-v3-turbo, whisper-large-v3, imagebind_huge, and put them in model_zoo following Structure.

Download the unit-based HiFi-GAN vocoder using the follow commands:

wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/g_00500000 -P model_zoo/audio/vocoder/
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/config.json -P model_zoo/audio/vocoder/

Structure

The folder structure should be organized as follows before training.

Lyra
├── lyra
├── scripts
├── work_dirs
│   ├── Lyra
│   │   ├── Lyra_Mini_3B
│   │   ├── Lyra_Base_9B
│   │   ├── Lyra_Pro_74B
│   │   ├── ...
├── model_zoo
│   ├── LLM
│   │   ├── Qwen2VL_2B_LLM
│   │   ├── Qwen2VL_7B_LLM
│   │   ├── Qwen2VL_70B_LLM
│   │   ├── Qwen2.5
│   │   ├── LLaMA3.2
│   │   ├── ...
│   ├── vision
│   │   ├── Qwen2VL_2B_ViT
│   │   ├── Qwen2VL_7B_ViT
│   │   ├── Qwen2VL_70B_ViT
│   │   ├── clip-vit-large
│   │   ├── siglip
│   │   ├── ConvNeXt
│   │   ├── ...
│   ├── audio
│   │   ├── whisper-large-v3-turbo
│   │   ├── whisper-large-v3
│   │   ├── imagebind_huge
│   │   ├── vocoder
│   │   ├── ...
├── data
│   ├── Lyra_Pretrain
│   │   ├── lyra_pretrain.json
│   │   ├── LibriSpeech
│   │   ├── CommonVoice
│   ├── Lyra_SFT
│   │   ├── multi_modality_speech
│   │   │   ├── lyra_multimodal.json
│   │   │   ├── Lyra_MM
│   │   │   ├── LibriSpeech
│   │   ├── multi_modality_image (similar to MGM-Finetune)
│   │   │   ├── llava
│   │   │   ├── coco
│   │   │   ├── gqa
│   │   │   ├── ocr_vqa
│   │   │   ├── textvqa
│   │   │   ├── vg
│   │   │   ├── gpt4v-dataset
│   │   │   ├── ...
│   │   ├── long_speech
│   │   │   ├── lyra_longspeech.json
│   │   │   ├── Lyra_LongSpeech
│   │   ├── speech_generation
│   │   │   ├── lyra_speechgeneration.json
│   ├── Lyra_Eval
│   │   ├── LibriSpeech
│   │   ├── TextVQA_speech
│   │   ├── MM_vet_speech
│   │   ├── Docvqa_val
│   │   ├── Chartvqa_human
│   │   ├── VideoMME_speech
│   │   ├── Lyra_needle_in_a_haystack

Train

The training process consists of four stages: (1) feature alignment stage: bridge the speech and language tokens; (2) multi-modality instruction tuning stage: teach the model to follow text-image-speech multimodal instructions. (3) long-speech instruction tuning stage: enable the model to handle long speech audios. (4) text-speech streaming generation stage: Enable the model to stream both text and speech simultaneously.

Our models are trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Please make sure you download and organize the data following Preparation before training.

NOTE: Please set hostfile/hostfile_2 for 2 machine training and hostfile/hostfile_4 for 4 machine training.

(1) feature alignment stage:

bash scripts/train/Lyra_Base_9B/Lyra_Base_qwen2vl_9B_Pretrain.sh

(2) multi-modality instruction tuning stage:

bash scripts/train/Lyra_Base_9B/Lyra_Base_qwen2vl_9B_SFT_text_image_speech.sh

(3) long-speech instruction tuning stage:

bash scripts/train/Lyra_Base_9B/Lyra_Base_qwen2vl_9B_SFT_long_speech.sh

(4) text-speech streaming generation stage:

bash scripts/train/Lyra_Base_9B/Lyra_Base_qwen2vl_9B_SFT_speech_generate.sh

Evaluation

Benchmarks Results

Omni Comparison Params. Text-Image Text-Video Image-Speech Text-Speech
TextVQA MME MM-Vet VideoMME MVBench Egoschema TextVQAs DocVQAs ChartQAs LibriSpeech
Mini-Gemini 8B 71.9 1989 53.5 - - - - - - -
LLaVA-OV 7B 65.4 1998 57.5 58.2 56.7 60.1 - - - -
Intern-VL2 8B 77.4 2211 60.0 54.0 - - - - - -
Mini-Omni 7B - - - - - - - - - 4.5
SALMONN 13B - - - - - - - - - 2.1
Qwen2-Audio 8B - - - - - - - - - 1.6
Intern-Omni 8B 80.6 2210 60.0 - - - 69.1 79.9 56.0 -
VITA 66B - 2097 41.6 59.2 - - - - - 8.1
EMOVA 14B 82.0 2205 55.8 - - - - - - 4.0
Lyra-Mini 3B 78.3 1884 51.2 55.0 62.5 54.1 73.9 75.0 40.7 2.4
Lyra-Base 9B 82.6 2335 63.5 62.8 67.2 63.2 80.0 85.5 61.0 2.0
Lyra-Pro 74B 83.5 2485 71.4 69.9 72.3 75.8 81.0 89.4 68.5 1.8

Benchmarks scripts

Please make sure you download and organize the evaluation data following Preparation before starting evaluation.

We provide five speech speech-centric evaluation benchmark scripts here:

Text-speech ability: LibriSpeech:

# you can change the model path and lora path in the script:
# e.g., CKPT="Lyra_Mini_3B", LORA_PATH="Lyra_Mini_3B/speech_lora"
# e.g., CKPT="Lyra_Base_9B", LORA_PATH="Lyra_Base_9B/speech_lora"
# the LibriSpeech test-clean WER result of Lyra-Mini-3B is about 2.4%
# the LibriSpeech test-clean WER result of Lyra-Base-9B is about 2.0%
bash scripts/eval/lyra_librispeech_wer.sh

Image-speech ability: TextVQA_speech:

# the TextVQA (speech) accuracy result of Lyra-Mini-3B is about 73.6%
# the TextVQA (speech) accuracy result of Lyra-Base-9B is about 80.5%
bash scripts/eval/lyra_textvqa_speech.sh

Image-speech ability: Chartvqa_human:

# the ChartQA (speech) accuracy result of Lyra-Mini-3B is about 42.2%
# the ChartQA (speech) accuracy result of Lyra-Base-9B is about 61.0%
bash scripts/eval/lyra_chartvqa_speech.sh

Image-speech ability: Docvqa_val:

# the DocVQA (speech) accuracy result of Lyra-Mini-3B is about 76.0%
# the DocVQA (speech) accuracy result of Lyra-Base-9B is about 86.2%
bash scripts/eval/lyra_docvqa_speech.sh

Image-speech ability: MM_vet_speech:

# the MM-Vet (speech) accuracy result of Lyra-Mini-3B is about 47.8%
# the MM-Vet (speech) accuracy result of Lyra-Base-9B is about 57.0%
# you need submit the file e.g., work_dirs/MM_vet_speech/Lyra_xxx_xB/Lyra_xxx_xB.json
# to https://huggingface.co/spaces/whyu/MM-Vet_Evaluator for GPT judgement
bash scripts/eval_release/lyra_mmvet_speech.sh

CLI Inference

Chat with images without the need of Gradio interface. It also supports multiple GPUs, 4-bit and 8-bit quantized inference. Please make sure you have installed fairseq for speech generation, and try the following command for speech and generation inference:

# image-file:       <path to your image: context>
# speech-file:      <path to your audio: instruction>
# generate speech:  <output path to generated speech: examples/pred_roundX.wav>
python -m lyra.serve.cli \
	--model-path work_dirs/Lyra_Mini_3B \
	--image-file examples/Chinese_painting.jpg \
	--audio-file examples/Chinese_painting.mp3 \
	--generate-speech
	
python -m lyra.serve.cli \
	--model-path work_dirs/Lyra_Base_9B \
	--image-file examples/Chinese_painting.jpg \
	--audio-file examples/Chinese_painting.mp3 \
	--generate-speech

Lyra can also handle your long speech input (max duration can be about two or three hours, suggest on A100 GPUs).

Here is an example: ABC New, Oct. 1, 2024, 20 mins:

# speech-file: <path to your long audio: context>
# instuction by the text keyboard input
python -m lyra.serve.cli \
	--model-path work_dirs/Lyra_Mini_3B \
	--audio-file examples/ABC_News_20241001.mp3 \
	--generate-speech

python -m lyra.serve.cli \
	--model-path work_dirs/Lyra_Base_9B \
	--audio-file examples/ABC_News_20241001.mp3 \
	--generate-speech

Here is an example for video input with its audio (you can use ffmpeg or other tools to extract video's audio):

# video-file:  <path to your video: context>
# speech-file: <path to your audio: instruction>
python -m lyra.serve.cli \
	--model-path work_dirs/Lyra_Mini_3B \
	--video-file examples/movement.mp4 \
	--audio-file examples/movement.mp3 \
	--generate-speech
	
python -m lyra.serve.cli \
	--model-path work_dirs/Lyra_Base_9B \
	--video-file examples/movement.mp4 \
	--audio-file examples/movement.mp3 \
	--generate-speech

Here is an example for video input and text instruction:

# video-file: <path to your video: context>
# instuction by the text keyboard input
python -m lyra.serve.cli \
	--model-path work_dirs/Lyra_Mini_3B \
	--video-file examples/Trump.mp4 \
	--generate-speech

python -m lyra.serve.cli \
	--model-path work_dirs/Lyra_Base_9B \
	--video-file examples/Trump.mp4 \
	--generate-speech

Gradio Web UI

Here, we adopt the Gradio UI similar to that in LLaVA to provide a user-friendly omni-interface (text, image, video, and audio) for our models. The UI example illustration is as follows:

To launch a Gradio demo locally, please run the following commands one by one. If you plan to launch multiple model workers to compare between different checkpoints, you only need to launch the controller and the web server ONCE.

Launch a controller

python -m lyra.serve.controller --host 0.0.0.0 --port 10000

Launch a gradio web server.

python -m lyra.serve.gradio_web_server \
	--controller http://localhost:10000 \
	--model-list-mode reload \
	--ssl-certfile Your/SSL/Certfile/Path/cert.pem \
	--ssl-keyfile Your/SSL/Keyfile/Path/key.pem

You just launched the Gradio web interface. Now, you can open the web interface with the URL printed on the screen. You may notice that there is no model in the model list. Do not worry, as we have not launched any model worker yet. It will be automatically updated when you launch a model worker.

Launch a model worker

This is the actual worker that performs the inference on the GPU. Each worker is responsible for a single model specified in --model-path.

CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' python -m lyra.serve.model_worker \
	--host 0.0.0.0 \
	--controller http://localhost:10000 \
	--port 40000 \
	--worker http://localhost:40000 \
	--model-path work_dirs/Lyra_Base_9B \
	--model-lora-path work_dirs/Lyra_Base_9B/speech_lora

Wait until the process finishes loading the model and you see " Uvicorn running on XXXXXX (Press CTRL+C to quit)". Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.

You can launch as many workers as you want, and compare between different models in the same Gradio interface. Please keep the --controller the same, and modify the --port and --worker to a different port number for each worker.

CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' python -m lyra.serve.model_worker \
	--host 0.0.0.0 \
	--controller http://localhost:10000 \
	--port <different from 40000, say 40001> \
	--worker http://localhost:<change accordingly, i.e. 40001> \
	--model-path work_dirs/Lyra_Mini_3B \
	--model-lora-path work_dirs/Lyra_Mini_3B/speech_lora

Examples

We provide some examples in this section. More examples can be found in our project page.

Citation

If you find this repo useful for your research, please consider citing the paper😊:

@article{zhong2024lyra,
  title={Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition},
  author={Zhong, Zhingsheng and Wang, Chengyao and Liu, Yuqi and Yang, Senqiao and Tang, Longxiang and Zhang, Yuechen and Li, Jingyao and Qu, Tianyuan and Li, Yanwei and Chen, Yukang and Yu, Shaozuo and Wu, Sitong and Lo, Eric and Liu, Shu and Jia, Jiaya},
  journal={arXiv preprint arXiv:2412.09501},
  year={2024}
}

Acknowledgement

We would like to thank the following repos for their great work:

License

Code License Data License Weight License

The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaVA, Qwen, LLaMA, Whisper, and GPT-4o. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.