Skip to content

Latest commit

 

History

History
74 lines (35 loc) · 2.75 KB

README.md

File metadata and controls

74 lines (35 loc) · 2.75 KB

🦜 Introduction

This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding, including a simple yet efficient framework to process long video sequence, a high-quality video dataset for grounded tuning of MLLMs, and a carefully-designed instruction tuning task to explicitly incorporate the grounding supervision in the traditional QA format.

State-of-the-art performance: VideoChat-T demonstrates high performance for both long-form video question answering and temporal grounding. alt text

Highly efficient model architecture with exceptional inference speed, encoding each video frame into just 3 tokens, leading to the flops of our VideoChat-T are 5.1% of Llava-OneVision alt text

High-quality data

  • We introduced the comprehensive dataset TimePro, which includes 9 task types with video sources from 15 different datasets.
  • We designed a novel Temporal Grounded Caption fine-tuning task to effectively mitigate hallucinations in MLLM. alt text

🔥 Updates

TODO

Inference & Demo

TODO

Evaluation Results

TODO

Grounded Training

TODO

📄 Citation

If you find this project useful in your research, please consider cite:

@article{zeng2024timesuite,
  title={Timesuite: Improving mllms for long video understanding via grounded tuning},
  author={Zeng, Xiangyu and Li, Kunchang and Wang, Chenting and Li, Xinhao and Jiang, Tianxiang and Yan, Ziang and Li, Songze and Shi, Yansong and Yue, Zhengrong and Wang, Yi and others},
  journal={arXiv preprint arXiv:2410.19702},
  year={2024}
}

💫 Acknowledgement