Skip to content

Latest commit

 

History

History
211 lines (95 loc) · 15.5 KB

Multi-modality Generation.md

File metadata and controls

211 lines (95 loc) · 15.5 KB

LLM

[arxiv 2024.12] Training Large Language Models to Reason in a Continuous Latent Space [PDF]

[arxiv 2025.02] [PDF,Page] Code

Multi-modality Generation

[arxiv 2024.11] Multimodal Alignment and Fusion: A Survey [PDF]

[arxiv 2025.01] Next Token Prediction Towards Multimodal Intelligence[PDF,Page] Code

[arxiv 2025.01] MiniMax-01: Scaling Foundation Models with Lightning Attention [PDF,Page] Code

[arxiv 2025.02] [PDF,Page] Code

[arxiv 2023.07]Generative Pretraining in Multimodality [PDF,Page]

[arxiv 2023.07]Generating Images with Multimodal Language Models [PDF,Page]

[arxiv 2023.07]3D-LLM: Injecting the 3D World into Large Language Models[[PDF] (https://arxiv.org/abs/2307.12981),Page]

[arxiv 2023.10]Making LLaMA SEE and Draw with SEED Tokenizer [PDF,Page]

[arxiv 2023.10]Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation [PDf,Page]

[arxiv 2023.12]CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation[PDF,Page]

[arxiv 2023.12]SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models [PDF,Page]

[arxiv 2023.12]InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following [PDF, Page]

[arxiv 2023.12]Massively Multimodal Masked Modeling [PDF,Page]

[arxiv 2023.12]Gemini: A Family of Highly Capable Multimodal Models [PDF]

[arxiv 2023.12]Generative Multimodal Models are In-Context Learners [PDF,Page]

[arxiv 2024.1]DiffusionGPT: LLM-Driven Text-to-Image Generation System [PDF]

[arxiv 2024.1]Image Anything: Towards Reasoning-coherent and Training-free Multi-modal Image Generation [PDF,Page]

[arxiv 2024.03]3D-VLA 3D Vision-Language-Action Generative World Model[PDF,Page]

[arxiv 2024.03]LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models [PDF,Page]

[arxiv 2024.04]SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation [PDF]

[arxiv 2024.06]The Evolution of Multimodal Model Architectures [PDF]

[arxiv 2024.08] Show-o: One Single Transformer to Unify Multimodal Understanding and Generation[PDF, Page]

[arxiv 2024.09] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation[PDF]

[arxiv 2024.09]EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions[PDF, Page]

[arxiv 2024.09] Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models[PDF,)]

[arxiv 2024.09]MonoFormer: One Transformer for Both Diffusion and Autoregression [PDF, Page]

[arxiv 2024.09] Visual Prompting in Multimodal Large Language Models: A Survey[PDF]

[arxiv 2024.09] Emu3: Next-Token Prediction is All You Need[PDF, Page]

[arxiv 2024.10]ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer [PDF, Page]

[arxiv 2024.10]Baichuan-Omni Technical Report [PDF, Page]

[arxiv 2024.10] PUMA: Empowering Unified MLLM with Multi-granular Visual Generation[PDF, Page]

[arxiv 2024.10]Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation [PDF, Page]

[arxiv 2024.11]Spider: Any-to-Many Multimodal LLM [PDF]

[arxiv 2024.12] TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation [PDF,Page] Code

[arxiv 2024.12] Multimodal Latent Language Modeling with Next-Token Diffusion [PDF,Page]

[arxiv 2024.12] InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions [PDF,Page] Code

[arxiv 2025.02] Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray [PDF,Page] Code

[arxiv 2025.02] CoS: Chain-of-Shot Prompting for Long Video Understanding [PDF,Page] Code

[arxiv 2025.02] [PDF,Page] Code

Feedback

[arxiv 2025.02] DAMO: Data- and Model-aware Alignment of Multi-modal LLMs [PDF,Page] Code

[arxiv 2025.02] [PDF,Page] Code

Agent

[arxiv 2024.10] Agent S: An Open Agentic Framework that Uses Computers Like a Human[PDF, Page]

[arxiv 2024.12] SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing [PDF]

[arxiv 2024.12] TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft [PDF,Page] Code

[arxiv 2025.01] UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI [PDF,Page] Code

[arxiv 2025.02] [PDF,Page] Code

Multi-modality Understanding

[arxiv 2024.10]TemporalBench Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [PDF, Page]

[arxiv 2024.10] γ−MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models[PDF, Page]

[arxiv 2024.10]Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant [PDF, Page]

[arxiv 2024.11] LLaVA-o1: Let Vision Language Models Reason Step-by-Step[PDF, Page]

[arxiv 2024.11] CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs[PDF]

[arxiv 2024.12] VisionZip: Longer is Better but Not Necessary in Vision Language Models [PDF,Page] Code

[arxiv 2024.12] NVILA: Efficient Frontier Visual Language Models [PDF] Code

[arxiv 2024.12] Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [PDF,Page] Code

[arxiv 2024.12] Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces [PDF,Page] Code

[arxiv 2024.12] OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving [PDF,Page] Code

[arxiv 2024.12] VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision [PDF]

[arxiv 2024.12] HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding [PDF,Page]

[arxiv 2025.01] VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM [PDF,Page] Code

[arxiv 2025.01] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling [PDF,Page] Code

[arxiv 2025.01] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction [PDF,Page] Code

[arxiv 2025.01] Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [PDF,Page] Code

[arxiv 2025.01] Scaling of Search and Learning: A Roadmap to Reproduce o1from Reinforcement Learning Perspective [PDF,Page] Code

[arxiv 2025.01] Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction [PDF,Page] Code

[arxiv 2025.01] LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [PDF,Page] Code

[arxiv 2025.01] VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding [PDF,Page] Code

[arxiv 2025.02] MINT: Mitigating Hallucinations in Large Vision-Language Models via Token Reduction [PDF]

[arxiv 2025.02] PixelWorld: Towards Perceiving Everything as Pixels [PDF,Page] Code

[arxiv 2025.02] Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment [PDF,Page] Code

[arxiv 2025.02] [PDF,Page] Code

multi-modality evaluation

[arxiv 2024.10] The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio[PDF]

Compression

[arxiv 2025.02] AdaSVD: Adaptive Singular Value Decomposition for Large Language Models [PDF,Page] Code

[arxiv 2025.02] Vision-centric Token Compression in Large Language Model [PDF]

[arxiv 2025.02] [PDF,Page] Code

few-shot

[arxiv 2025.02] Efficient Few-Shot Continual Learning in Vision-Language Models [PDF]

audio

[arxiv 2024.10] MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization[PDF]

[arxiv 2024.11]Video-Guided Foley Sound Generation with Multimodal Controls [PDF, Page]

[arxiv 2025.02] [PDF,Page] Code

speed

[arxiv 2024.10]PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction[PDF, Page]

[arxiv 2024.12] [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster [PDF,Page] Code

[arxiv 2025.02] [PDF,Page] Code