-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for MiniMax-Text-01 and MiniMax-VL-01 from MiniMaxAI #35710
Comments
I would like to implement these models in |
It sounds good!
My recommendation is to have a look at FYI @Rocketknight1 |
this dummy model should be the minimal architecture with a very small size, by reducing the number of layers, attention heads, hidden_size e.t.c from the config, right?
yes, I am planning to use the |
@geetu040 yes, the dummy model should be very small. It's okay for the model to be randomly initialized and to output garbage. What we want to check is that we get the same garbage with your implementation as we get with the original remote code implementation of the network. |
Hi @geetu040, |
@Shakib-IO, sure I can use some help |
@Shakib-IO, you wanted to help with the implementation. I have commented the future-work in the code, so that it can be easily tracked and worked on. I'll give you write access to this branch in my fork, where you can also push changes. |
Thanks @geetu040 |
Model description
MiniMaxAI has just released two new models for text generation. While the code and weights have been made publicly available, the code requires significant formatting and cleaning to align with the standards of the Hugging Face Transformers library. The models are:
MiniMax-Text-01
MiniMax-Text-01 is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token. To better unlock the long context capabilities of the model, MiniMax-Text-01 adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies and innovative compute-communication overlap methods—such as Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, Expert Tensor Parallel (ETP), etc., MiniMax-Text-01's training context length is extended to 1 million tokens, and it can handle a context of up to 4 million tokens during the inference. On various academic benchmarks, MiniMax-Text-01 also demonstrates the performance of a top-tier model.
MiniMax-VL-01
It adopts the “ViT-MLP-LLM” framework, which is a commonly used technique in the field of multimodal large language models. The model is initialized and trained with three key parts: a 303-million-parameter Vision Transformer (ViT) for visual encoding, a randomly initialized two-layer MLP projector for image adaptation, and the MiniMax-Text-01 as the base LLM. MiniMax-VL-01 has a notable dynamic resolution feature. Input images are resized per a pre-set grid, with resolutions from 336×336 to 2016×2016, keeping a 336×336 thumbnail. The resized images are split into non-overlapping patches of the same size. These patches and the thumbnail are encoded separately and then combined for a full image representation. The training data for MiniMax-VL-01 consists of caption, description, and instruction data. The Vision Transformer (ViT) is trained on 694 million image-caption pairs from scratch. Across four distinct stages of the training pipeline, a total of 512 billion tokens are processed, leveraging this vast amount of data to endow the model with strong capabilities. Finally, MiniMax-VL-01 has reached top-level performance on multimodal leaderboards, demonstrating its edge and dependability in complex multimodal tasks.
Open source status
Provide useful links for the implementation
Research Paper: https://arxiv.org/abs/2501.08313
Authors: MiniMax, Aonian Li, Bangwei Gong, et al.
Implementation
Models Weights
The text was updated successfully, but these errors were encountered: