Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for MiniMax-Text-01 and MiniMax-VL-01 from MiniMaxAI #35710

Open
2 tasks done
geetu040 opened this issue Jan 15, 2025 · 8 comments · May be fixed by #35831
Open
2 tasks done

Add support for MiniMax-Text-01 and MiniMax-VL-01 from MiniMaxAI #35710

geetu040 opened this issue Jan 15, 2025 · 8 comments · May be fixed by #35831

Comments

@geetu040
Copy link

geetu040 commented Jan 15, 2025

Model description

MiniMaxAI has just released two new models for text generation. While the code and weights have been made publicly available, the code requires significant formatting and cleaning to align with the standards of the Hugging Face Transformers library. The models are:

MiniMax-Text-01

MiniMax-Text-01 is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token. To better unlock the long context capabilities of the model, MiniMax-Text-01 adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies and innovative compute-communication overlap methods—such as Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, Expert Tensor Parallel (ETP), etc., MiniMax-Text-01's training context length is extended to 1 million tokens, and it can handle a context of up to 4 million tokens during the inference. On various academic benchmarks, MiniMax-Text-01 also demonstrates the performance of a top-tier model.

MiniMax-VL-01

It adopts the “ViT-MLP-LLM” framework, which is a commonly used technique in the field of multimodal large language models. The model is initialized and trained with three key parts: a 303-million-parameter Vision Transformer (ViT) for visual encoding, a randomly initialized two-layer MLP projector for image adaptation, and the MiniMax-Text-01 as the base LLM. MiniMax-VL-01 has a notable dynamic resolution feature. Input images are resized per a pre-set grid, with resolutions from 336×336 to 2016×2016, keeping a 336×336 thumbnail. The resized images are split into non-overlapping patches of the same size. These patches and the thumbnail are encoded separately and then combined for a full image representation. The training data for MiniMax-VL-01 consists of caption, description, and instruction data. The Vision Transformer (ViT) is trained on 694 million image-caption pairs from scratch. Across four distinct stages of the training pipeline, a total of 512 billion tokens are processed, leveraging this vast amount of data to endow the model with strong capabilities. Finally, MiniMax-VL-01 has reached top-level performance on multimodal leaderboards, demonstrating its edge and dependability in complex multimodal tasks.

Open source status

  • The model implementation is available
  • The model weights are available

Provide useful links for the implementation

@geetu040
Copy link
Author

I would like to implement these models in transformers. But since these models are very large in size (456B parameters), I can only try to create smaller architechtures when developing and later try on other machines for testing the final outputs and consistency on full architechture. Does that sound possible or should I avoid this altogether?

@ArthurZucker
Copy link
Collaborator

It sounds good!
I think the best way is to :

  1. create a dummy model from the original code (trust remote code = True)
  2. save the weights, and generate logits with a sentence
  3. create an equivalent model in transformers
  4. make the logits match! 🚀

My recommendation is to have a look at Mixtral, SwitchTransformers and Llama in general!
Also https://huggingface.co/docs/transformers/en/modular_transformers

FYI @Rocketknight1

@geetu040
Copy link
Author

@ArthurZucker

create a dummy model from the original code (trust remote code = True)

this dummy model should be the minimal architecture with a very small size, by reducing the number of layers, attention heads, hidden_size e.t.c from the config, right?
because loading the full size model is going to be really difficult under normal resources.

Also https://huggingface.co/docs/transformers/en/modular_transformers

yes, I am planning to use the modular transformers, I hope most of the code can be reused.

@Rocketknight1
Copy link
Member

@geetu040 yes, the dummy model should be very small. It's okay for the model to be randomly initialized and to output garbage. What we want to check is that we get the same garbage with your implementation as we get with the original remote code implementation of the network.

@Shakib-IO
Copy link

Hi @geetu040,
I was wondering if you'd be interested in collaborating on implementing this model. I'm currently exploring the NLP domain and am eager to gain hands-on coding experience by building a model. Let me know what you think.

@geetu040
Copy link
Author

Hi @geetu040, I was wondering if you'd be interested in collaborating on implementing this model. I'm currently exploring the NLP domain and am eager to gain hands-on coding experience by building a model. Let me know what you think.

@Shakib-IO, sure I can use some help

@geetu040 geetu040 linked a pull request Jan 22, 2025 that will close this issue
14 tasks
@geetu040
Copy link
Author

@Shakib-IO, you wanted to help with the implementation. I have commented the future-work in the code, so that it can be easily tracked and worked on. I'll give you write access to this branch in my fork, where you can also push changes.

@Shakib-IO
Copy link

Thanks @geetu040

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants