Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: MiniMax-Text-01 model #11290

Open
4 tasks done
Kreijstal opened this issue Jan 18, 2025 · 12 comments
Open
4 tasks done

Feature Request: MiniMax-Text-01 model #11290

Kreijstal opened this issue Jan 18, 2025 · 12 comments
Labels
enhancement New feature or request

Comments

@Kreijstal
Copy link

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Please add support for minimax-text-01 model https://huggingface.co/MiniMaxAI/MiniMax-Text-01
https://github.com/MiniMax-AI/MiniMax-01

Motivation

We need to add support for the latest models! Performs almost as good as deepseek v3. But has 4 million tokens context.

Possible Implementation

It's a MoE model.

@ehartford
Copy link

Very interested in this model!

@fairydreaming
Copy link
Collaborator

fairydreaming commented Jan 21, 2025

I have something more or less working here: https://github.com/fairydreaming/llama.cpp/tree/minimax-text-01

Some major remaining problems:

  • It currently doesn't support multiple token sequences. My current implementation of lightning attention simply ignores the token positions and sequence ids. Inference of a single token sequence should work fine.
  • I guess that proper support of this model would require some redesign of KV cache. The problem is that layers in MiniMax-Text-01 can either use linear lightning attention (a single kv matrix per layer is cached in this case) or regular transformers attention that caches separate key and value vectors. I need to think about it some more.

I tested it on CPU (AMD Epyc 9374F, Q5_K_M), some token generation performance values:

model size params backend threads test t/s
minimax01 456B Q5_K - Medium 302.51 GiB 456.09 B CPU 32 tg32@pp128 4.88 ± 0.05
minimax01 456B Q5_K - Medium 302.51 GiB 456.09 B CPU 32 tg32@pp256 4.51 ± 0.00
minimax01 456B Q5_K - Medium 302.51 GiB 456.09 B CPU 32 tg32@pp512 4.50 ± 0.00
minimax01 456B Q5_K - Medium 302.51 GiB 456.09 B CPU 32 tg32@pp1024 4.48 ± 0.00
minimax01 456B Q5_K - Medium 302.51 GiB 456.09 B CPU 32 tg32@pp2048 4.42 ± 0.00
minimax01 456B Q5_K - Medium 302.51 GiB 456.09 B CPU 32 tg32@pp4096 4.34 ± 0.00
minimax01 456B Q5_K - Medium 302.51 GiB 456.09 B CPU 32 tg32@pp8192 4.18 ± 0.00

I used my custom llama-bench test for testing generation rate at a given prompt length.

@ggerganov
Copy link
Member

I guess that proper support of this model would require some redesign of KV cache. The problem is that layers in MiniMax-Text-01 can either use linear lightning attention (a single kv matrix per layer is cached in this case) or regular transformers attention that caches separate key and value vectors. I need to think about it some more.

Yup, it's unfeasible to keep trying to fit all variants of the attention into the existing KV cache code. I am hoping that after the refactoring of #11213 , we will be able to implement custom attention mechanism for use cases like these.

@fairydreaming
Copy link
Collaborator

I noticed a problem with the model "eating" some words when asked to repeat text (Q5_K_M quant). Can someone with more RAM (like 512GB or 1TB) test this model with my branch? I'm not sure if the model is very sensitive to quantization or there is some other problem. The full prompt is:

<beginning_of_sentence>user name=user
Repeat this text: "The different accidents of life are not so changeable as the feelings of human nature. I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body. For this I had deprived myself of rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished, the beauty of the dream vanished, and breathless horror and disgust filled my heart."<end_of_sentence>
<beginning_of_sentence>ai name=assistant

while the model answer is:

The different accidents of life are not so changeable as the feelings human nature. I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body. For this I had deprived myself rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished, the beauty of the dream vanished, and breathless horror and disgust filled my heart.<end_of_sentence>

There is one missing "of" in front of "human nature" and another "of" in front of "rest and health". Sometimes it eats "and" instead or both. A hungry model. I ran it with temp 0.01.

I'm curious if it happens also on f16 or Q8_0 quantization.

@ehartford
Copy link

I have 1tb ram, I can try it

@fairydreaming
Copy link
Collaborator

I found about llama_sbatch::split_equal, so my branch now supports inference of multiple token sequences with llama-server. Prompt caching should be disabled for now, it doesn't work correctly. Run the server with --jinja to use model prompt template.

@Nondzu
Copy link

Nondzu commented Jan 25, 2025

@fairydreaming tested your branch with Q5_K_M. On my setup I see some missing "of".
Tested on AMD EPYC and 768G RAM.
Can you share your full command test to run?
Building Q8 and will do test tomorrow...

build: 4532 (1e74c4d9) with gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Q5_K_M:

> <beginning_of_sentence>user name=user  
Repeat this text: "The different accidents of life are not so changeable as the feelings of human nature. I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body. For this I had deprived myself of rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished, the beauty of the dream vanished, and breathless horror and disgust filled my heart."<end_of_sentence>  
<beginning_of_sentence>ai name=assistant  

"The different accidents life are not so changeable as the feelings human nature. I had worked hard for nearly two years, for the sole purpose of infusing life an inanimate body. For I had deprived myself of rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished the beauty of the dream vanished, and breathless horror and disgust filled my heart."

@fairydreaming
Copy link
Collaborator

@fairydreaming tested your branch with Q5_K_M. On my setup I see some missing "of". Tested on AMD EPYC and 768G RAM. Can you share your full command test to run? Building Q8 and will do test tomorrow...

That would be helpful, thanks. Regarding the command line I can't access the workstation now, will add that later.

@Nondzu
Copy link

Nondzu commented Jan 26, 2025

file format = GGUF V3 (latest)
file type = Q8_0
file size = 451.36 GiB (8.50 BPW)

Full log:

minimax-q8.log

compared by chatgpt

Summary of Rounds and Missing Words

Across the four rounds, the text provided by the user was analyzed for differences in word usage. Here's a concise summary of the missing words in each round and how they evolved:


Round 1:

Missing Words:

  1. "of" (in "The different accidents life are not so changeable as the feelings of human nature").
  2. "of" (in "For this I had deprived myself rest health").
  3. "and" (in "For this I had deprived myself rest health").
  4. "of" (in "the beauty the dream vanished").
  5. "and" (in "the beauty the dream vanished breathless horror").

Round 2:

Missing Words:

  1. "of" (in "as the feelings human nature").
  2. "of" (in "for the sole purpose infusing life").

Round 3:

  • No Missing Words: The AI response matched the original text completely.

Round 4:

  • No Missing Words: The AI response was identical to the original text.

Summary of All Missing Words:

From Rounds 1 and 2, the following words were missing:

  1. "of" (five occurrences in total across both rounds).
  2. "and" (two occurrences in Round 1).

In Rounds 3 and 4, no words were missing, indicating that the AI eventually reproduced the original text without errors.


@Nondzu
Copy link

Nondzu commented Jan 26, 2025

@fairydreaming I found a possible issue with that, need to reconvert model again. see u soon

@Nondzu
Copy link

Nondzu commented Jan 26, 2025

still the same issue, removed ignore_merges from llama-vocab.cpp and I did again a conversion and quant but no success. 'of' and and are still missing.
Nondzu@9ec3378
log.txt

@fairydreaming
Copy link
Collaborator

@Nondzu OK, if it happens on Q8_0 then likely there's still some problem with my inference code as I didn't observe this behavior via API in OpenRouter. Thanks for testing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants