Feature Request: MiniMax-Text-01 model #11290

Kreijstal · 2025-01-18T15:38:54Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Please add support for minimax-text-01 model https://huggingface.co/MiniMaxAI/MiniMax-Text-01
https://github.com/MiniMax-AI/MiniMax-01

Motivation

We need to add support for the latest models! Performs almost as good as deepseek v3. But has 4 million tokens context.

Possible Implementation

It's a MoE model.

ehartford · 2025-01-20T19:40:34Z

Very interested in this model!

fairydreaming · 2025-01-21T20:28:46Z

I have something more or less working here: https://github.com/fairydreaming/llama.cpp/tree/minimax-text-01

Some major remaining problems:

It currently doesn't support multiple token sequences. My current implementation of lightning attention simply ignores the token positions and sequence ids. Inference of a single token sequence should work fine.
I guess that proper support of this model would require some redesign of KV cache. The problem is that layers in MiniMax-Text-01 can either use linear lightning attention (a single kv matrix per layer is cached in this case) or regular transformers attention that caches separate key and value vectors. I need to think about it some more.

I tested it on CPU (AMD Epyc 9374F, Q5_K_M), some token generation performance values:

model	size	params	backend	threads	test	t/s
minimax01 456B Q5_K - Medium	302.51 GiB	456.09 B	CPU	32	tg32@pp128	4.88 ± 0.05
minimax01 456B Q5_K - Medium	302.51 GiB	456.09 B	CPU	32	tg32@pp256	4.51 ± 0.00
minimax01 456B Q5_K - Medium	302.51 GiB	456.09 B	CPU	32	tg32@pp512	4.50 ± 0.00
minimax01 456B Q5_K - Medium	302.51 GiB	456.09 B	CPU	32	tg32@pp1024	4.48 ± 0.00
minimax01 456B Q5_K - Medium	302.51 GiB	456.09 B	CPU	32	tg32@pp2048	4.42 ± 0.00
minimax01 456B Q5_K - Medium	302.51 GiB	456.09 B	CPU	32	tg32@pp4096	4.34 ± 0.00
minimax01 456B Q5_K - Medium	302.51 GiB	456.09 B	CPU	32	tg32@pp8192	4.18 ± 0.00

I used my custom llama-bench test for testing generation rate at a given prompt length.

ggerganov · 2025-01-22T07:15:51Z

I guess that proper support of this model would require some redesign of KV cache. The problem is that layers in MiniMax-Text-01 can either use linear lightning attention (a single kv matrix per layer is cached in this case) or regular transformers attention that caches separate key and value vectors. I need to think about it some more.

Yup, it's unfeasible to keep trying to fit all variants of the attention into the existing KV cache code. I am hoping that after the refactoring of #11213 , we will be able to implement custom attention mechanism for use cases like these.

fairydreaming · 2025-01-22T17:33:14Z

I noticed a problem with the model "eating" some words when asked to repeat text (Q5_K_M quant). Can someone with more RAM (like 512GB or 1TB) test this model with my branch? I'm not sure if the model is very sensitive to quantization or there is some other problem. The full prompt is:

<beginning_of_sentence>user name=user
Repeat this text: "The different accidents of life are not so changeable as the feelings of human nature. I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body. For this I had deprived myself of rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished, the beauty of the dream vanished, and breathless horror and disgust filled my heart."<end_of_sentence>
<beginning_of_sentence>ai name=assistant

while the model answer is:

The different accidents of life are not so changeable as the feelings human nature. I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body. For this I had deprived myself rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished, the beauty of the dream vanished, and breathless horror and disgust filled my heart.<end_of_sentence>

There is one missing "of" in front of "human nature" and another "of" in front of "rest and health". Sometimes it eats "and" instead or both. A hungry model. I ran it with temp 0.01.

I'm curious if it happens also on f16 or Q8_0 quantization.

ehartford · 2025-01-22T17:58:11Z

I have 1tb ram, I can try it

fairydreaming · 2025-01-24T11:47:09Z

I found about llama_sbatch::split_equal, so my branch now supports inference of multiple token sequences with llama-server. Prompt caching should be disabled for now, it doesn't work correctly. Run the server with --jinja to use model prompt template.

Nondzu · 2025-01-25T21:33:16Z

@fairydreaming tested your branch with Q5_K_M. On my setup I see some missing "of".
Tested on AMD EPYC and 768G RAM.
Can you share your full command test to run?
Building Q8 and will do test tomorrow...

build: 4532 (1e74c4d9) with gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Q5_K_M:

> <beginning_of_sentence>user name=user  
Repeat this text: "The different accidents of life are not so changeable as the feelings of human nature. I had worked hard for nearly two years, for the sole purpose of infusing life into an inanimate body. For this I had deprived myself of rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished, the beauty of the dream vanished, and breathless horror and disgust filled my heart."<end_of_sentence>  
<beginning_of_sentence>ai name=assistant  

"The different accidents life are not so changeable as the feelings human nature. I had worked hard for nearly two years, for the sole purpose of infusing life an inanimate body. For I had deprived myself of rest and health. I had desired it with an ardour that far exceeded moderation; but now that I had finished the beauty of the dream vanished, and breathless horror and disgust filled my heart."

fairydreaming · 2025-01-25T21:58:01Z

@fairydreaming tested your branch with Q5_K_M. On my setup I see some missing "of". Tested on AMD EPYC and 768G RAM. Can you share your full command test to run? Building Q8 and will do test tomorrow...

That would be helpful, thanks. Regarding the command line I can't access the workstation now, will add that later.

Nondzu · 2025-01-26T05:35:00Z

file format = GGUF V3 (latest)
file type = Q8_0
file size = 451.36 GiB (8.50 BPW)

Full log:

minimax-q8.log

compared by chatgpt

Summary of Rounds and Missing Words

Across the four rounds, the text provided by the user was analyzed for differences in word usage. Here's a concise summary of the missing words in each round and how they evolved:

Round 1:

Missing Words:

"of" (in "The different accidents life are not so changeable as the feelings of human nature").
"of" (in "For this I had deprived myself rest health").
"and" (in "For this I had deprived myself rest health").
"of" (in "the beauty the dream vanished").
"and" (in "the beauty the dream vanished breathless horror").

Round 2:

Missing Words:

"of" (in "as the feelings human nature").
"of" (in "for the sole purpose infusing life").

Round 3:

No Missing Words: The AI response matched the original text completely.

Round 4:

No Missing Words: The AI response was identical to the original text.

Summary of All Missing Words:

From Rounds 1 and 2, the following words were missing:

"of" (five occurrences in total across both rounds).
"and" (two occurrences in Round 1).

In Rounds 3 and 4, no words were missing, indicating that the AI eventually reproduced the original text without errors.

Nondzu · 2025-01-26T05:51:52Z

@fairydreaming I found a possible issue with that, need to reconvert model again. see u soon

Nondzu · 2025-01-26T08:15:05Z

still the same issue, removed ignore_merges from llama-vocab.cpp and I did again a conversion and quant but no success. 'of' and and are still missing.
Nondzu@9ec3378
log.txt

fairydreaming · 2025-01-26T09:47:43Z

@Nondzu OK, if it happens on Q8_0 then likely there's still some problem with my inference code as I didn't observe this behavior via API in OpenRouter. Thanks for testing!

Kreijstal added the enhancement New feature or request label Jan 18, 2025

Kreijstal mentioned this issue Jan 18, 2025

most powerful model with 4m context MiniMax-Text-01 ollama/ollama#8442

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: MiniMax-Text-01 model #11290

Feature Request: MiniMax-Text-01 model #11290

Kreijstal commented Jan 18, 2025

ehartford commented Jan 20, 2025

fairydreaming commented Jan 21, 2025 •

edited

Loading

ggerganov commented Jan 22, 2025

fairydreaming commented Jan 22, 2025

ehartford commented Jan 22, 2025

fairydreaming commented Jan 24, 2025

Nondzu commented Jan 25, 2025

fairydreaming commented Jan 25, 2025

Nondzu commented Jan 26, 2025

Nondzu commented Jan 26, 2025

Nondzu commented Jan 26, 2025

fairydreaming commented Jan 26, 2025

Feature Request: MiniMax-Text-01 model #11290

Feature Request: MiniMax-Text-01 model #11290

Comments

Kreijstal commented Jan 18, 2025

Prerequisites

Feature Description

Motivation

Possible Implementation

ehartford commented Jan 20, 2025

fairydreaming commented Jan 21, 2025 • edited Loading

ggerganov commented Jan 22, 2025

fairydreaming commented Jan 22, 2025

ehartford commented Jan 22, 2025

fairydreaming commented Jan 24, 2025

Nondzu commented Jan 25, 2025

fairydreaming commented Jan 25, 2025

Nondzu commented Jan 26, 2025

Summary of Rounds and Missing Words

Round 1:

Missing Words:

Round 2:

Missing Words:

Round 3:

Round 4:

Summary of All Missing Words:

Nondzu commented Jan 26, 2025

Nondzu commented Jan 26, 2025

fairydreaming commented Jan 26, 2025

fairydreaming commented Jan 21, 2025 •

edited

Loading