MLX model support #300

g-eoj · 2025-01-21T20:35:34Z

The goal of this PR is to enable users to run smolagents with models loaded onto Apple silicon with mlx-lm. The mlx-community has made available many models for experimentation. Personally I find running locally to be a convenient way to learn and experiment with the smolagents library, so I made this PR for a possible new feature.

Example usage:

from smolagents.models import MLXModel

mlx_model = MLXModel("mlx-community/Qwen2.5-Coder-32B-Instruct-4bit", max_tokens=10000)
messages = [{"role": "user", "content": "Explain quantum mechanics in simple terms."}]
print(mlx_model(messages))

~~Some questions:~~

~~tests won't work for CICD due to hardware requirements, what is the preferred way to handle that?~~
~~anything needed for docs and if so where should it go?~~

g-eoj · 2025-01-26T19:24:37Z

@kingdomad @clefourrier as you are reviewing #337, can you please take a look at this PR?

aymeric-roucher · 2025-01-30T17:50:55Z

Sorry for late review @g-eoj , I just left some comments! 😃

…port

g-eoj · 2025-02-03T21:03:52Z

Hi @aymeric-roucher can you please take a look? There are no comments or review from you.

I can add docs when/if design is finalized.

sysradium · 2025-02-04T18:39:35Z

Works for me. Though a slight correction to the example is:

mlx_model = MLXModel(
    "mlx-community/Qwen2.5-Coder-32B-Instruct-4bit",
    max_tokens=10000,
)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Explain quantum mechanics in simple terms."}
        ],
    }
]
print(mlx_model(messages))

I guess will be a good addition. You still can use LiteLLMModel with LMStudio to achieve similar goal, however not needing to run LMStudio seems quite convenient.

src/smolagents/models.py

sysradium

To me it looks good.

sysradium · 2025-02-09T17:17:26Z

@g-eoj is there any downside to checking for a stop_sequence just in a received chunk? Because if not, then the performance could be slightly improved.

For example if you take those options:

def implementation_1(text, stop_sequences):
    text_accumulated = ""
    for chunk in text:
        text_accumulated += chunk
        for stop_sequence in stop_sequences:
            if text_accumulated.strip().endswith(stop_sequence):
                text_accumulated = text_accumulated[: -len(stop_sequence)]
                return text_accumulated  # Simulating _to_message call
    return text_accumulated


def implementation_2(text, stop_sequences):
    text_accumulated = ""
    for chunk in text:
        text_accumulated += chunk
        for stop_sequence in stop_sequences:
            if chunk.endswith(stop_sequence):
                text_accumulated = text_accumulated[: -len(stop_sequence)]
                return text_accumulated  # Simulating _to_message call
    return text_accumulated


def implementation_3(text, stop_sequences):
    text_accumulated = []
    for chunk in text:
        text_accumulated.append(chunk)
        for stop_sequence in stop_sequences:
            if chunk.endswith(stop_sequence):
                joined_text = "".join(text_accumulated)[: -len(stop_sequence)]
                return joined_text
    return "".join(text_accumulated)

def implementation_5(text, stop_sequences):
    text_accumulated = []
    for chunk in text:
        text_accumulated.append(chunk)
        if not chunk.endswith(tuple(stop_sequences)):
            continue

        for stop_sequence in stop_sequences:
            if chunk.endswith(stop_sequence):
                joined_text = "".join(text_accumulated)[: -len(stop_sequence)]
                return joined_text

    return "".join(text_accumulated)


def implementation_6(text, stop_sequences):
    text_accumulated = []
    for chunk in text:
        text_accumulated.append(chunk)
        if not chunk.endswith(tuple(stop_sequences)):
            continue

        matched_suffix = next(s for s in stop_sequences if chunk.endswith(s))
        return "".join(text_accumulated)[: -len(matched_suffix)]

    return "".join(text_accumulated)

The benchmark would result in:

Might be useful in intensive apps.

g-eoj · 2025-02-09T18:15:54Z

@sysradium what is a chunk in this case? I'm pretty sure mlx-lm streams always produce one token at a time - could it still be used to get chunks efficiently? It seems like you'd have to have a guarantee the stop string didn't get split between chunks too.

I'm all for making this more efficient.

sysradium · 2025-02-09T19:47:30Z

@g-eoj the chunk is your_.text. So it is whatever mlx-lm decided to yield :) I had an assumption that a stop sequence is always a single token (or a single unit of what mlx-lm yields).

They themselves implement a generate function which does exactly what you did https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/utils.py#L594, i.e. just appending to the text variable. But it is quite inefficient since strings are immutable in python.

Not sure if they did it like this because they know something, or don't care about performance. I thought maybe you checked that when was working with the implementation.

g-eoj · 2025-02-09T20:26:27Z

I had an assumption that a stop sequence is always a single token (or a single unit of what mlx-lm yields)

I think the smolagent stop sequences (for example

smolagents/src/smolagents/agents.py

Line 849 in d74837b

stop_sequences=["<end_code>", "Observation:"],

) have the potential to be composed of multiple tokens based on the tokenizer.

To your point about strings being immutable - I assumed https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/utils.py#L594 was okay since Apple is doing it. I didn't check for a good reason ~~(in Apple's case it seems to just be to support verbosity)~~. I can't think of any reason for appending to the text variable except for the need to check multi-token stop sequences.

g-eoj · 2025-02-10T16:44:29Z

For reference, I tried this approach for not appending to the text variable. I didn't find evidence of a speed up.

def check_stop_1(messages, stop_sequences):
    prompt_ids = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
    )
    outputs = []
    for step in mlx_lm.stream_generate(model, tokenizer, prompt=prompt_ids, max_tokens=10000):
        outputs.append(step)
        for stop_sequence in stop_sequences:
            # assumes stop sequence will never be more than 10 tokens
            recent_text = "".join([_.text for _ in outputs[-10:]]) 
            if recent_text.rstrip().endswith(stop_sequence):
                text = "".join([_.text for _ in outputs])
                text = text.rstrip()[:-len(stop_sequence)]
                return text         
    text = "".join([_.text for _ in outputs])
    return text

g-eoj · 2025-02-10T18:47:04Z

Gonna move this work over to https://github.com/g-eoj/mac-smolagents. Still happy to contribute but I'm skeptical this PR makes it in.

sysradium · 2025-02-10T19:50:07Z

Which is a pitty. I use it locally quite often now :/

g-eoj · 2025-02-10T20:08:40Z

Just so there is no confusion, you can still use it alongside smolagents. You'll just need to install an extra whl and change your code a bit. If for some reason this doesn't work, please make an issue and I'll try to fix it.

import mac_smolagents
import smolagents


mlx_language_model = mac_smolagents.MLXLModel(
    model_id="mlx-community/Qwen2.5-Coder-32B-Instruct-4bit"
)
agent = smolagents.CodeAgent(
    model=mlx_language_model, tools=[], add_base_tools=True
)
agent.run(...

HuggingFaceDocBuilderDev · 2025-02-12T12:59:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

aymeric-roucher · 2025-01-30T17:49:49Z

src/smolagents/models.py

+    Parameters:
+        model_id (str):
+            The Hugging Face model ID to be used for inference. This can be a path or model identifier from the Hugging Face model hub.
+        tool_name_key (str):


We don't need this tool_name_key in TransformersModel, what is the reason for needing it in MLXModel?

Actually upon inspection it seems like a good idea, let's keep it and we might set the same in TransformersModel later on.

I found the params to be required unless I was using a logits processor and regex to force the key names in the output. Which might be a better solution overall but I have no strong opinions yet.

docs/source/en/guided_tour.md

docs/source/en/reference/models.md

aymeric-roucher · 2025-02-12T17:29:27Z

@g-eoj sorry for late review, don't hesitate to ping again when this happens!
Also please run ruff format and ruff check on your PR to fix formatting and pass tests!

aymeric-roucher · 2025-02-12T17:30:56Z

And thank you for the contribution, great work!

g-eoj and others added 3 commits January 21, 2025 12:09

Add MLX model support

d4bb38f

Add MLX model test

aa6f075

Merge main and refactor

662a481

g-eoj force-pushed the g-eoj/mlx-model-support branch from b193f27 to 662a481 Compare January 22, 2025 16:50

g-eoj added 5 commits January 22, 2025 09:19

Skip mlx tests if not on macOS

d52ea23

Fix accidental reformat

8f909a0

Add docs

ea960fc

Fix typos

fb2fb1f

Pass kwargs to base class

8a36c23

g-eoj added 5 commits January 26, 2025 12:36

Merge branch 'main' into g-eoj/mlx-model-support

8f2b082

Support tool calling

9d7b3ac

Merge branch 'main' into g-eoj/mlx-model-support

5e1643d

Add tests for a single planning step

0122378

Wrap planning prompts in first step

bfc10bc

g-eoj mentioned this pull request Jan 29, 2025

Add tests/fix bug for single planning step #407

Closed

Merge remote-tracking branch 'origin/test-planning-step' into mac

4b8f97b

g-eoj added 3 commits February 3, 2025 12:37

Merge remote-tracking branch 'upstream/main' into g-eoj/mlx-model-sup…

0c98183

…port

Remove doc changes until review

448390b

Merge remote-tracking branch 'upstream/main' into g-eoj/mlx-model-sup…

94f684b

…port

Merge branch 'g-eoj/mlx-model-support' into mac

72b8034

sysradium reviewed Feb 4, 2025

View reviewed changes

src/smolagents/models.py Show resolved Hide resolved

sysradium reviewed Feb 4, 2025

View reviewed changes

src/smolagents/models.py Outdated Show resolved Hide resolved

sysradium reviewed Feb 4, 2025

View reviewed changes

src/smolagents/models.py Outdated Show resolved Hide resolved

Add docs and address comments

c8b4f36

g-eoj requested a review from sysradium February 4, 2025 20:27

sysradium approved these changes Feb 4, 2025

View reviewed changes

Merge branch 'main' into g-eoj/mlx-model-support

bae882b

aymeric-roucher reviewed Feb 12, 2025

View reviewed changes

docs/source/en/guided_tour.md Outdated Show resolved Hide resolved

aymeric-roucher reviewed Feb 12, 2025

View reviewed changes

docs/source/en/reference/models.md Outdated Show resolved Hide resolved

aymeric-roucher approved these changes Feb 12, 2025

View reviewed changes

aymeric-roucher added 2 commits February 12, 2025 18:26

Update docs/source/en/guided_tour.md

9b385f8

Update docs/source/en/reference/models.md

0bb9cc5

aymeric-roucher merged commit 9b96199 into huggingface:main Feb 12, 2025
3 of 4 checks passed

albertvillanova mentioned this pull request Feb 13, 2025

Fix CI quality check by removing trailing whitespace #628

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLX model support #300

MLX model support #300

g-eoj commented Jan 21, 2025 •

edited

Loading

g-eoj commented Jan 26, 2025

aymeric-roucher commented Jan 30, 2025

g-eoj commented Feb 3, 2025

sysradium commented Feb 4, 2025 •

edited

Loading

sysradium left a comment

sysradium commented Feb 9, 2025

g-eoj commented Feb 9, 2025

sysradium commented Feb 9, 2025 •

edited

Loading

g-eoj commented Feb 9, 2025 •

edited

Loading

g-eoj commented Feb 10, 2025

g-eoj commented Feb 10, 2025

sysradium commented Feb 10, 2025

g-eoj commented Feb 10, 2025

HuggingFaceDocBuilderDev commented Feb 12, 2025

aymeric-roucher Jan 30, 2025 •

edited

Loading

aymeric-roucher Feb 12, 2025

g-eoj Feb 12, 2025

aymeric-roucher commented Feb 12, 2025 •

edited

Loading

aymeric-roucher commented Feb 12, 2025

MLX model support #300

MLX model support #300

Conversation

g-eoj commented Jan 21, 2025 • edited Loading

g-eoj commented Jan 26, 2025

aymeric-roucher commented Jan 30, 2025

g-eoj commented Feb 3, 2025

sysradium commented Feb 4, 2025 • edited Loading

sysradium left a comment

Choose a reason for hiding this comment

sysradium commented Feb 9, 2025

g-eoj commented Feb 9, 2025

sysradium commented Feb 9, 2025 • edited Loading

g-eoj commented Feb 9, 2025 • edited Loading

g-eoj commented Feb 10, 2025

g-eoj commented Feb 10, 2025

sysradium commented Feb 10, 2025

g-eoj commented Feb 10, 2025

HuggingFaceDocBuilderDev commented Feb 12, 2025

aymeric-roucher Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

aymeric-roucher Feb 12, 2025

Choose a reason for hiding this comment

g-eoj Feb 12, 2025

Choose a reason for hiding this comment

aymeric-roucher commented Feb 12, 2025 • edited Loading

aymeric-roucher commented Feb 12, 2025

g-eoj commented Jan 21, 2025 •

edited

Loading

sysradium commented Feb 4, 2025 •

edited

Loading

sysradium commented Feb 9, 2025 •

edited

Loading

g-eoj commented Feb 9, 2025 •

edited

Loading

aymeric-roucher Jan 30, 2025 •

edited

Loading

aymeric-roucher commented Feb 12, 2025 •

edited

Loading