Skip to content

Commit

Permalink
Experimental support for DeepSeek-R1
Browse files Browse the repository at this point in the history
  • Loading branch information
krasserm committed Feb 4, 2025
1 parent a8114ce commit 99f2d21
Show file tree
Hide file tree
Showing 20 changed files with 246 additions and 71 deletions.
12 changes: 9 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,23 +92,29 @@ https://github.com/user-attachments/assets/83cec179-54dc-456c-b647-ea98ec99600b

## Evaluation

We [evaluated](evaluation) `freeact` with these models:
We [evaluated](evaluation) `freeact` with the following models:

- Claude 3.5 Sonnet (`claude-3-5-sonnet-20241022`)
- Claude 3.5 Haiku (`claude-3-5-haiku-20241022`)
- Gemini 2.0 Flash (`gemini-2.0-flash-exp`)
- Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
- DeepSeek V3 (`deepseek-v3`)
- DeepSeek R1 (`deepseek-r1`)

The evaluation was performed using two benchmark datasets: [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) and the MATH subset from [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark). Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain selected tasks from GAIA, GSM8K, SimpleQA and MATH:
The evaluation uses two datasets:

1. [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2)
2. [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark)

Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain curated tasks from GAIA, GSM8K, SimpleQA, and MATH. We selected these datasets primarily for a quick evaluation of relative performance between models in a `freeact` setup, with the additional benefit of enabling comparisons with smolagents. To ensure fair comparisons with [their published results](https://huggingface.co/blog/smolagents#how-strong-are-open-models-for-agentic-workflows), we used identical evaluation protocols and tools.

[<img src="docs/eval/eval-plot.png" alt="Performance">](docs/eval/eval-plot.png)

When comparing our results with smolagents using Claude 3.5 Sonnet on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) (only dataset with available smolagents [reference data](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)), we observed the following outcomes (evaluation conducted on 2025-01-07):

[<img src="docs/eval/eval-plot-comparison.png" alt="Performance comparison" width="60%">](docs/eval/eval-plot-comparison.png)

Interestingly, these results were achieved using zero-shot prompting in `freeact`, while the smolagents implementation utilizes few-shot prompting. To ensure a fair comparison, we employed identical evaluation protocols and tools. You can find all evaluation details [here](evaluation).
Interestingly, these results were achieved using zero-shot prompting in `freeact`, while the smolagents implementation utilizes few-shot prompting. You can find all evaluation details [here](evaluation).

## Supported models

Expand Down
3 changes: 2 additions & 1 deletion docs/api/deepseek.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@
options:
show_root_heading: false
members:
- DeepSeek
- DeepSeekV3
- DeepSeekR1
Binary file added docs/eval/eval-plot-line.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/eval/eval-plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 9 additions & 3 deletions docs/evaluation.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,20 @@
# Evaluation results

We [evaluated](https://github.com/gradion-ai/freeact/tree/main/evaluation) `freeact` with these models:
We [evaluated](https://github.com/gradion-ai/freeact/tree/main/evaluation) `freeact` with the following models:

- Claude 3.5 Sonnet (`claude-3-5-sonnet-20241022`)
- Claude 3.5 Haiku (`claude-3-5-haiku-20241022`)
- Gemini 2.0 Flash (`gemini-2.0-flash-exp`)
- Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
- DeepSeek V3 (`deepseek-v3`)
- DeepSeek R1 (`deepseek-r1`)

The evaluation was performed using two benchmark datasets: [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) and the MATH subset from [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark). Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain selected tasks from GAIA, GSM8K, SimpleQA and MATH:
The evaluation uses two datasets:

1. [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2)
2. [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark)

Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain curated tasks from GAIA, GSM8K, SimpleQA, and MATH. We selected these datasets primarily for a quick evaluation of relative performance between models in a `freeact` setup, with the additional benefit of enabling comparisons with smolagents. To ensure fair comparisons with [their published results](https://huggingface.co/blog/smolagents#how-strong-are-open-models-for-agentic-workflows), we used identical evaluation protocols and tools.

<figure markdown>
[![architecture](eval/eval-plot.png){ align="left" }](eval/eval-plot.png){target="_blank"}
Expand All @@ -20,4 +26,4 @@ When comparing our results with smolagents using Claude 3.5 Sonnet on [m-ric/age
[![architecture](eval/eval-plot-comparison.png){ width="60%" align="left" }](eval/eval-plot-comparison.png){target="_blank"}
</figure>

Interestingly, these results were achieved using zero-shot prompting in `freeact`, while the smolagents implementation utilizes few-shot prompting. To ensure a fair comparison, we employed identical evaluation protocols and tools. You can find all evaluation details [here](https://github.com/gradion-ai/freeact/tree/main/evaluation).
Interestingly, these results were achieved using zero-shot prompting in `freeact`, while the smolagents implementation utilizes few-shot prompting. You can find all evaluation details [here](https://github.com/gradion-ai/freeact/tree/main/evaluation).
25 changes: 14 additions & 11 deletions docs/models.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,12 @@ For the following models, `freeact` provides model-specific prompt templates.
|-----------------------------|------------|-----------|--------------|
| Claude 3.5 Sonnet | 2024-10-22 || optimized |
| Claude 3.5 Haiku | 2024-10-22 || optimized |
| Gemini 2.0 Flash | 2024-12-11 || experimental |
| Gemini 2.0 Flash Thinking | 2025-01-21 || experimental |
| Qwen 2.5 Coder 32B Instruct | || experimental |
| DeepSeek V3 | || experimental |
| Gemini 2.0 Flash | 2024-12-11 || draft |
| Qwen 2.5 Coder 32B Instruct | || draft |
| DeepSeek V3 | || draft |
| DeepSeek R1[^1] | || experimental |

[^1]: DeepSeek R1 wasn't trained on agentic tool use but demonstrates strong performance with code actions, even surpassing Claude 3.5 Sonnet on the GAIA subset in our [evaluation](evaluation.md). However, its token usage for reasoning remains significantly higher than other models, making it impractical for everyday use yet.

!!! Info

Expand Down Expand Up @@ -60,25 +62,26 @@ python -m freeact.cli \
--api-key=$GOOGLE_API_KEY
```

### Gemini 2.0 Flash Thinking
### Qwen 2.5 Coder 32B Instruct

```bash
python -m freeact.cli \
--model-name=gemini-2.0-flash-thinking-exp-01-21 \
--model-name=Qwen/Qwen2.5-Coder-32B-Instruct \
--ipybox-tag=ghcr.io/gradion-ai/ipybox:basic \
--skill-modules=freeact_skills.search.google.stream.api \
--api-key=$GOOGLE_API_KEY
--base-url=https://api-inference.huggingface.co/v1/ \
--api-key=$HF_TOKEN
```

### Qwen 2.5 Coder 32B Instruct
### DeepSeek R1

```bash
python -m freeact.cli \
--model-name=Qwen/Qwen2.5-Coder-32B-Instruct \
--model-name=accounts/fireworks/models/deepseek-r1 \
--ipybox-tag=ghcr.io/gradion-ai/ipybox:basic \
--skill-modules=freeact_skills.search.google.stream.api \
--base-url=https://api-inference.huggingface.co/v1/ \
--api-key=$HF_TOKEN
--base-url=https://api.fireworks.ai/inference/v1 \
--api-key=$FIREWORKS_API_KEY
```

### DeepSeek V3
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorials/extend.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This tutorial demonstrates how `freeact` agents can be customized through system
- How to implement domain-specific rules (demonstrated by an example that multiplies temperatures by 3.17 in weather-related responses)

!!! Note
System extensions are currently only supported for [Claude][freeact.model.claude.model.Claude] models and [DeepSeek V3][freeact.model.deepseek.model.DeepSeek].
System extensions are currently only supported for [Claude][freeact.model.claude.model.Claude] models and [DeepSeek V3][freeact.model.deepseek.model.DeepSeekV3].

The [example conversation](#example-conversation) below was guided by this system extension:

Expand Down
26 changes: 19 additions & 7 deletions evaluation/README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,33 @@
# Evaluation

We evaluated `freeact` using five state-of-the-art models:
We evaluated `freeact` with the following models:

- Claude 3.5 Sonnet (`claude-3-5-sonnet-20241022`)
- Claude 3.5 Haiku (`claude-3-5-haiku-20241022`)
- Gemini 2.0 Flash (`gemini-2.0-flash-exp`)
- Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
- DeepSeek V3 (`deepseek-v3`)
- DeepSeek R1 (`deepseek-r1`)

The evaluation was performed using two benchmark datasets: [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) and the MATH subset from [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark). Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain selected tasks from GAIA, GSM8K, SimpleQA and MATH:
The evaluation uses two datasets:

[<img src="../docs/eval/eval-plot.png" alt="Performance">](../docs/eval/eval-plot.png)
1. [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2)
2. [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark)

Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain curated tasks from GAIA, GSM8K, SimpleQA, and MATH. We selected these datasets primarily for a quick evaluation of relative performance between models in a `freeact` setup, with the additional benefit of enabling comparisons with smolagents. To ensure fair comparisons with [their published results](https://huggingface.co/blog/smolagents#how-strong-are-open-models-for-agentic-workflows), we used identical evaluation protocols and tools (implemented as [skills](skills)).

[<img src="../docs/eval/eval-plot-line.png" alt="Performance">](../docs/eval/eval-plot-line.png)

| model | GAIA (exact_match) | GSM8K (exact_match) | MATH (exact_match) | SimpleQA (exact_match) | SimpleQA (llm_as_judge) |
|:----------------------------|--------------------:|--------------------:|-------------------:|-----------------------:|------------------------:|
| claude-3-5-sonnet-20241022 | **53.1** | **95.7** | **90.0** | 57.5 | **72.5** |
| claude-3-5-sonnet-20241022 | 53.1 | **95.7** | **90.0** | 57.5 | **72.5** |
| claude-3-5-haiku-20241022 | 31.2 | 90.0 | 76.0 | 52.5 | 70.0 |
| gemini-2.0-flash-exp | 34.4 | **95.7** | 88.0 | 50.0 | 65.0 |
| qwen2p5-coder-32b-instruct | 25.0 | **95.7** | 88.0 | 52.5 | 65.0 |
| deepseek-v3 | 37.5 | 91.4 | 88.0 | **60.0** | 67.5 |
| deepseek-r1 | **65.6** | | | | |

When comparing our results with smolagents using `claude-3-5-sonnet-20241022` on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) (only dataset with available smolagents [reference data](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)), we observed the following outcomes (evaluation conducted on 2025-01-07):
When comparing our results with smolagents using `claude-3-5-sonnet-20241022` on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) (only dataset with available smolagents [reference data](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)), we observed the following outcomes (comparison executed on 2025-01-07):

[<img src="../docs/eval/eval-plot-comparison.png" alt="Performance comparison" width="60%">](../docs/eval/eval-plot-comparison.png)

Expand All @@ -29,7 +36,7 @@ When comparing our results with smolagents using `claude-3-5-sonnet-20241022` on
| freeact | claude-3-5-sonnet-20241022 | zero-shot | **53.1** | **95.7** | **57.5** |
| smolagents | claude-3-5-sonnet-20241022 | few-shot | 43.8 | 91.4 | 47.5 |

Interestingly, these results were achieved using zero-shot prompting in `freeact`, while the smolagents implementation utilizes few-shot prompting. To ensure a fair comparison, we employed identical evaluation protocols and tools (converted to [skills](skills)).
Interestingly, these results were achieved using zero-shot prompting in `freeact`, while the smolagents implementation utilizes few-shot prompting.

## Running

Expand Down Expand Up @@ -86,6 +93,10 @@ python evaluation/evaluate.py \
python evaluation/evaluate.py \
--model-name deepseek-v3 \
--run-id deepseek-v3

python evaluation/evaluate.py \
--model-name deepseek-r1 \
--run-id deepseek-r1
```

Results are saved in `output/evaluation/<run-id>`. Pre-generated outputs from our runs are available [here](https://github.com/user-attachments/files/18488186/evaluation-results-agents-4_medium_benchmark_2.zip).
Expand All @@ -100,7 +111,8 @@ python evaluation/score.py \
--evaluation-dir output/evaluation/claude-3-5-haiku-20241022 \
--evaluation-dir output/evaluation/gemini-2.0-flash-exp \
--evaluation-dir output/evaluation/qwen2p5-coder-32b-instruct \
--evaluation-dir output/evaluation/deepseek-v3
--evaluation-dir output/evaluation/deepseek-v3 \
--evaluation-dir output/evaluation/deepseek-r1
```

Generate visualization and reports:
Expand Down
22 changes: 17 additions & 5 deletions evaluation/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@
CodeActModel,
CodeActModelTurn,
CodeExecution,
DeepSeek,
DeepSeekR1,
DeepSeekV3,
Gemini,
QwenCoder,
execution_environment,
Expand Down Expand Up @@ -255,7 +256,7 @@ async def run_agent(
SYSTEM_TEMPLATE,
)

model = DeepSeek(
model = DeepSeekV3(
api_key=os.getenv("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference/v1",
model_name=f"accounts/fireworks/models/{model_name}",
Expand All @@ -264,6 +265,15 @@ async def run_agent(
execution_output_template=EXECUTION_OUTPUT_TEMPLATE,
execution_error_template=EXECUTION_ERROR_TEMPLATE,
)
elif model_name == "deepseek-r1":
model = DeepSeekR1(
api_key=os.getenv("FIREWORKS_API_KEY"),
base_url="https://api.fireworks.ai/inference/v1",
model_name=f"accounts/fireworks/models/{model_name}",
skill_sources=skill_sources,
instruction_extension="Important: never pass a PDF file as argument to visit_webpage.",
)
run_kwargs |= {"max_tokens": 16384}
else:
raise ValueError(f"Unknown model: {model_name}")

Expand All @@ -284,11 +294,13 @@ async def collect_output(agent_turn: CodeActAgentTurn, debug: bool = True) -> Li
async for activity in agent_turn.stream():
match activity:
case CodeActModelTurn() as model_turn:
if debug:
async for chunk in model_turn.stream():
print(chunk, end="", flush=True)
print()

model_response = await model_turn.response()
output.append("[agent ] " + model_response.text)
if debug:
print("Agent response:")
print(model_response.text)

if model_response.code:
output.append("[python] " + model_response.code)
Expand Down
6 changes: 6 additions & 0 deletions evaluation/scoring/gaia.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,13 @@ def get_question_score_gaia(
return normalize_str(model_answer) == normalize_str(ground_truth)


def remove_boxed(text):
# Replace \boxed{number} with just the number
return re.sub(r"\\boxed\{(\d+)\}", r"\1", text)


def normalize_number_str(number_str: str) -> float:
number_str = remove_boxed(number_str)
# we replace these common units and commas to allow
# conversion to float
for char in ["$", "%", ","]:
Expand Down
3 changes: 2 additions & 1 deletion freeact/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
CodeActModel,
CodeActModelResponse,
CodeActModelTurn,
DeepSeek,
DeepSeekR1,
DeepSeekV3,
Gemini,
GeminiLive,
GeminiModelName,
Expand Down
36 changes: 17 additions & 19 deletions freeact/cli/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@
Claude,
CodeActAgent,
CodeActModel,
DeepSeek,
DeepSeekR1,
DeepSeekV3,
Gemini,
QwenCoder,
execution_environment,
Expand Down Expand Up @@ -52,7 +53,10 @@ async def amain(
else:
system_extension_str = None

run_kwargs: Dict[str, Any] = {}
run_kwargs: Dict[str, Any] = {
"temperature": temperature,
"max_tokens": max_tokens,
}
model: CodeActModel

if "claude" in model_name.lower():
Expand All @@ -64,11 +68,7 @@ async def amain(
api_key=api_key,
base_url=base_url,
)
run_kwargs |= {
"skill_sources": skill_sources,
"temperature": temperature,
"max_tokens": max_tokens,
}
run_kwargs |= {"skill_sources": skill_sources}
elif "gemini" in model_name.lower():
model = Gemini(
model_name=model_name, # type: ignore
Expand All @@ -84,22 +84,20 @@ async def amain(
api_key=api_key,
base_url=base_url,
)
run_kwargs |= {
"temperature": temperature,
"max_tokens": max_tokens,
}
elif "deepseek" in model_name.lower():
model = DeepSeek(
elif "deepseek-v3" in model_name.lower():
model = DeepSeekV3(
model_name=model_name,
skill_sources=skill_sources,
api_key=api_key,
base_url=base_url,
)
print(model._history[0]["content"])
run_kwargs |= {
"temperature": temperature,
"max_tokens": max_tokens,
}
elif "deepseek-r1" in model_name.lower():
model = DeepSeekR1(
model_name=model_name,
api_key=api_key,
base_url=base_url,
skill_sources=skill_sources,
)
else:
typer.echo(f"Unsupported model: {model_name}", err=True)
raise typer.Exit(code=1)
Expand All @@ -119,7 +117,7 @@ async def amain(

@app.command()
def main(
model_name: Annotated[str, typer.Option(help="Name of the model")] = "gemini-2.0-flash-thinking-exp-01-21",
model_name: Annotated[str, typer.Option(help="Name of the model")] = "claude-3-5-sonnet-20241022",
api_key: Annotated[str | None, typer.Option(help="API key of the model")] = None,
base_url: Annotated[str | None, typer.Option(help="Base URL of the model")] = None,
ipybox_tag: Annotated[str, typer.Option(help="Tag of the ipybox Docker image")] = "ghcr.io/gradion-ai/ipybox:basic",
Expand Down
Loading

0 comments on commit 99f2d21

Please sign in to comment.