Experimental support for DeepSeek-R1

gradion-ai · Feb 4, 2025 · 99f2d21 · 99f2d21
1 parent a8114ce
commit 99f2d21
Show file tree

Hide file tree

Showing 20 changed files with 246 additions and 71 deletions.
diff --git a/README.md b/README.md
@@ -92,23 +92,29 @@ https://github.com/user-attachments/assets/83cec179-54dc-456c-b647-ea98ec99600b
 
 ## Evaluation
 
-We [evaluated](evaluation) `freeact` with these models:
+We [evaluated](evaluation) `freeact` with the following models:
 
 - Claude 3.5 Sonnet (`claude-3-5-sonnet-20241022`)
 - Claude 3.5 Haiku (`claude-3-5-haiku-20241022`)
 - Gemini 2.0 Flash (`gemini-2.0-flash-exp`)
 - Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
 - DeepSeek V3 (`deepseek-v3`)
+- DeepSeek R1 (`deepseek-r1`)
 
-The evaluation was performed using two benchmark datasets: [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) and the MATH subset from [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark). Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain selected tasks from GAIA, GSM8K, SimpleQA and MATH:
+The evaluation uses two datasets:
+
+1. [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2)
+2. [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark)
+
+Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain curated tasks from GAIA, GSM8K, SimpleQA, and MATH. We selected these datasets primarily for a quick evaluation of relative performance between models in a `freeact` setup, with the additional benefit of enabling comparisons with smolagents. To ensure fair comparisons with [their published results](https://huggingface.co/blog/smolagents#how-strong-are-open-models-for-agentic-workflows), we used identical evaluation protocols and tools.
 
 [<img src="docs/eval/eval-plot.png" alt="Performance">](docs/eval/eval-plot.png)
 
 When comparing our results with smolagents using Claude 3.5 Sonnet on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) (only dataset with available smolagents [reference data](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)), we observed the following outcomes (evaluation conducted on 2025-01-07):
 
 [<img src="docs/eval/eval-plot-comparison.png" alt="Performance comparison" width="60%">](docs/eval/eval-plot-comparison.png)
 
-Interestingly, these results were achieved using zero-shot prompting in `freeact`, while the smolagents implementation utilizes few-shot prompting. To ensure a fair comparison, we employed identical evaluation protocols and tools. You can find all evaluation details [here](evaluation).
+Interestingly, these results were achieved using zero-shot prompting in `freeact`, while the smolagents implementation utilizes few-shot prompting. You can find all evaluation details [here](evaluation).
 
 ## Supported models
 

diff --git a/docs/api/deepseek.md b/docs/api/deepseek.md
@@ -2,4 +2,5 @@
     options:
       show_root_heading: false
       members:
-      - DeepSeek
+      - DeepSeekV3
+      - DeepSeekR1
diff --git a/docs/eval/eval-plot-line.png b/docs/eval/eval-plot-line.png
diff --git a/docs/eval/eval-plot.png b/docs/eval/eval-plot.png
diff --git a/docs/evaluation.md b/docs/evaluation.md
@@ -1,14 +1,20 @@
 # Evaluation results
 
-We [evaluated](https://github.com/gradion-ai/freeact/tree/main/evaluation) `freeact` with these models:
+We [evaluated](https://github.com/gradion-ai/freeact/tree/main/evaluation) `freeact` with the following models:
 
 - Claude 3.5 Sonnet (`claude-3-5-sonnet-20241022`)
 - Claude 3.5 Haiku (`claude-3-5-haiku-20241022`)
 - Gemini 2.0 Flash (`gemini-2.0-flash-exp`)
 - Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
 - DeepSeek V3 (`deepseek-v3`)
+- DeepSeek R1 (`deepseek-r1`)
 
-The evaluation was performed using two benchmark datasets: [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) and the MATH subset from [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark). Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain selected tasks from GAIA, GSM8K, SimpleQA and MATH:
+The evaluation uses two datasets:
+
+1. [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2)
+2. [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark)
+
+Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain curated tasks from GAIA, GSM8K, SimpleQA, and MATH. We selected these datasets primarily for a quick evaluation of relative performance between models in a `freeact` setup, with the additional benefit of enabling comparisons with smolagents. To ensure fair comparisons with [their published results](https://huggingface.co/blog/smolagents#how-strong-are-open-models-for-agentic-workflows), we used identical evaluation protocols and tools.
 
 <figure markdown>
   [![architecture](eval/eval-plot.png){ align="left" }](eval/eval-plot.png){target="_blank"}
@@ -20,4 +26,4 @@ When comparing our results with smolagents using Claude 3.5 Sonnet on [m-ric/age
   [![architecture](eval/eval-plot-comparison.png){ width="60%" align="left" }](eval/eval-plot-comparison.png){target="_blank"}
 </figure>
 
-Interestingly, these results were achieved using zero-shot prompting in `freeact`, while the smolagents implementation utilizes few-shot prompting. To ensure a fair comparison, we employed identical evaluation protocols and tools. You can find all evaluation details [here](https://github.com/gradion-ai/freeact/tree/main/evaluation).
+Interestingly, these results were achieved using zero-shot prompting in `freeact`, while the smolagents implementation utilizes few-shot prompting. You can find all evaluation details [here](https://github.com/gradion-ai/freeact/tree/main/evaluation).
diff --git a/docs/models.md b/docs/models.md
@@ -6,10 +6,12 @@ For the following models, `freeact` provides model-specific prompt templates.
 |-----------------------------|------------|-----------|--------------|
 | Claude 3.5 Sonnet           | 2024-10-22 | ✓         | optimized    |
 | Claude 3.5 Haiku            | 2024-10-22 | ✓         | optimized    |
-| Gemini 2.0 Flash            | 2024-12-11 | ✓         | experimental |
-| Gemini 2.0 Flash Thinking   | 2025-01-21 | ✗         | experimental |
-| Qwen 2.5 Coder 32B Instruct |            | ✓         | experimental |
-| DeepSeek V3                 |            | ✓         | experimental |
+| Gemini 2.0 Flash            | 2024-12-11 | ✓         | draft        |
+| Qwen 2.5 Coder 32B Instruct |            | ✓         | draft        |
+| DeepSeek V3                 |            | ✓         | draft        |
+| DeepSeek R1[^1]             |            | ✓         | experimental |
+
+[^1]: DeepSeek R1 wasn't trained on agentic tool use but demonstrates strong performance with code actions, even surpassing Claude 3.5 Sonnet on the GAIA subset in our [evaluation](evaluation.md). However, its token usage for reasoning remains significantly higher than other models, making it impractical for everyday use yet. 
 
 !!! Info
 
@@ -60,25 +62,26 @@ python -m freeact.cli \
   --api-key=$GOOGLE_API_KEY
 ```
 
-### Gemini 2.0 Flash Thinking
+### Qwen 2.5 Coder 32B Instruct
 
 ```bash
 python -m freeact.cli \
-  --model-name=gemini-2.0-flash-thinking-exp-01-21 \
+  --model-name=Qwen/Qwen2.5-Coder-32B-Instruct \
   --ipybox-tag=ghcr.io/gradion-ai/ipybox:basic \
   --skill-modules=freeact_skills.search.google.stream.api \
-  --api-key=$GOOGLE_API_KEY
+  --base-url=https://api-inference.huggingface.co/v1/ \
+  --api-key=$HF_TOKEN
 ```
 
-### Qwen 2.5 Coder 32B Instruct
+### DeepSeek R1
 
 ```bash
 python -m freeact.cli \
-  --model-name=Qwen/Qwen2.5-Coder-32B-Instruct \
+  --model-name=accounts/fireworks/models/deepseek-r1 \
   --ipybox-tag=ghcr.io/gradion-ai/ipybox:basic \
   --skill-modules=freeact_skills.search.google.stream.api \
-  --base-url=https://api-inference.huggingface.co/v1/ \
-  --api-key=$HF_TOKEN
+  --base-url=https://api.fireworks.ai/inference/v1 \
+  --api-key=$FIREWORKS_API_KEY
 ```
 
 ### DeepSeek V3

diff --git a/docs/tutorials/extend.md b/docs/tutorials/extend.md
@@ -7,7 +7,7 @@ This tutorial demonstrates how `freeact` agents can be customized through system
 - How to implement domain-specific rules (demonstrated by an example that multiplies temperatures by 3.17 in weather-related responses)
 
 !!! Note
-    System extensions are currently only supported for [Claude][freeact.model.claude.model.Claude] models and [DeepSeek V3][freeact.model.deepseek.model.DeepSeek].
+    System extensions are currently only supported for [Claude][freeact.model.claude.model.Claude] models and [DeepSeek V3][freeact.model.deepseek.model.DeepSeekV3].
 
 The [example conversation](#example-conversation) below was guided by this system extension:
 

diff --git a/evaluation/README.md b/evaluation/README.md
@@ -1,26 +1,33 @@
 # Evaluation
 
-We evaluated `freeact` using five state-of-the-art models:
+We evaluated `freeact` with the following models:
 
 - Claude 3.5 Sonnet (`claude-3-5-sonnet-20241022`)
 - Claude 3.5 Haiku (`claude-3-5-haiku-20241022`)
 - Gemini 2.0 Flash (`gemini-2.0-flash-exp`)
 - Qwen 2.5 Coder 32B Instruct (`qwen2p5-coder-32b-instruct`)
 - DeepSeek V3 (`deepseek-v3`)
+- DeepSeek R1 (`deepseek-r1`)
 
-The evaluation was performed using two benchmark datasets: [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) and the MATH subset from [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark). Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain selected tasks from GAIA, GSM8K, SimpleQA and MATH:
+The evaluation uses two datasets:
 
-[<img src="../docs/eval/eval-plot.png" alt="Performance">](../docs/eval/eval-plot.png)
+1. [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2)
+2. [m-ric/smol_agents_benchmark](https://huggingface.co/datasets/m-ric/smol_agents_benchmark)
+
+Both datasets were created by the [smolagents](https://github.com/huggingface/smolagents) team at 🤗 Hugging Face and contain curated tasks from GAIA, GSM8K, SimpleQA, and MATH. We selected these datasets primarily for a quick evaluation of relative performance between models in a `freeact` setup, with the additional benefit of enabling comparisons with smolagents. To ensure fair comparisons with [their published results](https://huggingface.co/blog/smolagents#how-strong-are-open-models-for-agentic-workflows), we used identical evaluation protocols and tools (implemented as [skills](skills)).
+
+[<img src="../docs/eval/eval-plot-line.png" alt="Performance">](../docs/eval/eval-plot-line.png)
 
 | model                        | GAIA (exact_match) | GSM8K (exact_match) | MATH (exact_match) | SimpleQA (exact_match) | SimpleQA (llm_as_judge) |
 |:----------------------------|--------------------:|--------------------:|-------------------:|-----------------------:|------------------------:|
-| claude-3-5-sonnet-20241022  |            **53.1** |            **95.7** |           **90.0** |                  57.5  |                **72.5** |
+| claude-3-5-sonnet-20241022  |                53.1 |            **95.7** |           **90.0** |                  57.5  |                **72.5** |
 | claude-3-5-haiku-20241022   |                31.2 |                90.0 |               76.0 |                  52.5  |                   70.0  |
 | gemini-2.0-flash-exp        |                34.4 |            **95.7** |               88.0 |                  50.0  |                   65.0  |
 | qwen2p5-coder-32b-instruct  |                25.0 |            **95.7** |               88.0 |                  52.5  |                   65.0  |
 | deepseek-v3                 |                37.5 |                91.4 |               88.0 |               **60.0** |                   67.5  |
+| deepseek-r1                 |            **65.6** |                     |                    |                        |                         |
 
-When comparing our results with smolagents using `claude-3-5-sonnet-20241022` on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) (only dataset with available smolagents [reference data](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)), we observed the following outcomes (evaluation conducted on 2025-01-07):
+When comparing our results with smolagents using `claude-3-5-sonnet-20241022` on [m-ric/agents_medium_benchmark_2](https://huggingface.co/datasets/m-ric/agents_medium_benchmark_2) (only dataset with available smolagents [reference data](https://github.com/huggingface/smolagents/blob/c22fedaee17b8b966e86dc53251f210788ae5c19/examples/benchmark.ipynb)), we observed the following outcomes (comparison executed on 2025-01-07):
 
 [<img src="../docs/eval/eval-plot-comparison.png" alt="Performance comparison" width="60%">](../docs/eval/eval-plot-comparison.png)
 
@@ -29,7 +36,7 @@ When comparing our results with smolagents using `claude-3-5-sonnet-20241022` on
 | freeact    | claude-3-5-sonnet-20241022 | zero-shot |  **53.1** |  **95.7** |  **57.5** |
 | smolagents | claude-3-5-sonnet-20241022 | few-shot  |      43.8 |      91.4 |      47.5 |
 
-Interestingly, these results were achieved using zero-shot prompting in `freeact`, while the smolagents implementation utilizes few-shot prompting. To ensure a fair comparison, we employed identical evaluation protocols and tools (converted to [skills](skills)).
+Interestingly, these results were achieved using zero-shot prompting in `freeact`, while the smolagents implementation utilizes few-shot prompting.
 
 ## Running
 
@@ -86,6 +93,10 @@ python evaluation/evaluate.py \
 python evaluation/evaluate.py \
     --model-name deepseek-v3 \
     --run-id deepseek-v3
+
+python evaluation/evaluate.py \
+    --model-name deepseek-r1 \
+    --run-id deepseek-r1
 ```
 
 Results are saved in `output/evaluation/<run-id>`. Pre-generated outputs from our runs are available [here](https://github.com/user-attachments/files/18488186/evaluation-results-agents-4_medium_benchmark_2.zip).
@@ -100,7 +111,8 @@ python evaluation/score.py \
   --evaluation-dir output/evaluation/claude-3-5-haiku-20241022 \
   --evaluation-dir output/evaluation/gemini-2.0-flash-exp \
   --evaluation-dir output/evaluation/qwen2p5-coder-32b-instruct \
-  --evaluation-dir output/evaluation/deepseek-v3
+  --evaluation-dir output/evaluation/deepseek-v3 \
+  --evaluation-dir output/evaluation/deepseek-r1
 ```
 
 Generate visualization and reports:

diff --git a/evaluation/evaluate.py b/evaluation/evaluate.py
@@ -20,7 +20,8 @@
     CodeActModel,
     CodeActModelTurn,
     CodeExecution,
-    DeepSeek,
+    DeepSeekR1,
+    DeepSeekV3,
     Gemini,
     QwenCoder,
     execution_environment,
@@ -255,7 +256,7 @@ async def run_agent(
                 SYSTEM_TEMPLATE,
             )
 
-            model = DeepSeek(
+            model = DeepSeekV3(
                 api_key=os.getenv("FIREWORKS_API_KEY"),
                 base_url="https://api.fireworks.ai/inference/v1",
                 model_name=f"accounts/fireworks/models/{model_name}",
@@ -264,6 +265,15 @@ async def run_agent(
                 execution_output_template=EXECUTION_OUTPUT_TEMPLATE,
                 execution_error_template=EXECUTION_ERROR_TEMPLATE,
             )
+        elif model_name == "deepseek-r1":
+            model = DeepSeekR1(
+                api_key=os.getenv("FIREWORKS_API_KEY"),
+                base_url="https://api.fireworks.ai/inference/v1",
+                model_name=f"accounts/fireworks/models/{model_name}",
+                skill_sources=skill_sources,
+                instruction_extension="Important: never pass a PDF file as argument to visit_webpage.",
+            )
+            run_kwargs |= {"max_tokens": 16384}
         else:
             raise ValueError(f"Unknown model: {model_name}")
 
@@ -284,11 +294,13 @@ async def collect_output(agent_turn: CodeActAgentTurn, debug: bool = True) -> Li
     async for activity in agent_turn.stream():
         match activity:
             case CodeActModelTurn() as model_turn:
+                if debug:
+                    async for chunk in model_turn.stream():
+                        print(chunk, end="", flush=True)
+                    print()
+
                 model_response = await model_turn.response()
                 output.append("[agent ] " + model_response.text)
-                if debug:
-                    print("Agent response:")
-                    print(model_response.text)
 
                 if model_response.code:
                     output.append("[python] " + model_response.code)

diff --git a/evaluation/scoring/gaia.py b/evaluation/scoring/gaia.py
@@ -41,7 +41,13 @@ def get_question_score_gaia(
         return normalize_str(model_answer) == normalize_str(ground_truth)
 
 
+def remove_boxed(text):
+    # Replace \boxed{number} with just the number
+    return re.sub(r"\\boxed\{(\d+)\}", r"\1", text)
+
+
 def normalize_number_str(number_str: str) -> float:
+    number_str = remove_boxed(number_str)
     # we replace these common units and commas to allow
     # conversion to float
     for char in ["$", "%", ","]:

diff --git a/freeact/__init__.py b/freeact/__init__.py
@@ -6,7 +6,8 @@
     CodeActModel,
     CodeActModelResponse,
     CodeActModelTurn,
-    DeepSeek,
+    DeepSeekR1,
+    DeepSeekV3,
     Gemini,
     GeminiLive,
     GeminiModelName,

diff --git a/freeact/cli/__main__.py b/freeact/cli/__main__.py
@@ -10,7 +10,8 @@
     Claude,
     CodeActAgent,
     CodeActModel,
-    DeepSeek,
+    DeepSeekR1,
+    DeepSeekV3,
     Gemini,
     QwenCoder,
     execution_environment,
@@ -52,7 +53,10 @@ async def amain(
         else:
             system_extension_str = None
 
-        run_kwargs: Dict[str, Any] = {}
+        run_kwargs: Dict[str, Any] = {
+            "temperature": temperature,
+            "max_tokens": max_tokens,
+        }
         model: CodeActModel
 
         if "claude" in model_name.lower():
@@ -64,11 +68,7 @@ async def amain(
                 api_key=api_key,
                 base_url=base_url,
             )
-            run_kwargs |= {
-                "skill_sources": skill_sources,
-                "temperature": temperature,
-                "max_tokens": max_tokens,
-            }
+            run_kwargs |= {"skill_sources": skill_sources}
         elif "gemini" in model_name.lower():
             model = Gemini(
                 model_name=model_name,  # type: ignore
@@ -84,22 +84,20 @@ async def amain(
                 api_key=api_key,
                 base_url=base_url,
             )
-            run_kwargs |= {
-                "temperature": temperature,
-                "max_tokens": max_tokens,
-            }
-        elif "deepseek" in model_name.lower():
-            model = DeepSeek(
+        elif "deepseek-v3" in model_name.lower():
+            model = DeepSeekV3(
                 model_name=model_name,
                 skill_sources=skill_sources,
                 api_key=api_key,
                 base_url=base_url,
             )
-            print(model._history[0]["content"])
-            run_kwargs |= {
-                "temperature": temperature,
-                "max_tokens": max_tokens,
-            }
+        elif "deepseek-r1" in model_name.lower():
+            model = DeepSeekR1(
+                model_name=model_name,
+                api_key=api_key,
+                base_url=base_url,
+                skill_sources=skill_sources,
+            )
         else:
             typer.echo(f"Unsupported model: {model_name}", err=True)
             raise typer.Exit(code=1)
@@ -119,7 +117,7 @@ async def amain(
 
 @app.command()
 def main(
-    model_name: Annotated[str, typer.Option(help="Name of the model")] = "gemini-2.0-flash-thinking-exp-01-21",
+    model_name: Annotated[str, typer.Option(help="Name of the model")] = "claude-3-5-sonnet-20241022",
     api_key: Annotated[str | None, typer.Option(help="API key of the model")] = None,
     base_url: Annotated[str | None, typer.Option(help="Base URL of the model")] = None,
     ipybox_tag: Annotated[str, typer.Option(help="Tag of the ipybox Docker image")] = "ghcr.io/gradion-ai/ipybox:basic",