clean up quick starts

microsoft · Jul 28, 2024 · 901fa18 · 901fa18
1 parent 497259a
commit 901fa18
Show file tree

Hide file tree

Showing 5 changed files with 76 additions and 12 deletions.
diff --git a/docs/images/virtualhome_image.png b/docs/images/virtualhome_image.png
diff --git a/docs/images/virtualhomes.png b/docs/images/virtualhomes.png
diff --git a/docs/images/virtualhomes_messages.png b/docs/images/virtualhomes_messages.png
diff --git a/docs/tutorials/quick_start_2.ipynb b/docs/tutorials/quick_start_2.ipynb
@@ -14,6 +14,8 @@
     "\n",
     "Now you've learned the basics -- we can look at an application in building a reinforcement learning (RL) agent using Trace primitives.\n",
     "\n",
+    "## A Reinforcement Learning Agent\n",
+    "\n",
     "The essence of an RL agent is to react and adapt to different situations. An RL agent should change its behavior to become more successful at a task. Using `node`, `@bundle`, we can expose different parts of a Python program to an optimizer, making this program reactive to various feedback signals. A self-modifying, self-evolving system is the definition of an RL agent. By rewriting its own rules and logic, they can self-improve through the philosophy of *trial-and-error* (the Reinforcement Learning way!).\n",
     "\n",
     "Building an RL agent (with program blocks) and use an optimize to react to feedback is at the heart of policy gradient algorithms (such as [PPO](https://arxiv.org/abs/1707.06347), which is used in RLHF -- Reinforcement Learning from Human Feedback). Trace changes the underlying program blocks to improve the agent's chance of success. Here, we can look at an example of how Trace can be used to design an RL-style agent to master the game of Battleship."
@@ -43,6 +45,8 @@
    "source": [
     "Trace uses decorators like `@bundle` and data wrappers like `node` to expose different parts of these programs to an LLM. An LLM can rewrite the entire or only parts of system based on the user's specification. An LLM can change various parts of this system, with feedback they receive from the environment. Trace allows users to exert control over the LLM code-generation process.\n",
     "\n",
+    "## The Game of BattleShip\n",
+    "\n",
     "A simple example of how Trace allows the user to design an agent, and how the agent self-modifies its own behavior to adapt to the environment, we can take a look at the classic game of Battleship.\n",
     "\n",
     "```{image} ../images/dall_e_battleship.jpeg\n",
@@ -167,7 +171,13 @@
    "cell_type": "code",
    "execution_count": 5,
    "id": "23d68819-eeda-4d4f-b49e-fd9c618ed508",
-   "metadata": {},
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": []
+   },
    "outputs": [
     {
      "data": {
@@ -267,6 +277,8 @@
    "id": "6a806190-5750-490f-8d3e-827b0de831d1",
    "metadata": {},
    "source": [
+    "## Define An Agent Using Trace\n",
+    "\n",
     "We can write a simple agent that can play this game. Note that we are creating a normal Python class and decorate it with `@model`, and then use `@bundle` to specify which part of this class can be changed by an LLM through feedback.\n",
     "\n",
     "```{tip}\n",
@@ -340,6 +352,8 @@
    "id": "b4108924-9d96-424a-b772-972ecdcfc717",
    "metadata": {},
    "source": [
+    "## Visualize Trace Graph of an Action\n",
+    "\n",
     "We can first take a look at what the Trace Graph looks like for this agent when it takes an observation `board.get_shots()` from the board (this shows the map without any ship but with past records of hits and misses). The agent takes an action based on this observation."
    ]
   },
@@ -493,15 +507,30 @@
     "Note that not all parts of the agent are present in this graph. For example, `__call__` is not in this. A user needs to decide what to include and what to exclude, and what's trainable and what's not. You can learn more about how to design an agent in the tutorials.\n",
     "```\n",
     "\n",
-    "Now let's see if we can get an agent that can play this game with environment reward information."
+    "## Define the Optimization Process\n",
+    "\n",
+    "Now let's see if we can get an agent that can play this game with environment reward information.\n",
+    "\n",
+    "We set up the optimization procedure:\n",
+    "1. We initialize the game and obtain the initial state `board.get_shots()`. We wrap this in a Trace `node`.\n",
+    "2. We enter a game loop. The agent produces an action through `agent.act(obs)`.\n",
+    "3. The action `output` is then executed in the environment through `user_fb_for_placing_shot`.\n",
+    "4. Based on the feedback, the `optimizer` takes a step to update the agent"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 4,
    "id": "e4536733-89c0-4245-802b-d5812dd38d0c",
    "metadata": {
-    "scrolled": true
+    "editable": true,
+    "scrolled": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": [
+     "scroll-output"
+    ]
    },
    "outputs": [
     {
@@ -2146,18 +2175,28 @@
    "id": "49317147-45ab-4c86-91a2-2f042699f5fe",
    "metadata": {},
    "source": [
-    "Then we can see how this agent performs in an evaluation run.\n",
+    "## Evaluate The Learned Agent Performance\n",
+    "\n",
+    "Then we can see how this agent performs in an evaluation run. See that at the end of the optimization, the agent learns to apply heuristics such as -- once a shot turns out to be a hit, then check the adjacent vertical or horizontal squares. \n",
     "\n",
     "```{note}\n",
-    "See that at the end of the optimization, the agent learns to apply heuristics such as -- once a shot turns out to be a hit, then check the adjacent vertical or horizontal squares. A deep learning based RL agent would take orders of magnitutde more iterations than 10 iterations to learn this kind of heuristics.\n",
+    "A neural network based RL agent would take orders of magnitutde more iterations than 10 iterations to learn this kind of heuristics.\n",
     "```"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 8,
    "id": "16daeec5-27ef-44c7-9395-cc6a7264e230",
-   "metadata": {},
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": [
+     "scroll-output"
+    ]
+   },
    "outputs": [
     {
      "data": {
@@ -3494,8 +3533,6 @@
     "\n",
     "terminal = False\n",
     "for _ in range(15):\n",
-    "    # This is also online optimization\n",
-    "    # we have the opportunity to keep changing the function with each round of interaction\n",
     "    try:\n",
     "        output = agent.act(obs)\n",
     "        obs, reward, terminal, feedback = user_fb_for_placing_shot(board, output.data)\n",

diff --git a/docs/tutorials/virtualhome.md b/docs/tutorials/virtualhome.md
@@ -14,7 +14,7 @@ from opto.optimizers import OptoPrime
 [VirtualHome](http://virtual-home.org/) is a Unity engine based simulation environment that creates a home-like enviornment where multiple agents need to collaboratively solve a 
 series of tasks, ranging from book reading, putting empty plates in a dishwasher, to preparing food.
 
-```{image} ../images/virtualhome_image.png
+```{image} ../images/virtualhome/virtualhome_image.png
 :alt: virtual-home
 :align: center
 ```
@@ -63,6 +63,8 @@ Note: You must respond in the json format above. The action choice must be the s
 If there's nothing left to do, the action can be "None". If you choose [send_message], you must also generate the actual message.
 ```
 
+## Agent Architecture
+
 For the Trace optimzied agent, we additionally add `Plan:$PLAN` right below `Goals`. The agent stores a plan in its python class object (which serves as its **memory**),
 and when it needs to produce an action, it will replace `$PLAN$` with the current plan.
 Trace optimizer will update the **plan** based on the feedback from the environment and the current progress.
@@ -107,6 +109,8 @@ class Agent(LLMCallable, BaseUtil):
         return action
 ```
 
+## Multi-Agent Synchronous Optimization
+
 In a multi-agent environment, we can create multiple agents and let them interact with each other.
 We take a synchronous approach, where all agents take actions after observing the current state of the environment, and their
 actions are executed together. To make the simulation faster, we implement a sticky-action mechanism, where if the environment
@@ -179,6 +183,8 @@ Therefore, we can directly call `backward` on the next observation.
 To learn more about how to use Trace to create an agent in an interactive environment, check out the [Meta-World](https://microsoft.github.io/Trace/examples/code/metaworld.html) example.
 ```
 
+## Results
+
 We compare with the baseline ReAct agents that only outputs `thoughts` before taking an action. 
 This table shows that when Trace optimizes and updates the plan of the agents, they can learn to coordinate with each other and achieve the shared goal more efficiently.
 
@@ -187,18 +193,19 @@ This figure is not to show that other style of agent architecture cannot achieve
 We are using this example to demonstrate how easy it is to specify an RL agent using Trace and how Trace can optimize individual agents in a multi-agent environment.
 ```
 
-```{image} ../images/virtualhomes.png
+```{image} ../images/virtualhome/virtualhomes.png
 ---
 alt: task-reward
 align: center
-figclass: margin-caption
 ---
 ```
 ```{div} align-center
 (**Figure**: *Lower number indicates faster task completion. We do not count sending a message as an action -- although if an action sends a message, it cannot perform another action in the same round. 
 The number of action describes the total number of actions from both agents.*)
 ```
 
+## Emergent Pro-Social Behaviors
+
 We also found out that Trace-optimized agents develop pro-social behaviors, under the optimization procedure.
 The agents will learn to coordinate with each other to achieve the shared goal, but will choose not to communicate when they need to be more efficient.
 Although there are many caveats to this toy experiment, emergence of behaviors through optimization can be achieved via Trace. 
@@ -246,13 +253,33 @@ We show that this pro-social behavior does not happen across all tasks. For exam
 This can be attributed to many reasons, but we will stop our investigation here.
 When we optimize our agents through Trace, the emergent behaviors will change according to different tasks. This is very different from explicitly requiring the agent to communicate with each other.
 
-```{image} ../images/virtualhomes_messages.png
+```{image} ../images/virtualhome/virtualhomes_messages.png
 ---
 alt: messages
 align: center
 ---
 ```
 
+## Recording of Agent Behavior
+
+We show three videos of how Trace-optimized agents accomplished Task 2 (Put Dishwasher). We present the top-down birdseye view, and what each agent sees in their own perspective.
+
+``````{grid}
+:gutter: 0
+````{grid-item}
+```{figure} ../images/virtualhome/task2.gif
+```
+````
+````{grid-item}
+```{figure} ../images/virtualhome/agent1_task2.gif
+```
+````
+````{grid-item}
+```{figure} ../images/virtualhome/agent2_task2.gif
+```
+````
+``````
+
 ## What's Next?
 
 In this tutorial, we showed how to create two agents and have them interact with each other in a multi-agent environment.