diff --git a/docs/config.rst b/docs/Contributing/config.rst
similarity index 100%
rename from docs/config.rst
rename to docs/Contributing/config.rst
diff --git a/docs/Contributing/design.rst b/docs/Contributing/design.rst
new file mode 100644
index 000000000..c7b1925ad
--- /dev/null
+++ b/docs/Contributing/design.rst
@@ -0,0 +1,49 @@
+Design
+======
+
+Milabench aims to simulate research workloads for benchmarking purposes.
+
+* Performance is measured as throughput (samples / secs).
+  For example, for a model like resnet the throughput would be image per seconds.
+
+* Single GPU workloads are spawned per GPU to ensure the entire machine is used.
+  Simulating something similar to a hyper parameter search.
+  The performance of the benchmark is the sum of throughput of each processes.
+
+* Multi GPU workloads
+
+* Multi Nodes
+
+
+Run
+---
+
+* Milabench Manager Process
+   * Handles messages from benchmark processes
+   * Saves messages into a file for future analysis
+
+* Benchmark processes
+   * run using ``voir``
+   * voir is configured to intercept and send events during the training process
+   * This allow us to add models from git repositories without modification
+   * voir sends data through a file descriptor that was created by milabench main process
+
+
+What milabench is
+-----------------
+
+* Training focused
+* milabench show candid performance numbers
+   * No optimization beyond batch size scaling is performed
+   * we want to measure the performance our researcher will see
+     not the performance they could get.
+* pytorch centric
+   * Pytorch has become the defacto library for research
+   * We are looking for accelerator with good maturity that can support
+     this framework with limited code change.
+
+
+What milabench is not
+---------------------
+
+* milabench goal is not a performance show case of an accelerator.
diff --git a/docs/dev-usage.rst b/docs/Contributing/dev-usage.rst
similarity index 100%
rename from docs/dev-usage.rst
rename to docs/Contributing/dev-usage.rst
diff --git a/docs/instrument.rst b/docs/Contributing/instrument.rst
similarity index 100%
rename from docs/instrument.rst
rename to docs/Contributing/instrument.rst
diff --git a/docs/new_benchmarks.rst b/docs/Contributing/new_benchmarks.rst
similarity index 99%
rename from docs/new_benchmarks.rst
rename to docs/Contributing/new_benchmarks.rst
index e348a28be..d8d72ff44 100644
--- a/docs/new_benchmarks.rst
+++ b/docs/Contributing/new_benchmarks.rst
@@ -1,6 +1,6 @@
 
-Creating a new benchmark
-------------------------
+Adding a benchmark
+==================
 
 To define a new benchmark (let's assume it is called ``ornatebench``), 
 
diff --git a/docs/flow.rst b/docs/Contributing/overview.rst
similarity index 72%
rename from docs/flow.rst
rename to docs/Contributing/overview.rst
index 45f212c46..7aa441325 100644
--- a/docs/flow.rst
+++ b/docs/Contributing/overview.rst
@@ -1,5 +1,5 @@
-Milabench Overview
-------------------
+Overview
+========
 
 .. code-block:: txt
 
@@ -230,11 +230,88 @@ Execution Flow
       * **run_script**: the script will start to run now
       * **finalize**: tearing down
 
-How do I
---------
 
-* I want to run a benchmark without milabench for debugging purposes
-   * ``milabench dev {benchname}`` will open bash with the benchmark venv sourced
-   * alternatively: ``source $MILABENCH_BASE/venv/torch/bin/activate``
+Execution Plan
+--------------
+
+* milabench main process
+  * gather metrics from benchmark processes, save them to file
+  * manages the benchmarks (timeout etc...)
+
+  * if ``per_gpu`` is used, milabench will launch one process per GPU (sets ``CUDA_VISIBLE_DEVCES``)
+    * each processes log their GPU data
+    * might spawn a monitor process
+      * will init pynvml
+    * dataloader will also spawn process workers
+      * usually not using GPU
+
+  * if ``njobs`` is used, milabench will launch a single process (torchrun)
+    * torchrun in turn will spawn one process per GPU
+      * RANK 0 is used for logging
+      * RANK 0 might spawn a monitor process
+        * will init pynvml
+      * dataloader will also spawn process workers 
+        * usually not using GPU
+
+per_gpu
+^^^^^^^
+
+``per_gpu``: used for mono gpu benchmarks, spawn one process per gpu and run the same benchmark
+
+.. code-block:: yaml
+
+   _torchvision:
+     inherits: _defaults
+     definition: ../benchmarks/torchvision
+     group: torchvision
+     install_group: torch
+     plan:
+       method: per_gpu
+
+Milabench will essentially execute something akin to below. 
+
+.. code-block:: bash
+
+   echo "---"
+   echo "fp16"
+   echo "===="
+   time (
+     CUDA_VISIBLE_DEVICES=0 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=1 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=2 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=3 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=4 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=5 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=6 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=7 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     wait
+   )
+
+njobs
+^^^^^
+
+``njobs`` used to launch a single jobs that can see all the gpus.
+
+.. code-block:: yaml
 
+   _torchvision_ddp:
+     inherits: _defaults
+     definition: ../benchmarks/torchvision_ddp
+     group: torchvision
+     install_group: torch
+     plan:
+       method: njobs
+       n: 1
+
+Milabench will essentially execute something akin to below.
+
+.. code-block:: bash
+
+   echo "---"
+   echo "lightning-gpus"
+   echo "=============="
+   time (
+     $BASE/venv/torch/bin/benchrun --nnodes=1 --rdzv-backend=c10d --rdzv-endpoint=127.0.0.1:29400 --master-addr=127.0.0.1 --master-port=29400 --nproc-per-node=8 --no-python -- python $SRC/milabench/benchmarks/lightning/main.py --epochs 10 --num-workers 8 --loader pytorch --data $BASE/data/FakeImageNet --model resnet152 --batch-size 16 &
+     wait
+   )
 
diff --git a/docs/process.rst b/docs/Contributing/process.rst
similarity index 99%
rename from docs/process.rst
rename to docs/Contributing/process.rst
index 73f83731b..cafc03115 100644
--- a/docs/process.rst
+++ b/docs/Contributing/process.rst
@@ -8,6 +8,7 @@ Preparing
 
    * NVIDIA
    * AMD
+   * Intel
 
 2. Create a milabench configuration for your RFP
    Milabench comes with a wide variety of benchmarks.
diff --git a/docs/recipes.rst b/docs/Contributing/recipes.rst
similarity index 97%
rename from docs/recipes.rst
rename to docs/Contributing/recipes.rst
index f647ab452..786a24bdf 100644
--- a/docs/recipes.rst
+++ b/docs/Contributing/recipes.rst
@@ -1,5 +1,5 @@
-Running Milabench
-=================
+Recipes
+=======
 
 Base Setup
 ----------
@@ -35,11 +35,9 @@ The current setup runs on 8xA100 SXM4 80Go.
 Note that some benchmarks do require more than 40Go of VRAM.
 One bench might be problematic; rwkv which requires nvcc but can be ignored.
 
-Recipes
--------
 
 Increase Runtime
-^^^^^^^^^^^^^^^^
+----------------
 
 For profiling it might be useful to run the benchmark for longer than the default configuration.
 You can update the yaml file (``config/base.yaml`` or ``config/standard.yaml``) to increase the runtime limits.
@@ -57,7 +55,7 @@ and ``voir.options.stop`` which represent the target number of observations mila
                                  # an observation is usually a batch forward/backward/optimizer.step (i.e one train step)
 
 One Env
-^^^^^^^
+-------
 
 If your are using a container with dependencies such as pytorch already installed,
 you can force milabench to use a single environment for everything.
@@ -69,17 +67,17 @@ you can force milabench to use a single environment for everything.
     milabench run --use-current-env --select bert-fp32 
 
 Batch resizer
-^^^^^^^^^^^^^
+-------------
 
 If the GPU you are using has lower VRAM automatic batch resizing could be enabled with the command below.
 Note that will not impact benchmarks that already use a batch of one, such as opt-6_7b and possibly opt-1_3b.
 
 .. code-block:: bash
 
-   MILABENCH_SIZER_AUTO=True milabench run
+   MILABENCH_SIZER_AUTO=1 milabench run
 
 Device Select
-^^^^^^^^^^^^^
+-------------
 
 To run on a subset of GPUs (note that by default milabench will try to use all the GPUs all the time
 which might make a run take a bit longer, reducing the number of visible devices to 2 might make experimentation faster)
@@ -89,7 +87,7 @@ which might make a run take a bit longer, reducing the number of visible devices
    CUDA_VISIBLE_DEVICES=0,1,2,3 milabench run 
 
 Update Package
-^^^^^^^^^^^^^^
+--------------
 
 To update pytorch to use a newer version of cuda (milabench creates a separate environment for benchmarks)
 
@@ -100,7 +98,7 @@ To update pytorch to use a newer version of cuda (milabench creates a separate e
    pip install -U torch torchvision torchaudio
 
 Arguments
-^^^^^^^^^
+---------
 
 If environment variables are troublesome, the values can also be passed as arguments.
 
@@ -118,6 +116,18 @@ It holds all the benchmark specific logs and metrics gathered by milabench.
   zip -r results.zip results
 
 
+Run a benchmark without milabench
+---------------------------------
+
+.. code-block:: bash
+
+   milabench dev {benchname}  # will open bash with the benchmark venv sourced 
+
+   # alternatively
+
+   source $MILABENCH_BASE/venv/torch/bin/activate
+
+
 Containers
 ----------
 
@@ -306,6 +316,7 @@ Example Reports
 
 Issues
 ------
+
 .. code-block:: txt
   
     > Traceback (most recent call last):
diff --git a/docs/sizer.rst b/docs/Contributing/sizer.rst
similarity index 100%
rename from docs/sizer.rst
rename to docs/Contributing/sizer.rst
diff --git a/docs/docker.rst b/docs/GettingStarted/docker.rst
similarity index 100%
rename from docs/docker.rst
rename to docs/GettingStarted/docker.rst
diff --git a/docs/usage.rst b/docs/GettingStarted/usage.rst
similarity index 100%
rename from docs/usage.rst
rename to docs/GettingStarted/usage.rst
diff --git a/docs/Welcome/Changelog.rst b/docs/Welcome/Changelog.rst
new file mode 100644
index 000000000..7dc58dfe7
--- /dev/null
+++ b/docs/Welcome/Changelog.rst
@@ -0,0 +1,4 @@
+Changelog
+=========
+
+TBD
\ No newline at end of file
diff --git a/docs/Welcome/Features.rst b/docs/Welcome/Features.rst
new file mode 100644
index 000000000..793bdbe6d
--- /dev/null
+++ b/docs/Welcome/Features.rst
@@ -0,0 +1,54 @@
+Features
+========
+
+* non intruisive Instrumentation
+* Validation Layers
+* Automatic batch resizing
+* Docker
+* Hardware
+   * ROCm 5.7
+   * NVIDIA
+* Metrics gathering
+   * Performance throughput
+   * GPU util
+   * CPU util
+   * IO util
+
+
+Benchmarks
+----------
+
+.. code-block:: text
+    +--------------------------+-----------+-----------+-------------+-----------+-------------------+
+    |        Benchmark         |   Unit    |  Domain   |   Network   |   Focus   |       Task        |
+    +==========================+===========+===========+=============+===========+===================+
+    | bf16                     | TFlops    | Synthetic |             | Training  |                   |
+    | fp16                     | TFlops    | Synthetic |             | Training  |                   |
+    | tf32                     | TFlops    | Synthetic |             | Training  |                   |
+    | fp32                     | TFlops    | Synthetic |             | Training  |                   |
+    | bert-fp16                |           | NLP       | Transformer | Training  | Language Modeling |
+    | bert-fp32                |           | NLP       | Transformer | Training  | Language Modeling |
+    | bert-tf32                |           | NLP       | Transformer | Training  | Language Modeling |
+    | bert-tf32-fp16           |           | NLP       | Transformer | Training  | Language Modeling |
+    | opt-1_3b                 |           | NLP       | Transformer | Training  | Language Modeling |
+    | opt-6_7b                 |           | NLP       | Transformer | Training  | Language Modeling |
+    | reformer                 |           | NLP       | Transformer | Training  | Language Modeling |
+    | rwkv                     |           | NLP       | RNN         | Training  | Language Modeling |
+    | llama                    | Token/sec | NLP       | Transformer | Inference | Generation        |
+    | dlrm                     |           | NLP       |             | Training  | Recommendation    |
+    | convnext_large-fp16      | img/sec   | Vision    | Convolution | Training  | Classification    |
+    | convnext_large-fp32      | img/sec   | Vision    | Convolution | Training  | Classification    |
+    | convnext_large-tf32      | img/sec   | Vision    | Convolution | Training  | Classification    |
+    | convnext_large-tf32-fp16 | img/sec   | Vision    | Convolution | Training  | Classification    |
+    | davit_large              | img/sec   | Vision    | Transformer | Training  | Classification    |
+    | focalnet                 |           | Vision    | Convolution | Training  | Classification    |
+    | davit_large-multi        | img/sec   | Vision    | Transformer | Training  | Classification    |
+    | regnet_y_128gf           | img/sec   | Vision    | Convolution | Training  | Classification    |
+    | resnet152                | img/sec   | Vision    | Convolution | Training  | Classification    |
+    | resnet152-multi          | img/sec   | Vision    | Convolution | Training  | Classification    |
+    | resnet50                 | img/sec   | Vision    | Convolution | Training  | Classification    |
+    | stargan                  | img/sec   | Vision    | Convolution | Training  | GAN               |
+    | super-slomo              | img/sec   | Vision    | Convolution | Training  |                   |
+    | t5                       |           | NLP       | Transformer | Training  |                   |
+    | whisper                  |           | Audio     |             | Training  |                   |
+    +--------------------------+-----------+-----------+-------------+-----------+-------------------+
\ No newline at end of file
diff --git a/docs/Welcome/Roadmap.rst b/docs/Welcome/Roadmap.rst
new file mode 100644
index 000000000..bfc05518f
--- /dev/null
+++ b/docs/Welcome/Roadmap.rst
@@ -0,0 +1,10 @@
+Roadmap
+=======
+
+* Cloud CI
+* ROCm 6.0 - MI300 support
+* GPU Max Series - 1550 support
+* Evaluate suitability
+   * Tenstorrent
+   * Graphcore
+   * Cerebras
\ No newline at end of file
diff --git a/docs/execution_modes.rst b/docs/execution_modes.rst
deleted file mode 100644
index 8d40fc44d..000000000
--- a/docs/execution_modes.rst
+++ /dev/null
@@ -1,93 +0,0 @@
-Milabench processes overview
-============================
-
-* milabench main process
-  * gather metrics from benchmark processes, save them to file
-  * manages the benchmarks (timeout etc...)
-
-  * if ``per_gpu`` is used, milabench will launch one process per GPU (sets ``CUDA_VISIBLE_DEVCES``)
-    * each processes log their GPU data
-    * might spawn a monitor process
-      * will init pynvml
-    * dataloader will also spawn process workers
-      * usually not using GPU
-
-  * if ``njobs`` is used, milabench will launch a single process (torchrun)
-    * torchrun in turn will spawn one process per GPU
-      * RANK 0 is used for logging
-      * RANK 0 might spawn a monitor process
-        * will init pynvml
-      * dataloader will also spawn process workers 
-        * usually not using GPU
-
-Plan
-----
-
-per_gpu
-+++++++
-
-``per_gpu``: used for mono gpu benchmarks, spawn one process per gpu and run the same benchmark
-
-.. code-block:: yaml
-
-   _torchvision:
-     inherits: _defaults
-     definition: ../benchmarks/torchvision
-     group: torchvision
-     install_group: torch
-     plan:
-       method: per_gpu
-
-Milabench will essentially execute something akin to below. 
-
-.. code-block:: bash
-
-   echo "---"
-   echo "fp16"
-   echo "===="
-   time (
-     CUDA_VISIBLE_DEVICES=0 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
-     CUDA_VISIBLE_DEVICES=1 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
-     CUDA_VISIBLE_DEVICES=2 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
-     CUDA_VISIBLE_DEVICES=3 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
-     CUDA_VISIBLE_DEVICES=4 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
-     CUDA_VISIBLE_DEVICES=5 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
-     CUDA_VISIBLE_DEVICES=6 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
-     CUDA_VISIBLE_DEVICES=7 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
-     wait
-   )
-
-njobs
-+++++
-
-``njobs`` used to launch a single jobs that can see all the gpus.
-
-.. code-block:: yaml
-
-   _torchvision_ddp:
-     inherits: _defaults
-     definition: ../benchmarks/torchvision_ddp
-     group: torchvision
-     install_group: torch
-     plan:
-       method: njobs
-       n: 1
-
-Milabench will essentially execute something akin to below.
-
-.. code-block:: bash
-
-   echo "---"
-   echo "lightning-gpus"
-   echo "=============="
-   time (
-     $BASE/venv/torch/bin/benchrun --nnodes=1 --rdzv-backend=c10d --rdzv-endpoint=127.0.0.1:29400 --master-addr=127.0.0.1 --master-port=29400 --nproc-per-node=8 --no-python -- python $SRC/milabench/benchmarks/lightning/main.py --epochs 10 --num-workers 8 --loader pytorch --data $BASE/data/FakeImageNet --model resnet152 --batch-size 16 &
-     wait
-   )
-
-
-
-
-
-
-
diff --git a/docs/index.rst b/docs/index.rst
index ebbb27383..b4decf4d7 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -2,18 +2,41 @@
 Welcome to milabench's documentation!
 =====================================
 
+
+.. toctree::
+   :caption: News
+   :maxdepth: 1
+
+   Welcome/Features
+   Welcome/Roadmap
+   Welcome/Changelog
+
+
 .. toctree::
    :maxdepth: 2
-   :caption: Contents:
+   :caption: Getting Started
+
+   GettingStarted/usage.rst
+   GettingStarted/docker.rst
+   
 
-   usage.rst
-   recipes.rst
-   new_benchmarks.rst
+.. toctree::
+   :caption: Contributing
+   :maxdepth: 1
+
+   Contributing/overview
+   Contributing/new_benchmarks
+   Contributing/sizer
+   Contributing/dev-usage
+   Contributing/design
+   Contributing/recipes
+
+
+.. toctree::
+   :caption: API
+   :maxdepth: 1
 
-   docker.rst
-   dev-usage.rst
-   reference.rst
-   sizer.rst
+   ref-pack.rst
 
 
 Indices and tables