Merge branch 'main' into recv-timeouts

womboai · Dec 10, 2024 · 7478f6e · 7478f6e
2 parents 44abfd0 + 4ea69ea
commit 7478f6e
Show file tree

Hide file tree

Showing 18 changed files with 3,316 additions and 241 deletions.
diff --git a/README.md b/README.md
@@ -9,24 +9,10 @@
 #### *In the annals of the digital age, a grand saga unfolds. In a realm where the forces of artificial intelligence are harnessed by a select few, the question arises: shall this power remain concentrated, or shall it be distributed for the benefit of all humankind?*
 
 [![License](https://img.shields.io/github/license/womboai/edge-maxxing)](https://github.com/womboai/edge-maxxing/blob/main/LICENSE)
+![netuid](https://img.shields.io/badge/netuid-39-blue)
 
 </div>
 
-# Table of Contents
-
-- [About WOMBO](#about-wombo)
-- [About w.ai](#about-wai)
-- [Intro to EdgeMaxxing Subnet](#edgemaxxing-subnet)
-- [Miners and Validators Functionality](#miners-and-validators)
-    - [Incentive Mechanism and Reward Structure](#incentive-mechanism-and-reward-structure)
-    - [Miners](#miners)
-    - [Validators](#validators)
--  [Get Started with Mining or Validating](#running-miners-and-validators)
-    - [Running a miner](#running-a-miner)
-    - [Running a validator](#validator-setup)
-- [Proposals for Optimizations](#proposals-for-optimizations)
-- [Roadmap](#roadmap)
-
 ## About WOMBO
 WOMBO is one of the world’s leading consumer AI companies, and early believers in generative AI.
 
@@ -44,7 +30,7 @@ w.ai envisions a future where artificial intelligence is decentralized, democrat
 ## EdgeMaxxing Subnet
 
 ### What is the goal?
-The EdgeMaxxing subnet aims to create the world's most optimized AI models for consumer devices, starting with Stable Diffusion XL on the NVIDIA GeForce RTX 4090. 
+The EdgeMaxxing subnet aims to create the world's most optimized AI models for consumer devices. We've already optimized Stable Diffusion XL on the NVIDIA GeForce RTX 4090 and are now working on Flux Schnell for the same hardware.
 
 The subnet will expand to support optimization for various end devices, models, and modalities overtime.
 
@@ -53,170 +39,50 @@ Optimizing AI models is crucial to realizing a vision of decentralized AI.
 - **Accessibility:** Enabling these advanced models to run on consumer devices, from smartphones to laptops, bringing AI capabilities to everyone.
 - **Decentralization:** Allowing millions of users to contribute their computing power, rather than relying on a small number of powerful miners, creating a truly distributed AI network.
 
-By optimizing popular models like LLAMA3 and Stable Diffusion, we transform idle computing resources into valuable contributors to a global AI network. This democratizes both AI usage and creation, offering earning opportunities to millions.
-
-### Current Subnet Focus
-- **Current GPU:** NVIDIA GeForce RTX 4090
-- **Current Model:** `stablediffusionapi/newdream-sdxl-20`
-- **Netuid:** 39
+By optimizing popular models like Flux and Stable Diffusion, we transform idle computing resources into valuable contributors to a global AI network. This democratizes both AI usage and creation, offering earning opportunities to millions.
 
 ## Miners and Validators
 
 ### Incentive Mechanism and Reward Structure
 
 The EdgeMaxxing subnet defines specific models, pipelines, and target hardware for optimization. Miners and validators collaborate in a daily competition to improve AI model performance on consumer devices.
 
-Miners are rewarded based on how optimized their submissions are relative to other miners and the baseline . Every day at 12 PM PST a contest is run.
+Miners are rewarded based on how optimized their submissions are relative to other miners and the baseline. Every day at 12 PM PST a contest is run.
 
 Validators receive rewards for their consistent operation and accurate scoring.
 
 ![WOMBO Cover](https://content.wombo.ai/bittensor/sn-explain.png "WOMBO AI")
 
-
 ### Competition Structure
 1. Miners submit optimized models
 2. Validators score submissions
 3. Contest runs daily at 12 PM PST
 4. Miners receive rewards based on their ranking
 
-### Miners
-- Actively submit optimized checkpoints of the specified model or pipeline. No need for continuous operation; can wait for results after submission
-- Use custom algorithms or tools to enhance model performance
-- Aim to produce the most generically optimized version of the model
-
-### Validators
-- Must run on the specified target hardware (e.g., NVIDIA GeForce RTX 4090, M2 MacBook)
-- Collect all miner submissions daily
-- Benchmark each submission against the baseline checkpoint
-- Score models based on:
-    - Speed improvements
-    - Accuracy maintenance
-    - Overall efficiency gains
-- Select the best-performing model as the daily winner
-
-## Running Miners and Validators
-
-To start working with a registered hotkey, clone the repository and install uv
-```bash
-# uv
-if [ "$USER" = "root" ]; then
-  apt install pipx
-else
-  sudo apt install pipx
-fi
-
-pipx ensurepath
-pipx install uv
-
-# Repository
-git clone https://github.com/womboai/edge-maxxing
-cd edge-maxxing
-```
-
-There is no need to manage venvs in any way, as uv will handle that.
-
-### Miner setup
-1. Clone the [base inference repository](https://github.com/womboai/sdxl-newdream-20-inference)
-```bash
-    git clone --depth 1 https://github.com/womboai/sdxl-newdream-20-inference model
-```
-2. Make your own repository on a git provider such as `GitHub` or `HuggingFace` to optimize in
-3. Edit the `src/pipeline.py` file to include any loading or inference optimizations, and commit when finished
-4. After creating and optimizing your repository, submit the model, changing the options as necessary
-```bash
-cd miner
-uv run submit_model \
-    --netuid {netuid} \
-    --subtensor.network finney \
-    --wallet.name {wallet} \
-    --wallet.hotkey {hotkey}
-```
-5. Follow the interactive prompts to submit the repository link, revision, and contest to participate in
-6. Optionally, benchmark your submission locally before submitting (make sure you have the right hardware e.g. NVIDIA GeForce RTX 4090). uv and huggingface are required for benchmarking:
-```bash
-pipx ensurepath
-pipx install uv
-pipx install huggingface-hub[cli,hf_transfer]
-```
-7. Validators will collect your submission on 12PM New York time and test it in the remainder of the day. Updated weights are set at the beginning of the next contest.
-
-### Validator setup
-The validator setup requires two components, an API container and a scoring validator
-
-### Dedicated Hardware
-If your hardware is not accessed within a container(as in, can use Docker), then the easiest way to set the different components up is to use docker compose.
-
-To get started, go to the `validator`, and create a `.env` file with the following contents:
-```
-VALIDATOR_ARGS=--netuid {netuid} --subtensor.network {network} --wallet.name {wallet} --wallet.hotkey {hotkey}
-VALIDATOR_HOTKEY_SS58_ADDRESS={ss58-address}
-```
-
-Generate the compose file for the GPUs you have by editing `compose-gpu-layout.json` to include all CUDA device IDs and then running:
-```bash
-python3 ./generate_compose.py
-```
-
-And then start docker compose
-```bash
-docker compose up -d --build
-```
-
-### RunPod/Containers
-If running in a containerized environment like RunPod(which does not support Docker), then you need to run 2 pods/containers. The following setup assumes using PM2.
-
-##### API Component
-In one pod/container with a GPU, we'll set up the API component, start as follows:
-
-```bash
-    git clone https://github.com/womboai/edge-maxxing /api
-    cd /api/validator
-```
-
-And then run as follows:
-```bash
-    export CUDA_VISIBLE_DEVICES=0
-    export VALIDATOR_HOTKEY_SS58_ADDRESS={ss58-address}
-
-    pm2 start ./submission_tester/start.sh --name edge-maxxing-submission-tester --interpreter /bin/bash -- \
-      --host 0.0.0.0 \
-      --port 8000 \
-      submission_tester:app
-```
-Make sure port 8000(or whichever you set) is exposed!
-
-The argument at the end is the name of the main PM2 process. This will keep your PM2 validator instance up to date as long as it is running.
-
-You can run more APIs(and are recommended to do so) and link the scoring validator to them.
-You can set which CUDA devices or ports to use along with that.
-
-#### Scoring Validator
-In the another pod/container without a GPU, to run the scoring validator, clone the repository as per the common instructions, then do as follows
-```bash
-    cd validator
-    pm2 start ./weight_setting/start.sh --name edge-maxxing-validator --interpreter /bin/bash -- \
-        --netuid {netuid} \
-        --subtensor.network {network} \
-        --wallet.name {wallet} \
-        --wallet.hotkey {hotkey} \
-        --benchmarker_api {API component routes, space separated if multiple}
-```
-
-Make sure to replace the API component route with the routes to the API containers(which can be something in the format of `http://ip:port`), refer to the instructions above at [API Component](#api-component)
-
-## Proposals for Optimizations
-
-There are several effective techniques to explore when optimizing machine learning models for edge devices. Here are some key approaches to consider:
-
-1. **Knowledge Distillation**: Train a smaller, more efficient model to mimic a larger, more complex one. This technique is particularly useful for deploying models on devices with limited computational resources.
-
-2. **Quantization**: Reduce the precision of the model's weights and activations, typically from 32-bit floating-point to 8-bit integers. This decreases memory usage and computational requirements, making it possible to run models on edge devices. Additionally, exploring low-precision representation for weights (e.g., using 8-bit integers) can reduce memory bandwidth usage for memory-bound models, even if the actual compute is done in higher precision (e.g., 32-bit).
-
-3. **TensorRT and Hardware-Specific Optimizations**: Utilize NVIDIA's TensorRT to optimize deep learning models for inference on NVIDIA GPUs. This involves more than just layer fusion; it includes optimizing assembly, identifying prefetch opportunities, optimizing L2 memory allocation, writing specialized kernels, and performing graph optimizations. These techniques enhance performance and reduce latency by tailoring the model to the specific hardware configuration.
-
-4. **Hyperparameter Tuning**: Optimize the configuration settings of the model to improve its performance. This can be done manually or through automated methods such as grid search or Bayesian optimization. While not a direct edge optimization, it is an essential step in the overall process of model optimization.
-
-We encourage developers to explore these optimization techniques or develop other approaches to enhance model performance and efficiency specifically for edge devices.
+## Active Contests
+
+- ### Flux Schnell
+  - Baseline: https://github.com/womboai/flux-schnell-edge-inference
+  - Hardware: `NVIDIA GeForce RTX 4090`
+  - Focus:
+      - Generation time: `14%` of final score
+      - Similarity to baseline: `43%` of final score
+      - VRAM Usage: `43%` of final score
+
+## Getting Started
+
+- ### [Miners](miner/README.md#setup)
+
+- ### [Validators](validator/README.md#setup)
+
+## Dashboard
+- https://huggingface.co/spaces/wombo/edge-maxxing-dashboard
+- Contains the following:
+  - Leaderboard with all submitted models and their scores + metrics
+  - A list of active validators and their benchmarking states
+  - The weights set by every validator
+  - A list of miner submissions
+  - An interactive model demo, showcasing the results of the best models compared to the baseline
 
 ## Roadmap
 Our mission is to create the world's most optimized AI models for edge devices, democratizing access to powerful AI capabilities. Here's our path forward:

diff --git a/base/base/contest.py b/base/base/contest.py
@@ -1,7 +1,8 @@
+import math
 from dataclasses import dataclass
 from enum import IntEnum
 from functools import partial
-from math import sqrt
+from math import sqrt, prod, log
 from typing import Callable
 
 from pydantic import BaseModel
@@ -19,12 +20,12 @@ class BenchmarkState(IntEnum):
 
 
 class MetricType(IntEnum):
-    SIMILARITY_SCORE = 0
     GENERATION_TIME = 1
     SIZE = 2
     VRAM_USED = 3
     WATTS_USED = 4
     LOAD_TIME = 5
+    RAM_USED = 6
 
 
 class ContestId(IntEnum):
@@ -52,6 +53,7 @@ class Metrics(BaseModel):
     vram_used: float
     watts_used: float
     load_time: float
+    ram_used: float
 
 
 class Benchmark(BaseModel):
@@ -85,26 +87,42 @@ def calculate_score(self, baseline: Metrics, benchmark: Benchmark) -> float:
 
         from .inputs_api import get_inputs_state
         metric_weights = get_inputs_state().get_metric_weights(self.id)
-        total_weight = sum(metric_weights.values())
 
-        scale = 1 / (1 - SIMILARITY_SCORE_THRESHOLD)
-        similarity = sqrt((benchmark.average_similarity - SIMILARITY_SCORE_THRESHOLD) * scale)
+        similarity_scale = 1 / (1 - SIMILARITY_SCORE_THRESHOLD)
+        similarity = sqrt((benchmark.average_similarity - SIMILARITY_SCORE_THRESHOLD) * similarity_scale)
 
-        def normalize(baseline_value: float, benchmark_value: float, metric_type: MetricType) -> float:
+        baseline_score = len(metric_weights)
+        highest_score = prod(w + 1 for w in metric_weights.values())
+
+        ratio = highest_score / baseline_score
+
+        def calculate_improvement(baseline_value: float, benchmark_value: float, metric_type: MetricType) -> float:
             if baseline_value == 0:
                 return 0
             relative_improvement = (baseline_value - benchmark_value) / baseline_value
-            return (relative_improvement * metric_weights.get(metric_type, 0)) / total_weight
-
-        score = sum([
-            normalize(baseline.generation_time, benchmark.metrics.generation_time, MetricType.GENERATION_TIME),
-            normalize(baseline.size, benchmark.metrics.size, MetricType.SIZE),
-            normalize(baseline.vram_used, benchmark.metrics.vram_used, MetricType.VRAM_USED),
-            normalize(baseline.watts_used, benchmark.metrics.watts_used, MetricType.WATTS_USED),
-            normalize(baseline.load_time, benchmark.metrics.load_time, MetricType.LOAD_TIME)
+            return relative_improvement * metric_weights.get(metric_type, 0) + 1
+
+        score = prod([
+            calculate_improvement(baseline.generation_time, benchmark.metrics.generation_time, MetricType.GENERATION_TIME),
+            calculate_improvement(baseline.size, benchmark.metrics.size, MetricType.SIZE),
+            calculate_improvement(baseline.vram_used, benchmark.metrics.vram_used, MetricType.VRAM_USED),
+            calculate_improvement(baseline.watts_used, benchmark.metrics.watts_used, MetricType.WATTS_USED),
+            calculate_improvement(baseline.load_time, benchmark.metrics.load_time, MetricType.LOAD_TIME),
+            calculate_improvement(baseline.ram_used, benchmark.metrics.ram_used, MetricType.RAM_USED),
         ])
 
-        return score * similarity * metric_weights.get(MetricType.SIMILARITY_SCORE, 0) / total_weight
+        n = (ratio + sqrt(ratio ** 2 - ratio * 4 + 4)) / 2
+
+        score_factor = (n - 1) / baseline_score
+
+        normalized_score = log(score_factor * score + 1, n) - 1
+
+        if normalized_score < 0:
+            normalized_score = normalized_score / similarity
+        else:
+            normalized_score = normalized_score * similarity
+
+        return max(-1.0, min(1.0, normalized_score))
 
 
 CUDA_4090_DEVICE = CudaDevice(gpu=Gpu.NVIDIA_RTX_4090)

diff --git a/base/base/device.py b/base/base/device.py
@@ -1,3 +1,4 @@
+import gc
 from abc import ABC, abstractmethod
 from enum import Enum
 
@@ -60,8 +61,9 @@ def get_joules(self):
     def empty_cache(self):
         import torch
 
-        torch.cuda.synchronize()
+        gc.collect()
         torch.cuda.empty_cache()
+        torch.cuda.reset_peak_memory_stats()
 
     def is_compatible(self):
         import torch
@@ -86,7 +88,7 @@ def get_joules(self):
     def empty_cache(self):
         import torch
 
-        torch.mps.synchronize()
+        gc.collect()
         torch.mps.empty_cache()
 
     def is_compatible(self):

diff --git a/base/base/output_comparator.py b/base/base/output_comparator.py
@@ -1,4 +1,3 @@
-import gc
 from abc import ABC, abstractmethod
 from io import BytesIO
 from typing import ContextManager
@@ -76,5 +75,4 @@ def __exit__(self, exc_type, exc_value, traceback):
         del self.clip
         del self.processor
 
-        gc.collect()
         self.device.empty_cache()