Merge pull request #1 from valohai/pre-commit

Configure & apply linting with pre-commit
valohai · Jan 22, 2024 · 914b7d6 · 914b7d6
2 parents bb37ae9 + 9b6fa64
commit 914b7d6
Show file tree

Hide file tree

Showing 11 changed files with 316 additions and 183 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,19 @@
+name: CI
+
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+
+jobs:
+  Lint:
+    runs-on: ubuntu-22.04
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v4
+        with:
+          python-version: "3.12"
+      - uses: pre-commit/[email protected]
+        env:
+          RUFF_OUTPUT_FORMAT: github
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,13 @@
+repos:
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.1.14
+    hooks:
+      - id: ruff
+        args:
+          - --fix
+          - --preview
+      - id: ruff-format
+  - repo: https://github.com/pre-commit/mirrors-prettier
+    rev: v3.1.0
+    hooks:
+      - id: prettier
diff --git a/README.md b/README.md
@@ -1,95 +1,105 @@
 # Drift Detection using WhyLabs
 
-This repository is made to show one of the ways how to detect the data drift when using [Valohai](https://app.valohai.com/). 
+This repository is made to show one of the ways how to detect the data drift when using [Valohai](https://app.valohai.com/).
 Here, we use [WhyLabs](https://whylabs.ai/) to generate drift reports for the input image data. We also show how to automaticaly trigger the retraining of the model, how to use human approval to the step in the pipeline, how to use Valohai actions.
 
 ## Drift is..
-Drift in machine learning refers to a change over time in input data or the relationship between input and output data, impacting model performance. Drift can lead to reduced model accuracy as the model becomes less effective over time, necessitating regular updates.
 
+Drift in machine learning refers to a change over time in input data or the relationship between input and output data, impacting model performance. Drift can lead to reduced model accuracy as the model becomes less effective over time, necessitating regular updates.
 
 ### Training pipeline
+
 Basic Pipeline that preprocess the data, trains the model and evaluates the results.
 
 Consists of three steps:
+
 - Data Preprocessing:
   1. Load compressed data from S3 bucket
   2. Preprocess images
   3. Save to Valohai datasets
 - YOLO finetuning:
-  1. Create yaml file with data path - readable to YOLO 
+  1. Create yaml file with data path - readable to YOLO
   2. Train yolo using library `ultralytics`
   3. Save best model with Valohai alias.
 - Evaluation:
   1. Load and evaluate the model using `ultralytics`
   2. Save the results to Valohai.
 
 Training Pipeline view in Valohai:
+
 <p align="center">
 <img src="./screenshots/train_pipeline.jpg" width="750" alt="Training Pipeline" />
 </p>
 
-
 ### Drift detection pipeline
+
 Does the inference of the fine-tuned model and detects the data drift using [WhyLabs](https://whylabs.ai/).
 
 Consists of two steps:
+
 - Drift detection:
   1. Load the data and the model
-  2. Inference the data 
-  3. Log the data to WhyLabs 
+  2. Inference the data
+  3. Log the data to WhyLabs
   4. Create inference and reference (from train data) profiles
-  5. Generate summary drift report with WhyLabs in html (`summary_drift_report.html`) 
-          <p align="center"><img src="./screenshots/summary_report.jpg" width="550" alt="Summary Drift Report" /></p>
-  _Note: We set a threshold on the number of image characteristics showing drift in WhyLabs. Once this threshold is reached, we initiate the training pipeline._
+  5. Generate summary drift report with WhyLabs in html (`summary_drift_report.html`)
+     <p align="center"><img src="./screenshots/summary_report.jpg" width="550" alt="Summary Drift Report" /></p>
+     _Note: We set a threshold on the number of image characteristics showing drift in WhyLabs. Once this threshold is reached, we initiate the training pipeline._
   6. If drift is detected, change Status detail. <p align="center"><img src="./screenshots/status_detail.jpg" width="550" alt="Status Detail updated" /></p>
   7. if drift is not detected, then the pipeline is stopped ([Valohai actions docs](https://docs.valohai.com/hc/en-us/articles/18704272477841-Conditions), see `valohai.yaml` -> `drift-detection-pipeline`)
 - Call retrain
   1. Only if on the previous step the drift was detected, the node starts.
   2. When the node is starting it will require human approval ([Valohai actions docs](https://docs.valohai.com/hc/en-us/articles/18704272477841-Conditions)) <p align="center"><img src="./screenshots/human_approval.jpg" width="550" alt="Human Approval" /></p>
      - You can set up notification when the pipeline requires human approval by going: project `Settings -> Notifications -> pipeline node approval required`.
   3. If approved, API call to start the `Training Pipeline` - retrain the model because the drift was detected.
-
 
 Drift Detection Pipeline view in Valohai:
+
 <p align="center">
 <img src="./screenshots/drift_pipeline.jpg" width="350" alt="Drift Detection Pipeline" />
 </p>
 
 Overall flow of the project:
+
 <p align="center">
 <img src="./screenshots/flow_chart.jpg" width="750" alt="Overall Flow" />
 </p>
 
-
 ## Running on Valohai
+
 ### Configure the repository:
 
 To run your code on Valohai using the terminal, follow these steps:
 
 1. Install Valohai on your machine by running the following command:
+
 ```bash
 pip install valohai-cli valohai-utils
 ```
 
 2. Log in to Valohai from the terminal using the command:
+
 ```bash
 vh login
 ```
 
-3. Create a project for your Valohai workflow. 
+3. Create a project for your Valohai workflow.
 
 Start by creating a directory for your project:
+
 ```bash
 mkdir valohai-drift-example
 cd valohai-drift-example
 ```
 
 Then, create the Valohai project:
+
 ```bash
 vh project create
 ```
 
 4. Clone the repository to your local machine:
+
 ```bash
 git clone https://github.com/valohai/drift-example.git .
 ```
@@ -99,36 +109,44 @@ Congratulations! You have successfully cloned the repository, and you can now mo
 ### Running Executions:
 
 To run individual steps, execute the following command:
+
 ```bash
 vh execution run <step-name> --adhoc
 ```
 
 For example, to run the prepare_data step, use the command:
+
 ```bash
 vh execution run prepare_data --adhoc
 ```
 
 ### Running Pipelines:
 
 To run pipelines, use the following command:
+
 ```bash
 vh pipeline run <pipeline-name> --adhoc
 ```
 
 For example, to run the three-trainings-pipeline-w-deployment pipeline, use the command:
+
 ```bash
 vh pipeline run train-val-pipeline --adhoc
 ```
 
 ## FAQ
-### 1. Working with secrets. 
+
+### 1. Working with secrets.
+
 In this project you need to use private tokens in two places: to use WhyLabs and to use Valohai API in `call-retrain.py`.
 
-Note that you should never include the token in your version control. Instead of pasting it directly into your code, we recommend storing it as a secret environment variable. 
+Note that you should never include the token in your version control. Instead of pasting it directly into your code, we recommend storing it as a secret environment variable.
 
 You can add environment variables in a couple of ways in Valohai.
+
 - Add the environment variable when creating an execution from the UI (Create Execution -> Environment Variables). The env variable are only available in the execution where it was created.
 - Add the project environment variable (Project Settings -> "Environment Variables" tab -> Check "Secret" checkbox). In this case, the env variable will be available for all executions of the project.
 
 ### 2. Other monitoring tools.
-WhyLabs is presented here as one of the options to detect the data drift for the image data. Valohai does not have limitations for any other monitoring tools like EvidentlyAI, Fiddler, Censius, NeptuneAI etc.  
+
+WhyLabs is presented here as one of the options to detect the data drift for the image data. Valohai does not have limitations for any other monitoring tools like EvidentlyAI, Fiddler, Censius, NeptuneAI etc.
diff --git a/call-retrain.py b/call-retrain.py
@@ -15,16 +15,16 @@
                 "source_type": "parameter",
                 "target_node": "training",
                 "target_type": "parameter",
-                "target_key": "image_size"
+                "target_key": "image_size",
             },
             {
                 "source_node": "training",
                 "source_key": "*best.pt",
                 "source_type": "output",
                 "target_node": "evaluation",
                 "target_type": "input",
-                "target_key": "model"
-            }
+                "target_key": "model",
+            },
         ],
         "nodes": [
             {
@@ -38,24 +38,24 @@
                     "command": "pip install valohai-utils\npython preprocess.py {parameters}",
                     "inputs": {
                         "dataset": [
-                            "s3://valohai-demo-library-data/drift-detection/ships-aerial-images.zip"
-                        ]
+                            "s3://valohai-demo-library-data/drift-detection/ships-aerial-images.zip",
+                        ],
                     },
                     "parameters": {
                         "train_size": 15,
                         "valid_size": 10,
                         "test_size": 10,
-                        "image_size": 768
+                        "image_size": 768,
                     },
                     "runtime_config": {},
                     "runtime_config_preset": "",
                     "inherit_environment_variables": True,
                     "environment_variable_groups": [],
                     "tags": [],
                     "time_limit": 0,
-                    "environment_variables": {}
+                    "environment_variables": {},
                 },
-                "on_error": "stop-all"
+                "on_error": "stop-all",
             },
             {
                 "name": "training",
@@ -67,15 +67,9 @@
                     "image": "ultralytics/yolov5",
                     "command": "pip install valohai-utils\npython train.py {parameters}",
                     "inputs": {
-                        "train": [
-                            "dataset://drift-demo-ships-aerial/dev_train"
-                        ],
-                        "test": [
-                            "dataset://drift-demo-ships-aerial/dev_test"
-                        ],
-                        "valid": [
-                            "dataset://drift-demo-ships-aerial/dev_valid"
-                        ]
+                        "train": ["dataset://drift-demo-ships-aerial/dev_train"],
+                        "test": ["dataset://drift-demo-ships-aerial/dev_test"],
+                        "valid": ["dataset://drift-demo-ships-aerial/dev_valid"],
                     },
                     "parameters": {
                         "yolo_model_name": "yolov8x.pt",
@@ -84,17 +78,17 @@
                         "image_size": 768,
                         "optimizer": "SGD",
                         "seed": 42,
-                        "project": "/valohai/outputs/"
+                        "project": "/valohai/outputs/",
                     },
                     "runtime_config": {},
                     "runtime_config_preset": "",
                     "inherit_environment_variables": True,
                     "environment_variable_groups": [],
                     "tags": [],
                     "time_limit": 0,
-                    "environment_variables": {}
+                    "environment_variables": {},
                 },
-                "on_error": "stop-all"
+                "on_error": "stop-all",
             },
             {
                 "name": "evaluation",
@@ -106,18 +100,10 @@
                     "image": "ultralytics/yolov5",
                     "command": "pip install valohai-utils\npython evaluation.py {parameters}",
                     "inputs": {
-                        "model": [
-                            "datum://model-current-best"
-                        ],
-                        "data_yaml": [
-                            "datum://data_yaml"
-                        ],
-                        "valid": [
-                            "dataset://drift-demo-ships-aerial/dev_valid"
-                        ],
-                        "test": [
-                            "dataset://drift-demo-ships-aerial/dev_test"
-                        ]
+                        "model": ["datum://model-current-best"],
+                        "data_yaml": ["datum://data_yaml"],
+                        "valid": ["dataset://drift-demo-ships-aerial/dev_valid"],
+                        "test": ["dataset://drift-demo-ships-aerial/dev_test"],
                     },
                     "parameters": {},
                     "runtime_config": {},
@@ -126,18 +112,18 @@
                     "environment_variable_groups": [],
                     "tags": [],
                     "time_limit": 0,
-                    "environment_variables": {}
+                    "environment_variables": {},
                 },
-                "on_error": "stop-all"
-            }
+                "on_error": "stop-all",
+            },
         ],
         "project": "018c8779-9475-09e1-d481-e295ab4de428",
         "tags": [],
         "parameters": {},
-        "title": "train-val-pipeline"
+        "title": "train-val-pipeline",
     },
 )
 if resp.status_code == 400:
     raise RuntimeError(resp.json())
 resp.raise_for_status()
-data = resp.json()
+data = resp.json()