feat: add script-based auto-evaluation with Streamlit analysis (#114)

* feat: add script-based auto-evaluation with Streamlit analysis * fix mypy/ruff checks * fix core functionality, update readme and shift script_based requirements to top-level --------- Signed-off-by: error9098x <[email protected]> Co-authored-by: Jack Luar <[email protected]>
The-OpenROAD-Project · Jan 5, 2025 · becf508 · becf508 · luarss · Jan 5, 2025
1 parent e882c5c
commit becf508
Show file tree

Hide file tree

Showing 18 changed files with 1,083 additions and 5 deletions.
diff --git a/.gitignore b/.gitignore
@@ -19,6 +19,7 @@ venv/
 # docs
 documents.txt
 credentials.json
+creds.json
 
 # virtualenv
 .venv
@@ -29,3 +30,4 @@ credentials.json
 **/.deepeval-cache.json
 temp_test_run_data.json
 **/llm_tests_output.txt
+**/error_log.txt
diff --git a/evaluation/auto_evaluation/eval_main.py b/evaluation/auto_evaluation/eval_main.py
@@ -19,7 +19,7 @@
     make_hallucination_metric,
 )
 from auto_evaluation.dataset import hf_pull, preprocess
-from tqdm import tqdm  # type: ignore
+from tqdm import tqdm
 
 eval_root_path = os.path.join(os.path.dirname(__file__), "..")
 load_dotenv(dotenv_path=os.path.join(eval_root_path, ".env"))

diff --git a/evaluation/human_evaluation/main.py b/evaluation/human_evaluation/main.py
@@ -2,9 +2,13 @@
 from dotenv import load_dotenv
 import os
 
-from utils.sheets import read_questions_and_answers, write_responses, find_new_questions
-from utils.api import fetch_endpoints, get_responses
-from utils.utils import (
+from human_evaluation.utils.sheets import (
+    read_questions_and_answers,
+    write_responses,
+    find_new_questions,
+)
+from human_evaluation.utils.api import fetch_endpoints, get_responses
+from human_evaluation.utils.utils import (
     parse_custom_input,
     selected_questions,
     update_gform,

diff --git a/evaluation/pyproject.toml b/evaluation/pyproject.toml
@@ -21,7 +21,7 @@ dependencies = { file = ["requirements.txt"] }
 optional-dependencies = { test = { file = ["requirements-test.txt"] } }
 
 [tool.setuptools.packages.find]
-include = ["auto_evaluation", "human_evaluation"]
+include = ["auto_evaluation", "human_evaluation", "script_based_evaluation"]
 
 [tool.mypy]
 python_version = "3.12"
@@ -30,6 +30,7 @@ warn_return_any = true
 warn_unused_ignores = true
 strict_optional = true
 disable_error_code = ["call-arg"]
+explicit_package_bases = true
 exclude = "src/post_install.py"
 
 [[tool.mypy.overrides]]

diff --git a/evaluation/requirements-test.txt b/evaluation/requirements-test.txt
@@ -2,3 +2,4 @@ mypy==1.10.1
 ruff==0.5.1
 types-requests==2.32.0.20240622
 google-api-python-client-stubs==1.28.0
+types-tqdm==4.67.0.20241221
diff --git a/evaluation/requirements.txt b/evaluation/requirements.txt
@@ -13,3 +13,8 @@ langchain-google-vertexai==2.0.6
 asyncio==3.4.3
 huggingface-hub==0.26.2
 instructor[vertexai]==1.5.2
+openai==1.58.1
+pydantic==2.10.4
+tqdm==4.67.1
+vertexai==1.71.1
+plotly==5.24.1
diff --git a/evaluation/script_based_evaluation/.env.sample b/evaluation/script_based_evaluation/.env.sample
@@ -0,0 +1,2 @@
+GOOGLE_APPLICATION_CREDENTIALS={{GOOGLE_APPLICATION_CREDENTIALS}}
+OPENAI_API_KEY={{OPENAI_API_KEY}}
diff --git a/evaluation/script_based_evaluation/README.md b/evaluation/script_based_evaluation/README.md
@@ -0,0 +1,134 @@
+# ORAssistant Automated Evaluation
+
+This project automates the evaluation of language model responses using classification-based metrics and LLMScore. It supports testing against various models, including OpenAI and Google Vertex AI. It also serves as an evaluation benchmark for comparing multiple versions of ORAssistant.
+
+
+## Features
+
+1. **Classification-based Metrics**: 
+   - Categorizes responses into True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).
+   - Computes metrics such as Accuracy, Precision, Recall, and F1 Score.
+
+2. **LLMScore**: 
+   - Assigns a score between 0 and 1 by comparing the ground truth against the generated response's quality and accuracy.
+
+## Setup
+
+### Environment Variables
+
+Create a `.env` file in the root directory with the following variables:
+```plaintext
+GOOGLE_APPLICATION_CREDENTIALS=path/to/secret.json
+OPENAI_API_KEY=your_openai_api_key  # Required if testing against OpenAI models
+```
+### Required Files
+
+- `secret.json`: Ensure you have a Google Vertex AI subscription and the necessary credentials file.
+
+### Data Files
+
+- **Input File**: `data/data.csv`
+  - This file should contain the questions to be tested. Ensure it is formatted as a CSV file with the following columns: `Question`, `Answer`.
+
+- **Output File**: `data/data_result.csv`
+  - This file will be generated after running the script. It contains the results of the evaluation.
+
+## How to Run
+
+1. **Activate virtual environment**
+
+   From the previous directory (`evaluation`), make sure you have run the command
+   `make init` before activating virtual environment. It is needed to recognise
+   this folder as a submodule.
+
+2. **Run the Script**
+
+   Use the following command to execute the script with customizable options:
+
+   ```bash
+   python script.py --env-path /path/to/.env --creds-path /path/to/secret.json --iterations 10 --llms "base-gemini-1.5-flash,base-gpt-4o" --agent-retrievers "v1=http://url1.com,v2=http://url2.com"
+   ```
+
+   - `--env-path`: Path to the `.env` file.
+   - `--creds-path`: Path to the `secret.json` file.
+   - `--iterations`: Number of iterations per question.
+   - `--llms`: Comma-separated list of LLMs to test.
+   - `--agent-retrievers`: Comma-separated list of agent-retriever names and URLs.
+
+3. **View Results**
+
+   Results will be saved in a CSV file named after the input data file with `_result` appended.
+
+## Basic Usage
+
+### a. Default Usage
+
+```bash
+python main.py
+```
+
+- Uses the default `.env` file in the project root.
+- Default `data/data.csv` as input.
+- 5 iterations per question.
+- Tests all available LLMs.
+- No additional agent-retrievers.
+
+### b. Specify .env and secret.json Paths
+
+```bash
+python main.py --env-path /path/to/.env --creds-path /path/to/secret.json
+```
+
+### c. Customize Iterations and Select Specific LLMs
+
+```bash
+python main.py --iterations 10 --llms "base-gpt-4o,base-gemini-1.5-flash"
+```
+
+### d. Add Agent-Retrievers with Custom Names
+
+```bash
+python main.py --agent-retrievers "v1=http://url1.com,v2=http://url2.com"
+```
+
+### e. Full Example with All Options
+
+```bash
+python main.py \
+    --env-path /path/to/.env \
+    --creds-path /path/to/secret.json \
+    --iterations 10 \
+    --llms "base-gemini-1.5-flash,base-gpt-4o" \
+    --agent-retrievers "v1=http://url1.com,v2=http://url2.com"
+```
+
+### f. Display Help Message
+
+To view all available command-line options:
+
+```bash
+python main.py --help
+```
+
+### Run Analysis 
+
+After generating results, you can perform analysis using the provided `analysis.py` script. To run the analysis, execute the following command:
+
+```bash
+streamlit run analysis.py
+```
+
+
+### Sample Comparison Commands
+
+1. To compare three versions of ORAssistant, use:
+   ```bash
+   python main.py --agent-retrievers "orassistant-v1=http://url1.com,orassistant-v2=http://url2.com,orassistant-v3=http://url3.com"
+   ```
+   *Note: Each URL is the endpoint of the ORAssistant backend.*
+
+2. To compare ORAssistant with base-gpt-4o, use:
+   ```bash
+   python main.py --llms "base-gpt-4o" --agent-retrievers "orassistant=http://url.com"
+   ```
+   *Note: The URL is the endpoint of the ORAssistant backend.*
diff --git a/evaluation/script_based_evaluation/__init__.py b/evaluation/script_based_evaluation/__init__.py
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		GOOGLE_APPLICATION_CREDENTIALS={{GOOGLE_APPLICATION_CREDENTIALS}}
		OPENAI_API_KEY={{OPENAI_API_KEY}}