Skip to content

Commit

Permalink
feat: add script-based auto-evaluation with Streamlit analysis (#114)
Browse files Browse the repository at this point in the history
* feat: add script-based auto-evaluation with Streamlit analysis

* fix mypy/ruff checks

* fix core functionality, update readme and shift script_based requirements to top-level

---------

Signed-off-by: error9098x <[email protected]>
Co-authored-by: Jack Luar <[email protected]>
  • Loading branch information
error9098x and luarss authored Jan 5, 2025
1 parent e882c5c commit becf508
Show file tree
Hide file tree
Showing 18 changed files with 1,083 additions and 5 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ venv/
# docs
documents.txt
credentials.json
creds.json

# virtualenv
.venv
Expand All @@ -29,3 +30,4 @@ credentials.json
**/.deepeval-cache.json
temp_test_run_data.json
**/llm_tests_output.txt
**/error_log.txt
2 changes: 1 addition & 1 deletion evaluation/auto_evaluation/eval_main.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
make_hallucination_metric,
)
from auto_evaluation.dataset import hf_pull, preprocess
from tqdm import tqdm # type: ignore
from tqdm import tqdm

eval_root_path = os.path.join(os.path.dirname(__file__), "..")
load_dotenv(dotenv_path=os.path.join(eval_root_path, ".env"))
Expand Down
10 changes: 7 additions & 3 deletions evaluation/human_evaluation/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,13 @@
from dotenv import load_dotenv
import os

from utils.sheets import read_questions_and_answers, write_responses, find_new_questions
from utils.api import fetch_endpoints, get_responses
from utils.utils import (
from human_evaluation.utils.sheets import (
read_questions_and_answers,
write_responses,
find_new_questions,
)
from human_evaluation.utils.api import fetch_endpoints, get_responses
from human_evaluation.utils.utils import (
parse_custom_input,
selected_questions,
update_gform,
Expand Down
3 changes: 2 additions & 1 deletion evaluation/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ dependencies = { file = ["requirements.txt"] }
optional-dependencies = { test = { file = ["requirements-test.txt"] } }

[tool.setuptools.packages.find]
include = ["auto_evaluation", "human_evaluation"]
include = ["auto_evaluation", "human_evaluation", "script_based_evaluation"]

[tool.mypy]
python_version = "3.12"
Expand All @@ -30,6 +30,7 @@ warn_return_any = true
warn_unused_ignores = true
strict_optional = true
disable_error_code = ["call-arg"]
explicit_package_bases = true
exclude = "src/post_install.py"

[[tool.mypy.overrides]]
Expand Down
1 change: 1 addition & 0 deletions evaluation/requirements-test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ mypy==1.10.1
ruff==0.5.1
types-requests==2.32.0.20240622
google-api-python-client-stubs==1.28.0
types-tqdm==4.67.0.20241221
5 changes: 5 additions & 0 deletions evaluation/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,8 @@ langchain-google-vertexai==2.0.6
asyncio==3.4.3
huggingface-hub==0.26.2
instructor[vertexai]==1.5.2
openai==1.58.1
pydantic==2.10.4
tqdm==4.67.1
vertexai==1.71.1
plotly==5.24.1
2 changes: 2 additions & 0 deletions evaluation/script_based_evaluation/.env.sample
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
GOOGLE_APPLICATION_CREDENTIALS={{GOOGLE_APPLICATION_CREDENTIALS}}
OPENAI_API_KEY={{OPENAI_API_KEY}}
134 changes: 134 additions & 0 deletions evaluation/script_based_evaluation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# ORAssistant Automated Evaluation

This project automates the evaluation of language model responses using classification-based metrics and LLMScore. It supports testing against various models, including OpenAI and Google Vertex AI. It also serves as an evaluation benchmark for comparing multiple versions of ORAssistant.


## Features

1. **Classification-based Metrics**:
- Categorizes responses into True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).
- Computes metrics such as Accuracy, Precision, Recall, and F1 Score.

2. **LLMScore**:
- Assigns a score between 0 and 1 by comparing the ground truth against the generated response's quality and accuracy.

## Setup

### Environment Variables

Create a `.env` file in the root directory with the following variables:
```plaintext
GOOGLE_APPLICATION_CREDENTIALS=path/to/secret.json
OPENAI_API_KEY=your_openai_api_key # Required if testing against OpenAI models
```
### Required Files

- `secret.json`: Ensure you have a Google Vertex AI subscription and the necessary credentials file.

### Data Files

- **Input File**: `data/data.csv`
- This file should contain the questions to be tested. Ensure it is formatted as a CSV file with the following columns: `Question`, `Answer`.

- **Output File**: `data/data_result.csv`
- This file will be generated after running the script. It contains the results of the evaluation.

## How to Run

1. **Activate virtual environment**

From the previous directory (`evaluation`), make sure you have run the command
`make init` before activating virtual environment. It is needed to recognise
this folder as a submodule.

2. **Run the Script**

Use the following command to execute the script with customizable options:

```bash
python script.py --env-path /path/to/.env --creds-path /path/to/secret.json --iterations 10 --llms "base-gemini-1.5-flash,base-gpt-4o" --agent-retrievers "v1=http://url1.com,v2=http://url2.com"
```

- `--env-path`: Path to the `.env` file.
- `--creds-path`: Path to the `secret.json` file.
- `--iterations`: Number of iterations per question.
- `--llms`: Comma-separated list of LLMs to test.
- `--agent-retrievers`: Comma-separated list of agent-retriever names and URLs.

3. **View Results**

Results will be saved in a CSV file named after the input data file with `_result` appended.

## Basic Usage

### a. Default Usage

```bash
python main.py
```

- Uses the default `.env` file in the project root.
- Default `data/data.csv` as input.
- 5 iterations per question.
- Tests all available LLMs.
- No additional agent-retrievers.

### b. Specify .env and secret.json Paths

```bash
python main.py --env-path /path/to/.env --creds-path /path/to/secret.json
```

### c. Customize Iterations and Select Specific LLMs

```bash
python main.py --iterations 10 --llms "base-gpt-4o,base-gemini-1.5-flash"
```

### d. Add Agent-Retrievers with Custom Names

```bash
python main.py --agent-retrievers "v1=http://url1.com,v2=http://url2.com"
```

### e. Full Example with All Options

```bash
python main.py \
--env-path /path/to/.env \
--creds-path /path/to/secret.json \
--iterations 10 \
--llms "base-gemini-1.5-flash,base-gpt-4o" \
--agent-retrievers "v1=http://url1.com,v2=http://url2.com"
```

### f. Display Help Message

To view all available command-line options:

```bash
python main.py --help
```

### Run Analysis

After generating results, you can perform analysis using the provided `analysis.py` script. To run the analysis, execute the following command:

```bash
streamlit run analysis.py
```


### Sample Comparison Commands

1. To compare three versions of ORAssistant, use:
```bash
python main.py --agent-retrievers "orassistant-v1=http://url1.com,orassistant-v2=http://url2.com,orassistant-v3=http://url3.com"
```
*Note: Each URL is the endpoint of the ORAssistant backend.*

2. To compare ORAssistant with base-gpt-4o, use:
```bash
python main.py --llms "base-gpt-4o" --agent-retrievers "orassistant=http://url.com"
```
*Note: The URL is the endpoint of the ORAssistant backend.*
Empty file.
Loading

1 comment on commit becf508

@luarss
Copy link
Collaborator

@luarss luarss commented on becf508 Jan 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

===================================
==> Dataset: EDA Corpus
==> Running tests for agent-retriever
/home/luarss/actions-runner/_work/ORAssistant/ORAssistant/evaluation/.venv/lib/python3.12/site-packages/deepeval/init.py:49: UserWarning: You are using deepeval version 1.4.9, however version 2.1.1 is available. You should consider upgrading via the "pip install --upgrade deepeval" command.
warnings.warn(

Fetching 2 files: 0%| | 0/2 [00:00<?, ?it/s]
Fetching 2 files: 50%|█████ | 1/2 [00:00<00:00, 4.25it/s]
Fetching 2 files: 100%|██████████| 2/2 [00:00<00:00, 8.48it/s]

Evaluating: 0%| | 0/100 [00:00<?, ?it/s]
Evaluating: 1%| | 1/100 [00:12<20:27, 12.40s/it]
Evaluating: 2%|▏ | 2/100 [00:23<18:38, 11.41s/it]
Evaluating: 3%|▎ | 3/100 [00:34<18:37, 11.52s/it]
Evaluating: 4%|▍ | 4/100 [00:43<16:32, 10.34s/it]
Evaluating: 5%|▌ | 5/100 [00:54<16:54, 10.68s/it]
Evaluating: 6%|▌ | 6/100 [01:05<16:49, 10.74s/it]
Evaluating: 7%|▋ | 7/100 [01:17<17:25, 11.24s/it]
Evaluating: 8%|▊ | 8/100 [01:29<17:16, 11.27s/it]
Evaluating: 9%|▉ | 9/100 [01:40<17:06, 11.28s/it]
Evaluating: 10%|█ | 10/100 [01:49<16:06, 10.74s/it]
Evaluating: 11%|█ | 11/100 [02:00<15:47, 10.65s/it]
Evaluating: 12%|█▏ | 12/100 [02:09<15:03, 10.27s/it]
Evaluating: 13%|█▎ | 13/100 [02:21<15:26, 10.65s/it]
Evaluating: 14%|█▍ | 14/100 [02:33<15:45, 10.99s/it]
Evaluating: 15%|█▌ | 15/100 [02:45<16:11, 11.43s/it]
Evaluating: 16%|█▌ | 16/100 [02:57<16:19, 11.66s/it]
Evaluating: 17%|█▋ | 17/100 [03:09<16:12, 11.72s/it]
Evaluating: 18%|█▊ | 18/100 [03:21<15:57, 11.68s/it]
Evaluating: 19%|█▉ | 19/100 [03:31<15:03, 11.16s/it]
Evaluating: 20%|██ | 20/100 [03:41<14:46, 11.08s/it]
Evaluating: 21%|██ | 21/100 [03:52<14:22, 10.92s/it]
Evaluating: 22%|██▏ | 22/100 [04:04<14:33, 11.20s/it]
Evaluating: 23%|██▎ | 23/100 [04:14<13:55, 10.84s/it]
Evaluating: 24%|██▍ | 24/100 [04:24<13:25, 10.61s/it]
Evaluating: 25%|██▌ | 25/100 [04:34<13:04, 10.46s/it]
Evaluating: 26%|██▌ | 26/100 [04:44<12:34, 10.19s/it]
Evaluating: 27%|██▋ | 27/100 [04:56<13:15, 10.89s/it]
Evaluating: 28%|██▊ | 28/100 [05:08<13:25, 11.19s/it]
Evaluating: 29%|██▉ | 29/100 [05:18<12:52, 10.88s/it]
Evaluating: 30%|███ | 30/100 [05:30<13:01, 11.16s/it]
Evaluating: 31%|███ | 31/100 [05:47<14:53, 12.96s/it]
Evaluating: 32%|███▏ | 32/100 [06:02<15:20, 13.54s/it]
Evaluating: 33%|███▎ | 33/100 [06:14<14:39, 13.13s/it]
Evaluating: 34%|███▍ | 34/100 [06:24<13:23, 12.18s/it]
Evaluating: 35%|███▌ | 35/100 [06:34<12:35, 11.62s/it]
Evaluating: 36%|███▌ | 36/100 [06:45<12:04, 11.32s/it]
Evaluating: 37%|███▋ | 37/100 [06:56<11:37, 11.07s/it]
Evaluating: 38%|███▊ | 38/100 [07:06<11:10, 10.81s/it]
Evaluating: 39%|███▉ | 39/100 [07:17<11:13, 11.04s/it]
Evaluating: 40%|████ | 40/100 [07:27<10:34, 10.57s/it]
Evaluating: 41%|████ | 41/100 [07:39<10:45, 10.93s/it]
Evaluating: 42%|████▏ | 42/100 [07:50<10:40, 11.05s/it]
Evaluating: 43%|████▎ | 43/100 [08:01<10:25, 10.98s/it]
Evaluating: 44%|████▍ | 44/100 [08:11<10:01, 10.73s/it]
Evaluating: 45%|████▌ | 45/100 [08:24<10:21, 11.30s/it]
Evaluating: 46%|████▌ | 46/100 [08:36<10:24, 11.56s/it]
Evaluating: 47%|████▋ | 47/100 [08:48<10:30, 11.90s/it]
Evaluating: 48%|████▊ | 48/100 [09:00<10:08, 11.70s/it]
Evaluating: 49%|████▉ | 49/100 [09:10<09:30, 11.20s/it]
Evaluating: 50%|█████ | 50/100 [09:22<09:37, 11.56s/it]
Evaluating: 51%|█████ | 51/100 [09:32<09:04, 11.10s/it]
Evaluating: 52%|█████▏ | 52/100 [09:44<09:01, 11.29s/it]
Evaluating: 53%|█████▎ | 53/100 [09:55<08:48, 11.23s/it]
Evaluating: 54%|█████▍ | 54/100 [10:06<08:31, 11.12s/it]
Evaluating: 55%|█████▌ | 55/100 [10:18<08:28, 11.30s/it]
Evaluating: 56%|█████▌ | 56/100 [10:28<08:03, 11.00s/it]
Evaluating: 57%|█████▋ | 57/100 [10:39<07:55, 11.06s/it]
Evaluating: 58%|█████▊ | 58/100 [10:51<07:55, 11.31s/it]
Evaluating: 59%|█████▉ | 59/100 [11:02<07:37, 11.15s/it]
Evaluating: 60%|██████ | 60/100 [11:13<07:33, 11.34s/it]
Evaluating: 61%|██████ | 61/100 [11:23<07:05, 10.90s/it]
Evaluating: 62%|██████▏ | 62/100 [11:34<06:49, 10.77s/it]
Evaluating: 63%|██████▎ | 63/100 [11:45<06:46, 10.98s/it]
Evaluating: 64%|██████▍ | 64/100 [11:57<06:45, 11.27s/it]
Evaluating: 65%|██████▌ | 65/100 [12:06<06:05, 10.45s/it]
Evaluating: 66%|██████▌ | 66/100 [12:16<05:54, 10.42s/it]
Evaluating: 67%|██████▋ | 67/100 [12:26<05:43, 10.41s/it]
Evaluating: 68%|██████▊ | 68/100 [12:38<05:40, 10.63s/it]
Evaluating: 69%|██████▉ | 69/100 [12:48<05:24, 10.47s/it]
Evaluating: 70%|███████ | 70/100 [12:59<05:19, 10.67s/it]
Evaluating: 71%|███████ | 71/100 [13:09<05:03, 10.46s/it]
Evaluating: 72%|███████▏ | 72/100 [13:20<04:55, 10.54s/it]
Evaluating: 73%|███████▎ | 73/100 [13:33<05:05, 11.31s/it]
Evaluating: 74%|███████▍ | 74/100 [13:45<04:58, 11.48s/it]
Evaluating: 75%|███████▌ | 75/100 [13:56<04:49, 11.58s/it]
Evaluating: 76%|███████▌ | 76/100 [14:08<04:36, 11.50s/it]
Evaluating: 77%|███████▋ | 77/100 [14:20<04:32, 11.84s/it]
Evaluating: 78%|███████▊ | 78/100 [14:32<04:18, 11.73s/it]
Evaluating: 79%|███████▉ | 79/100 [14:42<03:58, 11.36s/it]
Evaluating: 80%|████████ | 80/100 [14:53<03:44, 11.22s/it]
Evaluating: 81%|████████ | 81/100 [15:04<03:28, 10.99s/it]
Evaluating: 82%|████████▏ | 82/100 [15:13<03:11, 10.62s/it]
Evaluating: 83%|████████▎ | 83/100 [15:24<02:59, 10.59s/it]
Evaluating: 84%|████████▍ | 84/100 [15:34<02:48, 10.54s/it]
Evaluating: 85%|████████▌ | 85/100 [15:44<02:32, 10.15s/it]
Evaluating: 86%|████████▌ | 86/100 [15:55<02:29, 10.67s/it]
Evaluating: 87%|████████▋ | 87/100 [16:06<02:16, 10.53s/it]
Evaluating: 88%|████████▊ | 88/100 [16:21<02:23, 11.96s/it]
Evaluating: 89%|████████▉ | 89/100 [16:35<02:19, 12.72s/it]
Evaluating: 90%|█████████ | 90/100 [16:53<02:21, 14.19s/it]
Evaluating: 91%|█████████ | 91/100 [17:07<02:06, 14.03s/it]
Evaluating: 92%|█████████▏| 92/100 [17:18<01:44, 13.07s/it]
Evaluating: 93%|█████████▎| 93/100 [17:28<01:25, 12.17s/it]
Evaluating: 94%|█████████▍| 94/100 [17:38<01:09, 11.53s/it]
Evaluating: 95%|█████████▌| 95/100 [17:48<00:56, 11.24s/it]
Evaluating: 96%|█████████▌| 96/100 [17:59<00:43, 10.98s/it]
Evaluating: 97%|█████████▋| 97/100 [18:10<00:33, 11.07s/it]
Evaluating: 98%|█████████▊| 98/100 [18:22<00:22, 11.34s/it]
Evaluating: 99%|█████████▉| 99/100 [18:32<00:10, 10.89s/it]
Evaluating: 100%|██████████| 100/100 [18:44<00:00, 11.24s/it]
Evaluating: 100%|██████████| 100/100 [18:44<00:00, 11.24s/it]
✨ You're running DeepEval's latest Contextual Precision Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Contextual Recall Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Hallucination Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...

Evaluating 100 test case(s) in parallel: | | 0% (0/100) [Time Taken: 00:00, ?test case/s]
Evaluating 100 test case(s) in parallel: | | 1% (1/100) [Time Taken: 00:11, 11.92s/test case]
Evaluating 100 test case(s) in parallel: |▏ | 2% (2/100) [Time Taken: 00:12, 5.18s/test case]
Evaluating 100 test case(s) in parallel: |▍ | 4% (4/100) [Time Taken: 00:12, 1.97s/test case]
Evaluating 100 test case(s) in parallel: |▌ | 6% (6/100) [Time Taken: 00:13, 1.32s/test case]
Evaluating 100 test case(s) in parallel: |▋ | 7% (7/100) [Time Taken: 00:13, 1.03s/test case]
Evaluating 100 test case(s) in parallel: |▊ | 8% (8/100) [Time Taken: 00:14, 1.23test case/s]
Evaluating 100 test case(s) in parallel: |█ | 10% (10/100) [Time Taken: 00:14, 1.96test case/s]
Evaluating 100 test case(s) in parallel: |█ | 11% (11/100) [Time Taken: 00:14, 2.15test case/s]
Evaluating 100 test case(s) in parallel: |█▎ | 13% (13/100) [Time Taken: 00:15, 2.79test case/s]
Evaluating 100 test case(s) in parallel: |█▍ | 14% (14/100) [Time Taken: 00:15, 3.06test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 15% (15/100) [Time Taken: 00:15, 2.59test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 16% (16/100) [Time Taken: 00:15, 3.10test case/s]
Evaluating 100 test case(s) in parallel: |█▋ | 17% (17/100) [Time Taken: 00:16, 3.73test case/s]
Evaluating 100 test case(s) in parallel: |█▊ | 18% (18/100) [Time Taken: 00:16, 2.64test case/s]
Evaluating 100 test case(s) in parallel: |█▉ | 19% (19/100) [Time Taken: 00:16, 3.22test case/s]
Evaluating 100 test case(s) in parallel: |██ | 20% (20/100) [Time Taken: 00:17, 3.81test case/s]
Evaluating 100 test case(s) in parallel: |██ | 21% (21/100) [Time Taken: 00:17, 2.58test case/s]
Evaluating 100 test case(s) in parallel: |██▏ | 22% (22/100) [Time Taken: 00:17, 3.29test case/s]
Evaluating 100 test case(s) in parallel: |██▎ | 23% (23/100) [Time Taken: 00:17, 3.81test case/s]
Evaluating 100 test case(s) in parallel: |██▌ | 26% (26/100) [Time Taken: 00:18, 5.55test case/s]
Evaluating 100 test case(s) in parallel: |███ | 31% (31/100) [Time Taken: 00:18, 11.12test case/s]
Evaluating 100 test case(s) in parallel: |███▎ | 33% (33/100) [Time Taken: 00:18, 11.56test case/s]
Evaluating 100 test case(s) in parallel: |███▋ | 37% (37/100) [Time Taken: 00:18, 13.68test case/s]
Evaluating 100 test case(s) in parallel: |████ | 41% (41/100) [Time Taken: 00:18, 18.02test case/s]
Evaluating 100 test case(s) in parallel: |████▌ | 45% (45/100) [Time Taken: 00:19, 21.85test case/s]
Evaluating 100 test case(s) in parallel: |████▊ | 48% (48/100) [Time Taken: 00:19, 20.44test case/s]
Evaluating 100 test case(s) in parallel: |█████ | 51% (51/100) [Time Taken: 00:19, 20.12test case/s]
Evaluating 100 test case(s) in parallel: |█████▍ | 54% (54/100) [Time Taken: 00:19, 17.49test case/s]
Evaluating 100 test case(s) in parallel: |█████▋ | 57% (57/100) [Time Taken: 00:19, 19.72test case/s]
Evaluating 100 test case(s) in parallel: |██████ | 60% (60/100) [Time Taken: 00:19, 17.50test case/s]
Evaluating 100 test case(s) in parallel: |██████▎ | 63% (63/100) [Time Taken: 00:20, 17.16test case/s]
Evaluating 100 test case(s) in parallel: |██████▌ | 66% (66/100) [Time Taken: 00:20, 18.98test case/s]
Evaluating 100 test case(s) in parallel: |██████▉ | 69% (69/100) [Time Taken: 00:20, 19.69test case/s]
Evaluating 100 test case(s) in parallel: |███████▏ | 72% (72/100) [Time Taken: 00:20, 20.03test case/s]
Evaluating 100 test case(s) in parallel: |███████▌ | 75% (75/100) [Time Taken: 00:20, 17.65test case/s]
Evaluating 100 test case(s) in parallel: |███████▋ | 77% (77/100) [Time Taken: 00:20, 14.71test case/s]
Evaluating 100 test case(s) in parallel: |███████▉ | 79% (79/100) [Time Taken: 00:21, 13.05test case/s]
Evaluating 100 test case(s) in parallel: |████████ | 81% (81/100) [Time Taken: 00:21, 8.86test case/s]
Evaluating 100 test case(s) in parallel: |████████▎ | 83% (83/100) [Time Taken: 00:21, 9.42test case/s]
Evaluating 100 test case(s) in parallel: |████████▌ | 85% (85/100) [Time Taken: 00:22, 6.21test case/s]
Evaluating 100 test case(s) in parallel: |████████▌ | 86% (86/100) [Time Taken: 00:22, 4.60test case/s]
Evaluating 100 test case(s) in parallel: |████████▊ | 88% (88/100) [Time Taken: 00:23, 5.62test case/s]
Evaluating 100 test case(s) in parallel: |████████▉ | 89% (89/100) [Time Taken: 00:23, 5.87test case/s]
Evaluating 100 test case(s) in parallel: |█████████ | 90% (90/100) [Time Taken: 00:23, 5.81test case/s]
Evaluating 100 test case(s) in parallel: |█████████ | 91% (91/100) [Time Taken: 00:23, 4.97test case/s]
Evaluating 100 test case(s) in parallel: |█████████▎| 93% (93/100) [Time Taken: 00:23, 6.89test case/s]
Evaluating 100 test case(s) in parallel: |█████████▌| 95% (95/100) [Time Taken: 00:24, 6.77test case/s]
Evaluating 100 test case(s) in parallel: |█████████▌| 96% (96/100) [Time Taken: 00:24, 7.22test case/s]
Evaluating 100 test case(s) in parallel: |█████████▋| 97% (97/100) [Time Taken: 00:24, 6.21test case/s]
Evaluating 100 test case(s) in parallel: |█████████▊| 98% (98/100) [Time Taken: 00:24, 5.98test case/s]
Evaluating 100 test case(s) in parallel: |█████████▉| 99% (99/100) [Time Taken: 00:24, 5.26test case/s]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 00:26, 1.98test case/s]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 00:26, 3.82test case/s]
✓ Tests finished 🎉! Run 'deepeval login' to save and analyze evaluation results
on Confident AI.
‼️ Friendly reminder 😇: You can also run evaluations with ALL of deepeval's
metrics directly on Confident AI instead.
Average Metric Scores:
Contextual Precision 0.7293174603174603
Contextual Recall 0.8298333333333333
Hallucination 0.4964166666666667
Metric Passrates:
Contextual Precision 0.69
Contextual Recall 0.8
Hallucination 0.62

Please sign in to comment.