-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: add script-based auto-evaluation with Streamlit analysis (#114)
* feat: add script-based auto-evaluation with Streamlit analysis * fix mypy/ruff checks * fix core functionality, update readme and shift script_based requirements to top-level --------- Signed-off-by: error9098x <[email protected]> Co-authored-by: Jack Luar <[email protected]>
- Loading branch information
1 parent
e882c5c
commit becf508
Showing
18 changed files
with
1,083 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
GOOGLE_APPLICATION_CREDENTIALS={{GOOGLE_APPLICATION_CREDENTIALS}} | ||
OPENAI_API_KEY={{OPENAI_API_KEY}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,134 @@ | ||
# ORAssistant Automated Evaluation | ||
|
||
This project automates the evaluation of language model responses using classification-based metrics and LLMScore. It supports testing against various models, including OpenAI and Google Vertex AI. It also serves as an evaluation benchmark for comparing multiple versions of ORAssistant. | ||
|
||
|
||
## Features | ||
|
||
1. **Classification-based Metrics**: | ||
- Categorizes responses into True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). | ||
- Computes metrics such as Accuracy, Precision, Recall, and F1 Score. | ||
|
||
2. **LLMScore**: | ||
- Assigns a score between 0 and 1 by comparing the ground truth against the generated response's quality and accuracy. | ||
|
||
## Setup | ||
|
||
### Environment Variables | ||
|
||
Create a `.env` file in the root directory with the following variables: | ||
```plaintext | ||
GOOGLE_APPLICATION_CREDENTIALS=path/to/secret.json | ||
OPENAI_API_KEY=your_openai_api_key # Required if testing against OpenAI models | ||
``` | ||
### Required Files | ||
|
||
- `secret.json`: Ensure you have a Google Vertex AI subscription and the necessary credentials file. | ||
|
||
### Data Files | ||
|
||
- **Input File**: `data/data.csv` | ||
- This file should contain the questions to be tested. Ensure it is formatted as a CSV file with the following columns: `Question`, `Answer`. | ||
|
||
- **Output File**: `data/data_result.csv` | ||
- This file will be generated after running the script. It contains the results of the evaluation. | ||
|
||
## How to Run | ||
|
||
1. **Activate virtual environment** | ||
|
||
From the previous directory (`evaluation`), make sure you have run the command | ||
`make init` before activating virtual environment. It is needed to recognise | ||
this folder as a submodule. | ||
|
||
2. **Run the Script** | ||
|
||
Use the following command to execute the script with customizable options: | ||
|
||
```bash | ||
python script.py --env-path /path/to/.env --creds-path /path/to/secret.json --iterations 10 --llms "base-gemini-1.5-flash,base-gpt-4o" --agent-retrievers "v1=http://url1.com,v2=http://url2.com" | ||
``` | ||
|
||
- `--env-path`: Path to the `.env` file. | ||
- `--creds-path`: Path to the `secret.json` file. | ||
- `--iterations`: Number of iterations per question. | ||
- `--llms`: Comma-separated list of LLMs to test. | ||
- `--agent-retrievers`: Comma-separated list of agent-retriever names and URLs. | ||
|
||
3. **View Results** | ||
|
||
Results will be saved in a CSV file named after the input data file with `_result` appended. | ||
|
||
## Basic Usage | ||
|
||
### a. Default Usage | ||
|
||
```bash | ||
python main.py | ||
``` | ||
|
||
- Uses the default `.env` file in the project root. | ||
- Default `data/data.csv` as input. | ||
- 5 iterations per question. | ||
- Tests all available LLMs. | ||
- No additional agent-retrievers. | ||
|
||
### b. Specify .env and secret.json Paths | ||
|
||
```bash | ||
python main.py --env-path /path/to/.env --creds-path /path/to/secret.json | ||
``` | ||
|
||
### c. Customize Iterations and Select Specific LLMs | ||
|
||
```bash | ||
python main.py --iterations 10 --llms "base-gpt-4o,base-gemini-1.5-flash" | ||
``` | ||
|
||
### d. Add Agent-Retrievers with Custom Names | ||
|
||
```bash | ||
python main.py --agent-retrievers "v1=http://url1.com,v2=http://url2.com" | ||
``` | ||
|
||
### e. Full Example with All Options | ||
|
||
```bash | ||
python main.py \ | ||
--env-path /path/to/.env \ | ||
--creds-path /path/to/secret.json \ | ||
--iterations 10 \ | ||
--llms "base-gemini-1.5-flash,base-gpt-4o" \ | ||
--agent-retrievers "v1=http://url1.com,v2=http://url2.com" | ||
``` | ||
|
||
### f. Display Help Message | ||
|
||
To view all available command-line options: | ||
|
||
```bash | ||
python main.py --help | ||
``` | ||
|
||
### Run Analysis | ||
|
||
After generating results, you can perform analysis using the provided `analysis.py` script. To run the analysis, execute the following command: | ||
|
||
```bash | ||
streamlit run analysis.py | ||
``` | ||
|
||
|
||
### Sample Comparison Commands | ||
|
||
1. To compare three versions of ORAssistant, use: | ||
```bash | ||
python main.py --agent-retrievers "orassistant-v1=http://url1.com,orassistant-v2=http://url2.com,orassistant-v3=http://url3.com" | ||
``` | ||
*Note: Each URL is the endpoint of the ORAssistant backend.* | ||
|
||
2. To compare ORAssistant with base-gpt-4o, use: | ||
```bash | ||
python main.py --llms "base-gpt-4o" --agent-retrievers "orassistant=http://url.com" | ||
``` | ||
*Note: The URL is the endpoint of the ORAssistant backend.* |
Empty file.
Oops, something went wrong.
becf508
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
===================================
==> Dataset: EDA Corpus
==> Running tests for agent-retriever
/home/luarss/actions-runner/_work/ORAssistant/ORAssistant/evaluation/.venv/lib/python3.12/site-packages/deepeval/init.py:49: UserWarning: You are using deepeval version 1.4.9, however version 2.1.1 is available. You should consider upgrading via the "pip install --upgrade deepeval" command.
warnings.warn(
Fetching 2 files: 0%| | 0/2 [00:00<?, ?it/s]
Fetching 2 files: 50%|█████ | 1/2 [00:00<00:00, 4.25it/s]
Fetching 2 files: 100%|██████████| 2/2 [00:00<00:00, 8.48it/s]
Evaluating: 0%| | 0/100 [00:00<?, ?it/s]
Evaluating: 1%| | 1/100 [00:12<20:27, 12.40s/it]
Evaluating: 2%|▏ | 2/100 [00:23<18:38, 11.41s/it]
Evaluating: 3%|▎ | 3/100 [00:34<18:37, 11.52s/it]
Evaluating: 4%|▍ | 4/100 [00:43<16:32, 10.34s/it]
Evaluating: 5%|▌ | 5/100 [00:54<16:54, 10.68s/it]
Evaluating: 6%|▌ | 6/100 [01:05<16:49, 10.74s/it]
Evaluating: 7%|▋ | 7/100 [01:17<17:25, 11.24s/it]
Evaluating: 8%|▊ | 8/100 [01:29<17:16, 11.27s/it]
Evaluating: 9%|▉ | 9/100 [01:40<17:06, 11.28s/it]
Evaluating: 10%|█ | 10/100 [01:49<16:06, 10.74s/it]
Evaluating: 11%|█ | 11/100 [02:00<15:47, 10.65s/it]
Evaluating: 12%|█▏ | 12/100 [02:09<15:03, 10.27s/it]
Evaluating: 13%|█▎ | 13/100 [02:21<15:26, 10.65s/it]
Evaluating: 14%|█▍ | 14/100 [02:33<15:45, 10.99s/it]
Evaluating: 15%|█▌ | 15/100 [02:45<16:11, 11.43s/it]
Evaluating: 16%|█▌ | 16/100 [02:57<16:19, 11.66s/it]
Evaluating: 17%|█▋ | 17/100 [03:09<16:12, 11.72s/it]
Evaluating: 18%|█▊ | 18/100 [03:21<15:57, 11.68s/it]
Evaluating: 19%|█▉ | 19/100 [03:31<15:03, 11.16s/it]
Evaluating: 20%|██ | 20/100 [03:41<14:46, 11.08s/it]
Evaluating: 21%|██ | 21/100 [03:52<14:22, 10.92s/it]
Evaluating: 22%|██▏ | 22/100 [04:04<14:33, 11.20s/it]
Evaluating: 23%|██▎ | 23/100 [04:14<13:55, 10.84s/it]
Evaluating: 24%|██▍ | 24/100 [04:24<13:25, 10.61s/it]
Evaluating: 25%|██▌ | 25/100 [04:34<13:04, 10.46s/it]
Evaluating: 26%|██▌ | 26/100 [04:44<12:34, 10.19s/it]
Evaluating: 27%|██▋ | 27/100 [04:56<13:15, 10.89s/it]
Evaluating: 28%|██▊ | 28/100 [05:08<13:25, 11.19s/it]
Evaluating: 29%|██▉ | 29/100 [05:18<12:52, 10.88s/it]
Evaluating: 30%|███ | 30/100 [05:30<13:01, 11.16s/it]
Evaluating: 31%|███ | 31/100 [05:47<14:53, 12.96s/it]
Evaluating: 32%|███▏ | 32/100 [06:02<15:20, 13.54s/it]
Evaluating: 33%|███▎ | 33/100 [06:14<14:39, 13.13s/it]
Evaluating: 34%|███▍ | 34/100 [06:24<13:23, 12.18s/it]
Evaluating: 35%|███▌ | 35/100 [06:34<12:35, 11.62s/it]
Evaluating: 36%|███▌ | 36/100 [06:45<12:04, 11.32s/it]
Evaluating: 37%|███▋ | 37/100 [06:56<11:37, 11.07s/it]
Evaluating: 38%|███▊ | 38/100 [07:06<11:10, 10.81s/it]
Evaluating: 39%|███▉ | 39/100 [07:17<11:13, 11.04s/it]
Evaluating: 40%|████ | 40/100 [07:27<10:34, 10.57s/it]
Evaluating: 41%|████ | 41/100 [07:39<10:45, 10.93s/it]
Evaluating: 42%|████▏ | 42/100 [07:50<10:40, 11.05s/it]
Evaluating: 43%|████▎ | 43/100 [08:01<10:25, 10.98s/it]
Evaluating: 44%|████▍ | 44/100 [08:11<10:01, 10.73s/it]
Evaluating: 45%|████▌ | 45/100 [08:24<10:21, 11.30s/it]
Evaluating: 46%|████▌ | 46/100 [08:36<10:24, 11.56s/it]
Evaluating: 47%|████▋ | 47/100 [08:48<10:30, 11.90s/it]
Evaluating: 48%|████▊ | 48/100 [09:00<10:08, 11.70s/it]
Evaluating: 49%|████▉ | 49/100 [09:10<09:30, 11.20s/it]
Evaluating: 50%|█████ | 50/100 [09:22<09:37, 11.56s/it]
Evaluating: 51%|█████ | 51/100 [09:32<09:04, 11.10s/it]
Evaluating: 52%|█████▏ | 52/100 [09:44<09:01, 11.29s/it]
Evaluating: 53%|█████▎ | 53/100 [09:55<08:48, 11.23s/it]
Evaluating: 54%|█████▍ | 54/100 [10:06<08:31, 11.12s/it]
Evaluating: 55%|█████▌ | 55/100 [10:18<08:28, 11.30s/it]
Evaluating: 56%|█████▌ | 56/100 [10:28<08:03, 11.00s/it]
Evaluating: 57%|█████▋ | 57/100 [10:39<07:55, 11.06s/it]
Evaluating: 58%|█████▊ | 58/100 [10:51<07:55, 11.31s/it]
Evaluating: 59%|█████▉ | 59/100 [11:02<07:37, 11.15s/it]
Evaluating: 60%|██████ | 60/100 [11:13<07:33, 11.34s/it]
Evaluating: 61%|██████ | 61/100 [11:23<07:05, 10.90s/it]
Evaluating: 62%|██████▏ | 62/100 [11:34<06:49, 10.77s/it]
Evaluating: 63%|██████▎ | 63/100 [11:45<06:46, 10.98s/it]
Evaluating: 64%|██████▍ | 64/100 [11:57<06:45, 11.27s/it]
Evaluating: 65%|██████▌ | 65/100 [12:06<06:05, 10.45s/it]
Evaluating: 66%|██████▌ | 66/100 [12:16<05:54, 10.42s/it]
Evaluating: 67%|██████▋ | 67/100 [12:26<05:43, 10.41s/it]
Evaluating: 68%|██████▊ | 68/100 [12:38<05:40, 10.63s/it]
Evaluating: 69%|██████▉ | 69/100 [12:48<05:24, 10.47s/it]
Evaluating: 70%|███████ | 70/100 [12:59<05:19, 10.67s/it]
Evaluating: 71%|███████ | 71/100 [13:09<05:03, 10.46s/it]
Evaluating: 72%|███████▏ | 72/100 [13:20<04:55, 10.54s/it]
Evaluating: 73%|███████▎ | 73/100 [13:33<05:05, 11.31s/it]
Evaluating: 74%|███████▍ | 74/100 [13:45<04:58, 11.48s/it]
Evaluating: 75%|███████▌ | 75/100 [13:56<04:49, 11.58s/it]
Evaluating: 76%|███████▌ | 76/100 [14:08<04:36, 11.50s/it]
Evaluating: 77%|███████▋ | 77/100 [14:20<04:32, 11.84s/it]
Evaluating: 78%|███████▊ | 78/100 [14:32<04:18, 11.73s/it]
Evaluating: 79%|███████▉ | 79/100 [14:42<03:58, 11.36s/it]
Evaluating: 80%|████████ | 80/100 [14:53<03:44, 11.22s/it]
Evaluating: 81%|████████ | 81/100 [15:04<03:28, 10.99s/it]
Evaluating: 82%|████████▏ | 82/100 [15:13<03:11, 10.62s/it]
Evaluating: 83%|████████▎ | 83/100 [15:24<02:59, 10.59s/it]
Evaluating: 84%|████████▍ | 84/100 [15:34<02:48, 10.54s/it]
Evaluating: 85%|████████▌ | 85/100 [15:44<02:32, 10.15s/it]
Evaluating: 86%|████████▌ | 86/100 [15:55<02:29, 10.67s/it]
Evaluating: 87%|████████▋ | 87/100 [16:06<02:16, 10.53s/it]
Evaluating: 88%|████████▊ | 88/100 [16:21<02:23, 11.96s/it]
Evaluating: 89%|████████▉ | 89/100 [16:35<02:19, 12.72s/it]
Evaluating: 90%|█████████ | 90/100 [16:53<02:21, 14.19s/it]
Evaluating: 91%|█████████ | 91/100 [17:07<02:06, 14.03s/it]
Evaluating: 92%|█████████▏| 92/100 [17:18<01:44, 13.07s/it]
Evaluating: 93%|█████████▎| 93/100 [17:28<01:25, 12.17s/it]
Evaluating: 94%|█████████▍| 94/100 [17:38<01:09, 11.53s/it]
Evaluating: 95%|█████████▌| 95/100 [17:48<00:56, 11.24s/it]
Evaluating: 96%|█████████▌| 96/100 [17:59<00:43, 10.98s/it]
Evaluating: 97%|█████████▋| 97/100 [18:10<00:33, 11.07s/it]
Evaluating: 98%|█████████▊| 98/100 [18:22<00:22, 11.34s/it]
Evaluating: 99%|█████████▉| 99/100 [18:32<00:10, 10.89s/it]
Evaluating: 100%|██████████| 100/100 [18:44<00:00, 11.24s/it]
Evaluating: 100%|██████████| 100/100 [18:44<00:00, 11.24s/it]
✨ You're running DeepEval's latest Contextual Precision Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Contextual Recall Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Hallucination Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
Evaluating 100 test case(s) in parallel: | | 0% (0/100) [Time Taken: 00:00, ?test case/s]
‼️ Friendly reminder 😇: You can also run evaluations with ALL of deepeval's
Evaluating 100 test case(s) in parallel: | | 1% (1/100) [Time Taken: 00:11, 11.92s/test case]
Evaluating 100 test case(s) in parallel: |▏ | 2% (2/100) [Time Taken: 00:12, 5.18s/test case]
Evaluating 100 test case(s) in parallel: |▍ | 4% (4/100) [Time Taken: 00:12, 1.97s/test case]
Evaluating 100 test case(s) in parallel: |▌ | 6% (6/100) [Time Taken: 00:13, 1.32s/test case]
Evaluating 100 test case(s) in parallel: |▋ | 7% (7/100) [Time Taken: 00:13, 1.03s/test case]
Evaluating 100 test case(s) in parallel: |▊ | 8% (8/100) [Time Taken: 00:14, 1.23test case/s]
Evaluating 100 test case(s) in parallel: |█ | 10% (10/100) [Time Taken: 00:14, 1.96test case/s]
Evaluating 100 test case(s) in parallel: |█ | 11% (11/100) [Time Taken: 00:14, 2.15test case/s]
Evaluating 100 test case(s) in parallel: |█▎ | 13% (13/100) [Time Taken: 00:15, 2.79test case/s]
Evaluating 100 test case(s) in parallel: |█▍ | 14% (14/100) [Time Taken: 00:15, 3.06test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 15% (15/100) [Time Taken: 00:15, 2.59test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 16% (16/100) [Time Taken: 00:15, 3.10test case/s]
Evaluating 100 test case(s) in parallel: |█▋ | 17% (17/100) [Time Taken: 00:16, 3.73test case/s]
Evaluating 100 test case(s) in parallel: |█▊ | 18% (18/100) [Time Taken: 00:16, 2.64test case/s]
Evaluating 100 test case(s) in parallel: |█▉ | 19% (19/100) [Time Taken: 00:16, 3.22test case/s]
Evaluating 100 test case(s) in parallel: |██ | 20% (20/100) [Time Taken: 00:17, 3.81test case/s]
Evaluating 100 test case(s) in parallel: |██ | 21% (21/100) [Time Taken: 00:17, 2.58test case/s]
Evaluating 100 test case(s) in parallel: |██▏ | 22% (22/100) [Time Taken: 00:17, 3.29test case/s]
Evaluating 100 test case(s) in parallel: |██▎ | 23% (23/100) [Time Taken: 00:17, 3.81test case/s]
Evaluating 100 test case(s) in parallel: |██▌ | 26% (26/100) [Time Taken: 00:18, 5.55test case/s]
Evaluating 100 test case(s) in parallel: |███ | 31% (31/100) [Time Taken: 00:18, 11.12test case/s]
Evaluating 100 test case(s) in parallel: |███▎ | 33% (33/100) [Time Taken: 00:18, 11.56test case/s]
Evaluating 100 test case(s) in parallel: |███▋ | 37% (37/100) [Time Taken: 00:18, 13.68test case/s]
Evaluating 100 test case(s) in parallel: |████ | 41% (41/100) [Time Taken: 00:18, 18.02test case/s]
Evaluating 100 test case(s) in parallel: |████▌ | 45% (45/100) [Time Taken: 00:19, 21.85test case/s]
Evaluating 100 test case(s) in parallel: |████▊ | 48% (48/100) [Time Taken: 00:19, 20.44test case/s]
Evaluating 100 test case(s) in parallel: |█████ | 51% (51/100) [Time Taken: 00:19, 20.12test case/s]
Evaluating 100 test case(s) in parallel: |█████▍ | 54% (54/100) [Time Taken: 00:19, 17.49test case/s]
Evaluating 100 test case(s) in parallel: |█████▋ | 57% (57/100) [Time Taken: 00:19, 19.72test case/s]
Evaluating 100 test case(s) in parallel: |██████ | 60% (60/100) [Time Taken: 00:19, 17.50test case/s]
Evaluating 100 test case(s) in parallel: |██████▎ | 63% (63/100) [Time Taken: 00:20, 17.16test case/s]
Evaluating 100 test case(s) in parallel: |██████▌ | 66% (66/100) [Time Taken: 00:20, 18.98test case/s]
Evaluating 100 test case(s) in parallel: |██████▉ | 69% (69/100) [Time Taken: 00:20, 19.69test case/s]
Evaluating 100 test case(s) in parallel: |███████▏ | 72% (72/100) [Time Taken: 00:20, 20.03test case/s]
Evaluating 100 test case(s) in parallel: |███████▌ | 75% (75/100) [Time Taken: 00:20, 17.65test case/s]
Evaluating 100 test case(s) in parallel: |███████▋ | 77% (77/100) [Time Taken: 00:20, 14.71test case/s]
Evaluating 100 test case(s) in parallel: |███████▉ | 79% (79/100) [Time Taken: 00:21, 13.05test case/s]
Evaluating 100 test case(s) in parallel: |████████ | 81% (81/100) [Time Taken: 00:21, 8.86test case/s]
Evaluating 100 test case(s) in parallel: |████████▎ | 83% (83/100) [Time Taken: 00:21, 9.42test case/s]
Evaluating 100 test case(s) in parallel: |████████▌ | 85% (85/100) [Time Taken: 00:22, 6.21test case/s]
Evaluating 100 test case(s) in parallel: |████████▌ | 86% (86/100) [Time Taken: 00:22, 4.60test case/s]
Evaluating 100 test case(s) in parallel: |████████▊ | 88% (88/100) [Time Taken: 00:23, 5.62test case/s]
Evaluating 100 test case(s) in parallel: |████████▉ | 89% (89/100) [Time Taken: 00:23, 5.87test case/s]
Evaluating 100 test case(s) in parallel: |█████████ | 90% (90/100) [Time Taken: 00:23, 5.81test case/s]
Evaluating 100 test case(s) in parallel: |█████████ | 91% (91/100) [Time Taken: 00:23, 4.97test case/s]
Evaluating 100 test case(s) in parallel: |█████████▎| 93% (93/100) [Time Taken: 00:23, 6.89test case/s]
Evaluating 100 test case(s) in parallel: |█████████▌| 95% (95/100) [Time Taken: 00:24, 6.77test case/s]
Evaluating 100 test case(s) in parallel: |█████████▌| 96% (96/100) [Time Taken: 00:24, 7.22test case/s]
Evaluating 100 test case(s) in parallel: |█████████▋| 97% (97/100) [Time Taken: 00:24, 6.21test case/s]
Evaluating 100 test case(s) in parallel: |█████████▊| 98% (98/100) [Time Taken: 00:24, 5.98test case/s]
Evaluating 100 test case(s) in parallel: |█████████▉| 99% (99/100) [Time Taken: 00:24, 5.26test case/s]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 00:26, 1.98test case/s]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 00:26, 3.82test case/s]
✓ Tests finished 🎉! Run 'deepeval login' to save and analyze evaluation results
on Confident AI.
metrics directly on Confident AI instead.
Average Metric Scores:
Contextual Precision 0.7293174603174603
Contextual Recall 0.8298333333333333
Hallucination 0.4964166666666667
Metric Passrates:
Contextual Precision 0.69
Contextual Recall 0.8
Hallucination 0.62