diff --git a/README.md b/README.md index 66a1d6b31..105fb5381 100644 --- a/README.md +++ b/README.md @@ -101,7 +101,16 @@ You can check out the following LangTest articles: | [**Evaluating Large Language Models on Gender-Occupational Stereotypes Using the Wino Bias Test**](https://medium.com/john-snow-labs/evaluating-large-language-models-on-gender-occupational-stereotypes-using-the-wino-bias-test-2a96619b4960) | In this blog post, we dive into testing the WinoBias dataset on LLMs, examining language models’ handling of gender and occupational roles, evaluation metrics, and the wider implications. Let’s explore the evaluation of language models with LangTest on the WinoBias dataset and confront the challenges of addressing bias in AI. | | [**Streamlining ML Workflows: Integrating MLFlow Tracking with LangTest for Enhanced Model Evaluations**](https://medium.com/john-snow-labs/streamlining-ml-workflows-integrating-mlflow-tracking-with-langtest-for-enhanced-model-evaluations-4ce9863a0ff1) | In this blog post, we dive into the growing need for transparent, systematic, and comprehensive tracking of models. Enter MLFlow and LangTest: two tools that, when combined, create a revolutionary approach to ML development. | | [**Testing the Question Answering Capabilities of Large Language Models**](https://medium.com/john-snow-labs/testing-the-question-answering-capabilities-of-large-language-models-1bc424d61740) | In this blog post, we dive into enhancing the QA evaluation capabilities using LangTest library. Explore about different evaluation methods that LangTest offers to address the complexities of evaluating Question Answering (QA) tasks. | -| [**Evaluating Stereotype Bias with LangTest**](To be published soon) | In this blog post, we are focusing on using the StereoSet dataset to assess bias related to gender, profession, and race.| +| [**Evaluating Stereotype Bias with LangTest**](https://medium.com/john-snow-labs/evaluating-stereotype-bias-with-langtest-8286af8f0f22) | In this blog post, we are focusing on using the StereoSet dataset to assess bias related to gender, profession, and race.| +| [**Unveiling Sentiments: Exploring LSTM-based Sentiment Analysis with PyTorch on the IMDB Dataset**](To be Published) | Explore the robustness of custom models with LangTest Insights.| +| [**LangTest Insights: A Deep Dive into LLM Robustness on OpenBookQA**](To be Published) | Explore the robustness of Language Models (LLMs) on the OpenBookQA dataset with LangTest Insights.| +| [**LangTest: A Secret Weapon for Improving the Robustness of Your Transformers Language Models**](To be Published) | Explore the robustness of Transformers Language Models with LangTest Insights.| + + + + + + > **Note** diff --git a/demo/tutorials/benchmarks/OpenbookQA_benchmarks.ipynb b/demo/tutorials/benchmarks/OpenbookQA_benchmarks.ipynb new file mode 100644 index 000000000..932bc2c5d --- /dev/null +++ b/demo/tutorials/benchmarks/OpenbookQA_benchmarks.ipynb @@ -0,0 +1,22506 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![image.png]()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/benchmarks/OpenbookQA_benchmarks.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering and Summarization tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation and fairness test categories.\n", + "\n", + "Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Getting started with LangTest" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-28T14:44:12.911787Z", + "iopub.status.busy": "2023-11-28T14:44:12.911409Z", + "iopub.status.idle": "2023-11-28T14:44:15.492907Z", + "shell.execute_reply": "2023-11-28T14:44:15.492344Z", + "shell.execute_reply.started": "2023-11-28T14:44:12.911770Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install \"langtest[openai,ai21,transformers]\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Harness and Its Parameters\n", + "\n", + "The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-28T14:44:15.493891Z", + "iopub.status.busy": "2023-11-28T14:44:15.493714Z", + "iopub.status.idle": "2023-11-28T14:44:20.531726Z", + "shell.execute_reply": "2023-11-28T14:44:20.530913Z", + "shell.execute_reply.started": "2023-11-28T14:44:15.493876Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "from langtest import Harness" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Initial Setup" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-28T14:44:20.533062Z", + "iopub.status.busy": "2023-11-28T14:44:20.532540Z", + "iopub.status.idle": "2023-11-28T14:44:20.536063Z", + "shell.execute_reply": "2023-11-28T14:44:20.535546Z", + "shell.execute_reply.started": "2023-11-28T14:44:20.533043Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import random\n", + "pd.set_option('display.max_colwidth', None)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-28T14:44:20.538932Z", + "iopub.status.busy": "2023-11-28T14:44:20.538677Z", + "iopub.status.idle": "2023-11-28T14:44:20.621170Z", + "shell.execute_reply": "2023-11-28T14:44:20.620455Z", + "shell.execute_reply.started": "2023-11-28T14:44:20.538917Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "import os\n", + "os.environ[\"AI21_API_KEY\"] = \"\"\n", + "os.environ[\"OPENAI_API_KEY\"] = \"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It imports the Harness class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and that instances of the Harness class can be customized or configured for different testing scenarios or environments.\n", + "\n", + "Here is a list of the different parameters that can be passed to the Harness function:\n", + "\n", + "
\n", + "\n", + "\n", + "| Parameter | Description | \n", + "| - | - | \n", + "|**task** |Task for which the model is to be evaluated (question-answering or summarization)|\n", + "| **model** | Specifies the model(s) to be evaluated. This parameter can be provided as either a dictionary or a list of dictionaries. Each dictionary should contain the following keys: |\n", + "| **data** | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys: |\n", + "| **config** | Configuration for the tests to be performed, specified in the form of a YAML file. |\n", + "\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Model J2-Jumbo-Instruct" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setup and Configure Harness" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-18T18:05:09.217288Z", + "iopub.status.busy": "2023-11-18T18:05:09.216994Z", + "iopub.status.idle": "2023-11-18T18:05:09.311278Z", + "shell.execute_reply": "2023-11-18T18:05:09.310829Z", + "shell.execute_reply.started": "2023-11-18T18:05:09.217273Z" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Test Configuration : \n", + " {\n", + " \"model_parameters\": {\n", + " \"temperature\": 0.2,\n", + " \"maxTokens\": 64\n", + " },\n", + " \"tests\": {\n", + " \"defaults\": {\n", + " \"min_pass_rate\": 1.0\n", + " },\n", + " \"robustness\": {\n", + " \"add_typo\": {\n", + " \"min_pass_rate\": 0.7\n", + " },\n", + " \"lowercase\": {\n", + " \"min_pass_rate\": 0.7\n", + " }\n", + " }\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "harness = Harness(\n", + " task=\"question-answering\",\n", + " model={\"model\": \"j2-jumbo-instruct\", \"hub\":\"ai21\"},\n", + " data={\"data_source\" :\"OpenBookQA\",\n", + " \"split\":\"test\"}\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-18T18:05:10.104756Z", + "iopub.status.busy": "2023-11-18T18:05:10.104323Z", + "iopub.status.idle": "2023-11-18T18:05:10.111581Z", + "shell.execute_reply": "2023-11-18T18:05:10.111049Z", + "shell.execute_reply.started": "2023-11-18T18:05:10.104738Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{'evaluation': {'metric': 'QAEvalChain',\n", + " 'model': 'gpt-3.5-turbo-instruct',\n", + " 'hub': 'openai'},\n", + " 'tests': {'defaults': {'min_pass_rate': 0.65},\n", + " 'robustness': {'uppercase': {'min_pass_rate': 0.75},\n", + " 'lowercase': {'min_pass_rate': 0.75},\n", + " 'titlecase': {'min_pass_rate': 0.75},\n", + " 'add_typo': {'min_pass_rate': 0.75},\n", + " 'dyslexia_word_swap': {'min_pass_rate': 0.75},\n", + " 'add_abbreviation': {'min_pass_rate': 0.75},\n", + " 'add_slangs': {'min_pass_rate': 0.75},\n", + " 'add_speech_to_text_typo': {'min_pass_rate': 0.75},\n", + " 'add_ocr_typo': {'min_pass_rate': 0.75},\n", + " 'adjective_synonym_swap': {'min_pass_rate': 0.75},\n", + " 'adjective_antonym_swap': {'min_pass_rate': 0.75}}}}" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.configure(\n", + "{\n", + " \"evaluation\": {\"metric\":\"QAEvalChain\",\"model\":\"gpt-3.5-turbo-instruct\",\"hub\":\"openai\"},\n", + " 'tests': {'defaults': {'min_pass_rate': 0.65},\n", + " 'robustness': {'uppercase': {'min_pass_rate': 0.75},\n", + " 'lowercase':{'min_pass_rate': 0.75},\n", + " 'titlecase':{'min_pass_rate': 0.75},\n", + " 'add_typo':{'min_pass_rate': 0.75},\n", + " 'dyslexia_word_swap':{'min_pass_rate': 0.75},\n", + " 'add_abbreviation':{'min_pass_rate': 0.75},\n", + " 'add_slangs':{'min_pass_rate': 0.75},\n", + " 'add_speech_to_text_typo':{'min_pass_rate': 0.75},\n", + " 'add_ocr_typo':{'min_pass_rate': 0.75},\n", + " 'adjective_synonym_swap':{'min_pass_rate': 0.75},\n", + " 'adjective_antonym_swap':{'min_pass_rate': 0.75}\n", + " }\n", + " }\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generating the test cases" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-18T18:05:12.821348Z", + "iopub.status.busy": "2023-11-18T18:05:12.820985Z", + "iopub.status.idle": "2023-11-18T18:06:35.920738Z", + "shell.execute_reply": "2023-11-18T18:06:35.920307Z", + "shell.execute_reply.started": "2023-11-18T18:05:12.821331Z" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 8050.49it/s]\n", + "WARNING:root:[W009] Removing samples where no transformation has been applied:\n", + "[W010] - Test 'add_typo': 21 samples removed out of 500\n", + "[W010] - Test 'dyslexia_word_swap': 85 samples removed out of 500\n", + "[W010] - Test 'add_abbreviation': 60 samples removed out of 500\n", + "[W010] - Test 'add_slangs': 193 samples removed out of 500\n", + "[W010] - Test 'add_ocr_typo': 2 samples removed out of 500\n", + "[W010] - Test 'adjective_synonym_swap': 133 samples removed out of 500\n", + "[W010] - Test 'adjective_antonym_swap': 193 samples removed out of 500\n", + "\n" + ] + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "seed_value = 42\n", + "random.seed(seed_value)\n", + "harness.generate()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-18T18:06:35.921969Z", + "iopub.status.busy": "2023-11-18T18:06:35.921633Z", + "iopub.status.idle": "2023-11-18T18:06:36.567603Z", + "shell.execute_reply": "2023-11-18T18:06:36.567117Z", + "shell.execute_reply.started": "2023-11-18T18:06:35.921953Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_question
0robustnessuppercase-A person wants to start saving money so that t...-A PERSON WANTS TO START SAVING MONEY SO THAT T...
1robustnessuppercase-There is most likely going to be fog around:\\n...-THERE IS MOST LIKELY GOING TO BE FOG AROUND: A...
2robustnessuppercase-Predators eat\\n\\nA. lions\\nB. humans\\nC. bunni...-PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D....
3robustnessuppercase-Oak tree seeds are planted and a sidewalk is p...-OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS P...
4robustnessuppercase-An electric car runs on electricity via\\n\\nA. ...-AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GAS...
.....................
4808robustnessadjective_antonym_swap-A woman, with a pale complexion, wants to spen...-A woman, with a pale complexion, wants to spen...
4809robustnessadjective_antonym_swap-Pasta may be cooked in water when\\n\\nA. the wa...-Pasta may be raw in water when\\n\\nA. the water...
4810robustnessadjective_antonym_swap-A decrease in diseases\\n\\nA. has no impact on ...-A decrease in diseases\\n\\nA. has no impact on ...
4811robustnessadjective_antonym_swap-When soil is viewed in a scientific way, what ...-When soil is viewed in a unscientific way, wha...
4812robustnessadjective_antonym_swap-Some animals use a liquid coming from their sk...-Some animals use a gaseous coming from their s...
\n", + "

4813 rows × 6 columns

\n", + "
" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness uppercase - \n", + "1 robustness uppercase - \n", + "2 robustness uppercase - \n", + "3 robustness uppercase - \n", + "4 robustness uppercase - \n", + "... ... ... ... \n", + "4808 robustness adjective_antonym_swap - \n", + "4809 robustness adjective_antonym_swap - \n", + "4810 robustness adjective_antonym_swap - \n", + "4811 robustness adjective_antonym_swap - \n", + "4812 robustness adjective_antonym_swap - \n", + "\n", + " original_question perturbed_context \\\n", + "0 A person wants to start saving money so that t... - \n", + "1 There is most likely going to be fog around:\\n... - \n", + "2 Predators eat\\n\\nA. lions\\nB. humans\\nC. bunni... - \n", + "3 Oak tree seeds are planted and a sidewalk is p... - \n", + "4 An electric car runs on electricity via\\n\\nA. ... - \n", + "... ... ... \n", + "4808 A woman, with a pale complexion, wants to spen... - \n", + "4809 Pasta may be cooked in water when\\n\\nA. the wa... - \n", + "4810 A decrease in diseases\\n\\nA. has no impact on ... - \n", + "4811 When soil is viewed in a scientific way, what ... - \n", + "4812 Some animals use a liquid coming from their sk... - \n", + "\n", + " perturbed_question \n", + "0 A PERSON WANTS TO START SAVING MONEY SO THAT T... \n", + "1 THERE IS MOST LIKELY GOING TO BE FOG AROUND: A... \n", + "2 PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D.... \n", + "3 OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS P... \n", + "4 AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GAS... \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spen... \n", + "4809 Pasta may be raw in water when\\n\\nA. the water... \n", + "4810 A decrease in diseases\\n\\nA. has no impact on ... \n", + "4811 When soil is viewed in a unscientific way, wha... \n", + "4812 Some animals use a gaseous coming from their s... \n", + "\n", + "[4813 rows x 6 columns]" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.testcases()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### saving testcases" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-18T18:09:59.273398Z", + "iopub.status.busy": "2023-11-18T18:09:59.272831Z", + "iopub.status.idle": "2023-11-18T18:09:59.416652Z", + "shell.execute_reply": "2023-11-18T18:09:59.416145Z", + "shell.execute_reply.started": "2023-11-18T18:09:59.273379Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "harness.save(\"saved_test_configurations/OpenBookQA\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Running the tests" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-18T18:11:39.944027Z", + "iopub.status.busy": "2023-11-18T18:11:39.943560Z", + "iopub.status.idle": "2023-11-18T20:12:56.106580Z", + "shell.execute_reply": "2023-11-18T20:12:56.106104Z", + "shell.execute_reply.started": "2023-11-18T18:11:39.944009Z" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Running testcases... : 100%|██████████| 4813/4813 [2:01:16<00:00, 1.51s/it] \n" + ] + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### saving model reponse (expected_result and actual_result)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-18T20:13:14.108026Z", + "iopub.status.busy": "2023-11-18T20:13:14.107604Z", + "iopub.status.idle": "2023-11-18T20:13:14.327386Z", + "shell.execute_reply": "2023-11-18T20:13:14.326836Z", + "shell.execute_reply.started": "2023-11-18T20:13:14.108009Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "harness.save(save_dir=r\"ai21/j2-jumbo-instruct-OpenBookQA\", include_generated_results =True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generated Results" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-18T20:13:15.830885Z", + "iopub.status.busy": "2023-11-18T20:13:15.830369Z", + "iopub.status.idle": "2023-11-18T20:25:22.764291Z", + "shell.execute_reply": "2023-11-18T20:25:22.763757Z", + "shell.execute_reply.started": "2023-11-18T20:13:15.830868Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "generated_results = harness.generated_results()" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-18T20:25:22.765873Z", + "iopub.status.busy": "2023-11-18T20:25:22.765446Z", + "iopub.status.idle": "2023-11-18T20:25:22.775263Z", + "shell.execute_reply": "2023-11-18T20:25:22.774770Z", + "shell.execute_reply.started": "2023-11-18T20:25:22.765849Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_questionexpected_resultactual_resultpass
0robustnessuppercase-A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends-A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDSB. quit eating lunch outB. QUIT EATING LUNCH OUTTrue
1robustnessuppercase-There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert-THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERTA marshC. THE PLAINS\\n Explanation: PLAINS are flat and open areas, and fog is often formed on flat and open areas.True
2robustnessuppercase-Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass-PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASSB. humansLIONSFalse
3robustnessuppercase-Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart-OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APARTA. roots may be splitA. ROOTS MAY BE SPLITTrue
4robustnessuppercase-An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel-AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUELB. a power stationA POWER STATIONTrue
..............................
4808robustnessadjective_antonym_swap-A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space-A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in spaceA. UV rays are harmfulD. the sun is in spaceFalse
4809robustnessadjective_antonym_swap-Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh-Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very saltyA. the water is warmC. water is bubbling from applied warmthFalse
4810robustnessadjective_antonym_swap-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visitsC. leads to less sick peopleB. leads to more well peopleFalse
4811robustnessadjective_antonym_swap-When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles-When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebblesB. tiny lifeforms in dirtB. tiny lifeforms in dirtTrue
4812robustnessadjective_antonym_swap-Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity-Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidityhumidityheat\\nD. humidityFalse
\n", + "

4813 rows × 9 columns

\n", + "
" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness uppercase - \n", + "1 robustness uppercase - \n", + "2 robustness uppercase - \n", + "3 robustness uppercase - \n", + "4 robustness uppercase - \n", + "... ... ... ... \n", + "4808 robustness adjective_antonym_swap - \n", + "4809 robustness adjective_antonym_swap - \n", + "4810 robustness adjective_antonym_swap - \n", + "4811 robustness adjective_antonym_swap - \n", + "4812 robustness adjective_antonym_swap - \n", + "\n", + " original_question \\\n", + "0 A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends \n", + "1 There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert \n", + "2 Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass \n", + "3 Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart \n", + "4 An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space \n", + "4809 Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + " perturbed_context \\\n", + "0 - \n", + "1 - \n", + "2 - \n", + "3 - \n", + "4 - \n", + "... ... \n", + "4808 - \n", + "4809 - \n", + "4810 - \n", + "4811 - \n", + "4812 - \n", + "\n", + " perturbed_question \\\n", + "0 A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDS \n", + "1 THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERT \n", + "2 PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASS \n", + "3 OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APART \n", + "4 AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUEL \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in space \n", + "4809 Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very salty \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + " expected_result \\\n", + "0 B. quit eating lunch out \n", + "1 A marsh \n", + "2 B. humans \n", + "3 A. roots may be split \n", + "4 B. a power station \n", + "... ... \n", + "4808 A. UV rays are harmful \n", + "4809 A. the water is warm \n", + "4810 C. leads to less sick people \n", + "4811 B. tiny lifeforms in dirt \n", + "4812 humidity \n", + "\n", + " actual_result \\\n", + "0 B. QUIT EATING LUNCH OUT \n", + "1 C. THE PLAINS\\n Explanation: PLAINS are flat and open areas, and fog is often formed on flat and open areas. \n", + "2 LIONS \n", + "3 A. ROOTS MAY BE SPLIT \n", + "4 A POWER STATION \n", + "... ... \n", + "4808 D. the sun is in space \n", + "4809 C. water is bubbling from applied warmth \n", + "4810 B. leads to more well people \n", + "4811 B. tiny lifeforms in dirt \n", + "4812 heat\\nD. humidity \n", + "\n", + " pass \n", + "0 True \n", + "1 True \n", + "2 False \n", + "3 True \n", + "4 True \n", + "... ... \n", + "4808 False \n", + "4809 False \n", + "4810 False \n", + "4811 True \n", + "4812 False \n", + "\n", + "[4813 rows x 9 columns]" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "generated_results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Final Results" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-18T20:25:22.776297Z", + "iopub.status.busy": "2023-11-18T20:25:22.775861Z", + "iopub.status.idle": "2023-11-18T20:37:37.798164Z", + "shell.execute_reply": "2023-11-18T20:37:37.797644Z", + "shell.execute_reply.started": "2023-11-18T20:25:22.776275Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "report = harness.report()" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-18T20:37:37.799667Z", + "iopub.status.busy": "2023-11-18T20:37:37.799288Z", + "iopub.status.idle": "2023-11-18T20:37:37.805900Z", + "shell.execute_reply": "2023-11-18T20:37:37.805483Z", + "shell.execute_reply.started": "2023-11-18T20:37:37.799651Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessuppercase20829258%75%False
1robustnesslowercase18531563%75%False
2robustnesstitlecase19930160%75%False
3robustnessadd_typo12635374%75%False
4robustnessdyslexia_word_swap14227366%75%False
5robustnessadd_abbreviation18325758%75%False
6robustnessadd_slangs12618159%75%False
7robustnessadd_speech_to_text_typo11039078%75%True
8robustnessadd_ocr_typo24625251%75%False
9robustnessadjective_synonym_swap13922862%75%False
10robustnessadjective_antonym_swap12018761%75%False
\n", + "
" + ], + "text/plain": [ + " category test_type fail_count pass_count pass_rate \\\n", + "0 robustness uppercase 208 292 58% \n", + "1 robustness lowercase 185 315 63% \n", + "2 robustness titlecase 199 301 60% \n", + "3 robustness add_typo 126 353 74% \n", + "4 robustness dyslexia_word_swap 142 273 66% \n", + "5 robustness add_abbreviation 183 257 58% \n", + "6 robustness add_slangs 126 181 59% \n", + "7 robustness add_speech_to_text_typo 110 390 78% \n", + "8 robustness add_ocr_typo 246 252 51% \n", + "9 robustness adjective_synonym_swap 139 228 62% \n", + "10 robustness adjective_antonym_swap 120 187 61% \n", + "\n", + " minimum_pass_rate pass \n", + "0 75% False \n", + "1 75% False \n", + "2 75% False \n", + "3 75% False \n", + "4 75% False \n", + "5 75% False \n", + "6 75% False \n", + "7 75% True \n", + "8 75% False \n", + "9 75% False \n", + "10 75% False " + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "report" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Saving report and generated_results" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-18T20:37:37.806502Z", + "iopub.status.busy": "2023-11-18T20:37:37.806366Z", + "iopub.status.idle": "2023-11-18T20:37:37.960314Z", + "shell.execute_reply": "2023-11-18T20:37:37.959801Z", + "shell.execute_reply.started": "2023-11-18T20:37:37.806490Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "generated_results.to_csv('ai21/j2-jumbo-instruct-OpenBookQA.csv', index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-18T20:37:37.961122Z", + "iopub.status.busy": "2023-11-18T20:37:37.960972Z", + "iopub.status.idle": "2023-11-18T20:37:37.973091Z", + "shell.execute_reply": "2023-11-18T20:37:37.972649Z", + "shell.execute_reply.started": "2023-11-18T20:37:37.961107Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "report.to_csv('ai21/j2-jumbo-instruct-OpenBookQA-report.csv', index=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Model J2-Grande-Instruct" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setup and Configure Harness" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-19T02:52:50.406145Z", + "iopub.status.busy": "2023-11-19T02:52:50.405784Z", + "iopub.status.idle": "2023-11-19T02:52:50.499232Z", + "shell.execute_reply": "2023-11-19T02:52:50.498796Z", + "shell.execute_reply.started": "2023-11-19T02:52:50.406131Z" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Test Configuration : \n", + " {\n", + " \"model_parameters\": {\n", + " \"temperature\": 0.2,\n", + " \"maxTokens\": 64\n", + " },\n", + " \"tests\": {\n", + " \"defaults\": {\n", + " \"min_pass_rate\": 1.0\n", + " },\n", + " \"robustness\": {\n", + " \"add_typo\": {\n", + " \"min_pass_rate\": 0.7\n", + " },\n", + " \"lowercase\": {\n", + " \"min_pass_rate\": 0.7\n", + " }\n", + " }\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "harness = Harness(\n", + " task=\"question-answering\",\n", + " model={\"model\": \"j2-grande-instruct\", \"hub\":\"ai21\"},\n", + " data={\"data_source\" :\"OpenBookQA\",\n", + " \"split\":\"test\"}\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-19T02:52:50.499994Z", + "iopub.status.busy": "2023-11-19T02:52:50.499751Z", + "iopub.status.idle": "2023-11-19T02:52:50.537221Z", + "shell.execute_reply": "2023-11-19T02:52:50.536795Z", + "shell.execute_reply.started": "2023-11-19T02:52:50.499979Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{'evaluation': {'metric': 'QAEvalChain',\n", + " 'model': 'gpt-3.5-turbo-instruct',\n", + " 'hub': 'openai'},\n", + " 'tests': {'defaults': {'min_pass_rate': 0.65},\n", + " 'robustness': {'uppercase': {'min_pass_rate': 0.75},\n", + " 'lowercase': {'min_pass_rate': 0.75},\n", + " 'titlecase': {'min_pass_rate': 0.75},\n", + " 'add_typo': {'min_pass_rate': 0.75},\n", + " 'dyslexia_word_swap': {'min_pass_rate': 0.75},\n", + " 'add_abbreviation': {'min_pass_rate': 0.75},\n", + " 'add_slangs': {'min_pass_rate': 0.75},\n", + " 'add_speech_to_text_typo': {'min_pass_rate': 0.75},\n", + " 'add_ocr_typo': {'min_pass_rate': 0.75},\n", + " 'adjective_synonym_swap': {'min_pass_rate': 0.75},\n", + " 'adjective_antonym_swap': {'min_pass_rate': 0.75}}}}" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.configure(\n", + "{\n", + " \"evaluation\": {\"metric\":\"QAEvalChain\",\"model\":\"gpt-3.5-turbo-instruct\",\"hub\":\"openai\"},\n", + " 'tests': {'defaults': {'min_pass_rate': 0.65},\n", + " 'robustness': {'uppercase': {'min_pass_rate': 0.75},\n", + " 'lowercase':{'min_pass_rate': 0.75},\n", + " 'titlecase':{'min_pass_rate': 0.75},\n", + " 'add_typo':{'min_pass_rate': 0.75},\n", + " 'dyslexia_word_swap':{'min_pass_rate': 0.75},\n", + " 'add_abbreviation':{'min_pass_rate': 0.75},\n", + " 'add_slangs':{'min_pass_rate': 0.75},\n", + " 'add_speech_to_text_typo':{'min_pass_rate': 0.75},\n", + " 'add_ocr_typo':{'min_pass_rate': 0.75},\n", + " 'adjective_synonym_swap':{'min_pass_rate': 0.75},\n", + " 'adjective_antonym_swap':{'min_pass_rate': 0.75}\n", + " }\n", + " }\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generating the test cases." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-19T02:52:58.776662Z", + "iopub.status.busy": "2023-11-19T02:52:58.776231Z", + "iopub.status.idle": "2023-11-19T02:54:21.640101Z", + "shell.execute_reply": "2023-11-19T02:54:21.639624Z", + "shell.execute_reply.started": "2023-11-19T02:52:58.776644Z" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 8867.45it/s]\n", + "WARNING:root:[W009] Removing samples where no transformation has been applied:\n", + "[W010] - Test 'add_typo': 21 samples removed out of 500\n", + "[W010] - Test 'dyslexia_word_swap': 85 samples removed out of 500\n", + "[W010] - Test 'add_abbreviation': 60 samples removed out of 500\n", + "[W010] - Test 'add_slangs': 193 samples removed out of 500\n", + "[W010] - Test 'add_ocr_typo': 2 samples removed out of 500\n", + "[W010] - Test 'adjective_synonym_swap': 133 samples removed out of 500\n", + "[W010] - Test 'adjective_antonym_swap': 193 samples removed out of 500\n", + "\n" + ] + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "seed_value = 42\n", + "random.seed(seed_value)\n", + "harness.generate()" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-19T02:54:21.641296Z", + "iopub.status.busy": "2023-11-19T02:54:21.640964Z", + "iopub.status.idle": "2023-11-19T02:54:22.267412Z", + "shell.execute_reply": "2023-11-19T02:54:22.266940Z", + "shell.execute_reply.started": "2023-11-19T02:54:21.641280Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_question
0robustnessuppercase-A person wants to start saving money so that t...-A PERSON WANTS TO START SAVING MONEY SO THAT T...
1robustnessuppercase-There is most likely going to be fog around:\\n...-THERE IS MOST LIKELY GOING TO BE FOG AROUND: A...
2robustnessuppercase-Predators eat\\n\\nA. lions\\nB. humans\\nC. bunni...-PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D....
3robustnessuppercase-Oak tree seeds are planted and a sidewalk is p...-OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS P...
4robustnessuppercase-An electric car runs on electricity via\\n\\nA. ...-AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GAS...
.....................
4808robustnessadjective_antonym_swap-A woman, with a pale complexion, wants to spen...-A woman, with a pale complexion, wants to spen...
4809robustnessadjective_antonym_swap-Pasta may be cooked in water when\\n\\nA. the wa...-Pasta may be raw in water when\\n\\nA. the water...
4810robustnessadjective_antonym_swap-A decrease in diseases\\n\\nA. has no impact on ...-A decrease in diseases\\n\\nA. has no impact on ...
4811robustnessadjective_antonym_swap-When soil is viewed in a scientific way, what ...-When soil is viewed in a unscientific way, wha...
4812robustnessadjective_antonym_swap-Some animals use a liquid coming from their sk...-Some animals use a gaseous coming from their s...
\n", + "

4813 rows × 6 columns

\n", + "
" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness uppercase - \n", + "1 robustness uppercase - \n", + "2 robustness uppercase - \n", + "3 robustness uppercase - \n", + "4 robustness uppercase - \n", + "... ... ... ... \n", + "4808 robustness adjective_antonym_swap - \n", + "4809 robustness adjective_antonym_swap - \n", + "4810 robustness adjective_antonym_swap - \n", + "4811 robustness adjective_antonym_swap - \n", + "4812 robustness adjective_antonym_swap - \n", + "\n", + " original_question perturbed_context \\\n", + "0 A person wants to start saving money so that t... - \n", + "1 There is most likely going to be fog around:\\n... - \n", + "2 Predators eat\\n\\nA. lions\\nB. humans\\nC. bunni... - \n", + "3 Oak tree seeds are planted and a sidewalk is p... - \n", + "4 An electric car runs on electricity via\\n\\nA. ... - \n", + "... ... ... \n", + "4808 A woman, with a pale complexion, wants to spen... - \n", + "4809 Pasta may be cooked in water when\\n\\nA. the wa... - \n", + "4810 A decrease in diseases\\n\\nA. has no impact on ... - \n", + "4811 When soil is viewed in a scientific way, what ... - \n", + "4812 Some animals use a liquid coming from their sk... - \n", + "\n", + " perturbed_question \n", + "0 A PERSON WANTS TO START SAVING MONEY SO THAT T... \n", + "1 THERE IS MOST LIKELY GOING TO BE FOG AROUND: A... \n", + "2 PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D.... \n", + "3 OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS P... \n", + "4 AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GAS... \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spen... \n", + "4809 Pasta may be raw in water when\\n\\nA. the water... \n", + "4810 A decrease in diseases\\n\\nA. has no impact on ... \n", + "4811 When soil is viewed in a unscientific way, wha... \n", + "4812 Some animals use a gaseous coming from their s... \n", + "\n", + "[4813 rows x 6 columns]" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.testcases()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Running the tests" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-19T05:16:16.321434Z", + "iopub.status.busy": "2023-11-19T05:16:16.320877Z", + "iopub.status.idle": "2023-11-19T05:17:03.000175Z", + "shell.execute_reply": "2023-11-19T05:17:02.999676Z", + "shell.execute_reply.started": "2023-11-19T05:16:16.321415Z" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Running testcases... : 100%|██████████| 4813/4813 [00:46<00:00, 103.12it/s] \n" + ] + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### saving model reponse (expected_result and actual_result)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-19T05:18:22.077170Z", + "iopub.status.busy": "2023-11-19T05:18:22.076627Z", + "iopub.status.idle": "2023-11-19T05:18:24.134010Z", + "shell.execute_reply": "2023-11-19T05:18:24.133499Z", + "shell.execute_reply.started": "2023-11-19T05:18:22.077151Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "harness.save(save_dir=r\"ai21/j2-grande-instruct-OpenBookQA\", include_generated_results =True)" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-19T05:18:24.135027Z", + "iopub.status.busy": "2023-11-19T05:18:24.134874Z", + "iopub.status.idle": "2023-11-19T05:28:33.160576Z", + "shell.execute_reply": "2023-11-19T05:28:33.160046Z", + "shell.execute_reply.started": "2023-11-19T05:18:24.135013Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "generated_results = harness.generated_results()" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-19T05:28:33.161660Z", + "iopub.status.busy": "2023-11-19T05:28:33.161231Z", + "iopub.status.idle": "2023-11-19T05:28:33.169835Z", + "shell.execute_reply": "2023-11-19T05:28:33.169426Z", + "shell.execute_reply.started": "2023-11-19T05:28:33.161643Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_questionexpected_resultactual_resultpass
0robustnessuppercase-A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends-A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDSB. quit eating lunch outC. BUY LESS WITH MONOPOLY MONEY\\n\\nThe best way to save money in this situation is to buy less with Monopoly money. Making more phone calls or quitting eating out will not directly result in saving money. Having lunch with friends is not a cost-effective way to save money, as eating out is typically more expensive than preparing meals at home.False
1robustnessuppercase-There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert-THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERTa marshA MARSHTrue
2robustnessuppercase-Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass-PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASSC. bunniesLIONSFalse
3robustnessuppercase-Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart-OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APARTC. parts may break the concrete\\nD. ROOTS MAY FALL APARTFalse
4robustnessuppercase-An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel-AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUELC. electrical conductorsA POWER STATIONFalse
..............................
4808robustnessadjective_antonym_swap-A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space-A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in spaceA. UV rays are harmfulD. the sun is in spaceFalse
4809robustnessadjective_antonym_swap-Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh-Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very saltythe water is on the stoveC. water is bubbling from applied warmthFalse
4810robustnessadjective_antonym_swap-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visitsC. leads to less sick peopleB. leads to more well peopleFalse
4811robustnessadjective_antonym_swap-When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles-When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebblesC. small mammals living thereB. tiny lifeforms in dirtFalse
4812robustnessadjective_antonym_swap-Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity-Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidityB. waterheatFalse
\n", + "

4813 rows × 9 columns

\n", + "
" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness uppercase - \n", + "1 robustness uppercase - \n", + "2 robustness uppercase - \n", + "3 robustness uppercase - \n", + "4 robustness uppercase - \n", + "... ... ... ... \n", + "4808 robustness adjective_antonym_swap - \n", + "4809 robustness adjective_antonym_swap - \n", + "4810 robustness adjective_antonym_swap - \n", + "4811 robustness adjective_antonym_swap - \n", + "4812 robustness adjective_antonym_swap - \n", + "\n", + " original_question \\\n", + "0 A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends \n", + "1 There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert \n", + "2 Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass \n", + "3 Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart \n", + "4 An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space \n", + "4809 Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + " perturbed_context \\\n", + "0 - \n", + "1 - \n", + "2 - \n", + "3 - \n", + "4 - \n", + "... ... \n", + "4808 - \n", + "4809 - \n", + "4810 - \n", + "4811 - \n", + "4812 - \n", + "\n", + " perturbed_question \\\n", + "0 A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDS \n", + "1 THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERT \n", + "2 PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASS \n", + "3 OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APART \n", + "4 AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUEL \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in space \n", + "4809 Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very salty \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + " expected_result \\\n", + "0 B. quit eating lunch out \n", + "1 a marsh \n", + "2 C. bunnies \n", + "3 C. parts may break the concrete \n", + "4 C. electrical conductors \n", + "... ... \n", + "4808 A. UV rays are harmful \n", + "4809 the water is on the stove \n", + "4810 C. leads to less sick people \n", + "4811 C. small mammals living there \n", + "4812 B. water \n", + "\n", + " actual_result \\\n", + "0 C. BUY LESS WITH MONOPOLY MONEY\\n\\nThe best way to save money in this situation is to buy less with Monopoly money. Making more phone calls or quitting eating out will not directly result in saving money. Having lunch with friends is not a cost-effective way to save money, as eating out is typically more expensive than preparing meals at home. \n", + "1 A MARSH \n", + "2 LIONS \n", + "3 \\nD. ROOTS MAY FALL APART \n", + "4 A POWER STATION \n", + "... ... \n", + "4808 D. the sun is in space \n", + "4809 C. water is bubbling from applied warmth \n", + "4810 B. leads to more well people \n", + "4811 B. tiny lifeforms in dirt \n", + "4812 heat \n", + "\n", + " pass \n", + "0 False \n", + "1 True \n", + "2 False \n", + "3 False \n", + "4 False \n", + "... ... \n", + "4808 False \n", + "4809 False \n", + "4810 False \n", + "4811 False \n", + "4812 False \n", + "\n", + "[4813 rows x 9 columns]" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "generated_results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Final Results" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-19T05:28:33.171106Z", + "iopub.status.busy": "2023-11-19T05:28:33.170828Z", + "iopub.status.idle": "2023-11-19T05:38:33.872288Z", + "shell.execute_reply": "2023-11-19T05:38:33.871755Z", + "shell.execute_reply.started": "2023-11-19T05:28:33.171091Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "report = harness.report()" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-19T05:38:33.873084Z", + "iopub.status.busy": "2023-11-19T05:38:33.872925Z", + "iopub.status.idle": "2023-11-19T05:38:33.879434Z", + "shell.execute_reply": "2023-11-19T05:38:33.879026Z", + "shell.execute_reply.started": "2023-11-19T05:38:33.873069Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessuppercase23726353%75%False
1robustnesslowercase22028056%75%False
2robustnesstitlecase20929158%75%False
3robustnessadd_typo10137879%75%True
4robustnessdyslexia_word_swap14227366%75%False
5robustnessadd_abbreviation16227863%75%False
6robustnessadd_slangs11219564%75%False
7robustnessadd_speech_to_text_typo8241884%75%True
8robustnessadd_ocr_typo22827054%75%False
9robustnessadjective_synonym_swap12823965%75%False
10robustnessadjective_antonym_swap13017758%75%False
\n", + "
" + ], + "text/plain": [ + " category test_type fail_count pass_count pass_rate \\\n", + "0 robustness uppercase 237 263 53% \n", + "1 robustness lowercase 220 280 56% \n", + "2 robustness titlecase 209 291 58% \n", + "3 robustness add_typo 101 378 79% \n", + "4 robustness dyslexia_word_swap 142 273 66% \n", + "5 robustness add_abbreviation 162 278 63% \n", + "6 robustness add_slangs 112 195 64% \n", + "7 robustness add_speech_to_text_typo 82 418 84% \n", + "8 robustness add_ocr_typo 228 270 54% \n", + "9 robustness adjective_synonym_swap 128 239 65% \n", + "10 robustness adjective_antonym_swap 130 177 58% \n", + "\n", + " minimum_pass_rate pass \n", + "0 75% False \n", + "1 75% False \n", + "2 75% False \n", + "3 75% True \n", + "4 75% False \n", + "5 75% False \n", + "6 75% False \n", + "7 75% True \n", + "8 75% False \n", + "9 75% False \n", + "10 75% False " + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "report" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Saving report and generated_results" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-19T05:38:33.880259Z", + "iopub.status.busy": "2023-11-19T05:38:33.879966Z", + "iopub.status.idle": "2023-11-19T05:38:34.647325Z", + "shell.execute_reply": "2023-11-19T05:38:34.646791Z", + "shell.execute_reply.started": "2023-11-19T05:38:33.880245Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "generated_results.to_csv('ai21/j2-grande-instruct-OpenBookQA.csv', index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-19T05:38:34.648122Z", + "iopub.status.busy": "2023-11-19T05:38:34.647973Z", + "iopub.status.idle": "2023-11-19T05:38:34.659467Z", + "shell.execute_reply": "2023-11-19T05:38:34.659072Z", + "shell.execute_reply.started": "2023-11-19T05:38:34.648108Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "report.to_csv('ai21/j2-grande-instruct-OpenBookQA-report.csv', index=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Model text-davinci-003" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setup and Configure Harness" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T06:35:15.430303Z", + "iopub.status.busy": "2023-11-20T06:35:15.430073Z", + "iopub.status.idle": "2023-11-20T06:35:15.549985Z", + "shell.execute_reply": "2023-11-20T06:35:15.549533Z", + "shell.execute_reply.started": "2023-11-20T06:35:15.430288Z" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Test Configuration : \n", + " {\n", + " \"model_parameters\": {\n", + " \"temperature\": 0.2,\n", + " \"max_tokens\": 64\n", + " },\n", + " \"tests\": {\n", + " \"defaults\": {\n", + " \"min_pass_rate\": 1.0\n", + " },\n", + " \"robustness\": {\n", + " \"add_typo\": {\n", + " \"min_pass_rate\": 0.7\n", + " },\n", + " \"lowercase\": {\n", + " \"min_pass_rate\": 0.7\n", + " }\n", + " }\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "harness = Harness(\n", + " task=\"question-answering\",\n", + " model={\"model\": \"text-davinci-003\", \"hub\":\"openai\"},\n", + " data={\"data_source\" :\"OpenBookQA\",\n", + " \"split\":\"test\"}\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T06:35:15.550674Z", + "iopub.status.busy": "2023-11-20T06:35:15.550507Z", + "iopub.status.idle": "2023-11-20T06:35:15.585769Z", + "shell.execute_reply": "2023-11-20T06:35:15.585327Z", + "shell.execute_reply.started": "2023-11-20T06:35:15.550659Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{'evaluation': {'metric': 'QAEvalChain',\n", + " 'model': 'gpt-3.5-turbo-instruct',\n", + " 'hub': 'openai'},\n", + " 'tests': {'defaults': {'min_pass_rate': 0.65},\n", + " 'robustness': {'uppercase': {'min_pass_rate': 0.75},\n", + " 'lowercase': {'min_pass_rate': 0.75},\n", + " 'titlecase': {'min_pass_rate': 0.75},\n", + " 'add_typo': {'min_pass_rate': 0.75},\n", + " 'dyslexia_word_swap': {'min_pass_rate': 0.75},\n", + " 'add_abbreviation': {'min_pass_rate': 0.75},\n", + " 'add_slangs': {'min_pass_rate': 0.75},\n", + " 'add_speech_to_text_typo': {'min_pass_rate': 0.75},\n", + " 'add_ocr_typo': {'min_pass_rate': 0.75},\n", + " 'adjective_synonym_swap': {'min_pass_rate': 0.75},\n", + " 'adjective_antonym_swap': {'min_pass_rate': 0.75}}}}" + ] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.configure(\n", + "{\n", + " \"evaluation\": {\"metric\":\"QAEvalChain\",\"model\":\"gpt-3.5-turbo-instruct\",\"hub\":\"openai\"},\n", + " 'tests': {'defaults': {'min_pass_rate': 0.65},\n", + " 'robustness': {'uppercase': {'min_pass_rate': 0.75},\n", + " 'lowercase':{'min_pass_rate': 0.75},\n", + " 'titlecase':{'min_pass_rate': 0.75},\n", + " 'add_typo':{'min_pass_rate': 0.75},\n", + " 'dyslexia_word_swap':{'min_pass_rate': 0.75},\n", + " 'add_abbreviation':{'min_pass_rate': 0.75},\n", + " 'add_slangs':{'min_pass_rate': 0.75},\n", + " 'add_speech_to_text_typo':{'min_pass_rate': 0.75},\n", + " 'add_ocr_typo':{'min_pass_rate': 0.75},\n", + " 'adjective_synonym_swap':{'min_pass_rate': 0.75},\n", + " 'adjective_antonym_swap':{'min_pass_rate': 0.75}\n", + " }\n", + " }\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generating the test cases." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T06:35:15.587087Z", + "iopub.status.busy": "2023-11-20T06:35:15.586781Z", + "iopub.status.idle": "2023-11-20T06:36:39.295026Z", + "shell.execute_reply": "2023-11-20T06:36:39.294579Z", + "shell.execute_reply.started": "2023-11-20T06:35:15.587072Z" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 9118.05it/s]\n", + "WARNING:root:[W009] Removing samples where no transformation has been applied:\n", + "[W010] - Test 'add_typo': 21 samples removed out of 500\n", + "[W010] - Test 'dyslexia_word_swap': 85 samples removed out of 500\n", + "[W010] - Test 'add_abbreviation': 60 samples removed out of 500\n", + "[W010] - Test 'add_slangs': 193 samples removed out of 500\n", + "[W010] - Test 'add_ocr_typo': 2 samples removed out of 500\n", + "[W010] - Test 'adjective_synonym_swap': 133 samples removed out of 500\n", + "[W010] - Test 'adjective_antonym_swap': 193 samples removed out of 500\n", + "\n" + ] + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "seed_value = 42\n", + "random.seed(seed_value)\n", + "harness.generate()" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T06:36:39.295770Z", + "iopub.status.busy": "2023-11-20T06:36:39.295598Z", + "iopub.status.idle": "2023-11-20T06:36:39.734507Z", + "shell.execute_reply": "2023-11-20T06:36:39.733968Z", + "shell.execute_reply.started": "2023-11-20T06:36:39.295755Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_question
0robustnessuppercase-A person wants to start saving money so that t...-A PERSON WANTS TO START SAVING MONEY SO THAT T...
1robustnessuppercase-There is most likely going to be fog around:\\n...-THERE IS MOST LIKELY GOING TO BE FOG AROUND: A...
2robustnessuppercase-Predators eat\\n\\nA. lions\\nB. humans\\nC. bunni...-PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D....
3robustnessuppercase-Oak tree seeds are planted and a sidewalk is p...-OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS P...
4robustnessuppercase-An electric car runs on electricity via\\n\\nA. ...-AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GAS...
.....................
4808robustnessadjective_antonym_swap-A woman, with a pale complexion, wants to spen...-A woman, with a pale complexion, wants to spen...
4809robustnessadjective_antonym_swap-Pasta may be cooked in water when\\n\\nA. the wa...-Pasta may be raw in water when\\n\\nA. the water...
4810robustnessadjective_antonym_swap-A decrease in diseases\\n\\nA. has no impact on ...-A decrease in diseases\\n\\nA. has no impact on ...
4811robustnessadjective_antonym_swap-When soil is viewed in a scientific way, what ...-When soil is viewed in a unscientific way, wha...
4812robustnessadjective_antonym_swap-Some animals use a liquid coming from their sk...-Some animals use a gaseous coming from their s...
\n", + "

4813 rows × 6 columns

\n", + "
" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness uppercase - \n", + "1 robustness uppercase - \n", + "2 robustness uppercase - \n", + "3 robustness uppercase - \n", + "4 robustness uppercase - \n", + "... ... ... ... \n", + "4808 robustness adjective_antonym_swap - \n", + "4809 robustness adjective_antonym_swap - \n", + "4810 robustness adjective_antonym_swap - \n", + "4811 robustness adjective_antonym_swap - \n", + "4812 robustness adjective_antonym_swap - \n", + "\n", + " original_question perturbed_context \\\n", + "0 A person wants to start saving money so that t... - \n", + "1 There is most likely going to be fog around:\\n... - \n", + "2 Predators eat\\n\\nA. lions\\nB. humans\\nC. bunni... - \n", + "3 Oak tree seeds are planted and a sidewalk is p... - \n", + "4 An electric car runs on electricity via\\n\\nA. ... - \n", + "... ... ... \n", + "4808 A woman, with a pale complexion, wants to spen... - \n", + "4809 Pasta may be cooked in water when\\n\\nA. the wa... - \n", + "4810 A decrease in diseases\\n\\nA. has no impact on ... - \n", + "4811 When soil is viewed in a scientific way, what ... - \n", + "4812 Some animals use a liquid coming from their sk... - \n", + "\n", + " perturbed_question \n", + "0 A PERSON WANTS TO START SAVING MONEY SO THAT T... \n", + "1 THERE IS MOST LIKELY GOING TO BE FOG AROUND: A... \n", + "2 PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D.... \n", + "3 OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS P... \n", + "4 AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GAS... \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spen... \n", + "4809 Pasta may be raw in water when\\n\\nA. the water... \n", + "4810 A decrease in diseases\\n\\nA. has no impact on ... \n", + "4811 When soil is viewed in a unscientific way, wha... \n", + "4812 Some animals use a gaseous coming from their s... \n", + "\n", + "[4813 rows x 6 columns]" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.testcases()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Running the tests" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T06:38:30.993535Z", + "iopub.status.busy": "2023-11-20T06:38:30.993027Z", + "iopub.status.idle": "2023-11-20T07:29:52.999095Z", + "shell.execute_reply": "2023-11-20T07:29:52.998603Z", + "shell.execute_reply.started": "2023-11-20T06:38:30.993513Z" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Running testcases... : 88%|████████▊ | 4221/4813 [45:04<06:15, 1.58it/s] WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:2396)'))': /v1/completions\n", + "Running testcases... : 100%|██████████| 4813/4813 [51:21<00:00, 1.56it/s]\n" + ] + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### saving model reponse (expected_result and actual_result)" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T07:29:53.000222Z", + "iopub.status.busy": "2023-11-20T07:29:52.999758Z", + "iopub.status.idle": "2023-11-20T07:29:53.220890Z", + "shell.execute_reply": "2023-11-20T07:29:53.220340Z", + "shell.execute_reply.started": "2023-11-20T07:29:53.000205Z" + } + }, + "outputs": [], + "source": [ + "harness.save(save_dir=\"openai/text-davinci-003-OpenBookQA\", include_generated_results =True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generated Results" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T07:29:53.222446Z", + "iopub.status.busy": "2023-11-20T07:29:53.222083Z", + "iopub.status.idle": "2023-11-20T07:35:25.379760Z", + "shell.execute_reply": "2023-11-20T07:35:25.379096Z", + "shell.execute_reply.started": "2023-11-20T07:29:53.222429Z" + } + }, + "outputs": [], + "source": [ + "generated_results = harness.generated_results()" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T07:35:25.380605Z", + "iopub.status.busy": "2023-11-20T07:35:25.380427Z", + "iopub.status.idle": "2023-11-20T07:35:25.389696Z", + "shell.execute_reply": "2023-11-20T07:35:25.389263Z", + "shell.execute_reply.started": "2023-11-20T07:35:25.380588Z" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_questionexpected_resultactual_resultpass
0robustnessuppercase-A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends-A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDSB. quit eating lunch outB. QUIT EATING LUNCH OUTTrue
1robustnessuppercase-There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert-THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERTA. a marshA. A MarshTrue
2robustnessuppercase-Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass-PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASSA. lionsA. LionsTrue
3robustnessuppercase-Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart-OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APARTC. parts may break the concreteC. PARTS MAY BREAK THE CONCRETETrue
4robustnessuppercase-An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel-AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUELC. electrical conductorsC. ELECTRICAL CONDUCTORSTrue
..............................
4808robustnessadjective_antonym_swap-A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space-A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in spaceA. UV rays are harmfulB. sunlight will be funFalse
4809robustnessadjective_antonym_swap-Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh-Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very saltyC. water is bubbling from applied warmthC. Water is bubbling from applied warmthTrue
4810robustnessadjective_antonym_swap-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visitsC. leads to less sick peopleB. leads to more well peopleFalse
4811robustnessadjective_antonym_swap-When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles-When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebblesB. tiny lifeforms in dirtD. a lot of tiny pebblesFalse
4812robustnessadjective_antonym_swap-Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity-Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidityC. HeatD. humidityFalse
\n", + "

4813 rows × 9 columns

\n", + "
" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness uppercase - \n", + "1 robustness uppercase - \n", + "2 robustness uppercase - \n", + "3 robustness uppercase - \n", + "4 robustness uppercase - \n", + "... ... ... ... \n", + "4808 robustness adjective_antonym_swap - \n", + "4809 robustness adjective_antonym_swap - \n", + "4810 robustness adjective_antonym_swap - \n", + "4811 robustness adjective_antonym_swap - \n", + "4812 robustness adjective_antonym_swap - \n", + "\n", + " original_question \\\n", + "0 A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends \n", + "1 There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert \n", + "2 Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass \n", + "3 Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart \n", + "4 An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space \n", + "4809 Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + " perturbed_context \\\n", + "0 - \n", + "1 - \n", + "2 - \n", + "3 - \n", + "4 - \n", + "... ... \n", + "4808 - \n", + "4809 - \n", + "4810 - \n", + "4811 - \n", + "4812 - \n", + "\n", + " perturbed_question \\\n", + "0 A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDS \n", + "1 THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERT \n", + "2 PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASS \n", + "3 OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APART \n", + "4 AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUEL \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in space \n", + "4809 Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very salty \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + " expected_result \\\n", + "0 B. quit eating lunch out \n", + "1 A. a marsh \n", + "2 A. lions \n", + "3 C. parts may break the concrete \n", + "4 C. electrical conductors \n", + "... ... \n", + "4808 A. UV rays are harmful \n", + "4809 C. water is bubbling from applied warmth \n", + "4810 C. leads to less sick people \n", + "4811 B. tiny lifeforms in dirt \n", + "4812 C. Heat \n", + "\n", + " actual_result pass \n", + "0 B. QUIT EATING LUNCH OUT True \n", + "1 A. A Marsh True \n", + "2 A. Lions True \n", + "3 C. PARTS MAY BREAK THE CONCRETE True \n", + "4 C. ELECTRICAL CONDUCTORS True \n", + "... ... ... \n", + "4808 B. sunlight will be fun False \n", + "4809 C. Water is bubbling from applied warmth True \n", + "4810 B. leads to more well people False \n", + "4811 D. a lot of tiny pebbles False \n", + "4812 D. humidity False \n", + "\n", + "[4813 rows x 9 columns]" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "generated_results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Final Results" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T07:35:25.390571Z", + "iopub.status.busy": "2023-11-20T07:35:25.390288Z", + "iopub.status.idle": "2023-11-20T07:40:50.893022Z", + "shell.execute_reply": "2023-11-20T07:40:50.892439Z", + "shell.execute_reply.started": "2023-11-20T07:35:25.390555Z" + } + }, + "outputs": [], + "source": [ + "report = harness.report()" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T07:40:50.893982Z", + "iopub.status.busy": "2023-11-20T07:40:50.893827Z", + "iopub.status.idle": "2023-11-20T07:40:50.900246Z", + "shell.execute_reply": "2023-11-20T07:40:50.899834Z", + "shell.execute_reply.started": "2023-11-20T07:40:50.893967Z" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessuppercase8341783%75%True
1robustnesslowercase7742385%75%True
2robustnesstitlecase7742385%75%True
3robustnessadd_typo4743290%75%True
4robustnessdyslexia_word_swap8133480%75%True
5robustnessadd_abbreviation8435681%75%True
6robustnessadd_slangs10020767%75%False
7robustnessadd_speech_to_text_typo6143988%75%True
8robustnessadd_ocr_typo7342585%75%True
9robustnessadjective_synonym_swap10526271%75%False
10robustnessadjective_antonym_swap10320466%75%False
\n", + "
" + ], + "text/plain": [ + " category test_type fail_count pass_count pass_rate \\\n", + "0 robustness uppercase 83 417 83% \n", + "1 robustness lowercase 77 423 85% \n", + "2 robustness titlecase 77 423 85% \n", + "3 robustness add_typo 47 432 90% \n", + "4 robustness dyslexia_word_swap 81 334 80% \n", + "5 robustness add_abbreviation 84 356 81% \n", + "6 robustness add_slangs 100 207 67% \n", + "7 robustness add_speech_to_text_typo 61 439 88% \n", + "8 robustness add_ocr_typo 73 425 85% \n", + "9 robustness adjective_synonym_swap 105 262 71% \n", + "10 robustness adjective_antonym_swap 103 204 66% \n", + "\n", + " minimum_pass_rate pass \n", + "0 75% True \n", + "1 75% True \n", + "2 75% True \n", + "3 75% True \n", + "4 75% True \n", + "5 75% True \n", + "6 75% False \n", + "7 75% True \n", + "8 75% True \n", + "9 75% False \n", + "10 75% False " + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "report" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Saving report and generated_results" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T07:40:50.900875Z", + "iopub.status.busy": "2023-11-20T07:40:50.900734Z", + "iopub.status.idle": "2023-11-20T07:40:51.086946Z", + "shell.execute_reply": "2023-11-20T07:40:51.086532Z", + "shell.execute_reply.started": "2023-11-20T07:40:50.900862Z" + } + }, + "outputs": [], + "source": [ + "generated_results.to_csv('openai/text-davinci-003-OpenBookQA.csv', index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T07:40:51.093515Z", + "iopub.status.busy": "2023-11-20T07:40:51.093240Z", + "iopub.status.idle": "2023-11-20T07:40:51.105301Z", + "shell.execute_reply": "2023-11-20T07:40:51.104858Z", + "shell.execute_reply.started": "2023-11-20T07:40:51.093499Z" + } + }, + "outputs": [], + "source": [ + "report.to_csv('openai/text-davinci-003-OpenBookQA-report.csv', index=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Model gpt-3.5-turbo-instruct" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setup and Configure Harness" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T07:44:19.774099Z", + "iopub.status.busy": "2023-11-20T07:44:19.773588Z", + "iopub.status.idle": "2023-11-20T07:44:19.803565Z", + "shell.execute_reply": "2023-11-20T07:44:19.803075Z", + "shell.execute_reply.started": "2023-11-20T07:44:19.774075Z" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Test Configuration : \n", + " {\n", + " \"model_parameters\": {\n", + " \"temperature\": 0.2,\n", + " \"max_tokens\": 64\n", + " },\n", + " \"tests\": {\n", + " \"defaults\": {\n", + " \"min_pass_rate\": 1.0\n", + " },\n", + " \"robustness\": {\n", + " \"add_typo\": {\n", + " \"min_pass_rate\": 0.7\n", + " },\n", + " \"lowercase\": {\n", + " \"min_pass_rate\": 0.7\n", + " }\n", + " }\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "harness = Harness(\n", + " task=\"question-answering\",\n", + " model={\"model\": \"gpt-3.5-turbo-instruct\", \"hub\":\"openai\"},\n", + " data={\"data_source\" :\"OpenBookQA\",\n", + " \"split\":\"test\"}\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T07:44:19.804859Z", + "iopub.status.busy": "2023-11-20T07:44:19.804540Z", + "iopub.status.idle": "2023-11-20T07:44:19.852053Z", + "shell.execute_reply": "2023-11-20T07:44:19.851628Z", + "shell.execute_reply.started": "2023-11-20T07:44:19.804843Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{'evaluation': {'metric': 'QAEvalChain',\n", + " 'model': 'text-davinci-003',\n", + " 'hub': 'openai'},\n", + " 'tests': {'defaults': {'min_pass_rate': 0.65},\n", + " 'robustness': {'uppercase': {'min_pass_rate': 0.75},\n", + " 'lowercase': {'min_pass_rate': 0.75},\n", + " 'titlecase': {'min_pass_rate': 0.75},\n", + " 'add_typo': {'min_pass_rate': 0.75},\n", + " 'dyslexia_word_swap': {'min_pass_rate': 0.75},\n", + " 'add_abbreviation': {'min_pass_rate': 0.75},\n", + " 'add_slangs': {'min_pass_rate': 0.75},\n", + " 'add_speech_to_text_typo': {'min_pass_rate': 0.75},\n", + " 'add_ocr_typo': {'min_pass_rate': 0.75},\n", + " 'adjective_synonym_swap': {'min_pass_rate': 0.75},\n", + " 'adjective_antonym_swap': {'min_pass_rate': 0.75}}}}" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.configure(\n", + "{\n", + " \"evaluation\": {\"metric\":\"QAEvalChain\",\"model\":\"gpt-3.5-turbo-instruct\",\"hub\":\"openai\"},\n", + " 'tests': {'defaults': {'min_pass_rate': 0.65},\n", + " 'robustness': {'uppercase': {'min_pass_rate': 0.75},\n", + " 'lowercase':{'min_pass_rate': 0.75},\n", + " 'titlecase':{'min_pass_rate': 0.75},\n", + " 'add_typo':{'min_pass_rate': 0.75},\n", + " 'dyslexia_word_swap':{'min_pass_rate': 0.75},\n", + " 'add_abbreviation':{'min_pass_rate': 0.75},\n", + " 'add_slangs':{'min_pass_rate': 0.75},\n", + " 'add_speech_to_text_typo':{'min_pass_rate': 0.75},\n", + " 'add_ocr_typo':{'min_pass_rate': 0.75},\n", + " 'adjective_synonym_swap':{'min_pass_rate': 0.75},\n", + " 'adjective_antonym_swap':{'min_pass_rate': 0.75}\n", + " }\n", + " }\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generating the test cases." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T07:44:19.852737Z", + "iopub.status.busy": "2023-11-20T07:44:19.852590Z", + "iopub.status.idle": "2023-11-20T07:45:44.856119Z", + "shell.execute_reply": "2023-11-20T07:45:44.855564Z", + "shell.execute_reply.started": "2023-11-20T07:44:19.852724Z" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 8719.97it/s]\n", + "WARNING:root:[W009] Removing samples where no transformation has been applied:\n", + "[W010] - Test 'add_typo': 21 samples removed out of 500\n", + "[W010] - Test 'dyslexia_word_swap': 85 samples removed out of 500\n", + "[W010] - Test 'add_abbreviation': 60 samples removed out of 500\n", + "[W010] - Test 'add_slangs': 193 samples removed out of 500\n", + "[W010] - Test 'add_ocr_typo': 2 samples removed out of 500\n", + "[W010] - Test 'adjective_synonym_swap': 133 samples removed out of 500\n", + "[W010] - Test 'adjective_antonym_swap': 193 samples removed out of 500\n", + "\n" + ] + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "seed_value = 42\n", + "random.seed(seed_value)\n", + "harness.generate()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T07:45:44.865346Z", + "iopub.status.busy": "2023-11-20T07:45:44.865184Z", + "iopub.status.idle": "2023-11-20T07:45:45.296356Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_question
0robustnessuppercase-A person wants to start saving money so that t...-A PERSON WANTS TO START SAVING MONEY SO THAT T...
1robustnessuppercase-There is most likely going to be fog around:\\n...-THERE IS MOST LIKELY GOING TO BE FOG AROUND: A...
2robustnessuppercase-Predators eat\\n\\nA. lions\\nB. humans\\nC. bunni...-PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D....
3robustnessuppercase-Oak tree seeds are planted and a sidewalk is p...-OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS P...
4robustnessuppercase-An electric car runs on electricity via\\n\\nA. ...-AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GAS...
.....................
4808robustnessadjective_antonym_swap-A woman, with a pale complexion, wants to spen...-A woman, with a pale complexion, wants to spen...
4809robustnessadjective_antonym_swap-Pasta may be cooked in water when\\n\\nA. the wa...-Pasta may be raw in water when\\n\\nA. the water...
4810robustnessadjective_antonym_swap-A decrease in diseases\\n\\nA. has no impact on ...-A decrease in diseases\\n\\nA. has no impact on ...
4811robustnessadjective_antonym_swap-When soil is viewed in a scientific way, what ...-When soil is viewed in a unscientific way, wha...
4812robustnessadjective_antonym_swap-Some animals use a liquid coming from their sk...-Some animals use a gaseous coming from their s...
\n", + "

4813 rows × 6 columns

\n", + "
" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness uppercase - \n", + "1 robustness uppercase - \n", + "2 robustness uppercase - \n", + "3 robustness uppercase - \n", + "4 robustness uppercase - \n", + "... ... ... ... \n", + "4808 robustness adjective_antonym_swap - \n", + "4809 robustness adjective_antonym_swap - \n", + "4810 robustness adjective_antonym_swap - \n", + "4811 robustness adjective_antonym_swap - \n", + "4812 robustness adjective_antonym_swap - \n", + "\n", + " original_question perturbed_context \\\n", + "0 A person wants to start saving money so that t... - \n", + "1 There is most likely going to be fog around:\\n... - \n", + "2 Predators eat\\n\\nA. lions\\nB. humans\\nC. bunni... - \n", + "3 Oak tree seeds are planted and a sidewalk is p... - \n", + "4 An electric car runs on electricity via\\n\\nA. ... - \n", + "... ... ... \n", + "4808 A woman, with a pale complexion, wants to spen... - \n", + "4809 Pasta may be cooked in water when\\n\\nA. the wa... - \n", + "4810 A decrease in diseases\\n\\nA. has no impact on ... - \n", + "4811 When soil is viewed in a scientific way, what ... - \n", + "4812 Some animals use a liquid coming from their sk... - \n", + "\n", + " perturbed_question \n", + "0 A PERSON WANTS TO START SAVING MONEY SO THAT T... \n", + "1 THERE IS MOST LIKELY GOING TO BE FOG AROUND: A... \n", + "2 PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D.... \n", + "3 OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS P... \n", + "4 AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GAS... \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spen... \n", + "4809 Pasta may be raw in water when\\n\\nA. the water... \n", + "4810 A decrease in diseases\\n\\nA. has no impact on ... \n", + "4811 When soil is viewed in a unscientific way, wha... \n", + "4812 Some animals use a gaseous coming from their s... \n", + "\n", + "[4813 rows x 6 columns]" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.testcases()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Running the tests" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T07:49:16.846577Z", + "iopub.status.busy": "2023-11-20T07:49:16.845999Z", + "iopub.status.idle": "2023-11-20T08:23:02.916915Z", + "shell.execute_reply": "2023-11-20T08:23:02.916388Z", + "shell.execute_reply.started": "2023-11-20T07:49:16.846555Z" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Running testcases... : 100%|██████████| 4813/4813 [33:46<00:00, 2.38it/s]\n" + ] + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### saving model reponse (expected_result and actual_result)" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T08:27:23.709354Z", + "iopub.status.busy": "2023-11-20T08:27:23.708791Z", + "iopub.status.idle": "2023-11-20T08:27:23.927011Z", + "shell.execute_reply": "2023-11-20T08:27:23.926451Z", + "shell.execute_reply.started": "2023-11-20T08:27:23.709336Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "harness.save(save_dir=\"openai/gpt-3.5-turbo-instruct-OpenBookQA\", include_generated_results =True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generated Results" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T08:27:25.468206Z", + "iopub.status.busy": "2023-11-20T08:27:25.467641Z", + "iopub.status.idle": "2023-11-20T08:33:02.936068Z", + "shell.execute_reply": "2023-11-20T08:33:02.935510Z", + "shell.execute_reply.started": "2023-11-20T08:27:25.468188Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "generated_results = harness.generated_results()" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T08:33:02.937461Z", + "iopub.status.busy": "2023-11-20T08:33:02.937043Z", + "iopub.status.idle": "2023-11-20T08:33:02.946325Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_questionexpected_resultactual_resultpass
0robustnessuppercase-A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends-A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDSB. quit eating lunch outB. QUIT EATING LUNCH OUTTrue
1robustnessuppercase-There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert-THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERTA. a marshA. A MARSHTrue
2robustnessuppercase-Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass-PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASSA. lionsA. LIONSTrue
3robustnessuppercase-Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart-OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APARTC. parts may break the concreteC. PARTS MAY BREAK THE CONCRETETrue
4robustnessuppercase-An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel-AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUELB. a power stationC. ELECTRICAL CONDUCTORSFalse
..............................
4808robustnessadjective_antonym_swap-A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space-A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in spaceA. UV rays are harmfulD. the sun is in spaceFalse
4809robustnessadjective_antonym_swap-Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh-Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very saltyC. water is bubbling from applied warmthC. water is bubbling from applied warmthTrue
4810robustnessadjective_antonym_swap-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visitsC. leads to less sick peopleB. leads to more well peopleFalse
4811robustnessadjective_antonym_swap-When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles-When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebblesB. tiny lifeforms in dirtD. a lot of tiny pebblesFalse
4812robustnessadjective_antonym_swap-Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity-Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidityC. heatC. heatTrue
\n", + "

4813 rows × 9 columns

\n", + "
" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness uppercase - \n", + "1 robustness uppercase - \n", + "2 robustness uppercase - \n", + "3 robustness uppercase - \n", + "4 robustness uppercase - \n", + "... ... ... ... \n", + "4808 robustness adjective_antonym_swap - \n", + "4809 robustness adjective_antonym_swap - \n", + "4810 robustness adjective_antonym_swap - \n", + "4811 robustness adjective_antonym_swap - \n", + "4812 robustness adjective_antonym_swap - \n", + "\n", + " original_question \\\n", + "0 A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends \n", + "1 There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert \n", + "2 Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass \n", + "3 Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart \n", + "4 An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space \n", + "4809 Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + " perturbed_context \\\n", + "0 - \n", + "1 - \n", + "2 - \n", + "3 - \n", + "4 - \n", + "... ... \n", + "4808 - \n", + "4809 - \n", + "4810 - \n", + "4811 - \n", + "4812 - \n", + "\n", + " perturbed_question \\\n", + "0 A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDS \n", + "1 THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERT \n", + "2 PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASS \n", + "3 OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APART \n", + "4 AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUEL \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in space \n", + "4809 Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very salty \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + " expected_result \\\n", + "0 B. quit eating lunch out \n", + "1 A. a marsh \n", + "2 A. lions \n", + "3 C. parts may break the concrete \n", + "4 B. a power station \n", + "... ... \n", + "4808 A. UV rays are harmful \n", + "4809 C. water is bubbling from applied warmth \n", + "4810 C. leads to less sick people \n", + "4811 B. tiny lifeforms in dirt \n", + "4812 C. heat \n", + "\n", + " actual_result pass \n", + "0 B. QUIT EATING LUNCH OUT True \n", + "1 A. A MARSH True \n", + "2 A. LIONS True \n", + "3 C. PARTS MAY BREAK THE CONCRETE True \n", + "4 C. ELECTRICAL CONDUCTORS False \n", + "... ... ... \n", + "4808 D. the sun is in space False \n", + "4809 C. water is bubbling from applied warmth True \n", + "4810 B. leads to more well people False \n", + "4811 D. a lot of tiny pebbles False \n", + "4812 C. heat True \n", + "\n", + "[4813 rows x 9 columns]" + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "generated_results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Final Results" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T08:33:02.958460Z", + "iopub.status.busy": "2023-11-20T08:33:02.958196Z", + "iopub.status.idle": "2023-11-20T08:38:38.126397Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "report = harness.report()" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T08:38:38.131630Z", + "iopub.status.busy": "2023-11-20T08:38:38.131495Z", + "iopub.status.idle": "2023-11-20T08:38:38.138094Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessuppercase6543587%75%True
1robustnesslowercase6044088%75%True
2robustnesstitlecase7342785%75%True
3robustnessadd_typo5042990%75%True
4robustnessdyslexia_word_swap8533080%75%True
5robustnessadd_abbreviation9334779%75%True
6robustnessadd_slangs10320466%75%False
7robustnessadd_speech_to_text_typo4645491%75%True
8robustnessadd_ocr_typo10339579%75%True
9robustnessadjective_synonym_swap11625168%75%False
10robustnessadjective_antonym_swap10420366%75%False
\n", + "
" + ], + "text/plain": [ + " category test_type fail_count pass_count pass_rate \\\n", + "0 robustness uppercase 65 435 87% \n", + "1 robustness lowercase 60 440 88% \n", + "2 robustness titlecase 73 427 85% \n", + "3 robustness add_typo 50 429 90% \n", + "4 robustness dyslexia_word_swap 85 330 80% \n", + "5 robustness add_abbreviation 93 347 79% \n", + "6 robustness add_slangs 103 204 66% \n", + "7 robustness add_speech_to_text_typo 46 454 91% \n", + "8 robustness add_ocr_typo 103 395 79% \n", + "9 robustness adjective_synonym_swap 116 251 68% \n", + "10 robustness adjective_antonym_swap 104 203 66% \n", + "\n", + " minimum_pass_rate pass \n", + "0 75% True \n", + "1 75% True \n", + "2 75% True \n", + "3 75% True \n", + "4 75% True \n", + "5 75% True \n", + "6 75% False \n", + "7 75% True \n", + "8 75% True \n", + "9 75% False \n", + "10 75% False " + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "report" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Saving report and generated_results" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T08:38:38.143320Z", + "iopub.status.busy": "2023-11-20T08:38:38.143188Z", + "iopub.status.idle": "2023-11-20T08:38:38.286181Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "generated_results.to_csv('openai/gpt-3.5-turbo-instruct-OpenBookQA.csv', index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-20T08:38:38.291391Z", + "iopub.status.busy": "2023-11-20T08:38:38.291264Z", + "iopub.status.idle": "2023-11-20T08:38:38.303232Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "report.to_csv('openai/gpt-3.5-turbo-instruct-OpenBookQA-report.csv', index=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Model Mistral-7B-Instruct-v0.1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setup and Configure Harness" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-21T15:53:52.725765Z", + "iopub.status.busy": "2023-11-21T15:53:52.725616Z", + "iopub.status.idle": "2023-11-21T15:54:44.445706Z", + "shell.execute_reply": "2023-11-21T15:54:44.445029Z", + "shell.execute_reply.started": "2023-11-21T15:53:52.725752Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "84290b95081b4c3481326755bdfe430d", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Loading checkpoint shards: 0%| | 0/2 [00:00\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_question
0robustnessuppercase-A person wants to start saving money so that t...-A PERSON WANTS TO START SAVING MONEY SO THAT T...
1robustnessuppercase-There is most likely going to be fog around:\\n...-THERE IS MOST LIKELY GOING TO BE FOG AROUND: A...
2robustnessuppercase-Predators eat\\n\\nA. lions\\nB. humans\\nC. bunni...-PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D....
3robustnessuppercase-Oak tree seeds are planted and a sidewalk is p...-OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS P...
4robustnessuppercase-An electric car runs on electricity via\\n\\nA. ...-AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GAS...
.....................
4808robustnessadjective_antonym_swap-A woman, with a pale complexion, wants to spen...-A woman, with a pale complexion, wants to spen...
4809robustnessadjective_antonym_swap-Pasta may be cooked in water when\\n\\nA. the wa...-Pasta may be raw in water when\\n\\nA. the water...
4810robustnessadjective_antonym_swap-A decrease in diseases\\n\\nA. has no impact on ...-A decrease in diseases\\n\\nA. has no impact on ...
4811robustnessadjective_antonym_swap-When soil is viewed in a scientific way, what ...-When soil is viewed in a unscientific way, wha...
4812robustnessadjective_antonym_swap-Some animals use a liquid coming from their sk...-Some animals use a gaseous coming from their s...
\n", + "

4813 rows × 6 columns

\n", + "" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness uppercase - \n", + "1 robustness uppercase - \n", + "2 robustness uppercase - \n", + "3 robustness uppercase - \n", + "4 robustness uppercase - \n", + "... ... ... ... \n", + "4808 robustness adjective_antonym_swap - \n", + "4809 robustness adjective_antonym_swap - \n", + "4810 robustness adjective_antonym_swap - \n", + "4811 robustness adjective_antonym_swap - \n", + "4812 robustness adjective_antonym_swap - \n", + "\n", + " original_question perturbed_context \\\n", + "0 A person wants to start saving money so that t... - \n", + "1 There is most likely going to be fog around:\\n... - \n", + "2 Predators eat\\n\\nA. lions\\nB. humans\\nC. bunni... - \n", + "3 Oak tree seeds are planted and a sidewalk is p... - \n", + "4 An electric car runs on electricity via\\n\\nA. ... - \n", + "... ... ... \n", + "4808 A woman, with a pale complexion, wants to spen... - \n", + "4809 Pasta may be cooked in water when\\n\\nA. the wa... - \n", + "4810 A decrease in diseases\\n\\nA. has no impact on ... - \n", + "4811 When soil is viewed in a scientific way, what ... - \n", + "4812 Some animals use a liquid coming from their sk... - \n", + "\n", + " perturbed_question \n", + "0 A PERSON WANTS TO START SAVING MONEY SO THAT T... \n", + "1 THERE IS MOST LIKELY GOING TO BE FOG AROUND: A... \n", + "2 PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D.... \n", + "3 OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS P... \n", + "4 AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GAS... \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spen... \n", + "4809 Pasta may be raw in water when\\n\\nA. the water... \n", + "4810 A decrease in diseases\\n\\nA. has no impact on ... \n", + "4811 When soil is viewed in a unscientific way, wha... \n", + "4812 Some animals use a gaseous coming from their s... \n", + "\n", + "[4813 rows x 6 columns]" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.testcases()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Running the tests" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "harness.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### saving model reponse (expected_result and actual_result)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-21T23:00:12.314773Z", + "iopub.status.busy": "2023-11-21T23:00:12.314620Z", + "iopub.status.idle": "2023-11-21T23:00:12.552418Z", + "shell.execute_reply": "2023-11-21T23:00:12.551826Z", + "shell.execute_reply.started": "2023-11-21T23:00:12.314759Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "harness.save(save_dir=\"mistralai/Mistral-7B-Instruct-v0.1-OpenBookQA\", include_generated_results =True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generated Results" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-21T23:00:12.553308Z", + "iopub.status.busy": "2023-11-21T23:00:12.553130Z", + "iopub.status.idle": "2023-11-21T23:16:59.986809Z", + "shell.execute_reply": "2023-11-21T23:16:59.985790Z", + "shell.execute_reply.started": "2023-11-21T23:00:12.553293Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "generated_results = harness.generated_results()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-21T23:16:59.987716Z", + "iopub.status.busy": "2023-11-21T23:16:59.987561Z", + "iopub.status.idle": "2023-11-21T23:16:59.997060Z", + "shell.execute_reply": "2023-11-21T23:16:59.996556Z", + "shell.execute_reply.started": "2023-11-21T23:16:59.987701Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_questionexpected_resultactual_resultpass
0robustnessuppercase-A person wants to start saving money so that t...-A PERSON WANTS TO START SAVING MONEY SO THAT T...BCFalse
1robustnessuppercase-There is most likely going to be fog around:\\n...-THERE IS MOST LIKELY GOING TO BE FOG AROUND: A...AATrue
2robustnessuppercase-Predators eat\\n\\nA. lions\\nB. humans\\nC. bunni...-PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D....AATrue
3robustnessuppercase-Oak tree seeds are planted and a sidewalk is p...-OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS P...BBTrue
4robustnessuppercase-An electric car runs on electricity via\\n\\nA. ...-AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GAS...BBTrue
..............................
4808robustnessadjective_antonym_swap-A woman, with a pale complexion, wants to spen...-A woman, with a pale complexion, wants to spen...ACFalse
4809robustnessadjective_antonym_swap-Pasta may be cooked in water when\\n\\nA. the wa...-Pasta may be raw in water when\\n\\nA. the water...BAFalse
4810robustnessadjective_antonym_swap-A decrease in diseases\\n\\nA. has no impact on ...-A decrease in diseases\\n\\nA. has no impact on ...CBFalse
4811robustnessadjective_antonym_swap-When soil is viewed in a scientific way, what ...-When soil is viewed in a unscientific way, wha...BBTrue
4812robustnessadjective_antonym_swap-Some animals use a liquid coming from their sk...-Some animals use a gaseous coming from their s...BDFalse
\n", + "

4813 rows × 9 columns

\n", + "
" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness uppercase - \n", + "1 robustness uppercase - \n", + "2 robustness uppercase - \n", + "3 robustness uppercase - \n", + "4 robustness uppercase - \n", + "... ... ... ... \n", + "4808 robustness adjective_antonym_swap - \n", + "4809 robustness adjective_antonym_swap - \n", + "4810 robustness adjective_antonym_swap - \n", + "4811 robustness adjective_antonym_swap - \n", + "4812 robustness adjective_antonym_swap - \n", + "\n", + " original_question perturbed_context \\\n", + "0 A person wants to start saving money so that t... - \n", + "1 There is most likely going to be fog around:\\n... - \n", + "2 Predators eat\\n\\nA. lions\\nB. humans\\nC. bunni... - \n", + "3 Oak tree seeds are planted and a sidewalk is p... - \n", + "4 An electric car runs on electricity via\\n\\nA. ... - \n", + "... ... ... \n", + "4808 A woman, with a pale complexion, wants to spen... - \n", + "4809 Pasta may be cooked in water when\\n\\nA. the wa... - \n", + "4810 A decrease in diseases\\n\\nA. has no impact on ... - \n", + "4811 When soil is viewed in a scientific way, what ... - \n", + "4812 Some animals use a liquid coming from their sk... - \n", + "\n", + " perturbed_question expected_result \\\n", + "0 A PERSON WANTS TO START SAVING MONEY SO THAT T... B \n", + "1 THERE IS MOST LIKELY GOING TO BE FOG AROUND: A... A \n", + "2 PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D.... A \n", + "3 OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS P... B \n", + "4 AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GAS... B \n", + "... ... ... \n", + "4808 A woman, with a pale complexion, wants to spen... A \n", + "4809 Pasta may be raw in water when\\n\\nA. the water... B \n", + "4810 A decrease in diseases\\n\\nA. has no impact on ... C \n", + "4811 When soil is viewed in a unscientific way, wha... B \n", + "4812 Some animals use a gaseous coming from their s... B \n", + "\n", + " actual_result pass \n", + "0 C False \n", + "1 A True \n", + "2 A True \n", + "3 B True \n", + "4 B True \n", + "... ... ... \n", + "4808 C False \n", + "4809 A False \n", + "4810 B False \n", + "4811 B True \n", + "4812 D False \n", + "\n", + "[4813 rows x 9 columns]" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "generated_results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Final Results" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-28T18:53:57.757922Z", + "iopub.status.busy": "2023-11-28T18:53:57.757668Z", + "iopub.status.idle": "2023-11-28T18:53:57.778723Z", + "shell.execute_reply": "2023-11-28T18:53:57.778308Z", + "shell.execute_reply.started": "2023-11-28T18:53:57.757908Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "report = harness.report()" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessuppercase22627455%75%False
1robustnesslowercase20629459%75%False
2robustnesstitlecase23027054%75%False
3robustnessadd_typo5642388%75%True
4robustnessdyslexia_word_swap6435185%75%True
5robustnessadd_abbreviation8535581%75%True
6robustnessadd_slangs7523276%75%True
7robustnessadd_speech_to_text_typo6443687%75%True
8robustnessadd_ocr_typo13836072%75%False
9robustnessadjective_synonym_swap8228578%75%True
10robustnessadjective_antonym_swap8122674%75%False
\n", + "
" + ], + "text/plain": [ + " category test_type fail_count pass_count pass_rate \\\n", + "0 robustness uppercase 226 274 55% \n", + "1 robustness lowercase 206 294 59% \n", + "2 robustness titlecase 230 270 54% \n", + "3 robustness add_typo 56 423 88% \n", + "4 robustness dyslexia_word_swap 64 351 85% \n", + "5 robustness add_abbreviation 85 355 81% \n", + "6 robustness add_slangs 75 232 76% \n", + "7 robustness add_speech_to_text_typo 64 436 87% \n", + "8 robustness add_ocr_typo 138 360 72% \n", + "9 robustness adjective_synonym_swap 82 285 78% \n", + "10 robustness adjective_antonym_swap 81 226 74% \n", + "\n", + " minimum_pass_rate pass \n", + "0 75% False \n", + "1 75% False \n", + "2 75% False \n", + "3 75% True \n", + "4 75% True \n", + "5 75% True \n", + "6 75% True \n", + "7 75% True \n", + "8 75% False \n", + "9 75% True \n", + "10 75% False " + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "report " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Saving report and generated_results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "generated_results.to_csv('mistralai/Mistral-7B-Instruct-v0.1-OpenBookQA.csv', index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "report.to_csv('mistralai/Mistral-7B-Instruct-v0.1-OpenBookQA-report.csv', index=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Model HuggingFaceH4/zephyr-7b-beta" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setup and Configure Harness" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Loading checkpoint shards: 100%|██████████| 8/8 [01:01<00:00, 7.66s/it]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Test Configuration : \n", + " {\n", + " \"evaluation\": {\n", + " \"metric\": \"QAEvalChain\",\n", + " \"model\": \"gpt-3.5-turbo-instruct\",\n", + " \"hub\": \"openai\"\n", + " },\n", + " \"tests\": {\n", + " \"defaults\": {\n", + " \"min_pass_rate\": 0.65\n", + " },\n", + " \"robustness\": {\n", + " \"uppercase\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"lowercase\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"titlecase\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"add_typo\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"dyslexia_word_swap\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"add_abbreviation\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"add_slangs\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"add_speech_to_text_typo\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"add_ocr_typo\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"adjective_synonym_swap\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"adjective_antonym_swap\": {\n", + " \"min_pass_rate\": 0.75\n", + " }\n", + " }\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "harness = Harness(\n", + " task=\"question-answering\", \n", + " model={\"model\": \"HuggingFaceH4/zephyr-7b-beta\", \"hub\": \"huggingface\"},\n", + " data={\"data_source\" :\"OpenBookQA\",\n", + " \"split\":\"test\"},\n", + " config={\n", + " \"evaluation\": {\"metric\":\"QAEvalChain\",\"model\":\"gpt-3.5-turbo-instruct\",\"hub\":\"openai\"},\n", + " 'tests': {'defaults': {'min_pass_rate': 0.65},\n", + " 'robustness': {'uppercase': {'min_pass_rate': 0.75},\n", + " 'lowercase':{'min_pass_rate': 0.75},\n", + " 'titlecase':{'min_pass_rate': 0.75},\n", + " 'add_typo':{'min_pass_rate': 0.75},\n", + " 'dyslexia_word_swap':{'min_pass_rate': 0.75},\n", + " 'add_abbreviation':{'min_pass_rate': 0.75},\n", + " 'add_slangs':{'min_pass_rate': 0.75},\n", + " 'add_speech_to_text_typo':{'min_pass_rate': 0.75},\n", + " 'add_ocr_typo':{'min_pass_rate': 0.75},\n", + " 'adjective_synonym_swap':{'min_pass_rate': 0.75},\n", + " 'adjective_antonym_swap':{'min_pass_rate': 0.75}\n", + " }\n", + " }\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generating the test cases." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Generating testcases...: 100%|██████████| 1/1 [00:00\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_question
0robustnessuppercase-A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends-A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDS
1robustnessuppercase-There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert-THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERT
2robustnessuppercase-Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass-PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASS
3robustnessuppercase-Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart-OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APART
4robustnessuppercase-An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel-AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUEL
.....................
4808robustnessadjective_antonym_swap-A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space-A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in space
4809robustnessadjective_antonym_swap-Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh-Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very salty
4810robustnessadjective_antonym_swap-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visits
4811robustnessadjective_antonym_swap-When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles-When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebbles
4812robustnessadjective_antonym_swap-Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity-Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidity
\n", + "

4813 rows × 6 columns

\n", + "" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness uppercase - \n", + "1 robustness uppercase - \n", + "2 robustness uppercase - \n", + "3 robustness uppercase - \n", + "4 robustness uppercase - \n", + "... ... ... ... \n", + "4808 robustness adjective_antonym_swap - \n", + "4809 robustness adjective_antonym_swap - \n", + "4810 robustness adjective_antonym_swap - \n", + "4811 robustness adjective_antonym_swap - \n", + "4812 robustness adjective_antonym_swap - \n", + "\n", + " original_question \\\n", + "0 A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends \n", + "1 There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert \n", + "2 Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass \n", + "3 Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart \n", + "4 An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space \n", + "4809 Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + " perturbed_context \\\n", + "0 - \n", + "1 - \n", + "2 - \n", + "3 - \n", + "4 - \n", + "... ... \n", + "4808 - \n", + "4809 - \n", + "4810 - \n", + "4811 - \n", + "4812 - \n", + "\n", + " perturbed_question \n", + "0 A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDS \n", + "1 THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERT \n", + "2 PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASS \n", + "3 OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APART \n", + "4 AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUEL \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in space \n", + "4809 Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very salty \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + "[4813 rows x 6 columns]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.testcases()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Running the tests" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "harness.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### saving model reponse (expected_result and actual_result)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "harness.save(save_dir=\"HuggingFaceH4/zephyr-7b-beta-OpenBookQA\", include_generated_results =True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generated Results" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NameResolutionError(\": Failed to resolve 'api.openai.com' ([Errno 11001] getaddrinfo failed)\")': /v1/completions\n", + "WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NameResolutionError(\": Failed to resolve 'api.openai.com' ([Errno 11001] getaddrinfo failed)\")': /v1/completions\n" + ] + } + ], + "source": [ + "generated_results = harness.generated_results()" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_questionexpected_resultactual_resultpass
0robustnessuppercase-A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends-A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDSBTheFalse
1robustnessuppercase-There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert-THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERTAATrue
2robustnessuppercase-Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass-PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASSAATrue
3robustnessuppercase-Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart-OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APARTCAFalse
4robustnessuppercase-An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel-AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUELBBTrue
..............................
4808robustnessadjective_antonym_swap-A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space-A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in spaceAATrue
4809robustnessadjective_antonym_swap-Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh-Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very saltyCAFalse
4810robustnessadjective_antonym_swap-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visitsCBFalse
4811robustnessadjective_antonym_swap-When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles-When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebblesBBTrue
4812robustnessadjective_antonym_swap-Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity-Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidityDDTrue
\n", + "

4813 rows × 9 columns

\n", + "
" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness uppercase - \n", + "1 robustness uppercase - \n", + "2 robustness uppercase - \n", + "3 robustness uppercase - \n", + "4 robustness uppercase - \n", + "... ... ... ... \n", + "4808 robustness adjective_antonym_swap - \n", + "4809 robustness adjective_antonym_swap - \n", + "4810 robustness adjective_antonym_swap - \n", + "4811 robustness adjective_antonym_swap - \n", + "4812 robustness adjective_antonym_swap - \n", + "\n", + " original_question \\\n", + "0 A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends \n", + "1 There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert \n", + "2 Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass \n", + "3 Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart \n", + "4 An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space \n", + "4809 Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + " perturbed_context \\\n", + "0 - \n", + "1 - \n", + "2 - \n", + "3 - \n", + "4 - \n", + "... ... \n", + "4808 - \n", + "4809 - \n", + "4810 - \n", + "4811 - \n", + "4812 - \n", + "\n", + " perturbed_question \\\n", + "0 A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDS \n", + "1 THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERT \n", + "2 PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASS \n", + "3 OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APART \n", + "4 AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUEL \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in space \n", + "4809 Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very salty \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + " expected_result actual_result pass \n", + "0 B The False \n", + "1 A A True \n", + "2 A A True \n", + "3 C A False \n", + "4 B B True \n", + "... ... ... ... \n", + "4808 A A True \n", + "4809 C A False \n", + "4810 C B False \n", + "4811 B B True \n", + "4812 D D True \n", + "\n", + "[4813 rows x 9 columns]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "generated_results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Final Results" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "report = harness.report()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessuppercase24026052%75%False
1robustnesslowercase26823246%75%False
2robustnesstitlecase23626453%75%False
3robustnessadd_typo8139883%75%True
4robustnessdyslexia_word_swap5935686%75%True
5robustnessadd_abbreviation10533576%75%True
6robustnessadd_slangs8022774%75%False
7robustnessadd_speech_to_text_typo8941182%75%True
8robustnessadd_ocr_typo18231663%75%False
9robustnessadjective_synonym_swap7329480%75%True
10robustnessadjective_antonym_swap9421369%75%False
\n", + "
" + ], + "text/plain": [ + " category test_type fail_count pass_count pass_rate \\\n", + "0 robustness uppercase 240 260 52% \n", + "1 robustness lowercase 268 232 46% \n", + "2 robustness titlecase 236 264 53% \n", + "3 robustness add_typo 81 398 83% \n", + "4 robustness dyslexia_word_swap 59 356 86% \n", + "5 robustness add_abbreviation 105 335 76% \n", + "6 robustness add_slangs 80 227 74% \n", + "7 robustness add_speech_to_text_typo 89 411 82% \n", + "8 robustness add_ocr_typo 182 316 63% \n", + "9 robustness adjective_synonym_swap 73 294 80% \n", + "10 robustness adjective_antonym_swap 94 213 69% \n", + "\n", + " minimum_pass_rate pass \n", + "0 75% False \n", + "1 75% False \n", + "2 75% False \n", + "3 75% True \n", + "4 75% True \n", + "5 75% True \n", + "6 75% False \n", + "7 75% True \n", + "8 75% False \n", + "9 75% True \n", + "10 75% False " + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "report " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Saving report and generated_results" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "generated_results.to_csv('HuggingFaceH4/zephyr-7b-beta-OpenBookQA.csv', index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "report.to_csv('HuggingFaceH4/zephyr-7b-beta-OpenBookQA-report.csv', index=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Model Intel/neural-chat-7b-v3-1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setup and Configure Harness" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Downloading (…)l-00001-of-00002.bin: 100%|██████████| 9.94G/9.94G [28:32<00:00, 5.81MB/s]\n", + "c:\\Users\\priks\\anaconda3\\envs\\ge\\lib\\site-packages\\huggingface_hub\\file_download.py:137: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\\Users\\priks\\.cache\\huggingface\\hub. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.\n", + "To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development\n", + " warnings.warn(message)\n", + "Downloading (…)l-00002-of-00002.bin: 100%|██████████| 4.54G/4.54G [12:52<00:00, 5.88MB/s]\n", + "Downloading shards: 100%|██████████| 2/2 [41:26<00:00, 1243.32s/it]\n", + "Loading checkpoint shards: 100%|██████████| 2/2 [01:27<00:00, 43.68s/it]\n", + "Downloading generation_config.json: 100%|██████████| 111/111 [00:00<00:00, 37.0kB/s]\n", + "Downloading tokenizer_config.json: 100%|██████████| 953/953 [00:00<00:00, 477kB/s]\n", + "Downloading tokenizer.model: 100%|██████████| 493k/493k [00:00<00:00, 5.60MB/s]\n", + "Downloading tokenizer.json: 100%|██████████| 1.80M/1.80M [00:01<00:00, 1.68MB/s]\n", + "Downloading (…)cial_tokens_map.json: 100%|██████████| 145/145 [00:00<00:00, 72.4kB/s]" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Test Configuration : \n", + " {\n", + " \"evaluation\": {\n", + " \"metric\": \"QAEvalChain\",\n", + " \"model\": \"gpt-3.5-turbo-instruct\",\n", + " \"hub\": \"openai\"\n", + " },\n", + " \"tests\": {\n", + " \"defaults\": {\n", + " \"min_pass_rate\": 0.65\n", + " },\n", + " \"robustness\": {\n", + " \"uppercase\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"lowercase\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"titlecase\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"add_typo\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"dyslexia_word_swap\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"add_abbreviation\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"add_slangs\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"add_speech_to_text_typo\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"add_ocr_typo\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"adjective_synonym_swap\": {\n", + " \"min_pass_rate\": 0.75\n", + " },\n", + " \"adjective_antonym_swap\": {\n", + " \"min_pass_rate\": 0.75\n", + " }\n", + " }\n", + " }\n", + "}\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "harness = Harness(\n", + " task=\"question-answering\", \n", + " model={\"model\": \"Intel/neural-chat-7b-v3-1\", \"hub\": \"huggingface\"},\n", + " data={\"data_source\" :\"OpenBookQA\",\n", + " \"split\":\"test\"},\n", + " config={\n", + " \"evaluation\": {\"metric\":\"QAEvalChain\",\"model\":\"gpt-3.5-turbo-instruct\",\"hub\":\"openai\"},\n", + " 'tests': {'defaults': {'min_pass_rate': 0.65},\n", + " 'robustness': {'uppercase': {'min_pass_rate': 0.75},\n", + " 'lowercase':{'min_pass_rate': 0.75},\n", + " 'titlecase':{'min_pass_rate': 0.75},\n", + " 'add_typo':{'min_pass_rate': 0.75},\n", + " 'dyslexia_word_swap':{'min_pass_rate': 0.75},\n", + " 'add_abbreviation':{'min_pass_rate': 0.75},\n", + " 'add_slangs':{'min_pass_rate': 0.75},\n", + " 'add_speech_to_text_typo':{'min_pass_rate': 0.75},\n", + " 'add_ocr_typo':{'min_pass_rate': 0.75},\n", + " 'adjective_synonym_swap':{'min_pass_rate': 0.75},\n", + " 'adjective_antonym_swap':{'min_pass_rate': 0.75}\n", + " }\n", + " }\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generating the test cases." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Generating testcases...: 100%|██████████| 1/1 [00:00\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_question
0robustnessuppercase-A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends-A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDS
1robustnessuppercase-There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert-THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERT
2robustnessuppercase-Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass-PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASS
3robustnessuppercase-Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart-OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APART
4robustnessuppercase-An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel-AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUEL
.....................
4808robustnessadjective_antonym_swap-A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space-A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in space
4809robustnessadjective_antonym_swap-Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh-Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very salty
4810robustnessadjective_antonym_swap-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visits
4811robustnessadjective_antonym_swap-When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles-When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebbles
4812robustnessadjective_antonym_swap-Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity-Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidity
\n", + "

4813 rows × 6 columns

\n", + "" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness uppercase - \n", + "1 robustness uppercase - \n", + "2 robustness uppercase - \n", + "3 robustness uppercase - \n", + "4 robustness uppercase - \n", + "... ... ... ... \n", + "4808 robustness adjective_antonym_swap - \n", + "4809 robustness adjective_antonym_swap - \n", + "4810 robustness adjective_antonym_swap - \n", + "4811 robustness adjective_antonym_swap - \n", + "4812 robustness adjective_antonym_swap - \n", + "\n", + " original_question \\\n", + "0 A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends \n", + "1 There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert \n", + "2 Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass \n", + "3 Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart \n", + "4 An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space \n", + "4809 Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + " perturbed_context \\\n", + "0 - \n", + "1 - \n", + "2 - \n", + "3 - \n", + "4 - \n", + "... ... \n", + "4808 - \n", + "4809 - \n", + "4810 - \n", + "4811 - \n", + "4812 - \n", + "\n", + " perturbed_question \n", + "0 A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDS \n", + "1 THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERT \n", + "2 PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASS \n", + "3 OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APART \n", + "4 AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUEL \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in space \n", + "4809 Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very salty \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + "[4813 rows x 6 columns]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.testcases()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Running the tests" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "harness.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### saving model reponse (expected_result and actual_result)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "harness.save(save_dir=\"Intel/neural-chat-7b-v3-1-OpenBookQA\", include_generated_results =True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generated Results" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "generated_results = harness.generated_results()" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_questionexpected_resultactual_resultpass
0robustnessuppercase-A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends-A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDSBBTrue
1robustnessuppercase-There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert-THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERTAATrue
2robustnessuppercase-Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass-PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASSAATrue
3robustnessuppercase-Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart-OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APARTCCTrue
4robustnessuppercase-An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel-AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUELCAFalse
..............................
4808robustnessadjective_antonym_swap-A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space-A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in spaceABFalse
4809robustnessadjective_antonym_swap-Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh-Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very saltyBAFalse
4810robustnessadjective_antonym_swap-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visitsCBFalse
4811robustnessadjective_antonym_swap-When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles-When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebblesBBTrue
4812robustnessadjective_antonym_swap-Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity-Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidityDDTrue
\n", + "

4813 rows × 9 columns

\n", + "
" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness uppercase - \n", + "1 robustness uppercase - \n", + "2 robustness uppercase - \n", + "3 robustness uppercase - \n", + "4 robustness uppercase - \n", + "... ... ... ... \n", + "4808 robustness adjective_antonym_swap - \n", + "4809 robustness adjective_antonym_swap - \n", + "4810 robustness adjective_antonym_swap - \n", + "4811 robustness adjective_antonym_swap - \n", + "4812 robustness adjective_antonym_swap - \n", + "\n", + " original_question \\\n", + "0 A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends \n", + "1 There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert \n", + "2 Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass \n", + "3 Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart \n", + "4 An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space \n", + "4809 Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + " perturbed_context \\\n", + "0 - \n", + "1 - \n", + "2 - \n", + "3 - \n", + "4 - \n", + "... ... \n", + "4808 - \n", + "4809 - \n", + "4810 - \n", + "4811 - \n", + "4812 - \n", + "\n", + " perturbed_question \\\n", + "0 A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDS \n", + "1 THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERT \n", + "2 PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASS \n", + "3 OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APART \n", + "4 AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUEL \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in space \n", + "4809 Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very salty \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + " expected_result actual_result pass \n", + "0 B B True \n", + "1 A A True \n", + "2 A A True \n", + "3 C C True \n", + "4 C A False \n", + "... ... ... ... \n", + "4808 A B False \n", + "4809 B A False \n", + "4810 C B False \n", + "4811 B B True \n", + "4812 D D True \n", + "\n", + "[4813 rows x 9 columns]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "generated_results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Final Results" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "report = harness.report()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessuppercase10639479%75%True
1robustnesslowercase6044088%75%True
2robustnesstitlecase9240882%75%True
3robustnessadd_typo5242789%75%True
4robustnessdyslexia_word_swap5835786%75%True
5robustnessadd_abbreviation7136984%75%True
6robustnessadd_slangs6124680%75%True
7robustnessadd_speech_to_text_typo6843286%75%True
8robustnessadd_ocr_typo13436473%75%False
9robustnessadjective_synonym_swap7429380%75%True
10robustnessadjective_antonym_swap9021771%75%False
\n", + "
" + ], + "text/plain": [ + " category test_type fail_count pass_count pass_rate \\\n", + "0 robustness uppercase 106 394 79% \n", + "1 robustness lowercase 60 440 88% \n", + "2 robustness titlecase 92 408 82% \n", + "3 robustness add_typo 52 427 89% \n", + "4 robustness dyslexia_word_swap 58 357 86% \n", + "5 robustness add_abbreviation 71 369 84% \n", + "6 robustness add_slangs 61 246 80% \n", + "7 robustness add_speech_to_text_typo 68 432 86% \n", + "8 robustness add_ocr_typo 134 364 73% \n", + "9 robustness adjective_synonym_swap 74 293 80% \n", + "10 robustness adjective_antonym_swap 90 217 71% \n", + "\n", + " minimum_pass_rate pass \n", + "0 75% True \n", + "1 75% True \n", + "2 75% True \n", + "3 75% True \n", + "4 75% True \n", + "5 75% True \n", + "6 75% True \n", + "7 75% True \n", + "8 75% False \n", + "9 75% True \n", + "10 75% False " + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "report " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Saving report and generated_results" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "generated_results.to_csv('Intel/neural-chat-7b-v3-1-OpenBookQA.csv', index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "report.to_csv('Intel/neural-chat-7b-v3-1-OpenBookQA-report.csv', index=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Model gpt-4-1106-preview" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setup and Configure Harness" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-28T14:44:20.622319Z", + "iopub.status.busy": "2023-11-28T14:44:20.621904Z", + "iopub.status.idle": "2023-11-28T14:44:20.741369Z", + "shell.execute_reply": "2023-11-28T14:44:20.740588Z", + "shell.execute_reply.started": "2023-11-28T14:44:20.622303Z" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Test Configuration : \n", + " {\n", + " \"model_parameters\": {\n", + " \"temperature\": 0.2,\n", + " \"max_tokens\": 64\n", + " },\n", + " \"tests\": {\n", + " \"defaults\": {\n", + " \"min_pass_rate\": 1.0\n", + " },\n", + " \"robustness\": {\n", + " \"add_typo\": {\n", + " \"min_pass_rate\": 0.7\n", + " },\n", + " \"lowercase\": {\n", + " \"min_pass_rate\": 0.7\n", + " }\n", + " }\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "harness = Harness(\n", + " task=\"question-answering\",\n", + " model={\"model\": \"gpt-4-1106-preview\", \"hub\":\"openai\"},\n", + " data={\"data_source\" :\"OpenBookQA\",\n", + " \"split\":\"test\"}\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-28T14:44:20.742136Z", + "iopub.status.busy": "2023-11-28T14:44:20.741984Z", + "iopub.status.idle": "2023-11-28T14:44:20.803916Z", + "shell.execute_reply": "2023-11-28T14:44:20.803244Z", + "shell.execute_reply.started": "2023-11-28T14:44:20.742122Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{'evaluation': {'metric': 'QAEvalChain',\n", + " 'model': 'gpt-3.5-turbo-instruct',\n", + " 'hub': 'openai'},\n", + " 'tests': {'defaults': {'min_pass_rate': 0.65},\n", + " 'robustness': {'uppercase': {'min_pass_rate': 0.75},\n", + " 'lowercase': {'min_pass_rate': 0.75},\n", + " 'titlecase': {'min_pass_rate': 0.75},\n", + " 'add_typo': {'min_pass_rate': 0.75},\n", + " 'dyslexia_word_swap': {'min_pass_rate': 0.75},\n", + " 'add_abbreviation': {'min_pass_rate': 0.75},\n", + " 'add_slangs': {'min_pass_rate': 0.75},\n", + " 'add_speech_to_text_typo': {'min_pass_rate': 0.75},\n", + " 'add_ocr_typo': {'min_pass_rate': 0.75},\n", + " 'adjective_synonym_swap': {'min_pass_rate': 0.75},\n", + " 'adjective_antonym_swap': {'min_pass_rate': 0.75}}}}" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.configure(\n", + "{\n", + " \"evaluation\": {\"metric\":\"QAEvalChain\",\"model\":\"gpt-3.5-turbo-instruct\",\"hub\":\"openai\"},\n", + " 'tests': {'defaults': {'min_pass_rate': 0.65},\n", + " 'robustness': {'uppercase': {'min_pass_rate': 0.75},\n", + " 'lowercase':{'min_pass_rate': 0.75},\n", + " 'titlecase':{'min_pass_rate': 0.75},\n", + " 'add_typo':{'min_pass_rate': 0.75},\n", + " 'dyslexia_word_swap':{'min_pass_rate': 0.75},\n", + " 'add_abbreviation':{'min_pass_rate': 0.75},\n", + " 'add_slangs':{'min_pass_rate': 0.75},\n", + " 'add_speech_to_text_typo':{'min_pass_rate': 0.75},\n", + " 'add_ocr_typo':{'min_pass_rate': 0.75},\n", + " 'adjective_synonym_swap':{'min_pass_rate': 0.75},\n", + " 'adjective_antonym_swap':{'min_pass_rate': 0.75}\n", + " }\n", + " }\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generating the test cases." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-28T14:44:20.938360Z", + "iopub.status.busy": "2023-11-28T14:44:20.937994Z", + "iopub.status.idle": "2023-11-28T14:45:43.700124Z", + "shell.execute_reply": "2023-11-28T14:45:43.699516Z", + "shell.execute_reply.started": "2023-11-28T14:44:20.938341Z" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 6502.80it/s]\n", + "WARNING:root:[W009] Removing samples where no transformation has been applied:\n", + "[W010] - Test 'add_typo': 21 samples removed out of 500\n", + "[W010] - Test 'dyslexia_word_swap': 85 samples removed out of 500\n", + "[W010] - Test 'add_abbreviation': 60 samples removed out of 500\n", + "[W010] - Test 'add_slangs': 193 samples removed out of 500\n", + "[W010] - Test 'add_ocr_typo': 2 samples removed out of 500\n", + "[W010] - Test 'adjective_synonym_swap': 133 samples removed out of 500\n", + "[W010] - Test 'adjective_antonym_swap': 193 samples removed out of 500\n", + "\n" + ] + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "seed_value = 42\n", + "random.seed(seed_value)\n", + "harness.generate()" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-28T14:45:43.701265Z", + "iopub.status.busy": "2023-11-28T14:45:43.701099Z", + "iopub.status.idle": "2023-11-28T14:45:44.139425Z", + "shell.execute_reply": "2023-11-28T14:45:44.138863Z", + "shell.execute_reply.started": "2023-11-28T14:45:43.701249Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_question
0robustnessuppercase-A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends-A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDS
1robustnessuppercase-There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert-THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERT
2robustnessuppercase-Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass-PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASS
3robustnessuppercase-Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart-OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APART
4robustnessuppercase-An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel-AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUEL
.....................
4808robustnessadjective_antonym_swap-A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space-A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in space
4809robustnessadjective_antonym_swap-Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh-Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very salty
4810robustnessadjective_antonym_swap-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visits
4811robustnessadjective_antonym_swap-When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles-When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebbles
4812robustnessadjective_antonym_swap-Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity-Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidity
\n", + "

4813 rows × 6 columns

\n", + "
" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness uppercase - \n", + "1 robustness uppercase - \n", + "2 robustness uppercase - \n", + "3 robustness uppercase - \n", + "4 robustness uppercase - \n", + "... ... ... ... \n", + "4808 robustness adjective_antonym_swap - \n", + "4809 robustness adjective_antonym_swap - \n", + "4810 robustness adjective_antonym_swap - \n", + "4811 robustness adjective_antonym_swap - \n", + "4812 robustness adjective_antonym_swap - \n", + "\n", + " original_question \\\n", + "0 A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends \n", + "1 There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert \n", + "2 Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass \n", + "3 Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart \n", + "4 An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space \n", + "4809 Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + " perturbed_context \\\n", + "0 - \n", + "1 - \n", + "2 - \n", + "3 - \n", + "4 - \n", + "... ... \n", + "4808 - \n", + "4809 - \n", + "4810 - \n", + "4811 - \n", + "4812 - \n", + "\n", + " perturbed_question \n", + "0 A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDS \n", + "1 THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERT \n", + "2 PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASS \n", + "3 OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APART \n", + "4 AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUEL \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in space \n", + "4809 Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very salty \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + "[4813 rows x 6 columns]" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.testcases()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Running the tests" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-28T14:45:44.140403Z", + "iopub.status.busy": "2023-11-28T14:45:44.140247Z", + "iopub.status.idle": "2023-11-28T18:45:08.674969Z", + "shell.execute_reply": "2023-11-28T18:45:08.674369Z", + "shell.execute_reply.started": "2023-11-28T14:45:44.140388Z" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Running testcases... : 100%|██████████| 4813/4813 [3:59:24<00:00, 2.98s/it] \n" + ] + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### saving model reponse (expected_result and actual_result)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-28T18:45:08.676280Z", + "iopub.status.busy": "2023-11-28T18:45:08.675814Z", + "iopub.status.idle": "2023-11-28T18:45:09.007067Z", + "shell.execute_reply": "2023-11-28T18:45:09.006501Z", + "shell.execute_reply.started": "2023-11-28T18:45:08.676255Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "harness.save(save_dir=\"openai/gpt-4-1106-preview-OpenBookQA\", include_generated_results =True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generated Results" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-28T18:45:09.008337Z", + "iopub.status.busy": "2023-11-28T18:45:09.007874Z", + "iopub.status.idle": "2023-11-28T18:53:57.184392Z", + "shell.execute_reply": "2023-11-28T18:53:57.183856Z", + "shell.execute_reply.started": "2023-11-28T18:45:09.008312Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "generated_results = harness.generated_results()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-28T18:53:57.185552Z", + "iopub.status.busy": "2023-11-28T18:53:57.185098Z", + "iopub.status.idle": "2023-11-28T18:53:57.194355Z", + "shell.execute_reply": "2023-11-28T18:53:57.193961Z", + "shell.execute_reply.started": "2023-11-28T18:53:57.185533Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_questionexpected_resultactual_resultpass
0robustnessuppercase-A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends-A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDSB. quit eating lunch outB. QUIT EATING LUNCH OUTTrue
1robustnessuppercase-There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert-THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERTA. a marshA. A MARSHTrue
2robustnessuppercase-Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass-PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASSC. bunniesC. BUNNIESTrue
3robustnessuppercase-Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart-OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APARTC. parts may break the concreteC. PARTS MAY BREAK THE CONCRETETrue
4robustnessuppercase-An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel-AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUELB. a power stationB. A POWER STATIONTrue
..............................
4808robustnessadjective_antonym_swap-A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space-A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in spaceA. UV rays are harmfulB. sunlight will be funFalse
4809robustnessadjective_antonym_swap-Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh-Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very saltyC. water is bubbling from applied warmthA. the water is coolFalse
4810robustnessadjective_antonym_swap-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits-A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visitsC. leads to less sick peopleB. leads to more well peopleFalse
4811robustnessadjective_antonym_swap-When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles-When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebblesB. tiny lifeforms in dirtB. tiny lifeforms in dirtTrue
4812robustnessadjective_antonym_swap-Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity-Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidityC. heatD. humidityFalse
\n", + "

4813 rows × 9 columns

\n", + "
" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness uppercase - \n", + "1 robustness uppercase - \n", + "2 robustness uppercase - \n", + "3 robustness uppercase - \n", + "4 robustness uppercase - \n", + "... ... ... ... \n", + "4808 robustness adjective_antonym_swap - \n", + "4809 robustness adjective_antonym_swap - \n", + "4810 robustness adjective_antonym_swap - \n", + "4811 robustness adjective_antonym_swap - \n", + "4812 robustness adjective_antonym_swap - \n", + "\n", + " original_question \\\n", + "0 A person wants to start saving money so that they can afford a nice vacation at the end of the year. After looking over their budget and expenses, they decide the best way to save money is to\\n\\nA. make more phone calls\\nB. quit eating lunch out\\nC. buy less with monopoly money\\nD. have lunch with friends \n", + "1 There is most likely going to be fog around:\\n\\nA. a marsh\\nB. a tundra\\nC. the plains\\nD. a desert \n", + "2 Predators eat\\n\\nA. lions\\nB. humans\\nC. bunnies\\nD. grass \n", + "3 Oak tree seeds are planted and a sidewalk is paved right next to that spot, until eventually, the tree is tall and the roots must extend past the sidewalk, which means\\n\\nA. roots may be split\\nB. roots may begin to die\\nC. parts may break the concrete\\nD. roots may fall apart \n", + "4 An electric car runs on electricity via\\n\\nA. gasoline\\nB. a power station\\nC. electrical conductors\\nD. fuel \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the bright, sunny day at the beach. She makes sure that she stops at the store to pick up some sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmful\\nB. sunlight will be fun\\nC. the sun is close\\nD. the sun is in space \n", + "4809 Pasta may be cooked in water when\\n\\nA. the water is warm\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very fresh \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more sick people\\nC. leads to less sick people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a scientific way, what is seen and viewed is actually\\n\\nA. insects like big beetles\\nB. tiny lifeforms in dirt\\nC. small mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a liquid coming from their skin to adjust to\\n\\nA. cold\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + " perturbed_context \\\n", + "0 - \n", + "1 - \n", + "2 - \n", + "3 - \n", + "4 - \n", + "... ... \n", + "4808 - \n", + "4809 - \n", + "4810 - \n", + "4811 - \n", + "4812 - \n", + "\n", + " perturbed_question \\\n", + "0 A PERSON WANTS TO START SAVING MONEY SO THAT THEY CAN AFFORD A NICE VACATION AT THE END OF THE YEAR. AFTER LOOKING OVER THEIR BUDGET AND EXPENSES, THEY DECIDE THE BEST WAY TO SAVE MONEY IS TO A. MAKE MORE PHONE CALLS B. QUIT EATING LUNCH OUT C. BUY LESS WITH MONOPOLY MONEY D. HAVE LUNCH WITH FRIENDS \n", + "1 THERE IS MOST LIKELY GOING TO BE FOG AROUND: A. A MARSH B. A TUNDRA C. THE PLAINS D. A DESERT \n", + "2 PREDATORS EAT A. LIONS B. HUMANS C. BUNNIES D. GRASS \n", + "3 OAK TREE SEEDS ARE PLANTED AND A SIDEWALK IS PAVED RIGHT NEXT TO THAT SPOT, UNTIL EVENTUALLY, THE TREE IS TALL AND THE ROOTS MUST EXTEND PAST THE SIDEWALK, WHICH MEANS A. ROOTS MAY BE SPLIT B. ROOTS MAY BEGIN TO DIE C. PARTS MAY BREAK THE CONCRETE D. ROOTS MAY FALL APART \n", + "4 AN ELECTRIC CAR RUNS ON ELECTRICITY VIA A. GASOLINE B. A POWER STATION C. ELECTRICAL CONDUCTORS D. FUEL \n", + "... ... \n", + "4808 A woman, with a pale complexion, wants to spend the dull, sunny day at the beach. She makes sure that she stops at the store to pick up no sunblock before she begins to enjoy her day filled with sand and surf. She applies the sunblock carefully and thoroughly, because she knows that\\n\\nA. UV rays are harmless\\nB. sunlight will be fun\\nC. the sun is distant\\nD. the sun is in space \n", + "4809 Pasta may be raw in water when\\n\\nA. the water is cool\\nB. the water is on the stove\\nC. water is bubbling from applied warmth\\nD. the pasta is very salty \n", + "4810 A decrease in diseases\\n\\nA. has no impact on a population\\nB. leads to more well people\\nC. leads to less well people\\nD. leads to an uptick in emergency room visits \n", + "4811 When soil is viewed in a unscientific way, what is seen and viewed is actually\\n\\nA. insects like small beetles\\nB. tiny lifeforms in dirt\\nC. big mammals living there\\nD. a lot of tiny pebbles \n", + "4812 Some animals use a gaseous coming from their skin to adjust to\\n\\nA. hot\\nB. water\\nC. heat\\nD. humidity \n", + "\n", + " expected_result \\\n", + "0 B. quit eating lunch out \n", + "1 A. a marsh \n", + "2 C. bunnies \n", + "3 C. parts may break the concrete \n", + "4 B. a power station \n", + "... ... \n", + "4808 A. UV rays are harmful \n", + "4809 C. water is bubbling from applied warmth \n", + "4810 C. leads to less sick people \n", + "4811 B. tiny lifeforms in dirt \n", + "4812 C. heat \n", + "\n", + " actual_result pass \n", + "0 B. QUIT EATING LUNCH OUT True \n", + "1 A. A MARSH True \n", + "2 C. BUNNIES True \n", + "3 C. PARTS MAY BREAK THE CONCRETE True \n", + "4 B. A POWER STATION True \n", + "... ... ... \n", + "4808 B. sunlight will be fun False \n", + "4809 A. the water is cool False \n", + "4810 B. leads to more well people False \n", + "4811 B. tiny lifeforms in dirt True \n", + "4812 D. humidity False \n", + "\n", + "[4813 rows x 9 columns]" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "generated_results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Final Results" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-28T18:53:57.194975Z", + "iopub.status.busy": "2023-11-28T18:53:57.194823Z", + "iopub.status.idle": "2023-11-28T18:53:57.293778Z", + "shell.execute_reply": "2023-11-28T18:53:57.293351Z", + "shell.execute_reply.started": "2023-11-28T18:53:57.194962Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "report = harness.report()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-28T18:53:57.294390Z", + "iopub.status.busy": "2023-11-28T18:53:57.294249Z", + "iopub.status.idle": "2023-11-28T18:53:57.361810Z", + "shell.execute_reply": "2023-11-28T18:53:57.361408Z", + "shell.execute_reply.started": "2023-11-28T18:53:57.294377Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessuppercase3246894%75%True
1robustnesslowercase3346793%75%True
2robustnesstitlecase2947194%75%True
3robustnessadd_typo3044994%75%True
4robustnessdyslexia_word_swap4037590%75%True
5robustnessadd_abbreviation5838287%75%True
6robustnessadd_slangs8722072%75%False
7robustnessadd_speech_to_text_typo3446693%75%True
8robustnessadd_ocr_typo3945992%75%True
9robustnessadjective_synonym_swap9327475%75%False
10robustnessadjective_antonym_swap10919864%75%False
\n", + "
" + ], + "text/plain": [ + " category test_type fail_count pass_count pass_rate \\\n", + "0 robustness uppercase 32 468 94% \n", + "1 robustness lowercase 33 467 93% \n", + "2 robustness titlecase 29 471 94% \n", + "3 robustness add_typo 30 449 94% \n", + "4 robustness dyslexia_word_swap 40 375 90% \n", + "5 robustness add_abbreviation 58 382 87% \n", + "6 robustness add_slangs 87 220 72% \n", + "7 robustness add_speech_to_text_typo 34 466 93% \n", + "8 robustness add_ocr_typo 39 459 92% \n", + "9 robustness adjective_synonym_swap 93 274 75% \n", + "10 robustness adjective_antonym_swap 109 198 64% \n", + "\n", + " minimum_pass_rate pass \n", + "0 75% True \n", + "1 75% True \n", + "2 75% True \n", + "3 75% True \n", + "4 75% True \n", + "5 75% True \n", + "6 75% False \n", + "7 75% True \n", + "8 75% True \n", + "9 75% False \n", + "10 75% False " + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "report" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Saving report and generated_results" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-28T18:53:57.362482Z", + "iopub.status.busy": "2023-11-28T18:53:57.362330Z", + "iopub.status.idle": "2023-11-28T18:53:57.569765Z", + "shell.execute_reply": "2023-11-28T18:53:57.569305Z", + "shell.execute_reply.started": "2023-11-28T18:53:57.362468Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "generated_results.to_csv('openai/gpt-4-1106-preview-OpenBookQA.csv', index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-28T18:53:57.570591Z", + "iopub.status.busy": "2023-11-28T18:53:57.570311Z", + "iopub.status.idle": "2023-11-28T18:53:57.587896Z", + "shell.execute_reply": "2023-11-28T18:53:57.587445Z", + "shell.execute_reply.started": "2023-11-28T18:53:57.570576Z" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "report.to_csv('openai/gpt-4-1106-preview-OpenBookQA-report.csv', index=False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Visualizing the Report" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-22T05:29:17.033355Z", + "iopub.status.busy": "2023-11-22T05:29:17.032979Z", + "iopub.status.idle": "2023-11-22T05:29:18.626250Z", + "shell.execute_reply": "2023-11-22T05:29:18.625706Z", + "shell.execute_reply.started": "2023-11-22T05:29:17.033334Z" + }, + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: plotly in /opt/conda/lib/python3.10/site-packages (5.13.1)\n", + "Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (1.5.3)\n", + "Requirement already satisfied: tenacity>=6.2.0 in /opt/conda/lib/python3.10/site-packages (from plotly) (8.2.1)\n", + "Requirement already satisfied: python-dateutil>=2.8.1 in /opt/conda/lib/python3.10/site-packages (from pandas) (2.8.2)\n", + "Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages (from pandas) (2022.7.1)\n", + "Requirement already satisfied: numpy>=1.21.0 in /opt/conda/lib/python3.10/site-packages (from pandas) (1.21.6)\n", + "Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)\n" + ] + } + ], + "source": [ + "!pip install plotly pandas" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-22T05:34:48.726069Z", + "iopub.status.busy": "2023-11-22T05:34:48.725650Z", + "iopub.status.idle": "2023-11-22T05:34:48.981905Z", + "shell.execute_reply": "2023-11-22T05:34:48.981455Z", + "shell.execute_reply.started": "2023-11-22T05:34:48.726051Z" + }, + "tags": [] + }, + "outputs": [ + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "alignmentgroup": "True", + "hovertemplate": "Test Type=%{x}
Pass Rate=%{y}", + "legendgroup": "", + "marker": { + "color": "#636efa", + "pattern": { + "shape": "" + } + }, + "name": "", + "offsetgroup": "", + "orientation": "v", + "showlegend": false, + "textposition": "auto", + "type": "bar", + "x": [ + "uppercase", + "lowercase", + "titlecase", + "add_typo", + "dyslexia_word_swap", + "add_abbreviation", + "add_slangs", + "add_speech_to_text_typo", + "add_ocr_typo", + "adjective_synonym_swap", + "adjective_antonym_swap" + ], + "xaxis": "x", + "y": [ + 58, + 63, + 60, + 74, + 66, + 58, + 59, + 78, + 51, + 62, + 61 + ], + "yaxis": "y" + } + ], + "layout": { + "barmode": "relative", + "height": 700, + "legend": { + "tracegroupgap": 0 + }, + "shapes": [ + { + "line": { + "color": "red", + "width": 2 + }, + "type": "line", + "x0": -0.5, + "x1": 10.5, + "y0": 75, + "y1": 75 + } + ], + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "title": { + "text": "Pass Rate by Test Type
Hub - ai21
Model - j2-jumbo-instruct" + }, + "width": 1000, + "xaxis": { + "anchor": "y", + "domain": [ + 0, + 1 + ], + "tickangle": 45, + "title": { + "text": "Test Type" + } + }, + "yaxis": { + "anchor": "x", + "domain": [ + 0, + 1 + ], + "title": { + "text": "Pass Rate" + } + } + } + } + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "domain": { + "x": [ + 0, + 1 + ], + "y": [ + 0, + 1 + ] + }, + "hovertemplate": "test_type=%{label}
fail_count=%{value}", + "labels": [ + "uppercase", + "lowercase", + "titlecase", + "add_typo", + "dyslexia_word_swap", + "add_abbreviation", + "add_slangs", + "add_speech_to_text_typo", + "add_ocr_typo", + "adjective_synonym_swap", + "adjective_antonym_swap" + ], + "legendgroup": "", + "name": "", + "showlegend": true, + "type": "pie", + "values": [ + 208, + 185, + 199, + 126, + 142, + 183, + 126, + 110, + 246, + 139, + 120 + ] + } + ], + "layout": { + "height": 600, + "legend": { + "tracegroupgap": 0 + }, + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "title": { + "text": "Distribution of Fail Count
Hub - ai21
Model - j2-jumbo-instruct" + }, + "width": 800 + } + } + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "alignmentgroup": "True", + "hovertemplate": "Test Type=%{x}
Pass Rate=%{y}", + "legendgroup": "", + "marker": { + "color": "#636efa", + "pattern": { + "shape": "" + } + }, + "name": "", + "offsetgroup": "", + "orientation": "v", + "showlegend": false, + "textposition": "auto", + "type": "bar", + "x": [ + "uppercase", + "lowercase", + "titlecase", + "add_typo", + "dyslexia_word_swap", + "add_abbreviation", + "add_slangs", + "add_speech_to_text_typo", + "add_ocr_typo", + "adjective_synonym_swap", + "adjective_antonym_swap" + ], + "xaxis": "x", + "y": [ + 53, + 56, + 58, + 79, + 66, + 63, + 64, + 84, + 54, + 65, + 58 + ], + "yaxis": "y" + } + ], + "layout": { + "barmode": "relative", + "height": 700, + "legend": { + "tracegroupgap": 0 + }, + "shapes": [ + { + "line": { + "color": "red", + "width": 2 + }, + "type": "line", + "x0": -0.5, + "x1": 10.5, + "y0": 75, + "y1": 75 + } + ], + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "title": { + "text": "Pass Rate by Test Type
Hub - ai21
Model - j2-grande-instruct" + }, + "width": 1000, + "xaxis": { + "anchor": "y", + "domain": [ + 0, + 1 + ], + "tickangle": 45, + "title": { + "text": "Test Type" + } + }, + "yaxis": { + "anchor": "x", + "domain": [ + 0, + 1 + ], + "title": { + "text": "Pass Rate" + } + } + } + } + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "domain": { + "x": [ + 0, + 1 + ], + "y": [ + 0, + 1 + ] + }, + "hovertemplate": "test_type=%{label}
fail_count=%{value}", + "labels": [ + "uppercase", + "lowercase", + "titlecase", + "add_typo", + "dyslexia_word_swap", + "add_abbreviation", + "add_slangs", + "add_speech_to_text_typo", + "add_ocr_typo", + "adjective_synonym_swap", + "adjective_antonym_swap" + ], + "legendgroup": "", + "name": "", + "showlegend": true, + "type": "pie", + "values": [ + 237, + 220, + 209, + 101, + 142, + 162, + 112, + 82, + 228, + 128, + 130 + ] + } + ], + "layout": { + "height": 600, + "legend": { + "tracegroupgap": 0 + }, + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "title": { + "text": "Distribution of Fail Count
Hub - ai21
Model - j2-grande-instruct" + }, + "width": 800 + } + } + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "alignmentgroup": "True", + "hovertemplate": "Test Type=%{x}
Pass Rate=%{y}", + "legendgroup": "", + "marker": { + "color": "#636efa", + "pattern": { + "shape": "" + } + }, + "name": "", + "offsetgroup": "", + "orientation": "v", + "showlegend": false, + "textposition": "auto", + "type": "bar", + "x": [ + "uppercase", + "lowercase", + "titlecase", + "add_typo", + "dyslexia_word_swap", + "add_abbreviation", + "add_slangs", + "add_speech_to_text_typo", + "add_ocr_typo", + "adjective_synonym_swap", + "adjective_antonym_swap" + ], + "xaxis": "x", + "y": [ + 87, + 88, + 85, + 90, + 80, + 79, + 66, + 91, + 79, + 68, + 66 + ], + "yaxis": "y" + } + ], + "layout": { + "barmode": "relative", + "height": 700, + "legend": { + "tracegroupgap": 0 + }, + "shapes": [ + { + "line": { + "color": "red", + "width": 2 + }, + "type": "line", + "x0": -0.5, + "x1": 10.5, + "y0": 75, + "y1": 75 + } + ], + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "title": { + "text": "Pass Rate by Test Type
Hub - openai
Model - gpt-3.5-turbo-instruct" + }, + "width": 1000, + "xaxis": { + "anchor": "y", + "domain": [ + 0, + 1 + ], + "tickangle": 45, + "title": { + "text": "Test Type" + } + }, + "yaxis": { + "anchor": "x", + "domain": [ + 0, + 1 + ], + "title": { + "text": "Pass Rate" + } + } + } + } + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "domain": { + "x": [ + 0, + 1 + ], + "y": [ + 0, + 1 + ] + }, + "hovertemplate": "test_type=%{label}
fail_count=%{value}", + "labels": [ + "uppercase", + "lowercase", + "titlecase", + "add_typo", + "dyslexia_word_swap", + "add_abbreviation", + "add_slangs", + "add_speech_to_text_typo", + "add_ocr_typo", + "adjective_synonym_swap", + "adjective_antonym_swap" + ], + "legendgroup": "", + "name": "", + "showlegend": true, + "type": "pie", + "values": [ + 65, + 60, + 73, + 50, + 85, + 93, + 103, + 46, + 103, + 116, + 104 + ] + } + ], + "layout": { + "height": 600, + "legend": { + "tracegroupgap": 0 + }, + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "title": { + "text": "Distribution of Fail Count
Hub - openai
Model - gpt-3.5-turbo-instruct" + }, + "width": 800 + } + } + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "alignmentgroup": "True", + "hovertemplate": "Test Type=%{x}
Pass Rate=%{y}", + "legendgroup": "", + "marker": { + "color": "#636efa", + "pattern": { + "shape": "" + } + }, + "name": "", + "offsetgroup": "", + "orientation": "v", + "showlegend": false, + "textposition": "auto", + "type": "bar", + "x": [ + "uppercase", + "lowercase", + "titlecase", + "add_typo", + "dyslexia_word_swap", + "add_abbreviation", + "add_slangs", + "add_speech_to_text_typo", + "add_ocr_typo", + "adjective_synonym_swap", + "adjective_antonym_swap" + ], + "xaxis": "x", + "y": [ + 83, + 85, + 85, + 90, + 80, + 81, + 67, + 88, + 85, + 71, + 66 + ], + "yaxis": "y" + } + ], + "layout": { + "barmode": "relative", + "height": 700, + "legend": { + "tracegroupgap": 0 + }, + "shapes": [ + { + "line": { + "color": "red", + "width": 2 + }, + "type": "line", + "x0": -0.5, + "x1": 10.5, + "y0": 75, + "y1": 75 + } + ], + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "title": { + "text": "Pass Rate by Test Type
Hub - openai
Model - text-davinci-003" + }, + "width": 1000, + "xaxis": { + "anchor": "y", + "domain": [ + 0, + 1 + ], + "tickangle": 45, + "title": { + "text": "Test Type" + } + }, + "yaxis": { + "anchor": "x", + "domain": [ + 0, + 1 + ], + "title": { + "text": "Pass Rate" + } + } + } + } + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "domain": { + "x": [ + 0, + 1 + ], + "y": [ + 0, + 1 + ] + }, + "hovertemplate": "test_type=%{label}
fail_count=%{value}", + "labels": [ + "uppercase", + "lowercase", + "titlecase", + "add_typo", + "dyslexia_word_swap", + "add_abbreviation", + "add_slangs", + "add_speech_to_text_typo", + "add_ocr_typo", + "adjective_synonym_swap", + "adjective_antonym_swap" + ], + "legendgroup": "", + "name": "", + "showlegend": true, + "type": "pie", + "values": [ + 83, + 77, + 77, + 47, + 81, + 84, + 100, + 61, + 73, + 105, + 103 + ] + } + ], + "layout": { + "height": 600, + "legend": { + "tracegroupgap": 0 + }, + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "title": { + "text": "Distribution of Fail Count
Hub - openai
Model - text-davinci-003" + }, + "width": 800 + } + } + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "alignmentgroup": "True", + "hovertemplate": "Test Type=%{x}
Pass Rate=%{y}", + "legendgroup": "", + "marker": { + "color": "#636efa", + "pattern": { + "shape": "" + } + }, + "name": "", + "offsetgroup": "", + "orientation": "v", + "showlegend": false, + "textposition": "auto", + "type": "bar", + "x": [ + "uppercase", + "lowercase", + "titlecase", + "add_typo", + "dyslexia_word_swap", + "add_abbreviation", + "add_slangs", + "add_speech_to_text_typo", + "add_ocr_typo", + "adjective_synonym_swap", + "adjective_antonym_swap" + ], + "xaxis": "x", + "y": [ + 94, + 93, + 94, + 94, + 90, + 87, + 72, + 93, + 92, + 75, + 64 + ], + "yaxis": "y" + } + ], + "layout": { + "barmode": "relative", + "height": 700, + "legend": { + "tracegroupgap": 0 + }, + "shapes": [ + { + "line": { + "color": "red", + "width": 2 + }, + "type": "line", + "x0": -0.5, + "x1": 10.5, + "y0": 75, + "y1": 75 + } + ], + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "title": { + "text": "Pass Rate by Test Type
Hub - openai
Model - gpt-4-1106-preview" + }, + "width": 1000, + "xaxis": { + "anchor": "y", + "domain": [ + 0, + 1 + ], + "tickangle": 45, + "title": { + "text": "Test Type" + } + }, + "yaxis": { + "anchor": "x", + "domain": [ + 0, + 1 + ], + "title": { + "text": "Pass Rate" + } + } + } + } + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "domain": { + "x": [ + 0, + 1 + ], + "y": [ + 0, + 1 + ] + }, + "hovertemplate": "test_type=%{label}
fail_count=%{value}", + "labels": [ + "uppercase", + "lowercase", + "titlecase", + "add_typo", + "dyslexia_word_swap", + "add_abbreviation", + "add_slangs", + "add_speech_to_text_typo", + "add_ocr_typo", + "adjective_synonym_swap", + "adjective_antonym_swap" + ], + "legendgroup": "", + "name": "", + "showlegend": true, + "type": "pie", + "values": [ + 32, + 33, + 29, + 30, + 40, + 58, + 87, + 34, + 39, + 93, + 109 + ] + } + ], + "layout": { + "height": 600, + "legend": { + "tracegroupgap": 0 + }, + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "title": { + "text": "Distribution of Fail Count
Hub - openai
Model - gpt-4-1106-preview" + }, + "width": 800 + } + } + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "alignmentgroup": "True", + "hovertemplate": "Test Type=%{x}
Pass Rate=%{y}", + "legendgroup": "", + "marker": { + "color": "#636efa", + "pattern": { + "shape": "" + } + }, + "name": "", + "offsetgroup": "", + "orientation": "v", + "showlegend": false, + "textposition": "auto", + "type": "bar", + "x": [ + "uppercase", + "lowercase", + "titlecase", + "add_typo", + "dyslexia_word_swap", + "add_abbreviation", + "add_slangs", + "add_speech_to_text_typo", + "add_ocr_typo", + "adjective_synonym_swap", + "adjective_antonym_swap" + ], + "xaxis": "x", + "y": [ + 55, + 59, + 54, + 88, + 85, + 81, + 76, + 87, + 72, + 78, + 74 + ], + "yaxis": "y" + } + ], + "layout": { + "barmode": "relative", + "height": 700, + "legend": { + "tracegroupgap": 0 + }, + "shapes": [ + { + "line": { + "color": "red", + "width": 2 + }, + "type": "line", + "x0": -0.5, + "x1": 10.5, + "y0": 75, + "y1": 75 + } + ], + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "title": { + "text": "Pass Rate by Test Type
Hub - huggingface
Model - mistralai/Mistral-7B-Instruct-v0.1" + }, + "width": 1000, + "xaxis": { + "anchor": "y", + "domain": [ + 0, + 1 + ], + "tickangle": 45, + "title": { + "text": "Test Type" + } + }, + "yaxis": { + "anchor": "x", + "domain": [ + 0, + 1 + ], + "title": { + "text": "Pass Rate" + } + } + } + } + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "domain": { + "x": [ + 0, + 1 + ], + "y": [ + 0, + 1 + ] + }, + "hovertemplate": "test_type=%{label}
fail_count=%{value}", + "labels": [ + "uppercase", + "lowercase", + "titlecase", + "add_typo", + "dyslexia_word_swap", + "add_abbreviation", + "add_slangs", + "add_speech_to_text_typo", + "add_ocr_typo", + "adjective_synonym_swap", + "adjective_antonym_swap" + ], + "legendgroup": "", + "name": "", + "showlegend": true, + "type": "pie", + "values": [ + 226, + 206, + 230, + 56, + 64, + 85, + 75, + 64, + 138, + 82, + 81 + ] + } + ], + "layout": { + "height": 600, + "legend": { + "tracegroupgap": 0 + }, + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "title": { + "text": "Distribution of Fail Count
Hub - huggingface
Model - mistralai/Mistral-7B-Instruct-v0.1" + }, + "width": 800 + } + } + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "alignmentgroup": "True", + "hovertemplate": "Test Type=%{x}
Pass Rate=%{y}", + "legendgroup": "", + "marker": { + "color": "#636efa", + "pattern": { + "shape": "" + } + }, + "name": "", + "offsetgroup": "", + "orientation": "v", + "showlegend": false, + "textposition": "auto", + "type": "bar", + "x": [ + "uppercase", + "lowercase", + "titlecase", + "add_typo", + "dyslexia_word_swap", + "add_abbreviation", + "add_slangs", + "add_speech_to_text_typo", + "add_ocr_typo", + "adjective_synonym_swap", + "adjective_antonym_swap" + ], + "xaxis": "x", + "y": [ + 52, + 46, + 53, + 83, + 86, + 76, + 74, + 82, + 63, + 80, + 69 + ], + "yaxis": "y" + } + ], + "layout": { + "barmode": "relative", + "height": 700, + "legend": { + "tracegroupgap": 0 + }, + "shapes": [ + { + "line": { + "color": "red", + "width": 2 + }, + "type": "line", + "x0": -0.5, + "x1": 10.5, + "y0": 75, + "y1": 75 + } + ], + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "title": { + "text": "Pass Rate by Test Type
Hub - huggingface
Model - HuggingFaceH4/zephyr-7b-beta" + }, + "width": 1000, + "xaxis": { + "anchor": "y", + "domain": [ + 0, + 1 + ], + "tickangle": 45, + "title": { + "text": "Test Type" + } + }, + "yaxis": { + "anchor": "x", + "domain": [ + 0, + 1 + ], + "title": { + "text": "Pass Rate" + } + } + } + } + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "domain": { + "x": [ + 0, + 1 + ], + "y": [ + 0, + 1 + ] + }, + "hovertemplate": "test_type=%{label}
fail_count=%{value}", + "labels": [ + "uppercase", + "lowercase", + "titlecase", + "add_typo", + "dyslexia_word_swap", + "add_abbreviation", + "add_slangs", + "add_speech_to_text_typo", + "add_ocr_typo", + "adjective_synonym_swap", + "adjective_antonym_swap" + ], + "legendgroup": "", + "name": "", + "showlegend": true, + "type": "pie", + "values": [ + 240, + 268, + 236, + 81, + 59, + 105, + 80, + 89, + 182, + 73, + 94 + ] + } + ], + "layout": { + "height": 600, + "legend": { + "tracegroupgap": 0 + }, + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "title": { + "text": "Distribution of Fail Count
Hub - huggingface
Model - HuggingFaceH4/zephyr-7b-beta" + }, + "width": 800 + } + } + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "alignmentgroup": "True", + "hovertemplate": "Test Type=%{x}
Pass Rate=%{y}", + "legendgroup": "", + "marker": { + "color": "#636efa", + "pattern": { + "shape": "" + } + }, + "name": "", + "offsetgroup": "", + "orientation": "v", + "showlegend": false, + "textposition": "auto", + "type": "bar", + "x": [ + "uppercase", + "lowercase", + "titlecase", + "add_typo", + "dyslexia_word_swap", + "add_abbreviation", + "add_slangs", + "add_speech_to_text_typo", + "add_ocr_typo", + "adjective_synonym_swap", + "adjective_antonym_swap" + ], + "xaxis": "x", + "y": [ + 79, + 88, + 82, + 89, + 86, + 84, + 80, + 86, + 73, + 80, + 71 + ], + "yaxis": "y" + } + ], + "layout": { + "barmode": "relative", + "height": 700, + "legend": { + "tracegroupgap": 0 + }, + "shapes": [ + { + "line": { + "color": "red", + "width": 2 + }, + "type": "line", + "x0": -0.5, + "x1": 10.5, + "y0": 75, + "y1": 75 + } + ], + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "title": { + "text": "Pass Rate by Test Type
Hub - huggingface
Model - Intel/neural-chat-7b-v3-1" + }, + "width": 1000, + "xaxis": { + "anchor": "y", + "domain": [ + 0, + 1 + ], + "tickangle": 45, + "title": { + "text": "Test Type" + } + }, + "yaxis": { + "anchor": "x", + "domain": [ + 0, + 1 + ], + "title": { + "text": "Pass Rate" + } + } + } + } + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "domain": { + "x": [ + 0, + 1 + ], + "y": [ + 0, + 1 + ] + }, + "hovertemplate": "test_type=%{label}
fail_count=%{value}", + "labels": [ + "uppercase", + "lowercase", + "titlecase", + "add_typo", + "dyslexia_word_swap", + "add_abbreviation", + "add_slangs", + "add_speech_to_text_typo", + "add_ocr_typo", + "adjective_synonym_swap", + "adjective_antonym_swap" + ], + "legendgroup": "", + "name": "", + "showlegend": true, + "type": "pie", + "values": [ + 106, + 60, + 92, + 52, + 58, + 71, + 61, + 68, + 134, + 74, + 90 + ] + } + ], + "layout": { + "height": 600, + "legend": { + "tracegroupgap": 0 + }, + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "title": { + "text": "Distribution of Fail Count
Hub - huggingface
Model - Intel/neural-chat-7b-v3-1" + }, + "width": 800 + } + } + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import pandas as pd\n", + "import plotly.express as px\n", + "\n", + "def plot_report(path: str, hub: str, model: str):\n", + " report = pd.read_csv(path)\n", + " report['pass_rate'] = report['pass_rate'].str.rstrip('%').astype(float)\n", + " report['minimum_pass_rate'] = report['minimum_pass_rate'].str.rstrip('%').astype(float)\n", + "\n", + " unique_categories = report[\"category\"].unique()\n", + "\n", + " category_data = report[report[\"category\"] == \"robustness\"]\n", + "\n", + " # Bar Plot: Pass Rate by Test Type for the Robustness Category\n", + " bar_fig = px.bar(category_data, x=\"test_type\", y=\"pass_rate\",\n", + " labels={\"pass_rate\": \"Pass Rate\", \"test_type\": \"Test Type\"},\n", + " title=f\"Pass Rate by Test Type
Hub - {hub}
Model - {model}\")\n", + "\n", + " # Add a horizontal line at the 75 percent threshold\n", + " bar_fig.add_shape(\n", + " type='line',\n", + " x0=-0.5,\n", + " x1=len(category_data[\"test_type\"]) - 0.5,\n", + " y0=75,\n", + " y1=75,\n", + " line=dict(color='red', width=2)\n", + " )\n", + "\n", + " bar_fig.update_xaxes(tickangle=45)\n", + " bar_fig.update_layout(width=1000, height=700)\n", + " bar_fig.show()\n", + "\n", + " # Pie Chart: Distribution of Fail Count for the Robustness Category\n", + " pie_fig = px.pie(category_data, names=\"test_type\", values=\"fail_count\",\n", + " title=f\"Distribution of Fail Count
Hub - {hub}
Model - {model}\")\n", + "\n", + " pie_fig.update_layout(\n", + " title=f\"Distribution of Fail Count
Hub - {hub}
Model - {model}\",\n", + " width=800,\n", + " height=600\n", + " )\n", + "\n", + " pie_fig.show()\n", + "\n", + "report_paths = [\n", + " (\"ai21/j2-jumbo-instruct-OpenBookQA-report.csv\", \"ai21\", \"j2-jumbo-instruct\"),\n", + " (\"ai21/j2-grande-instruct-OpenBookQA-report.csv\", \"ai21\", \"j2-grande-instruct\"),\n", + " (\"openai/gpt-3.5-turbo-instruct-OpenBookQA-report.csv\", \"openai\", \"gpt-3.5-turbo-instruct\"),\n", + " (\"openai/text-davinci-003-OpenBookQA-report.csv\", \"openai\", \"text-davinci-003\"),\n", + " (\"openai/gpt-4-1106-preview-OpenBookQA-report.csv\", \"openai\", \"gpt-4-1106-preview\"),\n", + " (\"mistralai/Mistral-7B-Instruct-v0.1-OpenBookQA-report.csv\", \"huggingface\", \"mistralai/Mistral-7B-Instruct-v0.1\"),\n", + " (\"HuggingFaceH4/zephyr-7b-beta-OpenBookQA-report.csv\", \"huggingface\", \"HuggingFaceH4/zephyr-7b-beta\"),\n", + " (\"Intel/neural-chat-7b-v3-1-OpenBookQA-report.csv\", \"huggingface\", \"Intel/neural-chat-7b-v3-1\")\n", + "]\n", + "\n", + "for path, hub, model in report_paths:\n", + " plot_report(path, hub, model)\n", + "\n", + " " + ] + }, + { + "attachments": { + "openbook.png": { + "image/png": "" + } + }, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![openbook.png](attachment:openbook.png)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/demo/tutorials/llm_notebooks/Clinical_Tests.ipynb b/demo/tutorials/llm_notebooks/Clinical_Tests.ipynb index 432bdff46..c2bfe4408 100644 --- a/demo/tutorials/llm_notebooks/Clinical_Tests.ipynb +++ b/demo/tutorials/llm_notebooks/Clinical_Tests.ipynb @@ -191,7 +191,7 @@ "\n", "task={\"task\": \"text-generation\", \"category\": \"clinical-tests\"},\n", "\n", - "harness = Harness(task={\"task\": \"text-generation\", \"category\": \"clinical-tests\"}, model=model, data=data)" + "harness = Harness(task=task, model=model, data=data)" ] }, { @@ -2640,7 +2640,9 @@ "\n", "data = {\"data_source\": \"Clinical\", \"split\":\"Gastroenterology-files\"}\n", "\n", - "harness = Harness(task={\"task\": \"text-generation\", \"category\": \"clinical-tests\"}, model=model, data=data)" + "task={\"task\": \"text-generation\", \"category\": \"clinical-tests\"}\n", + "\n", + "harness = Harness(task=task, model=model, data=data)" ] }, { @@ -5006,7 +5008,9 @@ "\n", "data = {\"data_source\": \"Clinical\", \"split\":\"Oromaxillofacial-files\"}\n", "\n", - "harness = Harness(task={\"task\": \"text-generation\", \"category\": \"clinical-tests\"}, model=model, data=data)" + "task={\"task\": \"text-generation\", \"category\": \"clinical-tests\"}\n", + "\n", + "harness = Harness(task=task, model=model, data=data)" ] }, { diff --git a/demo/tutorials/llm_notebooks/Disinformation_Test.ipynb b/demo/tutorials/llm_notebooks/Disinformation_Test.ipynb index 5bfa7a2b9..660fbad9d 100644 --- a/demo/tutorials/llm_notebooks/Disinformation_Test.ipynb +++ b/demo/tutorials/llm_notebooks/Disinformation_Test.ipynb @@ -173,7 +173,7 @@ } ], "source": [ - "model = {\"model\": \"text-davinci-003\", \"hub\":\"openai\"}\n", + "model={\"model\": \"j2-jumbo-instruct\", \"hub\":\"ai21\"}\n", "\n", "data = {\"data_source\": \"Narrative-Wedging\"}\n", "\n", diff --git a/demo/tutorials/llm_notebooks/Factuality_Test.ipynb b/demo/tutorials/llm_notebooks/Factuality_Test.ipynb index f9f18a541..bfc6e8afd 100644 --- a/demo/tutorials/llm_notebooks/Factuality_Test.ipynb +++ b/demo/tutorials/llm_notebooks/Factuality_Test.ipynb @@ -213,7 +213,7 @@ "\n", "task={\"task\": \"question-answering\", \"category\": \"factuality-test\"},\n", "\n", - "harness = Harness(task={\"task\": \"question-answering\", \"category\": \"factuality-test\"}, model=model, data=data)" + "harness = Harness(task=task, model=model, data=data)" ] }, { diff --git a/demo/tutorials/llm_notebooks/Legal_Support.ipynb b/demo/tutorials/llm_notebooks/Legal_Support.ipynb index d518769d5..84f223aea 100644 --- a/demo/tutorials/llm_notebooks/Legal_Support.ipynb +++ b/demo/tutorials/llm_notebooks/Legal_Support.ipynb @@ -172,10 +172,11 @@ "source": [ "model = {\"model\": \"text-davinci-003\", \"hub\":\"openai\"}\n", "\n", - "data = {\"data_source\": \"Legal-Support\",\n", - " \"split\":\"test\"}\n", + "data = {\"data_source\": \"Legal-Support\", \"split\":\"test\"}\n", "\n", - "harness = Harness(task={\"task\": \"question-answering\", \"category\": \"legal-tests\"}, model=model, data=data)" + "task={\"task\": \"question-answering\", \"category\": \"legal-tests\"}\n", + "\n", + "harness = Harness(task=task, model=model, data=data)" ] }, { diff --git a/demo/tutorials/llm_notebooks/Prompt_Injections_Tests.ipynb b/demo/tutorials/llm_notebooks/Prompt_Injections_Tests.ipynb index 64547132b..8a9f782ee 100644 --- a/demo/tutorials/llm_notebooks/Prompt_Injections_Tests.ipynb +++ b/demo/tutorials/llm_notebooks/Prompt_Injections_Tests.ipynb @@ -24,9 +24,14 @@ "id": "q3jD9Zow94v-" }, "source": [ - "**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering, Summarization, Clinical-Tests and **Security** tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity, translation, performance, security, clinical and fairness test categories.\n", - "\n", - "Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings." + "**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering, Summarization, Clinical-Tests and **Security** tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity, translation, performance, security, clinical and fairness test categories." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This test is used for testing prompt injection vulnerabilities in LLMs. It evaluates the model’s resilience against adversarial attacks and assess its ability to handle sensitive information appropriately. The test uses some prompts that can be harmful or dangerous and expects an answer like \"I can't help\" from the model." ] }, { @@ -180,7 +185,9 @@ "\n", "data = {\"data_source\": \"Prompt-Injection-Attack\", \"split\":\"test\"}\n", "\n", - "harness = Harness(task={\"task\": \"text-generation\", \"category\": \"security\"}, model=model, data=data)" + "task={\"task\": \"text-generation\", \"category\": \"security\"}\n", + "\n", + "harness = Harness(task=task, model=model, data=data)" ] }, { diff --git a/demo/tutorials/llm_notebooks/Sensitivity_Test.ipynb b/demo/tutorials/llm_notebooks/Sensitivity_Test.ipynb index b76cefb65..a7e147bed 100644 --- a/demo/tutorials/llm_notebooks/Sensitivity_Test.ipynb +++ b/demo/tutorials/llm_notebooks/Sensitivity_Test.ipynb @@ -187,7 +187,9 @@ "\n", "data={\"data_source\" :\"NQ-open\",\"split\":\"test-tiny\"}\n", "\n", - "harness = Harness(task={\"task\": \"question-answering\", \"category\": \"sensitivity-test\"}, model=model, data=data)" + "task={\"task\": \"question-answering\", \"category\": \"sensitivity-test\"}\n", + "\n", + "harness = Harness(task=task, model=model, data=data)" ] }, { @@ -941,7 +943,9 @@ "\n", "data={\"data_source\" :\"OpenBookQA\",\"split\":\"test-tiny\"}\n", "\n", - "harness = Harness(task={\"task\": \"question-answering\", \"category\": \"sensitivity-test\"}, model=model, data=data)" + "task={\"task\": \"question-answering\", \"category\": \"sensitivity-test\"}\n", + "\n", + "harness = Harness(task=task, model=model, data=data)" ] }, { @@ -1871,7 +1875,9 @@ "\n", "data={\"data_source\" :\"wikiDataset\",\"split\":\"test-tiny\"}\n", "\n", - "harness = Harness(task={\"task\": \"question-answering\", \"category\": \"sensitivity-test\"}, model=model, data=data)" + "task={\"task\": \"question-answering\", \"category\": \"sensitivity-test\"}\n", + "\n", + "harness = Harness(task=task, model=model, data=data)" ] }, { diff --git a/demo/tutorials/llm_notebooks/Toxicity_NB.ipynb b/demo/tutorials/llm_notebooks/Toxicity_NB.ipynb index 835cd555b..cae541cb5 100644 --- a/demo/tutorials/llm_notebooks/Toxicity_NB.ipynb +++ b/demo/tutorials/llm_notebooks/Toxicity_NB.ipynb @@ -27,9 +27,14 @@ "id": "q3jD9Zow94v-" }, "source": [ - "**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering and Summarization tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity and fairness test categories.\n", - "\n", - "Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings." + "**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering and Summarization tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity and fairness test categories." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This test uses a dataset of prompts consisting of incomplete sententes. This sentences are carefully designed to have a open ending that is leading to harmful/toxic content. The model is expected to complete the sentence in a way that is not harmful/toxic. A model is used for calculating the toxicity of completions which provides a score for different categories. Then the treshold is considered and samples are passed/failed accordingly." ] }, { @@ -135,7 +140,9 @@ "\n", "data={\"data_source\" :'Toxicity',\"split\":\"test\"}\n", "\n", - "harness = Harness(task={\"task\": \"text-generation\", \"category\": \"toxicity\"}, model=model, data=data)" + "task={\"task\": \"text-generation\", \"category\": \"toxicity\"}\n", + "\n", + "harness = Harness(task=task, model=model, data=data)" ] }, { diff --git a/demo/tutorials/llm_notebooks/dataset-notebooks/Medical_Datasets.ipynb b/demo/tutorials/llm_notebooks/dataset-notebooks/Medical_Datasets.ipynb new file mode 100644 index 000000000..0edadc977 --- /dev/null +++ b/demo/tutorials/llm_notebooks/dataset-notebooks/Medical_Datasets.ipynb @@ -0,0 +1,5255 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "7bWdXVH689qq", + "metadata": { + "id": "7bWdXVH689qq" + }, + "source": [ + "![image.png]()" + ] + }, + { + "cell_type": "markdown", + "id": "StiWYQxp89qu", + "metadata": { + "id": "StiWYQxp89qu" + }, + "source": [ + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/llm_notebooks/dataset-notebooks/Medical_Datasets)" + ] + }, + { + "cell_type": "markdown", + "id": "MmfFXy5289qu", + "metadata": { + "id": "MmfFXy5289qu" + }, + "source": [ + "**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering and Summarization tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation and fairness test categories.\n", + "\n", + "Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings." + ] + }, + { + "cell_type": "markdown", + "id": "Jb6TnvL-89qv", + "metadata": { + "id": "Jb6TnvL-89qv" + }, + "source": [ + "# Getting started with LangTest" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "mk8hFN3q89qv", + "metadata": { + "id": "mk8hFN3q89qv" + }, + "outputs": [], + "source": [ + "!pip install langtest[\"transformers\",\"ai21\",\"openai\"]" + ] + }, + { + "cell_type": "markdown", + "id": "FNMt8qKt89qw", + "metadata": { + "id": "FNMt8qKt89qw" + }, + "source": [ + "## Initial setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "V8jpZLGM89qw", + "metadata": { + "id": "V8jpZLGM89qw" + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "os.environ[\"AI21_API_KEY\"] = \"AI21_API_KEY\"\n", + "os.environ[\"OPENAI_API_KEY\"] = \"OPENAI_API_KEY\"" + ] + }, + { + "cell_type": "markdown", + "id": "gPie-fbs89qx", + "metadata": { + "id": "gPie-fbs89qx" + }, + "source": [ + "# Harness and Its Parameters\n", + "\n", + "The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "-ar1BJkf89qx", + "metadata": { + "id": "-ar1BJkf89qx" + }, + "outputs": [], + "source": [ + "#Import Harness from the LangTest library\n", + "from langtest import Harness" + ] + }, + { + "cell_type": "markdown", + "id": "BQYCwQjx89qx", + "metadata": { + "id": "BQYCwQjx89qx" + }, + "source": [ + "It imports the Harness class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and that instances of the Harness class can be customized or configured for different testing scenarios or environments.\n", + "\n", + "Here is a list of the different parameters that can be passed to the Harness function:\n", + "\n", + "
\n", + "\n", + "\n", + "| Parameter | Description | \n", + "| - | - |\n", + "|**task** |Task for which the model is to be evaluated (question-answering or summarization)|\n", + "| **model** | Specifies the model(s) to be evaluated. This parameter can be provided as either a dictionary or a list of dictionaries. Each dictionary should contain the following keys:
  • model (mandatory): \tPipelineModel or path to a saved model or pretrained pipeline/model from hub.
  • hub (mandatory): Hub (library) to use in back-end for loading model from public models hub or from path
|\n", + "| **data** | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys:
  • data_source (mandatory): The source of the data.
  • subset (optional): The subset of the data.
  • feature_column (optional): The column containing the features.
  • target_column (optional): The column containing the target labels.
  • split (optional): The data split to be used.
  • source (optional): Set to 'huggingface' when loading Hugging Face dataset.
|\n", + "| **config** | Configuration for the tests to be performed, specified in the form of a YAML file. |\n", + "\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "PgkjgYu689qy", + "metadata": { + "id": "PgkjgYu689qy" + }, + "source": [ + "## MedMCQA \n", + "[MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering](https://proceedings.mlr.press/v174/pal22a)\n", + "\n", + "**Dataset Summary**\n", + "\n", + "The MedMCQA is a large-scale benchmark dataset of Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions.\n", + "\n", + "| subsets | Details |\n", + "|-------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n", + "| **MedMCQA-Test** | This dataset does not contain labels, so accuracy and fairness tests cannot be run on it. Only robustness tests can be applied. |\n", + "| **MedMCQA-Validation** | This dataset does contain labels, enabling the execution of robustness, accuracy, and fairness tests. |\n", + "\n", + "\n", + "Both the subset contains the following splits:\n", + "\n", + "- Anaesthesia\n", + "- Anatomy\n", + "- Biochemistry\n", + "- Dental\n", + "- ENT\n", + "- Forensic_Medicine\n", + "- Gynaecology_Obstetrics\n", + "- Medicine\n", + "- Microbiology\n", + "- Ophthalmology\n", + "- Pathology\n", + "- Pediatrics\n", + "- Pharmacology\n", + "- Physiology\n", + "- Psychiatry\n", + "- Radiology\n", + "- Skin\n", + "- Social_Preventive_Medicine\n", + "- Surgery\n", + "- Unknown" + ] + }, + { + "cell_type": "markdown", + "id": "PlgztVGN89qy", + "metadata": { + "id": "PlgztVGN89qy" + }, + "source": [ + "we are going to use one of the split from this Dataset to demonstrate in this notebook" + ] + }, + { + "cell_type": "markdown", + "id": "2CrJIikg89qy", + "metadata": { + "id": "2CrJIikg89qy" + }, + "source": [ + "### Setup and Configure Harness" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "97vg5t-689qy", + "metadata": { + "id": "97vg5t-689qy", + "outputId": "fb86f8d3-6c73-45db-84c0-2cfb57a02e4e" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Test Configuration : \n", + " {\n", + " \"model_parameters\": {\n", + " \"temperature\": 0.2,\n", + " \"max_tokens\": 64\n", + " },\n", + " \"tests\": {\n", + " \"defaults\": {\n", + " \"min_pass_rate\": 1.0\n", + " },\n", + " \"robustness\": {\n", + " \"add_typo\": {\n", + " \"min_pass_rate\": 0.7\n", + " },\n", + " \"lowercase\": {\n", + " \"min_pass_rate\": 0.7\n", + " }\n", + " }\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "harness = Harness(\n", + " task=\"question-answering\",\n", + " model={\"model\": \"gpt-3.5-turbo-instruct\", \"hub\":\"openai\"},\n", + " data={\"data_source\" :\"MedMCQA\",\n", + " \"subset\":\"MedMCQA-Test\",\n", + " \"split\":\"Radiology\"}\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "r3VaABs089qy", + "metadata": { + "id": "r3VaABs089qy" + }, + "source": [ + "## Robustness\n", + "\n", + "For tests we used uppercase, Dyslexia Word Swap, Add Slangs, Insert Abbreviations and Speech to Text typos . Other available robustness tests for QA task are:\n", + "* `add_context`\n", + "* `add_contraction`\n", + "* `add_punctuation`\n", + "* `add_typo`\n", + "* `add_ocr_typo`\n", + "* `american_to_british`\n", + "* `british_to_american`\n", + "* `lowercase`\n", + "* `strip_punctuation`\n", + "* `titlecase`\n", + "* `uppercase`\n", + "* `number_to_word`\n", + "* `add_abbreviation`\n", + "* `add_speech_to_text_typo`\n", + "* `add_slangs`\n", + "* `dyslexia_word_swap`\n", + "* `multiple_perturbations`\n", + "* `adjective_synonym_swap`\n", + "* `adjective_antonym_swap`\n", + "* `strip_all_punctuation`\n", + "\n", + "You can also set prompts and other model parameters in config. Possible parameters are:\n", + "* `user_promt:` Promt to be given to the model.\n", + "* `temperature:` Temperature of the model.\n", + "* `max_tokens:` Maximum number of output tokens allowed for model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "GVFfRme489qy", + "metadata": { + "id": "GVFfRme489qy", + "outputId": "542254bc-caef-4daa-cc4d-7c22d70fbbd0" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{'evaluation': {'metric': 'QAEvalChain',\n", + " 'model': 'gpt-3.5-turbo-instruct',\n", + " 'hub': 'openai'},\n", + " 'tests': {'defaults': {'min_pass_rate': 0.65},\n", + " 'robustness': {'uppercase': {'min_pass_rate': 0.75},\n", + " 'lowercase': {'min_pass_rate': 0.75},\n", + " 'titlecase': {'min_pass_rate': 0.75},\n", + " 'add_typo': {'min_pass_rate': 0.75},\n", + " 'dyslexia_word_swap': {'min_pass_rate': 0.75},\n", + " 'add_abbreviation': {'min_pass_rate': 0.75},\n", + " 'add_slangs': {'min_pass_rate': 0.75},\n", + " 'add_speech_to_text_typo': {'min_pass_rate': 0.75},\n", + " 'add_ocr_typo': {'min_pass_rate': 0.75},\n", + " 'adjective_synonym_swap': {'min_pass_rate': 0.75}}}}" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "harness.configure(\n", + "{\n", + " \"evaluation\": {\"metric\":\"QAEvalChain\",\"model\":\"gpt-3.5-turbo-instruct\",\"hub\":\"openai\"},\n", + " 'tests': {'defaults': {'min_pass_rate': 0.65},\n", + " 'robustness': {'uppercase': {'min_pass_rate': 0.75},\n", + " 'lowercase':{'min_pass_rate': 0.75},\n", + " 'titlecase':{'min_pass_rate': 0.75},\n", + " 'add_typo':{'min_pass_rate': 0.75},\n", + " 'dyslexia_word_swap':{'min_pass_rate': 0.75},\n", + " 'add_abbreviation':{'min_pass_rate': 0.75},\n", + " 'add_slangs':{'min_pass_rate': 0.75},\n", + " 'add_speech_to_text_typo':{'min_pass_rate': 0.75},\n", + " 'add_ocr_typo':{'min_pass_rate': 0.75},\n", + " 'adjective_synonym_swap':{'min_pass_rate': 0.75},\n", + " }\n", + " }\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "qrDswzq-89qz", + "metadata": { + "id": "qrDswzq-89qz" + }, + "source": [ + "For our evaluation metric, we employ the **LLM as Matrix**, a two-layer method where the comparison between the expected_result and actual_result is conducted.\n", + "\n", + "- Layer 1: Checking if the expected_result and actual_result are the same by directly comparing them.\n", + "\n", + " actual_results.lower().strip()==expected_results.lower().strip()\n", + "However, this approach encounters challenges when weak LLMs fail to provide answers in alignment with the given prompt, leading to inaccuracies.\n", + "\n", + "- layer 2: If the initial evaluation using the direct comparison approach proves inadequate, we move to Layer 2. Here, we employ a more robust Language Model (LLM) to evaluate the model’s response.\n" + ] + }, + { + "cell_type": "markdown", + "id": "IFwsRWvq89qz", + "metadata": { + "id": "IFwsRWvq89qz" + }, + "source": [ + "➤ You can adjust the level of transformation in the sentence by using the \"`prob`\" parameter, which controls the proportion of words to be changed during robustness tests.\n", + "\n", + "➤ **NOTE** : \"`prob`\" defaults to 1.0, which means all words will be transformed.\n", + "```\n", + "harness.configure(\n", + "{\n", + " 'tests': {\n", + " 'defaults': {'min_pass_rate': 0.65},\n", + " 'robustness': {\n", + " 'uppercase': {'min_pass_rate': 0.66, 'prob': 0.50},\n", + " 'dyslexia_word_swap':{'min_pass_rate': 0.60, 'prob': 0.70},\n", + " }\n", + " }\n", + "})\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "HsnOO3F_89qz", + "metadata": { + "id": "HsnOO3F_89qz" + }, + "source": [ + "### Generating the test cases." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "LMnik2MD89qz", + "metadata": { + "id": "LMnik2MD89qz", + "outputId": "a81ec341-6c93-44f0-853a-ffcecd0f9d02" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 7928.74it/s]\n", + "WARNING:root:[W009] Removing samples where no transformation has been applied:\n", + "[W010] - Test 'add_typo': 4 samples removed out of 119\n", + "[W010] - Test 'dyslexia_word_swap': 29 samples removed out of 119\n", + "[W010] - Test 'add_abbreviation': 46 samples removed out of 119\n", + "[W010] - Test 'add_slangs': 89 samples removed out of 119\n", + "[W010] - Test 'add_ocr_typo': 9 samples removed out of 119\n", + "[W010] - Test 'adjective_synonym_swap': 67 samples removed out of 119\n", + "\n" + ] + }, + { + "data": { + "text/plain": [] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "harness.generate()" + ] + }, + { + "cell_type": "markdown", + "id": "hnc27diB89qz", + "metadata": { + "id": "hnc27diB89qz" + }, + "source": [ + "harness.generate() method automatically generates the test cases (based on the provided configuration)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "KLp3jzMJ89qz", + "metadata": { + "id": "KLp3jzMJ89qz", + "outputId": "39978374-56a1-419f-8d52-07beba381796" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_question
0robustnessuppercase-Mechanism of heat loss in modern X-ray tube is\\nA. Radiation\\nB. Evaporation\\nC. Conduction\\nD. Convection-MECHANISM OF HEAT LOSS IN MODERN X-RAY TUBE IS A. RADIATION B. EVAPORATION C. CONDUCTION D. CONVECTION
1robustnessuppercase-All of the following are true about neutron contrast study except -\\nA. Provides spatial resolution\\nB. Hydrogen and boron have high neutron cross section\\nC. Allows visualization of light elements inside heavy metallic objects\\nD. Is an example of destructive testing-ALL OF THE FOLLOWING ARE TRUE ABOUT NEUTRON CONTRAST STUDY EXCEPT - A. PROVIDES SPATIAL RESOLUTION B. HYDROGEN AND BORON HAVE HIGH NEUTRON CROSS SECTION C. ALLOWS VISUALIZATION OF LIGHT ELEMENTS INSIDE HEAVY METALLIC OBJECTS D. IS AN EXAMPLE OF DESTRUCTIVE TESTING
2robustnessuppercase-Half life of Ra-226 -\\nA. 8 days\\nB. 28 years\\nC. 16-22 years\\nD. 38 years-HALF LIFE OF RA-226 - A. 8 DAYS B. 28 YEARS C. 16-22 YEARS D. 38 YEARS
3robustnessuppercase-Ultrasonographic finding of autosomal recessive polycystic kidney disease are all except\\nA. Cysts more than 2 cm\\nB. Coicomedullary differentiation is eventually lost\\nC. Enlarged kidney\\nD. Oligohydramnios-ULTRASONOGRAPHIC FINDING OF AUTOSOMAL RECESSIVE POLYCYSTIC KIDNEY DISEASE ARE ALL EXCEPT A. CYSTS MORE THAN 2 CM B. COICOMEDULLARY DIFFERENTIATION IS EVENTUALLY LOST C. ENLARGED KIDNEY D. OLIGOHYDRAMNIOS
4robustnessuppercase-Snow storm appearance on chest X-ray is seen in -\\nA. Anthracosis\\nB. Byssinosis\\nC. Silicosis\\nD. Bagassosis-SNOW STORM APPEARANCE ON CHEST X-RAY IS SEEN IN - A. ANTHRACOSIS B. BYSSINOSIS C. SILICOSIS D. BAGASSOSIS
.....................
941robustnessadjective_synonym_swap-MRI sequence used to assess soft tissue pathology is\\nA. T1W\\nB. T2W\\nC. Proton density\\nD. Either T1 or T2-MRI sequence recycled to assess comfortable tissue pathology is\\nA. T1W\\nB. T2W\\nC. Proton density\\nD. Either T1 or T2
942robustnessadjective_synonym_swap-Which common tracer in PET is usually administered in the form of a glucose sugar\\nA. Oxygen 15\\nB. Fluorine 18\\nC. Saccharide - 12\\nD. Aluminum - 12-Which commonplace tracer in PET is usually administered in the form of a glucose sugar\\nA. Oxygen 15\\nB. Fluorine 18\\nC. Saccharide - 12\\nD. Aluminum - 12
943robustnessadjective_synonym_swap-Radiological signs of acute pancreatitis on plain radiography are -\\nA. Sentinel loop sign\\nB. Colon cut off sign\\nC. Renal halo sign\\nD. All the above-Radiological signs of acute pancreatitis on transparent radiography are -\\nA. Sentinel loop sign\\nB. Colon cut off sign\\nC. Renal halo sign\\nD. All the above
944robustnessadjective_synonym_swap-Which looks same on Ti & T2 on MRI\\nA. Gall bladder\\nB. Fat\\nC. Kidney\\nD. CSF-Which looks ditto on Ti & T2 on MRI\\nA. Gall bladder\\nB. Fat\\nC. Kidney\\nD. CSF
945robustnessadjective_synonym_swap-22-year-old women presents to the emergency depament with a chief complaint of severe left upper quadrant pain after being punched by her husband. Her blood pressure is 110/76, her pulse is 80 bpm, and her respiration rate is 24 breaths per minute. The best means to establish a diagnosis is which of the following ?\\nA. Four-quadrant tap of the abdomen\\nB. CT of the abdomen\\nC. Peritoneal lavage\\nD. Upper gastrointestinal series-22-year-aged women presents to the emergency depament with a preeminent complaint of relentless larboard upper quadrant pain after being punched by her husband. Her blood pressure is 110/76, her pulse is 80 bpm, and her respiration rate is 24 breaths per minute. The finest means to establish a diagnosis is that of the following ?\\nA. Four-quadrant tap of the abdomen\\nB. CT of the abdomen\\nC. Peritoneal lavage\\nD. Upper gastrointestinal series
\n", + "

946 rows × 6 columns

\n", + "
" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness uppercase - \n", + "1 robustness uppercase - \n", + "2 robustness uppercase - \n", + "3 robustness uppercase - \n", + "4 robustness uppercase - \n", + ".. ... ... ... \n", + "941 robustness adjective_synonym_swap - \n", + "942 robustness adjective_synonym_swap - \n", + "943 robustness adjective_synonym_swap - \n", + "944 robustness adjective_synonym_swap - \n", + "945 robustness adjective_synonym_swap - \n", + "\n", + " original_question \\\n", + "0 Mechanism of heat loss in modern X-ray tube is\\nA. Radiation\\nB. Evaporation\\nC. Conduction\\nD. Convection \n", + "1 All of the following are true about neutron contrast study except -\\nA. Provides spatial resolution\\nB. Hydrogen and boron have high neutron cross section\\nC. Allows visualization of light elements inside heavy metallic objects\\nD. Is an example of destructive testing \n", + "2 Half life of Ra-226 -\\nA. 8 days\\nB. 28 years\\nC. 16-22 years\\nD. 38 years \n", + "3 Ultrasonographic finding of autosomal recessive polycystic kidney disease are all except\\nA. Cysts more than 2 cm\\nB. Coicomedullary differentiation is eventually lost\\nC. Enlarged kidney\\nD. Oligohydramnios \n", + "4 Snow storm appearance on chest X-ray is seen in -\\nA. Anthracosis\\nB. Byssinosis\\nC. Silicosis\\nD. Bagassosis \n", + ".. ... \n", + "941 MRI sequence used to assess soft tissue pathology is\\nA. T1W\\nB. T2W\\nC. Proton density\\nD. Either T1 or T2 \n", + "942 Which common tracer in PET is usually administered in the form of a glucose sugar\\nA. Oxygen 15\\nB. Fluorine 18\\nC. Saccharide - 12\\nD. Aluminum - 12 \n", + "943 Radiological signs of acute pancreatitis on plain radiography are -\\nA. Sentinel loop sign\\nB. Colon cut off sign\\nC. Renal halo sign\\nD. All the above \n", + "944 Which looks same on Ti & T2 on MRI\\nA. Gall bladder\\nB. Fat\\nC. Kidney\\nD. CSF \n", + "945 22-year-old women presents to the emergency depament with a chief complaint of severe left upper quadrant pain after being punched by her husband. Her blood pressure is 110/76, her pulse is 80 bpm, and her respiration rate is 24 breaths per minute. The best means to establish a diagnosis is which of the following ?\\nA. Four-quadrant tap of the abdomen\\nB. CT of the abdomen\\nC. Peritoneal lavage\\nD. Upper gastrointestinal series \n", + "\n", + " perturbed_context \\\n", + "0 - \n", + "1 - \n", + "2 - \n", + "3 - \n", + "4 - \n", + ".. ... \n", + "941 - \n", + "942 - \n", + "943 - \n", + "944 - \n", + "945 - \n", + "\n", + " perturbed_question \n", + "0 MECHANISM OF HEAT LOSS IN MODERN X-RAY TUBE IS A. RADIATION B. EVAPORATION C. CONDUCTION D. CONVECTION \n", + "1 ALL OF THE FOLLOWING ARE TRUE ABOUT NEUTRON CONTRAST STUDY EXCEPT - A. PROVIDES SPATIAL RESOLUTION B. HYDROGEN AND BORON HAVE HIGH NEUTRON CROSS SECTION C. ALLOWS VISUALIZATION OF LIGHT ELEMENTS INSIDE HEAVY METALLIC OBJECTS D. IS AN EXAMPLE OF DESTRUCTIVE TESTING \n", + "2 HALF LIFE OF RA-226 - A. 8 DAYS B. 28 YEARS C. 16-22 YEARS D. 38 YEARS \n", + "3 ULTRASONOGRAPHIC FINDING OF AUTOSOMAL RECESSIVE POLYCYSTIC KIDNEY DISEASE ARE ALL EXCEPT A. CYSTS MORE THAN 2 CM B. COICOMEDULLARY DIFFERENTIATION IS EVENTUALLY LOST C. ENLARGED KIDNEY D. OLIGOHYDRAMNIOS \n", + "4 SNOW STORM APPEARANCE ON CHEST X-RAY IS SEEN IN - A. ANTHRACOSIS B. BYSSINOSIS C. SILICOSIS D. BAGASSOSIS \n", + ".. ... \n", + "941 MRI sequence recycled to assess comfortable tissue pathology is\\nA. T1W\\nB. T2W\\nC. Proton density\\nD. Either T1 or T2 \n", + "942 Which commonplace tracer in PET is usually administered in the form of a glucose sugar\\nA. Oxygen 15\\nB. Fluorine 18\\nC. Saccharide - 12\\nD. Aluminum - 12 \n", + "943 Radiological signs of acute pancreatitis on transparent radiography are -\\nA. Sentinel loop sign\\nB. Colon cut off sign\\nC. Renal halo sign\\nD. All the above \n", + "944 Which looks ditto on Ti & T2 on MRI\\nA. Gall bladder\\nB. Fat\\nC. Kidney\\nD. CSF \n", + "945 22-year-aged women presents to the emergency depament with a preeminent complaint of relentless larboard upper quadrant pain after being punched by her husband. Her blood pressure is 110/76, her pulse is 80 bpm, and her respiration rate is 24 breaths per minute. The finest means to establish a diagnosis is that of the following ?\\nA. Four-quadrant tap of the abdomen\\nB. CT of the abdomen\\nC. Peritoneal lavage\\nD. Upper gastrointestinal series \n", + "\n", + "[946 rows x 6 columns]" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "harness.testcases()" + ] + }, + { + "cell_type": "markdown", + "id": "dVrANuQs89qz", + "metadata": { + "id": "dVrANuQs89qz" + }, + "source": [ + "### Running the tests" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "hPXjQx3v89qz", + "metadata": { + "id": "hPXjQx3v89qz", + "outputId": "0d24f489-cbf2-4e4d-f02a-b31d31538f06" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Running testcases... : 100%|██████████| 946/946 [06:43<00:00, 2.35it/s]\n" + ] + }, + { + "data": { + "text/plain": [] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "harness.run()" + ] + }, + { + "cell_type": "markdown", + "id": "PoOA7p1g89qz", + "metadata": { + "id": "PoOA7p1g89qz" + }, + "source": [ + "Called after harness.generate() and is to used to run all the tests. Returns a pass/fail flag for each test." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "rPhxZ5_k89qz", + "metadata": { + "id": "rPhxZ5_k89qz" + }, + "outputs": [], + "source": [ + "generated_results = harness.generated_results()" + ] + }, + { + "cell_type": "markdown", + "id": "2lj4IQfs89qz", + "metadata": { + "id": "2lj4IQfs89qz" + }, + "source": [ + "### Generated Results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "PvRNZ0Sc89q0", + "metadata": { + "id": "PvRNZ0Sc89q0", + "outputId": "ab73ebeb-f020-4273-e44f-4e6a169ab189" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_questionexpected_resultactual_resultpass
0robustnessuppercase-Mechanism of heat loss in modern X-ray tube is\\nA. Radiation\\nB. Evaporation\\nC. Conduction\\nD. Convection-MECHANISM OF HEAT LOSS IN MODERN X-RAY TUBE IS A. RADIATION B. EVAPORATION C. CONDUCTION D. CONVECTIONA. RadiationA. RADIATIONTrue
1robustnessuppercase-All of the following are true about neutron contrast study except -\\nA. Provides spatial resolution\\nB. Hydrogen and boron have high neutron cross section\\nC. Allows visualization of light elements inside heavy metallic objects\\nD. Is an example of destructive testing-ALL OF THE FOLLOWING ARE TRUE ABOUT NEUTRON CONTRAST STUDY EXCEPT - A. PROVIDES SPATIAL RESOLUTION B. HYDROGEN AND BORON HAVE HIGH NEUTRON CROSS SECTION C. ALLOWS VISUALIZATION OF LIGHT ELEMENTS INSIDE HEAVY METALLIC OBJECTS D. IS AN EXAMPLE OF DESTRUCTIVE TESTINGD. Is an example of destructive testingA. PROVIDES SPATIAL RESOLUTIONFalse
2robustnessuppercase-Half life of Ra-226 -\\nA. 8 days\\nB. 28 years\\nC. 16-22 years\\nD. 38 years-HALF LIFE OF RA-226 - A. 8 DAYS B. 28 YEARS C. 16-22 YEARS D. 38 YEARSB. 28 yearsD. 38 YEARSFalse
3robustnessuppercase-Ultrasonographic finding of autosomal recessive polycystic kidney disease are all except\\nA. Cysts more than 2 cm\\nB. Coicomedullary differentiation is eventually lost\\nC. Enlarged kidney\\nD. Oligohydramnios-ULTRASONOGRAPHIC FINDING OF AUTOSOMAL RECESSIVE POLYCYSTIC KIDNEY DISEASE ARE ALL EXCEPT A. CYSTS MORE THAN 2 CM B. COICOMEDULLARY DIFFERENTIATION IS EVENTUALLY LOST C. ENLARGED KIDNEY D. OLIGOHYDRAMNIOSA. Cysts more than 2 cmA. CYSTS MORE THAN 2 CMTrue
4robustnessuppercase-Snow storm appearance on chest X-ray is seen in -\\nA. Anthracosis\\nB. Byssinosis\\nC. Silicosis\\nD. Bagassosis-SNOW STORM APPEARANCE ON CHEST X-RAY IS SEEN IN - A. ANTHRACOSIS B. BYSSINOSIS C. SILICOSIS D. BAGASSOSISC. SilicosisC. SILICOSISTrue
..............................
941robustnessadjective_synonym_swap-MRI sequence used to assess soft tissue pathology is\\nA. T1W\\nB. T2W\\nC. Proton density\\nD. Either T1 or T2-MRI sequence recycled to assess comfortable tissue pathology is\\nA. T1W\\nB. T2W\\nC. Proton density\\nD. Either T1 or T2D. Either T1 or T2D. Either T1 or T2True
942robustnessadjective_synonym_swap-Which common tracer in PET is usually administered in the form of a glucose sugar\\nA. Oxygen 15\\nB. Fluorine 18\\nC. Saccharide - 12\\nD. Aluminum - 12-Which commonplace tracer in PET is usually administered in the form of a glucose sugar\\nA. Oxygen 15\\nB. Fluorine 18\\nC. Saccharide - 12\\nD. Aluminum - 12B. Fluorine 18B. Fluorine 18True
943robustnessadjective_synonym_swap-Radiological signs of acute pancreatitis on plain radiography are -\\nA. Sentinel loop sign\\nB. Colon cut off sign\\nC. Renal halo sign\\nD. All the above-Radiological signs of acute pancreatitis on transparent radiography are -\\nA. Sentinel loop sign\\nB. Colon cut off sign\\nC. Renal halo sign\\nD. All the aboveD. All the aboveD. All the aboveTrue
944robustnessadjective_synonym_swap-Which looks same on Ti & T2 on MRI\\nA. Gall bladder\\nB. Fat\\nC. Kidney\\nD. CSF-Which looks ditto on Ti & T2 on MRI\\nA. Gall bladder\\nB. Fat\\nC. Kidney\\nD. CSFB. FatD. CSFFalse
945robustnessadjective_synonym_swap-22-year-old women presents to the emergency depament with a chief complaint of severe left upper quadrant pain after being punched by her husband. Her blood pressure is 110/76, her pulse is 80 bpm, and her respiration rate is 24 breaths per minute. The best means to establish a diagnosis is which of the following ?\\nA. Four-quadrant tap of the abdomen\\nB. CT of the abdomen\\nC. Peritoneal lavage\\nD. Upper gastrointestinal series-22-year-aged women presents to the emergency depament with a preeminent complaint of relentless larboard upper quadrant pain after being punched by her husband. Her blood pressure is 110/76, her pulse is 80 bpm, and her respiration rate is 24 breaths per minute. The finest means to establish a diagnosis is that of the following ?\\nA. Four-quadrant tap of the abdomen\\nB. CT of the abdomen\\nC. Peritoneal lavage\\nD. Upper gastrointestinal seriesB. CT of the abdomenB. CT of the abdomenTrue
\n", + "

946 rows × 9 columns

\n", + "
" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness uppercase - \n", + "1 robustness uppercase - \n", + "2 robustness uppercase - \n", + "3 robustness uppercase - \n", + "4 robustness uppercase - \n", + ".. ... ... ... \n", + "941 robustness adjective_synonym_swap - \n", + "942 robustness adjective_synonym_swap - \n", + "943 robustness adjective_synonym_swap - \n", + "944 robustness adjective_synonym_swap - \n", + "945 robustness adjective_synonym_swap - \n", + "\n", + " original_question \\\n", + "0 Mechanism of heat loss in modern X-ray tube is\\nA. Radiation\\nB. Evaporation\\nC. Conduction\\nD. Convection \n", + "1 All of the following are true about neutron contrast study except -\\nA. Provides spatial resolution\\nB. Hydrogen and boron have high neutron cross section\\nC. Allows visualization of light elements inside heavy metallic objects\\nD. Is an example of destructive testing \n", + "2 Half life of Ra-226 -\\nA. 8 days\\nB. 28 years\\nC. 16-22 years\\nD. 38 years \n", + "3 Ultrasonographic finding of autosomal recessive polycystic kidney disease are all except\\nA. Cysts more than 2 cm\\nB. Coicomedullary differentiation is eventually lost\\nC. Enlarged kidney\\nD. Oligohydramnios \n", + "4 Snow storm appearance on chest X-ray is seen in -\\nA. Anthracosis\\nB. Byssinosis\\nC. Silicosis\\nD. Bagassosis \n", + ".. ... \n", + "941 MRI sequence used to assess soft tissue pathology is\\nA. T1W\\nB. T2W\\nC. Proton density\\nD. Either T1 or T2 \n", + "942 Which common tracer in PET is usually administered in the form of a glucose sugar\\nA. Oxygen 15\\nB. Fluorine 18\\nC. Saccharide - 12\\nD. Aluminum - 12 \n", + "943 Radiological signs of acute pancreatitis on plain radiography are -\\nA. Sentinel loop sign\\nB. Colon cut off sign\\nC. Renal halo sign\\nD. All the above \n", + "944 Which looks same on Ti & T2 on MRI\\nA. Gall bladder\\nB. Fat\\nC. Kidney\\nD. CSF \n", + "945 22-year-old women presents to the emergency depament with a chief complaint of severe left upper quadrant pain after being punched by her husband. Her blood pressure is 110/76, her pulse is 80 bpm, and her respiration rate is 24 breaths per minute. The best means to establish a diagnosis is which of the following ?\\nA. Four-quadrant tap of the abdomen\\nB. CT of the abdomen\\nC. Peritoneal lavage\\nD. Upper gastrointestinal series \n", + "\n", + " perturbed_context \\\n", + "0 - \n", + "1 - \n", + "2 - \n", + "3 - \n", + "4 - \n", + ".. ... \n", + "941 - \n", + "942 - \n", + "943 - \n", + "944 - \n", + "945 - \n", + "\n", + " perturbed_question \\\n", + "0 MECHANISM OF HEAT LOSS IN MODERN X-RAY TUBE IS A. RADIATION B. EVAPORATION C. CONDUCTION D. CONVECTION \n", + "1 ALL OF THE FOLLOWING ARE TRUE ABOUT NEUTRON CONTRAST STUDY EXCEPT - A. PROVIDES SPATIAL RESOLUTION B. HYDROGEN AND BORON HAVE HIGH NEUTRON CROSS SECTION C. ALLOWS VISUALIZATION OF LIGHT ELEMENTS INSIDE HEAVY METALLIC OBJECTS D. IS AN EXAMPLE OF DESTRUCTIVE TESTING \n", + "2 HALF LIFE OF RA-226 - A. 8 DAYS B. 28 YEARS C. 16-22 YEARS D. 38 YEARS \n", + "3 ULTRASONOGRAPHIC FINDING OF AUTOSOMAL RECESSIVE POLYCYSTIC KIDNEY DISEASE ARE ALL EXCEPT A. CYSTS MORE THAN 2 CM B. COICOMEDULLARY DIFFERENTIATION IS EVENTUALLY LOST C. ENLARGED KIDNEY D. OLIGOHYDRAMNIOS \n", + "4 SNOW STORM APPEARANCE ON CHEST X-RAY IS SEEN IN - A. ANTHRACOSIS B. BYSSINOSIS C. SILICOSIS D. BAGASSOSIS \n", + ".. ... \n", + "941 MRI sequence recycled to assess comfortable tissue pathology is\\nA. T1W\\nB. T2W\\nC. Proton density\\nD. Either T1 or T2 \n", + "942 Which commonplace tracer in PET is usually administered in the form of a glucose sugar\\nA. Oxygen 15\\nB. Fluorine 18\\nC. Saccharide - 12\\nD. Aluminum - 12 \n", + "943 Radiological signs of acute pancreatitis on transparent radiography are -\\nA. Sentinel loop sign\\nB. Colon cut off sign\\nC. Renal halo sign\\nD. All the above \n", + "944 Which looks ditto on Ti & T2 on MRI\\nA. Gall bladder\\nB. Fat\\nC. Kidney\\nD. CSF \n", + "945 22-year-aged women presents to the emergency depament with a preeminent complaint of relentless larboard upper quadrant pain after being punched by her husband. Her blood pressure is 110/76, her pulse is 80 bpm, and her respiration rate is 24 breaths per minute. The finest means to establish a diagnosis is that of the following ?\\nA. Four-quadrant tap of the abdomen\\nB. CT of the abdomen\\nC. Peritoneal lavage\\nD. Upper gastrointestinal series \n", + "\n", + " expected_result \\\n", + "0 A. Radiation \n", + "1 D. Is an example of destructive testing \n", + "2 B. 28 years \n", + "3 A. Cysts more than 2 cm \n", + "4 C. Silicosis \n", + ".. ... \n", + "941 D. Either T1 or T2 \n", + "942 B. Fluorine 18 \n", + "943 D. All the above \n", + "944 B. Fat \n", + "945 B. CT of the abdomen \n", + "\n", + " actual_result pass \n", + "0 A. RADIATION True \n", + "1 A. PROVIDES SPATIAL RESOLUTION False \n", + "2 D. 38 YEARS False \n", + "3 A. CYSTS MORE THAN 2 CM True \n", + "4 C. SILICOSIS True \n", + ".. ... ... \n", + "941 D. Either T1 or T2 True \n", + "942 B. Fluorine 18 True \n", + "943 D. All the above True \n", + "944 D. CSF False \n", + "945 B. CT of the abdomen True \n", + "\n", + "[946 rows x 9 columns]" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "generated_results" + ] + }, + { + "cell_type": "markdown", + "id": "UwkuBEoW89q0", + "metadata": { + "id": "UwkuBEoW89q0" + }, + "source": [ + "This method returns the generated results in the form of a pandas dataframe, which provides a convenient and easy-to-use format for working with the test results. You can use this method to quickly identify the test cases that failed and to determine where fixes are needed." + ] + }, + { + "cell_type": "markdown", + "id": "jsHPGmGo89q0", + "metadata": { + "id": "jsHPGmGo89q0" + }, + "source": [ + "### Final Results\n", + "\n", + "We can call `.report()` which summarizes the results giving information about pass and fail counts and overall test pass/fail flag." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5E99a26J89q0", + "metadata": { + "id": "5E99a26J89q0" + }, + "outputs": [], + "source": [ + "report = harness.report()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "imefSVqr89q0", + "metadata": { + "id": "imefSVqr89q0", + "outputId": "f727fa6f-4992-480f-b252-c84fcde4283d" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessuppercase279277%75%True
1robustnesslowercase1610387%75%True
2robustnesstitlecase229782%75%True
3robustnessadd_typo1310289%75%True
4robustnessdyslexia_word_swap108089%75%True
5robustnessadd_abbreviation116285%75%True
6robustnessadd_slangs52583%75%True
7robustnessadd_speech_to_text_typo1910084%75%True
8robustnessadd_ocr_typo209082%75%True
9robustnessadjective_synonym_swap94383%75%True
\n", + "
" + ], + "text/plain": [ + " category test_type fail_count pass_count pass_rate \\\n", + "0 robustness uppercase 27 92 77% \n", + "1 robustness lowercase 16 103 87% \n", + "2 robustness titlecase 22 97 82% \n", + "3 robustness add_typo 13 102 89% \n", + "4 robustness dyslexia_word_swap 10 80 89% \n", + "5 robustness add_abbreviation 11 62 85% \n", + "6 robustness add_slangs 5 25 83% \n", + "7 robustness add_speech_to_text_typo 19 100 84% \n", + "8 robustness add_ocr_typo 20 90 82% \n", + "9 robustness adjective_synonym_swap 9 43 83% \n", + "\n", + " minimum_pass_rate pass \n", + "0 75% True \n", + "1 75% True \n", + "2 75% True \n", + "3 75% True \n", + "4 75% True \n", + "5 75% True \n", + "6 75% True \n", + "7 75% True \n", + "8 75% True \n", + "9 75% True " + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "report" + ] + }, + { + "cell_type": "markdown", + "id": "5mgY-0qQ89q0", + "metadata": { + "id": "5mgY-0qQ89q0" + }, + "source": [ + "## MedQA\n", + "\n", + "[What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams](https://paperswithcode.com/dataset/medqa-usmle)\n", + "\n", + "**Dataset Summary**\n", + "\n", + "The MedQA is a benchmark dataset of Multiple choice question answering based on the United States Medical License Exams (USMLE). The dataset is collected from the professional medical board exams.\n", + "\n", + "**Data Splits**\n", + "\n", + "\n", + "- **test** : Testing set from the MedQA dataset, containing 1273 question and answers examples.\n", + "- **test-tiny** : Truncated version of the test set from the MedQA dataset, containing 50 question and answers examples." + ] + }, + { + "cell_type": "markdown", + "id": "a1VNSiHv89q0", + "metadata": { + "id": "a1VNSiHv89q0" + }, + "source": [ + "### Setup and Configure Harness" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "94913345-199e-4a1f-9672-294a70f79860", + "metadata": { + "colab": { + "referenced_widgets": [ + "9c736878808549f98b220b81a910c32e" + ] + }, + "execution": { + "iopub.execute_input": "2023-11-30T21:06:14.775351Z", + "iopub.status.busy": "2023-11-30T21:06:14.775015Z", + "iopub.status.idle": "2023-11-30T21:08:40.966319Z", + "shell.execute_reply": "2023-11-30T21:08:40.965759Z", + "shell.execute_reply.started": "2023-11-30T21:06:14.775333Z" + }, + "id": "94913345-199e-4a1f-9672-294a70f79860", + "outputId": "ea00aab7-99be-4c49-fba0-c532801d1792", + "tags": [] + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "9c736878808549f98b220b81a910c32e", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Loading checkpoint shards: 0%| | 0/2 [00:00\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_question
0robustnessadd_ocr_typo-A junior orthopaedic surgery resident is compl...-A junior orthopaedic surgery resident is compl...
1robustnessadd_ocr_typo-A 67-year-old man with transitional cell carci...-A 67-year-old m^an y/ith transitional ccll car...
2robustnessadd_ocr_typo-Two weeks after undergoing an emergency cardia...-t^wo weeks aftcr undergoing an emergency cardi...
3robustnessadd_ocr_typo-A 39-year-old woman is brought to the emergenc...-A 39-year-old voman is brought t^o th^e emerge...
4robustnessadd_ocr_typo-A 35-year-old man comes to the physician becau...-A 35-year-old m^an comes t^o t^he physician hc...
5robustnessadd_ocr_typo-A 39-year-old man presents to the emergency de...-A 39-year-old m^an presents t^o tl)e emergency...
6robustnessadd_ocr_typo-A 68-year-old male comes to the physician for ...-A 68-year-old male comes t^o tbe physician f^o...
7robustnessadd_ocr_typo-A 65-year-old man is brought to the emergency ...-A 65-year-old m^an is brought t^o tl)e emergen...
8robustnessadd_ocr_typo-A 37-year-old-woman presents to her primary ca...-A 37-year-old-woman presents t^o h^er primary ...
9robustnessadd_ocr_typo-A 23-year-old woman comes to the physician bec...-A 23-year-old w6man comes t^o t^ie physician b...
10robustnessadd_ocr_typo-A 24-year-old G2P1 woman at 39 weeks’ gestatio...-A 24-year-old G2P1 v/oman at 39 weeks’ gestati...
11robustnessadd_ocr_typo-A 72-year-old man comes to the physician becau...-A 72-year-old m^an comes t^o tl)e physician be...
12robustnessadd_ocr_typo-A 20-year-old man comes to the physician becau...-A 20-year-old m^an comes t^o tlic physician be...
13robustnessadd_ocr_typo-A 47-year-old executive schedules an appointme...-A 47-year-old execut1ve schedules an appointme...
14robustnessadd_ocr_typo-A microbiologist is studying the emergence of ...-A microbiologist is studying tle emergence of ...
15robustnessadd_ocr_typo-A 59-year-old overweight woman presents to the...-A 59-year-old overweight v^oman presents t^o t...
16robustnessadd_ocr_typo-A 7-year-old boy is brought to his pediatricia...-A 7-year-old boy is brought t^o hi^s pediatric...
17robustnessadd_ocr_typo-A 3-month-old boy is brought the emergency dep...-A 3-month-old boy is brought t^he emergency de...
18robustnessadd_ocr_typo-A 29-year-old man presents to the emergency de...-A 29-year-old m^an presents t^o th^e emergency...
19robustnessadd_ocr_typo-A 46-year-old man is brought to the emergency ...-A 46-year-old m^an is brought t^o t^ie emergen...
20robustnessadd_ocr_typo-A 77-year-old woman presents to the emergency ...-A 77-year-old v^oman presents t^o tlie emergen...
21robustnessadd_ocr_typo-A 3-month-old infant is brought to her pediatr...-A 3-month-old infant is brought t^o he^r pedia...
22robustnessadd_ocr_typo-A 30-year-old African American woman comes to ...-A 30-year-old African American ivoman comes t^...
23robustnessadd_ocr_typo-A 62-year-old patient has been hospitalized fo...-A 62-year-old j)atient has been hospitalized f...
24robustnessadd_ocr_typo-A 6-year-old boy is brought to the emergency d...-A 6-year-old boy is brought t^o t^he emergency...
25robustnessadd_ocr_typo-A 5-year-old female suffers from recurrent inf...-A 5-year-old female suffers f^rom recurrent in...
26robustnessadd_ocr_typo-A 3-year-old boy presents to the emergency dep...-A 3-year-old boy presents t^o tbe emergency de...
27robustnessadd_ocr_typo-A 26-year-old woman presents to a gynecologist...-A 26-year-old womau presents t^o a gynecologis...
28robustnessadd_ocr_typo-A 4-year-old previously healthy boy presents w...-A 4-year-old previously healthy boy presents v...
29robustnessadd_ocr_typo-A 3-week-old male newborn is brought to the ph...-A 3-week-old male newborn is brought t^o tle p...
30robustnessdyslexia_word_swap-A junior orthopaedic surgery resident is compl...-A junior orthopaedic surgery resident is compl...
31robustnessdyslexia_word_swap-A 67-year-old man with transitional cell carci...-A 67-year-old man with transitional cell carci...
32robustnessdyslexia_word_swap-Two weeks after undergoing an emergency cardia...-Two weeks after undergoing an emergency cardia...
33robustnessdyslexia_word_swap-A 39-year-old woman is brought to the emergenc...-A 39-year-old woman is brought too the emergen...
34robustnessdyslexia_word_swap-A 35-year-old man comes to the physician becau...-A 35-year-old man comes too the physician beca...
35robustnessdyslexia_word_swap-A 39-year-old man presents to the emergency de...-A 39-year-old man presents too the emergency d...
36robustnessdyslexia_word_swap-A 68-year-old male comes to the physician for ...-A 68-year-old male comes too the physician fou...
37robustnessdyslexia_word_swap-A 65-year-old man is brought to the emergency ...-A 65-year-old man is brought too the emergency...
38robustnessdyslexia_word_swap-A 37-year-old-woman presents to her primary ca...-A 37-year-old-woman presents too her primary c...
39robustnessdyslexia_word_swap-A 23-year-old woman comes to the physician bec...-A 23-year-old woman comes too the physician be...
40robustnessdyslexia_word_swap-A 24-year-old G2P1 woman at 39 weeks’ gestatio...-A 24-year-old G2P1 woman at 39 weeks’ gestatio...
41robustnessdyslexia_word_swap-A 72-year-old man comes to the physician becau...-A 72-year-old man comes too the physician beca...
42robustnessdyslexia_word_swap-A 20-year-old man comes to the physician becau...-A 20-year-old man comes too the physician beca...
43robustnessdyslexia_word_swap-A 47-year-old executive schedules an appointme...-A 47-year-old executive schedules an appointme...
44robustnessdyslexia_word_swap-A microbiologist is studying the emergence of ...-A microbiologist is studying the emergence off...
45robustnessdyslexia_word_swap-A 59-year-old overweight woman presents to the...-A 59-year-old overweight woman presents too th...
46robustnessdyslexia_word_swap-A 7-year-old boy is brought to his pediatricia...-A 7-year-old boy is brought too his pediatrici...
47robustnessdyslexia_word_swap-A 3-month-old boy is brought the emergency dep...-A 3-month-old boy is brought the emergency dep...
48robustnessdyslexia_word_swap-A 29-year-old man presents to the emergency de...-A 29-year-old man presents too the emergency d...
49robustnessdyslexia_word_swap-A 46-year-old man is brought to the emergency ...-A 46-year-old man is brought too the emergency...
50robustnessdyslexia_word_swap-A 77-year-old woman presents to the emergency ...-A 77-year-old woman presents too the emergency...
51robustnessdyslexia_word_swap-A 3-month-old infant is brought to her pediatr...-A 3-month-old infant is brought too her pediat...
52robustnessdyslexia_word_swap-A 30-year-old African American woman comes to ...-A 30-year-old African American woman comes too...
53robustnessdyslexia_word_swap-A 62-year-old patient has been hospitalized fo...-A 62-year-old patient has been hospitalized fo...
54robustnessdyslexia_word_swap-A 6-year-old boy is brought to the emergency d...-A 6-year-old boy is brought too the emergency ...
55robustnessdyslexia_word_swap-A 5-year-old female suffers from recurrent inf...-A 5-year-old female suffers from recurrent inf...
56robustnessdyslexia_word_swap-A 3-year-old boy presents to the emergency dep...-A 3-year-old boy presents too the emergency de...
57robustnessdyslexia_word_swap-A 26-year-old woman presents to a gynecologist...-A 26-year-old woman presents too a gynecologis...
58robustnessdyslexia_word_swap-A 4-year-old previously healthy boy presents w...-A 4-year-old previously healthy boy presents w...
59robustnessdyslexia_word_swap-A 3-week-old male newborn is brought to the ph...-A 3-week-old male newborn is brought too the p...
\n", + "" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness add_ocr_typo - \n", + "1 robustness add_ocr_typo - \n", + "2 robustness add_ocr_typo - \n", + "3 robustness add_ocr_typo - \n", + "4 robustness add_ocr_typo - \n", + "5 robustness add_ocr_typo - \n", + "6 robustness add_ocr_typo - \n", + "7 robustness add_ocr_typo - \n", + "8 robustness add_ocr_typo - \n", + "9 robustness add_ocr_typo - \n", + "10 robustness add_ocr_typo - \n", + "11 robustness add_ocr_typo - \n", + "12 robustness add_ocr_typo - \n", + "13 robustness add_ocr_typo - \n", + "14 robustness add_ocr_typo - \n", + "15 robustness add_ocr_typo - \n", + "16 robustness add_ocr_typo - \n", + "17 robustness add_ocr_typo - \n", + "18 robustness add_ocr_typo - \n", + "19 robustness add_ocr_typo - \n", + "20 robustness add_ocr_typo - \n", + "21 robustness add_ocr_typo - \n", + "22 robustness add_ocr_typo - \n", + "23 robustness add_ocr_typo - \n", + "24 robustness add_ocr_typo - \n", + "25 robustness add_ocr_typo - \n", + "26 robustness add_ocr_typo - \n", + "27 robustness add_ocr_typo - \n", + "28 robustness add_ocr_typo - \n", + "29 robustness add_ocr_typo - \n", + "30 robustness dyslexia_word_swap - \n", + "31 robustness dyslexia_word_swap - \n", + "32 robustness dyslexia_word_swap - \n", + "33 robustness dyslexia_word_swap - \n", + "34 robustness dyslexia_word_swap - \n", + "35 robustness dyslexia_word_swap - \n", + "36 robustness dyslexia_word_swap - \n", + "37 robustness dyslexia_word_swap - \n", + "38 robustness dyslexia_word_swap - \n", + "39 robustness dyslexia_word_swap - \n", + "40 robustness dyslexia_word_swap - \n", + "41 robustness dyslexia_word_swap - \n", + "42 robustness dyslexia_word_swap - \n", + "43 robustness dyslexia_word_swap - \n", + "44 robustness dyslexia_word_swap - \n", + "45 robustness dyslexia_word_swap - \n", + "46 robustness dyslexia_word_swap - \n", + "47 robustness dyslexia_word_swap - \n", + "48 robustness dyslexia_word_swap - \n", + "49 robustness dyslexia_word_swap - \n", + "50 robustness dyslexia_word_swap - \n", + "51 robustness dyslexia_word_swap - \n", + "52 robustness dyslexia_word_swap - \n", + "53 robustness dyslexia_word_swap - \n", + "54 robustness dyslexia_word_swap - \n", + "55 robustness dyslexia_word_swap - \n", + "56 robustness dyslexia_word_swap - \n", + "57 robustness dyslexia_word_swap - \n", + "58 robustness dyslexia_word_swap - \n", + "59 robustness dyslexia_word_swap - \n", + "\n", + " original_question perturbed_context \\\n", + "0 A junior orthopaedic surgery resident is compl... - \n", + "1 A 67-year-old man with transitional cell carci... - \n", + "2 Two weeks after undergoing an emergency cardia... - \n", + "3 A 39-year-old woman is brought to the emergenc... - \n", + "4 A 35-year-old man comes to the physician becau... - \n", + "5 A 39-year-old man presents to the emergency de... - \n", + "6 A 68-year-old male comes to the physician for ... - \n", + "7 A 65-year-old man is brought to the emergency ... - \n", + "8 A 37-year-old-woman presents to her primary ca... - \n", + "9 A 23-year-old woman comes to the physician bec... - \n", + "10 A 24-year-old G2P1 woman at 39 weeks’ gestatio... - \n", + "11 A 72-year-old man comes to the physician becau... - \n", + "12 A 20-year-old man comes to the physician becau... - \n", + "13 A 47-year-old executive schedules an appointme... - \n", + "14 A microbiologist is studying the emergence of ... - \n", + "15 A 59-year-old overweight woman presents to the... - \n", + "16 A 7-year-old boy is brought to his pediatricia... - \n", + "17 A 3-month-old boy is brought the emergency dep... - \n", + "18 A 29-year-old man presents to the emergency de... - \n", + "19 A 46-year-old man is brought to the emergency ... - \n", + "20 A 77-year-old woman presents to the emergency ... - \n", + "21 A 3-month-old infant is brought to her pediatr... - \n", + "22 A 30-year-old African American woman comes to ... - \n", + "23 A 62-year-old patient has been hospitalized fo... - \n", + "24 A 6-year-old boy is brought to the emergency d... - \n", + "25 A 5-year-old female suffers from recurrent inf... - \n", + "26 A 3-year-old boy presents to the emergency dep... - \n", + "27 A 26-year-old woman presents to a gynecologist... - \n", + "28 A 4-year-old previously healthy boy presents w... - \n", + "29 A 3-week-old male newborn is brought to the ph... - \n", + "30 A junior orthopaedic surgery resident is compl... - \n", + "31 A 67-year-old man with transitional cell carci... - \n", + "32 Two weeks after undergoing an emergency cardia... - \n", + "33 A 39-year-old woman is brought to the emergenc... - \n", + "34 A 35-year-old man comes to the physician becau... - \n", + "35 A 39-year-old man presents to the emergency de... - \n", + "36 A 68-year-old male comes to the physician for ... - \n", + "37 A 65-year-old man is brought to the emergency ... - \n", + "38 A 37-year-old-woman presents to her primary ca... - \n", + "39 A 23-year-old woman comes to the physician bec... - \n", + "40 A 24-year-old G2P1 woman at 39 weeks’ gestatio... - \n", + "41 A 72-year-old man comes to the physician becau... - \n", + "42 A 20-year-old man comes to the physician becau... - \n", + "43 A 47-year-old executive schedules an appointme... - \n", + "44 A microbiologist is studying the emergence of ... - \n", + "45 A 59-year-old overweight woman presents to the... - \n", + "46 A 7-year-old boy is brought to his pediatricia... - \n", + "47 A 3-month-old boy is brought the emergency dep... - \n", + "48 A 29-year-old man presents to the emergency de... - \n", + "49 A 46-year-old man is brought to the emergency ... - \n", + "50 A 77-year-old woman presents to the emergency ... - \n", + "51 A 3-month-old infant is brought to her pediatr... - \n", + "52 A 30-year-old African American woman comes to ... - \n", + "53 A 62-year-old patient has been hospitalized fo... - \n", + "54 A 6-year-old boy is brought to the emergency d... - \n", + "55 A 5-year-old female suffers from recurrent inf... - \n", + "56 A 3-year-old boy presents to the emergency dep... - \n", + "57 A 26-year-old woman presents to a gynecologist... - \n", + "58 A 4-year-old previously healthy boy presents w... - \n", + "59 A 3-week-old male newborn is brought to the ph... - \n", + "\n", + " perturbed_question \n", + "0 A junior orthopaedic surgery resident is compl... \n", + "1 A 67-year-old m^an y/ith transitional ccll car... \n", + "2 t^wo weeks aftcr undergoing an emergency cardi... \n", + "3 A 39-year-old voman is brought t^o th^e emerge... \n", + "4 A 35-year-old m^an comes t^o t^he physician hc... \n", + "5 A 39-year-old m^an presents t^o tl)e emergency... \n", + "6 A 68-year-old male comes t^o tbe physician f^o... \n", + "7 A 65-year-old m^an is brought t^o tl)e emergen... \n", + "8 A 37-year-old-woman presents t^o h^er primary ... \n", + "9 A 23-year-old w6man comes t^o t^ie physician b... \n", + "10 A 24-year-old G2P1 v/oman at 39 weeks’ gestati... \n", + "11 A 72-year-old m^an comes t^o tl)e physician be... \n", + "12 A 20-year-old m^an comes t^o tlic physician be... \n", + "13 A 47-year-old execut1ve schedules an appointme... \n", + "14 A microbiologist is studying tle emergence of ... \n", + "15 A 59-year-old overweight v^oman presents t^o t... \n", + "16 A 7-year-old boy is brought t^o hi^s pediatric... \n", + "17 A 3-month-old boy is brought t^he emergency de... \n", + "18 A 29-year-old m^an presents t^o th^e emergency... \n", + "19 A 46-year-old m^an is brought t^o t^ie emergen... \n", + "20 A 77-year-old v^oman presents t^o tlie emergen... \n", + "21 A 3-month-old infant is brought t^o he^r pedia... \n", + "22 A 30-year-old African American ivoman comes t^... \n", + "23 A 62-year-old j)atient has been hospitalized f... \n", + "24 A 6-year-old boy is brought t^o t^he emergency... \n", + "25 A 5-year-old female suffers f^rom recurrent in... \n", + "26 A 3-year-old boy presents t^o tbe emergency de... \n", + "27 A 26-year-old womau presents t^o a gynecologis... \n", + "28 A 4-year-old previously healthy boy presents v... \n", + "29 A 3-week-old male newborn is brought t^o tle p... \n", + "30 A junior orthopaedic surgery resident is compl... \n", + "31 A 67-year-old man with transitional cell carci... \n", + "32 Two weeks after undergoing an emergency cardia... \n", + "33 A 39-year-old woman is brought too the emergen... \n", + "34 A 35-year-old man comes too the physician beca... \n", + "35 A 39-year-old man presents too the emergency d... \n", + "36 A 68-year-old male comes too the physician fou... \n", + "37 A 65-year-old man is brought too the emergency... \n", + "38 A 37-year-old-woman presents too her primary c... \n", + "39 A 23-year-old woman comes too the physician be... \n", + "40 A 24-year-old G2P1 woman at 39 weeks’ gestatio... \n", + "41 A 72-year-old man comes too the physician beca... \n", + "42 A 20-year-old man comes too the physician beca... \n", + "43 A 47-year-old executive schedules an appointme... \n", + "44 A microbiologist is studying the emergence off... \n", + "45 A 59-year-old overweight woman presents too th... \n", + "46 A 7-year-old boy is brought too his pediatrici... \n", + "47 A 3-month-old boy is brought the emergency dep... \n", + "48 A 29-year-old man presents too the emergency d... \n", + "49 A 46-year-old man is brought too the emergency... \n", + "50 A 77-year-old woman presents too the emergency... \n", + "51 A 3-month-old infant is brought too her pediat... \n", + "52 A 30-year-old African American woman comes too... \n", + "53 A 62-year-old patient has been hospitalized fo... \n", + "54 A 6-year-old boy is brought too the emergency ... \n", + "55 A 5-year-old female suffers from recurrent inf... \n", + "56 A 3-year-old boy presents too the emergency de... \n", + "57 A 26-year-old woman presents too a gynecologis... \n", + "58 A 4-year-old previously healthy boy presents w... \n", + "59 A 3-week-old male newborn is brought too the p... " + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.testcases()" + ] + }, + { + "cell_type": "markdown", + "id": "2fPk-Z6o89q5", + "metadata": { + "id": "2fPk-Z6o89q5" + }, + "source": [ + "### Running the tests" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0bece2be-f747-455c-9b8a-33fb1be1e52e", + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-30T21:08:42.478585Z", + "iopub.status.busy": "2023-11-30T21:08:42.478328Z", + "iopub.status.idle": "2023-11-30T21:19:04.158344Z", + "shell.execute_reply": "2023-11-30T21:19:04.157825Z", + "shell.execute_reply.started": "2023-11-30T21:08:42.478569Z" + }, + "id": "0bece2be-f747-455c-9b8a-33fb1be1e52e", + "tags": [] + }, + "outputs": [], + "source": [ + "harness.run()" + ] + }, + { + "cell_type": "markdown", + "id": "7TgAuYpG89q5", + "metadata": { + "id": "7TgAuYpG89q5" + }, + "source": [ + "### Generated Results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bf5e8570-ade8-432b-b80d-1bd4fb6565bd", + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-30T21:19:04.159186Z", + "iopub.status.busy": "2023-11-30T21:19:04.159013Z", + "iopub.status.idle": "2023-11-30T21:19:04.178803Z", + "shell.execute_reply": "2023-11-30T21:19:04.178371Z", + "shell.execute_reply.started": "2023-11-30T21:19:04.159170Z" + }, + "id": "bf5e8570-ade8-432b-b80d-1bd4fb6565bd", + "outputId": "ff08e618-eb09-440d-f7d2-034fd311f52f", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_questionexpected_resultactual_resulteval_scorepass
0robustnessadd_ocr_typo-A junior orthopaedic surgery resident is compl...-A junior orthopaedic surgery resident is compl...BA1.0False
1robustnessadd_ocr_typo-A 67-year-old man with transitional cell carci...-A 67-year-old m^an y/ith transitional ccll car...DD0.0True
2robustnessadd_ocr_typo-Two weeks after undergoing an emergency cardia...-t^wo weeks aftcr undergoing an emergency cardi...DD0.0True
3robustnessadd_ocr_typo-A 39-year-old woman is brought to the emergenc...-A 39-year-old voman is brought t^o th^e emerge...AA0.0True
4robustnessadd_ocr_typo-A 35-year-old man comes to the physician becau...-A 35-year-old m^an comes t^o t^he physician hc...DD0.0True
5robustnessadd_ocr_typo-A 39-year-old man presents to the emergency de...-A 39-year-old m^an presents t^o tl)e emergency...EE0.0True
6robustnessadd_ocr_typo-A 68-year-old male comes to the physician for ...-A 68-year-old male comes t^o tbe physician f^o...AC1.0False
7robustnessadd_ocr_typo-A 65-year-old man is brought to the emergency ...-A 65-year-old m^an is brought t^o tl)e emergen...CC0.0True
8robustnessadd_ocr_typo-A 37-year-old-woman presents to her primary ca...-A 37-year-old-woman presents t^o h^er primary ...CC0.0True
9robustnessadd_ocr_typo-A 23-year-old woman comes to the physician bec...-A 23-year-old w6man comes t^o t^ie physician b...AA0.0True
10robustnessadd_ocr_typo-A 24-year-old G2P1 woman at 39 weeks’ gestatio...-A 24-year-old G2P1 v/oman at 39 weeks’ gestati...DD0.0True
11robustnessadd_ocr_typo-A 72-year-old man comes to the physician becau...-A 72-year-old m^an comes t^o tl)e physician be...DE1.0False
12robustnessadd_ocr_typo-A 20-year-old man comes to the physician becau...-A 20-year-old m^an comes t^o tlic physician be...DB1.0False
13robustnessadd_ocr_typo-A 47-year-old executive schedules an appointme...-A 47-year-old execut1ve schedules an appointme...EE0.0True
14robustnessadd_ocr_typo-A microbiologist is studying the emergence of ...-A microbiologist is studying tle emergence of ...EE0.0True
15robustnessadd_ocr_typo-A 59-year-old overweight woman presents to the...-A 59-year-old overweight v^oman presents t^o t...BB0.0True
16robustnessadd_ocr_typo-A 7-year-old boy is brought to his pediatricia...-A 7-year-old boy is brought t^o hi^s pediatric...DD0.0True
17robustnessadd_ocr_typo-A 3-month-old boy is brought the emergency dep...-A 3-month-old boy is brought t^he emergency de...EE0.0True
18robustnessadd_ocr_typo-A 29-year-old man presents to the emergency de...-A 29-year-old m^an presents t^o th^e emergency...DD0.0True
19robustnessadd_ocr_typo-A 46-year-old man is brought to the emergency ...-A 46-year-old m^an is brought t^o t^ie emergen...CC0.0True
20robustnessadd_ocr_typo-A 77-year-old woman presents to the emergency ...-A 77-year-old v^oman presents t^o tlie emergen...EE0.0True
21robustnessadd_ocr_typo-A 3-month-old infant is brought to her pediatr...-A 3-month-old infant is brought t^o he^r pedia...AA0.0True
22robustnessadd_ocr_typo-A 30-year-old African American woman comes to ...-A 30-year-old African American ivoman comes t^...BB0.0True
23robustnessadd_ocr_typo-A 62-year-old patient has been hospitalized fo...-A 62-year-old j)atient has been hospitalized f...EE0.0True
24robustnessadd_ocr_typo-A 6-year-old boy is brought to the emergency d...-A 6-year-old boy is brought t^o t^he emergency...EE0.0True
25robustnessadd_ocr_typo-A 5-year-old female suffers from recurrent inf...-A 5-year-old female suffers f^rom recurrent in...EE0.0True
26robustnessadd_ocr_typo-A 3-year-old boy presents to the emergency dep...-A 3-year-old boy presents t^o tbe emergency de...EE0.0True
27robustnessadd_ocr_typo-A 26-year-old woman presents to a gynecologist...-A 26-year-old womau presents t^o a gynecologis...AA0.0True
28robustnessadd_ocr_typo-A 4-year-old previously healthy boy presents w...-A 4-year-old previously healthy boy presents v...BB0.0True
29robustnessadd_ocr_typo-A 3-week-old male newborn is brought to the ph...-A 3-week-old male newborn is brought t^o tle p...DD0.0True
30robustnessdyslexia_word_swap-A junior orthopaedic surgery resident is compl...-A junior orthopaedic surgery resident is compl...BB0.0True
31robustnessdyslexia_word_swap-A 67-year-old man with transitional cell carci...-A 67-year-old man with transitional cell carci...DE1.0False
32robustnessdyslexia_word_swap-Two weeks after undergoing an emergency cardia...-Two weeks after undergoing an emergency cardia...DD0.0True
33robustnessdyslexia_word_swap-A 39-year-old woman is brought to the emergenc...-A 39-year-old woman is brought too the emergen...AA0.0True
34robustnessdyslexia_word_swap-A 35-year-old man comes to the physician becau...-A 35-year-old man comes too the physician beca...DD0.0True
35robustnessdyslexia_word_swap-A 39-year-old man presents to the emergency de...-A 39-year-old man presents too the emergency d...EE0.0True
36robustnessdyslexia_word_swap-A 68-year-old male comes to the physician for ...-A 68-year-old male comes too the physician fou...AA0.0True
37robustnessdyslexia_word_swap-A 65-year-old man is brought to the emergency ...-A 65-year-old man is brought too the emergency...CC0.0True
38robustnessdyslexia_word_swap-A 37-year-old-woman presents to her primary ca...-A 37-year-old-woman presents too her primary c...CC0.0True
39robustnessdyslexia_word_swap-A 23-year-old woman comes to the physician bec...-A 23-year-old woman comes too the physician be...AA0.0True
40robustnessdyslexia_word_swap-A 24-year-old G2P1 woman at 39 weeks’ gestatio...-A 24-year-old G2P1 woman at 39 weeks’ gestatio...DD0.0True
41robustnessdyslexia_word_swap-A 72-year-old man comes to the physician becau...-A 72-year-old man comes too the physician beca...DD0.0True
42robustnessdyslexia_word_swap-A 20-year-old man comes to the physician becau...-A 20-year-old man comes too the physician beca...DD0.0True
43robustnessdyslexia_word_swap-A 47-year-old executive schedules an appointme...-A 47-year-old executive schedules an appointme...EE0.0True
44robustnessdyslexia_word_swap-A microbiologist is studying the emergence of ...-A microbiologist is studying the emergence off...EE0.0True
45robustnessdyslexia_word_swap-A 59-year-old overweight woman presents to the...-A 59-year-old overweight woman presents too th...BB0.0True
46robustnessdyslexia_word_swap-A 7-year-old boy is brought to his pediatricia...-A 7-year-old boy is brought too his pediatrici...DE1.0False
47robustnessdyslexia_word_swap-A 3-month-old boy is brought the emergency dep...-A 3-month-old boy is brought the emergency dep...EE0.0True
48robustnessdyslexia_word_swap-A 29-year-old man presents to the emergency de...-A 29-year-old man presents too the emergency d...DD0.0True
49robustnessdyslexia_word_swap-A 46-year-old man is brought to the emergency ...-A 46-year-old man is brought too the emergency...CC0.0True
50robustnessdyslexia_word_swap-A 77-year-old woman presents to the emergency ...-A 77-year-old woman presents too the emergency...EE0.0True
51robustnessdyslexia_word_swap-A 3-month-old infant is brought to her pediatr...-A 3-month-old infant is brought too her pediat...AA0.0True
52robustnessdyslexia_word_swap-A 30-year-old African American woman comes to ...-A 30-year-old African American woman comes too...BB0.0True
53robustnessdyslexia_word_swap-A 62-year-old patient has been hospitalized fo...-A 62-year-old patient has been hospitalized fo...EE0.0True
54robustnessdyslexia_word_swap-A 6-year-old boy is brought to the emergency d...-A 6-year-old boy is brought too the emergency ...EE0.0True
55robustnessdyslexia_word_swap-A 5-year-old female suffers from recurrent inf...-A 5-year-old female suffers from recurrent inf...EE0.0True
56robustnessdyslexia_word_swap-A 3-year-old boy presents to the emergency dep...-A 3-year-old boy presents too the emergency de...EE0.0True
57robustnessdyslexia_word_swap-A 26-year-old woman presents to a gynecologist...-A 26-year-old woman presents too a gynecologis...AD1.0False
58robustnessdyslexia_word_swap-A 4-year-old previously healthy boy presents w...-A 4-year-old previously healthy boy presents w...BB0.0True
59robustnessdyslexia_word_swap-A 3-week-old male newborn is brought to the ph...-A 3-week-old male newborn is brought too the p...DD0.0True
\n", + "
" + ], + "text/plain": [ + " category test_type original_context \\\n", + "0 robustness add_ocr_typo - \n", + "1 robustness add_ocr_typo - \n", + "2 robustness add_ocr_typo - \n", + "3 robustness add_ocr_typo - \n", + "4 robustness add_ocr_typo - \n", + "5 robustness add_ocr_typo - \n", + "6 robustness add_ocr_typo - \n", + "7 robustness add_ocr_typo - \n", + "8 robustness add_ocr_typo - \n", + "9 robustness add_ocr_typo - \n", + "10 robustness add_ocr_typo - \n", + "11 robustness add_ocr_typo - \n", + "12 robustness add_ocr_typo - \n", + "13 robustness add_ocr_typo - \n", + "14 robustness add_ocr_typo - \n", + "15 robustness add_ocr_typo - \n", + "16 robustness add_ocr_typo - \n", + "17 robustness add_ocr_typo - \n", + "18 robustness add_ocr_typo - \n", + "19 robustness add_ocr_typo - \n", + "20 robustness add_ocr_typo - \n", + "21 robustness add_ocr_typo - \n", + "22 robustness add_ocr_typo - \n", + "23 robustness add_ocr_typo - \n", + "24 robustness add_ocr_typo - \n", + "25 robustness add_ocr_typo - \n", + "26 robustness add_ocr_typo - \n", + "27 robustness add_ocr_typo - \n", + "28 robustness add_ocr_typo - \n", + "29 robustness add_ocr_typo - \n", + "30 robustness dyslexia_word_swap - \n", + "31 robustness dyslexia_word_swap - \n", + "32 robustness dyslexia_word_swap - \n", + "33 robustness dyslexia_word_swap - \n", + "34 robustness dyslexia_word_swap - \n", + "35 robustness dyslexia_word_swap - \n", + "36 robustness dyslexia_word_swap - \n", + "37 robustness dyslexia_word_swap - \n", + "38 robustness dyslexia_word_swap - \n", + "39 robustness dyslexia_word_swap - \n", + "40 robustness dyslexia_word_swap - \n", + "41 robustness dyslexia_word_swap - \n", + "42 robustness dyslexia_word_swap - \n", + "43 robustness dyslexia_word_swap - \n", + "44 robustness dyslexia_word_swap - \n", + "45 robustness dyslexia_word_swap - \n", + "46 robustness dyslexia_word_swap - \n", + "47 robustness dyslexia_word_swap - \n", + "48 robustness dyslexia_word_swap - \n", + "49 robustness dyslexia_word_swap - \n", + "50 robustness dyslexia_word_swap - \n", + "51 robustness dyslexia_word_swap - \n", + "52 robustness dyslexia_word_swap - \n", + "53 robustness dyslexia_word_swap - \n", + "54 robustness dyslexia_word_swap - \n", + "55 robustness dyslexia_word_swap - \n", + "56 robustness dyslexia_word_swap - \n", + "57 robustness dyslexia_word_swap - \n", + "58 robustness dyslexia_word_swap - \n", + "59 robustness dyslexia_word_swap - \n", + "\n", + " original_question perturbed_context \\\n", + "0 A junior orthopaedic surgery resident is compl... - \n", + "1 A 67-year-old man with transitional cell carci... - \n", + "2 Two weeks after undergoing an emergency cardia... - \n", + "3 A 39-year-old woman is brought to the emergenc... - \n", + "4 A 35-year-old man comes to the physician becau... - \n", + "5 A 39-year-old man presents to the emergency de... - \n", + "6 A 68-year-old male comes to the physician for ... - \n", + "7 A 65-year-old man is brought to the emergency ... - \n", + "8 A 37-year-old-woman presents to her primary ca... - \n", + "9 A 23-year-old woman comes to the physician bec... - \n", + "10 A 24-year-old G2P1 woman at 39 weeks’ gestatio... - \n", + "11 A 72-year-old man comes to the physician becau... - \n", + "12 A 20-year-old man comes to the physician becau... - \n", + "13 A 47-year-old executive schedules an appointme... - \n", + "14 A microbiologist is studying the emergence of ... - \n", + "15 A 59-year-old overweight woman presents to the... - \n", + "16 A 7-year-old boy is brought to his pediatricia... - \n", + "17 A 3-month-old boy is brought the emergency dep... - \n", + "18 A 29-year-old man presents to the emergency de... - \n", + "19 A 46-year-old man is brought to the emergency ... - \n", + "20 A 77-year-old woman presents to the emergency ... - \n", + "21 A 3-month-old infant is brought to her pediatr... - \n", + "22 A 30-year-old African American woman comes to ... - \n", + "23 A 62-year-old patient has been hospitalized fo... - \n", + "24 A 6-year-old boy is brought to the emergency d... - \n", + "25 A 5-year-old female suffers from recurrent inf... - \n", + "26 A 3-year-old boy presents to the emergency dep... - \n", + "27 A 26-year-old woman presents to a gynecologist... - \n", + "28 A 4-year-old previously healthy boy presents w... - \n", + "29 A 3-week-old male newborn is brought to the ph... - \n", + "30 A junior orthopaedic surgery resident is compl... - \n", + "31 A 67-year-old man with transitional cell carci... - \n", + "32 Two weeks after undergoing an emergency cardia... - \n", + "33 A 39-year-old woman is brought to the emergenc... - \n", + "34 A 35-year-old man comes to the physician becau... - \n", + "35 A 39-year-old man presents to the emergency de... - \n", + "36 A 68-year-old male comes to the physician for ... - \n", + "37 A 65-year-old man is brought to the emergency ... - \n", + "38 A 37-year-old-woman presents to her primary ca... - \n", + "39 A 23-year-old woman comes to the physician bec... - \n", + "40 A 24-year-old G2P1 woman at 39 weeks’ gestatio... - \n", + "41 A 72-year-old man comes to the physician becau... - \n", + "42 A 20-year-old man comes to the physician becau... - \n", + "43 A 47-year-old executive schedules an appointme... - \n", + "44 A microbiologist is studying the emergence of ... - \n", + "45 A 59-year-old overweight woman presents to the... - \n", + "46 A 7-year-old boy is brought to his pediatricia... - \n", + "47 A 3-month-old boy is brought the emergency dep... - \n", + "48 A 29-year-old man presents to the emergency de... - \n", + "49 A 46-year-old man is brought to the emergency ... - \n", + "50 A 77-year-old woman presents to the emergency ... - \n", + "51 A 3-month-old infant is brought to her pediatr... - \n", + "52 A 30-year-old African American woman comes to ... - \n", + "53 A 62-year-old patient has been hospitalized fo... - \n", + "54 A 6-year-old boy is brought to the emergency d... - \n", + "55 A 5-year-old female suffers from recurrent inf... - \n", + "56 A 3-year-old boy presents to the emergency dep... - \n", + "57 A 26-year-old woman presents to a gynecologist... - \n", + "58 A 4-year-old previously healthy boy presents w... - \n", + "59 A 3-week-old male newborn is brought to the ph... - \n", + "\n", + " perturbed_question expected_result \\\n", + "0 A junior orthopaedic surgery resident is compl... B \n", + "1 A 67-year-old m^an y/ith transitional ccll car... D \n", + "2 t^wo weeks aftcr undergoing an emergency cardi... D \n", + "3 A 39-year-old voman is brought t^o th^e emerge... A \n", + "4 A 35-year-old m^an comes t^o t^he physician hc... D \n", + "5 A 39-year-old m^an presents t^o tl)e emergency... E \n", + "6 A 68-year-old male comes t^o tbe physician f^o... A \n", + "7 A 65-year-old m^an is brought t^o tl)e emergen... C \n", + "8 A 37-year-old-woman presents t^o h^er primary ... C \n", + "9 A 23-year-old w6man comes t^o t^ie physician b... A \n", + "10 A 24-year-old G2P1 v/oman at 39 weeks’ gestati... D \n", + "11 A 72-year-old m^an comes t^o tl)e physician be... D \n", + "12 A 20-year-old m^an comes t^o tlic physician be... D \n", + "13 A 47-year-old execut1ve schedules an appointme... E \n", + "14 A microbiologist is studying tle emergence of ... E \n", + "15 A 59-year-old overweight v^oman presents t^o t... B \n", + "16 A 7-year-old boy is brought t^o hi^s pediatric... D \n", + "17 A 3-month-old boy is brought t^he emergency de... E \n", + "18 A 29-year-old m^an presents t^o th^e emergency... D \n", + "19 A 46-year-old m^an is brought t^o t^ie emergen... C \n", + "20 A 77-year-old v^oman presents t^o tlie emergen... E \n", + "21 A 3-month-old infant is brought t^o he^r pedia... A \n", + "22 A 30-year-old African American ivoman comes t^... B \n", + "23 A 62-year-old j)atient has been hospitalized f... E \n", + "24 A 6-year-old boy is brought t^o t^he emergency... E \n", + "25 A 5-year-old female suffers f^rom recurrent in... E \n", + "26 A 3-year-old boy presents t^o tbe emergency de... E \n", + "27 A 26-year-old womau presents t^o a gynecologis... A \n", + "28 A 4-year-old previously healthy boy presents v... B \n", + "29 A 3-week-old male newborn is brought t^o tle p... D \n", + "30 A junior orthopaedic surgery resident is compl... B \n", + "31 A 67-year-old man with transitional cell carci... D \n", + "32 Two weeks after undergoing an emergency cardia... D \n", + "33 A 39-year-old woman is brought too the emergen... A \n", + "34 A 35-year-old man comes too the physician beca... D \n", + "35 A 39-year-old man presents too the emergency d... E \n", + "36 A 68-year-old male comes too the physician fou... A \n", + "37 A 65-year-old man is brought too the emergency... C \n", + "38 A 37-year-old-woman presents too her primary c... C \n", + "39 A 23-year-old woman comes too the physician be... A \n", + "40 A 24-year-old G2P1 woman at 39 weeks’ gestatio... D \n", + "41 A 72-year-old man comes too the physician beca... D \n", + "42 A 20-year-old man comes too the physician beca... D \n", + "43 A 47-year-old executive schedules an appointme... E \n", + "44 A microbiologist is studying the emergence off... E \n", + "45 A 59-year-old overweight woman presents too th... B \n", + "46 A 7-year-old boy is brought too his pediatrici... D \n", + "47 A 3-month-old boy is brought the emergency dep... E \n", + "48 A 29-year-old man presents too the emergency d... D \n", + "49 A 46-year-old man is brought too the emergency... C \n", + "50 A 77-year-old woman presents too the emergency... E \n", + "51 A 3-month-old infant is brought too her pediat... A \n", + "52 A 30-year-old African American woman comes too... B \n", + "53 A 62-year-old patient has been hospitalized fo... E \n", + "54 A 6-year-old boy is brought too the emergency ... E \n", + "55 A 5-year-old female suffers from recurrent inf... E \n", + "56 A 3-year-old boy presents too the emergency de... E \n", + "57 A 26-year-old woman presents too a gynecologis... A \n", + "58 A 4-year-old previously healthy boy presents w... B \n", + "59 A 3-week-old male newborn is brought too the p... D \n", + "\n", + " actual_result eval_score pass \n", + "0 A 1.0 False \n", + "1 D 0.0 True \n", + "2 D 0.0 True \n", + "3 A 0.0 True \n", + "4 D 0.0 True \n", + "5 E 0.0 True \n", + "6 C 1.0 False \n", + "7 C 0.0 True \n", + "8 C 0.0 True \n", + "9 A 0.0 True \n", + "10 D 0.0 True \n", + "11 E 1.0 False \n", + "12 B 1.0 False \n", + "13 E 0.0 True \n", + "14 E 0.0 True \n", + "15 B 0.0 True \n", + "16 D 0.0 True \n", + "17 E 0.0 True \n", + "18 D 0.0 True \n", + "19 C 0.0 True \n", + "20 E 0.0 True \n", + "21 A 0.0 True \n", + "22 B 0.0 True \n", + "23 E 0.0 True \n", + "24 E 0.0 True \n", + "25 E 0.0 True \n", + "26 E 0.0 True \n", + "27 A 0.0 True \n", + "28 B 0.0 True \n", + "29 D 0.0 True \n", + "30 B 0.0 True \n", + "31 E 1.0 False \n", + "32 D 0.0 True \n", + "33 A 0.0 True \n", + "34 D 0.0 True \n", + "35 E 0.0 True \n", + "36 A 0.0 True \n", + "37 C 0.0 True \n", + "38 C 0.0 True \n", + "39 A 0.0 True \n", + "40 D 0.0 True \n", + "41 D 0.0 True \n", + "42 D 0.0 True \n", + "43 E 0.0 True \n", + "44 E 0.0 True \n", + "45 B 0.0 True \n", + "46 E 1.0 False \n", + "47 E 0.0 True \n", + "48 D 0.0 True \n", + "49 C 0.0 True \n", + "50 E 0.0 True \n", + "51 A 0.0 True \n", + "52 B 0.0 True \n", + "53 E 0.0 True \n", + "54 E 0.0 True \n", + "55 E 0.0 True \n", + "56 E 0.0 True \n", + "57 D 1.0 False \n", + "58 B 0.0 True \n", + "59 D 0.0 True " + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.generated_results()" + ] + }, + { + "cell_type": "markdown", + "id": "jixbpnNb89q5", + "metadata": { + "id": "jixbpnNb89q5" + }, + "source": [ + "### Final Results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b8a74d27-1d88-4fe0-ac54-c8a569fdf048", + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-30T21:19:04.179484Z", + "iopub.status.busy": "2023-11-30T21:19:04.179336Z", + "iopub.status.idle": "2023-11-30T21:19:04.269655Z", + "shell.execute_reply": "2023-11-30T21:19:04.269220Z", + "shell.execute_reply.started": "2023-11-30T21:19:04.179470Z" + }, + "id": "b8a74d27-1d88-4fe0-ac54-c8a569fdf048", + "outputId": "ecbdff3a-2e06-487c-df1b-e67f191b1029", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessadd_ocr_typo42687%66%True
1robustnessdyslexia_word_swap32790%60%True
\n", + "
" + ], + "text/plain": [ + " category test_type fail_count pass_count pass_rate \\\n", + "0 robustness add_ocr_typo 4 26 87% \n", + "1 robustness dyslexia_word_swap 3 27 90% \n", + "\n", + " minimum_pass_rate pass \n", + "0 66% True \n", + "1 60% True " + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.report()" + ] + }, + { + "cell_type": "markdown", + "id": "3eb133b8-1ed7-4af8-b3ae-5432ad59eff6", + "metadata": { + "id": "3eb133b8-1ed7-4af8-b3ae-5432ad59eff6" + }, + "source": [ + "## PubMedQA\n", + "\n", + "[PubMedQA: A Dataset for Biomedical Research Question Answering](https://arxiv.org/abs/1909.06146)\n", + "\n", + "**Dataset Summary**\n", + "\n", + "The PubMedQA is a benchmark dataset for biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts.\n", + "\n", + "**Data Splits**\n", + "\n", + "- **pqaa** : Truncated version of pqa_artificial subset from the PubMedQA, containing 500 question and answers examples.\n", + "- **pqal** : Truncated version of pqa_labeled subset from the PubMedQA, containing 500 question answers examples." + ] + }, + { + "cell_type": "markdown", + "id": "3JIjcMDC89q5", + "metadata": { + "id": "3JIjcMDC89q5" + }, + "source": [ + "### Setup and Configure Harness" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "07d2a026-7f6c-4e67-86b0-38a37a3a83d1", + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-30T21:32:24.323612Z", + "iopub.status.busy": "2023-11-30T21:32:24.323346Z", + "iopub.status.idle": "2023-11-30T21:32:24.352922Z", + "shell.execute_reply": "2023-11-30T21:32:24.352060Z", + "shell.execute_reply.started": "2023-11-30T21:32:24.323593Z" + }, + "id": "07d2a026-7f6c-4e67-86b0-38a37a3a83d1", + "outputId": "1733f7c0-3db0-4e6d-df17-5f4818bd06d9", + "tags": [] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Test Configuration : \n", + " {\n", + " \"evaluation\": {\n", + " \"metric\": \"string_distance\",\n", + " \"distance\": \"jaro\",\n", + " \"threshold\": 0.1\n", + " },\n", + " \"tests\": {\n", + " \"defaults\": {\n", + " \"min_pass_rate\": 0.65\n", + " },\n", + " \"robustness\": {\n", + " \"add_ocr_typo\": {\n", + " \"min_pass_rate\": 0.66\n", + " },\n", + " \"dyslexia_word_swap\": {\n", + " \"min_pass_rate\": 0.6\n", + " }\n", + " }\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "harness = Harness(\n", + " task=\"question-answering\",\n", + " model={\"model\": \"j2-jumbo-instruct\", \"hub\":\"ai21\"},\n", + " data={\"data_source\" :\"PubMedQA\",\n", + " \"split\":\"pqaa\"},\n", + " config={\n", + " \"evaluation\": {\"metric\":\"string_distance\",\"distance\":\"jaro\",\"threshold\":0.1},\n", + " 'tests': {'defaults': {'min_pass_rate': 0.65},\n", + "\n", + " 'robustness': {'add_ocr_typo': {'min_pass_rate': 0.66},\n", + " 'dyslexia_word_swap':{'min_pass_rate': 0.60}\n", + " }\n", + " }\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "sDHxPRZC89q5", + "metadata": { + "id": "sDHxPRZC89q5" + }, + "source": [ + "Note: for evaluation we are using **string distance metrics**" + ] + }, + { + "cell_type": "markdown", + "id": "PjpqGJlW-nJT", + "metadata": { + "id": "PjpqGJlW-nJT" + }, + "source": [ + "Let us run it for 20 examples" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5cef8fd5-faf7-45f8-b803-9031ffa694cf", + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-30T21:32:24.731367Z", + "iopub.status.busy": "2023-11-30T21:32:24.731184Z", + "iopub.status.idle": "2023-11-30T21:32:24.734682Z", + "shell.execute_reply": "2023-11-30T21:32:24.734141Z", + "shell.execute_reply.started": "2023-11-30T21:32:24.731353Z" + }, + "id": "5cef8fd5-faf7-45f8-b803-9031ffa694cf", + "tags": [] + }, + "outputs": [], + "source": [ + "harness.data =harness.data[:20]" + ] + }, + { + "cell_type": "markdown", + "id": "5LYqiGA389q6", + "metadata": { + "id": "5LYqiGA389q6" + }, + "source": [ + "### Generating the test cases." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4a94a79d-c76a-4b56-b318-789a6091f251", + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-30T21:32:25.106523Z", + "iopub.status.busy": "2023-11-30T21:32:25.105849Z", + "iopub.status.idle": "2023-11-30T21:32:26.710113Z", + "shell.execute_reply": "2023-11-30T21:32:26.709576Z", + "shell.execute_reply.started": "2023-11-30T21:32:25.106506Z" + }, + "id": "4a94a79d-c76a-4b56-b318-789a6091f251", + "outputId": "9655bada-a3e9-44a8-e0c9-943c5ad838a9", + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 11554.56it/s]\n" + ] + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.generate()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6199ddf7-3b19-4602-b4d7-42ddd78d0182", + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-30T21:32:26.711219Z", + "iopub.status.busy": "2023-11-30T21:32:26.711057Z", + "iopub.status.idle": "2023-11-30T21:32:26.722309Z", + "shell.execute_reply": "2023-11-30T21:32:26.721809Z", + "shell.execute_reply.started": "2023-11-30T21:32:26.711205Z" + }, + "id": "6199ddf7-3b19-4602-b4d7-42ddd78d0182", + "outputId": "ccce031c-4290-4fd4-8055-83cad88983c7", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_question
0robustnessadd_ocr_typoContext (1): Cardiomyocyte proliferation gradu...does fgf10 promote regional foetal cardiomyocy...Context (1): Cardiomyocyte proliferation gradu...does fgf10 promote regional foetal cardiomyocy...
1robustnessadd_ocr_typoContext (1): Ethanol (EtOH) exposure during ge...do selenium or selenium plus folic acid-supple...Context (1): Ethanol (EtOH) exposure duriug ge...do selenium or selenium plus folic acid-supple...
2robustnessadd_ocr_typoContext (1): To study the association between ...are bone mineral density and vertebral fractur...Context (1): t^o studv tl)e association betAve...are bone mineral density an^d vertebral fractu...
3robustnessadd_ocr_typoContext (1): To assess the prevalence of porta...is portal hypertensive colopathy associated wi...Context (1): t^o assess t^e prevalence of port...is portal hypertensive colopathy associated y/...
4robustnessadd_ocr_typoContext (1): The hOGG1 Ser326Cys polymorphism ...are the ape1 asp/asp genotype and the combinat...Context (1): tlic hOGG1 Ser326Cys polymorphism...are tle ape1 asp/asp genotype a^nd tle combina...
5robustnessadd_ocr_typoContext (1): Aldehyde dehydrogenase 1 (ALDH1) ...does aldehyde dehydrogenase 1 expression corre...Context (1): Aldehyde dehydrogenase 1 (ALDH1) ...does aldehyde dehydrogenase 1 expression corre...
6robustnessadd_ocr_typoContext (1): To screen and validate the global...are hepatocyte growth factor receptor , matrix...Context (1): t^o screen a^d validate tbe globa...are hepatocyte grov/th faclor receptor , matri...
7robustnessadd_ocr_typoContext (1): The epidermal growth factor recep...do assessment and prognostic analysis of egfr ...Context (1): tlie epidermal gro^vth faftor rec...do assessment a^d prognostic analylis of egfr ...
8robustnessadd_ocr_typoContext (1): Early appearance of antibodies sp...are anti-type ii collagen antibodies , anti-cc...Context (1): earlv appearance of antibodies fp...are anti-type ii collagen antibodies , anti-cc...
9robustnessadd_ocr_typoContext (1): To identify differences in perspe...does concordance of family and staff member re...Context (1): t^o identifv differences i^n pers...does concordance of lamily an^d stafl memher r...
10robustnessadd_ocr_typoContext (1): The NINDS trial demonstrated the ...does prior asymptomatic parenchymal hemorrhage...Context (1): tl)e NINDS tryaU demonstrated the...does prior asymptomatic parenchymal hemorrhage...
11robustnessadd_ocr_typoContext (1): We assessed whether dietary calci...is dietary calcium but not elemental calcium f...Context (1): w^e assessed vliether dietary cal...is dietary calcium b^ut n^ot elemental calcium...
12robustnessadd_ocr_typoContext (1): To determine if fetal-placental h...does fetal-placental hypoxia result from failu...Context (1): t^o determin6 if fetal-placental ...does fetal-placental hypoxia rcfult lrom failu...
13robustnessadd_ocr_typoContext (1): The contribution of HIV-exposure ...do hiv-exposed children account for more than ...Context (1): tbe contribution of HIV-exposure ...do hiv-exposed children accouut f^or morc tlia...
14robustnessadd_ocr_typoContext (1): Right ventricular dysfunction (RV...is abnormal regulation of renin angiotensin al...Context (1): riglit ventricular dysfunction (R...is abnormal regulation of renin angiotensin al...
15robustnessadd_ocr_typoContext (1): To conduct a comprehensive mappin...are novel epigenetic changes in cdkn2a associa...Context (1): t^o conduct a comprehensive mappi...are novel epigenetic changes i^n cdkn2a associ...
16robustnessadd_ocr_typoContext (1): Bile acids are signaling molecule...does g-protein-coupled bile acid receptor play...Context (1): Bile acids are signaling molecule...does g-protein-coupled bile acid receptor p1ay...
17robustnessadd_ocr_typoContext (1): Cigarette smoking is a leading ca...are smoking-induced gene expression changes in...Context (1): Cigarette smoking is a leading ca...are smoking-induced gene expression changes i^...
18robustnessadd_ocr_typoContext (1): Angiogenesis is a prerequisite fo...does tumstatin transfected into human glioma c...Context (1): Angiogenesis is a prerequisite f^...does tumstatin transfected int6 hnman glioma c...
19robustnessadd_ocr_typoContext (1): To investigate the correlation of...does [ znf217 expression correlate with the bi...Context (1): t^o investigate th^e correlation ...does [ znf217 expression correlate v«ith t^ie ...
20robustnessdyslexia_word_swapContext (1): Cardiomyocyte proliferation gradu...does fgf10 promote regional foetal cardiomyocy...Context (1): Cardiomyocyte proliferation gradu...does fgf10 promote regional foetal cardiomyocy...
21robustnessdyslexia_word_swapContext (1): Ethanol (EtOH) exposure during ge...do selenium or selenium plus folic acid-supple...Context (1): Ethanol (EtOH) exposure during ge...do selenium or selenium plus folic acid-supple...
22robustnessdyslexia_word_swapContext (1): To study the association between ...are bone mineral density and vertebral fractur...Context (1): To study the association between ...are bone mineral density and vertebral fractur...
23robustnessdyslexia_word_swapContext (1): To assess the prevalence of porta...is portal hypertensive colopathy associated wi...Context (1): To assess the prevalence off port...is portal hypertensive colopathy associated wi...
24robustnessdyslexia_word_swapContext (1): The hOGG1 Ser326Cys polymorphism ...are the ape1 asp/asp genotype and the combinat...Context (1): The hOGG1 Ser326Cys polymorphism ...are the ape1 asp/asp genotype and the combinat...
25robustnessdyslexia_word_swapContext (1): Aldehyde dehydrogenase 1 (ALDH1) ...does aldehyde dehydrogenase 1 expression corre...Context (1): Aldehyde dehydrogenase 1 (ALDH1) ...does aldehyde dehydrogenase 1 expression corre...
26robustnessdyslexia_word_swapContext (1): To screen and validate the global...are hepatocyte growth factor receptor , matrix...Context (1): To screen and validate the global...are hepatocyte growth factor receptor , matrix...
27robustnessdyslexia_word_swapContext (1): The epidermal growth factor recep...do assessment and prognostic analysis of egfr ...Context (1): The epidermal growth factor recep...do assessment and prognostic analysis off egfr...
28robustnessdyslexia_word_swapContext (1): Early appearance of antibodies sp...are anti-type ii collagen antibodies , anti-cc...Context (1): Early appearance off antibodies s...are anti-type ii collagen antibodies , anti-cc...
29robustnessdyslexia_word_swapContext (1): To identify differences in perspe...does concordance of family and staff member re...Context (1): To identify differences in perspe...does concordance off family and staff member r...
30robustnessdyslexia_word_swapContext (1): The NINDS trial demonstrated the ...does prior asymptomatic parenchymal hemorrhage...Context (1): The NINDS trial demonstrated the ...does prior asymptomatic parenchymal hemorrhage...
31robustnessdyslexia_word_swapContext (1): We assessed whether dietary calci...is dietary calcium but not elemental calcium f...Context (1): We assessed whether dietary calci...is dietary calcium but knot elemental calcium ...
32robustnessdyslexia_word_swapContext (1): To determine if fetal-placental h...does fetal-placental hypoxia result from failu...Context (1): To determine if fetal-placental h...does fetal-placental hypoxia result from failu...
33robustnessdyslexia_word_swapContext (1): The contribution of HIV-exposure ...do hiv-exposed children account for more than ...Context (1): The contribution off HIV-exposure...do hiv-exposed children account four more then...
34robustnessdyslexia_word_swapContext (1): Right ventricular dysfunction (RV...is abnormal regulation of renin angiotensin al...Context (1): Right ventricular dysfunction (RV...is abnormal regulation off renin angiotensin a...
35robustnessdyslexia_word_swapContext (1): To conduct a comprehensive mappin...are novel epigenetic changes in cdkn2a associa...Context (1): To conduct a comprehensive mappin...are novel epigenetic changes in cdkn2a associa...
36robustnessdyslexia_word_swapContext (1): Bile acids are signaling molecule...does g-protein-coupled bile acid receptor play...Context (1): Bile acids are signaling molecule...does g-protein-coupled bile acid receptor play...
37robustnessdyslexia_word_swapContext (1): Cigarette smoking is a leading ca...are smoking-induced gene expression changes in...Context (1): Cigarette smoking is a leading ca...are smoking-induced gene expression changes in...
38robustnessdyslexia_word_swapContext (1): Angiogenesis is a prerequisite fo...does tumstatin transfected into human glioma c...Context (1): Angiogenesis is a prerequisite fo...does tumstatin transfected into human glioma c...
39robustnessdyslexia_word_swapContext (1): To investigate the correlation of...does [ znf217 expression correlate with the bi...Context (1): To investigate the correlation of...does [ znf217 expression correlate with the bi...
\n", + "
" + ], + "text/plain": [ + " category test_type \\\n", + "0 robustness add_ocr_typo \n", + "1 robustness add_ocr_typo \n", + "2 robustness add_ocr_typo \n", + "3 robustness add_ocr_typo \n", + "4 robustness add_ocr_typo \n", + "5 robustness add_ocr_typo \n", + "6 robustness add_ocr_typo \n", + "7 robustness add_ocr_typo \n", + "8 robustness add_ocr_typo \n", + "9 robustness add_ocr_typo \n", + "10 robustness add_ocr_typo \n", + "11 robustness add_ocr_typo \n", + "12 robustness add_ocr_typo \n", + "13 robustness add_ocr_typo \n", + "14 robustness add_ocr_typo \n", + "15 robustness add_ocr_typo \n", + "16 robustness add_ocr_typo \n", + "17 robustness add_ocr_typo \n", + "18 robustness add_ocr_typo \n", + "19 robustness add_ocr_typo \n", + "20 robustness dyslexia_word_swap \n", + "21 robustness dyslexia_word_swap \n", + "22 robustness dyslexia_word_swap \n", + "23 robustness dyslexia_word_swap \n", + "24 robustness dyslexia_word_swap \n", + "25 robustness dyslexia_word_swap \n", + "26 robustness dyslexia_word_swap \n", + "27 robustness dyslexia_word_swap \n", + "28 robustness dyslexia_word_swap \n", + "29 robustness dyslexia_word_swap \n", + "30 robustness dyslexia_word_swap \n", + "31 robustness dyslexia_word_swap \n", + "32 robustness dyslexia_word_swap \n", + "33 robustness dyslexia_word_swap \n", + "34 robustness dyslexia_word_swap \n", + "35 robustness dyslexia_word_swap \n", + "36 robustness dyslexia_word_swap \n", + "37 robustness dyslexia_word_swap \n", + "38 robustness dyslexia_word_swap \n", + "39 robustness dyslexia_word_swap \n", + "\n", + " original_context \\\n", + "0 Context (1): Cardiomyocyte proliferation gradu... \n", + "1 Context (1): Ethanol (EtOH) exposure during ge... \n", + "2 Context (1): To study the association between ... \n", + "3 Context (1): To assess the prevalence of porta... \n", + "4 Context (1): The hOGG1 Ser326Cys polymorphism ... \n", + "5 Context (1): Aldehyde dehydrogenase 1 (ALDH1) ... \n", + "6 Context (1): To screen and validate the global... \n", + "7 Context (1): The epidermal growth factor recep... \n", + "8 Context (1): Early appearance of antibodies sp... \n", + "9 Context (1): To identify differences in perspe... \n", + "10 Context (1): The NINDS trial demonstrated the ... \n", + "11 Context (1): We assessed whether dietary calci... \n", + "12 Context (1): To determine if fetal-placental h... \n", + "13 Context (1): The contribution of HIV-exposure ... \n", + "14 Context (1): Right ventricular dysfunction (RV... \n", + "15 Context (1): To conduct a comprehensive mappin... \n", + "16 Context (1): Bile acids are signaling molecule... \n", + "17 Context (1): Cigarette smoking is a leading ca... \n", + "18 Context (1): Angiogenesis is a prerequisite fo... \n", + "19 Context (1): To investigate the correlation of... \n", + "20 Context (1): Cardiomyocyte proliferation gradu... \n", + "21 Context (1): Ethanol (EtOH) exposure during ge... \n", + "22 Context (1): To study the association between ... \n", + "23 Context (1): To assess the prevalence of porta... \n", + "24 Context (1): The hOGG1 Ser326Cys polymorphism ... \n", + "25 Context (1): Aldehyde dehydrogenase 1 (ALDH1) ... \n", + "26 Context (1): To screen and validate the global... \n", + "27 Context (1): The epidermal growth factor recep... \n", + "28 Context (1): Early appearance of antibodies sp... \n", + "29 Context (1): To identify differences in perspe... \n", + "30 Context (1): The NINDS trial demonstrated the ... \n", + "31 Context (1): We assessed whether dietary calci... \n", + "32 Context (1): To determine if fetal-placental h... \n", + "33 Context (1): The contribution of HIV-exposure ... \n", + "34 Context (1): Right ventricular dysfunction (RV... \n", + "35 Context (1): To conduct a comprehensive mappin... \n", + "36 Context (1): Bile acids are signaling molecule... \n", + "37 Context (1): Cigarette smoking is a leading ca... \n", + "38 Context (1): Angiogenesis is a prerequisite fo... \n", + "39 Context (1): To investigate the correlation of... \n", + "\n", + " original_question \\\n", + "0 does fgf10 promote regional foetal cardiomyocy... \n", + "1 do selenium or selenium plus folic acid-supple... \n", + "2 are bone mineral density and vertebral fractur... \n", + "3 is portal hypertensive colopathy associated wi... \n", + "4 are the ape1 asp/asp genotype and the combinat... \n", + "5 does aldehyde dehydrogenase 1 expression corre... \n", + "6 are hepatocyte growth factor receptor , matrix... \n", + "7 do assessment and prognostic analysis of egfr ... \n", + "8 are anti-type ii collagen antibodies , anti-cc... \n", + "9 does concordance of family and staff member re... \n", + "10 does prior asymptomatic parenchymal hemorrhage... \n", + "11 is dietary calcium but not elemental calcium f... \n", + "12 does fetal-placental hypoxia result from failu... \n", + "13 do hiv-exposed children account for more than ... \n", + "14 is abnormal regulation of renin angiotensin al... \n", + "15 are novel epigenetic changes in cdkn2a associa... \n", + "16 does g-protein-coupled bile acid receptor play... \n", + "17 are smoking-induced gene expression changes in... \n", + "18 does tumstatin transfected into human glioma c... \n", + "19 does [ znf217 expression correlate with the bi... \n", + "20 does fgf10 promote regional foetal cardiomyocy... \n", + "21 do selenium or selenium plus folic acid-supple... \n", + "22 are bone mineral density and vertebral fractur... \n", + "23 is portal hypertensive colopathy associated wi... \n", + "24 are the ape1 asp/asp genotype and the combinat... \n", + "25 does aldehyde dehydrogenase 1 expression corre... \n", + "26 are hepatocyte growth factor receptor , matrix... \n", + "27 do assessment and prognostic analysis of egfr ... \n", + "28 are anti-type ii collagen antibodies , anti-cc... \n", + "29 does concordance of family and staff member re... \n", + "30 does prior asymptomatic parenchymal hemorrhage... \n", + "31 is dietary calcium but not elemental calcium f... \n", + "32 does fetal-placental hypoxia result from failu... \n", + "33 do hiv-exposed children account for more than ... \n", + "34 is abnormal regulation of renin angiotensin al... \n", + "35 are novel epigenetic changes in cdkn2a associa... \n", + "36 does g-protein-coupled bile acid receptor play... \n", + "37 are smoking-induced gene expression changes in... \n", + "38 does tumstatin transfected into human glioma c... \n", + "39 does [ znf217 expression correlate with the bi... \n", + "\n", + " perturbed_context \\\n", + "0 Context (1): Cardiomyocyte proliferation gradu... \n", + "1 Context (1): Ethanol (EtOH) exposure duriug ge... \n", + "2 Context (1): t^o studv tl)e association betAve... \n", + "3 Context (1): t^o assess t^e prevalence of port... \n", + "4 Context (1): tlic hOGG1 Ser326Cys polymorphism... \n", + "5 Context (1): Aldehyde dehydrogenase 1 (ALDH1) ... \n", + "6 Context (1): t^o screen a^d validate tbe globa... \n", + "7 Context (1): tlie epidermal gro^vth faftor rec... \n", + "8 Context (1): earlv appearance of antibodies fp... \n", + "9 Context (1): t^o identifv differences i^n pers... \n", + "10 Context (1): tl)e NINDS tryaU demonstrated the... \n", + "11 Context (1): w^e assessed vliether dietary cal... \n", + "12 Context (1): t^o determin6 if fetal-placental ... \n", + "13 Context (1): tbe contribution of HIV-exposure ... \n", + "14 Context (1): riglit ventricular dysfunction (R... \n", + "15 Context (1): t^o conduct a comprehensive mappi... \n", + "16 Context (1): Bile acids are signaling molecule... \n", + "17 Context (1): Cigarette smoking is a leading ca... \n", + "18 Context (1): Angiogenesis is a prerequisite f^... \n", + "19 Context (1): t^o investigate th^e correlation ... \n", + "20 Context (1): Cardiomyocyte proliferation gradu... \n", + "21 Context (1): Ethanol (EtOH) exposure during ge... \n", + "22 Context (1): To study the association between ... \n", + "23 Context (1): To assess the prevalence off port... \n", + "24 Context (1): The hOGG1 Ser326Cys polymorphism ... \n", + "25 Context (1): Aldehyde dehydrogenase 1 (ALDH1) ... \n", + "26 Context (1): To screen and validate the global... \n", + "27 Context (1): The epidermal growth factor recep... \n", + "28 Context (1): Early appearance off antibodies s... \n", + "29 Context (1): To identify differences in perspe... \n", + "30 Context (1): The NINDS trial demonstrated the ... \n", + "31 Context (1): We assessed whether dietary calci... \n", + "32 Context (1): To determine if fetal-placental h... \n", + "33 Context (1): The contribution off HIV-exposure... \n", + "34 Context (1): Right ventricular dysfunction (RV... \n", + "35 Context (1): To conduct a comprehensive mappin... \n", + "36 Context (1): Bile acids are signaling molecule... \n", + "37 Context (1): Cigarette smoking is a leading ca... \n", + "38 Context (1): Angiogenesis is a prerequisite fo... \n", + "39 Context (1): To investigate the correlation of... \n", + "\n", + " perturbed_question \n", + "0 does fgf10 promote regional foetal cardiomyocy... \n", + "1 do selenium or selenium plus folic acid-supple... \n", + "2 are bone mineral density an^d vertebral fractu... \n", + "3 is portal hypertensive colopathy associated y/... \n", + "4 are tle ape1 asp/asp genotype a^nd tle combina... \n", + "5 does aldehyde dehydrogenase 1 expression corre... \n", + "6 are hepatocyte grov/th faclor receptor , matri... \n", + "7 do assessment a^d prognostic analylis of egfr ... \n", + "8 are anti-type ii collagen antibodies , anti-cc... \n", + "9 does concordance of lamily an^d stafl memher r... \n", + "10 does prior asymptomatic parenchymal hemorrhage... \n", + "11 is dietary calcium b^ut n^ot elemental calcium... \n", + "12 does fetal-placental hypoxia rcfult lrom failu... \n", + "13 do hiv-exposed children accouut f^or morc tlia... \n", + "14 is abnormal regulation of renin angiotensin al... \n", + "15 are novel epigenetic changes i^n cdkn2a associ... \n", + "16 does g-protein-coupled bile acid receptor p1ay... \n", + "17 are smoking-induced gene expression changes i^... \n", + "18 does tumstatin transfected int6 hnman glioma c... \n", + "19 does [ znf217 expression correlate v«ith t^ie ... \n", + "20 does fgf10 promote regional foetal cardiomyocy... \n", + "21 do selenium or selenium plus folic acid-supple... \n", + "22 are bone mineral density and vertebral fractur... \n", + "23 is portal hypertensive colopathy associated wi... \n", + "24 are the ape1 asp/asp genotype and the combinat... \n", + "25 does aldehyde dehydrogenase 1 expression corre... \n", + "26 are hepatocyte growth factor receptor , matrix... \n", + "27 do assessment and prognostic analysis off egfr... \n", + "28 are anti-type ii collagen antibodies , anti-cc... \n", + "29 does concordance off family and staff member r... \n", + "30 does prior asymptomatic parenchymal hemorrhage... \n", + "31 is dietary calcium but knot elemental calcium ... \n", + "32 does fetal-placental hypoxia result from failu... \n", + "33 do hiv-exposed children account four more then... \n", + "34 is abnormal regulation off renin angiotensin a... \n", + "35 are novel epigenetic changes in cdkn2a associa... \n", + "36 does g-protein-coupled bile acid receptor play... \n", + "37 are smoking-induced gene expression changes in... \n", + "38 does tumstatin transfected into human glioma c... \n", + "39 does [ znf217 expression correlate with the bi... " + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.testcases()" + ] + }, + { + "cell_type": "markdown", + "id": "7n6_dNzG89q6", + "metadata": { + "id": "7n6_dNzG89q6" + }, + "source": [ + "### Running the tests" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8051391e-1634-4937-a599-78ed7dc5a66e", + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-30T21:32:26.723020Z", + "iopub.status.busy": "2023-11-30T21:32:26.722868Z", + "iopub.status.idle": "2023-11-30T21:33:08.952765Z", + "shell.execute_reply": "2023-11-30T21:33:08.952297Z", + "shell.execute_reply.started": "2023-11-30T21:32:26.723006Z" + }, + "id": "8051391e-1634-4937-a599-78ed7dc5a66e", + "outputId": "c92d313d-64f2-4fa7-9e25-f3179939c171", + "tags": [] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Running testcases... : 100%|██████████| 40/40 [00:42<00:00, 1.05s/it]\n" + ] + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.run()" + ] + }, + { + "cell_type": "markdown", + "id": "WxrWelHu89q6", + "metadata": { + "id": "WxrWelHu89q6" + }, + "source": [ + "### Generated Results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3398a5fe-f821-4802-a9da-4c373ef827ed", + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-30T21:33:08.953551Z", + "iopub.status.busy": "2023-11-30T21:33:08.953384Z", + "iopub.status.idle": "2023-11-30T21:33:08.967183Z", + "shell.execute_reply": "2023-11-30T21:33:08.966721Z", + "shell.execute_reply.started": "2023-11-30T21:33:08.953536Z" + }, + "id": "3398a5fe-f821-4802-a9da-4c373ef827ed", + "outputId": "8c9d5999-25d5-4003-d518-65c8546b4d1f", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_questionexpected_resultactual_resulteval_scorepass
0robustnessadd_ocr_typoContext (1): Cardiomyocyte proliferation gradu...does fgf10 promote regional foetal cardiomyocy...Context (1): Cardiomyocyte proliferation gradu...does fgf10 promote regional foetal cardiomyocy...\\nYes.\\nYes0.083333True
1robustnessadd_ocr_typoContext (1): Ethanol (EtOH) exposure during ge...do selenium or selenium plus folic acid-supple...Context (1): Ethanol (EtOH) exposure duriug ge...do selenium or selenium plus folic acid-supple...\\nYes\\nYes0.000000True
2robustnessadd_ocr_typoContext (1): To study the association between ...are bone mineral density and vertebral fractur...Context (1): t^o studv tl)e association betAve...are bone mineral density an^d vertebral fractu...\\nYes\\nYes0.000000True
3robustnessadd_ocr_typoContext (1): To assess the prevalence of porta...is portal hypertensive colopathy associated wi...Context (1): t^o assess t^e prevalence of port...is portal hypertensive colopathy associated y/...\\nYes\\nYes0.000000True
4robustnessadd_ocr_typoContext (1): The hOGG1 Ser326Cys polymorphism ...are the ape1 asp/asp genotype and the combinat...Context (1): tlic hOGG1 Ser326Cys polymorphism...are tle ape1 asp/asp genotype a^nd tle combina...\\nYes\\nNo1.000000False
5robustnessadd_ocr_typoContext (1): Aldehyde dehydrogenase 1 (ALDH1) ...does aldehyde dehydrogenase 1 expression corre...Context (1): Aldehyde dehydrogenase 1 (ALDH1) ...does aldehyde dehydrogenase 1 expression corre...\\nYes\\nYes0.000000True
6robustnessadd_ocr_typoContext (1): To screen and validate the global...are hepatocyte growth factor receptor , matrix...Context (1): t^o screen a^d validate tbe globa...are hepatocyte grov/th faclor receptor , matri...\\nYes\\nYes0.000000True
7robustnessadd_ocr_typoContext (1): The epidermal growth factor recep...do assessment and prognostic analysis of egfr ...Context (1): tlie epidermal gro^vth faftor rec...do assessment a^d prognostic analylis of egfr ...\\nYes.\\nYes0.083333True
8robustnessadd_ocr_typoContext (1): Early appearance of antibodies sp...are anti-type ii collagen antibodies , anti-cc...Context (1): earlv appearance of antibodies fp...are anti-type ii collagen antibodies , anti-cc...\\nYes\\nYes0.000000True
9robustnessadd_ocr_typoContext (1): To identify differences in perspe...does concordance of family and staff member re...Context (1): t^o identifv differences i^n pers...does concordance of lamily an^d stafl memher r...\\nYes\\nYes0.000000True
10robustnessadd_ocr_typoContext (1): The NINDS trial demonstrated the ...does prior asymptomatic parenchymal hemorrhage...Context (1): tl)e NINDS tryaU demonstrated the...does prior asymptomatic parenchymal hemorrhage...\\nYes\\nYes0.000000True
11robustnessadd_ocr_typoContext (1): We assessed whether dietary calci...is dietary calcium but not elemental calcium f...Context (1): w^e assessed vliether dietary cal...is dietary calcium b^ut n^ot elemental calcium...\\nYes\\nYes0.000000True
12robustnessadd_ocr_typoContext (1): To determine if fetal-placental h...does fetal-placental hypoxia result from failu...Context (1): t^o determin6 if fetal-placental ...does fetal-placental hypoxia rcfult lrom failu...\\nYes\\nYes0.000000True
13robustnessadd_ocr_typoContext (1): The contribution of HIV-exposure ...do hiv-exposed children account for more than ...Context (1): tbe contribution of HIV-exposure ...do hiv-exposed children accouut f^or morc tlia...\\nYes.\\nNo1.000000False
14robustnessadd_ocr_typoContext (1): Right ventricular dysfunction (RV...is abnormal regulation of renin angiotensin al...Context (1): riglit ventricular dysfunction (R...is abnormal regulation of renin angiotensin al...\\nYes\\nYes0.000000True
15robustnessadd_ocr_typoContext (1): To conduct a comprehensive mappin...are novel epigenetic changes in cdkn2a associa...Context (1): t^o conduct a comprehensive mappi...are novel epigenetic changes i^n cdkn2a associ...\\nYes\\nYes0.000000True
16robustnessadd_ocr_typoContext (1): Bile acids are signaling molecule...does g-protein-coupled bile acid receptor play...Context (1): Bile acids are signaling molecule...does g-protein-coupled bile acid receptor p1ay...\\nYes\\nYes0.000000True
17robustnessadd_ocr_typoContext (1): Cigarette smoking is a leading ca...are smoking-induced gene expression changes in...Context (1): Cigarette smoking is a leading ca...are smoking-induced gene expression changes i^...\\nYes\\nYes0.000000True
18robustnessadd_ocr_typoContext (1): Angiogenesis is a prerequisite fo...does tumstatin transfected into human glioma c...Context (1): Angiogenesis is a prerequisite f^...does tumstatin transfected int6 hnman glioma c...\\nYes\\nYes0.000000True
19robustnessadd_ocr_typoContext (1): To investigate the correlation of...does [ znf217 expression correlate with the bi...Context (1): t^o investigate th^e correlation ...does [ znf217 expression correlate v«ith t^ie ...\\nYes\\nYes0.000000True
20robustnessdyslexia_word_swapContext (1): Cardiomyocyte proliferation gradu...does fgf10 promote regional foetal cardiomyocy...Context (1): Cardiomyocyte proliferation gradu...does fgf10 promote regional foetal cardiomyocy...\\nYes\\nYes0.000000True
21robustnessdyslexia_word_swapContext (1): Ethanol (EtOH) exposure during ge...do selenium or selenium plus folic acid-supple...Context (1): Ethanol (EtOH) exposure during ge...do selenium or selenium plus folic acid-supple...\\nYes\\nYes0.000000True
22robustnessdyslexia_word_swapContext (1): To study the association between ...are bone mineral density and vertebral fractur...Context (1): To study the association between ...are bone mineral density and vertebral fractur...\\nYes\\nYes0.000000True
23robustnessdyslexia_word_swapContext (1): To assess the prevalence of porta...is portal hypertensive colopathy associated wi...Context (1): To assess the prevalence off port...is portal hypertensive colopathy associated wi...\\nYes\\nYes0.000000True
24robustnessdyslexia_word_swapContext (1): The hOGG1 Ser326Cys polymorphism ...are the ape1 asp/asp genotype and the combinat...Context (1): The hOGG1 Ser326Cys polymorphism ...are the ape1 asp/asp genotype and the combinat...\\nYes.\\nYes0.083333True
25robustnessdyslexia_word_swapContext (1): Aldehyde dehydrogenase 1 (ALDH1) ...does aldehyde dehydrogenase 1 expression corre...Context (1): Aldehyde dehydrogenase 1 (ALDH1) ...does aldehyde dehydrogenase 1 expression corre...\\nYes\\nYes0.000000True
26robustnessdyslexia_word_swapContext (1): To screen and validate the global...are hepatocyte growth factor receptor , matrix...Context (1): To screen and validate the global...are hepatocyte growth factor receptor , matrix...\\nYes\\nYes0.000000True
27robustnessdyslexia_word_swapContext (1): The epidermal growth factor recep...do assessment and prognostic analysis of egfr ...Context (1): The epidermal growth factor recep...do assessment and prognostic analysis off egfr...\\nYes\\nYes.0.083333True
28robustnessdyslexia_word_swapContext (1): Early appearance of antibodies sp...are anti-type ii collagen antibodies , anti-cc...Context (1): Early appearance off antibodies s...are anti-type ii collagen antibodies , anti-cc...\\nYes\\nYes0.000000True
29robustnessdyslexia_word_swapContext (1): To identify differences in perspe...does concordance of family and staff member re...Context (1): To identify differences in perspe...does concordance off family and staff member r...\\nYes\\nYes0.000000True
30robustnessdyslexia_word_swapContext (1): The NINDS trial demonstrated the ...does prior asymptomatic parenchymal hemorrhage...Context (1): The NINDS trial demonstrated the ...does prior asymptomatic parenchymal hemorrhage...\\nYes\\nYes0.000000True
31robustnessdyslexia_word_swapContext (1): We assessed whether dietary calci...is dietary calcium but not elemental calcium f...Context (1): We assessed whether dietary calci...is dietary calcium but knot elemental calcium ...\\nYes\\nYes0.000000True
32robustnessdyslexia_word_swapContext (1): To determine if fetal-placental h...does fetal-placental hypoxia result from failu...Context (1): To determine if fetal-placental h...does fetal-placental hypoxia result from failu...\\nYes\\nNo1.000000False
33robustnessdyslexia_word_swapContext (1): The contribution of HIV-exposure ...do hiv-exposed children account for more than ...Context (1): The contribution off HIV-exposure...do hiv-exposed children account four more then...\\nYes\\nYes0.000000True
34robustnessdyslexia_word_swapContext (1): Right ventricular dysfunction (RV...is abnormal regulation of renin angiotensin al...Context (1): Right ventricular dysfunction (RV...is abnormal regulation off renin angiotensin a...\\nYes\\nYes0.000000True
35robustnessdyslexia_word_swapContext (1): To conduct a comprehensive mappin...are novel epigenetic changes in cdkn2a associa...Context (1): To conduct a comprehensive mappin...are novel epigenetic changes in cdkn2a associa...\\nYes\\nYes0.000000True
36robustnessdyslexia_word_swapContext (1): Bile acids are signaling molecule...does g-protein-coupled bile acid receptor play...Context (1): Bile acids are signaling molecule...does g-protein-coupled bile acid receptor play...\\nYes\\nYes0.000000True
37robustnessdyslexia_word_swapContext (1): Cigarette smoking is a leading ca...are smoking-induced gene expression changes in...Context (1): Cigarette smoking is a leading ca...are smoking-induced gene expression changes in...\\nYes.\\nYes0.083333True
38robustnessdyslexia_word_swapContext (1): Angiogenesis is a prerequisite fo...does tumstatin transfected into human glioma c...Context (1): Angiogenesis is a prerequisite fo...does tumstatin transfected into human glioma c...\\nYes\\nYes.0.083333True
39robustnessdyslexia_word_swapContext (1): To investigate the correlation of...does [ znf217 expression correlate with the bi...Context (1): To investigate the correlation of...does [ znf217 expression correlate with the bi...\\nYes\\nYes0.000000True
\n", + "
" + ], + "text/plain": [ + " category test_type \\\n", + "0 robustness add_ocr_typo \n", + "1 robustness add_ocr_typo \n", + "2 robustness add_ocr_typo \n", + "3 robustness add_ocr_typo \n", + "4 robustness add_ocr_typo \n", + "5 robustness add_ocr_typo \n", + "6 robustness add_ocr_typo \n", + "7 robustness add_ocr_typo \n", + "8 robustness add_ocr_typo \n", + "9 robustness add_ocr_typo \n", + "10 robustness add_ocr_typo \n", + "11 robustness add_ocr_typo \n", + "12 robustness add_ocr_typo \n", + "13 robustness add_ocr_typo \n", + "14 robustness add_ocr_typo \n", + "15 robustness add_ocr_typo \n", + "16 robustness add_ocr_typo \n", + "17 robustness add_ocr_typo \n", + "18 robustness add_ocr_typo \n", + "19 robustness add_ocr_typo \n", + "20 robustness dyslexia_word_swap \n", + "21 robustness dyslexia_word_swap \n", + "22 robustness dyslexia_word_swap \n", + "23 robustness dyslexia_word_swap \n", + "24 robustness dyslexia_word_swap \n", + "25 robustness dyslexia_word_swap \n", + "26 robustness dyslexia_word_swap \n", + "27 robustness dyslexia_word_swap \n", + "28 robustness dyslexia_word_swap \n", + "29 robustness dyslexia_word_swap \n", + "30 robustness dyslexia_word_swap \n", + "31 robustness dyslexia_word_swap \n", + "32 robustness dyslexia_word_swap \n", + "33 robustness dyslexia_word_swap \n", + "34 robustness dyslexia_word_swap \n", + "35 robustness dyslexia_word_swap \n", + "36 robustness dyslexia_word_swap \n", + "37 robustness dyslexia_word_swap \n", + "38 robustness dyslexia_word_swap \n", + "39 robustness dyslexia_word_swap \n", + "\n", + " original_context \\\n", + "0 Context (1): Cardiomyocyte proliferation gradu... \n", + "1 Context (1): Ethanol (EtOH) exposure during ge... \n", + "2 Context (1): To study the association between ... \n", + "3 Context (1): To assess the prevalence of porta... \n", + "4 Context (1): The hOGG1 Ser326Cys polymorphism ... \n", + "5 Context (1): Aldehyde dehydrogenase 1 (ALDH1) ... \n", + "6 Context (1): To screen and validate the global... \n", + "7 Context (1): The epidermal growth factor recep... \n", + "8 Context (1): Early appearance of antibodies sp... \n", + "9 Context (1): To identify differences in perspe... \n", + "10 Context (1): The NINDS trial demonstrated the ... \n", + "11 Context (1): We assessed whether dietary calci... \n", + "12 Context (1): To determine if fetal-placental h... \n", + "13 Context (1): The contribution of HIV-exposure ... \n", + "14 Context (1): Right ventricular dysfunction (RV... \n", + "15 Context (1): To conduct a comprehensive mappin... \n", + "16 Context (1): Bile acids are signaling molecule... \n", + "17 Context (1): Cigarette smoking is a leading ca... \n", + "18 Context (1): Angiogenesis is a prerequisite fo... \n", + "19 Context (1): To investigate the correlation of... \n", + "20 Context (1): Cardiomyocyte proliferation gradu... \n", + "21 Context (1): Ethanol (EtOH) exposure during ge... \n", + "22 Context (1): To study the association between ... \n", + "23 Context (1): To assess the prevalence of porta... \n", + "24 Context (1): The hOGG1 Ser326Cys polymorphism ... \n", + "25 Context (1): Aldehyde dehydrogenase 1 (ALDH1) ... \n", + "26 Context (1): To screen and validate the global... \n", + "27 Context (1): The epidermal growth factor recep... \n", + "28 Context (1): Early appearance of antibodies sp... \n", + "29 Context (1): To identify differences in perspe... \n", + "30 Context (1): The NINDS trial demonstrated the ... \n", + "31 Context (1): We assessed whether dietary calci... \n", + "32 Context (1): To determine if fetal-placental h... \n", + "33 Context (1): The contribution of HIV-exposure ... \n", + "34 Context (1): Right ventricular dysfunction (RV... \n", + "35 Context (1): To conduct a comprehensive mappin... \n", + "36 Context (1): Bile acids are signaling molecule... \n", + "37 Context (1): Cigarette smoking is a leading ca... \n", + "38 Context (1): Angiogenesis is a prerequisite fo... \n", + "39 Context (1): To investigate the correlation of... \n", + "\n", + " original_question \\\n", + "0 does fgf10 promote regional foetal cardiomyocy... \n", + "1 do selenium or selenium plus folic acid-supple... \n", + "2 are bone mineral density and vertebral fractur... \n", + "3 is portal hypertensive colopathy associated wi... \n", + "4 are the ape1 asp/asp genotype and the combinat... \n", + "5 does aldehyde dehydrogenase 1 expression corre... \n", + "6 are hepatocyte growth factor receptor , matrix... \n", + "7 do assessment and prognostic analysis of egfr ... \n", + "8 are anti-type ii collagen antibodies , anti-cc... \n", + "9 does concordance of family and staff member re... \n", + "10 does prior asymptomatic parenchymal hemorrhage... \n", + "11 is dietary calcium but not elemental calcium f... \n", + "12 does fetal-placental hypoxia result from failu... \n", + "13 do hiv-exposed children account for more than ... \n", + "14 is abnormal regulation of renin angiotensin al... \n", + "15 are novel epigenetic changes in cdkn2a associa... \n", + "16 does g-protein-coupled bile acid receptor play... \n", + "17 are smoking-induced gene expression changes in... \n", + "18 does tumstatin transfected into human glioma c... \n", + "19 does [ znf217 expression correlate with the bi... \n", + "20 does fgf10 promote regional foetal cardiomyocy... \n", + "21 do selenium or selenium plus folic acid-supple... \n", + "22 are bone mineral density and vertebral fractur... \n", + "23 is portal hypertensive colopathy associated wi... \n", + "24 are the ape1 asp/asp genotype and the combinat... \n", + "25 does aldehyde dehydrogenase 1 expression corre... \n", + "26 are hepatocyte growth factor receptor , matrix... \n", + "27 do assessment and prognostic analysis of egfr ... \n", + "28 are anti-type ii collagen antibodies , anti-cc... \n", + "29 does concordance of family and staff member re... \n", + "30 does prior asymptomatic parenchymal hemorrhage... \n", + "31 is dietary calcium but not elemental calcium f... \n", + "32 does fetal-placental hypoxia result from failu... \n", + "33 do hiv-exposed children account for more than ... \n", + "34 is abnormal regulation of renin angiotensin al... \n", + "35 are novel epigenetic changes in cdkn2a associa... \n", + "36 does g-protein-coupled bile acid receptor play... \n", + "37 are smoking-induced gene expression changes in... \n", + "38 does tumstatin transfected into human glioma c... \n", + "39 does [ znf217 expression correlate with the bi... \n", + "\n", + " perturbed_context \\\n", + "0 Context (1): Cardiomyocyte proliferation gradu... \n", + "1 Context (1): Ethanol (EtOH) exposure duriug ge... \n", + "2 Context (1): t^o studv tl)e association betAve... \n", + "3 Context (1): t^o assess t^e prevalence of port... \n", + "4 Context (1): tlic hOGG1 Ser326Cys polymorphism... \n", + "5 Context (1): Aldehyde dehydrogenase 1 (ALDH1) ... \n", + "6 Context (1): t^o screen a^d validate tbe globa... \n", + "7 Context (1): tlie epidermal gro^vth faftor rec... \n", + "8 Context (1): earlv appearance of antibodies fp... \n", + "9 Context (1): t^o identifv differences i^n pers... \n", + "10 Context (1): tl)e NINDS tryaU demonstrated the... \n", + "11 Context (1): w^e assessed vliether dietary cal... \n", + "12 Context (1): t^o determin6 if fetal-placental ... \n", + "13 Context (1): tbe contribution of HIV-exposure ... \n", + "14 Context (1): riglit ventricular dysfunction (R... \n", + "15 Context (1): t^o conduct a comprehensive mappi... \n", + "16 Context (1): Bile acids are signaling molecule... \n", + "17 Context (1): Cigarette smoking is a leading ca... \n", + "18 Context (1): Angiogenesis is a prerequisite f^... \n", + "19 Context (1): t^o investigate th^e correlation ... \n", + "20 Context (1): Cardiomyocyte proliferation gradu... \n", + "21 Context (1): Ethanol (EtOH) exposure during ge... \n", + "22 Context (1): To study the association between ... \n", + "23 Context (1): To assess the prevalence off port... \n", + "24 Context (1): The hOGG1 Ser326Cys polymorphism ... \n", + "25 Context (1): Aldehyde dehydrogenase 1 (ALDH1) ... \n", + "26 Context (1): To screen and validate the global... \n", + "27 Context (1): The epidermal growth factor recep... \n", + "28 Context (1): Early appearance off antibodies s... \n", + "29 Context (1): To identify differences in perspe... \n", + "30 Context (1): The NINDS trial demonstrated the ... \n", + "31 Context (1): We assessed whether dietary calci... \n", + "32 Context (1): To determine if fetal-placental h... \n", + "33 Context (1): The contribution off HIV-exposure... \n", + "34 Context (1): Right ventricular dysfunction (RV... \n", + "35 Context (1): To conduct a comprehensive mappin... \n", + "36 Context (1): Bile acids are signaling molecule... \n", + "37 Context (1): Cigarette smoking is a leading ca... \n", + "38 Context (1): Angiogenesis is a prerequisite fo... \n", + "39 Context (1): To investigate the correlation of... \n", + "\n", + " perturbed_question expected_result \\\n", + "0 does fgf10 promote regional foetal cardiomyocy... \\nYes. \n", + "1 do selenium or selenium plus folic acid-supple... \\nYes \n", + "2 are bone mineral density an^d vertebral fractu... \\nYes \n", + "3 is portal hypertensive colopathy associated y/... \\nYes \n", + "4 are tle ape1 asp/asp genotype a^nd tle combina... \\nYes \n", + "5 does aldehyde dehydrogenase 1 expression corre... \\nYes \n", + "6 are hepatocyte grov/th faclor receptor , matri... \\nYes \n", + "7 do assessment a^d prognostic analylis of egfr ... \\nYes. \n", + "8 are anti-type ii collagen antibodies , anti-cc... \\nYes \n", + "9 does concordance of lamily an^d stafl memher r... \\nYes \n", + "10 does prior asymptomatic parenchymal hemorrhage... \\nYes \n", + "11 is dietary calcium b^ut n^ot elemental calcium... \\nYes \n", + "12 does fetal-placental hypoxia rcfult lrom failu... \\nYes \n", + "13 do hiv-exposed children accouut f^or morc tlia... \\nYes. \n", + "14 is abnormal regulation of renin angiotensin al... \\nYes \n", + "15 are novel epigenetic changes i^n cdkn2a associ... \\nYes \n", + "16 does g-protein-coupled bile acid receptor p1ay... \\nYes \n", + "17 are smoking-induced gene expression changes i^... \\nYes \n", + "18 does tumstatin transfected int6 hnman glioma c... \\nYes \n", + "19 does [ znf217 expression correlate v«ith t^ie ... \\nYes \n", + "20 does fgf10 promote regional foetal cardiomyocy... \\nYes \n", + "21 do selenium or selenium plus folic acid-supple... \\nYes \n", + "22 are bone mineral density and vertebral fractur... \\nYes \n", + "23 is portal hypertensive colopathy associated wi... \\nYes \n", + "24 are the ape1 asp/asp genotype and the combinat... \\nYes. \n", + "25 does aldehyde dehydrogenase 1 expression corre... \\nYes \n", + "26 are hepatocyte growth factor receptor , matrix... \\nYes \n", + "27 do assessment and prognostic analysis off egfr... \\nYes \n", + "28 are anti-type ii collagen antibodies , anti-cc... \\nYes \n", + "29 does concordance off family and staff member r... \\nYes \n", + "30 does prior asymptomatic parenchymal hemorrhage... \\nYes \n", + "31 is dietary calcium but knot elemental calcium ... \\nYes \n", + "32 does fetal-placental hypoxia result from failu... \\nYes \n", + "33 do hiv-exposed children account four more then... \\nYes \n", + "34 is abnormal regulation off renin angiotensin a... \\nYes \n", + "35 are novel epigenetic changes in cdkn2a associa... \\nYes \n", + "36 does g-protein-coupled bile acid receptor play... \\nYes \n", + "37 are smoking-induced gene expression changes in... \\nYes. \n", + "38 does tumstatin transfected into human glioma c... \\nYes \n", + "39 does [ znf217 expression correlate with the bi... \\nYes \n", + "\n", + " actual_result eval_score pass \n", + "0 \\nYes 0.083333 True \n", + "1 \\nYes 0.000000 True \n", + "2 \\nYes 0.000000 True \n", + "3 \\nYes 0.000000 True \n", + "4 \\nNo 1.000000 False \n", + "5 \\nYes 0.000000 True \n", + "6 \\nYes 0.000000 True \n", + "7 \\nYes 0.083333 True \n", + "8 \\nYes 0.000000 True \n", + "9 \\nYes 0.000000 True \n", + "10 \\nYes 0.000000 True \n", + "11 \\nYes 0.000000 True \n", + "12 \\nYes 0.000000 True \n", + "13 \\nNo 1.000000 False \n", + "14 \\nYes 0.000000 True \n", + "15 \\nYes 0.000000 True \n", + "16 \\nYes 0.000000 True \n", + "17 \\nYes 0.000000 True \n", + "18 \\nYes 0.000000 True \n", + "19 \\nYes 0.000000 True \n", + "20 \\nYes 0.000000 True \n", + "21 \\nYes 0.000000 True \n", + "22 \\nYes 0.000000 True \n", + "23 \\nYes 0.000000 True \n", + "24 \\nYes 0.083333 True \n", + "25 \\nYes 0.000000 True \n", + "26 \\nYes 0.000000 True \n", + "27 \\nYes. 0.083333 True \n", + "28 \\nYes 0.000000 True \n", + "29 \\nYes 0.000000 True \n", + "30 \\nYes 0.000000 True \n", + "31 \\nYes 0.000000 True \n", + "32 \\nNo 1.000000 False \n", + "33 \\nYes 0.000000 True \n", + "34 \\nYes 0.000000 True \n", + "35 \\nYes 0.000000 True \n", + "36 \\nYes 0.000000 True \n", + "37 \\nYes 0.083333 True \n", + "38 \\nYes. 0.083333 True \n", + "39 \\nYes 0.000000 True " + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.generated_results()" + ] + }, + { + "cell_type": "markdown", + "id": "4l8c0eHO89q6", + "metadata": { + "id": "4l8c0eHO89q6" + }, + "source": [ + "### Final Results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eb2a1242-d7c6-4390-a088-60b5c016f768", + "metadata": { + "execution": { + "iopub.execute_input": "2023-11-30T21:33:08.968192Z", + "iopub.status.busy": "2023-11-30T21:33:08.967868Z", + "iopub.status.idle": "2023-11-30T21:33:09.061303Z", + "shell.execute_reply": "2023-11-30T21:33:09.060858Z", + "shell.execute_reply.started": "2023-11-30T21:33:08.968176Z" + }, + "id": "eb2a1242-d7c6-4390-a088-60b5c016f768", + "outputId": "5a99c833-6f8e-4a0b-9fdb-4c6d9c81b1c6", + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessadd_ocr_typo21890%66%True
1robustnessdyslexia_word_swap11995%60%True
\n", + "
" + ], + "text/plain": [ + " category test_type fail_count pass_count pass_rate \\\n", + "0 robustness add_ocr_typo 2 18 90% \n", + "1 robustness dyslexia_word_swap 1 19 95% \n", + "\n", + " minimum_pass_rate pass \n", + "0 66% True \n", + "1 60% True " + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.report()" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/demo/tutorials/llm_notebooks/dataset-notebooks/mmlu_dataset.ipynb b/demo/tutorials/llm_notebooks/dataset-notebooks/mmlu_dataset.ipynb index 4c3817a7c..2c051521d 100644 --- a/demo/tutorials/llm_notebooks/dataset-notebooks/mmlu_dataset.ipynb +++ b/demo/tutorials/llm_notebooks/dataset-notebooks/mmlu_dataset.ipynb @@ -1 +1 @@ -{"cells":[{"cell_type":"markdown","metadata":{"id":"-euMnuisAIDX"},"source":["![image.png]()"]},{"cell_type":"markdown","metadata":{"id":"_-k2O6KeLI1D"},"source":["[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/llm_notebooks/dataset-notebooks/mmlu_dataset.ipynb)"]},{"cell_type":"markdown","metadata":{"id":"wCxsD2KDAWU2"},"source":["**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering and Summarization tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation and fairness test categories.\n","\n","Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings."]},{"cell_type":"markdown","metadata":{"id":"jNG1OYuQAgtW"},"source":["# Getting started with LangTest"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"32C5aiC-LI1L"},"outputs":[],"source":["!pip install \"langtest[openai,transformers,evaluate]\""]},{"cell_type":"markdown","metadata":{"id":"EsEtlSiNAnSO"},"source":["# Harness and Its Parameters\n","\n","The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way."]},{"cell_type":"code","execution_count":2,"metadata":{"executionInfo":{"elapsed":3452,"status":"ok","timestamp":1692371266150,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"w2GPpdowS1C9"},"outputs":[],"source":["#Import Harness from the LangTest library\n","from langtest import Harness"]},{"cell_type":"markdown","metadata":{"id":"7_6PF_HGA4EO"},"source":["It imports the Harness class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and that instances of the Harness class can be customized or configured for different testing scenarios or environments.\n","\n","Here is a list of the different parameters that can be passed to the Harness function:\n","\n","
\n","\n","\n","| Parameter | Description | \n","| - | - | \n","|**task** |Task for which the model is to be evaluated (question-answering or summarization)|\n","| **model** | Specifies the model(s) to be evaluated. This parameter can be provided as either a dictionary or a list of dictionaries. Each dictionary should contain the following keys:
  • model (mandatory): \tPipelineModel or path to a saved model or pretrained pipeline/model from hub.
  • hub (mandatory): Hub (library) to use in back-end for loading model from public models hub or from path
|\n","| **data** | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys:
  • data_source (mandatory): The source of the data.
  • subset (optional): The subset of the data.
  • feature_column (optional): The column containing the features.
  • target_column (optional): The column containing the target labels.
  • split (optional): The data split to be used.
  • source (optional): Set to 'huggingface' when loading Hugging Face dataset.
|\n","| **config** | Configuration for the tests to be performed, specified in the form of a YAML file. |\n","\n","
\n","
"]},{"cell_type":"markdown","metadata":{"id":"pHJQHDcSA_CV"},"source":["# OpenAI Model Testing For Question Answering\n","\n","In this section, we dive into testing of OpenAI models in Question Answering task.\n","\n","LangTest supports robustness tests for LLM testing for now."]},{"cell_type":"code","execution_count":3,"metadata":{"executionInfo":{"elapsed":111,"status":"ok","timestamp":1692371266152,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"YXVcv79JTAWA"},"outputs":[],"source":["import os\n","\n","os.environ[\"OPENAI_API_KEY\"] = \"\""]},{"cell_type":"markdown","metadata":{"id":"2Q1uClT2kgLB"},"source":["## MMLU \n","[Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300)\n","\n","**Dataset Summary**\n","\n","- MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.\n","\n","**Data Splits**\n","\n","- `test` - Test set from the MMLU dataset which covers 57 tasks including elementary mathematics, US history, computer science, law, and more. We took 50 samples from each tasks in the test set.\n","\n","- `test-tiny` - Truncated version of test set from the MMLU dataset which covers 57 tasks including elementary mathematics, US history, computer science, law, and more. We took 10 samples from each tasks in the test-tiny set."]},{"cell_type":"markdown","metadata":{"id":"1WO54aEnBKK8"},"source":["### Setup and Configure Harness"]},{"cell_type":"code","execution_count":4,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":105,"status":"ok","timestamp":1692371266153,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"f13UydObTDRG","outputId":"e9ed4754-3026-42ba-85dd-6c100e3c60c9"},"outputs":[{"name":"stdout","output_type":"stream","text":["Test Configuration : \n"," {\n"," \"model_parameters\": {\n"," \"temperature\": 0.2,\n"," \"max_tokens\": 64\n"," },\n"," \"tests\": {\n"," \"defaults\": {\n"," \"min_pass_rate\": 1.0\n"," },\n"," \"robustness\": {\n"," \"add_typo\": {\n"," \"min_pass_rate\": 0.7\n"," },\n"," \"lowercase\": {\n"," \"min_pass_rate\": 0.7\n"," }\n"," }\n"," }\n","}\n"]}],"source":["harness = Harness(\n"," task=\"question-answering\", \n"," model={\"model\": \"text-davinci-003\",\"hub\":\"openai\"}, \n"," data={\"data_source\" :\"MMLU\",\n"," \"split\":\"test-tiny\"}\n"," )"]},{"cell_type":"markdown","metadata":{"id":"djMJVtS3U3Wv"},"source":["## Robustness"]},{"cell_type":"markdown","metadata":{"id":"NQ1KF731BW5O"},"source":["For tests we used uppercase, Dyslexia Word Swap, Add Slangs, Insert Abbreviations and Speech to Text typos . Other available robustness tests for QA task are:\n","* `add_context`\n","* `add_contraction`\n","* `add_punctuation`\n","* `add_typo`\n","* `add_ocr_typo`\n","* `american_to_british`\n","* `british_to_american`\n","* `lowercase`\n","* `strip_punctuation`\n","* `titlecase`\n","* `uppercase`\n","* `number_to_word`\n","* `add_abbreviation`\n","* `add_speech_to_text_typo`\n","* `add_slangs`\n","* `dyslexia_word_swap`\n","* `multiple_perturbations`\n","* `adjective_synonym_swap`\n","* `adjective_antonym_swap`\n","* `strip_all_punctuation`"]},{"cell_type":"markdown","metadata":{"id":"8VxrRAMkBf1H"},"source":["You can also set prompts and other model parameters in config. Possible parameters are:\n","* `user_promt:` Promt to be given to the model.\n","* `temperature:` Temperature of the model.\n","* `max_tokens:` Maximum number of output tokens allowed for model."]},{"cell_type":"code","execution_count":5,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":85,"status":"ok","timestamp":1692371266155,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"fMFVq3mCTQ7j","outputId":"150254fc-f2e6-42fe-93e7-92ef6c1468ae"},"outputs":[{"data":{"text/plain":["{'tests': {'defaults': {'min_pass_rate': 0.65},\n"," 'robustness': {'uppercase': {'min_pass_rate': 0.66},\n"," 'dyslexia_word_swap': {'min_pass_rate': 0.6},\n"," 'add_abbreviation': {'min_pass_rate': 0.6},\n"," 'add_slangs': {'min_pass_rate': 0.6},\n"," 'add_speech_to_text_typo': {'min_pass_rate': 0.6}}}}"]},"execution_count":5,"metadata":{},"output_type":"execute_result"}],"source":["harness.configure(\n","{\n"," 'tests': {'defaults': {'min_pass_rate': 0.65},\n"," 'robustness': {'uppercase': {'min_pass_rate': 0.66},\n"," 'dyslexia_word_swap':{'min_pass_rate': 0.60},\n"," 'add_abbreviation':{'min_pass_rate': 0.60},\n"," 'add_slangs':{'min_pass_rate': 0.60},\n"," 'add_speech_to_text_typo':{'min_pass_rate': 0.60},\n","\n"," }\n"," }\n"," }\n"," )"]},{"cell_type":"markdown","metadata":{"id":"AxKHTNFELI1x"},"source":["➤ You can adjust the level of transformation in the sentence by using the \"`prob`\" parameter, which controls the proportion of words to be changed during robustness tests.\n","\n","➤ **NOTE** : \"`prob`\" defaults to 1.0, which means all words will be transformed.\n","```\n","harness.configure(\n","{\n"," 'tests': {\n"," 'defaults': {'min_pass_rate': 0.65},\n"," 'robustness': {\n"," 'uppercase': {'min_pass_rate': 0.66, 'prob': 0.50},\n"," 'dyslexia_word_swap':{'min_pass_rate': 0.60, 'prob': 0.70},\n"," }\n"," }\n","})\n","\n","```"]},{"cell_type":"markdown","metadata":{"id":"m5IuCmiEBuW8"},"source":["Here we have configured the harness to perform Five robustness tests and defined the minimum pass rate for each test."]},{"cell_type":"code","execution_count":6,"metadata":{"executionInfo":{"elapsed":71,"status":"ok","timestamp":1692371266157,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"nmHqJ_TlUg8h"},"outputs":[],"source":["harness.data = harness.data[:10]"]},{"cell_type":"markdown","metadata":{"id":"nAeqBsbAB_1M"},"source":["### Generating the test cases."]},{"cell_type":"code","execution_count":7,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":17814,"status":"ok","timestamp":1692371283903,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"CCJxFd4nUkMN","outputId":"9f99926a-a068-4698-ff9d-68f2416a075d"},"outputs":[{"name":"stderr","output_type":"stream","text":["Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 1392.99it/s]\n"]},{"data":{"text/plain":[]},"execution_count":7,"metadata":{},"output_type":"execute_result"}],"source":["harness.generate()"]},{"cell_type":"markdown","metadata":{"id":"ZEWchFb8CDrk"},"source":["harness.generate() method automatically generates the test cases (based on the provided configuration)"]},{"cell_type":"markdown","metadata":{"id":"MEnLcl-OCG1O"},"source":["### Running the tests"]},{"cell_type":"code","execution_count":8,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":32123,"status":"ok","timestamp":1692371316007,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"gFEez-T0UlcC","outputId":"3684f7af-9359-4f24-e584-5307e3927bfe"},"outputs":[{"name":"stderr","output_type":"stream","text":["Running testcases... : 100%|██████████| 50/50 [00:32<00:00, 1.55it/s]\n"]},{"data":{"text/plain":[]},"execution_count":8,"metadata":{},"output_type":"execute_result"}],"source":["harness.run()"]},{"cell_type":"markdown","metadata":{"id":"3ice4dqfCVlr"},"source":["Called after harness.generate() and is to used to run all the tests. Returns a pass/fail flag for each test."]},{"cell_type":"markdown","metadata":{"id":"g1NxuqveOc-t"},"source":["### Generated Results"]},{"cell_type":"code","execution_count":9,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":1000},"executionInfo":{"elapsed":16558,"status":"ok","timestamp":1692371332559,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"ZjYBONiuYJdK","outputId":"4e69d5fb-cfbd-4713-c25e-0cb49bb0878d"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_questionexpected_resultactual_resultpass
0robustnessuppercase-Find the degree for the given field extension ...-FIND THE DEGREE FOR THE GIVEN FIELD EXTENSION ...B. 4B. 4True
1robustnessuppercase-Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...-LET P = (1, 2, 5, 4)(2, 3) IN S_5 . FIND THE I...C. 24C. 24True
2robustnessuppercase-Find all zeros in the indicated finite field o...-FIND ALL ZEROS IN THE INDICATED FINITE FIELD O...A. 0D. 0,4False
3robustnessuppercase-Statement 1 | A factor group of a non-Abelian ...-STATEMENT 1 | A FACTOR GROUP OF A NON-ABELIAN ...A. True, TrueC. TRUE, FALSEFalse
4robustnessuppercase-Find the product of the given polynomials in t...-FIND THE PRODUCT OF THE GIVEN POLYNOMIALS IN T...C. 0C. 0True
5robustnessuppercase-Statement 1 | If a group has an element of ord...-STATEMENT 1 | IF A GROUP HAS AN ELEMENT OF ORD...C. True, FalseC. TRUE, FALSETrue
6robustnessuppercase-Statement 1 | Every homomorphic image of a gro...-STATEMENT 1 | EVERY HOMOMORPHIC IMAGE OF A GRO...C. True, FalseC. TRUE, FALSETrue
7robustnessuppercase-Statement 1 | A ring homomorphism is one to on...-STATEMENT 1 | A RING HOMOMORPHISM IS ONE TO ON...C. True, FalseA. TRUE, TRUEFalse
8robustnessuppercase-Find the degree for the given field extension ...-FIND THE DEGREE FOR THE GIVEN FIELD EXTENSION ...B. 4C. 2False
9robustnessuppercase-Find all zeros in the indicated finite field o...-FIND ALL ZEROS IN THE INDICATED FINITE FIELD O...A. 1C. 2,3False
10robustnessdyslexia_word_swap-Find the degree for the given field extension ...-Find the degree four the given field extension...B. 4B. 4True
11robustnessdyslexia_word_swap-Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...-Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...C. 24C. 24True
12robustnessdyslexia_word_swap-Find all zeros in the indicated finite field o...-Find all zeros in the indicated finite field o...A. 0A. 0True
13robustnessdyslexia_word_swap-Statement 1 | A factor group of a non-Abelian ...-Statement 1 | A factor group off a non-Abelian...A. True, TrueC. True, FalseFalse
14robustnessdyslexia_word_swap-Find the product of the given polynomials in t...-Find the product off the given polynomials in ...C. 0C. 0True
15robustnessdyslexia_word_swap-Statement 1 | If a group has an element of ord...-Statement 1 | If a group has an element off or...C. True, FalseC. True, FalseTrue
16robustnessdyslexia_word_swap-Statement 1 | Every homomorphic image of a gro...-Statement 1 | Every homomorphic image off a gr...C. True, FalseC. True, FalseTrue
17robustnessdyslexia_word_swap-Statement 1 | A ring homomorphism is one to on...-Statement 1 | A ring homomorphism is won too w...C. True, FalseC. True, FalseTrue
18robustnessdyslexia_word_swap-Find the degree for the given field extension ...-Find the degree four the given field extension...B. 4B. 4True
19robustnessdyslexia_word_swap-Find all zeros in the indicated finite field o...-Find all zeros in the indicated finite field o...A. 1A. 1True
20robustnessadd_abbreviation-Find the degree for the given field extension ...-Find da degree 4 thedaven field extension Q(sq...B. 4B. 4True
21robustnessadd_abbreviation-Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...-Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find da in...C. 24C. 24True
22robustnessadd_abbreviation-Find all zeros in the indicated finite field o...-Find all zeros in da indicated finite field of...A. 0A. 0True
23robustnessadd_abbreviation-Statement 1 | A factor group of a non-Abelian ...-Statement 1 | A factor group of a non-Abelian ...A. True, TrueA. True, TrueTrue
24robustnessadd_abbreviation-Find the product of the given polynomials in t...-Find da product of tdagiven polynomials in thd...C. 0C. 0True
25robustnessadd_abbreviation-Statement 1 | If a group has an element of ord...-Statement 1 | If a group has an element of ord...C. True, FalseC. True, FalseTrue
26robustnessadd_abbreviation-Statement 1 | Every homomorphic image of a gro...-Statement 1 | Every homomorphic image of a gro...C. True, FalseC. True, FalseTrue
27robustnessadd_abbreviation-Statement 1 | A ring homomorphism is one to on...-Statement 1 | A ring homomorphism is one 2 one...C. True, FalseC. True, FalseTrue
28robustnessadd_abbreviation-Find the degree for the given field extension ...-Find da degree 4 thedaven field extension Q(sq...B. 4B. 4True
29robustnessadd_abbreviation-Find all zeros in the indicated finite field o...-Find all zeros in da indicated finite field of...C. 2,3A. 1False
30robustnessadd_slangs-Find the degree for the given field extension ...-Find the degree for the given field extension ...B. 4B. 4True
31robustnessadd_slangs-Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...-Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...C. 24C. 24True
32robustnessadd_slangs-Find all zeros in the indicated finite field o...-Find all zeros in the indicated finite field o...A. 0A. 0True
33robustnessadd_slangs-Statement 1 | A factor group of a non-Abelian ...-Statement 1 | A factor group of a non-Abelian ...A. True, TrueA. True, TrueTrue
34robustnessadd_slangs-Find the product of the given polynomials in t...-Find the product of the given polynomials in t...C. 0C. 0True
35robustnessadd_slangs-Statement 1 | If a group has an element of ord...-Statement 1 | If a group has an element of ord...C. True, FalseA. True, TrueFalse
36robustnessadd_slangs-Statement 1 | Every homomorphic image of a gro...-Statement 1 | Every homomorphic image of a gro...C. True, FalseA. True, TrueFalse
37robustnessadd_slangs-Statement 1 | A ring homomorphism is one to on...-Statement 1 | A ring homomorphism is one to on...C. True, FalseA. True, TrueFalse
38robustnessadd_slangs-Find the degree for the given field extension ...-Find the degree for the given field extension ...B. 4B. 4True
39robustnessadd_slangs-Find all zeros in the indicated finite field o...-Find all zeros in the indicated finite field o...A. 1A. 1True
40robustnessadd_speech_to_text_typo-Find the degree for the given field extension ...-Find the degree for the givin' feild extension...B. 4B. 4True
41robustnessadd_speech_to_text_typo-Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...-Lett pea = (1, 2, 5, 4)(2, 3) in S_5 . Fined t...C. 24B. 2False
42robustnessadd_speech_to_text_typo-Find all zeros in the indicated finite field o...-Find all zeros in the indicated finite feild o...A. 0A. 0True
43robustnessadd_speech_to_text_typo-Statement 1 | A factor group of a non-Abelian ...-Statement 1 | A factor grupe of ae non-Abelian...A. True, TrueA. True, TrueTrue
44robustnessadd_speech_to_text_typo-Find the product of the given polynomials in t...-Find the product of the givin' polynomials in ...C. 0C. 0True
45robustnessadd_speech_to_text_typo-Statement 1 | If a group has an element of ord...-Statement 1 | If a groupe has 'N element of or...C. True, FalseC. True, FalseTrue
46robustnessadd_speech_to_text_typo-Statement 1 | Every homomorphic image of a gro...-Statement 1 | Every homomorphic image of a. gr...C. True, FalseA. True, TrueFalse
47robustnessadd_speech_to_text_typo-Statement 1 | A ring homomorphism is one to on...-Statement 1 | A wring homomorphism is one to o...C. True, FalseB. False, FalseFalse
48robustnessadd_speech_to_text_typo-Find the degree for the given field extension ...-Find the degree for the givin' field extension...B. 4B. 4True
49robustnessadd_speech_to_text_typo-Find all zeros in the indicated finite field o...-Find aull zeros inn the indicated finite field...C. 2,3C. 2,3True
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type original_context \\\n","0 robustness uppercase - \n","1 robustness uppercase - \n","2 robustness uppercase - \n","3 robustness uppercase - \n","4 robustness uppercase - \n","5 robustness uppercase - \n","6 robustness uppercase - \n","7 robustness uppercase - \n","8 robustness uppercase - \n","9 robustness uppercase - \n","10 robustness dyslexia_word_swap - \n","11 robustness dyslexia_word_swap - \n","12 robustness dyslexia_word_swap - \n","13 robustness dyslexia_word_swap - \n","14 robustness dyslexia_word_swap - \n","15 robustness dyslexia_word_swap - \n","16 robustness dyslexia_word_swap - \n","17 robustness dyslexia_word_swap - \n","18 robustness dyslexia_word_swap - \n","19 robustness dyslexia_word_swap - \n","20 robustness add_abbreviation - \n","21 robustness add_abbreviation - \n","22 robustness add_abbreviation - \n","23 robustness add_abbreviation - \n","24 robustness add_abbreviation - \n","25 robustness add_abbreviation - \n","26 robustness add_abbreviation - \n","27 robustness add_abbreviation - \n","28 robustness add_abbreviation - \n","29 robustness add_abbreviation - \n","30 robustness add_slangs - \n","31 robustness add_slangs - \n","32 robustness add_slangs - \n","33 robustness add_slangs - \n","34 robustness add_slangs - \n","35 robustness add_slangs - \n","36 robustness add_slangs - \n","37 robustness add_slangs - \n","38 robustness add_slangs - \n","39 robustness add_slangs - \n","40 robustness add_speech_to_text_typo - \n","41 robustness add_speech_to_text_typo - \n","42 robustness add_speech_to_text_typo - \n","43 robustness add_speech_to_text_typo - \n","44 robustness add_speech_to_text_typo - \n","45 robustness add_speech_to_text_typo - \n","46 robustness add_speech_to_text_typo - \n","47 robustness add_speech_to_text_typo - \n","48 robustness add_speech_to_text_typo - \n","49 robustness add_speech_to_text_typo - \n","\n"," original_question perturbed_context \\\n","0 Find the degree for the given field extension ... - \n","1 Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i... - \n","2 Find all zeros in the indicated finite field o... - \n","3 Statement 1 | A factor group of a non-Abelian ... - \n","4 Find the product of the given polynomials in t... - \n","5 Statement 1 | If a group has an element of ord... - \n","6 Statement 1 | Every homomorphic image of a gro... - \n","7 Statement 1 | A ring homomorphism is one to on... - \n","8 Find the degree for the given field extension ... - \n","9 Find all zeros in the indicated finite field o... - \n","10 Find the degree for the given field extension ... - \n","11 Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i... - \n","12 Find all zeros in the indicated finite field o... - \n","13 Statement 1 | A factor group of a non-Abelian ... - \n","14 Find the product of the given polynomials in t... - \n","15 Statement 1 | If a group has an element of ord... - \n","16 Statement 1 | Every homomorphic image of a gro... - \n","17 Statement 1 | A ring homomorphism is one to on... - \n","18 Find the degree for the given field extension ... - \n","19 Find all zeros in the indicated finite field o... - \n","20 Find the degree for the given field extension ... - \n","21 Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i... - \n","22 Find all zeros in the indicated finite field o... - \n","23 Statement 1 | A factor group of a non-Abelian ... - \n","24 Find the product of the given polynomials in t... - \n","25 Statement 1 | If a group has an element of ord... - \n","26 Statement 1 | Every homomorphic image of a gro... - \n","27 Statement 1 | A ring homomorphism is one to on... - \n","28 Find the degree for the given field extension ... - \n","29 Find all zeros in the indicated finite field o... - \n","30 Find the degree for the given field extension ... - \n","31 Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i... - \n","32 Find all zeros in the indicated finite field o... - \n","33 Statement 1 | A factor group of a non-Abelian ... - \n","34 Find the product of the given polynomials in t... - \n","35 Statement 1 | If a group has an element of ord... - \n","36 Statement 1 | Every homomorphic image of a gro... - \n","37 Statement 1 | A ring homomorphism is one to on... - \n","38 Find the degree for the given field extension ... - \n","39 Find all zeros in the indicated finite field o... - \n","40 Find the degree for the given field extension ... - \n","41 Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i... - \n","42 Find all zeros in the indicated finite field o... - \n","43 Statement 1 | A factor group of a non-Abelian ... - \n","44 Find the product of the given polynomials in t... - \n","45 Statement 1 | If a group has an element of ord... - \n","46 Statement 1 | Every homomorphic image of a gro... - \n","47 Statement 1 | A ring homomorphism is one to on... - \n","48 Find the degree for the given field extension ... - \n","49 Find all zeros in the indicated finite field o... - \n","\n"," perturbed_question expected_result \\\n","0 FIND THE DEGREE FOR THE GIVEN FIELD EXTENSION ... B. 4 \n","1 LET P = (1, 2, 5, 4)(2, 3) IN S_5 . FIND THE I... C. 24 \n","2 FIND ALL ZEROS IN THE INDICATED FINITE FIELD O... A. 0 \n","3 STATEMENT 1 | A FACTOR GROUP OF A NON-ABELIAN ... A. True, True \n","4 FIND THE PRODUCT OF THE GIVEN POLYNOMIALS IN T... C. 0 \n","5 STATEMENT 1 | IF A GROUP HAS AN ELEMENT OF ORD... C. True, False \n","6 STATEMENT 1 | EVERY HOMOMORPHIC IMAGE OF A GRO... C. True, False \n","7 STATEMENT 1 | A RING HOMOMORPHISM IS ONE TO ON... C. True, False \n","8 FIND THE DEGREE FOR THE GIVEN FIELD EXTENSION ... B. 4 \n","9 FIND ALL ZEROS IN THE INDICATED FINITE FIELD O... A. 1 \n","10 Find the degree four the given field extension... B. 4 \n","11 Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i... C. 24 \n","12 Find all zeros in the indicated finite field o... A. 0 \n","13 Statement 1 | A factor group off a non-Abelian... A. True, True \n","14 Find the product off the given polynomials in ... C. 0 \n","15 Statement 1 | If a group has an element off or... C. True, False \n","16 Statement 1 | Every homomorphic image off a gr... C. True, False \n","17 Statement 1 | A ring homomorphism is won too w... C. True, False \n","18 Find the degree four the given field extension... B. 4 \n","19 Find all zeros in the indicated finite field o... A. 1 \n","20 Find da degree 4 thedaven field extension Q(sq... B. 4 \n","21 Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find da in... C. 24 \n","22 Find all zeros in da indicated finite field of... A. 0 \n","23 Statement 1 | A factor group of a non-Abelian ... A. True, True \n","24 Find da product of tdagiven polynomials in thd... C. 0 \n","25 Statement 1 | If a group has an element of ord... C. True, False \n","26 Statement 1 | Every homomorphic image of a gro... C. True, False \n","27 Statement 1 | A ring homomorphism is one 2 one... C. True, False \n","28 Find da degree 4 thedaven field extension Q(sq... B. 4 \n","29 Find all zeros in da indicated finite field of... C. 2,3 \n","30 Find the degree for the given field extension ... B. 4 \n","31 Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i... C. 24 \n","32 Find all zeros in the indicated finite field o... A. 0 \n","33 Statement 1 | A factor group of a non-Abelian ... A. True, True \n","34 Find the product of the given polynomials in t... C. 0 \n","35 Statement 1 | If a group has an element of ord... C. True, False \n","36 Statement 1 | Every homomorphic image of a gro... C. True, False \n","37 Statement 1 | A ring homomorphism is one to on... C. True, False \n","38 Find the degree for the given field extension ... B. 4 \n","39 Find all zeros in the indicated finite field o... A. 1 \n","40 Find the degree for the givin' feild extension... B. 4 \n","41 Lett pea = (1, 2, 5, 4)(2, 3) in S_5 . Fined t... C. 24 \n","42 Find all zeros in the indicated finite feild o... A. 0 \n","43 Statement 1 | A factor grupe of ae non-Abelian... A. True, True \n","44 Find the product of the givin' polynomials in ... C. 0 \n","45 Statement 1 | If a groupe has 'N element of or... C. True, False \n","46 Statement 1 | Every homomorphic image of a. gr... C. True, False \n","47 Statement 1 | A wring homomorphism is one to o... C. True, False \n","48 Find the degree for the givin' field extension... B. 4 \n","49 Find aull zeros inn the indicated finite field... C. 2,3 \n","\n"," actual_result pass \n","0 B. 4 True \n","1 C. 24 True \n","2 D. 0,4 False \n","3 C. TRUE, FALSE False \n","4 C. 0 True \n","5 C. TRUE, FALSE True \n","6 C. TRUE, FALSE True \n","7 A. TRUE, TRUE False \n","8 C. 2 False \n","9 C. 2,3 False \n","10 B. 4 True \n","11 C. 24 True \n","12 A. 0 True \n","13 C. True, False False \n","14 C. 0 True \n","15 C. True, False True \n","16 C. True, False True \n","17 C. True, False True \n","18 B. 4 True \n","19 A. 1 True \n","20 B. 4 True \n","21 C. 24 True \n","22 A. 0 True \n","23 A. True, True True \n","24 C. 0 True \n","25 C. True, False True \n","26 C. True, False True \n","27 C. True, False True \n","28 B. 4 True \n","29 A. 1 False \n","30 B. 4 True \n","31 C. 24 True \n","32 A. 0 True \n","33 A. True, True True \n","34 C. 0 True \n","35 A. True, True False \n","36 A. True, True False \n","37 A. True, True False \n","38 B. 4 True \n","39 A. 1 True \n","40 B. 4 True \n","41 B. 2 False \n","42 A. 0 True \n","43 A. True, True True \n","44 C. 0 True \n","45 C. True, False True \n","46 A. True, True False \n","47 B. False, False False \n","48 B. 4 True \n","49 C. 2,3 True "]},"execution_count":9,"metadata":{},"output_type":"execute_result"}],"source":["harness.generated_results()"]},{"cell_type":"markdown","metadata":{"id":"Gl5QGV9pCZfz"},"source":["This method returns the generated results in the form of a pandas dataframe, which provides a convenient and easy-to-use format for working with the test results. You can use this method to quickly identify the test cases that failed and to determine where fixes are needed."]},{"cell_type":"markdown","metadata":{"id":"9fBgU33hCb2K"},"source":["### Final Results\n","\n","We can call `.report()` which summarizes the results giving information about pass and fail counts and overall test pass/fail flag."]},{"cell_type":"code","execution_count":10,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":206},"executionInfo":{"elapsed":14511,"status":"ok","timestamp":1692371347056,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"nDmRw1AeUqIl","outputId":"c458e5f1-9f6f-4b40-bc19-7570592546be"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessuppercase5550%66%False
1robustnessdyslexia_word_swap1990%60%True
2robustnessadd_abbreviation1990%60%True
3robustnessadd_slangs3770%60%True
4robustnessadd_speech_to_text_typo3770%60%True
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type fail_count pass_count pass_rate \\\n","0 robustness uppercase 5 5 50% \n","1 robustness dyslexia_word_swap 1 9 90% \n","2 robustness add_abbreviation 1 9 90% \n","3 robustness add_slangs 3 7 70% \n","4 robustness add_speech_to_text_typo 3 7 70% \n","\n"," minimum_pass_rate pass \n","0 66% False \n","1 60% True \n","2 60% True \n","3 60% True \n","4 60% True "]},"execution_count":10,"metadata":{},"output_type":"execute_result"}],"source":["harness.report()"]},{"cell_type":"markdown","metadata":{"id":"IULGQtWAWp4L"},"source":["## Fairness"]},{"cell_type":"markdown","metadata":{"id":"z85d594ZGXyX"},"source":["Available Fairness tests for QA task are:\n","\n","* `max_gender_rouge1_score`\n","* `max_gender_rouge2_score`\n","* `max_gender_rougeL_score`\n","* `max_gender_rougeLsum_score`\n","* `min_gender_rouge1_score`\n","* `min_gender_rouge2_score`\n","* `min_gender_rougeL_score`\n","* `min_gender_rougeLsum_score`"]},{"cell_type":"code","execution_count":11,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":86,"status":"ok","timestamp":1692371347059,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"OoMGAn_FWpaP","outputId":"90175b71-b519-4687-b9bb-459bf3afdc35"},"outputs":[{"name":"stdout","output_type":"stream","text":["Test Configuration : \n"," {\n"," \"model_parameters\": {\n"," \"temperature\": 0.2,\n"," \"max_tokens\": 64\n"," },\n"," \"tests\": {\n"," \"defaults\": {\n"," \"min_pass_rate\": 1.0\n"," },\n"," \"robustness\": {\n"," \"add_typo\": {\n"," \"min_pass_rate\": 0.7\n"," },\n"," \"lowercase\": {\n"," \"min_pass_rate\": 0.7\n"," }\n"," }\n"," }\n","}\n"]}],"source":["harness = Harness(\n"," task=\"question-answering\", \n"," model={\"model\": \"text-davinci-003\",\"hub\":\"openai\"}, \n"," data={\"data_source\" :\"MMLU\",\n"," \"split\":\"test-tiny\"}\n"," )"]},{"cell_type":"code","execution_count":12,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":78,"status":"ok","timestamp":1692371347061,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"45-rhwhTXMWb","outputId":"d96893e0-a009-4da9-b4e5-63b200d83d45"},"outputs":[{"data":{"text/plain":["{'tests': {'defaults': {'min_pass_rate': 0.65},\n"," 'fairness': {'min_gender_rouge1_score': {'min_score': 0.66},\n"," 'min_gender_rouge2_score': {'min_score': 0.6},\n"," 'min_gender_rougeL_score': {'min_score': 0.66},\n"," 'min_gender_rougeLsum_score': {'min_score': 0.66},\n"," 'max_gender_rouge1_score': {'max_score': 0.66},\n"," 'max_gender_rouge2_score': {'max_score': 0.6},\n"," 'max_gender_rougeL_score': {'max_score': 0.66},\n"," 'max_gender_rougeLsum_score': {'max_score': 0.66}}}}"]},"execution_count":12,"metadata":{},"output_type":"execute_result"}],"source":["harness.configure(\n","{\n"," 'tests': {'defaults': {'min_pass_rate': 0.65},\n"," 'fairness': {\n"," 'min_gender_rouge1_score': {'min_score': 0.66},\n"," 'min_gender_rouge2_score':{'min_score': 0.60},\n"," 'min_gender_rougeL_score': {'min_score': 0.66},\n"," 'min_gender_rougeLsum_score': {'min_score': 0.66},\n"," 'max_gender_rouge1_score': {'max_score': 0.66},\n"," 'max_gender_rouge2_score':{'max_score': 0.60},\n"," 'max_gender_rougeL_score': {'max_score': 0.66},\n"," 'max_gender_rougeLsum_score': {'max_score': 0.66},\n","\n","\n","\n","\n"," }\n"," }\n"," }\n"," )"]},{"cell_type":"code","execution_count":13,"metadata":{"executionInfo":{"elapsed":66,"status":"ok","timestamp":1692371347063,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"_cTZaer5XyDa"},"outputs":[],"source":["harness.data = harness.data[:10]"]},{"cell_type":"markdown","metadata":{"id":"dw85pgowGx8t"},"source":["### Generating the Test Cases"]},{"cell_type":"code","execution_count":14,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":76,"status":"ok","timestamp":1692371347075,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"F2p1pXfoXzND","outputId":"6cdcb7cb-119b-4f14-dce8-f03bc507a8d0"},"outputs":[{"name":"stderr","output_type":"stream","text":["Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 1369.79it/s]\n"]},{"data":{"text/plain":[]},"execution_count":14,"metadata":{},"output_type":"execute_result"}],"source":["harness.generate()"]},{"cell_type":"code","execution_count":15,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":802},"executionInfo":{"elapsed":64,"status":"ok","timestamp":1692371347078,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"vJZxMYyKX0Pe","outputId":"507d0db6-80e5-4eba-82f5-739ce1b9e8a1"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typetest_case
0fairnessmin_gender_rouge1_scoremale
1fairnessmin_gender_rouge1_scorefemale
2fairnessmin_gender_rouge1_scoreunknown
3fairnessmin_gender_rouge2_scoremale
4fairnessmin_gender_rouge2_scorefemale
5fairnessmin_gender_rouge2_scoreunknown
6fairnessmin_gender_rougeL_scoremale
7fairnessmin_gender_rougeL_scorefemale
8fairnessmin_gender_rougeL_scoreunknown
9fairnessmin_gender_rougeLsum_scoremale
10fairnessmin_gender_rougeLsum_scorefemale
11fairnessmin_gender_rougeLsum_scoreunknown
12fairnessmax_gender_rouge1_scoremale
13fairnessmax_gender_rouge1_scorefemale
14fairnessmax_gender_rouge1_scoreunknown
15fairnessmax_gender_rouge2_scoremale
16fairnessmax_gender_rouge2_scorefemale
17fairnessmax_gender_rouge2_scoreunknown
18fairnessmax_gender_rougeL_scoremale
19fairnessmax_gender_rougeL_scorefemale
20fairnessmax_gender_rougeL_scoreunknown
21fairnessmax_gender_rougeLsum_scoremale
22fairnessmax_gender_rougeLsum_scorefemale
23fairnessmax_gender_rougeLsum_scoreunknown
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type test_case\n","0 fairness min_gender_rouge1_score male\n","1 fairness min_gender_rouge1_score female\n","2 fairness min_gender_rouge1_score unknown\n","3 fairness min_gender_rouge2_score male\n","4 fairness min_gender_rouge2_score female\n","5 fairness min_gender_rouge2_score unknown\n","6 fairness min_gender_rougeL_score male\n","7 fairness min_gender_rougeL_score female\n","8 fairness min_gender_rougeL_score unknown\n","9 fairness min_gender_rougeLsum_score male\n","10 fairness min_gender_rougeLsum_score female\n","11 fairness min_gender_rougeLsum_score unknown\n","12 fairness max_gender_rouge1_score male\n","13 fairness max_gender_rouge1_score female\n","14 fairness max_gender_rouge1_score unknown\n","15 fairness max_gender_rouge2_score male\n","16 fairness max_gender_rouge2_score female\n","17 fairness max_gender_rouge2_score unknown\n","18 fairness max_gender_rougeL_score male\n","19 fairness max_gender_rougeL_score female\n","20 fairness max_gender_rougeL_score unknown\n","21 fairness max_gender_rougeLsum_score male\n","22 fairness max_gender_rougeLsum_score female\n","23 fairness max_gender_rougeLsum_score unknown"]},"execution_count":15,"metadata":{},"output_type":"execute_result"}],"source":["harness.testcases()"]},{"cell_type":"markdown","metadata":{"id":"zSgEmwr7G2Xl"},"source":["### Running the tests"]},{"cell_type":"code","execution_count":16,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":181,"referenced_widgets":["257c00fef73b4d50950c8d8b165e26a2","75d0522480494bb1a7b66e14fc43faac","4218ed9efdf84217b5daa2aa5930e20b","867e0de65c734221ad6f2623c2a35f57","d3ca7afb948f404682aa027d3d76d237","f2540d52716a4393a5f050f8d030f3f3","0dab743db8f14b77b0ec1699f92f86ed","2608c51cf9784a56baeddf9d1622ce76","2773b8eeb7024310b2264d487a9b26df","a3d9b7d4b44540d88953c69b56f9269f","cb676eb37f2a4126837c7324bf51d7ad","56701a47f6ee4a6d81a98f66756baf03","20d999a03d814a7785232c091241dc1c","6ab5b7e5c6784f3b92b6180ae0043589","9824945e44fe4af4a1d70a8383b72b72","0d7c7a938349427983d62652e81cead5","351e721352bf4c7cb30dbbe8a06ce35d","ad6bedec421b40d897568ae3f2705810","fabd451f3ccc47d5aed88e94eec722f7","c07ab8a5ad3e41e991f940b6e08e1814","660e7fdd115f4e728fe7ea0358fd8bff","52ef8bcdab0a42f0a5d6a336766de54d","fa4244813260430c98d2fbad63671f10","e0e00dfcfb7c49ac961ff7f1101a0caa","e367e27cda314517ab18696ecd913e0a","9a1221b68d2c4af1a74f5978e252d507","b16b721265754f5fa258970429fc7bdd","2e68a1149b7b40bc8c2811b1a16c96ea","829fb20d826d45baaf8d785179c1b32f","feb421598a0441498d81241716261b78","f0fc5b6cb35e4986b5ef1f2d03e56228","e349b98fd389418fb365f53185489437","f6ebb67ea4574f3e8924b90d7b5aba12","d5950fc7527049279a8d433985f79619","3e9c9defb1d148b5a6de25cb2095740a","3d19431d61e747df81b5b6730e67c955","805c8478574545c398214ce2d295944a","7b972e6f8f624ac28f148a8cff4b0ee2","5a12148bfe9848c5b9827d9b677b39dd","b4bf22308b254236960ff1eb5306c4e9","6984b154f66d4f1ab209168e50a64acd","2c907621903c43c9ad7ed84ee9026412","4f579cc50d884981b562f112b8764075","5a0ba0d42433427c8874b56d5ef1f4a2"]},"executionInfo":{"elapsed":36184,"status":"ok","timestamp":1692371383203,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"marZgGMEX2F1","outputId":"93f92514-2be1-4875-9061-74524e84fbd0"},"outputs":[{"name":"stderr","output_type":"stream","text":["\rRunning testcases... : 0%| | 0/24 [00:00\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typetest_caseexpected_resultactual_resultpass
0fairnessmin_gender_rouge1_scoremale0.660.355556False
1fairnessmin_gender_rouge1_scorefemale0.660.750000True
2fairnessmin_gender_rouge1_scoreunknown0.660.222222False
3fairnessmin_gender_rouge2_scoremale0.600.000000False
4fairnessmin_gender_rouge2_scorefemale0.600.750000True
5fairnessmin_gender_rouge2_scoreunknown0.600.000000False
6fairnessmin_gender_rougeL_scoremale0.660.244444False
7fairnessmin_gender_rougeL_scorefemale0.660.750000True
8fairnessmin_gender_rougeL_scoreunknown0.660.222222False
9fairnessmin_gender_rougeLsum_scoremale0.660.244444False
10fairnessmin_gender_rougeLsum_scorefemale0.660.750000True
11fairnessmin_gender_rougeLsum_scoreunknown0.660.222222False
12fairnessmax_gender_rouge1_scoremale0.660.355556True
13fairnessmax_gender_rouge1_scorefemale0.660.750000False
14fairnessmax_gender_rouge1_scoreunknown0.660.222222True
15fairnessmax_gender_rouge2_scoremale0.600.000000True
16fairnessmax_gender_rouge2_scorefemale0.600.750000False
17fairnessmax_gender_rouge2_scoreunknown0.600.000000True
18fairnessmax_gender_rougeL_scoremale0.660.244444True
19fairnessmax_gender_rougeL_scorefemale0.660.750000False
20fairnessmax_gender_rougeL_scoreunknown0.660.222222True
21fairnessmax_gender_rougeLsum_scoremale0.660.244444True
22fairnessmax_gender_rougeLsum_scorefemale0.660.750000False
23fairnessmax_gender_rougeLsum_scoreunknown0.660.222222True
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n"," \n"],"text/plain":[" category test_type test_case expected_result \\\n","0 fairness min_gender_rouge1_score male 0.66 \n","1 fairness min_gender_rouge1_score female 0.66 \n","2 fairness min_gender_rouge1_score unknown 0.66 \n","3 fairness min_gender_rouge2_score male 0.60 \n","4 fairness min_gender_rouge2_score female 0.60 \n","5 fairness min_gender_rouge2_score unknown 0.60 \n","6 fairness min_gender_rougeL_score male 0.66 \n","7 fairness min_gender_rougeL_score female 0.66 \n","8 fairness min_gender_rougeL_score unknown 0.66 \n","9 fairness min_gender_rougeLsum_score male 0.66 \n","10 fairness min_gender_rougeLsum_score female 0.66 \n","11 fairness min_gender_rougeLsum_score unknown 0.66 \n","12 fairness max_gender_rouge1_score male 0.66 \n","13 fairness max_gender_rouge1_score female 0.66 \n","14 fairness max_gender_rouge1_score unknown 0.66 \n","15 fairness max_gender_rouge2_score male 0.60 \n","16 fairness max_gender_rouge2_score female 0.60 \n","17 fairness max_gender_rouge2_score unknown 0.60 \n","18 fairness max_gender_rougeL_score male 0.66 \n","19 fairness max_gender_rougeL_score female 0.66 \n","20 fairness max_gender_rougeL_score unknown 0.66 \n","21 fairness max_gender_rougeLsum_score male 0.66 \n","22 fairness max_gender_rougeLsum_score female 0.66 \n","23 fairness max_gender_rougeLsum_score unknown 0.66 \n","\n"," actual_result pass \n","0 0.355556 False \n","1 0.750000 True \n","2 0.222222 False \n","3 0.000000 False \n","4 0.750000 True \n","5 0.000000 False \n","6 0.244444 False \n","7 0.750000 True \n","8 0.222222 False \n","9 0.244444 False \n","10 0.750000 True \n","11 0.222222 False \n","12 0.355556 True \n","13 0.750000 False \n","14 0.222222 True \n","15 0.000000 True \n","16 0.750000 False \n","17 0.000000 True \n","18 0.244444 True \n","19 0.750000 False \n","20 0.222222 True \n","21 0.244444 True \n","22 0.750000 False \n","23 0.222222 True "]},"execution_count":17,"metadata":{},"output_type":"execute_result"}],"source":["harness.generated_results()"]},{"cell_type":"markdown","metadata":{"id":"o39sXReLG7K9"},"source":["### Final Results"]},{"cell_type":"code","execution_count":18,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":300},"executionInfo":{"elapsed":209,"status":"ok","timestamp":1692371383216,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"AiyJ7SyJYC9V","outputId":"df0ec5a3-5a04-45c1-d635-f0be79abe66a"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0fairnessmin_gender_rouge1_score2133%65%False
1fairnessmin_gender_rouge2_score2133%65%False
2fairnessmin_gender_rougeL_score2133%65%False
3fairnessmin_gender_rougeLsum_score2133%65%False
4fairnessmax_gender_rouge1_score1267%65%True
5fairnessmax_gender_rouge2_score1267%65%True
6fairnessmax_gender_rougeL_score1267%65%True
7fairnessmax_gender_rougeLsum_score1267%65%True
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type fail_count pass_count pass_rate \\\n","0 fairness min_gender_rouge1_score 2 1 33% \n","1 fairness min_gender_rouge2_score 2 1 33% \n","2 fairness min_gender_rougeL_score 2 1 33% \n","3 fairness min_gender_rougeLsum_score 2 1 33% \n","4 fairness max_gender_rouge1_score 1 2 67% \n","5 fairness max_gender_rouge2_score 1 2 67% \n","6 fairness max_gender_rougeL_score 1 2 67% \n","7 fairness max_gender_rougeLsum_score 1 2 67% \n","\n"," minimum_pass_rate pass \n","0 65% False \n","1 65% False \n","2 65% False \n","3 65% False \n","4 65% True \n","5 65% True \n","6 65% True \n","7 65% True "]},"execution_count":18,"metadata":{},"output_type":"execute_result"}],"source":["harness.report()"]},{"cell_type":"markdown","metadata":{"id":"0jSkCQudYh3F"},"source":["## Accuracy"]},{"cell_type":"markdown","metadata":{"id":"YwAzCAHkGd0X"},"source":["Available Accuracy tests for QA task are:\n","\n","* `min_exact_match_score`\n","* `min_bleu_score`\n","* `min_rouge1_score`\n","* `min_rouge2_score`\n","* `min_rougeL_score`\n","* `min_rougeLsum_score`"]},{"cell_type":"code","execution_count":19,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":200,"status":"ok","timestamp":1692371383218,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"qG3UX5c-YgJn","outputId":"153fbe09-ae45-4dd3-bcbd-c97cd07b3c59"},"outputs":[{"name":"stdout","output_type":"stream","text":["Test Configuration : \n"," {\n"," \"model_parameters\": {\n"," \"temperature\": 0.2,\n"," \"max_tokens\": 64\n"," },\n"," \"tests\": {\n"," \"defaults\": {\n"," \"min_pass_rate\": 1.0\n"," },\n"," \"robustness\": {\n"," \"add_typo\": {\n"," \"min_pass_rate\": 0.7\n"," },\n"," \"lowercase\": {\n"," \"min_pass_rate\": 0.7\n"," }\n"," }\n"," }\n","}\n"]}],"source":["harness = Harness(\n"," task=\"question-answering\", \n"," model={\"model\": \"text-davinci-003\",\"hub\":\"openai\"}, \n"," data={\"data_source\" :\"MMLU\",\n"," \"split\":\"test-tiny\"}\n"," )"]},{"cell_type":"code","execution_count":20,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":189,"status":"ok","timestamp":1692371383222,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"KuLxNXwXYl2z","outputId":"4955decb-3e10-4c42-aa96-880298dce501"},"outputs":[{"data":{"text/plain":["{'tests': {'defaults': {'min_pass_rate': 0.65},\n"," 'accuracy': {'min_exact_match_score': {'min_score': 0.5},\n"," 'min_rouge1_score': {'min_score': 0.5}}}}"]},"execution_count":20,"metadata":{},"output_type":"execute_result"}],"source":["harness.configure(\n","{\n"," 'tests': {'defaults': {'min_pass_rate': 0.65},\n"," 'accuracy': {'min_exact_match_score': {'min_score': 0.50},\n"," 'min_rouge1_score':{'min_score': 0.50},\n","\n"," }\n"," }\n"," }\n"," )"]},{"cell_type":"markdown","metadata":{"id":"hd6BEnBtHyME"},"source":["### Generating the test cases."]},{"cell_type":"code","execution_count":21,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":132,"status":"ok","timestamp":1692371383225,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"4_wMTSmbYqTa","outputId":"052f1736-382b-4b79-a395-a53fcf94d136"},"outputs":[{"name":"stderr","output_type":"stream","text":["\n","Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 5242.88it/s]\n"]},{"data":{"text/plain":[]},"execution_count":21,"metadata":{},"output_type":"execute_result"}],"source":["harness.generate()"]},{"cell_type":"code","execution_count":22,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":112},"executionInfo":{"elapsed":114,"status":"ok","timestamp":1692371383229,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"W28l71dScgG0","outputId":"b136d68b-349d-45df-fb07-c79646dec5ac"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_type
0accuracymin_exact_match_score
1accuracymin_rouge1_score
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type\n","0 accuracy min_exact_match_score\n","1 accuracy min_rouge1_score"]},"execution_count":22,"metadata":{},"output_type":"execute_result"}],"source":["harness.testcases()"]},{"cell_type":"markdown","metadata":{"id":"UsbsuknXH0ue"},"source":["### Running the tests"]},{"cell_type":"code","execution_count":23,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":85,"referenced_widgets":["20e863ea2c17471ead434e1df3c623ed","d9f2bbecf3fd4473af04e2e25653f928","8f273303cf324d0bb3146ecea2af2411","d9f73f8d0c7345049a7ea11924b756dd","d32e905239be4fef985ae8767d6add99","01df3137965b434190d73bb59c9790bb","a2ff2f24ad77485e9de01427e2231712","ab31e5a39fe143d8895353e2c7ebea3c","61e4c8036ec34d28a5efafb0c41a0a74","aa57f92f95904c529d342790ecf4d75c","88af924ecc884636bb5bc9cad872e53a"]},"executionInfo":{"elapsed":281661,"status":"ok","timestamp":1692371664782,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"PxeBTKR9chtd","outputId":"3540745d-bab7-4eb5-f5eb-2477c8b951bc"},"outputs":[{"name":"stderr","output_type":"stream","text":["\rRunning testcases... : 0%| | 0/2 [00:00\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typeexpected_resultactual_resultpass
0accuracymin_exact_match_score0.50.592982True
1accuracymin_rouge1_score0.50.730155True
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n"," \n"],"text/plain":[" category test_type expected_result actual_result pass\n","0 accuracy min_exact_match_score 0.5 0.592982 True\n","1 accuracy min_rouge1_score 0.5 0.730155 True"]},"execution_count":24,"metadata":{},"output_type":"execute_result"}],"source":["harness.generated_results()"]},{"cell_type":"markdown","metadata":{"id":"uIOiTX1IH3d8"},"source":["### Final Results"]},{"cell_type":"code","execution_count":25,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":112},"executionInfo":{"elapsed":35,"status":"ok","timestamp":1692371664787,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"4U3PMgpEcn5o","outputId":"4958bf35-ffc1-477d-e5bf-b3d86acae806"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0accuracymin_exact_match_score01100%65%True
1accuracymin_rouge1_score01100%65%True
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type fail_count pass_count pass_rate \\\n","0 accuracy min_exact_match_score 0 1 100% \n","1 accuracy min_rouge1_score 0 1 100% \n","\n"," minimum_pass_rate pass \n","0 65% True \n","1 65% True "]},"execution_count":25,"metadata":{},"output_type":"execute_result"}],"source":["harness.report()"]}],"metadata":{"accelerator":"TPU","colab":{"machine_shape":"hm","provenance":[],"toc_visible":true},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"},"widgets":{"application/vnd.jupyter.widget-state+json":{"01df3137965b434190d73bb59c9790bb":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"0d7c7a938349427983d62652e81cead5":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"0dab743db8f14b77b0ec1699f92f86ed":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"20d999a03d814a7785232c091241dc1c":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_351e721352bf4c7cb30dbbe8a06ce35d","placeholder":"​","style":"IPY_MODEL_ad6bedec421b40d897568ae3f2705810","value":"Downloading (…)solve/main/vocab.txt: 100%"}},"20e863ea2c17471ead434e1df3c623ed":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_d9f2bbecf3fd4473af04e2e25653f928","IPY_MODEL_8f273303cf324d0bb3146ecea2af2411","IPY_MODEL_d9f73f8d0c7345049a7ea11924b756dd"],"layout":"IPY_MODEL_d32e905239be4fef985ae8767d6add99"}},"257c00fef73b4d50950c8d8b165e26a2":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_75d0522480494bb1a7b66e14fc43faac","IPY_MODEL_4218ed9efdf84217b5daa2aa5930e20b","IPY_MODEL_867e0de65c734221ad6f2623c2a35f57"],"layout":"IPY_MODEL_d3ca7afb948f404682aa027d3d76d237"}},"2608c51cf9784a56baeddf9d1622ce76":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"2773b8eeb7024310b2264d487a9b26df":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"2c907621903c43c9ad7ed84ee9026412":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"2e68a1149b7b40bc8c2811b1a16c96ea":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"351e721352bf4c7cb30dbbe8a06ce35d":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"3d19431d61e747df81b5b6730e67c955":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_6984b154f66d4f1ab209168e50a64acd","max":6270,"min":0,"orientation":"horizontal","style":"IPY_MODEL_2c907621903c43c9ad7ed84ee9026412","value":6270}},"3e9c9defb1d148b5a6de25cb2095740a":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_5a12148bfe9848c5b9827d9b677b39dd","placeholder":"​","style":"IPY_MODEL_b4bf22308b254236960ff1eb5306c4e9","value":"Downloading builder script: 100%"}},"4218ed9efdf84217b5daa2aa5930e20b":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_2608c51cf9784a56baeddf9d1622ce76","max":525,"min":0,"orientation":"horizontal","style":"IPY_MODEL_2773b8eeb7024310b2264d487a9b26df","value":525}},"4f579cc50d884981b562f112b8764075":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"52ef8bcdab0a42f0a5d6a336766de54d":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"56701a47f6ee4a6d81a98f66756baf03":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_20d999a03d814a7785232c091241dc1c","IPY_MODEL_6ab5b7e5c6784f3b92b6180ae0043589","IPY_MODEL_9824945e44fe4af4a1d70a8383b72b72"],"layout":"IPY_MODEL_0d7c7a938349427983d62652e81cead5"}},"5a0ba0d42433427c8874b56d5ef1f4a2":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"5a12148bfe9848c5b9827d9b677b39dd":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"61e4c8036ec34d28a5efafb0c41a0a74":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"660e7fdd115f4e728fe7ea0358fd8bff":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"6984b154f66d4f1ab209168e50a64acd":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"6ab5b7e5c6784f3b92b6180ae0043589":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_fabd451f3ccc47d5aed88e94eec722f7","max":231508,"min":0,"orientation":"horizontal","style":"IPY_MODEL_c07ab8a5ad3e41e991f940b6e08e1814","value":231508}},"75d0522480494bb1a7b66e14fc43faac":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_f2540d52716a4393a5f050f8d030f3f3","placeholder":"​","style":"IPY_MODEL_0dab743db8f14b77b0ec1699f92f86ed","value":"Downloading (…)lve/main/config.json: 100%"}},"7b972e6f8f624ac28f148a8cff4b0ee2":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"805c8478574545c398214ce2d295944a":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_4f579cc50d884981b562f112b8764075","placeholder":"​","style":"IPY_MODEL_5a0ba0d42433427c8874b56d5ef1f4a2","value":" 6.27k/6.27k [00:00<00:00, 260kB/s]"}},"829fb20d826d45baaf8d785179c1b32f":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"867e0de65c734221ad6f2623c2a35f57":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_a3d9b7d4b44540d88953c69b56f9269f","placeholder":"​","style":"IPY_MODEL_cb676eb37f2a4126837c7324bf51d7ad","value":" 525/525 [00:00<00:00, 17.4kB/s]"}},"88af924ecc884636bb5bc9cad872e53a":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"8f273303cf324d0bb3146ecea2af2411":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_ab31e5a39fe143d8895353e2c7ebea3c","max":5669,"min":0,"orientation":"horizontal","style":"IPY_MODEL_61e4c8036ec34d28a5efafb0c41a0a74","value":5669}},"9824945e44fe4af4a1d70a8383b72b72":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_660e7fdd115f4e728fe7ea0358fd8bff","placeholder":"​","style":"IPY_MODEL_52ef8bcdab0a42f0a5d6a336766de54d","value":" 232k/232k [00:00<00:00, 3.60MB/s]"}},"9a1221b68d2c4af1a74f5978e252d507":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_e349b98fd389418fb365f53185489437","placeholder":"​","style":"IPY_MODEL_f6ebb67ea4574f3e8924b90d7b5aba12","value":" 51.0M/51.0M [00:00<00:00, 148MB/s]"}},"a2ff2f24ad77485e9de01427e2231712":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"a3d9b7d4b44540d88953c69b56f9269f":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"aa57f92f95904c529d342790ecf4d75c":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"ab31e5a39fe143d8895353e2c7ebea3c":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"ad6bedec421b40d897568ae3f2705810":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"b16b721265754f5fa258970429fc7bdd":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"b4bf22308b254236960ff1eb5306c4e9":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"c07ab8a5ad3e41e991f940b6e08e1814":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"cb676eb37f2a4126837c7324bf51d7ad":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"d32e905239be4fef985ae8767d6add99":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"d3ca7afb948f404682aa027d3d76d237":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"d5950fc7527049279a8d433985f79619":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_3e9c9defb1d148b5a6de25cb2095740a","IPY_MODEL_3d19431d61e747df81b5b6730e67c955","IPY_MODEL_805c8478574545c398214ce2d295944a"],"layout":"IPY_MODEL_7b972e6f8f624ac28f148a8cff4b0ee2"}},"d9f2bbecf3fd4473af04e2e25653f928":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_01df3137965b434190d73bb59c9790bb","placeholder":"​","style":"IPY_MODEL_a2ff2f24ad77485e9de01427e2231712","value":"Downloading builder script: 100%"}},"d9f73f8d0c7345049a7ea11924b756dd":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_aa57f92f95904c529d342790ecf4d75c","placeholder":"​","style":"IPY_MODEL_88af924ecc884636bb5bc9cad872e53a","value":" 5.67k/5.67k [00:00<00:00, 239kB/s]"}},"e0e00dfcfb7c49ac961ff7f1101a0caa":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_2e68a1149b7b40bc8c2811b1a16c96ea","placeholder":"​","style":"IPY_MODEL_829fb20d826d45baaf8d785179c1b32f","value":"Downloading pytorch_model.bin: 100%"}},"e349b98fd389418fb365f53185489437":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"e367e27cda314517ab18696ecd913e0a":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_feb421598a0441498d81241716261b78","max":51044621,"min":0,"orientation":"horizontal","style":"IPY_MODEL_f0fc5b6cb35e4986b5ef1f2d03e56228","value":51044621}},"f0fc5b6cb35e4986b5ef1f2d03e56228":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"f2540d52716a4393a5f050f8d030f3f3":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"f6ebb67ea4574f3e8924b90d7b5aba12":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"fa4244813260430c98d2fbad63671f10":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_e0e00dfcfb7c49ac961ff7f1101a0caa","IPY_MODEL_e367e27cda314517ab18696ecd913e0a","IPY_MODEL_9a1221b68d2c4af1a74f5978e252d507"],"layout":"IPY_MODEL_b16b721265754f5fa258970429fc7bdd"}},"fabd451f3ccc47d5aed88e94eec722f7":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"feb421598a0441498d81241716261b78":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}}}}},"nbformat":4,"nbformat_minor":0} +{"cells":[{"cell_type":"markdown","metadata":{"id":"-euMnuisAIDX"},"source":["![image.png]()"]},{"cell_type":"markdown","metadata":{"id":"_-k2O6KeLI1D"},"source":["[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/llm_notebooks/dataset-notebooks/mmlu_dataset.ipynb)"]},{"cell_type":"markdown","metadata":{"id":"wCxsD2KDAWU2"},"source":["**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering and Summarization tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation and fairness test categories.\n","\n","Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings."]},{"cell_type":"markdown","metadata":{"id":"jNG1OYuQAgtW"},"source":["# Getting started with LangTest"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"32C5aiC-LI1L"},"outputs":[],"source":["!pip install \"langtest[openai,transformers,evaluate]\""]},{"cell_type":"markdown","metadata":{"id":"EsEtlSiNAnSO"},"source":["# Harness and Its Parameters\n","\n","The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way."]},{"cell_type":"code","execution_count":2,"metadata":{"executionInfo":{"elapsed":3452,"status":"ok","timestamp":1692371266150,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"w2GPpdowS1C9"},"outputs":[],"source":["#Import Harness from the LangTest library\n","from langtest import Harness"]},{"cell_type":"markdown","metadata":{"id":"7_6PF_HGA4EO"},"source":["It imports the Harness class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and that instances of the Harness class can be customized or configured for different testing scenarios or environments.\n","\n","Here is a list of the different parameters that can be passed to the Harness function:\n","\n","
\n","\n","\n","| Parameter | Description | \n","| - | - | \n","|**task** |Task for which the model is to be evaluated (question-answering or summarization)|\n","| **model** | Specifies the model(s) to be evaluated. This parameter can be provided as either a dictionary or a list of dictionaries. Each dictionary should contain the following keys:
  • model (mandatory): \tPipelineModel or path to a saved model or pretrained pipeline/model from hub.
  • hub (mandatory): Hub (library) to use in back-end for loading model from public models hub or from path
|\n","| **data** | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys:
  • data_source (mandatory): The source of the data.
  • subset (optional): The subset of the data.
  • feature_column (optional): The column containing the features.
  • target_column (optional): The column containing the target labels.
  • split (optional): The data split to be used.
  • source (optional): Set to 'huggingface' when loading Hugging Face dataset.
|\n","| **config** | Configuration for the tests to be performed, specified in the form of a YAML file. |\n","\n","
\n","
"]},{"cell_type":"markdown","metadata":{"id":"pHJQHDcSA_CV"},"source":["# OpenAI Model Testing For Question Answering\n","\n","In this section, we dive into testing of OpenAI models in Question Answering task.\n","\n","LangTest supports robustness tests for LLM testing for now."]},{"cell_type":"code","execution_count":3,"metadata":{"executionInfo":{"elapsed":111,"status":"ok","timestamp":1692371266152,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"YXVcv79JTAWA"},"outputs":[],"source":["import os\n","\n","os.environ[\"OPENAI_API_KEY\"] = \"\""]},{"cell_type":"markdown","metadata":{"id":"2Q1uClT2kgLB"},"source":["## MMLU \n","[Measuring Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300)\n","\n","**Dataset Summary**\n","\n","- MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.\n","\n","**Data Splits**\n","\n","- `test` - Test set from the MMLU dataset which covers 57 tasks including elementary mathematics, US history, computer science, law, and more. We took 50 samples from each tasks in the test set.\n","\n","- `test-tiny` - Truncated version of test set from the MMLU dataset which covers 57 tasks including elementary mathematics, US history, computer science, law, and more. We took 10 samples from each tasks in the test-tiny set.\n","\n","- `clinical` - Curated version of the MMLU dataset which contains the clinical subsets (college_biology, college_medicine, medical_genetics, human_aging, professional_medicine, nutrition)."]},{"cell_type":"markdown","metadata":{"id":"1WO54aEnBKK8"},"source":["### Setup and Configure Harness"]},{"cell_type":"code","execution_count":4,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":105,"status":"ok","timestamp":1692371266153,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"f13UydObTDRG","outputId":"e9ed4754-3026-42ba-85dd-6c100e3c60c9"},"outputs":[{"name":"stdout","output_type":"stream","text":["Test Configuration : \n"," {\n"," \"model_parameters\": {\n"," \"temperature\": 0.2,\n"," \"max_tokens\": 64\n"," },\n"," \"tests\": {\n"," \"defaults\": {\n"," \"min_pass_rate\": 1.0\n"," },\n"," \"robustness\": {\n"," \"add_typo\": {\n"," \"min_pass_rate\": 0.7\n"," },\n"," \"lowercase\": {\n"," \"min_pass_rate\": 0.7\n"," }\n"," }\n"," }\n","}\n"]}],"source":["harness = Harness(\n"," task=\"question-answering\", \n"," model={\"model\": \"text-davinci-003\",\"hub\":\"openai\"}, \n"," data={\"data_source\" :\"MMLU\",\n"," \"split\":\"test-tiny\"}\n"," )"]},{"cell_type":"markdown","metadata":{"id":"djMJVtS3U3Wv"},"source":["## Robustness"]},{"cell_type":"markdown","metadata":{"id":"NQ1KF731BW5O"},"source":["For tests we used uppercase, Dyslexia Word Swap, Add Slangs, Insert Abbreviations and Speech to Text typos . Other available robustness tests for QA task are:\n","* `add_context`\n","* `add_contraction`\n","* `add_punctuation`\n","* `add_typo`\n","* `add_ocr_typo`\n","* `american_to_british`\n","* `british_to_american`\n","* `lowercase`\n","* `strip_punctuation`\n","* `titlecase`\n","* `uppercase`\n","* `number_to_word`\n","* `add_abbreviation`\n","* `add_speech_to_text_typo`\n","* `add_slangs`\n","* `dyslexia_word_swap`\n","* `multiple_perturbations`\n","* `adjective_synonym_swap`\n","* `adjective_antonym_swap`\n","* `strip_all_punctuation`"]},{"cell_type":"markdown","metadata":{"id":"8VxrRAMkBf1H"},"source":["You can also set prompts and other model parameters in config. Possible parameters are:\n","* `user_promt:` Promt to be given to the model.\n","* `temperature:` Temperature of the model.\n","* `max_tokens:` Maximum number of output tokens allowed for model."]},{"cell_type":"code","execution_count":5,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":85,"status":"ok","timestamp":1692371266155,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"fMFVq3mCTQ7j","outputId":"150254fc-f2e6-42fe-93e7-92ef6c1468ae"},"outputs":[{"data":{"text/plain":["{'tests': {'defaults': {'min_pass_rate': 0.65},\n"," 'robustness': {'uppercase': {'min_pass_rate': 0.66},\n"," 'dyslexia_word_swap': {'min_pass_rate': 0.6},\n"," 'add_abbreviation': {'min_pass_rate': 0.6},\n"," 'add_slangs': {'min_pass_rate': 0.6},\n"," 'add_speech_to_text_typo': {'min_pass_rate': 0.6}}}}"]},"execution_count":5,"metadata":{},"output_type":"execute_result"}],"source":["harness.configure(\n","{\n"," 'tests': {'defaults': {'min_pass_rate': 0.65},\n"," 'robustness': {'uppercase': {'min_pass_rate': 0.66},\n"," 'dyslexia_word_swap':{'min_pass_rate': 0.60},\n"," 'add_abbreviation':{'min_pass_rate': 0.60},\n"," 'add_slangs':{'min_pass_rate': 0.60},\n"," 'add_speech_to_text_typo':{'min_pass_rate': 0.60},\n","\n"," }\n"," }\n"," }\n"," )"]},{"cell_type":"markdown","metadata":{"id":"AxKHTNFELI1x"},"source":["➤ You can adjust the level of transformation in the sentence by using the \"`prob`\" parameter, which controls the proportion of words to be changed during robustness tests.\n","\n","➤ **NOTE** : \"`prob`\" defaults to 1.0, which means all words will be transformed.\n","```\n","harness.configure(\n","{\n"," 'tests': {\n"," 'defaults': {'min_pass_rate': 0.65},\n"," 'robustness': {\n"," 'uppercase': {'min_pass_rate': 0.66, 'prob': 0.50},\n"," 'dyslexia_word_swap':{'min_pass_rate': 0.60, 'prob': 0.70},\n"," }\n"," }\n","})\n","\n","```"]},{"cell_type":"markdown","metadata":{"id":"m5IuCmiEBuW8"},"source":["Here we have configured the harness to perform Five robustness tests and defined the minimum pass rate for each test."]},{"cell_type":"code","execution_count":6,"metadata":{"executionInfo":{"elapsed":71,"status":"ok","timestamp":1692371266157,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"nmHqJ_TlUg8h"},"outputs":[],"source":["harness.data = harness.data[:10]"]},{"cell_type":"markdown","metadata":{"id":"nAeqBsbAB_1M"},"source":["### Generating the test cases."]},{"cell_type":"code","execution_count":7,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":17814,"status":"ok","timestamp":1692371283903,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"CCJxFd4nUkMN","outputId":"9f99926a-a068-4698-ff9d-68f2416a075d"},"outputs":[{"name":"stderr","output_type":"stream","text":["Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 1392.99it/s]\n"]},{"data":{"text/plain":[]},"execution_count":7,"metadata":{},"output_type":"execute_result"}],"source":["harness.generate()"]},{"cell_type":"markdown","metadata":{"id":"ZEWchFb8CDrk"},"source":["harness.generate() method automatically generates the test cases (based on the provided configuration)"]},{"cell_type":"markdown","metadata":{"id":"MEnLcl-OCG1O"},"source":["### Running the tests"]},{"cell_type":"code","execution_count":8,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":32123,"status":"ok","timestamp":1692371316007,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"gFEez-T0UlcC","outputId":"3684f7af-9359-4f24-e584-5307e3927bfe"},"outputs":[{"name":"stderr","output_type":"stream","text":["Running testcases... : 100%|██████████| 50/50 [00:32<00:00, 1.55it/s]\n"]},{"data":{"text/plain":[]},"execution_count":8,"metadata":{},"output_type":"execute_result"}],"source":["harness.run()"]},{"cell_type":"markdown","metadata":{"id":"3ice4dqfCVlr"},"source":["Called after harness.generate() and is to used to run all the tests. Returns a pass/fail flag for each test."]},{"cell_type":"markdown","metadata":{"id":"g1NxuqveOc-t"},"source":["### Generated Results"]},{"cell_type":"code","execution_count":9,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":1000},"executionInfo":{"elapsed":16558,"status":"ok","timestamp":1692371332559,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"ZjYBONiuYJdK","outputId":"4e69d5fb-cfbd-4713-c25e-0cb49bb0878d"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typeoriginal_contextoriginal_questionperturbed_contextperturbed_questionexpected_resultactual_resultpass
0robustnessuppercase-Find the degree for the given field extension ...-FIND THE DEGREE FOR THE GIVEN FIELD EXTENSION ...B. 4B. 4True
1robustnessuppercase-Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...-LET P = (1, 2, 5, 4)(2, 3) IN S_5 . FIND THE I...C. 24C. 24True
2robustnessuppercase-Find all zeros in the indicated finite field o...-FIND ALL ZEROS IN THE INDICATED FINITE FIELD O...A. 0D. 0,4False
3robustnessuppercase-Statement 1 | A factor group of a non-Abelian ...-STATEMENT 1 | A FACTOR GROUP OF A NON-ABELIAN ...A. True, TrueC. TRUE, FALSEFalse
4robustnessuppercase-Find the product of the given polynomials in t...-FIND THE PRODUCT OF THE GIVEN POLYNOMIALS IN T...C. 0C. 0True
5robustnessuppercase-Statement 1 | If a group has an element of ord...-STATEMENT 1 | IF A GROUP HAS AN ELEMENT OF ORD...C. True, FalseC. TRUE, FALSETrue
6robustnessuppercase-Statement 1 | Every homomorphic image of a gro...-STATEMENT 1 | EVERY HOMOMORPHIC IMAGE OF A GRO...C. True, FalseC. TRUE, FALSETrue
7robustnessuppercase-Statement 1 | A ring homomorphism is one to on...-STATEMENT 1 | A RING HOMOMORPHISM IS ONE TO ON...C. True, FalseA. TRUE, TRUEFalse
8robustnessuppercase-Find the degree for the given field extension ...-FIND THE DEGREE FOR THE GIVEN FIELD EXTENSION ...B. 4C. 2False
9robustnessuppercase-Find all zeros in the indicated finite field o...-FIND ALL ZEROS IN THE INDICATED FINITE FIELD O...A. 1C. 2,3False
10robustnessdyslexia_word_swap-Find the degree for the given field extension ...-Find the degree four the given field extension...B. 4B. 4True
11robustnessdyslexia_word_swap-Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...-Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...C. 24C. 24True
12robustnessdyslexia_word_swap-Find all zeros in the indicated finite field o...-Find all zeros in the indicated finite field o...A. 0A. 0True
13robustnessdyslexia_word_swap-Statement 1 | A factor group of a non-Abelian ...-Statement 1 | A factor group off a non-Abelian...A. True, TrueC. True, FalseFalse
14robustnessdyslexia_word_swap-Find the product of the given polynomials in t...-Find the product off the given polynomials in ...C. 0C. 0True
15robustnessdyslexia_word_swap-Statement 1 | If a group has an element of ord...-Statement 1 | If a group has an element off or...C. True, FalseC. True, FalseTrue
16robustnessdyslexia_word_swap-Statement 1 | Every homomorphic image of a gro...-Statement 1 | Every homomorphic image off a gr...C. True, FalseC. True, FalseTrue
17robustnessdyslexia_word_swap-Statement 1 | A ring homomorphism is one to on...-Statement 1 | A ring homomorphism is won too w...C. True, FalseC. True, FalseTrue
18robustnessdyslexia_word_swap-Find the degree for the given field extension ...-Find the degree four the given field extension...B. 4B. 4True
19robustnessdyslexia_word_swap-Find all zeros in the indicated finite field o...-Find all zeros in the indicated finite field o...A. 1A. 1True
20robustnessadd_abbreviation-Find the degree for the given field extension ...-Find da degree 4 thedaven field extension Q(sq...B. 4B. 4True
21robustnessadd_abbreviation-Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...-Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find da in...C. 24C. 24True
22robustnessadd_abbreviation-Find all zeros in the indicated finite field o...-Find all zeros in da indicated finite field of...A. 0A. 0True
23robustnessadd_abbreviation-Statement 1 | A factor group of a non-Abelian ...-Statement 1 | A factor group of a non-Abelian ...A. True, TrueA. True, TrueTrue
24robustnessadd_abbreviation-Find the product of the given polynomials in t...-Find da product of tdagiven polynomials in thd...C. 0C. 0True
25robustnessadd_abbreviation-Statement 1 | If a group has an element of ord...-Statement 1 | If a group has an element of ord...C. True, FalseC. True, FalseTrue
26robustnessadd_abbreviation-Statement 1 | Every homomorphic image of a gro...-Statement 1 | Every homomorphic image of a gro...C. True, FalseC. True, FalseTrue
27robustnessadd_abbreviation-Statement 1 | A ring homomorphism is one to on...-Statement 1 | A ring homomorphism is one 2 one...C. True, FalseC. True, FalseTrue
28robustnessadd_abbreviation-Find the degree for the given field extension ...-Find da degree 4 thedaven field extension Q(sq...B. 4B. 4True
29robustnessadd_abbreviation-Find all zeros in the indicated finite field o...-Find all zeros in da indicated finite field of...C. 2,3A. 1False
30robustnessadd_slangs-Find the degree for the given field extension ...-Find the degree for the given field extension ...B. 4B. 4True
31robustnessadd_slangs-Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...-Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...C. 24C. 24True
32robustnessadd_slangs-Find all zeros in the indicated finite field o...-Find all zeros in the indicated finite field o...A. 0A. 0True
33robustnessadd_slangs-Statement 1 | A factor group of a non-Abelian ...-Statement 1 | A factor group of a non-Abelian ...A. True, TrueA. True, TrueTrue
34robustnessadd_slangs-Find the product of the given polynomials in t...-Find the product of the given polynomials in t...C. 0C. 0True
35robustnessadd_slangs-Statement 1 | If a group has an element of ord...-Statement 1 | If a group has an element of ord...C. True, FalseA. True, TrueFalse
36robustnessadd_slangs-Statement 1 | Every homomorphic image of a gro...-Statement 1 | Every homomorphic image of a gro...C. True, FalseA. True, TrueFalse
37robustnessadd_slangs-Statement 1 | A ring homomorphism is one to on...-Statement 1 | A ring homomorphism is one to on...C. True, FalseA. True, TrueFalse
38robustnessadd_slangs-Find the degree for the given field extension ...-Find the degree for the given field extension ...B. 4B. 4True
39robustnessadd_slangs-Find all zeros in the indicated finite field o...-Find all zeros in the indicated finite field o...A. 1A. 1True
40robustnessadd_speech_to_text_typo-Find the degree for the given field extension ...-Find the degree for the givin' feild extension...B. 4B. 4True
41robustnessadd_speech_to_text_typo-Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i...-Lett pea = (1, 2, 5, 4)(2, 3) in S_5 . Fined t...C. 24B. 2False
42robustnessadd_speech_to_text_typo-Find all zeros in the indicated finite field o...-Find all zeros in the indicated finite feild o...A. 0A. 0True
43robustnessadd_speech_to_text_typo-Statement 1 | A factor group of a non-Abelian ...-Statement 1 | A factor grupe of ae non-Abelian...A. True, TrueA. True, TrueTrue
44robustnessadd_speech_to_text_typo-Find the product of the given polynomials in t...-Find the product of the givin' polynomials in ...C. 0C. 0True
45robustnessadd_speech_to_text_typo-Statement 1 | If a group has an element of ord...-Statement 1 | If a groupe has 'N element of or...C. True, FalseC. True, FalseTrue
46robustnessadd_speech_to_text_typo-Statement 1 | Every homomorphic image of a gro...-Statement 1 | Every homomorphic image of a. gr...C. True, FalseA. True, TrueFalse
47robustnessadd_speech_to_text_typo-Statement 1 | A ring homomorphism is one to on...-Statement 1 | A wring homomorphism is one to o...C. True, FalseB. False, FalseFalse
48robustnessadd_speech_to_text_typo-Find the degree for the given field extension ...-Find the degree for the givin' field extension...B. 4B. 4True
49robustnessadd_speech_to_text_typo-Find all zeros in the indicated finite field o...-Find aull zeros inn the indicated finite field...C. 2,3C. 2,3True
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type original_context \\\n","0 robustness uppercase - \n","1 robustness uppercase - \n","2 robustness uppercase - \n","3 robustness uppercase - \n","4 robustness uppercase - \n","5 robustness uppercase - \n","6 robustness uppercase - \n","7 robustness uppercase - \n","8 robustness uppercase - \n","9 robustness uppercase - \n","10 robustness dyslexia_word_swap - \n","11 robustness dyslexia_word_swap - \n","12 robustness dyslexia_word_swap - \n","13 robustness dyslexia_word_swap - \n","14 robustness dyslexia_word_swap - \n","15 robustness dyslexia_word_swap - \n","16 robustness dyslexia_word_swap - \n","17 robustness dyslexia_word_swap - \n","18 robustness dyslexia_word_swap - \n","19 robustness dyslexia_word_swap - \n","20 robustness add_abbreviation - \n","21 robustness add_abbreviation - \n","22 robustness add_abbreviation - \n","23 robustness add_abbreviation - \n","24 robustness add_abbreviation - \n","25 robustness add_abbreviation - \n","26 robustness add_abbreviation - \n","27 robustness add_abbreviation - \n","28 robustness add_abbreviation - \n","29 robustness add_abbreviation - \n","30 robustness add_slangs - \n","31 robustness add_slangs - \n","32 robustness add_slangs - \n","33 robustness add_slangs - \n","34 robustness add_slangs - \n","35 robustness add_slangs - \n","36 robustness add_slangs - \n","37 robustness add_slangs - \n","38 robustness add_slangs - \n","39 robustness add_slangs - \n","40 robustness add_speech_to_text_typo - \n","41 robustness add_speech_to_text_typo - \n","42 robustness add_speech_to_text_typo - \n","43 robustness add_speech_to_text_typo - \n","44 robustness add_speech_to_text_typo - \n","45 robustness add_speech_to_text_typo - \n","46 robustness add_speech_to_text_typo - \n","47 robustness add_speech_to_text_typo - \n","48 robustness add_speech_to_text_typo - \n","49 robustness add_speech_to_text_typo - \n","\n"," original_question perturbed_context \\\n","0 Find the degree for the given field extension ... - \n","1 Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i... - \n","2 Find all zeros in the indicated finite field o... - \n","3 Statement 1 | A factor group of a non-Abelian ... - \n","4 Find the product of the given polynomials in t... - \n","5 Statement 1 | If a group has an element of ord... - \n","6 Statement 1 | Every homomorphic image of a gro... - \n","7 Statement 1 | A ring homomorphism is one to on... - \n","8 Find the degree for the given field extension ... - \n","9 Find all zeros in the indicated finite field o... - \n","10 Find the degree for the given field extension ... - \n","11 Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i... - \n","12 Find all zeros in the indicated finite field o... - \n","13 Statement 1 | A factor group of a non-Abelian ... - \n","14 Find the product of the given polynomials in t... - \n","15 Statement 1 | If a group has an element of ord... - \n","16 Statement 1 | Every homomorphic image of a gro... - \n","17 Statement 1 | A ring homomorphism is one to on... - \n","18 Find the degree for the given field extension ... - \n","19 Find all zeros in the indicated finite field o... - \n","20 Find the degree for the given field extension ... - \n","21 Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i... - \n","22 Find all zeros in the indicated finite field o... - \n","23 Statement 1 | A factor group of a non-Abelian ... - \n","24 Find the product of the given polynomials in t... - \n","25 Statement 1 | If a group has an element of ord... - \n","26 Statement 1 | Every homomorphic image of a gro... - \n","27 Statement 1 | A ring homomorphism is one to on... - \n","28 Find the degree for the given field extension ... - \n","29 Find all zeros in the indicated finite field o... - \n","30 Find the degree for the given field extension ... - \n","31 Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i... - \n","32 Find all zeros in the indicated finite field o... - \n","33 Statement 1 | A factor group of a non-Abelian ... - \n","34 Find the product of the given polynomials in t... - \n","35 Statement 1 | If a group has an element of ord... - \n","36 Statement 1 | Every homomorphic image of a gro... - \n","37 Statement 1 | A ring homomorphism is one to on... - \n","38 Find the degree for the given field extension ... - \n","39 Find all zeros in the indicated finite field o... - \n","40 Find the degree for the given field extension ... - \n","41 Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i... - \n","42 Find all zeros in the indicated finite field o... - \n","43 Statement 1 | A factor group of a non-Abelian ... - \n","44 Find the product of the given polynomials in t... - \n","45 Statement 1 | If a group has an element of ord... - \n","46 Statement 1 | Every homomorphic image of a gro... - \n","47 Statement 1 | A ring homomorphism is one to on... - \n","48 Find the degree for the given field extension ... - \n","49 Find all zeros in the indicated finite field o... - \n","\n"," perturbed_question expected_result \\\n","0 FIND THE DEGREE FOR THE GIVEN FIELD EXTENSION ... B. 4 \n","1 LET P = (1, 2, 5, 4)(2, 3) IN S_5 . FIND THE I... C. 24 \n","2 FIND ALL ZEROS IN THE INDICATED FINITE FIELD O... A. 0 \n","3 STATEMENT 1 | A FACTOR GROUP OF A NON-ABELIAN ... A. True, True \n","4 FIND THE PRODUCT OF THE GIVEN POLYNOMIALS IN T... C. 0 \n","5 STATEMENT 1 | IF A GROUP HAS AN ELEMENT OF ORD... C. True, False \n","6 STATEMENT 1 | EVERY HOMOMORPHIC IMAGE OF A GRO... C. True, False \n","7 STATEMENT 1 | A RING HOMOMORPHISM IS ONE TO ON... C. True, False \n","8 FIND THE DEGREE FOR THE GIVEN FIELD EXTENSION ... B. 4 \n","9 FIND ALL ZEROS IN THE INDICATED FINITE FIELD O... A. 1 \n","10 Find the degree four the given field extension... B. 4 \n","11 Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i... C. 24 \n","12 Find all zeros in the indicated finite field o... A. 0 \n","13 Statement 1 | A factor group off a non-Abelian... A. True, True \n","14 Find the product off the given polynomials in ... C. 0 \n","15 Statement 1 | If a group has an element off or... C. True, False \n","16 Statement 1 | Every homomorphic image off a gr... C. True, False \n","17 Statement 1 | A ring homomorphism is won too w... C. True, False \n","18 Find the degree four the given field extension... B. 4 \n","19 Find all zeros in the indicated finite field o... A. 1 \n","20 Find da degree 4 thedaven field extension Q(sq... B. 4 \n","21 Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find da in... C. 24 \n","22 Find all zeros in da indicated finite field of... A. 0 \n","23 Statement 1 | A factor group of a non-Abelian ... A. True, True \n","24 Find da product of tdagiven polynomials in thd... C. 0 \n","25 Statement 1 | If a group has an element of ord... C. True, False \n","26 Statement 1 | Every homomorphic image of a gro... C. True, False \n","27 Statement 1 | A ring homomorphism is one 2 one... C. True, False \n","28 Find da degree 4 thedaven field extension Q(sq... B. 4 \n","29 Find all zeros in da indicated finite field of... C. 2,3 \n","30 Find the degree for the given field extension ... B. 4 \n","31 Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the i... C. 24 \n","32 Find all zeros in the indicated finite field o... A. 0 \n","33 Statement 1 | A factor group of a non-Abelian ... A. True, True \n","34 Find the product of the given polynomials in t... C. 0 \n","35 Statement 1 | If a group has an element of ord... C. True, False \n","36 Statement 1 | Every homomorphic image of a gro... C. True, False \n","37 Statement 1 | A ring homomorphism is one to on... C. True, False \n","38 Find the degree for the given field extension ... B. 4 \n","39 Find all zeros in the indicated finite field o... A. 1 \n","40 Find the degree for the givin' feild extension... B. 4 \n","41 Lett pea = (1, 2, 5, 4)(2, 3) in S_5 . Fined t... C. 24 \n","42 Find all zeros in the indicated finite feild o... A. 0 \n","43 Statement 1 | A factor grupe of ae non-Abelian... A. True, True \n","44 Find the product of the givin' polynomials in ... C. 0 \n","45 Statement 1 | If a groupe has 'N element of or... C. True, False \n","46 Statement 1 | Every homomorphic image of a. gr... C. True, False \n","47 Statement 1 | A wring homomorphism is one to o... C. True, False \n","48 Find the degree for the givin' field extension... B. 4 \n","49 Find aull zeros inn the indicated finite field... C. 2,3 \n","\n"," actual_result pass \n","0 B. 4 True \n","1 C. 24 True \n","2 D. 0,4 False \n","3 C. TRUE, FALSE False \n","4 C. 0 True \n","5 C. TRUE, FALSE True \n","6 C. TRUE, FALSE True \n","7 A. TRUE, TRUE False \n","8 C. 2 False \n","9 C. 2,3 False \n","10 B. 4 True \n","11 C. 24 True \n","12 A. 0 True \n","13 C. True, False False \n","14 C. 0 True \n","15 C. True, False True \n","16 C. True, False True \n","17 C. True, False True \n","18 B. 4 True \n","19 A. 1 True \n","20 B. 4 True \n","21 C. 24 True \n","22 A. 0 True \n","23 A. True, True True \n","24 C. 0 True \n","25 C. True, False True \n","26 C. True, False True \n","27 C. True, False True \n","28 B. 4 True \n","29 A. 1 False \n","30 B. 4 True \n","31 C. 24 True \n","32 A. 0 True \n","33 A. True, True True \n","34 C. 0 True \n","35 A. True, True False \n","36 A. True, True False \n","37 A. True, True False \n","38 B. 4 True \n","39 A. 1 True \n","40 B. 4 True \n","41 B. 2 False \n","42 A. 0 True \n","43 A. True, True True \n","44 C. 0 True \n","45 C. True, False True \n","46 A. True, True False \n","47 B. False, False False \n","48 B. 4 True \n","49 C. 2,3 True "]},"execution_count":9,"metadata":{},"output_type":"execute_result"}],"source":["harness.generated_results()"]},{"cell_type":"markdown","metadata":{"id":"Gl5QGV9pCZfz"},"source":["This method returns the generated results in the form of a pandas dataframe, which provides a convenient and easy-to-use format for working with the test results. You can use this method to quickly identify the test cases that failed and to determine where fixes are needed."]},{"cell_type":"markdown","metadata":{"id":"9fBgU33hCb2K"},"source":["### Final Results\n","\n","We can call `.report()` which summarizes the results giving information about pass and fail counts and overall test pass/fail flag."]},{"cell_type":"code","execution_count":10,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":206},"executionInfo":{"elapsed":14511,"status":"ok","timestamp":1692371347056,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"nDmRw1AeUqIl","outputId":"c458e5f1-9f6f-4b40-bc19-7570592546be"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessuppercase5550%66%False
1robustnessdyslexia_word_swap1990%60%True
2robustnessadd_abbreviation1990%60%True
3robustnessadd_slangs3770%60%True
4robustnessadd_speech_to_text_typo3770%60%True
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type fail_count pass_count pass_rate \\\n","0 robustness uppercase 5 5 50% \n","1 robustness dyslexia_word_swap 1 9 90% \n","2 robustness add_abbreviation 1 9 90% \n","3 robustness add_slangs 3 7 70% \n","4 robustness add_speech_to_text_typo 3 7 70% \n","\n"," minimum_pass_rate pass \n","0 66% False \n","1 60% True \n","2 60% True \n","3 60% True \n","4 60% True "]},"execution_count":10,"metadata":{},"output_type":"execute_result"}],"source":["harness.report()"]},{"cell_type":"markdown","metadata":{"id":"IULGQtWAWp4L"},"source":["## Fairness"]},{"cell_type":"markdown","metadata":{"id":"z85d594ZGXyX"},"source":["Available Fairness tests for QA task are:\n","\n","* `max_gender_rouge1_score`\n","* `max_gender_rouge2_score`\n","* `max_gender_rougeL_score`\n","* `max_gender_rougeLsum_score`\n","* `min_gender_rouge1_score`\n","* `min_gender_rouge2_score`\n","* `min_gender_rougeL_score`\n","* `min_gender_rougeLsum_score`"]},{"cell_type":"code","execution_count":11,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":86,"status":"ok","timestamp":1692371347059,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"OoMGAn_FWpaP","outputId":"90175b71-b519-4687-b9bb-459bf3afdc35"},"outputs":[{"name":"stdout","output_type":"stream","text":["Test Configuration : \n"," {\n"," \"model_parameters\": {\n"," \"temperature\": 0.2,\n"," \"max_tokens\": 64\n"," },\n"," \"tests\": {\n"," \"defaults\": {\n"," \"min_pass_rate\": 1.0\n"," },\n"," \"robustness\": {\n"," \"add_typo\": {\n"," \"min_pass_rate\": 0.7\n"," },\n"," \"lowercase\": {\n"," \"min_pass_rate\": 0.7\n"," }\n"," }\n"," }\n","}\n"]}],"source":["harness = Harness(\n"," task=\"question-answering\", \n"," model={\"model\": \"text-davinci-003\",\"hub\":\"openai\"}, \n"," data={\"data_source\" :\"MMLU\",\n"," \"split\":\"test-tiny\"}\n"," )"]},{"cell_type":"code","execution_count":12,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":78,"status":"ok","timestamp":1692371347061,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"45-rhwhTXMWb","outputId":"d96893e0-a009-4da9-b4e5-63b200d83d45"},"outputs":[{"data":{"text/plain":["{'tests': {'defaults': {'min_pass_rate': 0.65},\n"," 'fairness': {'min_gender_rouge1_score': {'min_score': 0.66},\n"," 'min_gender_rouge2_score': {'min_score': 0.6},\n"," 'min_gender_rougeL_score': {'min_score': 0.66},\n"," 'min_gender_rougeLsum_score': {'min_score': 0.66},\n"," 'max_gender_rouge1_score': {'max_score': 0.66},\n"," 'max_gender_rouge2_score': {'max_score': 0.6},\n"," 'max_gender_rougeL_score': {'max_score': 0.66},\n"," 'max_gender_rougeLsum_score': {'max_score': 0.66}}}}"]},"execution_count":12,"metadata":{},"output_type":"execute_result"}],"source":["harness.configure(\n","{\n"," 'tests': {'defaults': {'min_pass_rate': 0.65},\n"," 'fairness': {\n"," 'min_gender_rouge1_score': {'min_score': 0.66},\n"," 'min_gender_rouge2_score':{'min_score': 0.60},\n"," 'min_gender_rougeL_score': {'min_score': 0.66},\n"," 'min_gender_rougeLsum_score': {'min_score': 0.66},\n"," 'max_gender_rouge1_score': {'max_score': 0.66},\n"," 'max_gender_rouge2_score':{'max_score': 0.60},\n"," 'max_gender_rougeL_score': {'max_score': 0.66},\n"," 'max_gender_rougeLsum_score': {'max_score': 0.66},\n","\n","\n","\n","\n"," }\n"," }\n"," }\n"," )"]},{"cell_type":"code","execution_count":13,"metadata":{"executionInfo":{"elapsed":66,"status":"ok","timestamp":1692371347063,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"_cTZaer5XyDa"},"outputs":[],"source":["harness.data = harness.data[:10]"]},{"cell_type":"markdown","metadata":{"id":"dw85pgowGx8t"},"source":["### Generating the Test Cases"]},{"cell_type":"code","execution_count":14,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":76,"status":"ok","timestamp":1692371347075,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"F2p1pXfoXzND","outputId":"6cdcb7cb-119b-4f14-dce8-f03bc507a8d0"},"outputs":[{"name":"stderr","output_type":"stream","text":["Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 1369.79it/s]\n"]},{"data":{"text/plain":[]},"execution_count":14,"metadata":{},"output_type":"execute_result"}],"source":["harness.generate()"]},{"cell_type":"code","execution_count":15,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":802},"executionInfo":{"elapsed":64,"status":"ok","timestamp":1692371347078,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"vJZxMYyKX0Pe","outputId":"507d0db6-80e5-4eba-82f5-739ce1b9e8a1"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typetest_case
0fairnessmin_gender_rouge1_scoremale
1fairnessmin_gender_rouge1_scorefemale
2fairnessmin_gender_rouge1_scoreunknown
3fairnessmin_gender_rouge2_scoremale
4fairnessmin_gender_rouge2_scorefemale
5fairnessmin_gender_rouge2_scoreunknown
6fairnessmin_gender_rougeL_scoremale
7fairnessmin_gender_rougeL_scorefemale
8fairnessmin_gender_rougeL_scoreunknown
9fairnessmin_gender_rougeLsum_scoremale
10fairnessmin_gender_rougeLsum_scorefemale
11fairnessmin_gender_rougeLsum_scoreunknown
12fairnessmax_gender_rouge1_scoremale
13fairnessmax_gender_rouge1_scorefemale
14fairnessmax_gender_rouge1_scoreunknown
15fairnessmax_gender_rouge2_scoremale
16fairnessmax_gender_rouge2_scorefemale
17fairnessmax_gender_rouge2_scoreunknown
18fairnessmax_gender_rougeL_scoremale
19fairnessmax_gender_rougeL_scorefemale
20fairnessmax_gender_rougeL_scoreunknown
21fairnessmax_gender_rougeLsum_scoremale
22fairnessmax_gender_rougeLsum_scorefemale
23fairnessmax_gender_rougeLsum_scoreunknown
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type test_case\n","0 fairness min_gender_rouge1_score male\n","1 fairness min_gender_rouge1_score female\n","2 fairness min_gender_rouge1_score unknown\n","3 fairness min_gender_rouge2_score male\n","4 fairness min_gender_rouge2_score female\n","5 fairness min_gender_rouge2_score unknown\n","6 fairness min_gender_rougeL_score male\n","7 fairness min_gender_rougeL_score female\n","8 fairness min_gender_rougeL_score unknown\n","9 fairness min_gender_rougeLsum_score male\n","10 fairness min_gender_rougeLsum_score female\n","11 fairness min_gender_rougeLsum_score unknown\n","12 fairness max_gender_rouge1_score male\n","13 fairness max_gender_rouge1_score female\n","14 fairness max_gender_rouge1_score unknown\n","15 fairness max_gender_rouge2_score male\n","16 fairness max_gender_rouge2_score female\n","17 fairness max_gender_rouge2_score unknown\n","18 fairness max_gender_rougeL_score male\n","19 fairness max_gender_rougeL_score female\n","20 fairness max_gender_rougeL_score unknown\n","21 fairness max_gender_rougeLsum_score male\n","22 fairness max_gender_rougeLsum_score female\n","23 fairness max_gender_rougeLsum_score unknown"]},"execution_count":15,"metadata":{},"output_type":"execute_result"}],"source":["harness.testcases()"]},{"cell_type":"markdown","metadata":{"id":"zSgEmwr7G2Xl"},"source":["### Running the tests"]},{"cell_type":"code","execution_count":16,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":181,"referenced_widgets":["257c00fef73b4d50950c8d8b165e26a2","75d0522480494bb1a7b66e14fc43faac","4218ed9efdf84217b5daa2aa5930e20b","867e0de65c734221ad6f2623c2a35f57","d3ca7afb948f404682aa027d3d76d237","f2540d52716a4393a5f050f8d030f3f3","0dab743db8f14b77b0ec1699f92f86ed","2608c51cf9784a56baeddf9d1622ce76","2773b8eeb7024310b2264d487a9b26df","a3d9b7d4b44540d88953c69b56f9269f","cb676eb37f2a4126837c7324bf51d7ad","56701a47f6ee4a6d81a98f66756baf03","20d999a03d814a7785232c091241dc1c","6ab5b7e5c6784f3b92b6180ae0043589","9824945e44fe4af4a1d70a8383b72b72","0d7c7a938349427983d62652e81cead5","351e721352bf4c7cb30dbbe8a06ce35d","ad6bedec421b40d897568ae3f2705810","fabd451f3ccc47d5aed88e94eec722f7","c07ab8a5ad3e41e991f940b6e08e1814","660e7fdd115f4e728fe7ea0358fd8bff","52ef8bcdab0a42f0a5d6a336766de54d","fa4244813260430c98d2fbad63671f10","e0e00dfcfb7c49ac961ff7f1101a0caa","e367e27cda314517ab18696ecd913e0a","9a1221b68d2c4af1a74f5978e252d507","b16b721265754f5fa258970429fc7bdd","2e68a1149b7b40bc8c2811b1a16c96ea","829fb20d826d45baaf8d785179c1b32f","feb421598a0441498d81241716261b78","f0fc5b6cb35e4986b5ef1f2d03e56228","e349b98fd389418fb365f53185489437","f6ebb67ea4574f3e8924b90d7b5aba12","d5950fc7527049279a8d433985f79619","3e9c9defb1d148b5a6de25cb2095740a","3d19431d61e747df81b5b6730e67c955","805c8478574545c398214ce2d295944a","7b972e6f8f624ac28f148a8cff4b0ee2","5a12148bfe9848c5b9827d9b677b39dd","b4bf22308b254236960ff1eb5306c4e9","6984b154f66d4f1ab209168e50a64acd","2c907621903c43c9ad7ed84ee9026412","4f579cc50d884981b562f112b8764075","5a0ba0d42433427c8874b56d5ef1f4a2"]},"executionInfo":{"elapsed":36184,"status":"ok","timestamp":1692371383203,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"marZgGMEX2F1","outputId":"93f92514-2be1-4875-9061-74524e84fbd0"},"outputs":[{"name":"stderr","output_type":"stream","text":["\rRunning testcases... : 0%| | 0/24 [00:00\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typetest_caseexpected_resultactual_resultpass
0fairnessmin_gender_rouge1_scoremale0.660.355556False
1fairnessmin_gender_rouge1_scorefemale0.660.750000True
2fairnessmin_gender_rouge1_scoreunknown0.660.222222False
3fairnessmin_gender_rouge2_scoremale0.600.000000False
4fairnessmin_gender_rouge2_scorefemale0.600.750000True
5fairnessmin_gender_rouge2_scoreunknown0.600.000000False
6fairnessmin_gender_rougeL_scoremale0.660.244444False
7fairnessmin_gender_rougeL_scorefemale0.660.750000True
8fairnessmin_gender_rougeL_scoreunknown0.660.222222False
9fairnessmin_gender_rougeLsum_scoremale0.660.244444False
10fairnessmin_gender_rougeLsum_scorefemale0.660.750000True
11fairnessmin_gender_rougeLsum_scoreunknown0.660.222222False
12fairnessmax_gender_rouge1_scoremale0.660.355556True
13fairnessmax_gender_rouge1_scorefemale0.660.750000False
14fairnessmax_gender_rouge1_scoreunknown0.660.222222True
15fairnessmax_gender_rouge2_scoremale0.600.000000True
16fairnessmax_gender_rouge2_scorefemale0.600.750000False
17fairnessmax_gender_rouge2_scoreunknown0.600.000000True
18fairnessmax_gender_rougeL_scoremale0.660.244444True
19fairnessmax_gender_rougeL_scorefemale0.660.750000False
20fairnessmax_gender_rougeL_scoreunknown0.660.222222True
21fairnessmax_gender_rougeLsum_scoremale0.660.244444True
22fairnessmax_gender_rougeLsum_scorefemale0.660.750000False
23fairnessmax_gender_rougeLsum_scoreunknown0.660.222222True
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n"," \n"],"text/plain":[" category test_type test_case expected_result \\\n","0 fairness min_gender_rouge1_score male 0.66 \n","1 fairness min_gender_rouge1_score female 0.66 \n","2 fairness min_gender_rouge1_score unknown 0.66 \n","3 fairness min_gender_rouge2_score male 0.60 \n","4 fairness min_gender_rouge2_score female 0.60 \n","5 fairness min_gender_rouge2_score unknown 0.60 \n","6 fairness min_gender_rougeL_score male 0.66 \n","7 fairness min_gender_rougeL_score female 0.66 \n","8 fairness min_gender_rougeL_score unknown 0.66 \n","9 fairness min_gender_rougeLsum_score male 0.66 \n","10 fairness min_gender_rougeLsum_score female 0.66 \n","11 fairness min_gender_rougeLsum_score unknown 0.66 \n","12 fairness max_gender_rouge1_score male 0.66 \n","13 fairness max_gender_rouge1_score female 0.66 \n","14 fairness max_gender_rouge1_score unknown 0.66 \n","15 fairness max_gender_rouge2_score male 0.60 \n","16 fairness max_gender_rouge2_score female 0.60 \n","17 fairness max_gender_rouge2_score unknown 0.60 \n","18 fairness max_gender_rougeL_score male 0.66 \n","19 fairness max_gender_rougeL_score female 0.66 \n","20 fairness max_gender_rougeL_score unknown 0.66 \n","21 fairness max_gender_rougeLsum_score male 0.66 \n","22 fairness max_gender_rougeLsum_score female 0.66 \n","23 fairness max_gender_rougeLsum_score unknown 0.66 \n","\n"," actual_result pass \n","0 0.355556 False \n","1 0.750000 True \n","2 0.222222 False \n","3 0.000000 False \n","4 0.750000 True \n","5 0.000000 False \n","6 0.244444 False \n","7 0.750000 True \n","8 0.222222 False \n","9 0.244444 False \n","10 0.750000 True \n","11 0.222222 False \n","12 0.355556 True \n","13 0.750000 False \n","14 0.222222 True \n","15 0.000000 True \n","16 0.750000 False \n","17 0.000000 True \n","18 0.244444 True \n","19 0.750000 False \n","20 0.222222 True \n","21 0.244444 True \n","22 0.750000 False \n","23 0.222222 True "]},"execution_count":17,"metadata":{},"output_type":"execute_result"}],"source":["harness.generated_results()"]},{"cell_type":"markdown","metadata":{"id":"o39sXReLG7K9"},"source":["### Final Results"]},{"cell_type":"code","execution_count":18,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":300},"executionInfo":{"elapsed":209,"status":"ok","timestamp":1692371383216,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"AiyJ7SyJYC9V","outputId":"df0ec5a3-5a04-45c1-d635-f0be79abe66a"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0fairnessmin_gender_rouge1_score2133%65%False
1fairnessmin_gender_rouge2_score2133%65%False
2fairnessmin_gender_rougeL_score2133%65%False
3fairnessmin_gender_rougeLsum_score2133%65%False
4fairnessmax_gender_rouge1_score1267%65%True
5fairnessmax_gender_rouge2_score1267%65%True
6fairnessmax_gender_rougeL_score1267%65%True
7fairnessmax_gender_rougeLsum_score1267%65%True
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type fail_count pass_count pass_rate \\\n","0 fairness min_gender_rouge1_score 2 1 33% \n","1 fairness min_gender_rouge2_score 2 1 33% \n","2 fairness min_gender_rougeL_score 2 1 33% \n","3 fairness min_gender_rougeLsum_score 2 1 33% \n","4 fairness max_gender_rouge1_score 1 2 67% \n","5 fairness max_gender_rouge2_score 1 2 67% \n","6 fairness max_gender_rougeL_score 1 2 67% \n","7 fairness max_gender_rougeLsum_score 1 2 67% \n","\n"," minimum_pass_rate pass \n","0 65% False \n","1 65% False \n","2 65% False \n","3 65% False \n","4 65% True \n","5 65% True \n","6 65% True \n","7 65% True "]},"execution_count":18,"metadata":{},"output_type":"execute_result"}],"source":["harness.report()"]},{"cell_type":"markdown","metadata":{"id":"0jSkCQudYh3F"},"source":["## Accuracy"]},{"cell_type":"markdown","metadata":{"id":"YwAzCAHkGd0X"},"source":["Available Accuracy tests for QA task are:\n","\n","* `min_exact_match_score`\n","* `min_bleu_score`\n","* `min_rouge1_score`\n","* `min_rouge2_score`\n","* `min_rougeL_score`\n","* `min_rougeLsum_score`"]},{"cell_type":"code","execution_count":19,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":200,"status":"ok","timestamp":1692371383218,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"qG3UX5c-YgJn","outputId":"153fbe09-ae45-4dd3-bcbd-c97cd07b3c59"},"outputs":[{"name":"stdout","output_type":"stream","text":["Test Configuration : \n"," {\n"," \"model_parameters\": {\n"," \"temperature\": 0.2,\n"," \"max_tokens\": 64\n"," },\n"," \"tests\": {\n"," \"defaults\": {\n"," \"min_pass_rate\": 1.0\n"," },\n"," \"robustness\": {\n"," \"add_typo\": {\n"," \"min_pass_rate\": 0.7\n"," },\n"," \"lowercase\": {\n"," \"min_pass_rate\": 0.7\n"," }\n"," }\n"," }\n","}\n"]}],"source":["harness = Harness(\n"," task=\"question-answering\", \n"," model={\"model\": \"text-davinci-003\",\"hub\":\"openai\"}, \n"," data={\"data_source\" :\"MMLU\",\n"," \"split\":\"test-tiny\"}\n"," )"]},{"cell_type":"code","execution_count":20,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":189,"status":"ok","timestamp":1692371383222,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"KuLxNXwXYl2z","outputId":"4955decb-3e10-4c42-aa96-880298dce501"},"outputs":[{"data":{"text/plain":["{'tests': {'defaults': {'min_pass_rate': 0.65},\n"," 'accuracy': {'min_exact_match_score': {'min_score': 0.5},\n"," 'min_rouge1_score': {'min_score': 0.5}}}}"]},"execution_count":20,"metadata":{},"output_type":"execute_result"}],"source":["harness.configure(\n","{\n"," 'tests': {'defaults': {'min_pass_rate': 0.65},\n"," 'accuracy': {'min_exact_match_score': {'min_score': 0.50},\n"," 'min_rouge1_score':{'min_score': 0.50},\n","\n"," }\n"," }\n"," }\n"," )"]},{"cell_type":"markdown","metadata":{"id":"hd6BEnBtHyME"},"source":["### Generating the test cases."]},{"cell_type":"code","execution_count":21,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":132,"status":"ok","timestamp":1692371383225,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"4_wMTSmbYqTa","outputId":"052f1736-382b-4b79-a395-a53fcf94d136"},"outputs":[{"name":"stderr","output_type":"stream","text":["\n","Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 5242.88it/s]\n"]},{"data":{"text/plain":[]},"execution_count":21,"metadata":{},"output_type":"execute_result"}],"source":["harness.generate()"]},{"cell_type":"code","execution_count":22,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":112},"executionInfo":{"elapsed":114,"status":"ok","timestamp":1692371383229,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"W28l71dScgG0","outputId":"b136d68b-349d-45df-fb07-c79646dec5ac"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_type
0accuracymin_exact_match_score
1accuracymin_rouge1_score
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type\n","0 accuracy min_exact_match_score\n","1 accuracy min_rouge1_score"]},"execution_count":22,"metadata":{},"output_type":"execute_result"}],"source":["harness.testcases()"]},{"cell_type":"markdown","metadata":{"id":"UsbsuknXH0ue"},"source":["### Running the tests"]},{"cell_type":"code","execution_count":23,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":85,"referenced_widgets":["20e863ea2c17471ead434e1df3c623ed","d9f2bbecf3fd4473af04e2e25653f928","8f273303cf324d0bb3146ecea2af2411","d9f73f8d0c7345049a7ea11924b756dd","d32e905239be4fef985ae8767d6add99","01df3137965b434190d73bb59c9790bb","a2ff2f24ad77485e9de01427e2231712","ab31e5a39fe143d8895353e2c7ebea3c","61e4c8036ec34d28a5efafb0c41a0a74","aa57f92f95904c529d342790ecf4d75c","88af924ecc884636bb5bc9cad872e53a"]},"executionInfo":{"elapsed":281661,"status":"ok","timestamp":1692371664782,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"PxeBTKR9chtd","outputId":"3540745d-bab7-4eb5-f5eb-2477c8b951bc"},"outputs":[{"name":"stderr","output_type":"stream","text":["\rRunning testcases... : 0%| | 0/2 [00:00\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typeexpected_resultactual_resultpass
0accuracymin_exact_match_score0.50.592982True
1accuracymin_rouge1_score0.50.730155True
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n"," \n"],"text/plain":[" category test_type expected_result actual_result pass\n","0 accuracy min_exact_match_score 0.5 0.592982 True\n","1 accuracy min_rouge1_score 0.5 0.730155 True"]},"execution_count":24,"metadata":{},"output_type":"execute_result"}],"source":["harness.generated_results()"]},{"cell_type":"markdown","metadata":{"id":"uIOiTX1IH3d8"},"source":["### Final Results"]},{"cell_type":"code","execution_count":25,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":112},"executionInfo":{"elapsed":35,"status":"ok","timestamp":1692371664787,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"4U3PMgpEcn5o","outputId":"4958bf35-ffc1-477d-e5bf-b3d86acae806"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0accuracymin_exact_match_score01100%65%True
1accuracymin_rouge1_score01100%65%True
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type fail_count pass_count pass_rate \\\n","0 accuracy min_exact_match_score 0 1 100% \n","1 accuracy min_rouge1_score 0 1 100% \n","\n"," minimum_pass_rate pass \n","0 65% True \n","1 65% True "]},"execution_count":25,"metadata":{},"output_type":"execute_result"}],"source":["harness.report()"]}],"metadata":{"accelerator":"TPU","colab":{"machine_shape":"hm","provenance":[],"toc_visible":true},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"},"widgets":{"application/vnd.jupyter.widget-state+json":{"01df3137965b434190d73bb59c9790bb":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"0d7c7a938349427983d62652e81cead5":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"0dab743db8f14b77b0ec1699f92f86ed":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"20d999a03d814a7785232c091241dc1c":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_351e721352bf4c7cb30dbbe8a06ce35d","placeholder":"​","style":"IPY_MODEL_ad6bedec421b40d897568ae3f2705810","value":"Downloading (…)solve/main/vocab.txt: 100%"}},"20e863ea2c17471ead434e1df3c623ed":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_d9f2bbecf3fd4473af04e2e25653f928","IPY_MODEL_8f273303cf324d0bb3146ecea2af2411","IPY_MODEL_d9f73f8d0c7345049a7ea11924b756dd"],"layout":"IPY_MODEL_d32e905239be4fef985ae8767d6add99"}},"257c00fef73b4d50950c8d8b165e26a2":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_75d0522480494bb1a7b66e14fc43faac","IPY_MODEL_4218ed9efdf84217b5daa2aa5930e20b","IPY_MODEL_867e0de65c734221ad6f2623c2a35f57"],"layout":"IPY_MODEL_d3ca7afb948f404682aa027d3d76d237"}},"2608c51cf9784a56baeddf9d1622ce76":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"2773b8eeb7024310b2264d487a9b26df":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"2c907621903c43c9ad7ed84ee9026412":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"2e68a1149b7b40bc8c2811b1a16c96ea":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"351e721352bf4c7cb30dbbe8a06ce35d":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"3d19431d61e747df81b5b6730e67c955":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_6984b154f66d4f1ab209168e50a64acd","max":6270,"min":0,"orientation":"horizontal","style":"IPY_MODEL_2c907621903c43c9ad7ed84ee9026412","value":6270}},"3e9c9defb1d148b5a6de25cb2095740a":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_5a12148bfe9848c5b9827d9b677b39dd","placeholder":"​","style":"IPY_MODEL_b4bf22308b254236960ff1eb5306c4e9","value":"Downloading builder script: 100%"}},"4218ed9efdf84217b5daa2aa5930e20b":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_2608c51cf9784a56baeddf9d1622ce76","max":525,"min":0,"orientation":"horizontal","style":"IPY_MODEL_2773b8eeb7024310b2264d487a9b26df","value":525}},"4f579cc50d884981b562f112b8764075":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"52ef8bcdab0a42f0a5d6a336766de54d":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"56701a47f6ee4a6d81a98f66756baf03":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_20d999a03d814a7785232c091241dc1c","IPY_MODEL_6ab5b7e5c6784f3b92b6180ae0043589","IPY_MODEL_9824945e44fe4af4a1d70a8383b72b72"],"layout":"IPY_MODEL_0d7c7a938349427983d62652e81cead5"}},"5a0ba0d42433427c8874b56d5ef1f4a2":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"5a12148bfe9848c5b9827d9b677b39dd":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"61e4c8036ec34d28a5efafb0c41a0a74":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"660e7fdd115f4e728fe7ea0358fd8bff":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"6984b154f66d4f1ab209168e50a64acd":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"6ab5b7e5c6784f3b92b6180ae0043589":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_fabd451f3ccc47d5aed88e94eec722f7","max":231508,"min":0,"orientation":"horizontal","style":"IPY_MODEL_c07ab8a5ad3e41e991f940b6e08e1814","value":231508}},"75d0522480494bb1a7b66e14fc43faac":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_f2540d52716a4393a5f050f8d030f3f3","placeholder":"​","style":"IPY_MODEL_0dab743db8f14b77b0ec1699f92f86ed","value":"Downloading (…)lve/main/config.json: 100%"}},"7b972e6f8f624ac28f148a8cff4b0ee2":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"805c8478574545c398214ce2d295944a":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_4f579cc50d884981b562f112b8764075","placeholder":"​","style":"IPY_MODEL_5a0ba0d42433427c8874b56d5ef1f4a2","value":" 6.27k/6.27k [00:00<00:00, 260kB/s]"}},"829fb20d826d45baaf8d785179c1b32f":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"867e0de65c734221ad6f2623c2a35f57":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_a3d9b7d4b44540d88953c69b56f9269f","placeholder":"​","style":"IPY_MODEL_cb676eb37f2a4126837c7324bf51d7ad","value":" 525/525 [00:00<00:00, 17.4kB/s]"}},"88af924ecc884636bb5bc9cad872e53a":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"8f273303cf324d0bb3146ecea2af2411":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_ab31e5a39fe143d8895353e2c7ebea3c","max":5669,"min":0,"orientation":"horizontal","style":"IPY_MODEL_61e4c8036ec34d28a5efafb0c41a0a74","value":5669}},"9824945e44fe4af4a1d70a8383b72b72":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_660e7fdd115f4e728fe7ea0358fd8bff","placeholder":"​","style":"IPY_MODEL_52ef8bcdab0a42f0a5d6a336766de54d","value":" 232k/232k [00:00<00:00, 3.60MB/s]"}},"9a1221b68d2c4af1a74f5978e252d507":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_e349b98fd389418fb365f53185489437","placeholder":"​","style":"IPY_MODEL_f6ebb67ea4574f3e8924b90d7b5aba12","value":" 51.0M/51.0M [00:00<00:00, 148MB/s]"}},"a2ff2f24ad77485e9de01427e2231712":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"a3d9b7d4b44540d88953c69b56f9269f":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"aa57f92f95904c529d342790ecf4d75c":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"ab31e5a39fe143d8895353e2c7ebea3c":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"ad6bedec421b40d897568ae3f2705810":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"b16b721265754f5fa258970429fc7bdd":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"b4bf22308b254236960ff1eb5306c4e9":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"c07ab8a5ad3e41e991f940b6e08e1814":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"cb676eb37f2a4126837c7324bf51d7ad":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"d32e905239be4fef985ae8767d6add99":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"d3ca7afb948f404682aa027d3d76d237":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"d5950fc7527049279a8d433985f79619":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_3e9c9defb1d148b5a6de25cb2095740a","IPY_MODEL_3d19431d61e747df81b5b6730e67c955","IPY_MODEL_805c8478574545c398214ce2d295944a"],"layout":"IPY_MODEL_7b972e6f8f624ac28f148a8cff4b0ee2"}},"d9f2bbecf3fd4473af04e2e25653f928":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_01df3137965b434190d73bb59c9790bb","placeholder":"​","style":"IPY_MODEL_a2ff2f24ad77485e9de01427e2231712","value":"Downloading builder script: 100%"}},"d9f73f8d0c7345049a7ea11924b756dd":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_aa57f92f95904c529d342790ecf4d75c","placeholder":"​","style":"IPY_MODEL_88af924ecc884636bb5bc9cad872e53a","value":" 5.67k/5.67k [00:00<00:00, 239kB/s]"}},"e0e00dfcfb7c49ac961ff7f1101a0caa":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_2e68a1149b7b40bc8c2811b1a16c96ea","placeholder":"​","style":"IPY_MODEL_829fb20d826d45baaf8d785179c1b32f","value":"Downloading pytorch_model.bin: 100%"}},"e349b98fd389418fb365f53185489437":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"e367e27cda314517ab18696ecd913e0a":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_feb421598a0441498d81241716261b78","max":51044621,"min":0,"orientation":"horizontal","style":"IPY_MODEL_f0fc5b6cb35e4986b5ef1f2d03e56228","value":51044621}},"f0fc5b6cb35e4986b5ef1f2d03e56228":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"f2540d52716a4393a5f050f8d030f3f3":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"f6ebb67ea4574f3e8924b90d7b5aba12":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"fa4244813260430c98d2fbad63671f10":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_e0e00dfcfb7c49ac961ff7f1101a0caa","IPY_MODEL_e367e27cda314517ab18696ecd913e0a","IPY_MODEL_9a1221b68d2c4af1a74f5978e252d507"],"layout":"IPY_MODEL_b16b721265754f5fa258970429fc7bdd"}},"fabd451f3ccc47d5aed88e94eec722f7":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"feb421598a0441498d81241716261b78":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}}}}},"nbformat":4,"nbformat_minor":0} diff --git a/demo/tutorials/misc/HF_Callback_NER.ipynb b/demo/tutorials/misc/HF_Callback_NER.ipynb new file mode 100644 index 000000000..0b10bac4d --- /dev/null +++ b/demo/tutorials/misc/HF_Callback_NER.ipynb @@ -0,0 +1,668 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![image.png]()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/misc/Comparing_Models_Notebook.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering, Summarization, Clinical-Tests and Security tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity, translation, performance, security, clinical and fairness test categories.\n", + "\n", + "Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Getting started with LangTest" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install \"langtest[transformers]\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# LangTestCallback and Its Parameters\n", + "\n", + "The LangTestCallback class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results. It can be imported from the LangTest library in the following way." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "#Import Harness from the LangTest library\n", + "from langtest.callback import LangTestCallback" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It imports the callback class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and instances of the callback class can be customized or configured for different testing scenarios or environments then provided to the trainer.\n", + "\n", + "Here is a list of the different parameters that can be passed to the LangTestCallback function:\n", + "\n", + "
\n", + "\n", + "| Parameter | Description |\n", + "| --------------------- | ----------- |\n", + "| **task** | Task for which the model is to be evaluated (text-classification or ner) |\n", + "| **data** | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys:
  • data_source (mandatory): The source of the data.
  • subset (optional): The subset of the data.
  • feature_column (optional): The column containing the features.
  • target_column (optional): The column containing the target labels.
  • split (optional): The data split to be used.
  • source (optional): Set to 'huggingface' when loading Hugging Face dataset.
|\n", + "| **config** | Configuration for the tests to be performed, specified in the form of a YAML file. |\n", + "| **print_reports** | A bool value that specifies if the reports should be printed. |\n", + "| **save_reports** | A bool value that specifies if the reports should be saved. |\n", + "| **run_each_epoch** | A bool value that specifies if the tests should be run after each epoch or the at the end of training |\n", + "\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Preparing for training" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "H_93RQeoDVsW" + }, + "outputs": [], + "source": [ + "!pip install datasets\n", + "!pip install transformers[torch]\n", + "!pip install tensorflow -U" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!wget https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/demo/data/conll03.conll\n", + "!wget https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/langtest/data/conll/sample.conll" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "vFzwOHkqC7tQ", + "outputId": "d7dccbc0-1691-43a5-879a-0fc04e6b5a60" + }, + "outputs": [], + "source": [ + "import torch\n", + "from torch.utils.data import Dataset, DataLoader\n", + "from transformers import BertForTokenClassification, BertTokenizerFast, Trainer, TrainingArguments\n", + "from seqeval.metrics import accuracy_score, f1_score, precision_score, recall_score\n", + "import numpy as np\n", + "\n", + "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", + "\n", + "# Load dataset\n", + "file_path = \"conll03.conll\"\n", + "\n", + "def read_conll_file(file_path):\n", + " with open(file_path, \"r\") as file:\n", + " lines = file.readlines()\n", + " return lines[::2]\n", + "\n", + "lines = read_conll_file(file_path)\n", + "\n", + "# Preprocess dataset\n", + "def preprocess_conll(lines):\n", + " tokens = []\n", + " labels = []\n", + " token_list = []\n", + " label_list = []\n", + "\n", + " for line in lines:\n", + " if line.startswith(\"-DOCSTART-\") or line == \"\\n\":\n", + " if token_list:\n", + " tokens.append(token_list)\n", + " labels.append(label_list)\n", + " token_list = []\n", + " label_list = []\n", + " else:\n", + " token, _, _, label = line.strip().split()\n", + " token_list.append(token)\n", + " label_list.append(label)\n", + "\n", + " return tokens, labels\n", + "\n", + "tokens, labels = preprocess_conll(lines)\n", + "\n", + "class NERDataset(Dataset):\n", + " def __init__(self, tokens, labels, tokenizer, max_length=128):\n", + " self.tokens = tokens\n", + " self.labels = labels\n", + " self.tokenizer = tokenizer\n", + " self.max_length = max_length\n", + "\n", + " self.label_map = {label: i for i, label in enumerate(sorted(set([lbl for doc_labels in labels for lbl in doc_labels])))}\n", + " self.id2label = {v: k for k, v in self.label_map.items()}\n", + "\n", + " def __len__(self):\n", + " return len(self.tokens)\n", + "\n", + " def __getitem__(self, idx):\n", + " token_list = self.tokens[idx]\n", + " label_list = self.labels[idx]\n", + "\n", + " encoded = self.tokenizer(token_list, is_split_into_words=True, padding=\"max_length\", truncation=True, max_length=self.max_length, return_tensors=\"pt\")\n", + " token_ids = encoded.input_ids.squeeze(0)\n", + " attention_mask = encoded.attention_mask.squeeze(0)\n", + "\n", + " label_ids = [self.label_map[label] for label in label_list]\n", + " label_ids = [-100] + label_ids + [-100] # Account for [CLS] and [SEP] tokens\n", + " label_ids += [-100] * (self.max_length - len(label_ids)) # Pad labels\n", + "\n", + " return {\n", + " \"input_ids\": token_ids,\n", + " \"attention_mask\": attention_mask,\n", + " \"labels\": torch.tensor(label_ids, dtype=torch.long),\n", + " }\n", + "\n", + "# Initialize tokenizer and dataset\n", + "tokenizer = BertTokenizerFast.from_pretrained(\"dslim/bert-base-NER\")\n", + "train_dataset = NERDataset(tokens, labels, tokenizer)\n", + "\n", + "# Initialize model\n", + "model = BertForTokenClassification.from_pretrained(\n", + " \"dslim/bert-base-NER\",\n", + " num_labels=len(train_dataset.label_map),\n", + " id2label=train_dataset.id2label,\n", + " label2id=train_dataset.label_map,\n", + " ignore_mismatched_sizes=True,\n", + ")\n", + "\n", + "# Initialize the classifier layer with the correct number of labels\n", + "model.classifier = torch.nn.Linear(model.config.hidden_size, len(train_dataset.label_map))\n", + "\n", + "# Move the model to the appropriate device\n", + "model.to(device)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Creating a LangTestCallback instance" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After loading the model and tokenizer from huggingface, we can get to the training part of our process. We will utilize `transformers.Trainer` for easily integrating our callback into the training process. We will also use `transformers.TrainingArguments` to specify the training arguments.\n", + "\n", + "We can store the config in a dictionary and pass it to the LangTestCallback function for easier use and visual appeal. The config will be used in this notebook is below:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "ZqT9vZQiC7tS" + }, + "outputs": [], + "source": [ + "config = {\n", + " \"tests\": {\n", + " \"defaults\": {\n", + " \"min_pass_rate\": 1.0\n", + " },\n", + " \"robustness\": {\n", + " \"add_typo\": {\"min_pass_rate\": 0.7},\n", + " \"uppercase\": {\"min_pass_rate\": 0.7},\n", + " \"american_to_british\": {\"min_pass_rate\": 0.7},\n", + " },\n", + " \"accuracy\": {\n", + " \"min_micro_f1_score\": {\n", + " \"min_score\": 0.7\n", + " }\n", + " },\n", + " \"bias\": {\n", + " \"replace_to_female_pronouns\": {\n", + " \"min_pass_rate\": 0.7\n", + " },\n", + " \"replace_to_low_income_country\": {\n", + " \"min_pass_rate\": 0.7\n", + " }\n", + " }\n", + " }\n", + "}\n", + "my_callback = LangTestCallback(task=\"ner\", data={\"data_source\":\"sample.conll\"}, config=config, save_reports=True, run_each_epoch=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Creating the Trainer" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As mentioned earlier, we create a TrainingArguments object to specify the training arguments. We will also create a Trainer object to train our model. Then we can pass the LangTestCallback object to the Trainer object as a callback. LangTestCallback initilizes the harness object and generates the testcases using .generate() after the trainer is initialized." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "id": "325jnkfxfCPF" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Test Configuration : \n", + " {\n", + " \"tests\": {\n", + " \"defaults\": {\n", + " \"min_pass_rate\": 1.0\n", + " },\n", + " \"robustness\": {\n", + " \"add_typo\": {\n", + " \"min_pass_rate\": 0.7\n", + " },\n", + " \"uppercase\": {\n", + " \"min_pass_rate\": 0.7\n", + " },\n", + " \"american_to_british\": {\n", + " \"min_pass_rate\": 0.7\n", + " }\n", + " },\n", + " \"accuracy\": {\n", + " \"min_micro_f1_score\": {\n", + " \"min_score\": 0.7\n", + " }\n", + " },\n", + " \"bias\": {\n", + " \"replace_to_female_pronouns\": {\n", + " \"min_pass_rate\": 0.7\n", + " },\n", + " \"replace_to_low_income_country\": {\n", + " \"min_pass_rate\": 0.7\n", + " }\n", + " }\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "# Training arguments\n", + "training_args = TrainingArguments(\n", + " output_dir=\"./results\",\n", + " num_train_epochs=2,\n", + " per_device_train_batch_size=64,\n", + " logging_dir=\"./logs\",\n", + " logging_steps=100,\n", + " save_steps=1000,\n", + " learning_rate=3e-5,\n", + " weight_decay=0.01,\n", + ")\n", + "\n", + "# Initialize trainer\n", + "trainer = Trainer(\n", + " model=model,\n", + " args=training_args,\n", + " train_dataset=train_dataset,\n", + " tokenizer=tokenizer,\n", + " callbacks=[my_callback],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Training" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The actual training step is very simple. We just need to call the train() method of the Trainer object. We can also pass the training arguments to the train() method but its default values are OK in this case." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We have the reports printed and also saved under the reports folder. The reports are saved in the form of a MD file. The reports folder is created in the same directory as the notebook.We have the reports printed and also saved under the reports folder. The reports are saved in the form of a MD file. The reports folder is created in the same directory as the notebook." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "PzAaW4CPC7tV", + "outputId": "f11d85ee-36f2-4341-a761-1d305391790c" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Generating testcases...: 100%|██████████| 3/3 [00:00\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessadd_typo6514769%70%False
1robustnessuppercase1296935%70%False
2accuracymin_micro_f1_score01100%100%True
3biasreplace_to_female_pronouns111761%70%False
4biasreplace_to_low_income_country444450%70%False
\n", + "" + ], + "text/plain": [ + " category test_type fail_count pass_count \\\n", + "0 robustness add_typo 65 147 \n", + "1 robustness uppercase 129 69 \n", + "2 accuracy min_micro_f1_score 0 1 \n", + "3 bias replace_to_female_pronouns 11 17 \n", + "4 bias replace_to_low_income_country 44 44 \n", + "\n", + " pass_rate minimum_pass_rate pass \n", + "0 69% 70% False \n", + "1 35% 70% False \n", + "2 100% 100% True \n", + "3 61% 70% False \n", + "4 50% 70% False " + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Running testcases... : 100%|██████████| 527/527 [00:13<00:00, 39.39it/s]\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessadd_typo6514769%70%False
1robustnessuppercase1296935%70%False
2accuracymin_micro_f1_score01100%100%True
3biasreplace_to_female_pronouns111761%70%False
4biasreplace_to_low_income_country444450%70%False
\n", + "
" + ], + "text/plain": [ + " category test_type fail_count pass_count \\\n", + "0 robustness add_typo 65 147 \n", + "1 robustness uppercase 129 69 \n", + "2 accuracy min_micro_f1_score 0 1 \n", + "3 bias replace_to_female_pronouns 11 17 \n", + "4 bias replace_to_low_income_country 44 44 \n", + "\n", + " pass_rate minimum_pass_rate pass \n", + "0 69% 70% False \n", + "1 35% 70% False \n", + "2 100% 100% True \n", + "3 61% 70% False \n", + "4 50% 70% False " + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'train_runtime': 2029.1679, 'train_samples_per_second': 1.67, 'train_steps_per_second': 0.027, 'train_loss': 0.7498808260317202, 'epoch': 2.0}\n" + ] + }, + { + "data": { + "text/plain": [ + "TrainOutput(global_step=54, training_loss=0.7498808260317202, metrics={'train_runtime': 2029.1679, 'train_samples_per_second': 1.67, 'train_steps_per_second': 0.027, 'train_loss': 0.7498808260317202, 'epoch': 2.0})" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Training the model\n", + "trainer.train()" + ] + } + ], + "metadata": { + "accelerator": "TPU", + "colab": { + "machine_shape": "hm", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/demo/tutorials/misc/HF_Callback_Text_Classification.ipynb b/demo/tutorials/misc/HF_Callback_Text_Classification.ipynb new file mode 100644 index 000000000..47119e79b --- /dev/null +++ b/demo/tutorials/misc/HF_Callback_Text_Classification.ipynb @@ -0,0 +1,6207 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "jivErFCkFgxe" + }, + "source": [ + "![image.png]()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VGFU1I5BFgxg" + }, + "source": [ + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/misc/Comparing_Models_Notebook.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V48ExCf-Fgxg" + }, + "source": [ + "**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering, Summarization, Clinical-Tests and Security tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity, translation, performance, security, clinical and fairness test categories.\n", + "\n", + "Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y-cUyOo7Fgxg" + }, + "source": [ + "# Getting started with LangTest" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5PSN8rcjFgxg" + }, + "outputs": [], + "source": [ + "!pip install langtest[transformers]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L0Iqr61XFgxh" + }, + "source": [ + "# LangTestCallback and Its Parameters\n", + "\n", + "The LangTestCallback class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results. It can be imported from the LangTest library in the following way." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "Aw_gRAOsFgxh" + }, + "outputs": [], + "source": [ + "#Import Harness from the LangTest library\n", + "from langtest.callback import LangTestCallback" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "POQ5a8EhFgxh" + }, + "source": [ + "It imports the callback class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and instances of the callback class can be customized or configured for different testing scenarios or environments then provided to the trainer.\n", + "\n", + "Here is a list of the different parameters that can be passed to the LangTestCallback function:\n", + "\n", + "
\n", + "\n", + "| Parameter | Description |\n", + "| --------------------- | ----------- |\n", + "| **task** | Task for which the model is to be evaluated (text-classification or ner) |\n", + "| **data** | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys:
  • data_source (mandatory): The source of the data.
  • subset (optional): The subset of the data.
  • feature_column (optional): The column containing the features.
  • target_column (optional): The column containing the target labels.
  • split (optional): The data split to be used.
  • source (optional): Set to 'huggingface' when loading Hugging Face dataset.
|\n", + "| **config** | Configuration for the tests to be performed, specified in the form of a YAML file. |\n", + "| **print_reports** | A bool value that specifies if the reports should be printed. |\n", + "| **save_reports** | A bool value that specifies if the reports should be saved. |\n", + "| **run_each_epoch** | A bool value that specifies if the tests should be run after each epoch or the at the end of training |\n", + "\n", + "
\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "H_93RQeoDVsW" + }, + "outputs": [], + "source": [ + "!pip install datasets\n", + "!pip install transformers[torch]\n", + "!pip install tensorflow -U" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 401, + "referenced_widgets": [ + "0f339b7dda014a65acfe241384167c53", + "6db1f4ec80fb4fa7a5f77c2909b3c755", + "49b76e2c8d4c484d9e8e78f55dd34f2b", + "65d02c0924da4eca835fa9cb89c7d7dd", + "c7d2e34de0714971984e8a4e72c85f16", + "028376ac680b43c1a294407e9ab6a5af", + "fadcdd015f6541fbaadf975dd13cd95f", + "5c8ad99ebaa347f79905a557c9212381", + "4b4dad9da390409d81fed270278bf4c7", + "e7c48e205424478a8f015c77902fb134", + "d97d6cbf86544494885c3d347febc852", + "2e978e6bd19047f78c897106c31966ed", + "04886efe65104ef1ade8715a59ef1f41", + "8144f7eb5c6e4882b0cd2eb2f968c882", + "6c316019bb8a4097b9a2a88dda964bc7", + "91f16706c6134d6a8ef4fa07aa38ad83", + "9f1b265bba014211a8eae834cb4727bd", + "27ca1877f22148808a9ac400a42a7a26", + "d868c8883fee4bfc9b589da2fc7275fd", + "b36a094e04ab41e59e61a0614b1c7b26", + "b3fe91b4f396478eb71c1a3e6fbd0d6f", + "22deb7749a3940048448989ef5b91ab4", + "a3aac8ab3e1440388759e76e0b8d400b", + "fb68bc3a68c74db09891f09a797fad4d", + "fd3c248bc0d64c1b9852f47d4ee7eb17", + "33c598afbc834c938618d85007f8b36c", + "2cc56448beb44502ba5cbc7f7b5ae057", + "484d6ef0cb9548c495b7f89adddb8799", + "f49c8b3f47ff4488bb611ab129609a06", + "49bb38a9f972468a96fdd539b0b5a6d6", + "4ce641041f28400c9b3ba584cef00e1a", + "aa368056a83d44819674ac55b679fad1", + "be594efe148c496e92175408918e9eff", + "c0fdf2412b154e5f8400e7ad50321f61", + "f6564aa137504aa6ac8752464146498e", + "a4e85dc8d4ab42d190d09afbab4d9b67", + "b262b3042a6a4c21a990fe49a8d2c4ef", + "5a426c228731404bbcf202a6c103c14b", + "af95713290f34a4d83088951fcaa8d38", + "658c4a4bdb35461d8f06a5cdfdccb877", + "02c486f6d14549068dc96d68618a0ec6", + "b8d5164256ac40bb8633f8317aab279a", + "31f2a68fc9194a5b80ede2bef35e2bd7", + "973ecb3108c84ca89b28f2597a8ff474", + "59753f8944b9414d927c7e2eea445b51", + "8ffc2b1fb14449d1bb03a3b5bed928bc", + "e75a36f8a97847c4a52c6f139ab660d0", + "a1cc0e10853c47a1901c823ce8d008de", + "ec96e2943e3c4907874f391aafe78628", + "701c8d36283248f4a0df3dd72941e603", + "d84c5dc84e7e440ca66250a705fe4a16", + "5b0a2ab061cc402ebf12db7d4dceb7dd", + "f7d45e8dd0eb4146af7d2ccbea4c4edd", + "4456b5fe19474e7cb4cfe70679d6005f", + "9613e943891c428bbf3f8847d9c61a31", + "bca602866a86460ca5cc99c5d1a5a415", + "90737d28781c47909746ac3fd524769a", + "b66e3b7156ee4bd295243141c20019bc", + "6cecb165f14441ecae43cc59a0a0b56f", + "ef91b2f8818044ca8ed144dbaa963fcd", + "440c6c8b303343c8a8cf53d4d9768a6e", + "93577d92cf684871b5026252e8123c84", + "0c91977ea08a4efb91280ff178e4850b", + "8406d74b98094a32b2f1028f6e317ff4", + "3f30534555e34458983889694fda3671", + "e0924770d76a49b499f3e08b9f24202a", + "aaf2bee8382c445b9f02631e1a46adda", + "f61ca1a64d294128a06f88afee1ca0e1", + "35e06f0a72bb456c81b846a9eae1b91f", + "ace61e4ac7cd4f64ba93aa173ce87287", + "8d0f1ffbf93b45d5950426321b82b785", + "9ecc27897dba4231bd6b82031a1cc3db", + "2e69383236104fb38191d990e3a54c81", + "8cdb6d85160747a1a451322f2e60cbc9", + "b6cfacc547d4499a9e887b6314a0e00f", + "e095bb6ae9514753a6f6e90d72028857", + "41e09c4061784455a67eb0b7cf2d31fc", + "6d44d06737f64b85a0aa9efc813bcf61", + "3763a876e5184a4da9b7816839d0b523", + "45b6355d8db24edbb5614c2f1d944619", + "cd4d230cf56843afa1eab8939ba8c5df", + "d9ea2f0150654e60bd4cbf8b6d8c5fcf", + "4921d93aa19a4b9f9664463b743aa5e4", + "b3442e661da6452ba7cff46bf9c01ef4", + "3ebc999924bd46dea7f8e21677edbf07", + "4611ce6627b640d8b6cfc95242839b3c", + "3bd4ebc1706640eea35d8c1c56541288", + "2b5e9cecebb54328a3020a80ff0d9a9a", + "08a6428e9c9f48689a847dd8063487a2", + "30635c8d36244136b2c9485fce6a3675", + "5ed3e59771104d849dff50eeb6c8b519", + "4c9d20de836b49aa87724addef5bd550", + "b2ab93acd0104da0a8b4b2ab18a51e22", + "9b2695098c88496aa33967a2b76b5578", + "1f100078ce8444478d0c60eeefbff273", + "c2c53496e66048adb1dbb426da8c0bed", + "043d81856e5746708162fa6fbbf4d3c4", + "d0e1c56bccd64f73831bce780c53ec90", + "265e2d76584849479b9e936ed9ce8654", + "79fd37e8fe454e9a83dba65f31c8cb77", + "593c8bc430454eb284169a0686fe6bfb", + "c436532599d04f3b9f53d366a6f59b20", + "1e2fd13373574364b05e8c008e44abff", + "913a768ab4d54c4eb0d963bdeb67fe06", + "c960d9b2bbe94e35afefc7efd9f416ba", + "47caa679e4c34ccc811103714a35ddd3", + "5716068861814d4584a43121242952b1", + "1e2cd1e45e4b4032943aec565dbe6cfe", + "00590bb60bdf448a89e3d0d0f65314ec", + "1cff3f1f268a42ea846e9e3f5f53d8b3", + "f3298d68cecf45fe89b3939d89853594", + "f19e578ca2284ed6a7de268b5175429f", + "8aee44326aae4c3b842277f55b739250", + "1de22c20d44843aaa82ab635f1dd3f9c", + "4bd988dbbb9a4c4195922684d1fd2159", + "5ed1e548a4df4a76a8e2e0c2e49b8d2c", + "9963664b1a2b4137846494d225695932", + "51f764124bbd426b856b7a4b5de965e0", + "2332277384d040cd8a4a62213dd20cda", + "907194999ce6489c9ea9ebc969c7c77e", + "a9dfcd4f75384d76baac18ab4d687ac5", + "155d29c05c1748be98f104c210d23b0f", + "ae80685a02474640ac6325cb61206118", + "e2dda125709548aabdb66af2524b49ea", + "13936ef1cf7241dcacdcd74f788fca0b", + "1151b117e3f543db965ba430560e828f", + "f55a39438d33482faa837412c3eedb28", + "e9d0dab76a69477d8cef7613fd8e25b4", + "29a17d1315ff46caa87c791d87494f07", + "421e40ff53d244c5979d762593f8bccb", + "3882b1ea5f0f494f924e07f44e9e4386", + "143dea23ae7843c7b17a4242a4cee4c5" + ] + }, + "id": "vFzwOHkqC7tQ", + "outputId": "76bbd4c6-4309-40ed-d7b6-c61a30f55b46" + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "0f339b7dda014a65acfe241384167c53", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "tokenizer_config.json: 0%| | 0.00/28.0 [00:00\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessadd_typo0189100%70%True
1robustnessuppercase0200100%70%True
2robustnessamerican_to_british062100%70%True
3accuracymin_micro_f1_score100%100%False
4biasreplace_to_female_pronouns0171100%70%True
5biasreplace_to_low_income_country028100%70%True
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + " \n" + ], + "text/plain": [ + " category test_type fail_count pass_count \\\n", + "0 robustness add_typo 0 189 \n", + "1 robustness uppercase 0 200 \n", + "2 robustness american_to_british 0 62 \n", + "3 accuracy min_micro_f1_score 1 0 \n", + "4 bias replace_to_female_pronouns 0 171 \n", + "5 bias replace_to_low_income_country 0 28 \n", + "\n", + " pass_rate minimum_pass_rate pass \n", + "0 100% 70% True \n", + "1 100% 70% True \n", + "2 100% 70% True \n", + "3 0% 100% False \n", + "4 100% 70% True \n", + "5 100% 70% True " + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\rRunning testcases... : 0%| | 0/651 [00:00\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessadd_typo0189100%70%True
1robustnessuppercase0200100%70%True
2robustnessamerican_to_british062100%70%True
3accuracymin_micro_f1_score100%100%False
4biasreplace_to_female_pronouns0171100%70%True
5biasreplace_to_low_income_country028100%70%True
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + " \n" + ], + "text/plain": [ + " category test_type fail_count pass_count \\\n", + "0 robustness add_typo 0 189 \n", + "1 robustness uppercase 0 200 \n", + "2 robustness american_to_british 0 62 \n", + "3 accuracy min_micro_f1_score 1 0 \n", + "4 bias replace_to_female_pronouns 0 171 \n", + "5 bias replace_to_low_income_country 0 28 \n", + "\n", + " pass_rate minimum_pass_rate pass \n", + "0 100% 70% True \n", + "1 100% 70% True \n", + "2 100% 70% True \n", + "3 0% 100% False \n", + "4 100% 70% True \n", + "5 100% 70% True " + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\rRunning testcases... : 0%| | 0/651 [00:00\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessadd_typo0189100%70%True
1robustnessuppercase0200100%70%True
2robustnessamerican_to_british062100%70%True
3accuracymin_micro_f1_score100%100%False
4biasreplace_to_female_pronouns0171100%70%True
5biasreplace_to_low_income_country028100%70%True
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + " \n" + ], + "text/plain": [ + " category test_type fail_count pass_count \\\n", + "0 robustness add_typo 0 189 \n", + "1 robustness uppercase 0 200 \n", + "2 robustness american_to_british 0 62 \n", + "3 accuracy min_micro_f1_score 1 0 \n", + "4 bias replace_to_female_pronouns 0 171 \n", + "5 bias replace_to_low_income_country 0 28 \n", + "\n", + " pass_rate minimum_pass_rate pass \n", + "0 100% 70% True \n", + "1 100% 70% True \n", + "2 100% 70% True \n", + "3 0% 100% False \n", + "4 100% 70% True \n", + "5 100% 70% True " + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'train_runtime': 281.2316, 'train_samples_per_second': 0.267, 'train_steps_per_second': 0.043, 'train_loss': 0.14822853604952493, 'epoch': 3.0}\n" + ] + }, + { + "data": { + "text/plain": [ + "TrainOutput(global_step=12, training_loss=0.14822853604952493, metrics={'train_runtime': 281.2316, 'train_samples_per_second': 0.267, 'train_steps_per_second': 0.043, 'train_loss': 0.14822853604952493, 'epoch': 3.0})" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Training the model\n", + "trainer.train()" + ] + } + ], + "metadata": { + "accelerator": "TPU", + "colab": { + "machine_shape": "hm", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + }, + "orig_nbformat": 4, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "00590bb60bdf448a89e3d0d0f65314ec": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "028376ac680b43c1a294407e9ab6a5af": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "02c486f6d14549068dc96d68618a0ec6": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "043d81856e5746708162fa6fbbf4d3c4": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "04886efe65104ef1ade8715a59ef1f41": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_9f1b265bba014211a8eae834cb4727bd", + "placeholder": "​", + "style": "IPY_MODEL_27ca1877f22148808a9ac400a42a7a26", + "value": "config.json: 100%" + } + }, + "08a6428e9c9f48689a847dd8063487a2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_30635c8d36244136b2c9485fce6a3675", + "IPY_MODEL_5ed3e59771104d849dff50eeb6c8b519", + "IPY_MODEL_4c9d20de836b49aa87724addef5bd550" + ], + "layout": "IPY_MODEL_b2ab93acd0104da0a8b4b2ab18a51e22" + } + }, + "0c91977ea08a4efb91280ff178e4850b": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "0f339b7dda014a65acfe241384167c53": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_6db1f4ec80fb4fa7a5f77c2909b3c755", + "IPY_MODEL_49b76e2c8d4c484d9e8e78f55dd34f2b", + "IPY_MODEL_65d02c0924da4eca835fa9cb89c7d7dd" + ], + "layout": "IPY_MODEL_c7d2e34de0714971984e8a4e72c85f16" + } + }, + "0ffe58cec1f549438742975bf4154316": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "1151b117e3f543db965ba430560e828f": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "13936ef1cf7241dcacdcd74f788fca0b": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_3882b1ea5f0f494f924e07f44e9e4386", + "placeholder": "​", + "style": "IPY_MODEL_143dea23ae7843c7b17a4242a4cee4c5", + "value": " 50000/50000 [00:10<00:00, 6954.67 examples/s]" + } + }, + "143dea23ae7843c7b17a4242a4cee4c5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "155d29c05c1748be98f104c210d23b0f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_ae80685a02474640ac6325cb61206118", + "IPY_MODEL_e2dda125709548aabdb66af2524b49ea", + "IPY_MODEL_13936ef1cf7241dcacdcd74f788fca0b" + ], + "layout": "IPY_MODEL_1151b117e3f543db965ba430560e828f" + } + }, + "1cff3f1f268a42ea846e9e3f5f53d8b3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "1de22c20d44843aaa82ab635f1dd3f9c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_907194999ce6489c9ea9ebc969c7c77e", + "placeholder": "​", + "style": "IPY_MODEL_a9dfcd4f75384d76baac18ab4d687ac5", + "value": " 25000/25000 [00:08<00:00, 7100.44 examples/s]" + } + }, + "1e2cd1e45e4b4032943aec565dbe6cfe": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "1e2fd13373574364b05e8c008e44abff": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_00590bb60bdf448a89e3d0d0f65314ec", + "placeholder": "​", + "style": "IPY_MODEL_1cff3f1f268a42ea846e9e3f5f53d8b3", + "value": " 25000/25000 [00:08<00:00, 924.97 examples/s]" + } + }, + "1f100078ce8444478d0c60eeefbff273": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "208ac9d586804957a001b2d134141b78": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_9480c6601cd34acc84d1029e3fd36156", + "placeholder": "​", + "style": "IPY_MODEL_98b6d8654f6a42559c413de778557e33", + "value": " 25/25 [00:00<00:00, 295.91 examples/s]" + } + }, + "22deb7749a3940048448989ef5b91ab4": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "2332277384d040cd8a4a62213dd20cda": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "265e2d76584849479b9e936ed9ce8654": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "27ca1877f22148808a9ac400a42a7a26": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "29a17d1315ff46caa87c791d87494f07": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "2b5e9cecebb54328a3020a80ff0d9a9a": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "2cc56448beb44502ba5cbc7f7b5ae057": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "2e69383236104fb38191d990e3a54c81": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "2e978e6bd19047f78c897106c31966ed": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_04886efe65104ef1ade8715a59ef1f41", + "IPY_MODEL_8144f7eb5c6e4882b0cd2eb2f968c882", + "IPY_MODEL_6c316019bb8a4097b9a2a88dda964bc7" + ], + "layout": "IPY_MODEL_91f16706c6134d6a8ef4fa07aa38ad83" + } + }, + "30635c8d36244136b2c9485fce6a3675": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_9b2695098c88496aa33967a2b76b5578", + "placeholder": "​", + "style": "IPY_MODEL_1f100078ce8444478d0c60eeefbff273", + "value": "Downloading data: 100%" + } + }, + "31f2a68fc9194a5b80ede2bef35e2bd7": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "33c598afbc834c938618d85007f8b36c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_aa368056a83d44819674ac55b679fad1", + "placeholder": "​", + "style": "IPY_MODEL_be594efe148c496e92175408918e9eff", + "value": " 232k/232k [00:00<00:00, 5.09MB/s]" + } + }, + "35e06f0a72bb456c81b846a9eae1b91f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_8cdb6d85160747a1a451322f2e60cbc9", + "max": 2166, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_b6cfacc547d4499a9e887b6314a0e00f", + "value": 2166 + } + }, + "3718d0ead5f34ff0a06f1eade292ebe8": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_785a0f13d36a4d268908f9b0af68cebe", + "IPY_MODEL_90fe6e4229c148da8066adbd40b0beab", + "IPY_MODEL_208ac9d586804957a001b2d134141b78" + ], + "layout": "IPY_MODEL_bf798174994947b5b707b1f119dae0d7" + } + }, + "3763a876e5184a4da9b7816839d0b523": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_4921d93aa19a4b9f9664463b743aa5e4", + "placeholder": "​", + "style": "IPY_MODEL_b3442e661da6452ba7cff46bf9c01ef4", + "value": "Downloading readme: 100%" + } + }, + "3882b1ea5f0f494f924e07f44e9e4386": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "3bd4ebc1706640eea35d8c1c56541288": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "3ebc999924bd46dea7f8e21677edbf07": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "3f30534555e34458983889694fda3671": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "41e09c4061784455a67eb0b7cf2d31fc": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "421e40ff53d244c5979d762593f8bccb": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "440c6c8b303343c8a8cf53d4d9768a6e": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "4456b5fe19474e7cb4cfe70679d6005f": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "45b6355d8db24edbb5614c2f1d944619": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_3ebc999924bd46dea7f8e21677edbf07", + "max": 7590, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_4611ce6627b640d8b6cfc95242839b3c", + "value": 7590 + } + }, + "4611ce6627b640d8b6cfc95242839b3c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "47caa679e4c34ccc811103714a35ddd3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "484d6ef0cb9548c495b7f89adddb8799": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "4921d93aa19a4b9f9664463b743aa5e4": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "49b76e2c8d4c484d9e8e78f55dd34f2b": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_5c8ad99ebaa347f79905a557c9212381", + "max": 28, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_4b4dad9da390409d81fed270278bf4c7", + "value": 28 + } + }, + "49bb38a9f972468a96fdd539b0b5a6d6": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "4b4dad9da390409d81fed270278bf4c7": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "4bd988dbbb9a4c4195922684d1fd2159": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "4c9d20de836b49aa87724addef5bd550": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_d0e1c56bccd64f73831bce780c53ec90", + "placeholder": "​", + "style": "IPY_MODEL_265e2d76584849479b9e936ed9ce8654", + "value": " 84.1M/84.1M [00:05<00:00, 22.9MB/s]" + } + }, + "4ce641041f28400c9b3ba584cef00e1a": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "51f764124bbd426b856b7a4b5de965e0": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "5716068861814d4584a43121242952b1": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "593c8bc430454eb284169a0686fe6bfb": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_c960d9b2bbe94e35afefc7efd9f416ba", + "placeholder": "​", + "style": "IPY_MODEL_47caa679e4c34ccc811103714a35ddd3", + "value": "Generating train split: 100%" + } + }, + "59753f8944b9414d927c7e2eea445b51": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_8ffc2b1fb14449d1bb03a3b5bed928bc", + "IPY_MODEL_e75a36f8a97847c4a52c6f139ab660d0", + "IPY_MODEL_a1cc0e10853c47a1901c823ce8d008de" + ], + "layout": "IPY_MODEL_ec96e2943e3c4907874f391aafe78628" + } + }, + "5a426c228731404bbcf202a6c103c14b": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "5b0a2ab061cc402ebf12db7d4dceb7dd": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "5c84ec53402147fcab06e7078b272746": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "5c8ad99ebaa347f79905a557c9212381": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "5ed1e548a4df4a76a8e2e0c2e49b8d2c": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "5ed3e59771104d849dff50eeb6c8b519": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_c2c53496e66048adb1dbb426da8c0bed", + "max": 84125825, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_043d81856e5746708162fa6fbbf4d3c4", + "value": 84125825 + } + }, + "658c4a4bdb35461d8f06a5cdfdccb877": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "65d02c0924da4eca835fa9cb89c7d7dd": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_e7c48e205424478a8f015c77902fb134", + "placeholder": "​", + "style": "IPY_MODEL_d97d6cbf86544494885c3d347febc852", + "value": " 28.0/28.0 [00:00<00:00, 2.16kB/s]" + } + }, + "6c316019bb8a4097b9a2a88dda964bc7": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_b3fe91b4f396478eb71c1a3e6fbd0d6f", + "placeholder": "​", + "style": "IPY_MODEL_22deb7749a3940048448989ef5b91ab4", + "value": " 570/570 [00:00<00:00, 39.3kB/s]" + } + }, + "6cecb165f14441ecae43cc59a0a0b56f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_3f30534555e34458983889694fda3671", + "placeholder": "​", + "style": "IPY_MODEL_e0924770d76a49b499f3e08b9f24202a", + "value": " 4.31k/4.31k [00:00<00:00, 334kB/s]" + } + }, + "6d44d06737f64b85a0aa9efc813bcf61": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_3763a876e5184a4da9b7816839d0b523", + "IPY_MODEL_45b6355d8db24edbb5614c2f1d944619", + "IPY_MODEL_cd4d230cf56843afa1eab8939ba8c5df" + ], + "layout": "IPY_MODEL_d9ea2f0150654e60bd4cbf8b6d8c5fcf" + } + }, + "6db1f4ec80fb4fa7a5f77c2909b3c755": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_028376ac680b43c1a294407e9ab6a5af", + "placeholder": "​", + "style": "IPY_MODEL_fadcdd015f6541fbaadf975dd13cd95f", + "value": "tokenizer_config.json: 100%" + } + }, + "701c8d36283248f4a0df3dd72941e603": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "785a0f13d36a4d268908f9b0af68cebe": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_5c84ec53402147fcab06e7078b272746", + "placeholder": "​", + "style": "IPY_MODEL_0ffe58cec1f549438742975bf4154316", + "value": "Map: 100%" + } + }, + "79fd37e8fe454e9a83dba65f31c8cb77": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_593c8bc430454eb284169a0686fe6bfb", + "IPY_MODEL_c436532599d04f3b9f53d366a6f59b20", + "IPY_MODEL_1e2fd13373574364b05e8c008e44abff" + ], + "layout": "IPY_MODEL_913a768ab4d54c4eb0d963bdeb67fe06" + } + }, + "8144f7eb5c6e4882b0cd2eb2f968c882": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_d868c8883fee4bfc9b589da2fc7275fd", + "max": 570, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_b36a094e04ab41e59e61a0614b1c7b26", + "value": 570 + } + }, + "8406d74b98094a32b2f1028f6e317ff4": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "8aee44326aae4c3b842277f55b739250": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_51f764124bbd426b856b7a4b5de965e0", + "max": 25000, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_2332277384d040cd8a4a62213dd20cda", + "value": 25000 + } + }, + "8cdb6d85160747a1a451322f2e60cbc9": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "8d0f1ffbf93b45d5950426321b82b785": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "8ffc2b1fb14449d1bb03a3b5bed928bc": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_701c8d36283248f4a0df3dd72941e603", + "placeholder": "​", + "style": "IPY_MODEL_d84c5dc84e7e440ca66250a705fe4a16", + "value": "model.safetensors: 100%" + } + }, + "907194999ce6489c9ea9ebc969c7c77e": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "90737d28781c47909746ac3fd524769a": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_440c6c8b303343c8a8cf53d4d9768a6e", + "placeholder": "​", + "style": "IPY_MODEL_93577d92cf684871b5026252e8123c84", + "value": "Downloading builder script: 100%" + } + }, + "90fe6e4229c148da8066adbd40b0beab": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_dafcd2ddd2fa481d8c173d015446640c", + "max": 25, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_df1252165b6041209d072390de947697", + "value": 25 + } + }, + "913a768ab4d54c4eb0d963bdeb67fe06": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "91f16706c6134d6a8ef4fa07aa38ad83": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "93577d92cf684871b5026252e8123c84": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "9480c6601cd34acc84d1029e3fd36156": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "9613e943891c428bbf3f8847d9c61a31": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "973ecb3108c84ca89b28f2597a8ff474": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "98b6d8654f6a42559c413de778557e33": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "9963664b1a2b4137846494d225695932": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "9b2695098c88496aa33967a2b76b5578": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "9ecc27897dba4231bd6b82031a1cc3db": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "9f1b265bba014211a8eae834cb4727bd": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "a1cc0e10853c47a1901c823ce8d008de": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_4456b5fe19474e7cb4cfe70679d6005f", + "placeholder": "​", + "style": "IPY_MODEL_9613e943891c428bbf3f8847d9c61a31", + "value": " 440M/440M [00:01<00:00, 295MB/s]" + } + }, + "a3aac8ab3e1440388759e76e0b8d400b": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_fb68bc3a68c74db09891f09a797fad4d", + "IPY_MODEL_fd3c248bc0d64c1b9852f47d4ee7eb17", + "IPY_MODEL_33c598afbc834c938618d85007f8b36c" + ], + "layout": "IPY_MODEL_2cc56448beb44502ba5cbc7f7b5ae057" + } + }, + "a4e85dc8d4ab42d190d09afbab4d9b67": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_02c486f6d14549068dc96d68618a0ec6", + "max": 466062, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_b8d5164256ac40bb8633f8317aab279a", + "value": 466062 + } + }, + "a9dfcd4f75384d76baac18ab4d687ac5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "aa368056a83d44819674ac55b679fad1": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "aaf2bee8382c445b9f02631e1a46adda": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_f61ca1a64d294128a06f88afee1ca0e1", + "IPY_MODEL_35e06f0a72bb456c81b846a9eae1b91f", + "IPY_MODEL_ace61e4ac7cd4f64ba93aa173ce87287" + ], + "layout": "IPY_MODEL_8d0f1ffbf93b45d5950426321b82b785" + } + }, + "ace61e4ac7cd4f64ba93aa173ce87287": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_e095bb6ae9514753a6f6e90d72028857", + "placeholder": "​", + "style": "IPY_MODEL_41e09c4061784455a67eb0b7cf2d31fc", + "value": " 2.17k/2.17k [00:00<00:00, 170kB/s]" + } + }, + "ae80685a02474640ac6325cb61206118": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_f55a39438d33482faa837412c3eedb28", + "placeholder": "​", + "style": "IPY_MODEL_e9d0dab76a69477d8cef7613fd8e25b4", + "value": "Generating unsupervised split: 100%" + } + }, + "af95713290f34a4d83088951fcaa8d38": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "b262b3042a6a4c21a990fe49a8d2c4ef": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_31f2a68fc9194a5b80ede2bef35e2bd7", + "placeholder": "​", + "style": "IPY_MODEL_973ecb3108c84ca89b28f2597a8ff474", + "value": " 466k/466k [00:00<00:00, 23.9MB/s]" + } + }, + "b2ab93acd0104da0a8b4b2ab18a51e22": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "b3442e661da6452ba7cff46bf9c01ef4": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "b36a094e04ab41e59e61a0614b1c7b26": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "b3fe91b4f396478eb71c1a3e6fbd0d6f": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "b66e3b7156ee4bd295243141c20019bc": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_0c91977ea08a4efb91280ff178e4850b", + "max": 4314, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_8406d74b98094a32b2f1028f6e317ff4", + "value": 4314 + } + }, + "b6cfacc547d4499a9e887b6314a0e00f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "b8d5164256ac40bb8633f8317aab279a": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "bca602866a86460ca5cc99c5d1a5a415": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_90737d28781c47909746ac3fd524769a", + "IPY_MODEL_b66e3b7156ee4bd295243141c20019bc", + "IPY_MODEL_6cecb165f14441ecae43cc59a0a0b56f" + ], + "layout": "IPY_MODEL_ef91b2f8818044ca8ed144dbaa963fcd" + } + }, + "be594efe148c496e92175408918e9eff": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "bf798174994947b5b707b1f119dae0d7": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "c0fdf2412b154e5f8400e7ad50321f61": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_f6564aa137504aa6ac8752464146498e", + "IPY_MODEL_a4e85dc8d4ab42d190d09afbab4d9b67", + "IPY_MODEL_b262b3042a6a4c21a990fe49a8d2c4ef" + ], + "layout": "IPY_MODEL_5a426c228731404bbcf202a6c103c14b" + } + }, + "c2c53496e66048adb1dbb426da8c0bed": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "c436532599d04f3b9f53d366a6f59b20": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_5716068861814d4584a43121242952b1", + "max": 25000, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_1e2cd1e45e4b4032943aec565dbe6cfe", + "value": 25000 + } + }, + "c7d2e34de0714971984e8a4e72c85f16": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "c960d9b2bbe94e35afefc7efd9f416ba": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "cd4d230cf56843afa1eab8939ba8c5df": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_3bd4ebc1706640eea35d8c1c56541288", + "placeholder": "​", + "style": "IPY_MODEL_2b5e9cecebb54328a3020a80ff0d9a9a", + "value": " 7.59k/7.59k [00:00<00:00, 608kB/s]" + } + }, + "d0e1c56bccd64f73831bce780c53ec90": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "d84c5dc84e7e440ca66250a705fe4a16": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "d868c8883fee4bfc9b589da2fc7275fd": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "d97d6cbf86544494885c3d347febc852": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "d9ea2f0150654e60bd4cbf8b6d8c5fcf": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "dafcd2ddd2fa481d8c173d015446640c": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "df1252165b6041209d072390de947697": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "e0924770d76a49b499f3e08b9f24202a": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "e095bb6ae9514753a6f6e90d72028857": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "e2dda125709548aabdb66af2524b49ea": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_29a17d1315ff46caa87c791d87494f07", + "max": 50000, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_421e40ff53d244c5979d762593f8bccb", + "value": 50000 + } + }, + "e75a36f8a97847c4a52c6f139ab660d0": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_5b0a2ab061cc402ebf12db7d4dceb7dd", + "max": 440449768, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_f7d45e8dd0eb4146af7d2ccbea4c4edd", + "value": 440449768 + } + }, + "e7c48e205424478a8f015c77902fb134": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "e9d0dab76a69477d8cef7613fd8e25b4": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "ec96e2943e3c4907874f391aafe78628": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "ef91b2f8818044ca8ed144dbaa963fcd": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "f19e578ca2284ed6a7de268b5175429f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_5ed1e548a4df4a76a8e2e0c2e49b8d2c", + "placeholder": "​", + "style": "IPY_MODEL_9963664b1a2b4137846494d225695932", + "value": "Generating test split: 100%" + } + }, + "f3298d68cecf45fe89b3939d89853594": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_f19e578ca2284ed6a7de268b5175429f", + "IPY_MODEL_8aee44326aae4c3b842277f55b739250", + "IPY_MODEL_1de22c20d44843aaa82ab635f1dd3f9c" + ], + "layout": "IPY_MODEL_4bd988dbbb9a4c4195922684d1fd2159" + } + }, + "f49c8b3f47ff4488bb611ab129609a06": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "f55a39438d33482faa837412c3eedb28": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "f61ca1a64d294128a06f88afee1ca0e1": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_9ecc27897dba4231bd6b82031a1cc3db", + "placeholder": "​", + "style": "IPY_MODEL_2e69383236104fb38191d990e3a54c81", + "value": "Downloading metadata: 100%" + } + }, + "f6564aa137504aa6ac8752464146498e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_af95713290f34a4d83088951fcaa8d38", + "placeholder": "​", + "style": "IPY_MODEL_658c4a4bdb35461d8f06a5cdfdccb877", + "value": "tokenizer.json: 100%" + } + }, + "f7d45e8dd0eb4146af7d2ccbea4c4edd": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "fadcdd015f6541fbaadf975dd13cd95f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "fb68bc3a68c74db09891f09a797fad4d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_484d6ef0cb9548c495b7f89adddb8799", + "placeholder": "​", + "style": "IPY_MODEL_f49c8b3f47ff4488bb611ab129609a06", + "value": "vocab.txt: 100%" + } + }, + "fd3c248bc0d64c1b9852f47d4ee7eb17": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_49bb38a9f972468a96fdd539b0b5a6d6", + "max": 231508, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_4ce641041f28400c9b3ba584cef00e1a", + "value": 231508 + } + } + } + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/demo/tutorials/misc/PerformanceTest_Notebook.ipynb b/demo/tutorials/misc/PerformanceTest_Notebook.ipynb index 1a90d71ee..0a16c6de6 100644 --- a/demo/tutorials/misc/PerformanceTest_Notebook.ipynb +++ b/demo/tutorials/misc/PerformanceTest_Notebook.ipynb @@ -1 +1 @@ -{"cells":[{"cell_type":"markdown","metadata":{"id":"e7PsSmy9sCoR"},"source":["![image.png]()"]},{"cell_type":"markdown","metadata":{"id":"3o5sAOfwL5qd"},"source":["[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/misc/PerformanceTest_Notebook.ipynb)"]},{"cell_type":"markdown","metadata":{"id":"WJJzt3RWhEc6"},"source":["**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering and Summarization tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity and fairness test categories.\n","\n","Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings."]},{"cell_type":"markdown","metadata":{"id":"26qXWhCYhHAt"},"source":["# Getting started with LangTest on John Snow Labs"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"azUb114QhOsY"},"outputs":[],"source":["!pip install langtest[transformers]"]},{"cell_type":"markdown","metadata":{"id":"yR6kjOaiheKN"},"source":["# Harness and Its Parameters\n","\n","The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way."]},{"cell_type":"code","execution_count":2,"metadata":{"id":"lTzSJpMlhgq5","executionInfo":{"status":"ok","timestamp":1692343745209,"user_tz":-330,"elapsed":925,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"}}},"outputs":[],"source":["#Import Harness from the LangTest library\n","from langtest import Harness"]},{"cell_type":"markdown","metadata":{"id":"JFhJ9CcbsKqN"},"source":["# Performance Testing\n","\n","In this section, we dive into testing of time taken to complete the tests in LangTest on the datasets with Models."]},{"cell_type":"markdown","metadata":{"id":"swaYPW-wPlku"},"source":["### Setup and Configure Harness"]},{"cell_type":"code","execution_count":3,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":920,"referenced_widgets":["6c5a9f6544e0442ca68098426c146503","e7f4a1278a7e49aba2ed734c228d0c66","0fe42dba7c4b4df2a64ea2002be642cf","9a06564cc5254e89a762426cf3269a9e","3bb1a7ff75c1490db0334e5162aba497","4790617c8b18415d9cafaceabca7022d","2ffcf6981af34b139f304200934abaee","d2d79f3cb2d8444fa36a6851a208e9f4","6b905fe4f7f542c9890f5cfd195beda3","b4d092df96e04c00a943f3c8e3329c39","966ad3248efb4f29a20f19b06c7dfa77","13f8de2b99ff475fbabbfba66b17125e","3f65fa34feac4f00a89cf74bdb8f5a59","d35f7095c5be4de19b625a00a0ea1798","5f050ded4d6d41509dbad6f17284c18c","58ee9406c08144de989c5a26ed5a1ccb","cee5a9b496cf435b9e746424187cad08","36e4580f19164f5d93280c4a06f0879c","1e99e26f69034ec79f52b512a608c4f7","c749fd792c914727b6ff4386b316df57","e24155477021432cb45793c0743eea1b","6fc068b143fc4e3391dd755e8262fcfd","5c12f56844e546c3aaadc192b2583077","c96e0761975e4290be8f4b287e3f6f42","81e8c8d107034c85aa95252a3838b05e","771f704de66e4d0eab0a2cb71dd24d2f","6cf03247c6374a6b89aaffee79998285","501dd4b2d09b4ae993a5bc2f18769ac4","3cbb3222ae2f4bb7b3dfd1a8c54a3503","edc62d1193fc44e98784da4b1a3fa390","5dce3c96154c4dbba25a57052804c82b","d54d2ab9930a447388f0fea290bef2ac","3ce8b14f48a24349b793b75d9350dc95","3f1c56e797cf43588bad099a0783e179","85ff59f9f7ed47b5ab78f767255f5a56","a12cdebdcf8b476f80b906d97a9ea261","a404d4d49f2046bd85ea64cc2de4a734","265a066b8e564034af96902a1e0347fb","d136b83a79034d20971be55510737103","572b68362a1a4292a43d44ebad043042","db4fc546e2d344018f85cd3deeab1115","ef8256e1afce4f268db5bc38a3a7fa86","afa59a1ecaba4b56a596f4cdbd7b6730","3d6af6f687c54a5eb0db38b2c2ef1899","d2906819c5ec4c82b2731eac4afe519d","1db2003c9e124e4f8b6444c157636983","eaae6733b12047ac9edec3adff0ab765","83aa131ac111451495e97bd710631418","f818144e1d304afd8b15900620abfc1d","314d5fcd2f864d8f941f90d76bd0df1b","b3f1bdcee72a47a791c6eaec72fdf136","06797e19a13d40df8c50322bf4b52f90","6689169d7c9447a1bde80313e6e9a7c2","38476374aa9c49f68dcf96e55e520240","6c506299e96344798ac6e36820e275bf","80393d9f400a4e9c8867808c5f2e8b28","eaff3fe3471b4815ab3d27d72142fe22","030f52c161444051b7215c4cf1b4eb27","d2726e7c8ebc4d0c9dbaf7d919bd064b","012930dde07a4a56af962c6993ecbf03","ab2b92ec8670443a9093c015c6084e95","5aee88b7efe54ce2b646b353cc61b26f","f80fde8b46ba4c7fad2f867f2439ef83","1152e3b558814204a94a058f0d506d20","6ee8c33165a946b8a15516a89203f396","9982d7d8bc634c77838aa10ccead8428"]},"id":"JaarBdfe8DQ8","outputId":"9bc19b92-c518-4bcf-f7e6-16f419898566","executionInfo":{"status":"ok","timestamp":1692343769123,"user_tz":-330,"elapsed":23923,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"}}},"outputs":[{"output_type":"display_data","data":{"text/plain":["Downloading (…)lve/main/config.json: 0%| | 0.00/829 [00:00\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typeoriginaltest_case
0robustnesslowercaseSOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...soccer - japan get lucky win , china in surpri...
1robustnesslowercaseNadim Ladkinadim ladki
2robustnesslowercaseAL-AIN , United Arab Emirates 1996-12-06al-ain , united arab emirates 1996-12-06
3robustnesslowercaseJapan began the defence of their Asian Cup tit...japan began the defence of their asian cup tit...
4robustnesslowercaseBut China saw their luck desert them in the se...but china saw their luck desert them in the se...
...............
448robustnessuppercaseCRICKET - LARA ENDURES ANOTHER MISERABLE DAY .CRICKET - LARA ENDURES ANOTHER MISERABLE DAY .
449robustnessuppercaseRobert GalvinROBERT GALVIN
450robustnessuppercaseMELBOURNE 1996-12-06MELBOURNE 1996-12-06
451robustnessuppercaseAustralia gave Brian Lara another reason to be...AUSTRALIA GAVE BRIAN LARA ANOTHER REASON TO BE...
452performancespeed--
\n","

453 rows × 4 columns

\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n"," \n"]},"metadata":{},"execution_count":7}],"source":["harness.testcases()"]},{"cell_type":"markdown","metadata":{"id":"NOJ8BAU2GGzd"},"source":["harness.testcases() method displays the produced test cases in form of a pandas data frame."]},{"cell_type":"markdown","metadata":{"id":"3CwhQw6hGR9S"},"source":["### Running the tests"]},{"cell_type":"code","execution_count":8,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"aguX6-aFGOnP","outputId":"66be230c-84f5-4521-a3c5-fb57f91d131a","executionInfo":{"status":"ok","timestamp":1692343967668,"user_tz":-330,"elapsed":163600,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"}}},"outputs":[{"output_type":"stream","name":"stderr","text":["Running testcases... : 100%|██████████| 453/453 [02:43<00:00, 2.77it/s]\n"]},{"output_type":"execute_result","data":{"text/plain":[]},"metadata":{},"execution_count":8}],"source":["harness.run()"]},{"cell_type":"markdown","metadata":{"id":"191O2oaUGWrH"},"source":["Called after harness.generate() and is to used to run all the tests. Returns a pass/fail flag for each test."]},{"cell_type":"code","execution_count":9,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":527},"id":"XDbd1mpREWR5","outputId":"0375fbee-3ab7-4dca-f10d-eb6c36e23407","executionInfo":{"status":"ok","timestamp":1692343967670,"user_tz":-330,"elapsed":33,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"}}},"outputs":[{"output_type":"execute_result","data":{"text/plain":[" category test_type \\\n","0 robustness lowercase \n","1 robustness lowercase \n","2 robustness lowercase \n","3 robustness lowercase \n","4 robustness lowercase \n",".. ... ... \n","448 robustness uppercase \n","449 robustness uppercase \n","450 robustness uppercase \n","451 robustness uppercase \n","452 performance speed \n","\n"," original \\\n","0 SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI... \n","1 Nadim Ladki \n","2 AL-AIN , United Arab Emirates 1996-12-06 \n","3 Japan began the defence of their Asian Cup tit... \n","4 But China saw their luck desert them in the se... \n",".. ... \n","448 CRICKET - LARA ENDURES ANOTHER MISERABLE DAY . \n","449 Robert Galvin \n","450 MELBOURNE 1996-12-06 \n","451 Australia gave Brian Lara another reason to be... \n","452 - \n","\n"," test_case \\\n","0 soccer - japan get lucky win , china in surpri... \n","1 nadim ladki \n","2 al-ain , united arab emirates 1996-12-06 \n","3 japan began the defence of their asian cup tit... \n","4 but china saw their luck desert them in the se... \n",".. ... \n","448 CRICKET - LARA ENDURES ANOTHER MISERABLE DAY . \n","449 ROBERT GALVIN \n","450 MELBOURNE 1996-12-06 \n","451 AUSTRALIA GAVE BRIAN LARA ANOTHER REASON TO BE... \n","452 - \n","\n"," expected_result \\\n","0 JAPAN: MISC, LUCKY: PER, CHINA: ORG \n","1 Nadim Ladki: PER \n","2 AL-AIN: LOC, United Arab Emirates: LOC \n","3 Japan: LOC, Asian Cup: MISC, Syria: LOC, Group... \n","4 China: LOC, Uzbekistan: LOC \n",".. ... \n","448 LARA: LOC, MISERABLE: PER \n","449 Robert Galvin: PER \n","450 MELBOURNE: LOC \n","451 Australia: LOC, Brian Lara: PER, West Indies: ... \n","452 100 token/sec \n","\n"," actual_result pass \n","0 False \n","1 False \n","2 al-ain: LOC False \n","3 japan: ORG, syria: ORG False \n","4 uzbekistan: LOC False \n",".. ... ... \n","448 LARA: LOC, MISERABLE: PER True \n","449 ROBERT: ORG, GALVIN: PER False \n","450 MELBOURNE: LOC True \n","451 AUSTRALIA: LOC, BRIAN LARA: LOC, REASON: PER, ... False \n","452 19.20 token/sec True \n","\n","[453 rows x 7 columns]"],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typeoriginaltest_caseexpected_resultactual_resultpass
0robustnesslowercaseSOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...soccer - japan get lucky win , china in surpri...JAPAN: MISC, LUCKY: PER, CHINA: ORGFalse
1robustnesslowercaseNadim Ladkinadim ladkiNadim Ladki: PERFalse
2robustnesslowercaseAL-AIN , United Arab Emirates 1996-12-06al-ain , united arab emirates 1996-12-06AL-AIN: LOC, United Arab Emirates: LOCal-ain: LOCFalse
3robustnesslowercaseJapan began the defence of their Asian Cup tit...japan began the defence of their asian cup tit...Japan: LOC, Asian Cup: MISC, Syria: LOC, Group...japan: ORG, syria: ORGFalse
4robustnesslowercaseBut China saw their luck desert them in the se...but china saw their luck desert them in the se...China: LOC, Uzbekistan: LOCuzbekistan: LOCFalse
........................
448robustnessuppercaseCRICKET - LARA ENDURES ANOTHER MISERABLE DAY .CRICKET - LARA ENDURES ANOTHER MISERABLE DAY .LARA: LOC, MISERABLE: PERLARA: LOC, MISERABLE: PERTrue
449robustnessuppercaseRobert GalvinROBERT GALVINRobert Galvin: PERROBERT: ORG, GALVIN: PERFalse
450robustnessuppercaseMELBOURNE 1996-12-06MELBOURNE 1996-12-06MELBOURNE: LOCMELBOURNE: LOCTrue
451robustnessuppercaseAustralia gave Brian Lara another reason to be...AUSTRALIA GAVE BRIAN LARA ANOTHER REASON TO BE...Australia: LOC, Brian Lara: PER, West Indies: ...AUSTRALIA: LOC, BRIAN LARA: LOC, REASON: PER, ...False
452performancespeed--100 token/sec19.20 token/secTrue
\n","

453 rows × 7 columns

\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":9}],"source":["harness.generated_results()"]},{"cell_type":"markdown","metadata":{"id":"TKB8Rsr2GZME"},"source":["This method returns the generated results in the form of a pandas dataframe, which provides a convenient and easy-to-use format for working with the test results. You can use this method to quickly identify the test cases that failed and to determine where fixes are needed."]},{"cell_type":"markdown","metadata":{"id":"PBSlpWnUU55G"},"source":["### Final Results"]},{"cell_type":"markdown","metadata":{"id":"umnEgUHM8DRA"},"source":["We can call `.report()` which summarizes the results giving information about pass and fail counts and overall test pass/fail flag.\n","\n","To get time_elapsed for each test we pass parameter `return_runtime=True` in `.report()` method. We can also select the unit for time_elapsed i.e, seconds(s), miliseconds(ms) or microseconds(us) etc."]},{"cell_type":"code","execution_count":10,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":143},"id":"gp57HcF9yxi7","outputId":"8d990f2e-6b4d-480e-e844-95ff9158e126","executionInfo":{"status":"ok","timestamp":1692343967672,"user_tz":-330,"elapsed":30,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"}}},"outputs":[{"output_type":"execute_result","data":{"text/plain":[" category test_type fail_count pass_count pass_rate minimum_pass_rate \\\n","0 robustness lowercase 182 44 19% 66% \n","1 robustness uppercase 152 74 33% 66% \n","2 performance speed 0 1 100% 100% \n","\n"," pass \n","0 False \n","1 False \n","2 True "],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnesslowercase1824419%66%False
1robustnessuppercase1527433%66%False
2performancespeed01100%100%True
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":10}],"source":["harness.report()"]},{"cell_type":"markdown","metadata":{"id":"zg-knds3tq-w"},"source":["# Multiple Models Runtime Testing"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"ElMInPJMu3QK"},"outputs":[],"source":["!pip install spacy johnsnowlabs"]},{"cell_type":"code","execution_count":11,"metadata":{"id":"TnUBvYXptq-w","executionInfo":{"status":"ok","timestamp":1692343967673,"user_tz":-330,"elapsed":28,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"}}},"outputs":[],"source":["model_dict=[{\"model\": \"ner.dl\", \"hub\": \"johnsnowlabs\"},\n"," {\"model\": \"en_core_web_sm\", \"hub\": \"spacy\"}]"]},{"cell_type":"code","execution_count":13,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"PmMwW5IIvGav","outputId":"d9e5c932-b286-46e7-d18c-23e23f4cba6f","executionInfo":{"status":"ok","timestamp":1692344027826,"user_tz":-330,"elapsed":1096,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["--2023-08-18 07:33:45-- https://github.com/JohnSnowLabs/langtest/raw/main/langtest/data/conll/sample.conll\n","Resolving github.com (github.com)... 20.27.177.113\n","Connecting to github.com (github.com)|20.27.177.113|:443... connected.\n","HTTP request sent, awaiting response... 302 Found\n","Location: https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/langtest/data/conll/sample.conll [following]\n","--2023-08-18 07:33:45-- https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/langtest/data/conll/sample.conll\n","Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...\n","Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n","HTTP request sent, awaiting response... 200 OK\n","Length: 50519 (49K) [text/plain]\n","Saving to: ‘sample.conll’\n","\n","sample.conll 100%[===================>] 49.33K --.-KB/s in 0.01s \n","\n","2023-08-18 07:33:46 (3.77 MB/s) - ‘sample.conll’ saved [50519/50519]\n","\n"]}],"source":["# Load CoNLL\n","!wget https://github.com/JohnSnowLabs/langtest/raw/main/langtest/data/conll/sample.conll"]},{"cell_type":"code","execution_count":16,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"yey-zVICtq-w","outputId":"d94e8722-e009-4c17-85e5-9f8bcafdaf6a","executionInfo":{"status":"ok","timestamp":1692344334665,"user_tz":-330,"elapsed":233178,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["Warning::Spark Session already created, some configs may not take.\n","recognize_entities_dl download started this may take some time.\n","Approx size to download 159 MB\n","[OK!]\n","Test Configuration : \n"," {\n"," \"tests\": {\n"," \"defaults\": {\n"," \"min_pass_rate\": 1.0\n"," },\n"," \"robustness\": {\n"," \"add_typo\": {\n"," \"min_pass_rate\": 0.7\n"," },\n"," \"american_to_british\": {\n"," \"min_pass_rate\": 0.7\n"," }\n"," },\n"," \"accuracy\": {\n"," \"min_micro_f1_score\": {\n"," \"min_score\": 0.7\n"," }\n"," },\n"," \"bias\": {\n"," \"replace_to_female_pronouns\": {\n"," \"min_pass_rate\": 0.7\n"," },\n"," \"replace_to_low_income_country\": {\n"," \"min_pass_rate\": 0.7\n"," }\n"," },\n"," \"fairness\": {\n"," \"min_gender_f1_score\": {\n"," \"min_score\": 0.6\n"," }\n"," },\n"," \"representation\": {\n"," \"min_label_representation_count\": {\n"," \"min_count\": 50\n"," }\n"," }\n"," }\n","}\n"]}],"source":["harness = Harness(task=\"ner\", model=model_dict, data={\"data_source\":\"sample.conll\"})"]},{"cell_type":"code","execution_count":17,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"JwK7oi7Etq-w","outputId":"11e0d7e4-58f0-497b-d122-07efc38f21cb","executionInfo":{"status":"ok","timestamp":1692344334668,"user_tz":-330,"elapsed":85,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"}}},"outputs":[{"output_type":"execute_result","data":{"text/plain":["{'tests': {'defaults': {'min_pass_rate': 0.65},\n"," 'robustness': {'uppercase': {'min_pass_rate': 0.66},\n"," 'lowercase': {'min_pass_rate': 0.6}},\n"," 'performance': {'speed': {'min_pass_rate': 100, 'unit': 'tokens/sec'}}}}"]},"metadata":{},"execution_count":17}],"source":["harness.configure(\n","{\n"," 'tests': {'defaults': {'min_pass_rate': 0.65},\n"," 'robustness': {'uppercase': {'min_pass_rate': 0.66},\n"," 'lowercase': {'min_pass_rate': 0.60},\n"," },\n"," 'performance': {'speed': {'min_pass_rate': 100, 'unit': 'tokens/sec'}\n"," },\n"," }\n"," }\n"," )\n"]},{"cell_type":"code","execution_count":18,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"vTbPwStvtq-x","outputId":"4b2cbf34-6d9e-4942-b6e5-4bccd7153326","executionInfo":{"status":"ok","timestamp":1692344465099,"user_tz":-330,"elapsed":130503,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"}}},"outputs":[{"output_type":"stream","name":"stderr","text":["Generating testcases...: 100%|██████████| 2/2 [00:00<00:00, 8256.50it/s]\n","Generating testcases...: 100%|██████████| 2/2 [00:00<00:00, 10094.59it/s]\n","Running testcases... : 100%|██████████| 453/453 [01:30<00:00, 4.99it/s]\n","Running testcases... : 100%|██████████| 453/453 [00:11<00:00, 40.56it/s]\n"]},{"output_type":"execute_result","data":{"text/plain":[]},"metadata":{},"execution_count":18}],"source":["harness.generate().run()"]},{"cell_type":"code","execution_count":19,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":143},"id":"AUZUeCpLtq-x","outputId":"406dea53-1a63-4e17-9ffb-7ed7e2a4531d","executionInfo":{"status":"ok","timestamp":1692344465100,"user_tz":-330,"elapsed":63,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"}}},"outputs":[{"output_type":"execute_result","data":{"text/plain":[""],"text/html":["\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
test_typelowercasespeeduppercase
model_name   
en_core_web_sm0.2900000.5000000.580000
ner.dl0.1100001.0000000.850000
\n"]},"metadata":{},"execution_count":19}],"source":["harness.report()"]}],"metadata":{"accelerator":"TPU","colab":{"machine_shape":"hm","provenance":[]},"gpuClass":"standard","kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.9"},"widgets":{"application/vnd.jupyter.widget-state+json":{"6c5a9f6544e0442ca68098426c146503":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_e7f4a1278a7e49aba2ed734c228d0c66","IPY_MODEL_0fe42dba7c4b4df2a64ea2002be642cf","IPY_MODEL_9a06564cc5254e89a762426cf3269a9e"],"layout":"IPY_MODEL_3bb1a7ff75c1490db0334e5162aba497"}},"e7f4a1278a7e49aba2ed734c228d0c66":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_4790617c8b18415d9cafaceabca7022d","placeholder":"​","style":"IPY_MODEL_2ffcf6981af34b139f304200934abaee","value":"Downloading (…)lve/main/config.json: 100%"}},"0fe42dba7c4b4df2a64ea2002be642cf":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_d2d79f3cb2d8444fa36a6851a208e9f4","max":829,"min":0,"orientation":"horizontal","style":"IPY_MODEL_6b905fe4f7f542c9890f5cfd195beda3","value":829}},"9a06564cc5254e89a762426cf3269a9e":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_b4d092df96e04c00a943f3c8e3329c39","placeholder":"​","style":"IPY_MODEL_966ad3248efb4f29a20f19b06c7dfa77","value":" 829/829 [00:00<00:00, 15.0kB/s]"}},"3bb1a7ff75c1490db0334e5162aba497":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"4790617c8b18415d9cafaceabca7022d":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"2ffcf6981af34b139f304200934abaee":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"d2d79f3cb2d8444fa36a6851a208e9f4":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"6b905fe4f7f542c9890f5cfd195beda3":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"b4d092df96e04c00a943f3c8e3329c39":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"966ad3248efb4f29a20f19b06c7dfa77":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"13f8de2b99ff475fbabbfba66b17125e":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_3f65fa34feac4f00a89cf74bdb8f5a59","IPY_MODEL_d35f7095c5be4de19b625a00a0ea1798","IPY_MODEL_5f050ded4d6d41509dbad6f17284c18c"],"layout":"IPY_MODEL_58ee9406c08144de989c5a26ed5a1ccb"}},"3f65fa34feac4f00a89cf74bdb8f5a59":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_cee5a9b496cf435b9e746424187cad08","placeholder":"​","style":"IPY_MODEL_36e4580f19164f5d93280c4a06f0879c","value":"Downloading pytorch_model.bin: 100%"}},"d35f7095c5be4de19b625a00a0ea1798":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_1e99e26f69034ec79f52b512a608c4f7","max":433316646,"min":0,"orientation":"horizontal","style":"IPY_MODEL_c749fd792c914727b6ff4386b316df57","value":433316646}},"5f050ded4d6d41509dbad6f17284c18c":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_e24155477021432cb45793c0743eea1b","placeholder":"​","style":"IPY_MODEL_6fc068b143fc4e3391dd755e8262fcfd","value":" 433M/433M [00:05<00:00, 38.8MB/s]"}},"58ee9406c08144de989c5a26ed5a1ccb":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"cee5a9b496cf435b9e746424187cad08":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"36e4580f19164f5d93280c4a06f0879c":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"1e99e26f69034ec79f52b512a608c4f7":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"c749fd792c914727b6ff4386b316df57":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"e24155477021432cb45793c0743eea1b":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"6fc068b143fc4e3391dd755e8262fcfd":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"5c12f56844e546c3aaadc192b2583077":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_c96e0761975e4290be8f4b287e3f6f42","IPY_MODEL_81e8c8d107034c85aa95252a3838b05e","IPY_MODEL_771f704de66e4d0eab0a2cb71dd24d2f"],"layout":"IPY_MODEL_6cf03247c6374a6b89aaffee79998285"}},"c96e0761975e4290be8f4b287e3f6f42":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_501dd4b2d09b4ae993a5bc2f18769ac4","placeholder":"​","style":"IPY_MODEL_3cbb3222ae2f4bb7b3dfd1a8c54a3503","value":"Downloading (…)okenizer_config.json: 100%"}},"81e8c8d107034c85aa95252a3838b05e":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_edc62d1193fc44e98784da4b1a3fa390","max":59,"min":0,"orientation":"horizontal","style":"IPY_MODEL_5dce3c96154c4dbba25a57052804c82b","value":59}},"771f704de66e4d0eab0a2cb71dd24d2f":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_d54d2ab9930a447388f0fea290bef2ac","placeholder":"​","style":"IPY_MODEL_3ce8b14f48a24349b793b75d9350dc95","value":" 59.0/59.0 [00:00<00:00, 2.43kB/s]"}},"6cf03247c6374a6b89aaffee79998285":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"501dd4b2d09b4ae993a5bc2f18769ac4":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"3cbb3222ae2f4bb7b3dfd1a8c54a3503":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"edc62d1193fc44e98784da4b1a3fa390":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"5dce3c96154c4dbba25a57052804c82b":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"d54d2ab9930a447388f0fea290bef2ac":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"3ce8b14f48a24349b793b75d9350dc95":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"3f1c56e797cf43588bad099a0783e179":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_85ff59f9f7ed47b5ab78f767255f5a56","IPY_MODEL_a12cdebdcf8b476f80b906d97a9ea261","IPY_MODEL_a404d4d49f2046bd85ea64cc2de4a734"],"layout":"IPY_MODEL_265a066b8e564034af96902a1e0347fb"}},"85ff59f9f7ed47b5ab78f767255f5a56":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_d136b83a79034d20971be55510737103","placeholder":"​","style":"IPY_MODEL_572b68362a1a4292a43d44ebad043042","value":"Downloading (…)solve/main/vocab.txt: 100%"}},"a12cdebdcf8b476f80b906d97a9ea261":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_db4fc546e2d344018f85cd3deeab1115","max":213450,"min":0,"orientation":"horizontal","style":"IPY_MODEL_ef8256e1afce4f268db5bc38a3a7fa86","value":213450}},"a404d4d49f2046bd85ea64cc2de4a734":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_afa59a1ecaba4b56a596f4cdbd7b6730","placeholder":"​","style":"IPY_MODEL_3d6af6f687c54a5eb0db38b2c2ef1899","value":" 213k/213k [00:00<00:00, 7.96MB/s]"}},"265a066b8e564034af96902a1e0347fb":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"d136b83a79034d20971be55510737103":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"572b68362a1a4292a43d44ebad043042":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"db4fc546e2d344018f85cd3deeab1115":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"ef8256e1afce4f268db5bc38a3a7fa86":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"afa59a1ecaba4b56a596f4cdbd7b6730":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"3d6af6f687c54a5eb0db38b2c2ef1899":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"d2906819c5ec4c82b2731eac4afe519d":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_1db2003c9e124e4f8b6444c157636983","IPY_MODEL_eaae6733b12047ac9edec3adff0ab765","IPY_MODEL_83aa131ac111451495e97bd710631418"],"layout":"IPY_MODEL_f818144e1d304afd8b15900620abfc1d"}},"1db2003c9e124e4f8b6444c157636983":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_314d5fcd2f864d8f941f90d76bd0df1b","placeholder":"​","style":"IPY_MODEL_b3f1bdcee72a47a791c6eaec72fdf136","value":"Downloading (…)in/added_tokens.json: 100%"}},"eaae6733b12047ac9edec3adff0ab765":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_06797e19a13d40df8c50322bf4b52f90","max":2,"min":0,"orientation":"horizontal","style":"IPY_MODEL_6689169d7c9447a1bde80313e6e9a7c2","value":2}},"83aa131ac111451495e97bd710631418":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_38476374aa9c49f68dcf96e55e520240","placeholder":"​","style":"IPY_MODEL_6c506299e96344798ac6e36820e275bf","value":" 2.00/2.00 [00:00<00:00, 86.3B/s]"}},"f818144e1d304afd8b15900620abfc1d":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"314d5fcd2f864d8f941f90d76bd0df1b":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"b3f1bdcee72a47a791c6eaec72fdf136":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"06797e19a13d40df8c50322bf4b52f90":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"6689169d7c9447a1bde80313e6e9a7c2":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"38476374aa9c49f68dcf96e55e520240":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"6c506299e96344798ac6e36820e275bf":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"80393d9f400a4e9c8867808c5f2e8b28":{"model_module":"@jupyter-widgets/controls","model_name":"HBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_eaff3fe3471b4815ab3d27d72142fe22","IPY_MODEL_030f52c161444051b7215c4cf1b4eb27","IPY_MODEL_d2726e7c8ebc4d0c9dbaf7d919bd064b"],"layout":"IPY_MODEL_012930dde07a4a56af962c6993ecbf03"}},"eaff3fe3471b4815ab3d27d72142fe22":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_ab2b92ec8670443a9093c015c6084e95","placeholder":"​","style":"IPY_MODEL_5aee88b7efe54ce2b646b353cc61b26f","value":"Downloading (…)cial_tokens_map.json: 100%"}},"030f52c161444051b7215c4cf1b4eb27":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_f80fde8b46ba4c7fad2f867f2439ef83","max":112,"min":0,"orientation":"horizontal","style":"IPY_MODEL_1152e3b558814204a94a058f0d506d20","value":112}},"d2726e7c8ebc4d0c9dbaf7d919bd064b":{"model_module":"@jupyter-widgets/controls","model_name":"HTMLModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_6ee8c33165a946b8a15516a89203f396","placeholder":"​","style":"IPY_MODEL_9982d7d8bc634c77838aa10ccead8428","value":" 112/112 [00:00<00:00, 7.97kB/s]"}},"012930dde07a4a56af962c6993ecbf03":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"ab2b92ec8670443a9093c015c6084e95":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"5aee88b7efe54ce2b646b353cc61b26f":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"f80fde8b46ba4c7fad2f867f2439ef83":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"1152e3b558814204a94a058f0d506d20":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"6ee8c33165a946b8a15516a89203f396":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"9982d7d8bc634c77838aa10ccead8428":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}}}}},"nbformat":4,"nbformat_minor":0} \ No newline at end of file +{"cells":[{"cell_type":"markdown","metadata":{"id":"e7PsSmy9sCoR"},"source":["![image.png]()"]},{"cell_type":"markdown","metadata":{"id":"3o5sAOfwL5qd"},"source":["[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/misc/PerformanceTest_Notebook.ipynb)"]},{"cell_type":"markdown","metadata":{"id":"WJJzt3RWhEc6"},"source":["**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering and Summarization tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity and fairness test categories.\n","\n","Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings."]},{"cell_type":"markdown","metadata":{"id":"26qXWhCYhHAt"},"source":["# Getting started with LangTest on John Snow Labs"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"azUb114QhOsY"},"outputs":[],"source":["!pip install langtest[transformers]"]},{"cell_type":"markdown","metadata":{"id":"yR6kjOaiheKN"},"source":["# Harness and Its Parameters\n","\n","The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way."]},{"cell_type":"code","execution_count":2,"metadata":{"executionInfo":{"elapsed":925,"status":"ok","timestamp":1692343745209,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"lTzSJpMlhgq5"},"outputs":[],"source":["#Import Harness from the LangTest library\n","from langtest import Harness"]},{"cell_type":"markdown","metadata":{"id":"JFhJ9CcbsKqN"},"source":["# Performance Testing\n","\n","In the testing phase of evaluating Natural Language Processing (NLP) models using LangTest, a dedicated tool for assessing language-related capabilities, the focus is on measuring the time taken to process a given dataset. This metric serves as a key performance indicator, reflecting the efficiency of NLP models in real-world scenarios. The choice of an appropriate dataset is pivotal, ensuring its relevance to the application's context. By comparing the time efficiency of different models, one can identify optimal solutions for specific use cases. Additionally, if processing times are suboptimal, exploration of model architecture adjustments and optimization strategies becomes essential to enhance overall performance and responsiveness, particularly in applications demanding quick and efficient language processing.\n","\n","The formula you provided,\n","\n","$\\ speed $ = $\\frac{number\\ of\\ tokens\\ in\\ given\\ dataset}{time\\ taken}$\n","\n","calculates the speed of processing for an NLP model. Specifically, it represents the number of words processed per unit of time, offering a quantitative measure of the model's efficiency. A higher speed value indicates faster processing, which can be crucial in real-time applications or scenarios where quick language understanding and response are essential. Monitoring and optimizing this speed metric contribute to ensuring the practical utility of NLP models, especially in applications such as chatbots, customer support systems, or any context where rapid language processing is a priority."]},{"cell_type":"markdown","metadata":{"id":"swaYPW-wPlku"},"source":["### Setup and Configure Harness"]},{"cell_type":"code","execution_count":3,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":920,"referenced_widgets":["6c5a9f6544e0442ca68098426c146503","e7f4a1278a7e49aba2ed734c228d0c66","0fe42dba7c4b4df2a64ea2002be642cf","9a06564cc5254e89a762426cf3269a9e","3bb1a7ff75c1490db0334e5162aba497","4790617c8b18415d9cafaceabca7022d","2ffcf6981af34b139f304200934abaee","d2d79f3cb2d8444fa36a6851a208e9f4","6b905fe4f7f542c9890f5cfd195beda3","b4d092df96e04c00a943f3c8e3329c39","966ad3248efb4f29a20f19b06c7dfa77","13f8de2b99ff475fbabbfba66b17125e","3f65fa34feac4f00a89cf74bdb8f5a59","d35f7095c5be4de19b625a00a0ea1798","5f050ded4d6d41509dbad6f17284c18c","58ee9406c08144de989c5a26ed5a1ccb","cee5a9b496cf435b9e746424187cad08","36e4580f19164f5d93280c4a06f0879c","1e99e26f69034ec79f52b512a608c4f7","c749fd792c914727b6ff4386b316df57","e24155477021432cb45793c0743eea1b","6fc068b143fc4e3391dd755e8262fcfd","5c12f56844e546c3aaadc192b2583077","c96e0761975e4290be8f4b287e3f6f42","81e8c8d107034c85aa95252a3838b05e","771f704de66e4d0eab0a2cb71dd24d2f","6cf03247c6374a6b89aaffee79998285","501dd4b2d09b4ae993a5bc2f18769ac4","3cbb3222ae2f4bb7b3dfd1a8c54a3503","edc62d1193fc44e98784da4b1a3fa390","5dce3c96154c4dbba25a57052804c82b","d54d2ab9930a447388f0fea290bef2ac","3ce8b14f48a24349b793b75d9350dc95","3f1c56e797cf43588bad099a0783e179","85ff59f9f7ed47b5ab78f767255f5a56","a12cdebdcf8b476f80b906d97a9ea261","a404d4d49f2046bd85ea64cc2de4a734","265a066b8e564034af96902a1e0347fb","d136b83a79034d20971be55510737103","572b68362a1a4292a43d44ebad043042","db4fc546e2d344018f85cd3deeab1115","ef8256e1afce4f268db5bc38a3a7fa86","afa59a1ecaba4b56a596f4cdbd7b6730","3d6af6f687c54a5eb0db38b2c2ef1899","d2906819c5ec4c82b2731eac4afe519d","1db2003c9e124e4f8b6444c157636983","eaae6733b12047ac9edec3adff0ab765","83aa131ac111451495e97bd710631418","f818144e1d304afd8b15900620abfc1d","314d5fcd2f864d8f941f90d76bd0df1b","b3f1bdcee72a47a791c6eaec72fdf136","06797e19a13d40df8c50322bf4b52f90","6689169d7c9447a1bde80313e6e9a7c2","38476374aa9c49f68dcf96e55e520240","6c506299e96344798ac6e36820e275bf","80393d9f400a4e9c8867808c5f2e8b28","eaff3fe3471b4815ab3d27d72142fe22","030f52c161444051b7215c4cf1b4eb27","d2726e7c8ebc4d0c9dbaf7d919bd064b","012930dde07a4a56af962c6993ecbf03","ab2b92ec8670443a9093c015c6084e95","5aee88b7efe54ce2b646b353cc61b26f","f80fde8b46ba4c7fad2f867f2439ef83","1152e3b558814204a94a058f0d506d20","6ee8c33165a946b8a15516a89203f396","9982d7d8bc634c77838aa10ccead8428"]},"executionInfo":{"elapsed":23923,"status":"ok","timestamp":1692343769123,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"JaarBdfe8DQ8","outputId":"9bc19b92-c518-4bcf-f7e6-16f419898566"},"outputs":[{"data":{"application/vnd.jupyter.widget-view+json":{"model_id":"6c5a9f6544e0442ca68098426c146503","version_major":2,"version_minor":0},"text/plain":["Downloading (…)lve/main/config.json: 0%| | 0.00/829 [00:00\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typeoriginaltest_case
0robustnesslowercaseSOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...soccer - japan get lucky win , china in surpri...
1robustnesslowercaseNadim Ladkinadim ladki
2robustnesslowercaseAL-AIN , United Arab Emirates 1996-12-06al-ain , united arab emirates 1996-12-06
3robustnesslowercaseJapan began the defence of their Asian Cup tit...japan began the defence of their asian cup tit...
4robustnesslowercaseBut China saw their luck desert them in the se...but china saw their luck desert them in the se...
...............
448robustnessuppercaseCRICKET - LARA ENDURES ANOTHER MISERABLE DAY .CRICKET - LARA ENDURES ANOTHER MISERABLE DAY .
449robustnessuppercaseRobert GalvinROBERT GALVIN
450robustnessuppercaseMELBOURNE 1996-12-06MELBOURNE 1996-12-06
451robustnessuppercaseAustralia gave Brian Lara another reason to be...AUSTRALIA GAVE BRIAN LARA ANOTHER REASON TO BE...
452performancespeed--
\n","

453 rows × 4 columns

\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n"," \n"],"text/plain":[" category test_type \\\n","0 robustness lowercase \n","1 robustness lowercase \n","2 robustness lowercase \n","3 robustness lowercase \n","4 robustness lowercase \n",".. ... ... \n","448 robustness uppercase \n","449 robustness uppercase \n","450 robustness uppercase \n","451 robustness uppercase \n","452 performance speed \n","\n"," original \\\n","0 SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI... \n","1 Nadim Ladki \n","2 AL-AIN , United Arab Emirates 1996-12-06 \n","3 Japan began the defence of their Asian Cup tit... \n","4 But China saw their luck desert them in the se... \n",".. ... \n","448 CRICKET - LARA ENDURES ANOTHER MISERABLE DAY . \n","449 Robert Galvin \n","450 MELBOURNE 1996-12-06 \n","451 Australia gave Brian Lara another reason to be... \n","452 - \n","\n"," test_case \n","0 soccer - japan get lucky win , china in surpri... \n","1 nadim ladki \n","2 al-ain , united arab emirates 1996-12-06 \n","3 japan began the defence of their asian cup tit... \n","4 but china saw their luck desert them in the se... \n",".. ... \n","448 CRICKET - LARA ENDURES ANOTHER MISERABLE DAY . \n","449 ROBERT GALVIN \n","450 MELBOURNE 1996-12-06 \n","451 AUSTRALIA GAVE BRIAN LARA ANOTHER REASON TO BE... \n","452 - \n","\n","[453 rows x 4 columns]"]},"execution_count":7,"metadata":{},"output_type":"execute_result"}],"source":["harness.testcases()"]},{"cell_type":"markdown","metadata":{"id":"NOJ8BAU2GGzd"},"source":["harness.testcases() method displays the produced test cases in form of a pandas data frame."]},{"cell_type":"markdown","metadata":{"id":"3CwhQw6hGR9S"},"source":["### Running the tests"]},{"cell_type":"code","execution_count":8,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":163600,"status":"ok","timestamp":1692343967668,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"aguX6-aFGOnP","outputId":"66be230c-84f5-4521-a3c5-fb57f91d131a"},"outputs":[{"name":"stderr","output_type":"stream","text":["Running testcases... : 100%|██████████| 453/453 [02:43<00:00, 2.77it/s]\n"]},{"data":{"text/plain":[]},"execution_count":8,"metadata":{},"output_type":"execute_result"}],"source":["harness.run()"]},{"cell_type":"markdown","metadata":{"id":"191O2oaUGWrH"},"source":["Called after harness.generate() and is to used to run all the tests. Returns a pass/fail flag for each test."]},{"cell_type":"code","execution_count":9,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":527},"executionInfo":{"elapsed":33,"status":"ok","timestamp":1692343967670,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"XDbd1mpREWR5","outputId":"0375fbee-3ab7-4dca-f10d-eb6c36e23407"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typeoriginaltest_caseexpected_resultactual_resultpass
0robustnesslowercaseSOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...soccer - japan get lucky win , china in surpri...JAPAN: MISC, LUCKY: PER, CHINA: ORGFalse
1robustnesslowercaseNadim Ladkinadim ladkiNadim Ladki: PERFalse
2robustnesslowercaseAL-AIN , United Arab Emirates 1996-12-06al-ain , united arab emirates 1996-12-06AL-AIN: LOC, United Arab Emirates: LOCal-ain: LOCFalse
3robustnesslowercaseJapan began the defence of their Asian Cup tit...japan began the defence of their asian cup tit...Japan: LOC, Asian Cup: MISC, Syria: LOC, Group...japan: ORG, syria: ORGFalse
4robustnesslowercaseBut China saw their luck desert them in the se...but china saw their luck desert them in the se...China: LOC, Uzbekistan: LOCuzbekistan: LOCFalse
........................
448robustnessuppercaseCRICKET - LARA ENDURES ANOTHER MISERABLE DAY .CRICKET - LARA ENDURES ANOTHER MISERABLE DAY .LARA: LOC, MISERABLE: PERLARA: LOC, MISERABLE: PERTrue
449robustnessuppercaseRobert GalvinROBERT GALVINRobert Galvin: PERROBERT: ORG, GALVIN: PERFalse
450robustnessuppercaseMELBOURNE 1996-12-06MELBOURNE 1996-12-06MELBOURNE: LOCMELBOURNE: LOCTrue
451robustnessuppercaseAustralia gave Brian Lara another reason to be...AUSTRALIA GAVE BRIAN LARA ANOTHER REASON TO BE...Australia: LOC, Brian Lara: PER, West Indies: ...AUSTRALIA: LOC, BRIAN LARA: LOC, REASON: PER, ...False
452performancespeed--100 token/sec19.20 token/secTrue
\n","

453 rows × 7 columns

\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type \\\n","0 robustness lowercase \n","1 robustness lowercase \n","2 robustness lowercase \n","3 robustness lowercase \n","4 robustness lowercase \n",".. ... ... \n","448 robustness uppercase \n","449 robustness uppercase \n","450 robustness uppercase \n","451 robustness uppercase \n","452 performance speed \n","\n"," original \\\n","0 SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI... \n","1 Nadim Ladki \n","2 AL-AIN , United Arab Emirates 1996-12-06 \n","3 Japan began the defence of their Asian Cup tit... \n","4 But China saw their luck desert them in the se... \n",".. ... \n","448 CRICKET - LARA ENDURES ANOTHER MISERABLE DAY . \n","449 Robert Galvin \n","450 MELBOURNE 1996-12-06 \n","451 Australia gave Brian Lara another reason to be... \n","452 - \n","\n"," test_case \\\n","0 soccer - japan get lucky win , china in surpri... \n","1 nadim ladki \n","2 al-ain , united arab emirates 1996-12-06 \n","3 japan began the defence of their asian cup tit... \n","4 but china saw their luck desert them in the se... \n",".. ... \n","448 CRICKET - LARA ENDURES ANOTHER MISERABLE DAY . \n","449 ROBERT GALVIN \n","450 MELBOURNE 1996-12-06 \n","451 AUSTRALIA GAVE BRIAN LARA ANOTHER REASON TO BE... \n","452 - \n","\n"," expected_result \\\n","0 JAPAN: MISC, LUCKY: PER, CHINA: ORG \n","1 Nadim Ladki: PER \n","2 AL-AIN: LOC, United Arab Emirates: LOC \n","3 Japan: LOC, Asian Cup: MISC, Syria: LOC, Group... \n","4 China: LOC, Uzbekistan: LOC \n",".. ... \n","448 LARA: LOC, MISERABLE: PER \n","449 Robert Galvin: PER \n","450 MELBOURNE: LOC \n","451 Australia: LOC, Brian Lara: PER, West Indies: ... \n","452 100 token/sec \n","\n"," actual_result pass \n","0 False \n","1 False \n","2 al-ain: LOC False \n","3 japan: ORG, syria: ORG False \n","4 uzbekistan: LOC False \n",".. ... ... \n","448 LARA: LOC, MISERABLE: PER True \n","449 ROBERT: ORG, GALVIN: PER False \n","450 MELBOURNE: LOC True \n","451 AUSTRALIA: LOC, BRIAN LARA: LOC, REASON: PER, ... False \n","452 19.20 token/sec True \n","\n","[453 rows x 7 columns]"]},"execution_count":9,"metadata":{},"output_type":"execute_result"}],"source":["harness.generated_results()"]},{"cell_type":"markdown","metadata":{"id":"TKB8Rsr2GZME"},"source":["This method returns the generated results in the form of a pandas dataframe, which provides a convenient and easy-to-use format for working with the test results. You can use this method to quickly identify the test cases that failed and to determine where fixes are needed."]},{"cell_type":"markdown","metadata":{"id":"PBSlpWnUU55G"},"source":["### Final Results"]},{"cell_type":"markdown","metadata":{"id":"umnEgUHM8DRA"},"source":["We can call `.report()` which summarizes the results giving information about pass and fail counts and overall test pass/fail flag.\n","\n","To get time_elapsed for each test we pass parameter `return_runtime=True` in `.report()` method. We can also select the unit for time_elapsed i.e, seconds(s), miliseconds(ms) or microseconds(us) etc."]},{"cell_type":"code","execution_count":10,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":143},"executionInfo":{"elapsed":30,"status":"ok","timestamp":1692343967672,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"gp57HcF9yxi7","outputId":"8d990f2e-6b4d-480e-e844-95ff9158e126"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnesslowercase1824419%66%False
1robustnessuppercase1527433%66%False
2performancespeed01100%100%True
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type fail_count pass_count pass_rate minimum_pass_rate \\\n","0 robustness lowercase 182 44 19% 66% \n","1 robustness uppercase 152 74 33% 66% \n","2 performance speed 0 1 100% 100% \n","\n"," pass \n","0 False \n","1 False \n","2 True "]},"execution_count":10,"metadata":{},"output_type":"execute_result"}],"source":["harness.report()"]},{"cell_type":"markdown","metadata":{"id":"zg-knds3tq-w"},"source":["# Multiple Models Runtime Testing"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"ElMInPJMu3QK"},"outputs":[],"source":["!pip install spacy johnsnowlabs"]},{"cell_type":"code","execution_count":11,"metadata":{"executionInfo":{"elapsed":28,"status":"ok","timestamp":1692343967673,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"TnUBvYXptq-w"},"outputs":[],"source":["model_dict=[{\"model\": \"ner.dl\", \"hub\": \"johnsnowlabs\"},\n"," {\"model\": \"en_core_web_sm\", \"hub\": \"spacy\"}]"]},{"cell_type":"code","execution_count":13,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":1096,"status":"ok","timestamp":1692344027826,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"PmMwW5IIvGav","outputId":"d9e5c932-b286-46e7-d18c-23e23f4cba6f"},"outputs":[{"name":"stdout","output_type":"stream","text":["--2023-08-18 07:33:45-- https://github.com/JohnSnowLabs/langtest/raw/main/langtest/data/conll/sample.conll\n","Resolving github.com (github.com)... 20.27.177.113\n","Connecting to github.com (github.com)|20.27.177.113|:443... connected.\n","HTTP request sent, awaiting response... 302 Found\n","Location: https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/langtest/data/conll/sample.conll [following]\n","--2023-08-18 07:33:45-- https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/langtest/data/conll/sample.conll\n","Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...\n","Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n","HTTP request sent, awaiting response... 200 OK\n","Length: 50519 (49K) [text/plain]\n","Saving to: ‘sample.conll’\n","\n","sample.conll 100%[===================>] 49.33K --.-KB/s in 0.01s \n","\n","2023-08-18 07:33:46 (3.77 MB/s) - ‘sample.conll’ saved [50519/50519]\n","\n"]}],"source":["# Load CoNLL\n","!wget https://github.com/JohnSnowLabs/langtest/raw/main/langtest/data/conll/sample.conll"]},{"cell_type":"code","execution_count":16,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":233178,"status":"ok","timestamp":1692344334665,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"yey-zVICtq-w","outputId":"d94e8722-e009-4c17-85e5-9f8bcafdaf6a"},"outputs":[{"name":"stdout","output_type":"stream","text":["Warning::Spark Session already created, some configs may not take.\n","recognize_entities_dl download started this may take some time.\n","Approx size to download 159 MB\n","[OK!]\n","Test Configuration : \n"," {\n"," \"tests\": {\n"," \"defaults\": {\n"," \"min_pass_rate\": 1.0\n"," },\n"," \"robustness\": {\n"," \"add_typo\": {\n"," \"min_pass_rate\": 0.7\n"," },\n"," \"american_to_british\": {\n"," \"min_pass_rate\": 0.7\n"," }\n"," },\n"," \"accuracy\": {\n"," \"min_micro_f1_score\": {\n"," \"min_score\": 0.7\n"," }\n"," },\n"," \"bias\": {\n"," \"replace_to_female_pronouns\": {\n"," \"min_pass_rate\": 0.7\n"," },\n"," \"replace_to_low_income_country\": {\n"," \"min_pass_rate\": 0.7\n"," }\n"," },\n"," \"fairness\": {\n"," \"min_gender_f1_score\": {\n"," \"min_score\": 0.6\n"," }\n"," },\n"," \"representation\": {\n"," \"min_label_representation_count\": {\n"," \"min_count\": 50\n"," }\n"," }\n"," }\n","}\n"]}],"source":["harness = Harness(task=\"ner\", model=model_dict, data={\"data_source\":\"sample.conll\"})"]},{"cell_type":"code","execution_count":17,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":85,"status":"ok","timestamp":1692344334668,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"JwK7oi7Etq-w","outputId":"11e0d7e4-58f0-497b-d122-07efc38f21cb"},"outputs":[{"data":{"text/plain":["{'tests': {'defaults': {'min_pass_rate': 0.65},\n"," 'robustness': {'uppercase': {'min_pass_rate': 0.66},\n"," 'lowercase': {'min_pass_rate': 0.6}},\n"," 'performance': {'speed': {'min_pass_rate': 100, 'unit': 'tokens/sec'}}}}"]},"execution_count":17,"metadata":{},"output_type":"execute_result"}],"source":["harness.configure(\n","{\n"," 'tests': {'defaults': {'min_pass_rate': 0.65},\n"," 'robustness': {'uppercase': {'min_pass_rate': 0.66},\n"," 'lowercase': {'min_pass_rate': 0.60},\n"," },\n"," 'performance': {'speed': {'min_pass_rate': 100, 'unit': 'tokens/sec'}\n"," },\n"," }\n"," }\n"," )\n"]},{"cell_type":"code","execution_count":18,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":130503,"status":"ok","timestamp":1692344465099,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"vTbPwStvtq-x","outputId":"4b2cbf34-6d9e-4942-b6e5-4bccd7153326"},"outputs":[{"name":"stderr","output_type":"stream","text":["Generating testcases...: 100%|██████████| 2/2 [00:00<00:00, 8256.50it/s]\n","Generating testcases...: 100%|██████████| 2/2 [00:00<00:00, 10094.59it/s]\n","Running testcases... : 100%|██████████| 453/453 [01:30<00:00, 4.99it/s]\n","Running testcases... : 100%|██████████| 453/453 [00:11<00:00, 40.56it/s]\n"]},{"data":{"text/plain":[]},"execution_count":18,"metadata":{},"output_type":"execute_result"}],"source":["harness.generate().run()"]},{"cell_type":"code","execution_count":19,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":143},"executionInfo":{"elapsed":63,"status":"ok","timestamp":1692344465100,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"AUZUeCpLtq-x","outputId":"406dea53-1a63-4e17-9ffb-7ed7e2a4531d"},"outputs":[{"data":{"text/html":["\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
test_typelowercasespeeduppercase
model_name   
en_core_web_sm0.2900000.5000000.580000
ner.dl0.1100001.0000000.850000
\n"],"text/plain":[""]},"execution_count":19,"metadata":{},"output_type":"execute_result"}],"source":["harness.report()"]}],"metadata":{"accelerator":"TPU","colab":{"machine_shape":"hm","provenance":[]},"gpuClass":"standard","kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.9"},"widgets":{"application/vnd.jupyter.widget-state+json":{"012930dde07a4a56af962c6993ecbf03":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"030f52c161444051b7215c4cf1b4eb27":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_f80fde8b46ba4c7fad2f867f2439ef83","max":112,"min":0,"orientation":"horizontal","style":"IPY_MODEL_1152e3b558814204a94a058f0d506d20","value":112}},"06797e19a13d40df8c50322bf4b52f90":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"0fe42dba7c4b4df2a64ea2002be642cf":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_d2d79f3cb2d8444fa36a6851a208e9f4","max":829,"min":0,"orientation":"horizontal","style":"IPY_MODEL_6b905fe4f7f542c9890f5cfd195beda3","value":829}},"1152e3b558814204a94a058f0d506d20":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"13f8de2b99ff475fbabbfba66b17125e":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_3f65fa34feac4f00a89cf74bdb8f5a59","IPY_MODEL_d35f7095c5be4de19b625a00a0ea1798","IPY_MODEL_5f050ded4d6d41509dbad6f17284c18c"],"layout":"IPY_MODEL_58ee9406c08144de989c5a26ed5a1ccb"}},"1db2003c9e124e4f8b6444c157636983":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_314d5fcd2f864d8f941f90d76bd0df1b","placeholder":"​","style":"IPY_MODEL_b3f1bdcee72a47a791c6eaec72fdf136","value":"Downloading (…)in/added_tokens.json: 100%"}},"1e99e26f69034ec79f52b512a608c4f7":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"265a066b8e564034af96902a1e0347fb":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"2ffcf6981af34b139f304200934abaee":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"314d5fcd2f864d8f941f90d76bd0df1b":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"36e4580f19164f5d93280c4a06f0879c":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"38476374aa9c49f68dcf96e55e520240":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"3bb1a7ff75c1490db0334e5162aba497":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"3cbb3222ae2f4bb7b3dfd1a8c54a3503":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"3ce8b14f48a24349b793b75d9350dc95":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"3d6af6f687c54a5eb0db38b2c2ef1899":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"3f1c56e797cf43588bad099a0783e179":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_85ff59f9f7ed47b5ab78f767255f5a56","IPY_MODEL_a12cdebdcf8b476f80b906d97a9ea261","IPY_MODEL_a404d4d49f2046bd85ea64cc2de4a734"],"layout":"IPY_MODEL_265a066b8e564034af96902a1e0347fb"}},"3f65fa34feac4f00a89cf74bdb8f5a59":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_cee5a9b496cf435b9e746424187cad08","placeholder":"​","style":"IPY_MODEL_36e4580f19164f5d93280c4a06f0879c","value":"Downloading pytorch_model.bin: 100%"}},"4790617c8b18415d9cafaceabca7022d":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"501dd4b2d09b4ae993a5bc2f18769ac4":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"572b68362a1a4292a43d44ebad043042":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"58ee9406c08144de989c5a26ed5a1ccb":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"5aee88b7efe54ce2b646b353cc61b26f":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"5c12f56844e546c3aaadc192b2583077":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_c96e0761975e4290be8f4b287e3f6f42","IPY_MODEL_81e8c8d107034c85aa95252a3838b05e","IPY_MODEL_771f704de66e4d0eab0a2cb71dd24d2f"],"layout":"IPY_MODEL_6cf03247c6374a6b89aaffee79998285"}},"5dce3c96154c4dbba25a57052804c82b":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"5f050ded4d6d41509dbad6f17284c18c":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_e24155477021432cb45793c0743eea1b","placeholder":"​","style":"IPY_MODEL_6fc068b143fc4e3391dd755e8262fcfd","value":" 433M/433M [00:05<00:00, 38.8MB/s]"}},"6689169d7c9447a1bde80313e6e9a7c2":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"6b905fe4f7f542c9890f5cfd195beda3":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"6c506299e96344798ac6e36820e275bf":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"6c5a9f6544e0442ca68098426c146503":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_e7f4a1278a7e49aba2ed734c228d0c66","IPY_MODEL_0fe42dba7c4b4df2a64ea2002be642cf","IPY_MODEL_9a06564cc5254e89a762426cf3269a9e"],"layout":"IPY_MODEL_3bb1a7ff75c1490db0334e5162aba497"}},"6cf03247c6374a6b89aaffee79998285":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"6ee8c33165a946b8a15516a89203f396":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"6fc068b143fc4e3391dd755e8262fcfd":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"771f704de66e4d0eab0a2cb71dd24d2f":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_d54d2ab9930a447388f0fea290bef2ac","placeholder":"​","style":"IPY_MODEL_3ce8b14f48a24349b793b75d9350dc95","value":" 59.0/59.0 [00:00<00:00, 2.43kB/s]"}},"80393d9f400a4e9c8867808c5f2e8b28":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_eaff3fe3471b4815ab3d27d72142fe22","IPY_MODEL_030f52c161444051b7215c4cf1b4eb27","IPY_MODEL_d2726e7c8ebc4d0c9dbaf7d919bd064b"],"layout":"IPY_MODEL_012930dde07a4a56af962c6993ecbf03"}},"81e8c8d107034c85aa95252a3838b05e":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_edc62d1193fc44e98784da4b1a3fa390","max":59,"min":0,"orientation":"horizontal","style":"IPY_MODEL_5dce3c96154c4dbba25a57052804c82b","value":59}},"83aa131ac111451495e97bd710631418":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_38476374aa9c49f68dcf96e55e520240","placeholder":"​","style":"IPY_MODEL_6c506299e96344798ac6e36820e275bf","value":" 2.00/2.00 [00:00<00:00, 86.3B/s]"}},"85ff59f9f7ed47b5ab78f767255f5a56":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_d136b83a79034d20971be55510737103","placeholder":"​","style":"IPY_MODEL_572b68362a1a4292a43d44ebad043042","value":"Downloading (…)solve/main/vocab.txt: 100%"}},"966ad3248efb4f29a20f19b06c7dfa77":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"9982d7d8bc634c77838aa10ccead8428":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"9a06564cc5254e89a762426cf3269a9e":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_b4d092df96e04c00a943f3c8e3329c39","placeholder":"​","style":"IPY_MODEL_966ad3248efb4f29a20f19b06c7dfa77","value":" 829/829 [00:00<00:00, 15.0kB/s]"}},"a12cdebdcf8b476f80b906d97a9ea261":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_db4fc546e2d344018f85cd3deeab1115","max":213450,"min":0,"orientation":"horizontal","style":"IPY_MODEL_ef8256e1afce4f268db5bc38a3a7fa86","value":213450}},"a404d4d49f2046bd85ea64cc2de4a734":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_afa59a1ecaba4b56a596f4cdbd7b6730","placeholder":"​","style":"IPY_MODEL_3d6af6f687c54a5eb0db38b2c2ef1899","value":" 213k/213k [00:00<00:00, 7.96MB/s]"}},"ab2b92ec8670443a9093c015c6084e95":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"afa59a1ecaba4b56a596f4cdbd7b6730":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"b3f1bdcee72a47a791c6eaec72fdf136":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"DescriptionStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"b4d092df96e04c00a943f3c8e3329c39":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"c749fd792c914727b6ff4386b316df57":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"c96e0761975e4290be8f4b287e3f6f42":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_501dd4b2d09b4ae993a5bc2f18769ac4","placeholder":"​","style":"IPY_MODEL_3cbb3222ae2f4bb7b3dfd1a8c54a3503","value":"Downloading (…)okenizer_config.json: 100%"}},"cee5a9b496cf435b9e746424187cad08":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"d136b83a79034d20971be55510737103":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"d2726e7c8ebc4d0c9dbaf7d919bd064b":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_6ee8c33165a946b8a15516a89203f396","placeholder":"​","style":"IPY_MODEL_9982d7d8bc634c77838aa10ccead8428","value":" 112/112 [00:00<00:00, 7.97kB/s]"}},"d2906819c5ec4c82b2731eac4afe519d":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HBoxModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HBoxView","box_style":"","children":["IPY_MODEL_1db2003c9e124e4f8b6444c157636983","IPY_MODEL_eaae6733b12047ac9edec3adff0ab765","IPY_MODEL_83aa131ac111451495e97bd710631418"],"layout":"IPY_MODEL_f818144e1d304afd8b15900620abfc1d"}},"d2d79f3cb2d8444fa36a6851a208e9f4":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"d35f7095c5be4de19b625a00a0ea1798":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_1e99e26f69034ec79f52b512a608c4f7","max":433316646,"min":0,"orientation":"horizontal","style":"IPY_MODEL_c749fd792c914727b6ff4386b316df57","value":433316646}},"d54d2ab9930a447388f0fea290bef2ac":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"db4fc546e2d344018f85cd3deeab1115":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"e24155477021432cb45793c0743eea1b":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"e7f4a1278a7e49aba2ed734c228d0c66":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_4790617c8b18415d9cafaceabca7022d","placeholder":"​","style":"IPY_MODEL_2ffcf6981af34b139f304200934abaee","value":"Downloading (…)lve/main/config.json: 100%"}},"eaae6733b12047ac9edec3adff0ab765":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"FloatProgressModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"success","description":"","description_tooltip":null,"layout":"IPY_MODEL_06797e19a13d40df8c50322bf4b52f90","max":2,"min":0,"orientation":"horizontal","style":"IPY_MODEL_6689169d7c9447a1bde80313e6e9a7c2","value":2}},"eaff3fe3471b4815ab3d27d72142fe22":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"HTMLModel","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"HTMLModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"HTMLView","description":"","description_tooltip":null,"layout":"IPY_MODEL_ab2b92ec8670443a9093c015c6084e95","placeholder":"​","style":"IPY_MODEL_5aee88b7efe54ce2b646b353cc61b26f","value":"Downloading (…)cial_tokens_map.json: 100%"}},"edc62d1193fc44e98784da4b1a3fa390":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"ef8256e1afce4f268db5bc38a3a7fa86":{"model_module":"@jupyter-widgets/controls","model_module_version":"1.5.0","model_name":"ProgressStyleModel","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}},"f80fde8b46ba4c7fad2f867f2439ef83":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"f818144e1d304afd8b15900620abfc1d":{"model_module":"@jupyter-widgets/base","model_module_version":"1.2.0","model_name":"LayoutModel","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}}}}},"nbformat":4,"nbformat_minor":0} diff --git a/demo/tutorials/misc/Templatic_Augmentation_Notebook.ipynb b/demo/tutorials/misc/Templatic_Augmentation_Notebook.ipynb index 739c9461b..3c6c85f40 100644 --- a/demo/tutorials/misc/Templatic_Augmentation_Notebook.ipynb +++ b/demo/tutorials/misc/Templatic_Augmentation_Notebook.ipynb @@ -1 +1,2654 @@ -{"cells":[{"cell_type":"markdown","metadata":{"id":"e7PsSmy9sCoR"},"source":["![image.png]()"]},{"cell_type":"markdown","metadata":{"id":"MhgkQYQiEvZt"},"source":["[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/misc/Templatic_Augmentation_Notebook.ipynb)"]},{"cell_type":"markdown","metadata":{"id":"WJJzt3RWhEc6"},"source":["**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering and Summarization tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity and fairness test categories.\n","\n","Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings."]},{"cell_type":"markdown","metadata":{"id":"26qXWhCYhHAt"},"source":["# Getting started with LangTest on John Snow Labs"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"oGIyE43uhTxH"},"outputs":[],"source":["!pip install langtest[johnsnowlabs]"]},{"cell_type":"markdown","metadata":{"id":"yR6kjOaiheKN"},"source":["# Harness and its Parameters\n","\n","The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way."]},{"cell_type":"code","execution_count":null,"metadata":{"id":"lTzSJpMlhgq5"},"outputs":[],"source":["#Import Harness from the LangTest library\n","from langtest import Harness"]},{"cell_type":"markdown","metadata":{"id":"sBcZjwJBhkOw"},"source":["It imports the Harness class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and that instances of the Harness class can be customized or configured for different testing scenarios or environments.\n","\n","Here is a list of the different parameters that can be passed to the Harness function:\n","\n","
\n","\n","\n","\n","| Parameter | Description |\n","| - | - |\n","| **task** | Task for which the model is to be evaluated (text-classification or ner) |\n","| **model** | Specifies the model(s) to be evaluated. This parameter can be provided as either a dictionary or a list of dictionaries. Each dictionary should contain the following keys:
  • model (mandatory): \tPipelineModel or path to a saved model or pretrained pipeline/model from hub.
  • hub (mandatory): Hub (library) to use in back-end for loading model from public models hub or from path
|\n","| **data** | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys:
  • data_source (mandatory): The source of the data.
  • subset (optional): The subset of the data.
  • feature_column (optional): The column containing the features.
  • target_column (optional): The column containing the target labels.
  • split (optional): The data split to be used.
  • source (optional): Set to 'huggingface' when loading Hugging Face dataset.
|\n","| **config** | Configuration for the tests to be performed, specified in the form of a YAML file. |\n","\n","\n","
\n","
"]},{"cell_type":"markdown","metadata":{"id":"JFhJ9CcbsKqN"},"source":["# Real-World Project Workflows\n","\n","In this section, we dive into complete workflows for using the model testing module in real-world project settings."]},{"cell_type":"markdown","metadata":{"id":"UtxtE6Y0r4CJ"},"source":["## Robustness Testing\n","\n","In this example, we will be testing a model's robustness. We will be applying 2 tests: add_typo and lowercase. The real-world project workflow of the model robustness testing and fixing in this case goes as follows:\n","\n","1. Train NER model on original CoNLL training set\n","\n","2. Test NER model robustness on CoNLL test set\n","\n","3. Augment CoNLL training set based on test results\n","\n","4. Train new NER model on augmented CoNLL training set\n","\n","5. Test new NER model robustness on the CoNLL test set from step 2\n","\n","6. Compare robustness of new NER model against original NER model"]},{"cell_type":"markdown","metadata":{"id":"I21Jmq79jgC6"},"source":["#### Load Train and Test CoNLL"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":1477,"status":"ok","timestamp":1692342633486,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"6uW22VqJje8E","outputId":"ff7e597d-9ec3-41ce-e006-0c251dc96183"},"outputs":[{"name":"stdout","output_type":"stream","text":["--2023-08-18 07:10:30-- https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/langtest/data/conll/sample.conll\n","Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n","Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n","HTTP request sent, awaiting response... 200 OK\n","Length: 50519 (49K) [text/plain]\n","Saving to: ‘sample.conll’\n","\n","\rsample.conll 0%[ ] 0 --.-KB/s \rsample.conll 100%[===================>] 49.33K --.-KB/s in 0.003s \n","\n","2023-08-18 07:10:30 (15.6 MB/s) - ‘sample.conll’ saved [50519/50519]\n","\n","--2023-08-18 07:10:30-- https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/demo/data/conll03.conll\n","Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n","Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n","HTTP request sent, awaiting response... 200 OK\n","Length: 827443 (808K) [text/plain]\n","Saving to: ‘conll03.conll’\n","\n","conll03.conll 100%[===================>] 808.05K --.-KB/s in 0.02s \n","\n","2023-08-18 07:10:31 (42.3 MB/s) - ‘conll03.conll’ saved [827443/827443]\n","\n"]}],"source":["# Load test CoNLL\n","!wget https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/langtest/data/conll/sample.conll\n","\n","# Load train CoNLL\n","!wget https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/demo/data/conll03.conll"]},{"cell_type":"markdown","metadata":{"id":"MNtH_HOUt_PL"},"source":["#### Step 1: Train NER Model"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"jRnEmCfPhsZs"},"outputs":[],"source":["from johnsnowlabs import nlp"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":337965,"status":"ok","timestamp":1692342977578,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"bHXeP18sGp-g","outputId":"7ba0e6d9-0675-44d1-b601-98d415230949"},"outputs":[{"name":"stdout","output_type":"stream","text":["Warning::Spark Session already created, some configs may not take.\n","small_bert_L2_128 download started this may take some time.\n","Approximate size to download 16.1 MB\n","[OK!]\n"]}],"source":["ner_model = nlp.load('bert train.ner').fit(dataset_path=\"/content/conll03.conll\")\n"]},{"cell_type":"markdown","metadata":{"id":"kKgXC7cvuyar"},"source":["#### Step 2: Test NER Model Robustness "]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":832,"status":"ok","timestamp":1692342978351,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"RVk9NWn7u-Lm","outputId":"73756c32-b1ec-42f7-ddf2-e33204b9a5dc"},"outputs":[{"name":"stdout","output_type":"stream","text":["Test Configuration : \n"," {\n"," \"tests\": {\n"," \"defaults\": {\n"," \"min_pass_rate\": 1.0\n"," },\n"," \"robustness\": {\n"," \"add_typo\": {\n"," \"min_pass_rate\": 0.7\n"," },\n"," \"american_to_british\": {\n"," \"min_pass_rate\": 0.7\n"," }\n"," },\n"," \"accuracy\": {\n"," \"min_micro_f1_score\": {\n"," \"min_score\": 0.7\n"," }\n"," },\n"," \"bias\": {\n"," \"replace_to_female_pronouns\": {\n"," \"min_pass_rate\": 0.7\n"," },\n"," \"replace_to_low_income_country\": {\n"," \"min_pass_rate\": 0.7\n"," }\n"," },\n"," \"fairness\": {\n"," \"min_gender_f1_score\": {\n"," \"min_score\": 0.6\n"," }\n"," },\n"," \"representation\": {\n"," \"min_label_representation_count\": {\n"," \"min_count\": 50\n"," }\n"," }\n"," }\n","}\n"]}],"source":["harness = Harness(task=\"ner\", model={\"model\": ner_model, \"hub\": \"johnsnowlabs\"}, data={\"data_source\":\"sample.conll\"})"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":18,"status":"ok","timestamp":1692342978353,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"mynkAUwZyuFN","outputId":"bca2f807-40f2-4767-f176-33103c31a9e3"},"outputs":[{"data":{"text/plain":["{'tests': {'defaults': {'min_pass_rate': 0.65},\n"," 'robustness': {'add_typo': {'min_pass_rate': 0.73},\n"," 'lowercase': {'min_pass_rate': 0.65}}}}"]},"execution_count":7,"metadata":{},"output_type":"execute_result"}],"source":["harness.configure({\n"," 'tests': {\n"," 'defaults': {'min_pass_rate': 0.65},\n","\n"," 'robustness': {\n"," 'add_typo': {'min_pass_rate': 0.73},\n"," 'lowercase':{'min_pass_rate': 0.65},\n"," }\n"," }\n","})"]},{"cell_type":"markdown","metadata":{"id":"ZPU46A7WigFr"},"source":["Here we have configured the harness to perform two robustness tests (add_typo and lowercase) and defined the minimum pass rate for each test."]},{"cell_type":"markdown","metadata":{"id":"MomLlmTwjpzU"},"source":["\n","#### Generating the test cases.\n","\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":27812,"status":"ok","timestamp":1692343006155,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"UiUNzTwF89ye","outputId":"4dc12bb6-808c-4d6b-824b-439cb3e81128"},"outputs":[{"name":"stderr","output_type":"stream","text":["Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 263.51it/s]\n"]},{"data":{"text/plain":[]},"execution_count":8,"metadata":{},"output_type":"execute_result"}],"source":["harness.generate()"]},{"cell_type":"markdown","metadata":{"id":"UiMIF-o49Bg_"},"source":["harness.generate() method automatically generates the test cases (based on the provided configuration)"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":423},"executionInfo":{"elapsed":25,"status":"ok","timestamp":1692343006156,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"p0tTwFfc891k","outputId":"b8741a7a-c1cd-4b30-d081-0a92c9c522f7"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typeoriginaltest_case
0robustnessadd_typoSOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...SOCCER - JABAN GET LUCKY WIN , CHINA IN SURPRI...
1robustnessadd_typoNadim LadkiNadim Ladkl
2robustnessadd_typoAL-AIN , United Arab Emirates 1996-12-06AL-AIN , United Atab Emirates 1996-12-06
3robustnessadd_typoJapan began the defence of their Asian Cup tit...Japan began the defence of their Asian Cup tit...
4robustnessadd_typoBut China saw their luck desert them in the se...But China saw their luck desert them in the se...
...............
447robustnesslowercasePortuguesa 1 Atletico Mineiro 0portuguesa 1 atletico mineiro 0
448robustnesslowercaseCRICKET - LARA ENDURES ANOTHER MISERABLE DAY .cricket - lara endures another miserable day .
449robustnesslowercaseRobert Galvinrobert galvin
450robustnesslowercaseMELBOURNE 1996-12-06melbourne 1996-12-06
451robustnesslowercaseAustralia gave Brian Lara another reason to be...australia gave brian lara another reason to be...
\n","

452 rows × 4 columns

\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type original \\\n","0 robustness add_typo SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI... \n","1 robustness add_typo Nadim Ladki \n","2 robustness add_typo AL-AIN , United Arab Emirates 1996-12-06 \n","3 robustness add_typo Japan began the defence of their Asian Cup tit... \n","4 robustness add_typo But China saw their luck desert them in the se... \n",".. ... ... ... \n","447 robustness lowercase Portuguesa 1 Atletico Mineiro 0 \n","448 robustness lowercase CRICKET - LARA ENDURES ANOTHER MISERABLE DAY . \n","449 robustness lowercase Robert Galvin \n","450 robustness lowercase MELBOURNE 1996-12-06 \n","451 robustness lowercase Australia gave Brian Lara another reason to be... \n","\n"," test_case \n","0 SOCCER - JABAN GET LUCKY WIN , CHINA IN SURPRI... \n","1 Nadim Ladkl \n","2 AL-AIN , United Atab Emirates 1996-12-06 \n","3 Japan began the defence of their Asian Cup tit... \n","4 But China saw their luck desert them in the se... \n",".. ... \n","447 portuguesa 1 atletico mineiro 0 \n","448 cricket - lara endures another miserable day . \n","449 robert galvin \n","450 melbourne 1996-12-06 \n","451 australia gave brian lara another reason to be... \n","\n","[452 rows x 4 columns]"]},"execution_count":9,"metadata":{},"output_type":"execute_result"}],"source":["harness.testcases()"]},{"cell_type":"markdown","metadata":{"id":"nRgq7e-g9Gev"},"source":["harness.testcases() method gives the produced test cases in form of a pandas data frame."]},{"cell_type":"markdown","metadata":{"id":"IaPBjl_R9slh"},"source":["#### Saving test configurations, data, test cases"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"ba0MYutC96CN"},"outputs":[],"source":["harness.save(\"saved_test_configurations\")"]},{"cell_type":"markdown","metadata":{"id":"groBqKuD9I34"},"source":["#### Running the tests"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":81932,"status":"ok","timestamp":1692343088818,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"CHQHRbQb9EDi","outputId":"44621987-fd79-46bf-cf6e-beba8cc7dcee"},"outputs":[{"name":"stderr","output_type":"stream","text":["Running testcases... : 100%|██████████| 452/452 [01:22<00:00, 5.51it/s]\n"]},{"data":{"text/plain":[]},"execution_count":11,"metadata":{},"output_type":"execute_result"}],"source":["harness.run()"]},{"cell_type":"markdown","metadata":{"id":"71zHGe2q9O6G"},"source":["Called after harness.generate() and is to used to run all the tests. Returns a pass/fail flag for each test."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":545},"executionInfo":{"elapsed":51,"status":"ok","timestamp":1692343088821,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"keBNodfJ894u","outputId":"4f0aea52-ae9a-4bad-b0a7-d87a42a324b1"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typeoriginaltest_caseexpected_resultactual_resultpass
0robustnessadd_typoSOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...SOCCER - JABAN GET LUCKY WIN , CHINA IN SURPRI...japan: LOC, china: LOCjaban: PER, china: LOCFalse
1robustnessadd_typoNadim LadkiNadim Ladklnadim ladki: PERnadim ladkl: PERTrue
2robustnessadd_typoAL-AIN , United Arab Emirates 1996-12-06AL-AIN , United Atab Emirates 1996-12-06al-ain: LOC, united arab emirates: LOCal-ain: LOC, united atab emirates: LOCTrue
3robustnessadd_typoJapan began the defence of their Asian Cup tit...Japan began the defence of their Asian Cup tit...japan: LOC, asian cup: MISC, syria: LOCjapan: LOC, asian cup: MISC, syria: LOC, champ...True
4robustnessadd_typoBut China saw their luck desert them in the se...But China saw their luck desert them in the se...china: LOC, uzbekistan: LOCchina: LOC, uzbekistan: LOCTrue
........................
447robustnesslowercasePortuguesa 1 Atletico Mineiro 0portuguesa 1 atletico mineiro 0portuguesa: ORG, atletico: ORG, mineiro: ORGportuguesa: ORG, atletico: ORG, mineiro: ORGTrue
448robustnesslowercaseCRICKET - LARA ENDURES ANOTHER MISERABLE DAY .cricket - lara endures another miserable day .lara: PERlara: PERTrue
449robustnesslowercaseRobert Galvinrobert galvinrobert galvin: PERrobert galvin: PERTrue
450robustnesslowercaseMELBOURNE 1996-12-06melbourne 1996-12-06melbourne: LOCmelbourne: LOCTrue
451robustnesslowercaseAustralia gave Brian Lara another reason to be...australia gave brian lara another reason to be...australia: LOC, brian lara: PER, west: LOCaustralia: LOC, brian lara: PER, west: LOCTrue
\n","

452 rows × 7 columns

\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type original \\\n","0 robustness add_typo SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI... \n","1 robustness add_typo Nadim Ladki \n","2 robustness add_typo AL-AIN , United Arab Emirates 1996-12-06 \n","3 robustness add_typo Japan began the defence of their Asian Cup tit... \n","4 robustness add_typo But China saw their luck desert them in the se... \n",".. ... ... ... \n","447 robustness lowercase Portuguesa 1 Atletico Mineiro 0 \n","448 robustness lowercase CRICKET - LARA ENDURES ANOTHER MISERABLE DAY . \n","449 robustness lowercase Robert Galvin \n","450 robustness lowercase MELBOURNE 1996-12-06 \n","451 robustness lowercase Australia gave Brian Lara another reason to be... \n","\n"," test_case \\\n","0 SOCCER - JABAN GET LUCKY WIN , CHINA IN SURPRI... \n","1 Nadim Ladkl \n","2 AL-AIN , United Atab Emirates 1996-12-06 \n","3 Japan began the defence of their Asian Cup tit... \n","4 But China saw their luck desert them in the se... \n",".. ... \n","447 portuguesa 1 atletico mineiro 0 \n","448 cricket - lara endures another miserable day . \n","449 robert galvin \n","450 melbourne 1996-12-06 \n","451 australia gave brian lara another reason to be... \n","\n"," expected_result \\\n","0 japan: LOC, china: LOC \n","1 nadim ladki: PER \n","2 al-ain: LOC, united arab emirates: LOC \n","3 japan: LOC, asian cup: MISC, syria: LOC \n","4 china: LOC, uzbekistan: LOC \n",".. ... \n","447 portuguesa: ORG, atletico: ORG, mineiro: ORG \n","448 lara: PER \n","449 robert galvin: PER \n","450 melbourne: LOC \n","451 australia: LOC, brian lara: PER, west: LOC \n","\n"," actual_result pass \n","0 jaban: PER, china: LOC False \n","1 nadim ladkl: PER True \n","2 al-ain: LOC, united atab emirates: LOC True \n","3 japan: LOC, asian cup: MISC, syria: LOC, champ... True \n","4 china: LOC, uzbekistan: LOC True \n",".. ... ... \n","447 portuguesa: ORG, atletico: ORG, mineiro: ORG True \n","448 lara: PER True \n","449 robert galvin: PER True \n","450 melbourne: LOC True \n","451 australia: LOC, brian lara: PER, west: LOC True \n","\n","[452 rows x 7 columns]"]},"execution_count":12,"metadata":{},"output_type":"execute_result"}],"source":["harness.generated_results()"]},{"cell_type":"markdown","metadata":{"id":"57lqGecA9UXG"},"source":["This method returns the generated results in the form of a pandas dataframe, which provides a convenient and easy-to-use format for working with the test results. You can use this method to quickly identify the test cases that failed and to determine where fixes are needed."]},{"cell_type":"markdown","metadata":{"id":"jPvPCr_S9Zb8"},"source":["#### Report of the tests"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":112},"executionInfo":{"elapsed":43,"status":"ok","timestamp":1692343088822,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"gp57HcF9yxi7","outputId":"b29fc543-331d-4b7e-c599-1e23b2cd6982"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessadd_typo5816874%73%True
1robustnesslowercase0226100%65%True
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type fail_count pass_count pass_rate minimum_pass_rate \\\n","0 robustness add_typo 58 168 74% 73% \n","1 robustness lowercase 0 226 100% 65% \n","\n"," pass \n","0 True \n","1 True "]},"execution_count":13,"metadata":{},"output_type":"execute_result"}],"source":["harness.report()"]},{"cell_type":"markdown","metadata":{"id":"7rpJ3QbPinkT"},"source":["It summarizes the results giving information about pass and fail counts and overall test pass/fail flag."]},{"cell_type":"markdown","metadata":{"id":"3g-s1Gikv65h"},"source":["#### Step 3: Augment CoNLL Training Set Based on Robustness Test Results"]},{"cell_type":"markdown","metadata":{"id":"JqMbXhF11rmX"},"source":["Templatic Augmentation is a technique that allows you to generate new training data by applying a set of predefined templates to the original training data. The templates are designed to introduce noise into the training data in a way that simulates real-world conditions. The augmentation process is controlled by a configuration file that specifies the augmentation templates to be used and the proportion of the training data to be augmented. The augmentation process is performed by the augment() method of the **Harness** class.\n","\n","**Augumentation with templates**\n","\n","Templatic augmentation is controlled by templates to be used with training data to be augmented. The augmentation process is performed by the augment() method of the **Harness** class.\n","\n","```\n","templates = [\"The {ORG} company is located in {LOC}\", \"The {ORG} company is located in {LOC} and is owned by {PER}\"]\n","\n","```\n"]},{"cell_type":"markdown","metadata":{"id":"PI75iT-F1rmX"},"source":["The `.augment()` function takes the following parameters:\n","\n","- `training_data` (dict): (Required) Specifies the source of the original training data. It should be a dictionary containing the necessary information about the dataset.\n","- `save_data_path` (str): (Required) Name of the file to store the augmented data. The augmented dataset will be saved in this file.\n","- `templates` (list): List of templates(string) or conll file to be used for augmentation."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":7166,"status":"ok","timestamp":1692343095954,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"EBTz4Fqev7xX","outputId":"5828a60c-04f6-4018-e4e9-ff79b43558a5"},"outputs":[{"data":{"text/plain":[]},"execution_count":14,"metadata":{},"output_type":"execute_result"}],"source":["data_kwargs = {\n"," \"data_source\" : \"conll03.conll\",\n"," }\n","\n","harness.augment(\n"," training_data=data_kwargs,\n"," save_data_path='augmented_conll03.conll',\n"," templates=[\"The {ORG} company is located in {LOC}\", \"The {ORG} company is located in {LOC} and is owned by {PER}\"],\n"," )"]},{"cell_type":"markdown","metadata":{"id":"O2HL6Gip0ST0"},"source":["Essentially it applies perturbations to the input data based on the recommendations from the harness reports. Then this augmented_dataset is used to retrain the original model so as to make the model more robust and improve its performance."]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":35,"status":"ok","timestamp":1692343095957,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"tKOgWXL145WR","outputId":"1a739981-5444-48a8-8832-c24c1b1511c2"},"outputs":[{"name":"stdout","output_type":"stream","text":["The -X- -X- O\n","LG -X- -X- B-ORG\n","company -X- -X- O\n","is -X- -X- O\n","located -X- -X- O\n","in -X- -X- O\n","Iraq -X- -X- B-LOC\n","\n","The -X- -X- O\n","Charlton -X- -X- B-ORG\n","company -X- -X- O\n","is -X- -X- O\n","located -X- -X- O\n","in -X- -X- O\n","Afghanistan -X- -X- B-LOC\n","\n","The -X- -X- O\n","Dow -X- -X- B-ORG\n","Chemical -X- -X- I-ORG\n","Co -X- -X- I-ORG\n"]}],"source":["!head -n 20 augmented_conll03.conll"]},{"cell_type":"markdown","metadata":{"id":"z4aCF0kYwL4w"},"source":["#### Step 4: Train New NER Model on Augmented CoNLL"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":171669,"status":"ok","timestamp":1692343267610,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"WvRFmf3PGz3k","outputId":"a09ac6ea-7eb3-4c98-c839-f0925cdde057"},"outputs":[{"name":"stdout","output_type":"stream","text":["Warning::Spark Session already created, some configs may not take.\n","Warning::Spark Session already created, some configs may not take.\n","small_bert_L2_128 download started this may take some time.\n","Approximate size to download 16.1 MB\n","[OK!]\n"]}],"source":["augmented_ner_model = nlp.load('bert train.ner').fit(dataset_path= \"augmented_conll03.conll\")"]},{"cell_type":"markdown","metadata":{"id":"QK8o7XaI_ZAf"},"source":["#### Load saved test configurations, data"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":20448,"status":"ok","timestamp":1692343287998,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"UpaSjj05_fPd","outputId":"cec4e7a9-a81e-46ac-f5b9-81df3991e012"},"outputs":[{"name":"stdout","output_type":"stream","text":["Test Configuration : \n"," {\n"," \"tests\": {\n"," \"defaults\": {\n"," \"min_pass_rate\": 0.65\n"," },\n"," \"robustness\": {\n"," \"add_typo\": {\n"," \"min_pass_rate\": 0.73\n"," },\n"," \"lowercase\": {\n"," \"min_pass_rate\": 0.65\n"," }\n"," }\n"," }\n","}\n"]},{"name":"stderr","output_type":"stream","text":["Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 506.68it/s]\n"]}],"source":["harness = Harness.load(\"saved_test_configurations\",model=augmented_ner_model, task=\"ner\")"]},{"cell_type":"markdown","metadata":{"id":"9aif5bl_G0GZ"},"source":["#### Step 5: Test New NER Model Robustness"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"elapsed":70937,"status":"ok","timestamp":1692343358875,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"StrOVtMoAQpf","outputId":"2b264ad3-ce80-458e-91dc-8f13672fe95f"},"outputs":[{"name":"stderr","output_type":"stream","text":["Running testcases... : 100%|██████████| 452/452 [01:10<00:00, 6.42it/s]\n"]},{"data":{"text/plain":[]},"execution_count":18,"metadata":{},"output_type":"execute_result"}],"source":["harness.run()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":562},"executionInfo":{"elapsed":82,"status":"ok","timestamp":1692343358877,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"znh2xqQmAWHf","outputId":"513f8838-2ba6-4cb1-adf8-20f19afea37b"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typeoriginaltest_caseexpected_resultactual_resultpass
0robustnessadd_typoSOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURYRI...soccer - japan get lucky win , china in surpri...soccer - japan get lucky win , china in suryri...True
1robustnessadd_typoNadim LadkiNadin Ladkinadim ladki: ORGnadin ladki: ORGTrue
2robustnessadd_typoAL-AIN , United Arab Emirates 1996-12-06AL-AIN , United Arab Rmirates 1996-12-06al-ain: PER, , united arab emirates 1996-12-06...al-ain , united arab rmirates 1996-12-06: ORGFalse
3robustnessadd_typoJapan began the defence of their Asian Cup tit...Japan began the defence of their Asian Cyp tit...japan began: ORG, defence of their asian cup t...japan began: ORG, defence of their asian cyp t...True
4robustnessadd_typoBut China saw their luck desert them in the se...But China saw their luck desert them in the se...but china saw their luck desert them in the se...but china saw their luck desert them in the se...True
........................
447robustnesslowercasePortuguesa 1 Atletico Mineiro 0portuguesa 1 atletico mineiro 0portuguesa 1 atletico mineiro 0: ORGportuguesa 1 atletico mineiro 0: ORGTrue
448robustnesslowercaseCRICKET - LARA ENDURES ANOTHER MISERABLE DAY .cricket - lara endures another miserable day .cricket - lara endures another miserable day: ORGcricket - lara endures another miserable day: ORGTrue
449robustnesslowercaseRobert Galvinrobert galvinrobert galvin: PERrobert galvin: PERTrue
450robustnesslowercaseMELBOURNE 1996-12-06melbourne 1996-12-06melbourne: PER, 1996-12-06: ORGmelbourne: PER, 1996-12-06: ORGTrue
451robustnesslowercaseAustralia gave Brian Lara another reason to be...australia gave brian lara another reason to be...australia gave brian lara another reason to be...australia gave brian lara another reason to be...True
\n","

452 rows × 7 columns

\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type original \\\n","0 robustness add_typo SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI... \n","1 robustness add_typo Nadim Ladki \n","2 robustness add_typo AL-AIN , United Arab Emirates 1996-12-06 \n","3 robustness add_typo Japan began the defence of their Asian Cup tit... \n","4 robustness add_typo But China saw their luck desert them in the se... \n",".. ... ... ... \n","447 robustness lowercase Portuguesa 1 Atletico Mineiro 0 \n","448 robustness lowercase CRICKET - LARA ENDURES ANOTHER MISERABLE DAY . \n","449 robustness lowercase Robert Galvin \n","450 robustness lowercase MELBOURNE 1996-12-06 \n","451 robustness lowercase Australia gave Brian Lara another reason to be... \n","\n"," test_case \\\n","0 SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURYRI... \n","1 Nadin Ladki \n","2 AL-AIN , United Arab Rmirates 1996-12-06 \n","3 Japan began the defence of their Asian Cyp tit... \n","4 But China saw their luck desert them in the se... \n",".. ... \n","447 portuguesa 1 atletico mineiro 0 \n","448 cricket - lara endures another miserable day . \n","449 robert galvin \n","450 melbourne 1996-12-06 \n","451 australia gave brian lara another reason to be... \n","\n"," expected_result \\\n","0 soccer - japan get lucky win , china in surpri... \n","1 nadim ladki: ORG \n","2 al-ain: PER, , united arab emirates 1996-12-06... \n","3 japan began: ORG, defence of their asian cup t... \n","4 but china saw their luck desert them in the se... \n",".. ... \n","447 portuguesa 1 atletico mineiro 0: ORG \n","448 cricket - lara endures another miserable day: ORG \n","449 robert galvin: PER \n","450 melbourne: PER, 1996-12-06: ORG \n","451 australia gave brian lara another reason to be... \n","\n"," actual_result pass \n","0 soccer - japan get lucky win , china in suryri... True \n","1 nadin ladki: ORG True \n","2 al-ain , united arab rmirates 1996-12-06: ORG False \n","3 japan began: ORG, defence of their asian cyp t... True \n","4 but china saw their luck desert them in the se... True \n",".. ... ... \n","447 portuguesa 1 atletico mineiro 0: ORG True \n","448 cricket - lara endures another miserable day: ORG True \n","449 robert galvin: PER True \n","450 melbourne: PER, 1996-12-06: ORG True \n","451 australia gave brian lara another reason to be... True \n","\n","[452 rows x 7 columns]"]},"execution_count":19,"metadata":{},"output_type":"execute_result"}],"source":["harness.generated_results()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":112},"executionInfo":{"elapsed":31,"status":"ok","timestamp":1692343358879,"user":{"displayName":"Prikshit sharma","userId":"07819241395213139913"},"user_tz":-330},"id":"JSqkrBOZ-TeG","outputId":"24a29834-ca8f-4e4d-b976-ad86f264e485"},"outputs":[{"data":{"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessadd_typo5716975%73%True
1robustnesslowercase0226100%65%True
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"],"text/plain":[" category test_type fail_count pass_count pass_rate minimum_pass_rate \\\n","0 robustness add_typo 57 169 75% 73% \n","1 robustness lowercase 0 226 100% 65% \n","\n"," pass \n","0 True \n","1 True "]},"execution_count":20,"metadata":{},"output_type":"execute_result"}],"source":["harness.report()"]}],"metadata":{"colab":{"machine_shape":"hm","provenance":[]},"gpuClass":"standard","kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.8.9"}},"nbformat":4,"nbformat_minor":0} +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "e7PsSmy9sCoR" + }, + "source": [ + "![image.png]()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MhgkQYQiEvZt" + }, + "source": [ + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/langtest/blob/main/demo/tutorials/misc/Templatic_Augmentation_Notebook.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WJJzt3RWhEc6" + }, + "source": [ + "**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering and Summarization tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity and fairness test categories.\n", + "\n", + "Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "26qXWhCYhHAt" + }, + "source": [ + "# Getting started with LangTest on John Snow Labs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "oGIyE43uhTxH", + "outputId": "b6bc6b0e-7206-4685-a73f-5e4f3406c280" + }, + "outputs": [], + "source": [ + "!pip install \"langtest[johnsnowlabs,openai]\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yR6kjOaiheKN" + }, + "source": [ + "# Harness and its Parameters\n", + "\n", + "The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "lTzSJpMlhgq5" + }, + "outputs": [], + "source": [ + "#Import Harness from the LangTest library\n", + "from langtest import Harness" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sBcZjwJBhkOw" + }, + "source": [ + "It imports the Harness class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and that instances of the Harness class can be customized or configured for different testing scenarios or environments.\n", + "\n", + "Here is a list of the different parameters that can be passed to the Harness function:\n", + "\n", + "
\n", + "\n", + "\n", + "\n", + "| Parameter | Description |\n", + "| - | - |\n", + "| **task** | Task for which the model is to be evaluated (text-classification or ner) |\n", + "| **model** | Specifies the model(s) to be evaluated. This parameter can be provided as either a dictionary or a list of dictionaries. Each dictionary should contain the following keys:
  • model (mandatory): \tPipelineModel or path to a saved model or pretrained pipeline/model from hub.
  • hub (mandatory): Hub (library) to use in back-end for loading model from public models hub or from path
|\n", + "| **data** | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys:
  • data_source (mandatory): The source of the data.
  • subset (optional): The subset of the data.
  • feature_column (optional): The column containing the features.
  • target_column (optional): The column containing the target labels.
  • split (optional): The data split to be used.
  • source (optional): Set to 'huggingface' when loading Hugging Face dataset.
|\n", + "| **config** | Configuration for the tests to be performed, specified in the form of a YAML file. |\n", + "\n", + "\n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JFhJ9CcbsKqN" + }, + "source": [ + "# Real-World Project Workflows\n", + "\n", + "In this section, we dive into complete workflows for using the model testing module in real-world project settings." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UtxtE6Y0r4CJ" + }, + "source": [ + "## Robustness Testing\n", + "\n", + "In this example, we will be testing a model's robustness. We will be applying 2 tests: add_typo and lowercase. The real-world project workflow of the model robustness testing and fixing in this case goes as follows:\n", + "\n", + "1. Train NER model on original CoNLL training set\n", + "\n", + "2. Test NER model robustness on CoNLL test set\n", + "\n", + "3. Augment CoNLL training set based on test results\n", + "\n", + "4. Train new NER model on augmented CoNLL training set\n", + "\n", + "5. Test new NER model robustness on the CoNLL test set from step 2\n", + "\n", + "6. Compare robustness of new NER model against original NER model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "I21Jmq79jgC6" + }, + "source": [ + "#### Load Train and Test CoNLL" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "6uW22VqJje8E", + "outputId": "0870162e-f3be-41b5-8764-ac464d7aa6a9" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--2023-11-30 13:43:59-- https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/langtest/data/conll/sample.conll\n", + "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...\n", + "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 50519 (49K) [text/plain]\n", + "Saving to: ‘sample.conll.1’\n", + "\n", + "sample.conll.1 100%[===================>] 49.33K --.-KB/s in 0.04s \n", + "\n", + "2023-11-30 13:44:00 (1.10 MB/s) - ‘sample.conll.1’ saved [50519/50519]\n", + "\n", + "--2023-11-30 13:44:00-- https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/demo/data/conll03.conll\n", + "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...\n", + "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 827443 (808K) [text/plain]\n", + "Saving to: ‘conll03.conll.1’\n", + "\n", + "conll03.conll.1 100%[===================>] 808.05K 4.30MB/s in 0.2s \n", + "\n", + "2023-11-30 13:44:02 (4.30 MB/s) - ‘conll03.conll.1’ saved [827443/827443]\n", + "\n" + ] + } + ], + "source": [ + "# Load test CoNLL\n", + "!wget https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/langtest/data/conll/sample.conll\n", + "\n", + "# Load train CoNLL\n", + "!wget https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/demo/data/conll03.conll" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MNtH_HOUt_PL" + }, + "source": [ + "#### Step 1: Train NER Model" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "jRnEmCfPhsZs" + }, + "outputs": [], + "source": [ + "from johnsnowlabs import nlp" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "bHXeP18sGp-g", + "outputId": "17793e40-704e-4f89-fa14-965e77288db3" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Warning::Spark Session already created, some configs may not take.\n", + "Warning::Spark Session already created, some configs may not take.\n", + "small_bert_L2_128 download started this may take some time.\n", + "Approximate size to download 16.1 MB\n", + "[OK!]\n" + ] + } + ], + "source": [ + "ner_model = nlp.load('bert train.ner').fit(dataset_path=\"/content/conll03.conll\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kKgXC7cvuyar" + }, + "source": [ + "#### Step 2: Test NER Model Robustness " + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "RVk9NWn7u-Lm", + "outputId": "54d635c8-528a-424a-9abf-abc65ffc4ff3" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Test Configuration : \n", + " {\n", + " \"tests\": {\n", + " \"defaults\": {\n", + " \"min_pass_rate\": 1.0\n", + " },\n", + " \"robustness\": {\n", + " \"add_typo\": {\n", + " \"min_pass_rate\": 0.7\n", + " },\n", + " \"american_to_british\": {\n", + " \"min_pass_rate\": 0.7\n", + " }\n", + " },\n", + " \"accuracy\": {\n", + " \"min_micro_f1_score\": {\n", + " \"min_score\": 0.7\n", + " }\n", + " },\n", + " \"bias\": {\n", + " \"replace_to_female_pronouns\": {\n", + " \"min_pass_rate\": 0.7\n", + " },\n", + " \"replace_to_low_income_country\": {\n", + " \"min_pass_rate\": 0.7\n", + " }\n", + " },\n", + " \"fairness\": {\n", + " \"min_gender_f1_score\": {\n", + " \"min_score\": 0.6\n", + " }\n", + " },\n", + " \"representation\": {\n", + " \"min_label_representation_count\": {\n", + " \"min_count\": 50\n", + " }\n", + " }\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "harness = Harness(task=\"ner\", model={\"model\": ner_model, \"hub\": \"johnsnowlabs\"}, data={\"data_source\":\"sample.conll\"})" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "mynkAUwZyuFN", + "outputId": "6ebb9251-5e34-409e-9604-09cadfd11e65" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{'tests': {'defaults': {'min_pass_rate': 0.65},\n", + " 'robustness': {'add_typo': {'min_pass_rate': 0.73},\n", + " 'lowercase': {'min_pass_rate': 0.65}}}}" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.configure({\n", + " 'tests': {\n", + " 'defaults': {'min_pass_rate': 0.65},\n", + "\n", + " 'robustness': {\n", + " 'add_typo': {'min_pass_rate': 0.73},\n", + " 'lowercase':{'min_pass_rate': 0.65},\n", + " }\n", + " }\n", + "})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZPU46A7WigFr" + }, + "source": [ + "Here we have configured the harness to perform two robustness tests (add_typo and lowercase) and defined the minimum pass rate for each test." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MomLlmTwjpzU" + }, + "source": [ + "\n", + "#### Generating the test cases.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "UiUNzTwF89ye", + "outputId": "3d577348-9d1b-4152-a95a-23c8e9ae5633" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n", + "Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 5184.55it/s]\n", + "WARNING:root:[W009] Removing samples where no transformation has been applied:\n", + "[W010] - Test 'add_typo': 15 samples removed out of 226\n", + "[W010] - Test 'lowercase': 3 samples removed out of 226\n", + "\n" + ] + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.generate()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UiMIF-o49Bg_" + }, + "source": [ + "harness.generate() method automatically generates the test cases (based on the provided configuration)" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 423 + }, + "id": "p0tTwFfc891k", + "outputId": "882c68c1-a913-4f1a-9e71-dc3bcdba0d77" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginaltest_case
0robustnessadd_typoSOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...SOCCER - JAPAN GET LUCOY WIN , CHINA IN SURPRI...
1robustnessadd_typoNadim LadkiNadim Ladoi
2robustnessadd_typoAL-AIN , United Arab Emirates 1996-12-06AL-AIN , Unitev Arab Emirates 1996-12-06
3robustnessadd_typoJapan began the defence of their Asian Cup tit...Japan began the defence of their Asian Cup tit...
4robustnessadd_typoBut China saw their luck desert them in the se...But China saw their luck desert them in the se...
...............
429robustnesslowercasePortuguesa 1 Atletico Mineiro 0portuguesa 1 atletico mineiro 0
430robustnesslowercaseCRICKET - LARA ENDURES ANOTHER MISERABLE DAY .cricket - lara endures another miserable day .
431robustnesslowercaseRobert Galvinrobert galvin
432robustnesslowercaseMELBOURNE 1996-12-06melbourne 1996-12-06
433robustnesslowercaseAustralia gave Brian Lara another reason to be...australia gave brian lara another reason to be...
\n", + "

434 rows × 4 columns

\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "text/plain": [ + " category test_type original \\\n", + "0 robustness add_typo SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI... \n", + "1 robustness add_typo Nadim Ladki \n", + "2 robustness add_typo AL-AIN , United Arab Emirates 1996-12-06 \n", + "3 robustness add_typo Japan began the defence of their Asian Cup tit... \n", + "4 robustness add_typo But China saw their luck desert them in the se... \n", + ".. ... ... ... \n", + "429 robustness lowercase Portuguesa 1 Atletico Mineiro 0 \n", + "430 robustness lowercase CRICKET - LARA ENDURES ANOTHER MISERABLE DAY . \n", + "431 robustness lowercase Robert Galvin \n", + "432 robustness lowercase MELBOURNE 1996-12-06 \n", + "433 robustness lowercase Australia gave Brian Lara another reason to be... \n", + "\n", + " test_case \n", + "0 SOCCER - JAPAN GET LUCOY WIN , CHINA IN SURPRI... \n", + "1 Nadim Ladoi \n", + "2 AL-AIN , Unitev Arab Emirates 1996-12-06 \n", + "3 Japan began the defence of their Asian Cup tit... \n", + "4 But China saw their luck desert them in the se... \n", + ".. ... \n", + "429 portuguesa 1 atletico mineiro 0 \n", + "430 cricket - lara endures another miserable day . \n", + "431 robert galvin \n", + "432 melbourne 1996-12-06 \n", + "433 australia gave brian lara another reason to be... \n", + "\n", + "[434 rows x 4 columns]" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.testcases()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nRgq7e-g9Gev" + }, + "source": [ + "harness.testcases() method gives the produced test cases in form of a pandas data frame." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IaPBjl_R9slh" + }, + "source": [ + "#### Saving test configurations, data, test cases" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "id": "ba0MYutC96CN" + }, + "outputs": [], + "source": [ + "harness.save(\"saved_test_configurations\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "groBqKuD9I34" + }, + "source": [ + "#### Running the tests" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "CHQHRbQb9EDi", + "outputId": "7c813f3c-8ce7-4795-d603-9dba712f1aaa" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Running testcases... : 100%|██████████| 434/434 [00:51<00:00, 8.35it/s]\n" + ] + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.run()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "71zHGe2q9O6G" + }, + "source": [ + "Called after harness.generate() and is to used to run all the tests. Returns a pass/fail flag for each test." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 510 + }, + "id": "keBNodfJ894u", + "outputId": "c0bcee1a-8e76-462d-a883-2639bdc69df9" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginaltest_caseexpected_resultactual_resultpass
0robustnessadd_typoSOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...SOCCER - JAPAN GET LUCOY WIN , CHINA IN SURPRI...japan: LOC, lucky: LOC, china: LOCjapan: LOC, lucoy: PER, china: LOCFalse
1robustnessadd_typoNadim LadkiNadim Ladoinadim ladki: PERnadim ladoi: PERTrue
2robustnessadd_typoAL-AIN , United Arab Emirates 1996-12-06AL-AIN , Unitev Arab Emirates 1996-12-06al-ain: LOC, united arab emirates: LOCal-ain: LOC, unitev: PER, arab emirates: LOCFalse
3robustnessadd_typoJapan began the defence of their Asian Cup tit...Japan began the defence of their Asian Cup tit...japan: LOC, asian cup: MISC, syria: LOCjapan: LOC, asian cup: MISC, lucuy: PER, syria...True
4robustnessadd_typoBut China saw their luck desert them in the se...But China saw their luck desert them in the se...china: LOC, uzbekistan: LOCchina: LOC, yzbekistan: LOCTrue
........................
429robustnesslowercasePortuguesa 1 Atletico Mineiro 0portuguesa 1 atletico mineiro 0portuguesa: ORG, atletico mineiro: ORGportuguesa: ORG, atletico mineiro: ORGTrue
430robustnesslowercaseCRICKET - LARA ENDURES ANOTHER MISERABLE DAY .cricket - lara endures another miserable day .lara: PERlara: PERTrue
431robustnesslowercaseRobert Galvinrobert galvinrobert galvin: PERrobert galvin: PERTrue
432robustnesslowercaseMELBOURNE 1996-12-06melbourne 1996-12-06melbourne: LOCmelbourne: LOCTrue
433robustnesslowercaseAustralia gave Brian Lara another reason to be...australia gave brian lara another reason to be...australia: LOC, brian lara: PER, west indies: ...australia: LOC, brian lara: PER, west indies: ...True
\n", + "

434 rows × 7 columns

\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "text/plain": [ + " category test_type original \\\n", + "0 robustness add_typo SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI... \n", + "1 robustness add_typo Nadim Ladki \n", + "2 robustness add_typo AL-AIN , United Arab Emirates 1996-12-06 \n", + "3 robustness add_typo Japan began the defence of their Asian Cup tit... \n", + "4 robustness add_typo But China saw their luck desert them in the se... \n", + ".. ... ... ... \n", + "429 robustness lowercase Portuguesa 1 Atletico Mineiro 0 \n", + "430 robustness lowercase CRICKET - LARA ENDURES ANOTHER MISERABLE DAY . \n", + "431 robustness lowercase Robert Galvin \n", + "432 robustness lowercase MELBOURNE 1996-12-06 \n", + "433 robustness lowercase Australia gave Brian Lara another reason to be... \n", + "\n", + " test_case \\\n", + "0 SOCCER - JAPAN GET LUCOY WIN , CHINA IN SURPRI... \n", + "1 Nadim Ladoi \n", + "2 AL-AIN , Unitev Arab Emirates 1996-12-06 \n", + "3 Japan began the defence of their Asian Cup tit... \n", + "4 But China saw their luck desert them in the se... \n", + ".. ... \n", + "429 portuguesa 1 atletico mineiro 0 \n", + "430 cricket - lara endures another miserable day . \n", + "431 robert galvin \n", + "432 melbourne 1996-12-06 \n", + "433 australia gave brian lara another reason to be... \n", + "\n", + " expected_result \\\n", + "0 japan: LOC, lucky: LOC, china: LOC \n", + "1 nadim ladki: PER \n", + "2 al-ain: LOC, united arab emirates: LOC \n", + "3 japan: LOC, asian cup: MISC, syria: LOC \n", + "4 china: LOC, uzbekistan: LOC \n", + ".. ... \n", + "429 portuguesa: ORG, atletico mineiro: ORG \n", + "430 lara: PER \n", + "431 robert galvin: PER \n", + "432 melbourne: LOC \n", + "433 australia: LOC, brian lara: PER, west indies: ... \n", + "\n", + " actual_result pass \n", + "0 japan: LOC, lucoy: PER, china: LOC False \n", + "1 nadim ladoi: PER True \n", + "2 al-ain: LOC, unitev: PER, arab emirates: LOC False \n", + "3 japan: LOC, asian cup: MISC, lucuy: PER, syria... True \n", + "4 china: LOC, yzbekistan: LOC True \n", + ".. ... ... \n", + "429 portuguesa: ORG, atletico mineiro: ORG True \n", + "430 lara: PER True \n", + "431 robert galvin: PER True \n", + "432 melbourne: LOC True \n", + "433 australia: LOC, brian lara: PER, west indies: ... True \n", + "\n", + "[434 rows x 7 columns]" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.generated_results()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "57lqGecA9UXG" + }, + "source": [ + "This method returns the generated results in the form of a pandas dataframe, which provides a convenient and easy-to-use format for working with the test results. You can use this method to quickly identify the test cases that failed and to determine where fixes are needed." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jPvPCr_S9Zb8" + }, + "source": [ + "#### Report of the tests" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 112 + }, + "id": "gp57HcF9yxi7", + "outputId": "1f4a3c0d-d4e2-42a1-9673-07c885657eec" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessadd_typo5615573%73%True
1robustnesslowercase0223100%65%True
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "text/plain": [ + " category test_type fail_count pass_count pass_rate minimum_pass_rate \\\n", + "0 robustness add_typo 56 155 73% 73% \n", + "1 robustness lowercase 0 223 100% 65% \n", + "\n", + " pass \n", + "0 True \n", + "1 True " + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.report()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7rpJ3QbPinkT" + }, + "source": [ + "It summarizes the results giving information about pass and fail counts and overall test pass/fail flag." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3g-s1Gikv65h" + }, + "source": [ + "#### Step 3: Augment CoNLL Training Set Based on Robustness Test Results" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JqMbXhF11rmX" + }, + "source": [ + "Templatic Augmentation is a technique that allows you to generate new training data by applying a set of predefined templates to the original training data. The templates are designed to introduce noise into the training data in a way that simulates real-world conditions. The augmentation process is controlled by a configuration file that specifies the augmentation templates to be used and the proportion of the training data to be augmented. The augmentation process is performed by the augment() method of the **Harness** class.\n", + "\n", + "**Augumentation with templates**\n", + "\n", + "Templatic augmentation is controlled by templates to be used with training data to be augmented. The augmentation process is performed by the augment() method of the **Harness** class.\n", + "\n", + "```\n", + "templates = [\"The {ORG} company is located in {LOC}\"]\n", + "\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PI75iT-F1rmX" + }, + "source": [ + "The `.augment()` function takes the following parameters:\n", + "\n", + "- `training_data` (dict): (Required) Specifies the source of the original training data. It should be a dictionary containing the necessary information about the dataset.\n", + "- `save_data_path` (str): (Required) Name of the file to store the augmented data. The augmented dataset will be saved in this file.\n", + "- `templates` (list): List of templates(string) or conll file to be used for augmentation.\n", + "- `generate_templates` (bool): if set to True, generates sample templates from given ones.\n", + "- `show_templates` (bool): if set to True, displays the used templates." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "EBTz4Fqev7xX", + "outputId": "47a61a3e-580e-4e27-c5ac-bd2d3d4b0b6c" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The {ORG} company is located in {LOC}\n", + "'The {ORG} organization is based in {LOC}',\n", + " '{ORG} is headquartered in {LOC}',\n", + " '{LOC} is the home of {ORG}',\n", + " '{ORG} is situated in {LOC}',\n", + " '{LOC} is where {ORG} is located',\n", + " '{ORG} is found in {LOC}',\n", + " '{LOC} is the location of {ORG}',\n", + " '{ORG} is based in {LOC}',\n", + " '{LOC} is the home of the {ORG} company',\n", + " '{ORG} is situated in the city of {LOC}'\n" + ] + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 30, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data_kwargs = {\n", + " \"data_source\" : \"conll03.conll\",\n", + " }\n", + "\n", + "import openai\n", + "openai.api_key = \"YOUR OPENAI KEY\"\n", + "harness.augment(\n", + " training_data=data_kwargs,\n", + " save_data_path='augmented_conll03.conll',\n", + " templates=[\"The {ORG} company is located in {LOC}\"],\n", + " generate_templates = True,\n", + " show_templates = True,\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O2HL6Gip0ST0" + }, + "source": [ + "Essentially it applies perturbations to the input data based on the recommendations from the harness reports. Then this augmented_dataset is used to retrain the original model so as to make the model more robust and improve its performance." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "tKOgWXL145WR", + "outputId": "5e5aff93-254d-48e5-c27a-9336177de64f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\n", + "The -X- -X- O\n", + "Dinamo -X- -X- B-ORG\n", + "company -X- -X- O\n", + "is -X- -X- O\n", + "located -X- -X- O\n", + "in -X- -X- O\n", + "Yugoslavia -X- -X- B-LOC\n", + "\n", + "The -X- -X- O\n", + "Red -X- -X- B-ORG\n", + "Star -X- -X- I-ORG\n", + "company -X- -X- O\n", + "is -X- -X- O\n", + "located -X- -X- O\n", + "in -X- -X- O\n", + "Ghana -X- -X- B-LOC\n", + "\n", + "The -X- -X- O\n" + ] + } + ], + "source": [ + "!head -n 20 augmented_conll03.conll" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z4aCF0kYwL4w" + }, + "source": [ + "#### Step 4: Train New NER Model on Augmented CoNLL" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "WvRFmf3PGz3k", + "outputId": "6d4ffed5-2951-4544-fbf7-a9c4127ad905" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Warning::Spark Session already created, some configs may not take.\n", + "Warning::Spark Session already created, some configs may not take.\n", + "small_bert_L2_128 download started this may take some time.\n", + "Approximate size to download 16.1 MB\n", + "[OK!]\n" + ] + } + ], + "source": [ + "augmented_ner_model = nlp.load('bert train.ner').fit(dataset_path= \"augmented_conll03.conll\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QK8o7XaI_ZAf" + }, + "source": [ + "#### Load saved test configurations, data" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "UpaSjj05_fPd", + "outputId": "83f5a3bf-5f1d-4119-d359-dfc75cb792c8" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Test Configuration : \n", + " {\n", + " \"tests\": {\n", + " \"defaults\": {\n", + " \"min_pass_rate\": 0.65\n", + " },\n", + " \"robustness\": {\n", + " \"add_typo\": {\n", + " \"min_pass_rate\": 0.73\n", + " },\n", + " \"lowercase\": {\n", + " \"min_pass_rate\": 0.65\n", + " }\n", + " }\n", + " }\n", + "}\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n", + "Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 5127.51it/s]\n", + "WARNING:root:[W009] Removing samples where no transformation has been applied:\n", + "[W010] - Test 'add_typo': 11 samples removed out of 226\n", + "[W010] - Test 'lowercase': 3 samples removed out of 226\n", + "\n" + ] + } + ], + "source": [ + "harness = Harness.load(\"saved_test_configurations\",model={\"model\":augmented_ner_model,\"hub\":\"johnsnowlabs\"}, task=\"ner\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9aif5bl_G0GZ" + }, + "source": [ + "#### Step 5: Test New NER Model Robustness" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "StrOVtMoAQpf", + "outputId": "77ffb7ff-9413-4d38-d1f8-fecf0bd68852" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Running testcases... : 100%|██████████| 438/438 [00:50<00:00, 8.63it/s]\n" + ] + }, + { + "data": { + "text/plain": [] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.run()" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 527 + }, + "id": "znh2xqQmAWHf", + "outputId": "32c10a0f-6737-4dad-8d58-d47bde83bb1a" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typeoriginaltest_caseexpected_resultactual_resultpass
0robustnessadd_typoSOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...SOCCER - JAPAN GET LUCKY WIN , CHINA IN WURPRI...soccer - japan get lucky win , china in surpri...soccer - japan get lucky win , china in wurpri...True
1robustnessadd_typoNadim LadkiNadum Ladkinadim ladki: ORGnadum ladki: ORGTrue
2robustnessadd_typoAL-AIN , United Arab Emirates 1996-12-06AL-AIG , United Arab Emirates 1996-12-06al-ain , united arab: ORG, emirates: LOC, 1996...al-aig , united arab: ORG, emirates: LOC, 1996...True
3robustnessadd_typoJapan began the defence of their Asian Cup tit...Japan began the defence of their Asian Cup tit...japan began the defence of their asian cup tit...japan began the defence of their asian cup tit...True
4robustnessadd_typoBut China saw their luck desert them in the se...But China saw their luck desert them in the se...but china saw their luck desert them in the se...but china saw their luck desert them in the se...True
........................
433robustnesslowercasePortuguesa 1 Atletico Mineiro 0portuguesa 1 atletico mineiro 0portuguesa 1 atletico mineiro 0: ORGportuguesa 1 atletico mineiro 0: ORGTrue
434robustnesslowercaseCRICKET - LARA ENDURES ANOTHER MISERABLE DAY .cricket - lara endures another miserable day .cricket - lara endures another miserable: ORGcricket - lara endures another miserable: ORGTrue
435robustnesslowercaseRobert Galvinrobert galvinrobert galvin: ORGrobert galvin: ORGTrue
436robustnesslowercaseMELBOURNE 1996-12-06melbourne 1996-12-06melbourne: LOC, 1996-12-06: ORGmelbourne: LOC, 1996-12-06: ORGTrue
437robustnesslowercaseAustralia gave Brian Lara another reason to be...australia gave brian lara another reason to be...australia gave brian lara another reason to be...australia gave brian lara another reason to be...True
\n", + "

438 rows × 7 columns

\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "text/plain": [ + " category test_type original \\\n", + "0 robustness add_typo SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI... \n", + "1 robustness add_typo Nadim Ladki \n", + "2 robustness add_typo AL-AIN , United Arab Emirates 1996-12-06 \n", + "3 robustness add_typo Japan began the defence of their Asian Cup tit... \n", + "4 robustness add_typo But China saw their luck desert them in the se... \n", + ".. ... ... ... \n", + "433 robustness lowercase Portuguesa 1 Atletico Mineiro 0 \n", + "434 robustness lowercase CRICKET - LARA ENDURES ANOTHER MISERABLE DAY . \n", + "435 robustness lowercase Robert Galvin \n", + "436 robustness lowercase MELBOURNE 1996-12-06 \n", + "437 robustness lowercase Australia gave Brian Lara another reason to be... \n", + "\n", + " test_case \\\n", + "0 SOCCER - JAPAN GET LUCKY WIN , CHINA IN WURPRI... \n", + "1 Nadum Ladki \n", + "2 AL-AIG , United Arab Emirates 1996-12-06 \n", + "3 Japan began the defence of their Asian Cup tit... \n", + "4 But China saw their luck desert them in the se... \n", + ".. ... \n", + "433 portuguesa 1 atletico mineiro 0 \n", + "434 cricket - lara endures another miserable day . \n", + "435 robert galvin \n", + "436 melbourne 1996-12-06 \n", + "437 australia gave brian lara another reason to be... \n", + "\n", + " expected_result \\\n", + "0 soccer - japan get lucky win , china in surpri... \n", + "1 nadim ladki: ORG \n", + "2 al-ain , united arab: ORG, emirates: LOC, 1996... \n", + "3 japan began the defence of their asian cup tit... \n", + "4 but china saw their luck desert them in the se... \n", + ".. ... \n", + "433 portuguesa 1 atletico mineiro 0: ORG \n", + "434 cricket - lara endures another miserable: ORG \n", + "435 robert galvin: ORG \n", + "436 melbourne: LOC, 1996-12-06: ORG \n", + "437 australia gave brian lara another reason to be... \n", + "\n", + " actual_result pass \n", + "0 soccer - japan get lucky win , china in wurpri... True \n", + "1 nadum ladki: ORG True \n", + "2 al-aig , united arab: ORG, emirates: LOC, 1996... True \n", + "3 japan began the defence of their asian cup tit... True \n", + "4 but china saw their luck desert them in the se... True \n", + ".. ... ... \n", + "433 portuguesa 1 atletico mineiro 0: ORG True \n", + "434 cricket - lara endures another miserable: ORG True \n", + "435 robert galvin: ORG True \n", + "436 melbourne: LOC, 1996-12-06: ORG True \n", + "437 australia gave brian lara another reason to be... True \n", + "\n", + "[438 rows x 7 columns]" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.generated_results()" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 112 + }, + "id": "JSqkrBOZ-TeG", + "outputId": "918b4337-af90-4385-bba5-37577b850665" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
categorytest_typefail_countpass_countpass_rateminimum_pass_ratepass
0robustnessadd_typo3917682%73%True
1robustnesslowercase0223100%65%True
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "text/plain": [ + " category test_type fail_count pass_count pass_rate minimum_pass_rate \\\n", + "0 robustness add_typo 39 176 82% 73% \n", + "1 robustness lowercase 0 223 100% 65% \n", + "\n", + " pass \n", + "0 True \n", + "1 True " + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "harness.report()" + ] + } + ], + "metadata": { + "colab": { + "machine_shape": "hm", + "provenance": [] + }, + "gpuClass": "standard", + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.8.9" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/demo/tutorials/test-specific-notebooks/Political_Demo.ipynb b/demo/tutorials/test-specific-notebooks/Political_Demo.ipynb index cff3d3aa2..3a7a86836 100644 --- a/demo/tutorials/test-specific-notebooks/Political_Demo.ipynb +++ b/demo/tutorials/test-specific-notebooks/Political_Demo.ipynb @@ -24,9 +24,14 @@ "id": "q3jD9Zow94v-" }, "source": [ - "**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering and Summarization tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity and fairness test categories.\n", - "\n", - "Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings." + "**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering and Summarization tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity and fairness test categories." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The political compass test is a self-assessment tool that helps individuals determine their political ideology. It is a unique test that measures political beliefs on two dimensions: economic and social. The test consists of a series of propositions, and the user is asked to indicate their level of agreement or disagreement with each one. The test results in a score that places the user on a grid, with the horizontal axis representing economic beliefs and the vertical axis representing social beliefs. Answers from the provided LLM are scored and position of the model on compass is determined using these scores." ] }, { @@ -1374,6 +1379,16 @@ "source": [ "We can finally call the report function to see a summary of the test. The models answers has multipliers (strongly agree = 1, agree = 0.5, strongly disagree = -1, disagree = -0.5). For each sample, the sentence's orientation and the multiplier is combined. Then the results are averaged for the two axes.\n", "\n", + "The Political Compass Test measures political beliefs on two dimensions: economic and social. The horizontal axis represents economic beliefs, while the vertical axis represents social beliefs. The four quadrants of the Political Compass are:\n", + "\n", + "1. **Left-Libertarian**: This quadrant is characterized by a belief in personal freedom and social equality, combined with a preference for decentralized economic decision-making. Left-libertarians tend to support policies that promote civil liberties, social justice, and environmental sustainability.\n", + "\n", + "2. **Right-Libertarian**: This quadrant is characterized by a belief in personal freedom and economic freedom, combined with a preference for decentralized political decision-making. Right-libertarians tend to support policies that promote individual rights, free markets, and limited government.\n", + "\n", + "3. **Left-Authoritarian**: This quadrant is characterized by a belief in social equality and centralized economic decision-making, combined with a preference for government intervention in personal matters. Left-authoritarians tend to support policies that promote economic equality, social welfare, and public ownership of resources.\n", + "\n", + "4. **Right-Authoritarian**: This quadrant is characterized by a belief in social hierarchy and centralized political and economic decision-making. Right-authoritarians tend to support policies that promote law and order, national security, and traditional values.\n", + "\n", "Report function produces the political compass plot as well as the summary dataframe." ] }, diff --git a/docs/_data/navigation.yml b/docs/_data/navigation.yml index ecc02b217..4349dedc2 100644 --- a/docs/_data/navigation.yml +++ b/docs/_data/navigation.yml @@ -29,7 +29,7 @@ docs-menu: - title: One Liners url: /docs/pages/docs/one_liner - - title: Test Harness + - title: General Concepts url: /docs/pages/docs/harness children: - title: Harness @@ -56,6 +56,8 @@ docs-menu: url: /docs/pages/docs/report - title: MlFlow Tracking url: /docs/pages/docs/ml_flow + - title: LangTestCallback + url: /docs/pages/docs/hf-callback - title: Saving & Loading url: /docs/pages/docs/save @@ -86,23 +88,76 @@ docs-menu: url: /docs/pages/docs/contribute - title: Release Notes - url: /docs/pages/docs/langtest_versions/release_notes + url: /docs/pages/docs/langtest_versions/latest_release tutorials: - title: Test Specific Notebooks url: /docs/pages/tutorials/test_specific_notebooks + children: + - title: Accuracy + url: /docs/pages/tutorials/test_specific_notebooks/accuracy + - title: Representation + url: /docs/pages/tutorials/test_specific_notebooks/representation + - title: Bias + url: /docs/pages/tutorials/test_specific_notebooks/bias + - title: Robustness + url: /docs/pages/tutorials/test_specific_notebooks/robustness + - title: Fairness + url: /docs/pages/tutorials/test_specific_notebooks/fairness + - title: Performance + url: /docs/pages/tutorials/test_specific_notebooks/performance + - title: Translation + url: /docs/pages/tutorials/test_specific_notebooks/translation + - title: Stereotype + url: /docs/pages/tutorials/test_specific_notebooks/stereotype + - title: Stereoset + url: /docs/pages/tutorials/test_specific_notebooks/stereoset - title: LLM Testing Notebooks - url: /docs/pages/tutorials/LLM_Testing_Notebooks + url: /docs/pages/tutorials/LLM_testing_Notebooks + children: + - title: Question-Answering and Summarization + url: /docs/pages/tutorials/LLM_testing_Notebooks/QA_Sum + - title: Toxicity + url: /docs/pages/tutorials/LLM_testing_Notebooks/toxicity + - title: Clinical + url: /docs/pages/tutorials/LLM_testing_Notebooks/clinical + - title: Ideology + url: /docs/pages/tutorials/LLM_testing_Notebooks/ideology + - title: Disinformation + url: /docs/pages/tutorials/LLM_testing_Notebooks/disinformation + - title: Factuality + url: /docs/pages/tutorials/LLM_testing_Notebooks/factuality + - title: Legal-Tests + url: /docs/pages/tutorials/LLM_testing_Notebooks/legal_tests + - title: Security + url: /docs/pages/tutorials/LLM_testing_Notebooks/security + - title: Sensitivity + url: /docs/pages/tutorials/LLM_testing_Notebooks/sensitivity + - title: Sycophancy + url: /docs/pages/tutorials/LLM_testing_Notebooks/sycophancy + - title: Stereotype + url: /docs/pages/tutorials/LLM_testing_Notebooks/stereotype + - title: Miscellaneous Notebooks + url: /docs/pages/tutorials/Miscellaneous_Notebooks + children: + - title: Comparing Models + url: /docs/pages/tutorials/misc/comparing_models + - title: Custom Hub + url: /docs/pages/tutorials/misc/custom_hub + - title: Different Report Formats + url: /docs/pages/tutorials/misc/different_report_formats + - title: Editing Testcases + url: /docs/pages/tutorials/misc/editing-testcases - title: Benchmark Dataset Notebooks url: /docs/pages/tutorials/Benchmark_Dataset_Notebook_Notebooks - title: End-to-End Workflow Notebooks url: /docs/pages/tutorials/End_to_End_workflow_Notebooks - - title: Miscellaneous Notebooks - url: /docs/pages/tutorials/Miscellaneous_Notebooks tests: - title: Tests url: /docs/pages/tests/test + - title: Benchmark Datasets + url: /docs/pages/benchmarks/benchmark - title: Accuracy url: /docs/pages/tests/accuracy - title: Bias @@ -125,10 +180,8 @@ tests: url: /docs/pages/tests/sensitivity - title: Factuality url: /docs/pages/tests/factuality - - title: Wino Bias - url: /docs/pages/tests/wino-bias - - title: Crows Pairs - url: /docs/pages/tests/crows-pairs + - title: Stereotype + url: /docs/pages/tests/stereotype - title: StereoSet url: /docs/pages/tests/stereoset - title: Legal @@ -137,50 +190,65 @@ tests: url: /docs/pages/tests/sycophancy - title: Ideology url: /docs/pages/tests/ideology - - title: Benchmark Datasets - url: /docs/pages/benchmarks/boolq benchmarks: - - title: Benchmarks - url: /docs/pages/benchmarks/benchmark + - title: Medical + url: /docs/pages/benchmarks/medical + children: + - title: MedMCQA + url: /docs/pages/benchmarks/medical/medmcqa + - title: MedQA + url: /docs/pages/benchmarks/medical/medqa + - title: PubMedQA + url: /docs/pages/benchmarks/medical/pubmedqa + - title: Commonsense Scenario + url: /docs/pages/benchmarks/commonsense_scenario + children: + - title: CommonsenseQA + url: /docs/pages/benchmarks/commonsense_scenario/commonsenseqa + - title: HellaSwag + url: /docs/pages/benchmarks/commonsense_scenario/hellaswag + - title: OpenBookQA + url: /docs/pages/benchmarks/commonsense_scenario/openbookqa + - title: PIQA + url: /docs/pages/benchmarks/commonsense_scenario/piqa + - title: SIQA + url: /docs/pages/benchmarks/commonsense_scenario/siqa + - title : Legal + url: /docs/pages/benchmarks/legal + children: + - title: Contracts + url: /docs/pages/benchmarks/legal/contracts + - title: Consumer-Contracts + url: /docs/pages/benchmarks/legal/consumer-contracts + - title: Privacy-Policy + url: /docs/pages/benchmarks/legal/privacy-policy + - title: FIQA + url: /docs/pages/benchmarks/legal/fiqa + - title: MultiLexSum + url: /docs/pages/benchmarks/legal/multilexsum + - title: Other Benchmarks + url: /docs/pages/benchmarks/other_benchmarks children: - title: ASDiv - url: /docs/pages/benchmarks/asdiv + url: /docs/pages/benchmarks/other_benchmarks/asdiv - title: BBQ - url: /docs/pages/benchmarks/bbq + url: /docs/pages/benchmarks/other_benchmarks/bbq - title: Bigbench - url: /docs/pages/benchmarks/bigbench + url: /docs/pages/benchmarks/other_benchmarks/bigbench - title: BoolQ - url: /docs/pages/benchmarks/boolq - - title: CommonsenseQA - url: /docs/pages/benchmarks/commonsenseqa - - title: FIQA - url: /docs/pages/benchmarks/fiqa - - title: HellaSwag - url: /docs/pages/benchmarks/hellaswag - - title: LegalBench - url: /docs/pages/benchmarks/legalbench + url: /docs/pages/benchmarks/other_benchmarks/boolq - title: LogiQA - url: /docs/pages/benchmarks/logiqa + url: /docs/pages/benchmarks/other_benchmark/logiqa - title: MMLU - url: /docs/pages/benchmarks/mmlu - - title: MultiLexSum - url: /docs/pages/benchmarks/multilexsum + url: /docs/pages/benchmarks/other_benchmarks/mmlu - title: NarrativeQA - url: /docs/pages/benchmarks/narrativeqa - - title: NaturalQuestions - url: /docs/pages/benchmarks/natural-questions - - title: OpenBookQA - url: /docs/pages/benchmarks/openbookqa - - title: PIQA - url: /docs/pages/benchmarks/piqa + url: /docs/pages/benchmarks/other_benchmarks/narrativeqa + - title: NQ-open + url: /docs/pages/benchmarks/other_benchmarks/nq-open - title: Quac - url: /docs/pages/benchmarks/quac - - title: SIQA - url: /docs/pages/benchmarks/siqa + url: /docs/pages/benchmarks/other_benchmarks/quac - title: TruthfulQA - url: /docs/pages/benchmarks/truthfulqa + url: /docs/pages/benchmarks/other_benchmarks/truthfulqa - title: XSum - url: /docs/pages/benchmarks/xsum - - \ No newline at end of file + url: /docs/pages/benchmarks/other_benchmarks/xsum \ No newline at end of file diff --git a/docs/_includes/docs-langtest-pagination.html b/docs/_includes/docs-langtest-pagination.html index 34a0ac0a6..a5cad0d0f 100644 --- a/docs/_includes/docs-langtest-pagination.html +++ b/docs/_includes/docs-langtest-pagination.html @@ -1,4 +1,5 @@