Merge pull request #909 from JohnSnowLabs/release/1.9.0

Release/1.9.0
JohnSnowLabs · Dec 1, 2023 · 3abd5b2 · 3abd5b2
2 parents 9e15f66 + b5f7b32
commit 3abd5b2
Show file tree

Hide file tree

Showing 216 changed files with 55,334 additions and 1,296 deletions.
diff --git a/README.md b/README.md
@@ -101,7 +101,16 @@ You can check out the following LangTest articles:
 | [**Evaluating Large Language Models on Gender-Occupational Stereotypes Using the Wino Bias Test**](https://medium.com/john-snow-labs/evaluating-large-language-models-on-gender-occupational-stereotypes-using-the-wino-bias-test-2a96619b4960) | In this blog post, we dive into testing the WinoBias dataset on LLMs, examining language models’ handling of gender and occupational roles, evaluation metrics, and the wider implications. Let’s explore the evaluation of language models with LangTest on the WinoBias dataset and confront the challenges of addressing bias in AI. |
 | [**Streamlining ML Workflows: Integrating MLFlow Tracking with LangTest for Enhanced Model Evaluations**](https://medium.com/john-snow-labs/streamlining-ml-workflows-integrating-mlflow-tracking-with-langtest-for-enhanced-model-evaluations-4ce9863a0ff1) | In this blog post, we dive into the growing need for transparent, systematic, and comprehensive tracking of models. Enter MLFlow and LangTest: two tools that, when combined, create a revolutionary approach to ML development. |
 | [**Testing the Question Answering Capabilities of Large Language Models**](https://medium.com/john-snow-labs/testing-the-question-answering-capabilities-of-large-language-models-1bc424d61740) | In this blog post, we dive into enhancing the QA evaluation capabilities using LangTest library. Explore about different evaluation methods that LangTest offers to address the complexities of evaluating Question Answering (QA) tasks. |
-| [**Evaluating Stereotype Bias with LangTest**](To be published soon) | In this blog post, we are focusing on using the StereoSet dataset to assess bias related to gender, profession, and race.|
+| [**Evaluating Stereotype Bias with LangTest**](https://medium.com/john-snow-labs/evaluating-stereotype-bias-with-langtest-8286af8f0f22) | In this blog post, we are focusing on using the StereoSet dataset to assess bias related to gender, profession, and race.|
+| [**Unveiling Sentiments: Exploring LSTM-based Sentiment Analysis with PyTorch on the IMDB Dataset**](To be Published) | Explore the robustness of custom models with LangTest Insights.|
+| [**LangTest Insights: A Deep Dive into LLM Robustness on OpenBookQA**](To be Published) | Explore the robustness of Language Models (LLMs) on the OpenBookQA dataset with LangTest Insights.|
+| [**LangTest: A Secret Weapon for Improving the Robustness of Your Transformers Language Models**](To be Published) | Explore the robustness of Transformers Language Models with LangTest Insights.|
+
+
+
+
+
+
 
 
 > **Note**

diff --git a/demo/tutorials/benchmarks/OpenbookQA_benchmarks.ipynb b/demo/tutorials/benchmarks/OpenbookQA_benchmarks.ipynb
diff --git a/demo/tutorials/llm_notebooks/Clinical_Tests.ipynb b/demo/tutorials/llm_notebooks/Clinical_Tests.ipynb
@@ -191,7 +191,7 @@
         "\n",
         "task={\"task\": \"text-generation\", \"category\": \"clinical-tests\"},\n",
         "\n",
-        "harness = Harness(task={\"task\": \"text-generation\", \"category\": \"clinical-tests\"}, model=model, data=data)"
+        "harness = Harness(task=task, model=model, data=data)"
       ]
     },
     {
@@ -2640,7 +2640,9 @@
         "\n",
         "data = {\"data_source\": \"Clinical\", \"split\":\"Gastroenterology-files\"}\n",
         "\n",
-        "harness = Harness(task={\"task\": \"text-generation\", \"category\": \"clinical-tests\"}, model=model, data=data)"
+        "task={\"task\": \"text-generation\", \"category\": \"clinical-tests\"}\n",
+        "\n",
+        "harness = Harness(task=task, model=model, data=data)"
       ]
     },
     {
@@ -5006,7 +5008,9 @@
         "\n",
         "data = {\"data_source\": \"Clinical\", \"split\":\"Oromaxillofacial-files\"}\n",
         "\n",
-        "harness = Harness(task={\"task\": \"text-generation\", \"category\": \"clinical-tests\"}, model=model, data=data)"
+        "task={\"task\": \"text-generation\", \"category\": \"clinical-tests\"}\n",
+        "\n",
+        "harness = Harness(task=task, model=model, data=data)"
       ]
     },
     {

diff --git a/demo/tutorials/llm_notebooks/Disinformation_Test.ipynb b/demo/tutorials/llm_notebooks/Disinformation_Test.ipynb
@@ -173,7 +173,7 @@
     }
    ],
    "source": [
-    "model = {\"model\": \"text-davinci-003\", \"hub\":\"openai\"}\n",
+    "model={\"model\": \"j2-jumbo-instruct\", \"hub\":\"ai21\"}\n",
     "\n",
     "data = {\"data_source\": \"Narrative-Wedging\"}\n",
     "\n",

diff --git a/demo/tutorials/llm_notebooks/Factuality_Test.ipynb b/demo/tutorials/llm_notebooks/Factuality_Test.ipynb
@@ -213,7 +213,7 @@
     "\n",
     "task={\"task\": \"question-answering\", \"category\": \"factuality-test\"},\n",
     "\n",
-    "harness = Harness(task={\"task\": \"question-answering\", \"category\": \"factuality-test\"}, model=model, data=data)"
+    "harness = Harness(task=task, model=model, data=data)"
    ]
   },
   {

diff --git a/demo/tutorials/llm_notebooks/Legal_Support.ipynb b/demo/tutorials/llm_notebooks/Legal_Support.ipynb
@@ -172,10 +172,11 @@
       "source": [
         "model = {\"model\": \"text-davinci-003\", \"hub\":\"openai\"}\n",
         "\n",
-        "data = {\"data_source\": \"Legal-Support\",\n",
-        "        \"split\":\"test\"}\n",
+        "data = {\"data_source\": \"Legal-Support\", \"split\":\"test\"}\n",
         "\n",
-        "harness = Harness(task={\"task\": \"question-answering\", \"category\": \"legal-tests\"}, model=model, data=data)"
+        "task={\"task\": \"question-answering\", \"category\": \"legal-tests\"}\n",
+        "\n",
+        "harness = Harness(task=task, model=model, data=data)"
       ]
     },
     {

diff --git a/demo/tutorials/llm_notebooks/Prompt_Injections_Tests.ipynb b/demo/tutorials/llm_notebooks/Prompt_Injections_Tests.ipynb
@@ -24,9 +24,14 @@
         "id": "q3jD9Zow94v-"
       },
       "source": [
-        "**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering, Summarization, Clinical-Tests and **Security** tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity, translation, performance, security, clinical and fairness test categories.\n",
-        "\n",
-        "Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings."
+        "**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering, Summarization, Clinical-Tests and **Security** tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity, translation, performance, security, clinical and fairness test categories."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "This test is used for testing prompt injection vulnerabilities in LLMs. It evaluates the model’s resilience against adversarial attacks and assess its ability to handle sensitive information appropriately. The test uses some prompts that can be harmful or dangerous and expects an answer like \"I can't help\" from the model."
       ]
     },
     {
@@ -180,7 +185,9 @@
         "\n",
         "data = {\"data_source\": \"Prompt-Injection-Attack\", \"split\":\"test\"}\n",
         "\n",
-        "harness = Harness(task={\"task\": \"text-generation\", \"category\": \"security\"}, model=model, data=data)"
+        "task={\"task\": \"text-generation\", \"category\": \"security\"}\n",
+        "\n",
+        "harness = Harness(task=task, model=model, data=data)"
       ]
     },
     {

diff --git a/demo/tutorials/llm_notebooks/Sensitivity_Test.ipynb b/demo/tutorials/llm_notebooks/Sensitivity_Test.ipynb
@@ -187,7 +187,9 @@
     "\n",
     "data={\"data_source\" :\"NQ-open\",\"split\":\"test-tiny\"}\n",
     "\n",
-    "harness = Harness(task={\"task\": \"question-answering\", \"category\": \"sensitivity-test\"}, model=model, data=data)"
+    "task={\"task\": \"question-answering\", \"category\": \"sensitivity-test\"}\n",
+    "\n",
+    "harness = Harness(task=task, model=model, data=data)"
    ]
   },
   {
@@ -941,7 +943,9 @@
     "\n",
     "data={\"data_source\" :\"OpenBookQA\",\"split\":\"test-tiny\"}\n",
     "\n",
-    "harness = Harness(task={\"task\": \"question-answering\", \"category\": \"sensitivity-test\"}, model=model, data=data)"
+    "task={\"task\": \"question-answering\", \"category\": \"sensitivity-test\"}\n",
+    "\n",
+    "harness = Harness(task=task, model=model, data=data)"
    ]
   },
   {
@@ -1871,7 +1875,9 @@
     "\n",
     "data={\"data_source\" :\"wikiDataset\",\"split\":\"test-tiny\"}\n",
     "\n",
-    "harness = Harness(task={\"task\": \"question-answering\", \"category\": \"sensitivity-test\"}, model=model, data=data)"
+    "task={\"task\": \"question-answering\", \"category\": \"sensitivity-test\"}\n",
+    "\n",
+    "harness = Harness(task=task, model=model, data=data)"
    ]
   },
   {

diff --git a/demo/tutorials/llm_notebooks/Toxicity_NB.ipynb b/demo/tutorials/llm_notebooks/Toxicity_NB.ipynb
@@ -27,9 +27,14 @@
         "id": "q3jD9Zow94v-"
       },
       "source": [
-        "**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering and Summarization tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity and fairness test categories.\n",
-        "\n",
-        "Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings."
+        "**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering and Summarization tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity and fairness test categories."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "This test uses a dataset of prompts consisting of incomplete sententes. This sentences are carefully designed to have a open ending that is leading to harmful/toxic content. The model is expected to complete the sentence in a way that is not harmful/toxic. A model is used for calculating the toxicity of completions which provides a score for different categories. Then the treshold is considered and samples are passed/failed accordingly."
       ]
     },
     {
@@ -135,7 +140,9 @@
         "\n",
         "data={\"data_source\" :'Toxicity',\"split\":\"test\"}\n",
         "\n",
-        "harness = Harness(task={\"task\": \"text-generation\", \"category\": \"toxicity\"}, model=model, data=data)"
+        "task={\"task\": \"text-generation\", \"category\": \"toxicity\"}\n",
+        "\n",
+        "harness = Harness(task=task, model=model, data=data)"
       ]
     },
     {

diff --git a/demo/tutorials/llm_notebooks/dataset-notebooks/Medical_Datasets.ipynb b/demo/tutorials/llm_notebooks/dataset-notebooks/Medical_Datasets.ipynb
diff --git a/demo/tutorials/llm_notebooks/dataset-notebooks/mmlu_dataset.ipynb b/demo/tutorials/llm_notebooks/dataset-notebooks/mmlu_dataset.ipynb
diff --git a/demo/tutorials/misc/HF_Callback_NER.ipynb b/demo/tutorials/misc/HF_Callback_NER.ipynb
diff --git a/demo/tutorials/misc/HF_Callback_Text_Classification.ipynb b/demo/tutorials/misc/HF_Callback_Text_Classification.ipynb
diff --git a/demo/tutorials/misc/PerformanceTest_Notebook.ipynb b/demo/tutorials/misc/PerformanceTest_Notebook.ipynb
diff --git a/demo/tutorials/misc/Templatic_Augmentation_Notebook.ipynb b/demo/tutorials/misc/Templatic_Augmentation_Notebook.ipynb
diff --git a/demo/tutorials/test-specific-notebooks/Political_Demo.ipynb b/demo/tutorials/test-specific-notebooks/Political_Demo.ipynb
@@ -24,9 +24,14 @@
         "id": "q3jD9Zow94v-"
       },
       "source": [
-        "**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering and Summarization tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity and fairness test categories.\n",
-        "\n",
-        "Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings."
+        "**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification model using the library. We also support testing LLMS for Question-Answering and Summarization tasks on benchmark datasets. The library supports 50+ out of the box tests. These tests fall into robustness, accuracy, bias, representation, toxicity and fairness test categories."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "The political compass test is a self-assessment tool that helps individuals determine their political ideology. It is a unique test that measures political beliefs on two dimensions: economic and social. The test consists of a series of propositions, and the user is asked to indicate their level of agreement or disagreement with each one. The test results in a score that places the user on a grid, with the horizontal axis representing economic beliefs and the vertical axis representing social beliefs. Answers from the provided LLM are scored and position of the model on compass is determined using these scores."
       ]
     },
     {
@@ -1374,6 +1379,16 @@
       "source": [
         "We can finally call the report function to see a summary of the test. The models answers has multipliers (strongly agree = 1, agree = 0.5, strongly disagree = -1, disagree = -0.5). For each sample, the sentence's orientation and the multiplier is combined. Then the results are averaged for the two axes.\n",
         "\n",
+        "The Political Compass Test measures political beliefs on two dimensions: economic and social. The horizontal axis represents economic beliefs, while the vertical axis represents social beliefs. The four quadrants of the Political Compass are:\n",
+        "\n",
+        "1. **Left-Libertarian**: This quadrant is characterized by a belief in personal freedom and social equality, combined with a preference for decentralized economic decision-making. Left-libertarians tend to support policies that promote civil liberties, social justice, and environmental sustainability.\n",
+        "\n",
+        "2. **Right-Libertarian**: This quadrant is characterized by a belief in personal freedom and economic freedom, combined with a preference for decentralized political decision-making. Right-libertarians tend to support policies that promote individual rights, free markets, and limited government.\n",
+        "\n",
+        "3. **Left-Authoritarian**: This quadrant is characterized by a belief in social equality and centralized economic decision-making, combined with a preference for government intervention in personal matters. Left-authoritarians tend to support policies that promote economic equality, social welfare, and public ownership of resources.\n",
+        "\n",
+        "4. **Right-Authoritarian**: This quadrant is characterized by a belief in social hierarchy and centralized political and economic decision-making. Right-authoritarians tend to support policies that promote law and order, national security, and traditional values.\n",
+        "\n",
         "Report function produces the political compass plot as well as the summary dataframe."
       ]
     },
-Original file line number
+Diff line change
@@ Expand Up / @@ -213,7 +213,7 @@ @@
         "\n",
         "task={\"task\": \"question-answering\", \"category\": \"factuality-test\"},\n",
         "\n",
-        "harness = Harness(task={\"task\": \"question-answering\", \"category\": \"factuality-test\"}, model=model, data=data)"
+        "harness = Harness(task=task, model=model, data=data)"
        ]
       },
       {
@@ Expand Down @@