Add Eval Assist LLM as Judges (#1409)

* evalassist metric initial changes Signed-off-by: Martín Santillán Cooper <[email protected]> * Fixed registering eval assist to catalog. Eval assist now generates assessment and summary for hardcoded prompts Signed-off-by: Martín Santillán Cooper <[email protected]> * Update EvalAssistLLMasJudge Signed-off-by: Martín Santillán Cooper <[email protected]> * update catalog for direct assessment metric Signed-off-by: Martín Santillán Cooper <[email protected]> * add pairwiseEvaluator as EvalAssistLLMAsJudgePairwise Signed-off-by: Martín Santillán Cooper <[email protected]> * Pass Rubric dynamically Signed-off-by: Martín Santillán Cooper <[email protected]> * add catalog for pairwise metrics Signed-off-by: Martín Santillán Cooper <[email protected]> * update example evaluate_aval_assist_rubric Signed-off-by: Martín Santillán Cooper <[email protected]> * Add prometheus model for pairwise Signed-off-by: Martín Santillán Cooper <[email protected]> * Add prometheus in catalog Signed-off-by: Martín Santillán Cooper <[email protected]> * add pairwise criteria in preprocess_step Signed-off-by: Martín Santillán Cooper <[email protected]> * update input fields in pairwise example Signed-off-by: Martín Santillán Cooper <[email protected]> * add GPT rubric evaluator Signed-off-by: Martín Santillán Cooper <[email protected]> * add gpt to the catalog Signed-off-by: Martín Santillán Cooper <[email protected]> * Fixed incorrect merge conflict Signed-off-by: Martín Santillán Cooper <[email protected]> * Removed unncessary template files Signed-off-by: Martín Santillán Cooper <[email protected]> * add GPT rubric evaluator Signed-off-by: Martín Santillán Cooper <[email protected]> * Renamed eval_assist to eval_assist_direct and also using constants to create direct assessment evaluators Signed-off-by: Martín Santillán Cooper <[email protected]> * eval assist pairwise now gets model name from constants to register to catalog Signed-off-by: Martín Santillán Cooper <[email protected]> * Add rubrics and criterias as Enums Signed-off-by: Martín Santillán Cooper <[email protected]> * update eval_assist_cinstant Signed-off-by: Martín Santillán Cooper <[email protected]> * Add GPT Pairwise evaluator Signed-off-by: Martín Santillán Cooper <[email protected]> * update catalog for eval assist Signed-off-by: Martín Santillán Cooper <[email protected]> * rename eval assist's example files Signed-off-by: Martín Santillán Cooper <[email protected]> * remove unused variable Signed-off-by: Martín Santillán Cooper <[email protected]> * Major refactors and pairwise implementation Signed-off-by: Martín Santillán Cooper <[email protected]> * Improve criteria parsing and classes Signed-off-by: Martín Santillán Cooper <[email protected]> * Make watsonx infer call concurrent Signed-off-by: Martín Santillán Cooper <[email protected]> * Adapt evaluators and examples context to be a dict instead of a str Signed-off-by: Martín Santillán Cooper <[email protected]> * Rename evalassist to eval_assist as in other files Signed-off-by: Martín Santillán Cooper <[email protected]> * Rename context variable in examples Signed-off-by: Martín Santillán Cooper <[email protected]> * Use empty template in example cards Signed-off-by: Martín Santillán Cooper <[email protected]> * Remove unused imports Signed-off-by: Martín Santillán Cooper <[email protected]> * Many changes, prepares to the integration with EvalAssist Signed-off-by: Martín Santillán Cooper <[email protected]> * Fix examples Signed-off-by: Martín Santillán Cooper <[email protected]> * Remove model family Signed-off-by: Martín Santillán Cooper <[email protected]> * Remove MatchClosestOption checks Signed-off-by: Martín Santillán Cooper <[email protected]> * Clean unused files Signed-off-by: Martín Santillán Cooper <[email protected]> * Remove inference changes Signed-off-by: Martín Santillán Cooper <[email protected]> * remove .mailmap Signed-off-by: Martín Santillán Cooper <[email protected]> * Suggested python based approach for user criteria Signed-off-by: Yoav Katz <[email protected]> Signed-off-by: Martín Santillán Cooper <[email protected]> * add context_field attribute and minor fixes and improvements Signed-off-by: Martín Santillán Cooper <[email protected]> * Add example of squad card Signed-off-by: Martín Santillán Cooper <[email protected]> * Introduce context_fields and generate_summaries attributes Signed-off-by: Martín Santillán Cooper <[email protected]> * Add example for running evaluations on squad Signed-off-by: Martín Santillán Cooper <[email protected]> * Changes after running pre-commit Signed-off-by: Martín Santillán Cooper <[email protected]> * Apply ruff suggestions Signed-off-by: Martín Santillán Cooper <[email protected]> * Add updated secrets.baseline Signed-off-by: Martín Santillán Cooper <[email protected]> * Remove unused function Signed-off-by: Martín Santillán Cooper <[email protected]> * Several changes - update examples - Update enums to inherit from string - allow task data criteria to be string from catalog Signed-off-by: Martín Santillán Cooper <[email protected]> * Minor changes after updating with main Signed-off-by: Martín Santillán Cooper <[email protected]> * Add the option of passing the criteria as a single string description Signed-off-by: Martín Santillán Cooper <[email protected]> * Fix evaluate_eval_assist_direct_user_criteria_no_catalog Signed-off-by: Martín Santillán Cooper <[email protected]> * fix evaluate_existing_dataset_by_llm_as_judge_eval_assist Signed-off-by: Martín Santillán Cooper <[email protected]> * Refactor EvalAssistLLMAsJudgeDirect Signed-off-by: Martín Santillán Cooper <[email protected]> * Refactor EvalAssistLLMAsJudgePairwise Signed-off-by: Martín Santillán Cooper <[email protected]> * Add more entries to the pairwise results Signed-off-by: Martín Santillán Cooper <[email protected]> * Modularize get_results in direct assessment Signed-off-by: Martín Santillán Cooper <[email protected]> * Simplified example by using standard qa template and "create_dataset". Also allowed qa multireference template to work without references (and added check to metric that required it) Signed-off-by: Yoav Katz <[email protected]> * Simplified example Signed-off-by: Yoav Katz <[email protected]> * Simplified example Signed-off-by: Yoav Katz <[email protected]> * Removed old way to set a custom single criteria over dataset The old way required a dedicated task field. Signed-off-by: Yoav Katz <[email protected]> * Example with criteria from the dataste (WIP) Signed-off-by: Yoav Katz <[email protected]> * Added example of taking judgement string from file Signed-off-by: Yoav Katz <[email protected]> * Change temperature test case name Signed-off-by: Martín Santillán Cooper <[email protected]> * Update criteria parsing now that operators are used Signed-off-by: Martín Santillán Cooper <[email protected]> * Start pairwise change Signed-off-by: Martín Santillán Cooper <[email protected]> * Fix pairwise bugs and update examples to have >1 instances Signed-off-by: Martín Santillán Cooper <[email protected]> * Fixed type of pairwise judge. Added NullTemplate that returns empty prompt and no references. Signed-off-by: Yoav Katz <[email protected]> * Minor simplifcation of example Signed-off-by: Yoav Katz <[email protected]> * Added global scores of mean winwrate and mean ranking per prediction names. Added check to ensure all instances have the same predictions. Signed-off-by: Yoav Katz <[email protected]> * Add/update no-catalog examples Signed-off-by: Martín Santillán Cooper <[email protected]> * Remove user criteria pairwise example Signed-off-by: Martín Santillán Cooper <[email protected]> * Added 'criteria_feild' param to judge. Instead of impliciting using a hard coded field name. Signed-off-by: Yoav Katz <[email protected]> * Add pairwise criteria operators and refactor names Signed-off-by: Martín Santillán Cooper <[email protected]> * Update examples Signed-off-by: Martín Santillán Cooper <[email protected]> * Add criteria to the result Signed-off-by: Martín Santillán Cooper <[email protected]> * Move criteria verification to parent class Signed-off-by: Martín Santillán Cooper <[email protected]> * Change '|' to Union Signed-off-by: Martín Santillán Cooper <[email protected]> * Change missing '|' to Union Signed-off-by: Martín Santillán Cooper <[email protected]> * fix pre-commits check Signed-off-by: Martín Santillán Cooper <[email protected]> * Fix: don't add positional bias check if it is disabled Signed-off-by: Martín Santillán Cooper <[email protected]> * Removed test_multi_reference_template_with_empty_references Because it's now supported. Signed-off-by: Yoav Katz <[email protected]> * Fixed failing unitests by adding missing imports Signed-off-by: Yoav Katz <[email protected]> * Running rugg Signed-off-by: Yoav Katz <[email protected]> * Removed temp workaround for a problem Signed-off-by: Yoav Katz <[email protected]> * Moved criteria existance check from verify() to before processing Signed-off-by: Yoav Katz <[email protected]> * Reverted changes in log probs in inference.py Signed-off-by: Yoav Katz <[email protected]> Signed-off-by: Martín Santillán Cooper <[email protected]> * Fix typo in LLMasJudge Signed-off-by: Martín Santillán Cooper <[email protected]> * Fix catalog name task_based_ll_mas_judge -> task_based_llm_as_judge Signed-off-by: Martín Santillán Cooper <[email protected]> * bring back to old TaskBasedLLMasJudge class name Signed-off-by: Martín Santillán Cooper <[email protected]> * Fix examples Signed-off-by: elronbandel <[email protected]> * Remove example from testing Signed-off-by: elronbandel <[email protected]> * Fix type Signed-off-by: elronbandel <[email protected]> --------- Signed-off-by: Martín Santillán Cooper <[email protected]> Signed-off-by: Yoav Katz <[email protected]> Signed-off-by: elronbandel <[email protected]> Co-authored-by: Tejaswini Pedapati <[email protected]> Co-authored-by: Swapnaja <[email protected]> Co-authored-by: Swapnaja Achintalwar [email protected] <[email protected]> Co-authored-by: Yoav Katz <[email protected]> Co-authored-by: Yoav Katz <[email protected]> Co-authored-by: elronbandel <[email protected]>
IBM · Dec 23, 2024 · d53b187 · d53b187
1 parent 9e1e10f
commit d53b187
Show file tree

Hide file tree

Showing 76 changed files with 3,176 additions and 475 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -10,11 +10,6 @@ repos:
         args: [--fix]
         exclude: src/unitxt/metrics.py|examples/evaluate_existing_dataset_no_install.py
       # Run the linter on the specific file with the ignore flag
-      - id: ruff
-        name: ruff (src/unitxt/metrics.py)
-        files: src/unitxt/metrics.py
-        args: [--fix, --ignore, C901]
-      # Run the linter on the specific file with the ignore flag
       - id: ruff
         name: ruff (examples/evaluate_existing_dataset_no_install.py)
         files: examples/evaluate_existing_dataset_no_install.py

diff --git a/examples/evaluate_existing_dataset_by_llm_as_judge_direct.py b/examples/evaluate_existing_dataset_by_llm_as_judge_direct.py
@@ -0,0 +1,118 @@
+import statistics
+
+from unitxt import get_logger, get_settings, load_dataset
+from unitxt.api import evaluate
+from unitxt.inference import (
+    CrossProviderInferenceEngine,
+)
+from unitxt.text_utils import print_dict
+
+logger = get_logger()
+settings = get_settings()
+
+# Use the HF load_dataset API, to load the squad QA dataset using the standard template in the catalog.
+# We set loader_limit to 20 to reduce download time.
+criterias = ["answer_relevance", "coherence", "conciseness"]
+metrics = [
+    "metrics.llm_as_judge.direct.rits.llama3_1_70b"
+    "[context_fields=[context,question],"
+    f"criteria=metrics.llm_as_judge.direct.criterias.{criteria},"
+    f"score_prefix={criteria}_]"
+    for criteria in criterias
+]
+dataset = load_dataset(
+    card="cards.squad",
+    metrics=metrics,
+    loader_limit=10,
+    max_test_instances=10,
+    split="test",
+)
+
+# Infer a model to get predictions.
+inference_model = CrossProviderInferenceEngine(
+    model="llama-3-2-1b-instruct", provider="watsonx"
+)
+
+"""
+We are using a CrossProviderInferenceEngine inference engine that supply api access to provider such as:
+watsonx, bam, openai, azure, aws and more.
+
+For the arguments these inference engines can receive, please refer to the classes documentation or read
+about the the open ai api arguments the CrossProviderInferenceEngine follows.
+"""
+predictions = inference_model.infer(dataset)
+
+gold_answers = [d[0] for d in dataset["references"]]
+
+# Evaluate the predictions using the defined metric.
+evaluated_predictions = evaluate(predictions=predictions, data=dataset)
+evaluated_gold_answers = evaluate(predictions=gold_answers, data=dataset)
+
+print_dict(
+    evaluated_predictions[0],
+    keys_to_print=[
+        "source",
+        "score",
+    ],
+)
+print_dict(
+    evaluated_gold_answers[0],
+    keys_to_print=[
+        "source",
+        "score",
+    ],
+)
+
+for criteria in criterias:
+    logger.info(f"Scores for criteria '{criteria}'")
+    gold_answer_scores = [
+        instance["score"]["instance"][f"{criteria}_llm_as_a_judge_score"]
+        for instance in evaluated_gold_answers
+    ]
+    gold_answer_position_bias = [
+        int(instance["score"]["instance"][f"{criteria}_positional_bias"])
+        for instance in evaluated_gold_answers
+    ]
+    prediction_scores = [
+        instance["score"]["instance"][f"{criteria}_llm_as_a_judge_score"]
+        for instance in evaluated_predictions
+    ]
+    prediction_position_bias = [
+        int(instance["score"]["instance"][f"{criteria}_positional_bias"])
+        for instance in evaluated_predictions
+    ]
+
+    logger.info(
+        f"Scores of gold answers: {statistics.mean(gold_answer_scores)} +/- {statistics.stdev(gold_answer_scores)}"
+    )
+    logger.info(
+        f"Scores of predicted answers: {statistics.mean(prediction_scores)} +/- {statistics.stdev(prediction_scores)}"
+    )
+    logger.info(
+        f"Positional bias occurrence on gold answers: {statistics.mean(gold_answer_position_bias)}"
+    )
+    logger.info(
+        f"Positional bias occurrence on predicted answers: {statistics.mean(prediction_position_bias)}\n"
+    )
+
+"""
+Output with 100 examples
+
+Scores for criteria 'answer_relevance'
+Scores of gold answers: 0.9625 +/- 0.14811526360619054
+Scores of predicted answers: 0.5125 +/- 0.4638102516061385
+Positional bias occurrence on gold answers: 0.03
+Positional bias occurrence on predicted answers: 0.12
+
+Scores for criteria 'coherence'
+Scores of gold answers: 0.159 +/- 0.15689216524464028
+Scores of predicted answers: 0.066 +/- 0.11121005695384194
+Positional bias occurrence on gold answers: 0.16
+Positional bias occurrence on predicted answers: 0.07
+
+Scores for criteria 'conciseness'
+Scores of gold answers: 1.0 +/- 0.0
+Scores of predicted answers: 0.34 +/- 0.47609522856952335
+Positional bias occurrence on gold answers: 0.03
+Positional bias occurrence on predicted answers: 0.01
+"""
diff --git a/...luate_existing_dataset_by_llm_as_judge.py → ..._dataset_by_llm_as_judge_from_template.py b/...luate_existing_dataset_by_llm_as_judge.py → ..._dataset_by_llm_as_judge_from_template.py
diff --git a/examples/evaluate_llm_as_judge_direct_criteria_from_dataset.py b/examples/evaluate_llm_as_judge_direct_criteria_from_dataset.py
@@ -0,0 +1,49 @@
+from typing import Any
+
+from unitxt import evaluate, load_dataset
+from unitxt.blocks import Task, TaskCard
+from unitxt.llm_as_judge_operators import CreateYesNoCriteriaFromString
+from unitxt.loaders import LoadFromDictionary
+
+data = {
+    "test": [
+        {
+            "question": "How is the weather?",
+            "judgement": "In the response, if there is a numerical temperature present, is it denominated in both Fahrenheit and Celsius?",
+        },
+        {
+            "question": "Tell me a joke about cats",
+            "judgement": "Is the response funny?",
+        },
+    ]
+}
+
+card = TaskCard(
+    loader=LoadFromDictionary(data=data, data_classification_policy=["public"]),
+    preprocess_steps=[
+        CreateYesNoCriteriaFromString(field="judgement", to_field="criteria"),
+    ],
+    task=Task(
+        input_fields={"question": str},
+        reference_fields={"criteria": Any},
+        prediction_type=str,
+        metrics=[
+            "metrics.llm_as_judge.direct.watsonx.llama3_1_70b[context_fields=question,criteria_field=criteria]"
+        ],
+    ),
+)
+
+dataset = load_dataset(card=card, template="templates.empty", split="test")
+
+predictions = [
+    """On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit (around 31-34°C). The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
+    """Why did the cat cross the road? To cat to the other side.""",
+]
+
+results = evaluate(predictions=predictions, data=dataset)
+
+print("Global Scores:")
+print(results.global_scores.summary)
+
+print("Instance Scores:")
+print(results.instance_scores.summary)
diff --git a/examples/evaluate_llm_as_judge_direct_predefined_criteria.py b/examples/evaluate_llm_as_judge_direct_predefined_criteria.py
@@ -0,0 +1,33 @@
+from unitxt import get_logger
+from unitxt.api import create_dataset, evaluate
+
+logger = get_logger()
+
+data = [
+    {"question": "How is the weather?"},
+    {"question": "How is the weather?"},
+    {"question": "How is the weather?"},
+]
+
+criteria = "metrics.llm_as_judge.direct.criterias.temperature_in_celsius_and_fahrenheit"
+metrics = [
+    f"metrics.llm_as_judge.direct.rits.llama3_1_70b[criteria={criteria}, context_fields=[question]]"
+]
+
+dataset = create_dataset(
+    task="tasks.qa.open", test_set=data, metrics=metrics, split="test"
+)
+
+predictions = [
+    """On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit (around 31-34°C). The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
+    """On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
+    """On most days, the weather is warm and humid. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
+]
+
+results = evaluate(predictions=predictions, data=dataset)
+
+print("Global Scores:")
+print(results.global_scores.summary)
+
+print("Instance Scores:")
+print(results.instance_scores.summary)
diff --git a/examples/evaluate_llm_as_judge_direct_user_criteria_no_catalog.py b/examples/evaluate_llm_as_judge_direct_user_criteria_no_catalog.py
@@ -0,0 +1,62 @@
+from unitxt.api import create_dataset, evaluate
+from unitxt.inference import CrossProviderInferenceEngine
+from unitxt.llm_as_judge import LLMJudgeDirect
+from unitxt.llm_as_judge_constants import (
+    CriteriaWithOptions,
+)
+
+criteria = CriteriaWithOptions.from_obj(
+    {
+        "name": "Temperature in Fahrenheit and Celsius",
+        "description": "In the response, if there is a numerical temperature present, is it denominated in both Fahrenheit and Celsius?",
+        "options": [
+            {
+                "name": "Yes",
+                "description": "The temperature reading is provided in both Fahrenheit and Celsius.",
+            },
+            {
+                "name": "No",
+                "description": "The temperature reading is provided either in Fahrenheit or Celsius, but not both.",
+            },
+            {
+                "name": "Pass",
+                "description": "There is no numerical temperature reading in the response.",
+            },
+        ],
+        "option_map": {"Yes": 1.0, "No": 0.5, "Pass": 0.0},
+    }
+)
+
+
+data = [
+    {"question": "How is the weather?"},
+    {"question": "How is the weather?"},
+    {"question": "How is the weather?"},
+]
+
+metric = LLMJudgeDirect(
+    inference_engine=CrossProviderInferenceEngine(
+        model="llama-3-1-70b-instruct", max_tokens=1024
+    ),
+    criteria=criteria,
+    context_fields=["question"],
+    criteria_field="criteria",
+)
+
+dataset = create_dataset(
+    task="tasks.qa.open", test_set=data, metrics=[metric], split="test"
+)
+
+predictions = [
+    """On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit (around 31-34°C). The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
+    """On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
+    """On most days, the weather is warm and humid. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
+]
+
+results = evaluate(predictions=predictions, data=dataset)
+
+print("Global Scores:")
+print(results.global_scores.summary)
+
+print("Instance Scores:")
+print(results.instance_scores.summary)
diff --git a/examples/evaluate_llm_as_judge.py → ...es/evaluate_llm_as_judge_from_template.py b/examples/evaluate_llm_as_judge.py → ...es/evaluate_llm_as_judge_from_template.py
diff --git a/examples/evaluate_llm_as_judge_pairwise_criteria_from_dataset.py b/examples/evaluate_llm_as_judge_pairwise_criteria_from_dataset.py
@@ -0,0 +1,61 @@
+from typing import Any, List
+
+from unitxt import evaluate, load_dataset
+from unitxt.blocks import Task, TaskCard
+from unitxt.llm_as_judge_operators import (
+    CreateCriteriaFromString,
+)
+from unitxt.loaders import LoadFromDictionary
+from unitxt.templates import NullTemplate
+
+data = {
+    "test": [
+        {
+            "question": "How is the weather?",
+            "judgement": "The temperature is described in both Fahrenheit and Celsius.",
+        },
+        {
+            "question": "Tell me a joke about cats",
+            "judgement": "Is the response funny?",
+        },
+    ]
+}
+
+card = TaskCard(
+    loader=LoadFromDictionary(data=data, data_classification_policy=["public"]),
+    preprocess_steps=[
+        CreateCriteriaFromString(field="judgement", to_field="criteria"),
+    ],
+    task=Task(
+        input_fields={"question": str},
+        reference_fields={"criteria": Any},
+        prediction_type=List[str],
+        metrics=[
+            "metrics.llm_as_judge.pairwise.rits.llama3_1_70b[context_fields=question,criteria_field=criteria]"
+        ],
+        default_template=NullTemplate(),
+    ),
+)
+
+test_dataset = load_dataset(card=card, split="test")
+
+predictions = [
+    [
+        """On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit (around 31-34°C). The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
+        """On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
+        """On most days, the weather is warm and humid. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
+    ],
+    [
+        """Why did the cat cross the road? To cat to the other side.""",
+        """Why did the cat sit on the computer? Because it wanted to keep an eye on the mouse!""",
+        """What is red, yellow and green? A traffic light.""",
+    ],
+]
+
+results = evaluate(predictions=predictions, data=test_dataset)
+
+print("Global Scores:")
+print(results.global_scores.summary)
+
+print("Instance Scores:")
+print(results.instance_scores.summary)
diff --git a/examples/evaluate_llm_as_judge_pairwise_predefined_criteria.py b/examples/evaluate_llm_as_judge_pairwise_predefined_criteria.py
@@ -0,0 +1,59 @@
+from typing import Any, List
+
+from unitxt import evaluate, load_dataset
+from unitxt.blocks import Task, TaskCard
+from unitxt.llm_as_judge_operators import LoadCriteria
+from unitxt.loaders import LoadFromDictionary
+from unitxt.templates import NullTemplate
+
+data = {
+    "test": [
+        {
+            "question": "How is the weather?",
+            "criteria": "metrics.llm_as_judge.pairwise.criterias.temperature_in_celsius_and_fahrenheit",
+        },
+        {
+            "question": "Tell me a joke about cats",
+            "criteria": "metrics.llm_as_judge.pairwise.criterias.funny_joke",
+        },
+    ]
+}
+
+card = TaskCard(
+    loader=LoadFromDictionary(data=data, data_classification_policy=["public"]),
+    preprocess_steps=[
+        LoadCriteria(field="criteria", to_field="criteria"),
+    ],
+    task=Task(
+        input_fields={"question": str},
+        reference_fields={"criteria": Any},
+        prediction_type=List[str],
+        metrics=[
+            "metrics.llm_as_judge.pairwise.watsonx.llama3_1_70b[context_fields=question,criteria_field=criteria]"
+        ],
+        default_template=NullTemplate(),
+    ),
+)
+
+dataset = load_dataset(card=card, split="test")
+
+predictions = [
+    [
+        """On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit (around 31-34°C). The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
+        """On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
+        """On most days, the weather is warm and humid. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
+    ],
+    [
+        """Why did the cat cross the road? To cat to the other side.""",
+        """Why did the cat sit on the computer? Because it wanted to keep an eye on the mouse!""",
+        """What is red, yellow and green? A traffic light.""",
+    ],
+]
+
+results = evaluate(predictions=predictions, data=dataset)
+
+print("Global Scores:")
+print(results.global_scores.summary)
+
+print("Instance Scores:")
+print(results.instance_scores.summary)