Merge pull request #207 from ufal/fix-iaa-single-campaign

Enable computing IAA scores between multiple groups
ufal · Feb 4, 2025 · 9ed23b9 · 9ed23b9
2 parents f2ce36d + 1c74d77
commit 9ed23b9
Show file tree

Hide file tree

Showing 3 changed files with 133 additions and 95 deletions.
diff --git a/factgenie/notebooks/inter_annotator_agreement.ipynb b/factgenie/notebooks/inter_annotator_agreement.ipynb
@@ -15,90 +15,31 @@
    "source": [
     "## Computing inter-annotator agreement (IAA) with factgenie\n",
     "\n",
-    "This notebook shows how to compute inter-annotator agreement (IAA) between two annotator groups.\n",
+    "This notebook shows how to compute inter-annotator agreement (IAA) between annotators.\n",
     "\n",
-    "### Input data\n",
-    "For using the notebook, you will need the CSV files generated by factgenie for computing inter-annotator agreement:\n",
-    "- `dataset_level_counts.csv`\n",
-    "- `example_level_counts.csv`\n",
-    "- `gamma_spans.csv`\n",
-    "\n",
-    "You can generate these files on the `/analyze` page (on the Inter-annotator agreement tab). On that page, you need to select the campaign(s) with multiple annotators per example and select `Export data files`.\n",
-    "\n",
-    "### Annotator groups\n",
-    "We will compute the correlation between two **annotator groups**. Each annotator group has an id in the format `{campaign_id}-anngroup-{group_idx}`. That means that it uniquely defines the ordinal number of the annotator within a specific campaign.\n",
-    "\n",
-    "#### Single campaign\n",
-    "You can compute IAA between annotators within a single campaign.\n",
+    "# Pearson r\n",
     "\n",
-    "Example: in the campaign `llm-eval-1`, you used two annotators per example. Then you want to measure agreement between `llm-eval-1-anngroup-0` and `llm-eval-1-anngroup-1`.\n",
+    "First, we will use the **Pearson correlation coefficient** to measure the agreement between two annotator groups.\n",
     "\n",
-    "#### Multiple campaigns\n",
-    "You can compute IAA between annotators in multiple campaigns **if these campaigns were annotating the same outputs**.\n",
+    "Specifically, we will measure how much the **error counts** agree. \n",
     "\n",
-    "Example: you ran campaigns `llm-eval-1` and `llm-eval-2` over the same set of examples. Then you will measure agreement between `llm-eval-1-anngroup-0` and `llm-eval-2-anngroup-0`."
+    "In the ideal case, both annotator groups annotated the **same amount of errors of each category** for each example. The Pearson r coefficient will help us to quantify to which extent it is true. The value of 1 signifies perfect *positive linear correlation*, 0 signifies no linear correlation, -1 signifies perfect *negative linear correllation*."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 54,
    "metadata": {},
    "outputs": [],
    "source": [
-    "import pandas as pd\n",
+    "from scipy.stats import pearsonr\n",
     "import numpy as np\n",
+    "import pandas as pd\n",
     "import logging\n",
-    "import pygamma_agreement as pa\n",
     "import traceback\n",
-    "from pyannote.core import Segment\n",
-    "from tqdm.notebook import tqdm\n",
-    "from scipy.stats import pearsonr\n",
-    "\n",
-    "# Set the directory where the csv files are located here\n",
-    "csv_path = \".\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Pearson r\n",
-    "\n",
-    "First, we will use the **Pearson correlation coefficient** to measure the agreement between two annotators.\n",
-    "\n",
-    "Specifically, we will measure how much the **error counts** agree. \n",
-    "\n",
-    "In the ideal case, both annotators annotated the **same amount of errors of each category** for each example. The Pearson r coefficient will help us to quantify to which extent it is true. The value of 1 signifies perfect *positive linear correlation*, 0 signifies no linear correlation, -1 signifies perfect *negative linear correllation*.\n",
-    "\n",
-    "We will compare both the example-level correlation, which is more strict, and dataset-level (or, more precisely, dataset-split-setup_id-level) correlation, which is more lenient.\n",
     "\n",
-    "## Levels\n",
     "\n",
-    "### Dataset-level\n",
-    "Pearson r between two annotators computed over a list of average error counts for each (dataset, split, setup_id) combination.\n",
-    "\n",
-    "### Example-level\n",
-    "Pearson r between two annotators computed over a list of error counts for each (dataset, split, setup_id, example_idx) combination.\n",
-    "\n",
-    "## Average type\n",
-    "\n",
-    "### Micro-average\n",
-    "A coefficient computed over concatenated results from all the categories.\n",
-    "\n",
-    "### Macro-average\n",
-    "An average of coefficients computed separately for each category."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def compute_pearson_r(csv_path, group1, group2):\n",
-    "    # Load data\n",
-    "    df = pd.read_csv(csv_path)\n",
-    "    \n",
+    "def compute_pearson_r(df, group1, group2):\n",
     "    group1_data = df[df['annotator_group_id'] == group1]\n",
     "    group2_data = df[df['annotator_group_id'] == group2]\n",
     "    \n",
@@ -121,20 +62,83 @@
     "    return {'micro': micro_corr, 'macro': macro_corr, 'category_correlations': type_corrs}"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "### Input data\n",
+    "For computing Pearson r, you will need (at least one of) the CSV files generated by factgenie:\n",
+    "- `example_level_counts.csv` - absolute error counts for each (dataset, split, setup_id, example_idx) combination,\n",
+    "- `dataset_level_counts.csv` - average error counts for each (dataset, split, setup_id) combination.\n",
+    "\n",
+    "You can generate these files on the `/analyze` page (on the Inter-annotator agreement tab). On that page, you need to select the campaign(s) with multiple annotators per example and select `Export data files`.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 55,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Set the directory where the csv files are located here\n",
+    "csv_dir = \".\"\n",
+    "\n",
+    "level = \"example\"\n",
+    "\n",
+    "csv_filename = f\"{csv_dir}/{level}_level_counts.csv\"\n",
+    "# Load data\n",
+    "df = pd.read_csv(csv_filename)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Annotator groups\n",
+    "We will always compute the correlation between two **annotator groups**. Each annotator group has an id in the format `{campaign_id}-anngroup-{group_idx}`. That means that it uniquely defines the ordinal number of the annotator within a specific campaign.\n",
+    "\n",
+    "Example: in the campaign `llm-eval-1`, you used two annotators per example. Then you want to measure agreement between `llm-eval-1-anngroup-0` and `llm-eval-1-anngroup-1`."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Use all annotator groups\n",
+    "groups = df.annotator_group_id.unique()\n",
     "\n",
-    "for level in [\"dataset\", \"example\"]:\n",
-    "    csv_filename = f\"{csv_path}/{level}_level_counts.csv\"\n",
+    "print(f\"Groups: {groups}\")\n",
     "\n",
-    "    groups = ('quintd1-gpt-4-anngroup-0', 'quintd1-human-anngroup-0')\n",
-    "    correlations = compute_pearson_r(csv_filename, *groups)\n",
+    "from itertools import combinations\n",
+    "group_pairs = list(combinations(groups, 2))\n",
     "\n",
-    "    print(f\"{level}-level correlations between {groups[0]} and {groups[1]}\")\n",
+    "print(f\"Group pairs: {group_pairs}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "### Average type\n",
+    "- **Micro-average** - a coefficient computed over concatenated results from all the categories.\n",
+    "- **Macro-average** - an average of coefficients computed separately for each category."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Compute correlations for all pairs of groups\n",
+    "for group1, group2 in group_pairs:\n",
+    "    correlations = compute_pearson_r(df, group1, group2)\n",
+    "\n",
+    "    print(f\"{level}-level correlations between {group1} and {group2}\")\n",
     "    print(\"==============================================\")\n",
     "\n",
     "    print(f\"Micro Pearson-r: {correlations['micro']:.3f}\")\n",
@@ -153,7 +157,7 @@
    "metadata": {},
    "source": [
     "# Gamma (γ) score\n",
-    "Second, we compute the gamma (γ) score between the two annotator groups.\n",
+    "Next, we compute the gamma (γ) score.\n",
     "\n",
     "This score suitable for computing IAA in cases where are both (1) determining span positions and (2) categorizing the spans.\n",
     "\n",
@@ -168,12 +172,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 58,
    "metadata": {},
    "outputs": [],
    "source": [
     "from IPython import display\n",
-    "\n",
+    "import pygamma_agreement as pa\n",
+    "from pyannote.core import Segment\n",
+    "from tqdm.notebook import tqdm\n",
     "\n",
     "def compute_gamma(span_index, dissim):\n",
     "    gamma_scores = []\n",
@@ -243,11 +249,11 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 59,
    "metadata": {},
    "outputs": [],
    "source": [
-    "gamma_spans = pd.read_csv(f\"{csv_path}/gamma_spans.csv\")"
+    "gamma_spans = pd.read_csv(f\"{csv_dir}/gamma_spans.csv\")"
    ]
   },
   {
@@ -264,13 +270,15 @@
     "dissim = pa.CombinedCategoricalDissimilarity(delta_empty=1, alpha=1, beta=1)\n",
     "gamma = compute_gamma(gamma_spans, dissim)\n",
     "\n",
-    "print(f\"Gamma score: {gamma:.3f}\")"
+    "print(f\"==============================================\")\n",
+    "print(f\"Gamma score: {gamma:.3f}\")\n",
+    "print(f\"==============================================\")"
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "factgenie",
    "language": "python",
    "name": "python3"
   },

diff --git a/factgenie/static/js/analyze.js b/factgenie/static/js/analyze.js
@@ -88,10 +88,6 @@ function updateComparisonData() {
     $("#selectedDatasetsContent").empty();
     $("#agreement-btn").addClass("disabled")
 
-    if (selectedCampaigns.length < 2) {
-        // TODO make it also work for multiple annotators within the same campaign
-        return;
-    }
     // find which category label names are common to all campaigns
     const campaignCategories = selectedCampaigns.map(c => campaigns[c].metadata.config.annotation_span_categories).map(c => c.map(cat => cat.name));
     const commonCategories = campaignCategories.reduce((acc, val) => {
@@ -102,23 +98,59 @@ function updateComparisonData() {
         commonCategories.map(c => `<span class="badge bg-secondary">${c}</span>`).join("\n")
     );
 
-    // find examples that are common to all selected campaigns and that have a status `finished`
-    const combinations = selectedCampaigns.map(c => campaigns[c].data);
+    // Create campaign-annotator group combinations
+    const campaignAnnotatorGroups = selectedCampaigns.flatMap(campaign => {
+        const campaignData = campaigns[campaign].data;
+        const annotatorGroups = [...new Set(campaignData.map(d => d.annotator_group))];
+        return annotatorGroups.map(group => ({ campaign, group }));
+    });
+
+    // Get examples for each campaign-annotator group combination
+    const combinations = campaignAnnotatorGroups.map(({ campaign, group }) =>
+        campaigns[campaign].data.filter(d => d.annotator_group === group)
+    );
+
+    // Find common examples across all combinations
     const commonExamples = combinations.reduce((acc, val) => {
-        return acc.filter(x => val.some(y => y.dataset === x.dataset && y.split === x.split && y.setup_id === x.setup_id));
+        return acc.filter(x => val.some(y =>
+            y.dataset === x.dataset &&
+            y.split === x.split &&
+            y.setup_id === x.setup_id
+        ));
     });
     const finishedExamples = commonExamples.filter(e => e.status === 'finished');
 
-    // for every (dataset, split, setup_id) combination, compute the number of examples
+    // Count examples per dataset-split-setup combination
     const exampleCounts = finishedExamples.reduce((acc, val) => {
         const key = `${val.dataset}|${val.split}|${val.setup_id}`;
         acc[key] = (acc[key] || 0) + 1;
         return acc;
     }, {});
 
-    const comparisonData = Object.entries(exampleCounts).map(([key, count]) => {
+    const filteredExampleCounts = Object.entries(exampleCounts).reduce((acc, [key, count]) => {
+        const [dataset, split, setup_id] = key.split('|');
+        const groupsWithExample = campaignAnnotatorGroups.filter(({ campaign, group }) => {
+            return campaigns[campaign].data.some(d =>
+                d.annotator_group === group &&
+                d.dataset === dataset &&
+                d.split === split &&
+                d.setup_id === setup_id &&
+                d.status === 'finished'
+            );
+        });
+
+        if (groupsWithExample.length >= 2) {
+            acc[key] = count;
+        }
+        return acc;
+    }, {});
+
+    const comparisonData = Object.entries(filteredExampleCounts).map(([key, count]) => {
         const [dataset, split, setup_id] = key.split('|');
-        return { dataset, split, setup_id, example_count: count };
+        const groups = campaignAnnotatorGroups
+            .map(({ campaign, group }) => `${campaign}:${group}`)
+            .join(", ");
+        return { dataset, split, setup_id, example_count: count, groups };
     });
 
     $("#selectedDatasetsContent").html(
@@ -128,6 +160,7 @@ function updateComparisonData() {
                 <td>${d.split}</td>
                 <td>${d.setup_id}</td>
                 <td>${d.example_count}</td>
+                <td><small>${d.groups}</small></td>
                 <td><button type="button" class="btn btn-sm btn-secondary" onclick="deleteRow(this);">x</button></td>
             </tr>`
         ).join("\n")

diff --git a/factgenie/templates/pages/analyze.html b/factgenie/templates/pages/analyze.html
@@ -95,10 +95,6 @@ <h3><img src="{{ host_prefix }}/static/img/analysis.png" class="heading-img-inli
           </p>
         </div>
 
-        <p class="mt-3">
-          <small class="text-muted">Select <b>at least two campaigns</b> for comparison.</small>
-        </p>
-
         <div id="data-select-area" class="row">
           <div class="col-md-6">
             <div class="card">
@@ -150,7 +146,8 @@ <h3><img src="{{ host_prefix }}/static/img/analysis.png" class="heading-img-inli
                   <th scope="col">Dataset</th>
                   <th scope="col">Split</th>
                   <th scope="col">Outputs</th>
-                  <th scope="col">Examples with multiple annotators</th>
+                  <th scope="col">Example count</th>
+                  <th scope="col">Groups for comparison (at least 2 needed)</th>
                 </tr>
               <tbody id="selectedDatasetsContent">
                 <!-- Selected combinations will be dynamically inserted here -->