Skip to content

Commit

Permalink
Merge pull request #207 from ufal/fix-iaa-single-campaign
Browse files Browse the repository at this point in the history
Enable computing IAA scores between multiple groups
  • Loading branch information
kasnerz authored Feb 4, 2025
2 parents f2ce36d + 1c74d77 commit 9ed23b9
Show file tree
Hide file tree
Showing 3 changed files with 133 additions and 95 deletions.
168 changes: 88 additions & 80 deletions factgenie/notebooks/inter_annotator_agreement.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,90 +15,31 @@
"source": [
"## Computing inter-annotator agreement (IAA) with factgenie\n",
"\n",
"This notebook shows how to compute inter-annotator agreement (IAA) between two annotator groups.\n",
"This notebook shows how to compute inter-annotator agreement (IAA) between annotators.\n",
"\n",
"### Input data\n",
"For using the notebook, you will need the CSV files generated by factgenie for computing inter-annotator agreement:\n",
"- `dataset_level_counts.csv`\n",
"- `example_level_counts.csv`\n",
"- `gamma_spans.csv`\n",
"\n",
"You can generate these files on the `/analyze` page (on the Inter-annotator agreement tab). On that page, you need to select the campaign(s) with multiple annotators per example and select `Export data files`.\n",
"\n",
"### Annotator groups\n",
"We will compute the correlation between two **annotator groups**. Each annotator group has an id in the format `{campaign_id}-anngroup-{group_idx}`. That means that it uniquely defines the ordinal number of the annotator within a specific campaign.\n",
"\n",
"#### Single campaign\n",
"You can compute IAA between annotators within a single campaign.\n",
"# Pearson r\n",
"\n",
"Example: in the campaign `llm-eval-1`, you used two annotators per example. Then you want to measure agreement between `llm-eval-1-anngroup-0` and `llm-eval-1-anngroup-1`.\n",
"First, we will use the **Pearson correlation coefficient** to measure the agreement between two annotator groups.\n",
"\n",
"#### Multiple campaigns\n",
"You can compute IAA between annotators in multiple campaigns **if these campaigns were annotating the same outputs**.\n",
"Specifically, we will measure how much the **error counts** agree. \n",
"\n",
"Example: you ran campaigns `llm-eval-1` and `llm-eval-2` over the same set of examples. Then you will measure agreement between `llm-eval-1-anngroup-0` and `llm-eval-2-anngroup-0`."
"In the ideal case, both annotator groups annotated the **same amount of errors of each category** for each example. The Pearson r coefficient will help us to quantify to which extent it is true. The value of 1 signifies perfect *positive linear correlation*, 0 signifies no linear correlation, -1 signifies perfect *negative linear correllation*."
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from scipy.stats import pearsonr\n",
"import numpy as np\n",
"import pandas as pd\n",
"import logging\n",
"import pygamma_agreement as pa\n",
"import traceback\n",
"from pyannote.core import Segment\n",
"from tqdm.notebook import tqdm\n",
"from scipy.stats import pearsonr\n",
"\n",
"# Set the directory where the csv files are located here\n",
"csv_path = \".\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pearson r\n",
"\n",
"First, we will use the **Pearson correlation coefficient** to measure the agreement between two annotators.\n",
"\n",
"Specifically, we will measure how much the **error counts** agree. \n",
"\n",
"In the ideal case, both annotators annotated the **same amount of errors of each category** for each example. The Pearson r coefficient will help us to quantify to which extent it is true. The value of 1 signifies perfect *positive linear correlation*, 0 signifies no linear correlation, -1 signifies perfect *negative linear correllation*.\n",
"\n",
"We will compare both the example-level correlation, which is more strict, and dataset-level (or, more precisely, dataset-split-setup_id-level) correlation, which is more lenient.\n",
"\n",
"## Levels\n",
"\n",
"### Dataset-level\n",
"Pearson r between two annotators computed over a list of average error counts for each (dataset, split, setup_id) combination.\n",
"\n",
"### Example-level\n",
"Pearson r between two annotators computed over a list of error counts for each (dataset, split, setup_id, example_idx) combination.\n",
"\n",
"## Average type\n",
"\n",
"### Micro-average\n",
"A coefficient computed over concatenated results from all the categories.\n",
"\n",
"### Macro-average\n",
"An average of coefficients computed separately for each category."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def compute_pearson_r(csv_path, group1, group2):\n",
" # Load data\n",
" df = pd.read_csv(csv_path)\n",
" \n",
"def compute_pearson_r(df, group1, group2):\n",
" group1_data = df[df['annotator_group_id'] == group1]\n",
" group2_data = df[df['annotator_group_id'] == group2]\n",
" \n",
Expand All @@ -121,20 +62,83 @@
" return {'micro': micro_corr, 'macro': macro_corr, 'category_correlations': type_corrs}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Input data\n",
"For computing Pearson r, you will need (at least one of) the CSV files generated by factgenie:\n",
"- `example_level_counts.csv` - absolute error counts for each (dataset, split, setup_id, example_idx) combination,\n",
"- `dataset_level_counts.csv` - average error counts for each (dataset, split, setup_id) combination.\n",
"\n",
"You can generate these files on the `/analyze` page (on the Inter-annotator agreement tab). On that page, you need to select the campaign(s) with multiple annotators per example and select `Export data files`.\n"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"# Set the directory where the csv files are located here\n",
"csv_dir = \".\"\n",
"\n",
"level = \"example\"\n",
"\n",
"csv_filename = f\"{csv_dir}/{level}_level_counts.csv\"\n",
"# Load data\n",
"df = pd.read_csv(csv_filename)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Annotator groups\n",
"We will always compute the correlation between two **annotator groups**. Each annotator group has an id in the format `{campaign_id}-anngroup-{group_idx}`. That means that it uniquely defines the ordinal number of the annotator within a specific campaign.\n",
"\n",
"Example: in the campaign `llm-eval-1`, you used two annotators per example. Then you want to measure agreement between `llm-eval-1-anngroup-0` and `llm-eval-1-anngroup-1`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Use all annotator groups\n",
"groups = df.annotator_group_id.unique()\n",
"\n",
"for level in [\"dataset\", \"example\"]:\n",
" csv_filename = f\"{csv_path}/{level}_level_counts.csv\"\n",
"print(f\"Groups: {groups}\")\n",
"\n",
" groups = ('quintd1-gpt-4-anngroup-0', 'quintd1-human-anngroup-0')\n",
" correlations = compute_pearson_r(csv_filename, *groups)\n",
"from itertools import combinations\n",
"group_pairs = list(combinations(groups, 2))\n",
"\n",
" print(f\"{level}-level correlations between {groups[0]} and {groups[1]}\")\n",
"print(f\"Group pairs: {group_pairs}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Average type\n",
"- **Micro-average** - a coefficient computed over concatenated results from all the categories.\n",
"- **Macro-average** - an average of coefficients computed separately for each category."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Compute correlations for all pairs of groups\n",
"for group1, group2 in group_pairs:\n",
" correlations = compute_pearson_r(df, group1, group2)\n",
"\n",
" print(f\"{level}-level correlations between {group1} and {group2}\")\n",
" print(\"==============================================\")\n",
"\n",
" print(f\"Micro Pearson-r: {correlations['micro']:.3f}\")\n",
Expand All @@ -153,7 +157,7 @@
"metadata": {},
"source": [
"# Gamma (γ) score\n",
"Second, we compute the gamma (γ) score between the two annotator groups.\n",
"Next, we compute the gamma (γ) score.\n",
"\n",
"This score suitable for computing IAA in cases where are both (1) determining span positions and (2) categorizing the spans.\n",
"\n",
Expand All @@ -168,12 +172,14 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"from IPython import display\n",
"\n",
"import pygamma_agreement as pa\n",
"from pyannote.core import Segment\n",
"from tqdm.notebook import tqdm\n",
"\n",
"def compute_gamma(span_index, dissim):\n",
" gamma_scores = []\n",
Expand Down Expand Up @@ -243,11 +249,11 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 59,
"metadata": {},
"outputs": [],
"source": [
"gamma_spans = pd.read_csv(f\"{csv_path}/gamma_spans.csv\")"
"gamma_spans = pd.read_csv(f\"{csv_dir}/gamma_spans.csv\")"
]
},
{
Expand All @@ -264,13 +270,15 @@
"dissim = pa.CombinedCategoricalDissimilarity(delta_empty=1, alpha=1, beta=1)\n",
"gamma = compute_gamma(gamma_spans, dissim)\n",
"\n",
"print(f\"Gamma score: {gamma:.3f}\")"
"print(f\"==============================================\")\n",
"print(f\"Gamma score: {gamma:.3f}\")\n",
"print(f\"==============================================\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": "factgenie",
"language": "python",
"name": "python3"
},
Expand Down
53 changes: 43 additions & 10 deletions factgenie/static/js/analyze.js
Original file line number Diff line number Diff line change
Expand Up @@ -88,10 +88,6 @@ function updateComparisonData() {
$("#selectedDatasetsContent").empty();
$("#agreement-btn").addClass("disabled")

if (selectedCampaigns.length < 2) {
// TODO make it also work for multiple annotators within the same campaign
return;
}
// find which category label names are common to all campaigns
const campaignCategories = selectedCampaigns.map(c => campaigns[c].metadata.config.annotation_span_categories).map(c => c.map(cat => cat.name));
const commonCategories = campaignCategories.reduce((acc, val) => {
Expand All @@ -102,23 +98,59 @@ function updateComparisonData() {
commonCategories.map(c => `<span class="badge bg-secondary">${c}</span>`).join("\n")
);

// find examples that are common to all selected campaigns and that have a status `finished`
const combinations = selectedCampaigns.map(c => campaigns[c].data);
// Create campaign-annotator group combinations
const campaignAnnotatorGroups = selectedCampaigns.flatMap(campaign => {
const campaignData = campaigns[campaign].data;
const annotatorGroups = [...new Set(campaignData.map(d => d.annotator_group))];
return annotatorGroups.map(group => ({ campaign, group }));
});

// Get examples for each campaign-annotator group combination
const combinations = campaignAnnotatorGroups.map(({ campaign, group }) =>
campaigns[campaign].data.filter(d => d.annotator_group === group)
);

// Find common examples across all combinations
const commonExamples = combinations.reduce((acc, val) => {
return acc.filter(x => val.some(y => y.dataset === x.dataset && y.split === x.split && y.setup_id === x.setup_id));
return acc.filter(x => val.some(y =>
y.dataset === x.dataset &&
y.split === x.split &&
y.setup_id === x.setup_id
));
});
const finishedExamples = commonExamples.filter(e => e.status === 'finished');

// for every (dataset, split, setup_id) combination, compute the number of examples
// Count examples per dataset-split-setup combination
const exampleCounts = finishedExamples.reduce((acc, val) => {
const key = `${val.dataset}|${val.split}|${val.setup_id}`;
acc[key] = (acc[key] || 0) + 1;
return acc;
}, {});

const comparisonData = Object.entries(exampleCounts).map(([key, count]) => {
const filteredExampleCounts = Object.entries(exampleCounts).reduce((acc, [key, count]) => {
const [dataset, split, setup_id] = key.split('|');
const groupsWithExample = campaignAnnotatorGroups.filter(({ campaign, group }) => {
return campaigns[campaign].data.some(d =>
d.annotator_group === group &&
d.dataset === dataset &&
d.split === split &&
d.setup_id === setup_id &&
d.status === 'finished'
);
});

if (groupsWithExample.length >= 2) {
acc[key] = count;
}
return acc;
}, {});

const comparisonData = Object.entries(filteredExampleCounts).map(([key, count]) => {
const [dataset, split, setup_id] = key.split('|');
return { dataset, split, setup_id, example_count: count };
const groups = campaignAnnotatorGroups
.map(({ campaign, group }) => `${campaign}:${group}`)
.join(", ");
return { dataset, split, setup_id, example_count: count, groups };
});

$("#selectedDatasetsContent").html(
Expand All @@ -128,6 +160,7 @@ function updateComparisonData() {
<td>${d.split}</td>
<td>${d.setup_id}</td>
<td>${d.example_count}</td>
<td><small>${d.groups}</small></td>
<td><button type="button" class="btn btn-sm btn-secondary" onclick="deleteRow(this);">x</button></td>
</tr>`
).join("\n")
Expand Down
7 changes: 2 additions & 5 deletions factgenie/templates/pages/analyze.html
Original file line number Diff line number Diff line change
Expand Up @@ -95,10 +95,6 @@ <h3><img src="{{ host_prefix }}/static/img/analysis.png" class="heading-img-inli
</p>
</div>

<p class="mt-3">
<small class="text-muted">Select <b>at least two campaigns</b> for comparison.</small>
</p>

<div id="data-select-area" class="row">
<div class="col-md-6">
<div class="card">
Expand Down Expand Up @@ -150,7 +146,8 @@ <h3><img src="{{ host_prefix }}/static/img/analysis.png" class="heading-img-inli
<th scope="col">Dataset</th>
<th scope="col">Split</th>
<th scope="col">Outputs</th>
<th scope="col">Examples with multiple annotators</th>
<th scope="col">Example count</th>
<th scope="col">Groups for comparison (at least 2 needed)</th>
</tr>
<tbody id="selectedDatasetsContent">
<!-- Selected combinations will be dynamically inserted here -->
Expand Down

0 comments on commit 9ed23b9

Please sign in to comment.