task-avg.en.html

---
title: Task+Overall
lang: en
layout: default
views:
  - type: rader
    target: ja
    width: 6
    title: Japanese
  - type: rader
    target: en
    width: 6
    title: English
  - type: rader
    target: ja_mtb
    width: 6
    title: Japanese MT-Bench
  - type: bar
    target: avb
    width: 6
    title: Average
    aspect_portrait: 1.1
    aspect_landscape: 1.1
persistent_group: task
instructions:
  - title: Usage and Notes
    text: "Scores for all tasks in Japanese, Japanese MT-Bench, and English benchmark for the LLMs selected in the table below are visualized in radar charts. In adition, average scores are visualized in a bar chart. You can copy the permalink corresponding to the selected model from the icon 🔗 in the upper left corner of the site. Note that <strong>it may be inappropriate to discuss the superiority of some models based on their average scores or sort order, since some tasks have not been evaluated.</strong> For example, GPT-3.5 and GPT-4 are presumed to show high performance in Japanese and English tasks, but since no evaluation was conducted, the average score for these tasks is treated as 0, and the sort order is also at the end."
---

{% include view.html %}