MMLU Benchmarks #1163

gagika · 2025-01-13T18:45:38Z

MMLU Benchmark Script (0-shot)

Tests

Tested for llama2-7b and llama3.1-8b

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

…-line-length=125

RissyRan

This is great! Do we have logs to see how the test results looks like?

gagika · 2025-01-13T19:25:32Z

This is great! Do we have logs to see how the test results looks like?

Here is the output for llama3.1-8b with default 0-shot prompt:

Final accuracy on MMLU dataset: 0.6413

Subcategory Accuracies:
Accuracy for subcategory 'politics': 0.7963
Accuracy for subcategory 'other': 0.6876
Accuracy for subcategory 'culture': 0.8283
Accuracy for subcategory 'history': 0.7860
Accuracy for subcategory 'biology': 0.7819
Accuracy for subcategory 'psychology': 0.7640
Accuracy for subcategory 'physics': 0.5172
Accuracy for subcategory 'engineering': 0.6345
Accuracy for subcategory 'philosophy': 0.5323
Accuracy for subcategory 'health': 0.7000
Accuracy for subcategory 'economics': 0.6415
Accuracy for subcategory 'computer science': 0.5947
Accuracy for subcategory 'law': 0.5309
Accuracy for subcategory 'math': 0.4474
Accuracy for subcategory 'chemistry': 0.5314
Accuracy for subcategory 'business': 0.8101
Accuracy for subcategory 'geography': 0.7727

Category Accuracies:
Accuracy for category 'STEM': 0.5500
Accuracy for category 'humanities': 0.5819
Accuracy for category 'social sciences': 0.7488
Accuracy for category 'other (business, health, misc.)': 0.7104

gagika added 2 commits January 13, 2025 10:13

Add a simple MMLU evaluation script

dec2139

Fixing comments

60792b1

gagika requested review from gobbleturk, khatwanimohit, bvandermoon, vipannalla and RissyRan as code owners January 13, 2025 18:45

Reformat with pyink MaxText/benchmarks/mmlu/* --pyink-indentation=2 -…

0d2b9cb

…-line-length=125

RissyRan reviewed Jan 13, 2025

View reviewed changes

Adding missing logs

c1e9124

gagika added 2 commits January 13, 2025 11:35

Fixing defaultdict -> collections.defaultdict

53ec9a3

Formating fix

6e98732

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMLU Benchmarks #1163

MMLU Benchmarks #1163

gagika commented Jan 13, 2025 •

edited

Loading

RissyRan left a comment

gagika commented Jan 13, 2025

MMLU Benchmarks #1163

Are you sure you want to change the base?

MMLU Benchmarks #1163

Conversation

gagika commented Jan 13, 2025 • edited Loading

Tests

Checklist

RissyRan left a comment

Choose a reason for hiding this comment

gagika commented Jan 13, 2025

gagika commented Jan 13, 2025 •

edited

Loading