Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MMLU Benchmarks #1163

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

MMLU Benchmarks #1163

wants to merge 6 commits into from

Conversation

gagika
Copy link
Collaborator

@gagika gagika commented Jan 13, 2025

MMLU Benchmark Script (0-shot)

Tests

Tested for llama2-7b and llama3.1-8b

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

Copy link
Collaborator

@RissyRan RissyRan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Do we have logs to see how the test results looks like?

@gagika
Copy link
Collaborator Author

gagika commented Jan 13, 2025

This is great! Do we have logs to see how the test results looks like?

Here is the output for llama3.1-8b with default 0-shot prompt:

Final accuracy on MMLU dataset: 0.6413

Subcategory Accuracies:
Accuracy for subcategory 'politics': 0.7963
Accuracy for subcategory 'other': 0.6876
Accuracy for subcategory 'culture': 0.8283
Accuracy for subcategory 'history': 0.7860
Accuracy for subcategory 'biology': 0.7819
Accuracy for subcategory 'psychology': 0.7640
Accuracy for subcategory 'physics': 0.5172
Accuracy for subcategory 'engineering': 0.6345
Accuracy for subcategory 'philosophy': 0.5323
Accuracy for subcategory 'health': 0.7000
Accuracy for subcategory 'economics': 0.6415
Accuracy for subcategory 'computer science': 0.5947
Accuracy for subcategory 'law': 0.5309
Accuracy for subcategory 'math': 0.4474
Accuracy for subcategory 'chemistry': 0.5314
Accuracy for subcategory 'business': 0.8101
Accuracy for subcategory 'geography': 0.7727

Category Accuracies:
Accuracy for category 'STEM': 0.5500
Accuracy for category 'humanities': 0.5819
Accuracy for category 'social sciences': 0.7488
Accuracy for category 'other (business, health, misc.)': 0.7104

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants