Shortcuts

Dataset Statistics

On this page, we have listed all the datasets supported by OpenCompass.

You can use sorting and search functions to find the dataset you need.

We provide recommended running configurations for each dataset, and in some datasets also offer recommended configurations based on LLM Judge.

You can quickly start evaluation tasks based on the recommended configurations. However, please note that these configurations may be updated over time.

Supported Dataset List

Name

Category

Paper or Repository

Recommended Config

Recommended Config (LLM Judge)

IFEval

Instruction Following

link

link

NPHardEval

Reasoning

link

link(TBD)

PMMEval

Language

link

link(TBD)

TheroremQA

Reasoning

link

link(TBD)

AGIEval

Examination

link

link(TBD)

BABILong

Long Context

link

link(TBD)

BigCodeBench

Code

link

link

CaLM

Reasoning

link

link(TBD)

InfiniteBench (∞Bench)

Long Context

link

link(TBD)

KOR-Bench

Reasoning

link

link

link

LawBench

Knowledge / Law

link

link(TBD) / link(TBD)

L-Eval

Long Context

link

link(TBD)

LiveCodeBench

Code

link

link

LiveMathBench

Math

link

link(TBD)

LiveReasonBench

Reasoning

link

link(TBD)

LongBench

Long Context

link

link(TBD) / link(TBD)

LV-Eval

Long Context

link

link(TBD)

Mastermath2024v1

Math

link

link(TBD)

matbench

Science / Material

link

link(TBD)

MedBench

Knowledge / Medicine

link

link(TBD)

MedCalc_Bench

Knowledge / Medicine

link

link(TBD)

MedQA

Knowledge / Medicine

link

link(TBD)

link(TBD)

MedXpertQA

Knowledge / Medicine

link

link(TBD)

link(TBD)

ClinicBench

Knowledge / Medicine

link

link(TBD)

link(TBD)

ScienceQA

Knowledge / Medicine

link

link(TBD)

link(TBD)

PubMedQA

Knowledge / Medicine

link

link(TBD)

link(TBD)

MuSR

Reasoning

link

link

link

NeedleBench V1 (Deprecated)

Long Context

link

link(TBD)

NeedleBench V2

Long Context

link

link(TBD)

RULER

Long Context

link

link(TBD)

AlignBench

Subjective / Alignment

link

link(TBD)

AlpacaEval

Subjective / Instruction Following

link

link(TBD)

Arena-Hard

Subjective / Chatbot

link

link(TBD)

FLAMES

Subjective / Alignment

link

link(TBD)

FOFO

Subjective / Format Following

link

link(TBD)

FollowBench

Subjective / Instruction Following

link

link(TBD)

HelloBench

Subjective / Long Context

link

link(TBD)

JudgerBench

Subjective / Long Context

link

link(TBD)

MT-Bench-101

Subjective / Multi-Round

link

link(TBD)

WildBench

Subjective / Real Task

link

link(TBD)

T-Eval

Tool Utilization

link

link(TBD) / link(TBD)

FinanceIQ

Knowledge / Finance

link

link(TBD)

GAOKAOBench

Examination

link

link(TBD)

LCBench

Code

link

link(TBD)

ArabicMMLU

Language

link

link(TBD)

OpenFinData

Knowledge / Finance

link

link(TBD)

QuALITY

Long Context

link

link(TBD)

Adversarial GLUE

Safety

link

link(TBD) / link(TBD) / link(TBD) / link(TBD) / link(TBD) / link(TBD)

CLUE / AFQMC

Language

link

link(TBD)

AIME2024

Examination

link

link

link

Adversarial NLI

Reasoning

link

link(TBD)

Anthropics Evals

Safety

link

link(TBD) / link(TBD) / link(TBD)

APPS

Code

link

link(TBD) / link(TBD)

ARC

Reasoning

link

link(TBD) / link(TBD)

ARC Prize

ARC-AGI

link

link(TBD)

SuperGLUE / AX

Reasoning

link

link(TBD) / link(TBD)

BIG-Bench Hard

Reasoning

link

link

link

BIG-Bench Extra Hard

Reasoning

link

link(TBD)

SuperGLUE / BoolQ

Knowledge

link

link(TBD)

CLUE / C3 (C³)

Understanding

link

link(TBD)

CARDBiomedBench

Knowledge / Medicine

link

link(TBD)

link(TBD)

SuperGLUE / CB

Reasoning

link

link(TBD)

C-EVAL

Examination

link

link(TBD)

CHARM

Reasoning

link

link(TBD)

ChemBench

Knowledge / Chemistry

link

link(TBD)

FewCLUE / CHID

Language

link

link(TBD)

Chinese SimpleQA

Knowledge

link

link(TBD)

CIBench

Code

link

link(TBD) / link(TBD) / link(TBD)

CivilComments

Safety

link

link(TBD)

Cloze Test-max/min

Code

link

link(TBD)

FewCLUE / CLUEWSC

Language / WSC

link

link(TBD)

CMB

Knowledge / Medicine

link

link(TBD)

CMMLU

Understanding

link

link

link

CLUE / CMNLI

Reasoning

link

link(TBD)

cmo_fib

Examination

link

link(TBD)

CLUE / CMRC

Understanding

link

link(TBD)

CommonSenseQA

Knowledge

link

link(TBD)

CommonSenseQA-CN

Knowledge

link

link(TBD)

SuperGLUE / COPA

Reasoning

link

link(TBD)

CrowsPairs

Safety

link

link(TBD)

CrowsPairs-CN

Safety

link

link(TBD)

CVALUES

Safety

link

link(TBD)

CLUE / DRCD

Understanding

link

link(TBD)

DROP (DROP Simple Eval)

Understanding

link

link

link

DS-1000

Code

link

link(TBD)

FewCLUE / EPRSTMT

Understanding

link

link(TBD)

Flores

Language

link

link(TBD)

Game24

Math

link

link(TBD)

Government Report Dataset

Long Context

link

link(TBD)

GPQA

Knowledge

link

link

link

GSM8K

Math

link

link(TBD)

GSM-Hard

Math

link

link(TBD)

HLE(Humanity’s Last Exam)

Reasoning

link

link(TBD)

HellaSwag

Reasoning

link

link

link

HumanEval

Code

link

link

HumanEval-CN

Code

link

link(TBD)

Multi-HumanEval

Code

link

link(TBD)

HumanEval+

Code

link

link(TBD)

HumanEval-X

Code

link

link(TBD)

HumanEval Pro

Code

link

link(TBD)

Hungarian_Math

Math

link

link(TBD)

IWSLT2017

Language

link

link(TBD)

JigsawMultilingual

Safety

link

link(TBD)

LAMBADA

Understanding

link

link(TBD)

LCSTS

Understanding

link

link(TBD)

LiveStemBench

link

link(TBD)

LLM Compression

Bits Per Character (BPC)

link

link(TBD)

MATH

Math

link

link

link

MATH500

Math

link

link

link

MATH 401

Math

link

link(TBD)

MathBench

Math

link

link(TBD)

MBPP

Code

link

link(TBD)

MBPP-CN

Code

link

link(TBD)

MBPP-PLUS

Code

link

link(TBD)

MBPP Pro

Code

link

link(TBD)

MGSM

Language / Math

link

link(TBD)

MMLU

Understanding

link

link

link

SciEval

Understanding

link

link(TBD)

link(TBD)

MMLU-CF

Understanding

link

link(TBD)

MMLU-Pro

Understanding

link

link

link

MMMLU

Language / Understanding

link

link(TBD) / link(TBD)

SuperGLUE / MultiRC

Understanding

link

link(TBD)

MultiPL-E

Code

link

link(TBD)

NarrativeQA

Understanding

link

link(TBD)

NaturalQuestions

Knowledge

link

link(TBD)

NaturalQuestions-CN

Knowledge

link

link(TBD)

OpenBookQA

Knowledge

link

link(TBD)

OlymMATH

Math

link

link(TBD)

link(TBD)

OpenBookQA

Knowledge / Physics

link

link(TBD)

ProteinLMBench

Knowledge / Biology (Protein)

link

link(TBD)

link(TBD)

py150

Code

link

link(TBD)

Qasper

Long Context

link

link(TBD)

Qasper-Cut

Long Context

link

link(TBD)

RACE

Examination

link

link(TBD)

R-Bench

Reasoning

link

link(TBD)

RealToxicPrompts

Safety

link

link(TBD)

SuperGLUE / ReCoRD

Understanding

link

link(TBD)

SuperGLUE / RTE

Reasoning

link

link(TBD)

CLUE / OCNLI

Reasoning

link

link(TBD)

FewCLUE / OCNLI-FC

Reasoning

link

link(TBD)

RoleBench

Role Play

link

link(TBD)

S3Eval

Long Context

link

link(TBD)

SciBench

Reasoning

link

link(TBD)

SciCode

Code

link

link(TBD)

SimpleQA

Knowledge

link

link(TBD)

SocialIQA

Reasoning

link

link(TBD)

SQuAD2.0

Understanding

link

link(TBD)

StoryCloze

Reasoning

link

link(TBD)

StrategyQA

Reasoning

link

link(TBD)

SummEdits

Language

link

link(TBD)

SummScreen

Understanding

link

link(TBD)

SVAMP

Math

link

link(TBD)

TabMWP

Math / Table

link

link(TBD)

TACO

Code

link

link(TBD)

FewCLUE / TNEWS

Understanding

link

link(TBD)

FewCLUE / BUSTM

Reasoning

link

link(TBD)

FewCLUE / CSL

Understanding

link

link(TBD)

FewCLUE / OCNLI-FC

Reasoning

link

link(TBD)

TriviaQA

Knowledge

link

link(TBD)

TriviaQA-RC

Knowledge / Understanding

link

link(TBD)

TruthfulQA

Safety

link

link(TBD)

TyDi-QA

Language

link

link(TBD)

SuperGLUE / WiC

Language

link

link(TBD)

SuperGLUE / WSC

Language / WSC

link

link(TBD)

WinoGrande

Language / WSC

link

link(TBD)

XCOPA

Language

link

link(TBD)

Xiezhi

Knowledge

link

link(TBD)

XLSum

Understanding

link

link(TBD)

Xsum

Understanding

link

link(TBD)

GLUE / CoLA

Understanding

link

link(TBD)

GLUE / MPRC

Understanding

link

link(TBD)

GLUE / QQP

Understanding

link

link(TBD)

Omni-MATH

Math

link

link(TBD)

WikiBench

Knowledge

link

link(TBD)

SuperGPQA

Knowledge

link

link(TBD)

ClimaQA

Science

link

link(TBD)

link(TBD) / link(TBD)

PHYSICS

Science

link

link(TBD)

link(TBD)

SmolInstruct

Science /Chemistry

link

link(TBD)

SciKnowEval

Science

link

link(TBD)

link(TBD)

InternSandbox

Reasoning/Code/Agent

link

link(TBD)

nejmaibench

Science /Medicine

link

link(TBD)

link(TBD)

Medbullets

Science /Medicine

link

link(TBD)

link(TBD)

medmcqa

Science /Medicine

link

link(TBD)

link(TBD)

PHYBench

Science /Physics

link

link(TBD)

@沪ICP备2021009351号-23 OpenCompass Open Platform Service Agreement