Dataset Statistics¶
On this page, we have listed all the datasets supported by OpenCompass.
You can use sorting and search functions to find the dataset you need.
We provide recommended running configurations for each dataset, and in some datasets also offer recommended configurations based on LLM Judge.
You can quickly start evaluation tasks based on the recommended configurations. However, please note that these configurations may be updated over time.
Supported Dataset List¶
Name |
Category |
Paper or Repository |
Recommended Config |
Recommended Config (LLM Judge) |
---|---|---|---|---|
IFEval |
Instruction Following |
|||
NPHardEval |
Reasoning |
|||
PMMEval |
Language |
|||
TheroremQA |
Reasoning |
|||
AGIEval |
Examination |
|||
BABILong |
Long Context |
|||
BigCodeBench |
Code |
|||
CaLM |
Reasoning |
|||
InfiniteBench (∞Bench) |
Long Context |
|||
KOR-Bench |
Reasoning |
|||
LawBench |
Knowledge / Law |
|||
L-Eval |
Long Context |
|||
LiveCodeBench |
Code |
|||
LiveMathBench |
Math |
|||
LiveReasonBench |
Reasoning |
|||
LongBench |
Long Context |
|||
LV-Eval |
Long Context |
|||
Mastermath2024v1 |
Math |
|||
matbench |
Science / Material |
|||
MedBench |
Knowledge / Medicine |
|||
MedCalc_Bench |
Knowledge / Medicine |
|||
MedQA |
Knowledge / Medicine |
|||
MedXpertQA |
Knowledge / Medicine |
|||
ClinicBench |
Knowledge / Medicine |
|||
ScienceQA |
Knowledge / Medicine |
|||
PubMedQA |
Knowledge / Medicine |
|||
MuSR |
Reasoning |
|||
NeedleBench V1 (Deprecated) |
Long Context |
|||
NeedleBench V2 |
Long Context |
|||
RULER |
Long Context |
|||
AlignBench |
Subjective / Alignment |
|||
AlpacaEval |
Subjective / Instruction Following |
|||
Arena-Hard |
Subjective / Chatbot |
|||
FLAMES |
Subjective / Alignment |
|||
FOFO |
Subjective / Format Following |
|||
FollowBench |
Subjective / Instruction Following |
|||
HelloBench |
Subjective / Long Context |
|||
JudgerBench |
Subjective / Long Context |
|||
MT-Bench-101 |
Subjective / Multi-Round |
|||
WildBench |
Subjective / Real Task |
|||
T-Eval |
Tool Utilization |
|||
FinanceIQ |
Knowledge / Finance |
|||
GAOKAOBench |
Examination |
|||
LCBench |
Code |
|||
ArabicMMLU |
Language |
|||
OpenFinData |
Knowledge / Finance |
|||
QuALITY |
Long Context |
|||
Adversarial GLUE |
Safety |
link(TBD) / link(TBD) / link(TBD) / link(TBD) / link(TBD) / link(TBD) |
||
CLUE / AFQMC |
Language |
|||
AIME2024 |
Examination |
|||
Adversarial NLI |
Reasoning |
|||
Anthropics Evals |
Safety |
|||
APPS |
Code |
|||
ARC |
Reasoning |
|||
ARC Prize |
ARC-AGI |
|||
SuperGLUE / AX |
Reasoning |
|||
BIG-Bench Hard |
Reasoning |
|||
BIG-Bench Extra Hard |
Reasoning |
|||
SuperGLUE / BoolQ |
Knowledge |
|||
CLUE / C3 (C³) |
Understanding |
|||
CARDBiomedBench |
Knowledge / Medicine |
|||
SuperGLUE / CB |
Reasoning |
|||
C-EVAL |
Examination |
|||
CHARM |
Reasoning |
|||
ChemBench |
Knowledge / Chemistry |
|||
FewCLUE / CHID |
Language |
|||
Chinese SimpleQA |
Knowledge |
|||
CIBench |
Code |
|||
CivilComments |
Safety |
|||
Cloze Test-max/min |
Code |
|||
FewCLUE / CLUEWSC |
Language / WSC |
|||
CMB |
Knowledge / Medicine |
|||
CMMLU |
Understanding |
|||
CLUE / CMNLI |
Reasoning |
|||
cmo_fib |
Examination |
|||
CLUE / CMRC |
Understanding |
|||
CommonSenseQA |
Knowledge |
|||
CommonSenseQA-CN |
Knowledge |
|||
SuperGLUE / COPA |
Reasoning |
|||
CrowsPairs |
Safety |
|||
CrowsPairs-CN |
Safety |
|||
CVALUES |
Safety |
|||
CLUE / DRCD |
Understanding |
|||
DROP (DROP Simple Eval) |
Understanding |
|||
DS-1000 |
Code |
|||
FewCLUE / EPRSTMT |
Understanding |
|||
Flores |
Language |
|||
Game24 |
Math |
|||
Government Report Dataset |
Long Context |
|||
GPQA |
Knowledge |
|||
GSM8K |
Math |
|||
GSM-Hard |
Math |
|||
HLE(Humanity’s Last Exam) |
Reasoning |
|||
HellaSwag |
Reasoning |
|||
HumanEval |
Code |
|||
HumanEval-CN |
Code |
|||
Multi-HumanEval |
Code |
|||
HumanEval+ |
Code |
|||
HumanEval-X |
Code |
|||
HumanEval Pro |
Code |
|||
Hungarian_Math |
Math |
|||
IWSLT2017 |
Language |
|||
JigsawMultilingual |
Safety |
|||
LAMBADA |
Understanding |
|||
LCSTS |
Understanding |
|||
LiveStemBench |
||||
LLM Compression |
Bits Per Character (BPC) |
|||
MATH |
Math |
|||
MATH500 |
Math |
|||
MATH 401 |
Math |
|||
MathBench |
Math |
|||
MBPP |
Code |
|||
MBPP-CN |
Code |
|||
MBPP-PLUS |
Code |
|||
MBPP Pro |
Code |
|||
MGSM |
Language / Math |
|||
MMLU |
Understanding |
|||
SciEval |
Understanding |
|||
MMLU-CF |
Understanding |
|||
MMLU-Pro |
Understanding |
|||
MMMLU |
Language / Understanding |
|||
SuperGLUE / MultiRC |
Understanding |
|||
MultiPL-E |
Code |
|||
NarrativeQA |
Understanding |
|||
NaturalQuestions |
Knowledge |
|||
NaturalQuestions-CN |
Knowledge |
|||
OpenBookQA |
Knowledge |
|||
OlymMATH |
Math |
|||
OpenBookQA |
Knowledge / Physics |
|||
ProteinLMBench |
Knowledge / Biology (Protein) |
|||
py150 |
Code |
|||
Qasper |
Long Context |
|||
Qasper-Cut |
Long Context |
|||
RACE |
Examination |
|||
R-Bench |
Reasoning |
|||
RealToxicPrompts |
Safety |
|||
SuperGLUE / ReCoRD |
Understanding |
|||
SuperGLUE / RTE |
Reasoning |
|||
CLUE / OCNLI |
Reasoning |
|||
FewCLUE / OCNLI-FC |
Reasoning |
|||
RoleBench |
Role Play |
|||
S3Eval |
Long Context |
|||
SciBench |
Reasoning |
|||
SciCode |
Code |
|||
SimpleQA |
Knowledge |
|||
SocialIQA |
Reasoning |
|||
SQuAD2.0 |
Understanding |
|||
StoryCloze |
Reasoning |
|||
StrategyQA |
Reasoning |
|||
SummEdits |
Language |
|||
SummScreen |
Understanding |
|||
SVAMP |
Math |
|||
TabMWP |
Math / Table |
|||
TACO |
Code |
|||
FewCLUE / TNEWS |
Understanding |
|||
FewCLUE / BUSTM |
Reasoning |
|||
FewCLUE / CSL |
Understanding |
|||
FewCLUE / OCNLI-FC |
Reasoning |
|||
TriviaQA |
Knowledge |
|||
TriviaQA-RC |
Knowledge / Understanding |
|||
TruthfulQA |
Safety |
|||
TyDi-QA |
Language |
|||
SuperGLUE / WiC |
Language |
|||
SuperGLUE / WSC |
Language / WSC |
|||
WinoGrande |
Language / WSC |
|||
XCOPA |
Language |
|||
Xiezhi |
Knowledge |
|||
XLSum |
Understanding |
|||
Xsum |
Understanding |
|||
GLUE / CoLA |
Understanding |
|||
GLUE / MPRC |
Understanding |
|||
GLUE / QQP |
Understanding |
|||
Omni-MATH |
Math |
|||
WikiBench |
Knowledge |
|||
SuperGPQA |
Knowledge |
|||
ClimaQA |
Science |
|||
PHYSICS |
Science |
|||
SmolInstruct |
Science /Chemistry |
|||
SciKnowEval |
Science |
|||
InternSandbox |
Reasoning/Code/Agent |
|||
nejmaibench |
Science /Medicine |
|||
Medbullets |
Science /Medicine |
|||
medmcqa |
Science /Medicine |
|||
PHYBench |
Science /Physics |