数据集统计¶
在本页面中,我们列举了OpenCompass所支持的所有数据集。
你可以使用排序和搜索功能找到需要的数据集。
我们对每一个数据集都给出了推荐的运行配置,部分数据集中还提供了基于LLM Judge的推荐配置。
你可以基于推荐配置快速启动评测。但请注意,推荐配置可能随时间推移被更新。
支持数据集列表¶
数据集名称 |
数据集类型 |
原文或资源地址 |
推荐配置 |
推荐配置(基于LLM评估) |
---|---|---|---|---|
IFEval |
Instruction Following |
|||
NPHardEval |
Reasoning |
|||
PMMEval |
Language |
|||
TheroremQA |
Reasoning |
|||
AGIEval |
Examination |
|||
BABILong |
Long Context |
|||
BigCodeBench |
Code |
|||
CaLM |
Reasoning |
|||
InfiniteBench (∞Bench) |
Long Context |
|||
KOR-Bench |
Reasoning |
|||
LawBench |
Knowledge / Law |
|||
L-Eval |
Long Context |
|||
LiveCodeBench |
Code |
|||
LiveMathBench |
Math |
|||
LiveReasonBench |
Reasoning |
|||
LongBench |
Long Context |
|||
LV-Eval |
Long Context |
|||
Mastermath2024v1 |
Math |
|||
matbench |
Science / Material |
|||
MedBench |
Knowledge / Medicine |
|||
MedCalc_Bench |
Knowledge / Medicine |
|||
MedQA |
Knowledge / Medicine |
|||
MedXpertQA |
Knowledge / Medicine |
|||
ClinicBench |
Knowledge / Medicine |
|||
ScienceQA |
Knowledge / Medicine |
|||
PubMedQA |
Knowledge / Medicine |
|||
MuSR |
Reasoning |
|||
NeedleBench V1 (Deprecated) |
Long Context |
|||
NeedleBench V2 |
Long Context |
|||
RULER |
Long Context |
|||
AlignBench |
Subjective / Alignment |
|||
AlpacaEval |
Subjective / Instruction Following |
|||
Arena-Hard |
Subjective / Chatbot |
|||
FLAMES |
Subjective / Alignment |
|||
FOFO |
Subjective / Format Following |
|||
FollowBench |
Subjective / Instruction Following |
|||
HelloBench |
Subjective / Long Context |
|||
JudgerBench |
Subjective / Long Context |
|||
MT-Bench-101 |
Subjective / Multi-Round |
|||
WildBench |
Subjective / Real Task |
|||
T-Eval |
Tool Utilization |
|||
FinanceIQ |
Knowledge / Finance |
|||
GAOKAOBench |
Examination |
|||
LCBench |
Code |
|||
ArabicMMLU |
Language |
|||
OpenFinData |
Knowledge / Finance |
|||
QuALITY |
Long Context |
|||
Adversarial GLUE |
Safety |
|||
CLUE / AFQMC |
Language |
|||
AIME2024 |
Examination |
|||
Adversarial NLI |
Reasoning |
|||
Anthropics Evals |
Safety |
|||
APPS |
Code |
|||
ARC |
Reasoning |
|||
ARC Prize |
ARC-AGI |
|||
SuperGLUE / AX |
Reasoning |
|||
BIG-Bench Hard |
Reasoning |
|||
BIG-Bench Extra Hard |
Reasoning |
|||
SuperGLUE / BoolQ |
Knowledge |
|||
CLUE / C3 (C³) |
Understanding |
|||
CARDBiomedBench |
Knowledge / Medicine |
|||
SuperGLUE / CB |
Reasoning |
|||
C-EVAL |
Examination |
|||
CHARM |
Reasoning |
|||
ChemBench |
Knowledge / Chemistry |
|||
FewCLUE / CHID |
Language |
|||
Chinese SimpleQA |
Knowledge |
|||
CIBench |
Code |
|||
CivilComments |
Safety |
|||
Cloze Test-max/min |
Code |
|||
FewCLUE / CLUEWSC |
Language / WSC |
|||
CMB |
Knowledge / Medicine |
|||
CMMLU |
Understanding |
|||
CLUE / CMNLI |
Reasoning |
|||
cmo_fib |
Examination |
|||
CLUE / CMRC |
Understanding |
|||
CommonSenseQA |
Knowledge |
|||
CommonSenseQA-CN |
Knowledge |
|||
SuperGLUE / COPA |
Reasoning |
|||
CrowsPairs |
Safety |
|||
CrowsPairs-CN |
Safety |
|||
CVALUES |
Safety |
|||
CLUE / DRCD |
Understanding |
|||
DROP (DROP Simple Eval) |
Understanding |
|||
DS-1000 |
Code |
|||
FewCLUE / EPRSTMT |
Understanding |
|||
Flores |
Language |
|||
Game24 |
Math |
|||
Government Report Dataset |
Long Context |
|||
GPQA |
Knowledge |
|||
GSM8K |
Math |
|||
GSM-Hard |
Math |
|||
HLE(Humanity's Last Exam) |
Reasoning |
|||
HellaSwag |
Reasoning |
|||
HumanEval |
Code |
|||
HumanEval-CN |
Code |
|||
Multi-HumanEval |
Code |
|||
HumanEval+ |
Code |
|||
HumanEval-X |
Code |
|||
HumanEval Pro |
Code |
|||
Hungarian_Math |
Math |
|||
IWSLT2017 |
Language |
|||
JigsawMultilingual |
Safety |
|||
LAMBADA |
Understanding |
|||
LCSTS |
Understanding |
|||
LiveStemBench |
||||
LLM Compression |
Bits Per Character (BPC) |
|||
MATH |
Math |
|||
MATH500 |
Math |
|||
MATH 401 |
Math |
|||
MathBench |
Math |
|||
MBPP |
Code |
|||
MBPP-CN |
Code |
|||
MBPP-PLUS |
Code |
|||
MBPP Pro |
Code |
|||
MGSM |
Language / Math |
|||
MMLU |
Understanding |
|||
SciEval |
Understanding |
|||
MMLU-CF |
Understanding |
|||
MMLU-Pro |
Understanding |
|||
MMMLU |
Language / Understanding |
|||
SuperGLUE / MultiRC |
Understanding |
|||
MultiPL-E |
Code |
|||
NarrativeQA |
Understanding |
|||
NaturalQuestions |
Knowledge |
|||
NaturalQuestions-CN |
Knowledge |
|||
OpenBookQA |
Knowledge |
|||
OlymMATH |
Math |
|||
OpenBookQA |
Knowledge / Physics |
|||
ProteinLMBench |
Knowledge / Biology (Protein) |
|||
py150 |
Code |
|||
Qasper |
Long Context |
|||
Qasper-Cut |
Long Context |
|||
RACE |
Examination |
|||
R-Bench |
Reasoning |
|||
RealToxicPrompts |
Safety |
|||
SuperGLUE / ReCoRD |
Understanding |
|||
SuperGLUE / RTE |
Reasoning |
|||
CLUE / OCNLI |
Reasoning |
|||
FewCLUE / OCNLI-FC |
Reasoning |
|||
RoleBench |
Role Play |
|||
S3Eval |
Long Context |
|||
SciBench |
Reasoning |
|||
SciCode |
Code |
|||
SimpleQA |
Knowledge |
|||
SocialIQA |
Reasoning |
|||
SQuAD2.0 |
Understanding |
|||
StoryCloze |
Reasoning |
|||
StrategyQA |
Reasoning |
|||
SummEdits |
Language |
|||
SummScreen |
Understanding |
|||
SVAMP |
Math |
|||
TabMWP |
Math / Table |
|||
TACO |
Code |
|||
FewCLUE / TNEWS |
Understanding |
|||
FewCLUE / BUSTM |
Reasoning |
|||
FewCLUE / CSL |
Understanding |
|||
FewCLUE / OCNLI-FC |
Reasoning |
|||
TriviaQA |
Knowledge |
|||
TriviaQA-RC |
Knowledge / Understanding |
|||
TruthfulQA |
Safety |
|||
TyDi-QA |
Language |
|||
SuperGLUE / WiC |
Language |
|||
SuperGLUE / WSC |
Language / WSC |
|||
WinoGrande |
Language / WSC |
|||
XCOPA |
Language |
|||
Xiezhi |
Knowledge |
|||
XLSum |
Understanding |
|||
Xsum |
Understanding |
|||
GLUE / CoLA |
Understanding |
|||
GLUE / MPRC |
Understanding |
|||
GLUE / QQP |
Understanding |
|||
Omni-MATH |
Math |
|||
WikiBench |
Knowledge |
|||
SuperGPQA |
Knowledge |
|||
ClimaQA |
Science |
|||
PHYSICS |
Science |
|||
SmolInstruct |
Science /Chemistry |
|||
SciKnowEval |
Science |
|||
InternSandbox |
Reasoning/Code/Agent |
|||
nejmaibench |
Science /Medicine |
|||
Medbullets |
Science /Medicine |
|||
medmcqa |
Science /Medicine |
|||
PHYBench |
Science /Physics |