Tutorial for Evaluating Reasoning Models¶
OpenCompass provides an evaluation tutorial for DeepSeek R1 series reasoning models (mathematical datasets).
At the model level, we recommend using the sampling approach to reduce repetitions caused by greedy decoding
For datasets with limited samples, we employ multiple evaluation runs and take the average
For answer validation, we utilize LLM-based verification to reduce misjudgments from rule-based evaluation
Installation and Preparation¶
Please follow OpenCompass’s installation guide.
Evaluation Configuration Setup¶
We provide example configurations in examples/eval_deepseek_r1.py
. Below is the configuration explanation:
Configuration Interpretation¶
1. Dataset and Validator Configuration¶
# Configuration supporting multiple runs (example)
from opencompass.configs.datasets.aime2024.aime2024_llmverify_repeat8_gen_e8fcee import aime2024_datasets
datasets = sum(
(v for k, v in locals().items() if k.endswith('_datasets')),
[],
)
# LLM validator configuration. Users need to deploy API services via LMDeploy/vLLM/SGLang or use OpenAI-compatible endpoints
verifier_cfg = dict(
abbr='qwen2-5-32B-Instruct',
type=OpenAISDK,
path='Qwen/Qwen2.5-32B-Instruct', # Replace with actual path
key='YOUR_API_KEY', # Use real API key
openai_api_base=['http://your-api-endpoint'], # Replace with API endpoint
query_per_second=16,
batch_size=1024,
temperature=0.001,
max_out_len=16384
)
# Apply validator to all datasets
for item in datasets:
if 'judge_cfg' in item['eval_cfg']['evaluator']:
item['eval_cfg']['evaluator']['judge_cfg'] = verifier_cfg
2. Model Configuration¶
We provided an example of evaluation based on LMDeploy as the reasoning model backend, users can modify path (i.e., HF path)
# LMDeploy model configuration example
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='deepseek-r1-distill-qwen-7b-turbomind',
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B',
engine_config=dict(session_len=32768, max_batch_size=128, tp=1),
gen_config=dict(
do_sample=True,
temperature=0.6,
top_p=0.95,
max_new_tokens=32768
),
max_seq_len=32768,
batch_size=64,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content)
),
# Extendable 14B/32B configurations...
]
3. Evaluation Process Configuration¶
# Inference configuration
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
# Evaluation configuration
eval = dict(
partitioner=dict(type=NaivePartitioner, n=8),
runner=dict(type=LocalRunner, task=dict(type=OpenICLEvalTask)))
4. Summary Configuration¶
# Multiple runs results average configuration
summary_groups = [
{
'name': 'AIME2024-Aveage8',
'subsets':[[f'aime2024-run{idx}', 'accuracy'] for idx in range(8)]
},
# Other dataset average configurations...
]
summarizer = dict(
dataset_abbrs=[
['AIME2024-Aveage8', 'naive_average'],
# Other dataset metrics...
],
summary_groups=summary_groups
)
# Work directory configuration
work_dir = "outputs/deepseek_r1_reasoning"
Evaluation Execution¶
Scenario 1: Model loaded on 1 GPU, data evaluated by 1 worker, using a total of 1 GPU¶
opencompass examples/eval_deepseek_r1.py --debug --dump-eval-details
Evaluation logs will be output in the command line.
Scenario 2: Model loaded on 1 GPU, data evaluated by 8 workers, using a total of 8 GPUs¶
You need to modify the infer
configuration in the configuration file and set num_worker
to 8
# Inference configuration
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
At the same time, remove the --debug
parameter from the evaluation command
opencompass examples/eval_deepseek_r1.py --dump-eval-details
In this mode, OpenCompass will use multithreading to start $num_worker
tasks. Specific logs will not be displayed in the command line, instead, detailed evaluation logs will be shown under $work_dir
.
Scenario 3: Model loaded on 2 GPUs, data evaluated by 4 workers, using a total of 8 GPUs¶
Note that in the model configuration, num_gpus
in run_cfg
needs to be set to 2 (if using an inference backend, parameters such as tp
in LMDeploy also need to be modified accordingly to 2), and at the same time, set num_worker
in the infer
configuration to 4
models += [
dict(
type=TurboMindModelwithChatTemplate,
abbr='deepseek-r1-distill-qwen-14b-turbomind',
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-14B',
engine_config=dict(session_len=32768, max_batch_size=128, tp=2),
gen_config=dict(
do_sample=True,
temperature=0.6,
top_p=0.95,
max_new_tokens=32768),
max_seq_len=32768,
max_out_len=32768,
batch_size=128,
run_cfg=dict(num_gpus=2),
pred_postprocessor=dict(type=extract_non_reasoning_content)
),
]
# Inference configuration
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=4),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
Evaluation Results¶
The evaluation results are displayed as follows:
dataset version metric mode deepseek-r1-distill-qwen-7b-turbomind ---------------------------------- --------- ------------- ------ --------------------------------------- MATH - - - AIME2024-Aveage8 - naive_average gen 56.25
Performance Baseline¶
Since the model uses Sampling for decoding, and the AIME dataset size is small, there may still be a performance fluctuation of 1-3 points even when averaging over 8 evaluations.
Model |
Dataset |
Metric |
Value |
---|---|---|---|
DeepSeek-R1-Distill-Qwen-7B |
AIME2024 |
Accuracy |
56.3 |
DeepSeek-R1-Distill-Qwen-14B |
AIME2024 |
Accuracy |
74.2 |
DeepSeek-R1-Distill-Qwen-32B |
AIME2024 |
Accuracy |
74.2 |