Tutorial for Evaluating Reasoning Models¶

OpenCompass provides an evaluation tutorial for DeepSeek R1 series reasoning models (mathematical datasets).

At the model level, we recommend using the sampling approach to reduce repetitions caused by greedy decoding
For datasets with limited samples, we employ multiple evaluation runs and take the average
For answer validation, we utilize LLM-based verification to reduce misjudgments from rule-based evaluation

Installation and Preparation¶

Please follow OpenCompass’s installation guide.

Evaluation Configuration Setup¶

We provide example configurations in examples/eval_deepseek_r1.py. Below is the configuration explanation:

Configuration Interpretation¶

1. Dataset and Validator Configuration¶

# Configuration supporting multiple runs (example)
from opencompass.configs.datasets.aime2024.aime2024_llmverify_repeat8_gen_e8fcee import aime2024_datasets

datasets = sum(
    (v for k, v in locals().items() if k.endswith('_datasets')),
    [],
)

# LLM validator configuration. Users need to deploy API services via LMDeploy/vLLM/SGLang or use OpenAI-compatible endpoints
verifier_cfg = dict(
    abbr='qwen2-5-32B-Instruct',
    type=OpenAISDK,
    path='Qwen/Qwen2.5-32B-Instruct',  # Replace with actual path
    key='YOUR_API_KEY',  # Use real API key
    openai_api_base=['http://your-api-endpoint'],  # Replace with API endpoint
    query_per_second=16,
    batch_size=1024,
    temperature=0.001,
    max_out_len=16384
)

# Apply validator to all datasets
for item in datasets:
    if 'judge_cfg' in item['eval_cfg']['evaluator']:
        item['eval_cfg']['evaluator']['judge_cfg'] = verifier_cfg

2. Model Configuration¶

We provided an example of evaluation based on LMDeploy as the reasoning model backend, users can modify path (i.e., HF path)

# LMDeploy model configuration example
models = [
    dict(
        type=TurboMindModelwithChatTemplate,
        abbr='deepseek-r1-distill-qwen-7b-turbomind',
        path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B',
        engine_config=dict(session_len=32768, max_batch_size=128, tp=1),
        gen_config=dict(
            do_sample=True,
            temperature=0.6,
            top_p=0.95,
            max_new_tokens=32768
        ),
        max_seq_len=32768,
        batch_size=64,
        run_cfg=dict(num_gpus=1),
        pred_postprocessor=dict(type=extract_non_reasoning_content)
    ),
    # Extendable 14B/32B configurations...
]

3. Evaluation Process Configuration¶

# Inference configuration
infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
    
# Evaluation configuration
eval = dict(
    partitioner=dict(type=NaivePartitioner, n=8),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLEvalTask)))

4. Summary Configuration¶

# Multiple runs results average configuration
summary_groups = [
    {
        'name': 'AIME2024-Aveage8',
        'subsets':[[f'aime2024-run{idx}', 'accuracy'] for idx in range(8)]
    },
    # Other dataset average configurations...
]

summarizer = dict(
    dataset_abbrs=[
        ['AIME2024-Aveage8', 'naive_average'],
        # Other dataset metrics...
    ],
    summary_groups=summary_groups
)

# Work directory configuration
work_dir = "outputs/deepseek_r1_reasoning"

Evaluation Execution¶

Scenario 1: Model loaded on 1 GPU, data evaluated by 1 worker, using a total of 1 GPU¶

opencompass examples/eval_deepseek_r1.py --debug --dump-eval-details

Evaluation logs will be output in the command line.

Scenario 2: Model loaded on 1 GPU, data evaluated by 8 workers, using a total of 8 GPUs¶

You need to modify the infer configuration in the configuration file and set num_worker to 8

# Inference configuration
infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))

At the same time, remove the --debug parameter from the evaluation command

opencompass examples/eval_deepseek_r1.py --dump-eval-details

In this mode, OpenCompass will use multithreading to start $num_worker tasks. Specific logs will not be displayed in the command line, instead, detailed evaluation logs will be shown under $work_dir.

Scenario 3: Model loaded on 2 GPUs, data evaluated by 4 workers, using a total of 8 GPUs¶

Note that in the model configuration, num_gpus in run_cfg needs to be set to 2 (if using an inference backend, parameters such as tp in LMDeploy also need to be modified accordingly to 2), and at the same time, set num_worker in the infer configuration to 4

models += [
    dict(
        type=TurboMindModelwithChatTemplate,
        abbr='deepseek-r1-distill-qwen-14b-turbomind',
        path='deepseek-ai/DeepSeek-R1-Distill-Qwen-14B',
        engine_config=dict(session_len=32768, max_batch_size=128, tp=2),
        gen_config=dict(
                        do_sample=True,
                        temperature=0.6,
                        top_p=0.95,
                        max_new_tokens=32768),
        max_seq_len=32768,
        max_out_len=32768,
        batch_size=128,
        run_cfg=dict(num_gpus=2),
        pred_postprocessor=dict(type=extract_non_reasoning_content)
    ),
]

# Inference configuration
infer = dict(
    partitioner=dict(type=NumWorkerPartitioner, num_worker=4),
    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))

Evaluation Results¶

The evaluation results are displayed as follows:

dataset                             version    metric         mode    deepseek-r1-distill-qwen-7b-turbomind                                                                                                       ----------------------------------  ---------  -------------  ------  ---------------------------------------                                                                                                     MATH                                -          -              -                                         AIME2024-Aveage8                    -          naive_average  gen     56.25     

Performance Baseline¶

Since the model uses Sampling for decoding, and the AIME dataset size is small, there may still be a performance fluctuation of 1-3 points even when averaging over 8 evaluations.

Model	Dataset	Metric	Value
DeepSeek-R1-Distill-Qwen-7B	AIME2024	Accuracy	56.3
DeepSeek-R1-Distill-Qwen-14B	AIME2024	Accuracy	74.2
DeepSeek-R1-Distill-Qwen-32B	AIME2024	Accuracy	74.2