Tutorial for Evaluating Intern-S1¶

OpenCompass now provides the necessary configs for evaluating Intern-S1. Please perform the following steps to initiate the evaluation of Intern-S1.

Model Download and Deployment¶

The Intern-S1 now has been open-sourced, which can be downloaded from Huggingface. After completing the model download, it is recommended to deploy it as an API service for calling. You can deploy it based on LMdeploy/vlLM/sglang according to this page.

Evaluation Configs¶

Model Configs¶

We provide a config example in opencompass/configs/models/interns1/intern_s1.py. Please make the changes according to your needs.

models = [
    dict(
        abbr="intern-s1",
        key="YOUR_API_KEY", # Fill in your API KEY here
        openai_api_base="YOUR_API_BASE", # Fill in your API BASE here
        type=OpenAISDK,
        path="internlm/Intern-S1",
        temperature=0.7,
        meta_template=api_meta_template,
        query_per_second=1,
        batch_size=8,
        max_out_len=64000,
        max_seq_len=65536,
        openai_extra_kwargs={
            'top_p': 0.95,
        },
        retry=10,
        extra_body={
            "chat_template_kwargs": {"enable_thinking": True} # Control the thinking mode when deploying the model based on vllm or sglang
        },
        pred_postprocessor=dict(type=extract_non_reasoning_content), # Extract non-reasoning contents when opening the thinking mode
    ),
]

Dataset Configs¶

We provide a config for datasets used for evaluating Intern-S1 in examples/eval_bench_intern_s1.py. You can also add other datasets as needed.

In addition, you need to add the configuration of the LLM Judger in this config file, as shown in the following example:

judge_cfg = dict(
    abbr='YOUR_JUDGE_MODEL',
    type=OpenAISDK,
    path='YOUR_JUDGE_MODEL_PATH',
    key='YOUR_API_KEY',
    openai_api_base='YOUR_API_BASE',
    meta_template=dict(
        round=[
            dict(role='HUMAN', api_role='HUMAN'),
            dict(role='BOT', api_role='BOT', generate=True),
        ]),
    query_per_second=1,
    batch_size=1,
    temperature=0.001,
    max_out_len=8192,
    max_seq_len=32768,
    mode='mid',
)

Start Evaluation¶

After completing the above configuration, enter the following command to start the evaluation:

opencompass examples/eval_bench_intern_s1.py