Add a dataset¶
Although OpenCompass has already included most commonly used datasets, users need to follow the steps below to support a new dataset if wanted:
Add a dataset script
mydataset.pyto theopencompass/datasetsfolder. This script should include:The dataset and its loading method. Define a
MyDatasetclass that implements the data loading methodloadas a static method. This method should return data of typedatasets.Dataset. We use the Hugging Face dataset as the unified interface for datasets to avoid introducing additional logic. Here’s an example:
import datasets from .base import BaseDataset class MyDataset(BaseDataset): @staticmethod def load(**kwargs) -> datasets.Dataset: pass
(Optional) If the existing evaluators in OpenCompass do not meet your needs, you need to define a
MyDatasetEvaluatorclass that implements the scoring methodscore. This method should takepredictionsandreferencesas input and return the desired dictionary. Since a dataset may have multiple metrics, the method should return a dictionary containing the metrics and their corresponding scores. Here’s an example:
from opencompass.openicl.icl_evaluator import BaseEvaluator class MyDatasetEvaluator(BaseEvaluator): def score(self, predictions: List, references: List) -> dict: pass
(Optional) If the existing postprocessors in OpenCompass do not meet your needs, you need to define the
mydataset_postprocessmethod. This method takes an input string and returns the corresponding postprocessed result string. Here’s an example:
def mydataset_postprocess(text: str) -> str: pass
After defining the dataset loading, data postprocessing, and evaluator methods, you need to add the following configurations to the configuration file:
from opencompass.datasets import MyDataset, MyDatasetEvaluator, mydataset_postprocess mydataset_eval_cfg = dict( evaluator=dict(type=MyDatasetEvaluator), pred_postprocessor=dict(type=mydataset_postprocess)) mydataset_datasets = [ dict( type=MyDataset, ..., reader_cfg=..., infer_cfg=..., eval_cfg=mydataset_eval_cfg) ]
To facilitate the access of your datasets to other users, you need to specify the channels for downloading the datasets in the configuration file. Specifically, you need to first fill in a dataset name given by yourself in the
pathfield in themydataset_datasetsconfiguration, and this name will be mapped to the actual download path in theopencompass/utils/datasets_info.pyfile. Here’s an example:
mmlu_datasets = [an dict( ..., path='opencompass/mmlu', ..., ) ]
Next, you need to create a dictionary key in
opencompass/utils/datasets_info.pywith the same name as the one you provided above. If you have already hosted the dataset on HuggingFace or Modelscope, please add a dictionary key to theDATASETS_MAPPINGdictionary and fill in the HuggingFace or Modelscope dataset address in thehf_idorms_idkey, respectively. You can also specify a default local address. Here’s an example:
"opencompass/mmlu": { "ms_id": "opencompass/mmlu", "hf_id": "opencompass/mmlu", "local": "./data/mmlu/", }
If you wish for the provided dataset to be directly accessible from the OpenCompass OSS repository when used by others, you need to submit the dataset files in the Pull Request phase. We will then transfer the dataset to the OSS on your behalf and create a new dictionary key in the
DATASET_URL.To ensure the optionality of data sources, you need to improve the method
loadin the dataset scriptmydataset.py. Specifically, you need to implement a functionality to switch among different download sources based on the setting of the environment variableDATASET_SOURCE. It should be noted that if the environment variableDATASET_SOURCEis not set, the dataset will default to being downloaded from the OSS repository. Here’s an example fromopencompass/dataset/cmmlu.py:
def load(path: str, name: str, **kwargs): ... if environ.get('DATASET_SOURCE') == 'ModelScope': ... else: ... return dataset
After completing the dataset script and config file, you need to register the information of your new dataset in the file
dataset-index.ymlat the main directory, so that it can be added to the dataset statistics list on the OpenCompass website.The keys that need to be filled in include
name: the name of your dataset,category: the category of your dataset,paper: the URL of the paper or project, andconfigpath: the path to the dataset config file. Here’s an example:
- mydataset: name: MyDataset category: Understanding paper: https://arxiv.org/pdf/xxxxxxx configpath: opencompass/configs/datasets/MyDataset
Detailed dataset configuration files and other required configuration files can be referred to in the Configuration Files tutorial. For guides on launching tasks, please refer to the Quick Start tutorial.