Evaluation Results Persistence¶

Introduction¶

Normally, the evaluation results of OpenCompass will be saved to your work directory. But in some cases, there may be a need for data sharing among users or quickly browsing existing public evaluation results. Therefore, we provide an interface that can quickly transfer evaluation results to external public data stations, and on this basis, provide functions such as uploading, overwriting, and reading.

Quick Start¶

Uploading¶

By adding args to the evaluation command or adding configuration in the Eval script, the results of evaluation can be stored in the path you specify. Here are the examples:

(Approach 1) Add an args option to the command and specify your public path address.

opencompass  ...  -sp '/your_path'

(Approach 2) Add configuration in the Eval script.

station_path = '/your_path'

Overwriting¶

The above storage method will first determine whether the same task result already exists in the data station based on the abbr attribute in the model and dataset configuration before uploading data. If results already exists, cancel this storage. If you need to update these results, please add the station-overwrite option to the command, here is an example:

opencompass  ...  -sp '/your_path' --station-overwrite

Reading¶

You can directly read existing results from the data station to avoid duplicate evaluation tasks. The read results will directly participate in the ‘summarize’ step. When using this configuration, only tasks that do not store results in the data station will be initiated. Here is an example:

opencompass  ...  -sp '/your_path' --read-from-station

Command Combination¶

Only upload the results under your latest working directory to the data station, without supplementing tasks that missing results:

opencompass  ...  -sp '/your_path' -r latest -m viz

Storage Format of the Data Station¶

In the data station, the evaluation results are stored as json files for each model-dataset pair. The specific directory form is /your_path/dataset_name/model_name.json . Each json file stores a dictionary corresponding to the results, including predictions, results, and cfg, here is an example:

Result = {
    'predictions': List[Dict],
    'results': Dict,
    'cfg': Dict = {
        'models': Dict,
        'datasets': Dict,
        (Only subjective datasets)'judge_models': Dict
    }
}

Among this three keys, predictions records the predictions of the model on each item of data in the dataset. results records the total score of the model on the dataset. cfg records detailed configurations of the model and the dataset in this evaluation task.