Results Summary¶
After the evaluation is complete, the results need to be printed on the screen or saved. This process is controlled by the summarizer.
Note
If the summarizer appears in the overall config, all the evaluation results will be output according to the following logic.
If the summarizer does not appear in the overall config, the evaluation results will be output in the order they appear in the dataset config.
Example¶
A typical summarizer configuration file is as follows:
summarizer = dict(
dataset_abbrs = [
'race',
'race-high',
'race-middle',
],
summary_groups=[
{'name': 'race', 'subsets': ['race-high', 'race-middle']},
]
)
The output is:
dataset version metric mode internlm-7b-hf
----------- --------- ------------- ------ ----------------
race - naive_average ppl 76.23
race-high 0c332f accuracy ppl 74.53
race-middle 0c332f accuracy ppl 77.92
The summarizer tries to read the evaluation scores from the {work_dir}/results/ directory using the models and datasets in the config as the full set. It then displays them in the order of the summarizer.dataset_abbrs list. Moreover, the summarizer tries to compute some aggregated metrics using summarizer.summary_groups. The name metric is only generated if and only if all values in subsets exist. This means if some scores are missing, the aggregated metric will also be missing. If scores can’t be fetched by the above methods, the summarizer will use - in the respective cell of the table.
In addition, the output consists of multiple columns:
The
datasetcolumn corresponds to thesummarizer.dataset_abbrsconfiguration.The
versioncolumn is the hash value of the dataset, which considers the dataset’s evaluation method, prompt words, output length limit, etc. Users can verify whether two evaluation results are comparable using this column.The
metriccolumn indicates the evaluation method of this metric. For specific details, metrics.The
modecolumn indicates how the inference result is obtained. Possible values areppl/gen. For items insummarizer.summary_groups, if the methods of obtainingsubsetsare consistent, its value will be the same as subsets, otherwise it will bemixed.The subsequent columns represent different models.
Field Description¶
The fields of summarizer are explained as follows:
dataset_abbrs: (list, optional) Display list items. If omitted, all evaluation results will be output.summary_groups: (list, optional) Configuration for aggregated metrics.
The fields in summary_groups are:
name: (str) Name of the aggregated metric.subsets: (list) Names of the metrics that are aggregated. Note that it can not only be the originaldataset_abbrbut also the name of another aggregated metric.weights: (list, optional) Weights of the metrics being aggregated. If omitted, the default is to use unweighted averaging.
Please note that we have stored the summary groups of datasets like MMLU, C-Eval, etc., under the configs/summarizers/groups path. It’s recommended to consider using them first.