Multi-class Performance Measures

when we want to evaluate a set of predicted labels or performance of ML models we use different performance measures. Accuracy, Precision, Recall, F-beta(usually people use F-1) or etc. But none of the aforementioned methods except Accuracy work for Multi-class data where class labels tend to have more than two (binary) different values.


Why not Accuracy?

Well, Accuracy is calculated as the portion of true labeled instances to total number of instances. The questions is what is wrong wit accuracy that we need other performance measures? The problem is that in some datasets we can achieve high accuracy with weak models such as a dummy classifier to classify instances with the most frequent label. In cases such as outlier detection or in any dataset that a large portion of samples are of one class label the dummy classifier can achieve a high accuracy such as 80%(in cases that 80% of data are of the majority class label). This is while stronger models may even have lower accuracy. This is called the Accuracy Paradox. Hence, we usually prefer to use other performance measures such as PrecisionRecall, F-measure or etc. To know more about the check their wikipedia pages.

I specifically refer you to a nice article on why accuracy is not enough and we need precision and recall that can be found here.

What is Multi-class Classification?

Multi-class classification denotes to classification problems that we have more than two (binary) class label. Document category classification is a common example of such problem where documents are to be assigned to number of categories.

Performance Methods for Multi-class Classification

There are different methods for this task. Here, I name some of them and refer to articles, papers explaining these methods in detail.

Macro-Averaging or Micro-Averaging

Macro-Averaging is method of combining precision or recall of multi class labels by averaging over their values. Macro averaging simply normalizes sum of precision of each of these labels using the number of different values. One issue with Macro averaging is that doesn’t consider the number of samples with each class label. Weighted or Label-Frequency-Based methods is a different method where these

Micro-Averagin has a similar idea to macro-averaging but computes precision and recall from sum of true positive, true negative, false positive and false negative values of all class labels. In contrast to Macro Averaging, this methods takes the frequency of each label into consideration.

There is a long debate among export on choosing which method to use. However, a common believe is that Macro-Averaging can be a bad practice in cases that there is a considerable different in number of samples of each class label.


Macro- and Micro-averaged evaluation measures

A systematic analysis of performance measures for classification tasks 


Micro/Macro- F-Measure

Micro/Macro- F-Measure is measured similar to its binary case but from multi class Precision and Recall.


Mutual Information:

Mutual information is a measure that calculates the mutual dependence of two variables. The problem with MI is that it can have high values for a completely different prediction as it can have perfectly high values for a perfect prediction set.

Read More about MI






Leave a reply:

Your email address will not be published.

Site Footer