On the Nature of Accuracy

The term “accuracy” in machine learning can often be difficult to define. In order to get a reading on true accuracy of a model, we must have some notion of "ground truth", i.e. the true state of things. Accuracy can then be directly measured by comparing the outputs of the model with this ground truth. This is usually possible with supervised learning, where the ground truth takes the form of a set of labels that describe and define the underlying data. However, in many applications, such knowledge is either limited or unavailable. Unsupervised learning tasks tend not to have any measure of ground truth, and as such can be quite difficult to validate or to measure for accuracy. Accuracy measurements can even be difficult in some supervised or semi-supervised learning tasks, even when ground truth is available.

Classification: A Primer

Let us take an example of a supervised task for which accuracy measurement can be difficult: classification. With classification, we have a set of labeled data where the labels describe each individual data point. Models are built on the labeled data and then tested on new data in order to predict one of two outcomes: whether this new data pertains to this label or not. Since the labels are present for the new data, the labels themselves provide the ground truth and thus provide a direct way to measure the accuracy of the model. Furthermore, this method of classification can also be used for a form of anomaly detection; if the predicted class does not match the true class given by the label, then the new data can be said to differ significantly enough from the old data that the new data can be said to be anomalous. Taken at face value, accuracy measurements for classification should be easy. However, this is not necessarily always the case.

What to look out for?

The definition of accuracy in Machine Learning can change depending on the application in question. The foundations for accuracy measurements in classification come from the definitions of True/False Positives and Negatives. Once the basic concepts behind true and false positive rates are understood, it is relatively easy to appreciate the subtleties involved.

In medical applications, a missed diagnosis can have potentially fatal consequences. This would be a case of a False Negative, i.e. incorrectly identifying a sick patient as being healthy. The opposite (i.e. diagnosing a healthy person as sick) is still undesirable but is often less serious than the former case. A highly accurate machine learning system in the medical field will be designed to have as little False Negatives as possible.

In security applications such as fraud detection, a parallel can be drawn with legal terminology; innocent until proven guilty. It is important that the number of False Positives be as low as possible in order to minimize the number of false alerts, as the consequences of following up on false alerts in security applications can be quite high (i.e. wrongfully accusing an individual of malicious activity).

The Real World is Hard

In an ideal world, accuracy is easy to measure. In the majority of cases, research and development in the machine learning world does not focus on real-world “live” data, but rather on well-defined pre-compiled datasets. With these datasets, labels (if present) are correct, and accuracy can be easily quantified. However, in production environments, things are not always so clean-cut.

Case Study: Bank Loans

A simple example can be used to describe the difficulty in measuring accuracy in the real world. Consider an algorithm, deployed by a bank, that is used to decide whether or not a prospective customer should be approved for a loan. The customer would be approved if they are deemed likely to be able to repay that loan, and would be denied the loan if they were deemed at risk of non-payment (which could incur losses for the bank). In a real-world environment, this problem would only ever be able to accurately measure two of the four entries in the confusion matrix.

True Positives can be readily measured: when a customer is approved a loan,
and repays it correctly.
False Positives can be readily measured: when a customer is approved a loan,
and cannot repay it.

However, there is no way in a real live situation to discern between True Negatives (customers denied a loan who were truly unable to repay that loan) and False Negatives (customers denied a loan who would actually have been capable of repaying that loan), as there will be no way to ascertain the “ground truth” label for these cases. This means that True Positive Rates or False Positive Rates cannot be measured in production, and it will be difficult to assess the accuracy of the live system.

Managing the Trade-off

Real-world data is never perfect; it is typically noisy and may require some filtering/cleaning before it can be used. As a result, it can be very difficult for many “live” machine learning deployments to achieve highly accurate models - models that not only achieve very high True Positive Rate, but also a very low False Positive Rate. As such, these systems have an inevitable trade-off that needs to be made. A system can be set up to be highly sensitive to the data. This will see a high number of True Positives as more anomalies will be reported, with more of them likely to be correct. However, such a system will also likely have a high number of False Positives, as the number of incorrect anomalies reported will also increase. Lowering sensitivity of the model will reduce the overall number of reported events, which would lower the rate of False Positives, but, this could also lead to an unacceptably low True Positive Rate.

While some user-specific control may still be required (due to ever-changing concerns or objectives), a well-designed algorithm will be capable of achieving a fair balance between False Positives and False Negatives. Techniques such as combining multiple independent detection techniques (e.g., ensemble techniques or model aggregation), or a multi-stage analysis process can be employed to improve this trade-off.

Corvil Data Science and Best Practices

There are a number of considerations when evaluating modern Machine Learning products for best practices:

Algorithms should feature controllable trade-offs between True Positive and False Positive Rates. This will allow users to tune the algorithm to their desired requirements.
A sophisticated modern algorithm without good data will have a very tough time creating a good accuracy baseline. The better the quality of the data, the better the overall accuracy can become.
Combining multiple independent methods rather than focusing all efforts on a single path can vastly improve results.
Although there is a lot of hype around modern machine learning techniques, they should not be seen as a “one stop shop” solution. Combining ML algorithms with non-ML methods can yield even further improvements.

In our quest for ever more effective solutions to ever more difficult problems, Corvil’s approach to machine learning and data science considers these points, and more. These concepts and techniques currently represent the cutting edge of machine learning research and, with an unparalleled data source at the core, Corvil is leading the charge.