What do these tasks all have in common?
- Looking at an X-ray and deciding whether someone’s leg is broken
- Screening a CV for suitability for a job post
- Identifying faces of terrorists in a crowd
They are all examples of classification tasks, where we need to make a binary yes/no decision. This is one of the most common kinds of machine learning, and it usually works something like this:
For each example in our training data, we have an input and a label.
- The input might be anything, e.g. an image, or a piece of text, or a bunch of numbers, or something else entirely.
- The label is always a ‘yes’ or a ‘no’. 
Our job is to guess what the yes/no label should be for a new piece of test data based on its input. For example, the input might be an X-ray of someone's leg, and we're trying to label it yes/no for whether the leg is broken.
In this article, we’re going to focus on how to measure whether we’re doing a good job, and we’ll discuss three metrics, ‘accuracy’, ‘precision’, and ‘recall’.
Accuracy: how often do you get the right answer?
Let’s start with accuracy, which is the most obvious. The accuracy score is just a percentage - how often did you get the right answer? In other words, how often did your yes/no guess match the actual yes/no label for that input?
Accuracy works well when you have a balanced dataset, i.e. the ‘yes’ and ‘no’ labels are roughly 50-50.
But we run into problems with accuracy when your dataset is imbalanced. Consider the problem of deciding whether the X-ray shows a broken leg. Perhaps in your dataset, only 1 in 10 of the X-rays actually has a broken leg. In other words, the correct labels are ‘yes’ 10% of the time, and ‘no’ 90% of the time.
Machine learning algorithms are lazy and opportunistic. Faced with a dataset like this, a simple classification algorithm might just guess ‘no’ all the time every time, and then go home early to have a beer. And you know what, its accuracy would be 90%!
So, if someone boasts about their high accuracy when you know the dataset is imbalanced, they’re either making an amateur mistake or trying to hoodwink you.
So what are Precision and Recall?
Accuracy (‘how often was the answer correct?’) can be split into two components:
- Precision = when it said yes, how often was it right? i.e., of all the people it said had a broken leg, how many did indeed have broken legs?
- Recall = of all the times the answer should have been yes, how many did it get? i.e., of all the X-rays with broken legs, did it notice them all?
Sometimes we care more about precision, sometimes we care more about recall.
If we’re trying to decide whether to let someone into the Federal Reserve’s gold vaults from their retinal scan, maybe we care a lot about precision. In that case, we don’t want to accidentally say ‘yes’ when the answer is in fact ‘no’. We don’t want to accidentally let in a master thief holding up a severed eyeball - even if that means irritating the real Bank Manager because we’re being really careful and mistakenly reject him sometimes.
If we’re trying to tell whether someone has cancer from their medical imaging, maybe we care a lot about recall, i.e. we don’t want to accidentally say ‘no’ when the answer is ‘yes’. We don’t want to accidentally miss anybody, even if it means we occasionally have some false positives that go for a follow-up test.
For a deeper dive into precision and recall, the Wikipedia’s examples and Venn diagrams are pretty helpful and more rigorously defined.
It's a trade-off
You can see intuitively that precision and recall trade off against one another - in other words, you can always choose to improve one but at the expense of the other by adjusting your threshold for deciding when to say ‘yes’ vs ‘no’.
In the real world, your first job when faced with a classification task is to measure how balanced the dataset is, and then to have a conversation with your stakeholders about whether they care about precision or recall. Unhelpfully, the answer is almost always ‘both’!
There are more advanced metrics (like ‘F1’, ‘area under the ROC curve’, ‘Cohen’s kappa’) that take both into account, so you could just use one of those. But once you dig into the problem, you may realise that one is more important than the other. Indeed you may eventually end up with an even more sophisticated approach. For example, imagine a situation where you make more money by maximising recall, but your reputation suffers if precision isn’t good enough. In that case, perhaps you decide on some acceptable level of precision (whatever’s standard for the existing system or market) - and set your threshold to maximise recall, as long as precision is above that acceptable level.
As always, the key in data science is evaluation, i.e. knowing how to measure whether you're doing a good job. Unless you get that right, no amount of algorithm tuning will lead to happy stakeholders!
: Of course, we can imagine more complex setups. Maybe sometimes we allow for multi-way decisions (rather than just binary yes/no). Maybe sometimes we have examples in our training data that don’t have labels. In that case, it will be worth asking if there's a way you can make use of those too.