What is Overfitting?

Overfitting is a common problem in Machine Learning. But what does it actually mean? And why is it an important problem for Data Scientists to overcome?

Most exams are designed to test our real understanding, by including some questions that we’ve never seen before. To answer those questions well, we need to generalise, to apply our knowledge flexibly in novel situations.

But if all the questions on an upcoming exam had been asked in recent previous years, then you could get a great grade by regurgitating memorised good answers from those previous years without any real understanding. And the more time you spend on memorising, the worse you get at adapting to slightly different questions.

This happens in machine learning too - if your model performs well on the training set, but can’t generalise to the test set, it’s called overfitting. It has over-learned the training set, without picking up on the underlying principles that would allow it to generalise.

Here, our data has been fitted to a linear model (black line) and a polynomial model (blue line). Although the blue line is a perfect fit, the black line would be expected to generalize better; if the two models were used to extrapolate beyond the data we currently have, the linear model would likely make better predictions. In this case, the blue polynomial is suffering from overfitting. [Source: Wikipedia]

Overfitting is a common problem in machine learning. Simply put, you can tell it’s a problem if you’re improving on your training set as you continue to train, but you’re actually getting worse at generalising to the test set.

My favourite example of overfitting actually comes from a human, rather than a machine. Solomon Shereshevsky was a famous Russian mnemonist whose brain naturally emphasised the distinctiveness of the world through synaesthesia, making it easy for him to memorise long strings of playing cards or random numbers for his act. But on the flip side, he struggled to recognise the faces of people he knew, which he saw as ‘very changeable’.

This is the same problem we have when an AI facial recognition algorithm has overfit to the training set and can’t generalise to the “test set” of photos taken a year later!

There are many techniques for avoiding overfitting. One solution is to stop learning when you’ve got as much as you can from the training set. Another is to constrain your algorithms to look for simple solutions (known as ‘regularisation’). We’ll discuss some of these in future articles.

This post is part of a series I’m calling Data Science Explained - where I explain Data Science concepts to managers and other non-technical people. Other posts in this series include: