Machine learning algorithms are faced with the same challenge you had as a pupil at school.

There’s going to be an exam. The higher you score, the better. You’re given some exams from the past with answer sheets to study from.

This leads to a few standard data science principles that will be familiar from your schooldays:

  • You’ll have an easier time if the past exams are very similar to the future exam. You’ll be stuck if the past exams have different kinds of questions (or a different grading scheme) from the future exam.
  • You’ll have an easier time if you have a large number of past exams to study.
  • You’ll have an easier time if the questions are drawn from a narrow range of topics.
  • You’ll have an easier time if there’s a clean and consistent grading scheme.
  • You’ll have an easier time if you try a question or two, see what you got right and wrong, and learn from that, before you go onto the next one.
  • On the other hand, if correctness depends on factors beyond your knowledge or control, then that places an upper limit on just how well you can do.
  • You can get a rough idea of how well you’re going to do on the future exam by testing yourself on the past exams. Of course, if you’ve already seen the answers, that’ll give you an over-inflated confidence. So you have to make sure to always test your progress on exams you haven’t seen before.
  • If you keep improving as you go over and over the past exams that you have, keep on practicing.
  • But only up to a point - sometimes you can actually fixate too much on the past exams, and you’ll find it hard to generalise your knowledge to new questions that are even slightly different and unexpected. See: What is Overfitting?.

In machine learning terms, we’d say that the practice exams are the algorithm’s ‘training set’, i.e., the data it uses to learn from. Its ‘performance’ is the score when it generalises to new, unseen data, aka the ‘testing set’ - that’s what we really care about.

Machine learning algorithms automatically improve with experience. They are unlike normal computer code, which needs every single step to be spelled out in unambiguous and complete detail. Given enough high-quality data and a clear scoring system, machine learning algorithms use clever maths to notice patterns in the training set.

In Part 2, we’ll talk about some techniques for guessing when the algorithms will be able to meet this magical-sounding promise, and when they’ll fall far short.

This post is part of a series called Data Science Explained - where I explain Data Science concepts to managers and other non-technical people. Other posts in this series include: