Years ago, I was eating dinner with relatives when one of them asked, 'So... what did you do today?'. When I thought back, I realised I’d spent the better part of my afternoon puzzling over a glitch in my code. I’d eventually traced it to a colon that should have been a comma (Matlab cares about that sort of thing). When I realised just how much time I’d spent on something so trivial, I muttered a sheepish response, stuffed another bite of lasagne in my mouth, and tried not to think about how cruel a mistress code can be.

We spend 90% of our time debugging. Learning to type twice as fast will only speed up your overall development time by a few percent. So if you want to speed up how fast you develop finished software, focus your efforts on reducing time spent debugging.

But of course it’s so tempting to code first and ask questions later. Hammering away on the keyboard into the greenfield of a blank IDE screen is incredibly satisfying - it feels like making rapid progress.

But efforts to speed up how fast you write the code in the first place will have much less impact than finding ways to write fewer bugs or find and fix them faster. So let’s first understand the claim that the bulk of our programming effort goes into debugging. Then let’s think about how this should change the way we write software.

What do we mean by ‘debugging’?

By debugging, I’m talking about investigating and addressing violated expectations[1], that is, time spent trying to understand why things aren’t working as you think they should and then acting on that.

Do we really spend that much of our time debugging?

Even though the machine does exactly what we tell it, we are surprised over and over when we write software.

That we spend the majority of our time debugging has been so consistently true, watching my own and hundreds of other programmers, that I’ve come to treat it as an axiom. From the literature, Glass (2002) notes that “error removal is the most time-consuming phase of the lifecycle”.

Moreover, not all time spent debugging is equal. Finding and fixing a bug before the code is live is cheap in every sense: it’s probably quick to fix because you still have the full system in your mind, and you remember how it all works. There’s not much reputational cost. It’s probably still within working hours. It doesn’t affect the roadmap, user experience, no decisions have been made on the back of it, nor money lost.

In contrast, a bug in an important algorithm that’s been running for a year is expensive in every way. It might take ages to even figure out when the bug was introduced, let alone the cause of it. You may not be able to use your debugger on the production server, so you’re relying on spotty logs. Time spent fire-fighting is demoralising, and can be high stress - big bugs always seem to show up the night before an investor meeting. They can be more complex to fix - perhaps you need downtime, a custom deployment, or reprocessing a year’s worth of data from backups.

Time spent fire-fighting is demoralising, and can be high stress - big bugs always seem to show up the night before an investor meeting.

In other words, bugs cost much more as you get further towards production. It could take literally weeks of investigation for a complex or high-stakes bug.

When we talk about debugging time, it’s easy to ignore these kind of outliers, to put them in a different category. But they’re not. They should be considered part of the development time for that piece of work. Time spent debugging has a long tail. I wouldn’t be surprised to find that there’s some kind of exponential distribution, like earthquake magnitudes.

In other words, costly bugs may be rare, but when you amortise them over multiple pieces of work, they significantly drive up the proportion of overall time spent debugging. When you add it all up over the lifecycle of your code, debugging ends up taking up much more time than we imagine, and it’s usually not enjoyable time.

So how can we spend less time debugging?

To reduce the time we spend debugging, we should focus our efforts on introducing fewer bugs in the first place, finding them sooner, identifying their bounds correctly, and fixing them well.

Introduce fewer bugs in the first place

  • Focus your best efforts on the bits of the system that are either complex or important.
  • Write good automated tests. Preferably before you write the code. Write a range of tests - from small-scale unit tests that target a single function, through to large-scale integration tests that check the behaviour of the whole system.
  • Pair program.
  • Design. At the very least, talk things over with a colleague beforehand.
  • Code reviews. Like design sessions, these can be very valuable in many ways. But your reviewer almost certainly won’t run the code, and probably can’t inspect every line with a fine-toothed comb.
  • Create great fake datasets. See: How to write scientific software that’s right

Find bugs early and quickly

Aim for a debug cycle time of less than a second

  • How long does it take you to run your system and determine if it’s working? That’s your debug cycle time. Aim for a debug cycle time of under a second.
  • Even if it takes ten seconds, that’s enough to disrupt your flow, send you off to the coffee machine or your website-vice of choice, and fritter away minutes and precious mental context.
  • If your debugging cycle is slower than that, then you’ll have dramatically fewer bites at the apple in a given day, and every debugging exercise will be a marathon - improving this should be your top priority.
  • Let’s say you’re working on some complicated data analysis or model that takes hours to run. You still have a lot of moves you can make to reduce your cycle time by a few orders of magnitude. And truly, if you don’t reduce your cycle time, I’ll bet any money that this project will drag on and be a nightmare. See: How to unit test in data science

Sharpen your axe

  • Choose good tools, and get to know them really well. If your main tool for debugging is print statements, you’re really missing a trick. Ask productive people about their development environment choices, and get to know it backwards, especially the debugging tools.
  • As a bare minimum, you shouldn’t have to think when you want to step through line by line, in and out of functions, and inspect variables.
  • Make sure you can automatically drop into the debugger at a breakpoint, when certain conditions are met, or whenever there’s an error.
  • If you feel nervous making changes to your code because you’re worried about introducing subtle bugs that you won’t notice, then you probably need more automated tests.

Find bugs upstream (further up the call stack)

  • It can take a long time to fix things if you don’t know where to look for the problem, or worse still, if you’re looking in the wrong place. Often, the error manifests in one place, but the underlying cause happened much further upstream. By upstream, I mean further up in the call stack (e.g. the function that called the function that called this one), or backwards in the code base (lines of code that were executed earlier). If you could have caught the problem right as it happened, it would have been much more obvious what the problem was, and much quicker to solve.
  • Indeed, this is a big argument for unit tests. If you have good tests, they’ll fail as you’re changing things, you’ll notice immediately, and that usually gives you a big clue about where to look.
  • Time-travel debugging allows you to run the debugger forwards and backwards, like scrubbing in a video. I’ve never used one, but this seems like it could be a big help, especially if you find yourself re-running the debugger over and over, each time with earlier and earlier breakpoints.

Static typing

  • Until recently, my last 15 years of programming had all been with dynamic languages. I felt that they helped me write fewer lines of code, build faster, and avoid extra syntactic complexity. I did have some quiet misgivings though, especially about type errors that had to be laboriously traced upstream, and consequently the amount of energy spent adding type assertions to unit tests to try and catch them.
  • A brief foray into C# convinced me that I had indeed been missing a trick. I loved being able to specify the types of the inputs and outputs for a function. If someone feeds in a string that was supposed to be an integer, I want things to fail loudly and immediately. Better still, I want my IDE to look at me askance with a squiggly red line. This reduced the debug time for trivial bugs to milliseconds - I wouldn’t even have to run the unit tests to notice the problem. When I went back to Python, I upgraded my IDE and configured it to treat type hinting warnings as errors.

Improve your logging

  • This is most valuable when you’re trying to debug a problem in production that you can’t easily reproduce. You’re probably doomed without good logging. Remember that debugging problems in production will usually involve higher stakes, stress and urgency, and less information - so a little bit of time spent writing better logging could really pay off.
  • Improve your error messages. Error messages can be your best friend, but bad ones can be uselessly opaque or even misdirect you. You understand the system best at the time you’re writing the code, and so that’s when you can write the best error messages. Your future self or future colleague will thank you.

Be a scientist

  • Identify the bounds of the bug. Collect data to determine the circumstances when it does and doesn’t occur. Form a hypothesis about what’s causing it that’s consistent with the data.
  • Figure out a way to test that hypothesis as narrowly as possible (e.g. if the problem is with the caching, then it should go away when I read directly from the database).
  • If your hypothesis is wrong, update your theory, generate new hypotheses that are consistent, test them.
  • Take good notes. Zero-in on things systematically.

Fix bugs “well”

If you’ve found the problem early, identified its bounds correctly, and tested your hypothesis successfully, then fixing it will hopefully be the easy part.

When you’ve found a bug, don’t just rush to fix it. First, introduce a new test that isolates it. Make sure that the test fails. Fix the bug. Make sure that the test now passes.

It’s important to confirm that the test fails before you fix it, otherwise you won’t notice if you’ve written a crappy test. In such cases you may think your test captures the bug, but perhaps the test isn’t quite right or isn’t even running at all! If you only confirm that it passes after you think you’ve fixed things, you may not realise that your test was inadequate and maybe also your fix.

Don’t create a new bug when you fix the old one. Yeah, I don’t know how to avoid doing that, but try not to.

To sum it up:

  • If we spend the majority of our programming time and effort on debugging, we should focus our efforts on speeding up our debugging (rather than trying to write code faster).
  • Fortunately, there are lots of ways in which we can debug more effectively - we can  introduce fewer bugs in the first place, find them sooner, identify their bounds correctly, and fix them “well” by introducing tests that isolate them.


Footnotes:

[1] Glass (2002) refers to ‘error removal’, which is similar but too narrow for my tastes. The investigative aspect of debugging is often the hard and uncertain part - we can all think of a time we spent two hours tracking down the source of a problem, and then fixed it in under a minute once we found it. Indeed the fruits of that investigation don’t always lead to action. We may put in a lot of effort to understand an error, only to decide at the end of the debugging investigation that it’s too much hassle to fix. Or indeed, we might work hard to figure out why the program is behaving as it is, only to eventually realise that it’s correct and our expectations were wrong - i.e. there was no error to remove. That still counts as debugging to me, since things weren’t working as I thought they should. So debugging is investigating and addressing violated expectations, a superset of error removal.

Resources