Diving Into Unfamiliar Codebases

June 13, 2021

Whether you’re starting out in a new job or team, or if you’re trying to contribute to a Open Source project that you’re new to, having to familiarise yourself with an alien codebase is something a software dev will have to do at some point in their lives. You won’t start all projects from scratch. This isn’t easy by any measure (at least not for me), but I try to share some of the tricks that I’ve picked up to get up to speed with a new codebase in this post.

Starting to work on an existing, large unfamiliar codebase is radically different from starting a project from scratch. I find the latter easier, because you essentially start with a clean slate and start to incrementally add small chunks of code that you’re familiar with.

Looking at an existing piece of software, however, is a whole other ballgame. Even well written software is hard to reason about. Code is still written for computers instead of humans. Code may be poorly documented. Or it may use design or architecture patterns that you’re not familiar with. Or it may have hundreds of layers of abstraction (I’m looking at you, Java). And most software out there isn’t very well written. Heck, the very best code is no code at all.

We’ve been trying to make code more human-friendly for decades now. There are coding standards, design patterns, autogenerated documentation and test suites, and even crazy stuff like literate programming (the ferret project is my fav example). And yet, despite all this, code is still meant to be read by computers first and humans second. Thus it is likely that the next time you need to dig into the innards of something new, it will not be trivial.

Working with existing codebases is a fundamental skill for a software engineer. Something that you hone over time. And yet I don’t see this talked about very often, so here’s my first attempt at starting the conversation.

Run the Code

If you’re anything like me, the first few minutes you spend to eyeball the code quickly spiral into a negative reinforcement loop of “oh I don’t understand” followed by looking at a different code file/function, eventually terminating with you ending up scrolling through Reddit. This is bad. Eyeballing code rarely gets you far (unless you’re a prodigy or already familiar with the domain that the project is for).

So once you’ve gotten a birds eye view of the codebase, get out of that eyeballing loop even before you enter it. Instead, a much more productive use of your time is to just run the code.

I find it astonishing that running a program is a non-trivial process in this day and age. It has improved a lot over the last few decades, but it is still hard. Some kinds of projects are especially hard to setup. Such as things that do low level stuff, networking related projects (especially stuff which is layer 4 or below), service oriented stuff, etc. There are so many build systems and dependency managers and all sorts of tooling to wrap your head around.

So just getting the code to run comes with a plethora of benefits - it familiarizes you with the build system that the project uses, the dependency management stuff, some of the build/run automation tooling, the basic configuration you need, and any special requirements that the program may have (for example superuser privileges or extra capabilities). And that is a good start. It is also a prerequisite for almost all of the other tricks I suggest in this post.

Pro tip: if you’re having trouble with building/running the software or if documentation is scarce, taking a look at the CI config for the project can be a quick and easy way to see exactly what you need to build something. Assuming such a file exists of course. If you’re having trouble building on your system, it might be a good start to try and bind-mount the project files into a clean docker container and attempt the build there.

Grepping

If you already have some idea of what you’re looking for in the codebase (say the specific component that is related to the issue you’re tackling, or if you’re wondering how a particular thing works), grepping for keywords is a shortcut to finding the needles in the haystack. This is one of the first things I end up doing when tackling an unfamiliar codebase. You can grep for something through all the files in all subdirectories of the current dir using the -R flag. There are quite a few tricks that I’ve come to use when grepping through codebases, and I’ll list them down here:

Use git grep. This is like grep -R but it additionally automatically ignores anything that is ignored by git (i..e. stuff in the .gitignore file or in .git/info/exclude). This is particularly useful when working with python codebases that have virtualenvs, or when your build artefacts are things like a static website.
Case insensitive searches with the -i flag improve your odds of landing a hit that you otherwise wouldn’t have landed because of casing issues. Especially useful for codebases in languages that use camelCase.
Use wildcard characters (the .) and the * operator to search smart. Using this in combination with -i can uusually yield very good results.
Filter out irrelevant results using the -v flag.

File Archealogy - Using Commit History

You will eventually narrow your area of interest to one or two components (and a handful of files associated with them). At this point, you’ll have to start going from a birds eye view to something more specific. A great way to do this is to look at the history of a file in terms of commits.

It is hard to overemphasize the importance of a thoughtfully authored commit. A commit represents a minimal set of changes that allow you to achieve a single, well defined, tangible result. It is the smallest building block of the version history of a project, and thus it has the least amount of cruft. Looking at commits is an excellent way to see how more experienced people have approached modifying the codebase, and how various parts of the codebase might be related. It is thus quite enlightening to smartly inspect commits.

You can view all the commits that modified a particular file using git log <filename>, and I usually scroll down to the very bottom of the history to start with the point where the file was created. I then work my way upwards, skimming through the commit messages to get some sense of how the file has changed and what it is responsible for. Now you know why it is so important to [write good commit messages][3]. Look out for interesting commits, useful keywords, etc and note them down if you think it’ll help (keep a copy of the commit hash and message, or at least a keyword in the message so that you can quickly find the commit via a text search).

And then it is time to see how the code shaped up. The -p flag to git log shows you the diff for each commit in addition to the other details. So do a git log -p <filename>, hit G to fast-travel to the bottom (I’m assuming you’re using less as a pager, which is usually the default), and squint at the diffs. Try to understand how the file evolved. And if you don’t, it is okay. Time to move on.

Working With Commits

At this point, you should ideally have some understanding of the life of the file. You probably have identified a few interesting commits, and it is time to zoom in on them. Looking at the diffs for a single file is rarely enough since many useful commits will touch multiple files to effect their goals. This is where your list of useful commits comes into play. For each commit, git show <commit hash> will show you the details of the commit. This will indicate the diffs for all the files that this commit touched. This is a good time to refine your mental model of how source files in the codebase are related.

git show can show information about any git object!

Sometimes, you want an even more complete picture of what a commit does. The best way to reason about this is to look at the codebase the way it was at that commit. git checkout <commit hash> is your friend. Once you checkout the commit, you can build and run the code and see it in action. You can also take another step back in the history and run the code again to observe what the codebase was like before the commit changed it. This is a good way to analyse bugfixes - you can reproduce the bug and see what causes it, and then move to the commits that fix the bug, and then run it again to see the fix in action.

Print Statements

You have a few code paths of interest and you want to start toying with the code. So pepper these code paths with a few print statements and run the thing! This gives you some insight into the flow of the program. Especially the “critical paths” that encompass the main functionality of the program. Print statements are particularly useful when you put them in blocks of conditional statements, and try to trigger different blocks. This “poor man’s debugger” is a quick-and-dirty way to start analysing specific “flows” in the program.

Using the Test Suite

Tests aren’t just ways to ensure that your code does what is supposed to do. They also end up “documenting” the expected behaviour of the code by virtue of their very existence! This is useful since it can tell us what a specific function/component/system is expected to do. Looking at the test files is a great way to quickly see what different things do. Integration tests are particularly helpful because they test business logic and exercise large, multi-component parts of the codebase. Unit tests can provide some insight into otherwise undocumented functions.

This approach becomes a lot more powerful if you modify the parameters, inputs or assertions in the tests and see how things break (and if they don’t, you might have found a bug, or worse, a poorly written test!).

Surgery Using A Debugger

I’ve emphasized on the importance of getting familiar with using a debugger [in previous articles][4], and this section should give you another reason for learning how to use debuggers effectively. One of the most effective ways to become intimately familiar with code is to step through it with a debugger, looking at how stuff changes, the flow of the program, etc. You can also use this to trigger conditional code paths that are hard to trigger (by modifying variables on the fly!).

Most projects, even small ones, are still too big for one to fire up a debugger and go over every line. I find it more productive to set breakpoints in interesting functions or code paths and then step through/over codepaths starting from said breakpoint. Makes the process fit in the head a lot more easily.

This also prepares you quite well for actually hacking on the codebase, which means you’ll be a lot more productive when you decide to actually contribute to the project you’re looking at.

Tackle an Issue (Or Pretend to Do So)

Picking up a (simple) issue is easily one of the best ways to have a targetted path to gaining some degree of familiarity with a codebase. And you’re also helping out the maintainers if you manage to tackle it! This will be a requisite if you’re doing it in a work context, but even for curiosity-driven forays into interesting open source software, working on an issue can be very productive. It lets you exercise pretty much everything this post covers (and a lot more) in an environment with well defined objectives. So don’t be scared and go ahead and pick something up!

Contributing to open source projects is a topic that is far out of the scope of this article (and there is a lot of material out there for this already), so I won’t go into that. But I will drop some friendly advice about making sure that you go over and respect the contribution guidelines of the project, make sure that the license is friendly, ensure that you meet any requirements they have from contributors, and are respectful of the people you’re interacting with. The official IRC/discourse/slack/whatever channels are a great place to hang out and find help for projects that have them. Don’t forget to have fun!

If you aren’t comfortable in committing to an issue yet, you can also pretend to do so. Take an issue that has already has a fix, and pretend you’re trying to fix it. Spend some time on it, and if you actually are able to come up with a fix, congratulations! That should inspire you to pick up an unsolved issue perhaps. Regardless of whether or not you manage to fix the issue, make sure you closely look at the commits that fix it. This is an opportunity for some serious learning! Extra points if you identify doubts or suggestions for improvement related to the fix and talk about them to the maintainers.

More unsolicited advice: when I’m working on an issue in a new codebase, I often favour the TDD (Test Driven Development) methodology a lot. I find that it is one of the quickest and most effective ways for me to get productive in a way that lets me develop fast while minimizing breakage. I do tend to do TDD even with codebases that I am familiar with as well, it is a great way of working!

[3]: TODO the good commit message link [4]: TODO rel link to mastering your tools