You can improve your programming skills by reading code, but the only code I’ve seen people study is individual snippets or functions. Not files, much less codebases. So, if I wanted to improve my skills, what codebases should I read the source code of?
Here are some specific questions:
I don’t have any recommendations in this regard, which bothers me, which is why I’m asking lobsters at large.
IMHO it’s hard to get much out of reading a codebase without necessity. Without a reason why, you won’t do it, or you won’t get much out of it without knowing what to look for.
Yeah, this seems a bit like asking “What’s your favorite math problem?”
I dunno. Always liked 7+7=14 since I was a kid.
Codebases exist to do things. You read a codebase because you want to modify what that is or fix it because it’s not doing the thing its supposed to. Ideally, my favorite codebase is the one I get value out of constantly but never have to look at. CPU microcode, maybe?
I often find myself reading codebases when looking for examples for using a library I am working with, or to understand how you are supposed to interact with some protocol. Open source codebases can help a lot there. It’s not so much 7 + 7 = 14, but rather 7 + x + y = 23, and I don’t know how to do x or y to get 23, but there are a few common components between the math problems. Maybe one solution can help me understand another?
I completely agree. I do the same thing.
when I am solving a similar problem or I’m interested in a class of problems, sometimes I find reviewing a codebase very informative. In my mind, what I’m doing is walking through the various things I might want to do and then reviewing the code structure to see how they’re doing it. It’s also bidirectional: A lot of times I see things in the structure and then wonder what sorts of behavior I might be missing.
I’m not saying don’t review any codebases at all. I’m simply pointing out that without context, there’s no qualifiers for one way of coding to be viewed as better or worse than any other. You take the context to your codebase review, whether explicitly or completely inside your mind.
There’s a place for context-free codebase reviews, of course. It’s usually in an academic setting. Everybody should walk through the GoF and functional data structures. You should have experience in a generic fashion working through a message loop or queuing system and writing a compiler. I did and still do, but in the same way I read up on what’s going on in mRNA vaccinations: familiarity. There exists these sorts of things that might help when I need them. I do not necessarily have to learn or remember them, but I have to be able to get them when I want. I know these coding details at a much lower level than I do biology, after all, I’m the guy who’s going to use and code them if I need them. But the real work is matching the problem context up (gradually, of course) with the various implementation systems you might want to use.
There are folks who are great problem-solvers that can’t code. That sucks. There are other folks who can code like the wind but are always putting some obscure yet clever chunk of stuff out and plugging it in somewhere. That also sucks. Good coders should be able to work on both sides of that technical line and move back and forth freely. I review codebases to review how that problem-solving line changed over the years of development, thinking to myself “Where did these guys do too much coding? Too little? Why are these classes or modules set up the way they are (in relation to the problem and maintaining code)?”
That’s the huge value you bring from reviewing codebases: more information on the story of developing inside of that domain. The rest of the coding stuff should be rote: I have a queue, I have a stack, etc. If I want to dive down to that level, start reviewing object interface strategy, perhaps, I’m still doing it inside of some context: I’m solving this problem and decided I need X, here’s a great example of X. Now, start reading and go back to reviewing what they’ve done against the problem you’re solving. Don’t be the guy who brings 4,000 lines of code to a 1 line problem. They might be great lines of code, but you’re working backwards.
Yeah, I end up doing this a lot for i.e obscure system-specific APIs. Look at projects that’d use it/GH code search, chase the ifdefs.
Great Picard’s Theorem, obvs. I always imagined approaching an essential singularity and seeing all infinity unfold, like a fractal flower, endlessly repeated in every step.
I’d disagree. While sure, one could argue you just feed a computer what to do, you could make a similar statement about for example architecture, where (very simplified) you draw what workers should do and they do it.
Does that mean that architects don’t learn from the work of other architect? I really don’t think so.
But I also don’t think that “just reading” code or copying some “pattern” or “style” from others is what makes you like it. It’s more that if you write some code only on your own or with a somewhat static, like-minded team your mental constructs don’t really change, while different code bases can challenge your mental model or give you insights in a different mental/architectural model that someone else came up with.
For me that’s not so different from learning different programming languages - like really learning them, not just being able to figure out what it means or doing the same thing you did before with different syntax.
I am sure it’s not the same for everyone, and it surely depends on different learning styles, but I assume that most people commenting here don’t read code like the read a calculation and I’d never recommend people to just “read some code”. It doesn’t work, just like you won’t be a programmer after just reading a book on programming.
It can be a helpful way of reflecting on own programming, but very differently from most code-reviews (real ones, not some theoretical optimal code review).
Another thing, more psychological maybe is that I think everyone has seen bad code, and be it some old self-written code from some years ago. Sometimes it helps for motivation to come across the opposite by reading a nice code base to be able to visualize a goal. The closer it is to practical the better in my opinion. I am not so much a fan of examples or example apps, because they might not work in real world code bases, but that’s another topic.
I hope though that nobody feels like they need to read code, when they don’t feel like it and it gives them nothing. Minds work differently and forcing yourself to do something seems to often counter-act how much is actually learned.
“Mathematics is not a spectator sport” - I think the same applies to coding.
Well, it varies. Many contributions end up being a grep away and only make you look at a tiny bit of the codebase. Small codebases can be easier to grasp, as can those with implementation overviews (e.g. ARCHITECTURE.md)
I have to agree with this; I’ve found the most improvement comes from contribution, and having my code critiqued by others. Maybe we can s/codebases to study/codebases to contribute to/?
Even if you don’t have to modify something, reading something out of a necessity to understand it makes it stick better (and more interesting) than just reading it for the sake of reading. That’s how I know more about PHP than most people want to know.
Years ago working on my MSc thesis I was working on a web app profiler. “How can I get the PHP interpreter to tell me every time it enters or exits a function in user code” led to likely a similar level of “I know more about the internals of PHP than I would like” :D
Sorbet typechecker for ruby can teach you about data-oriented-design for compilers. Two interesting files:
(strongly biased about this one) rust-analyzer can teach you how to architect a powerful IDE for a complex language. I am not especially proud of the code itself, but I like the overall architecture and the way it is explained. One specific bit I like is test suite (some day I’ll blog about how we test)
Rust standard library can teach you how “primitives” like vectors or hash maps work under the hood. This is a real production thing, but is (yet?) surprisingly readable in comparison to eg typical STL implementation. Vec would be a good start:
ImmutableText from IntelliJ platform is a good evidence that immutable rope needn’t be a complex beast to work well in an editor (not really a code base, but I find it surprising how simple this thing is in comparison to the scale of IntelliJ):
sds teaches about frugal library design and working within the constraints of the language (C in this case).
Cargo’s test suite is a good example of how to test “real world” programs : not a pure function, interacts with outside world significantly, must be backwards compatible forever, has infinite number of special cases to handle:
I like the OpenBSD codebase because it focuses on simplicity and makes the intention of every program very clear. In this sense, it is much easier to read than for example GNU code. I think it teaches simplicity quite well, but I don’t really know any “programming practices”; I just do things that are fun and interesting.
I recommend browsing the bin and usr.bin parts of the codebase especially, those are the ones concerning the core command line utilities of the system. It’s interesting to see how many such programs can be implemented.
Here is the OpenBSD style guide: https://man.openbsd.org/style.9
I often find myself reading the source code of Fe, a tiny Lisp. It’s implemented in 800 LoC of ANSI C. Despite not being a C programmer (the pointer-munging scares me a little), it’s amazing being able to glance at a full programming language in 1 file.
In terms of what it teaches you, it’s a good example of a very small (regex-less) parser and mark-and-sweep GC.
I guess the downsides are the slightly C-ey data structures - objects have a car and cdr with GC and type info just shoved in, plus a bunch of bit-twiddling which makes it a bit less clear.
As with many things, rxi has written a very good implementation overview for it, which is a good starting point.
If it had a language-agnostic implementation tutorial, I’d follow it. It seems like the perfect educational language. And since it’s very small, I’m sure extending it with TCO and a bytecode VM would not be hard.
Similarly, I learned a lot as a teen by reading a hard copy of FIG FORTH for the Z-80. FORTH source code tends to have a “plot” in a way, I.e. it’s quite linear. First a small assembly core, sufficient to establish the interpreter, then a series of FORTH words defined in assembly, which then bootstraps the all important “:” word that lets it start defining words in FORTH itself.
Also, FORTH is simple enough that it makes LISP look like C++ ;-) The syntax is basically purely linear, and there’s no GC nor even a memory heap.
Jones FORTH is a superlative example to read — it’s one source file, lavishly commented in “literate” style. Highly recommended. https://github.com/nornagon/jonesforth/blob/master/jonesforth.S
This makes me want to port Fe to Zig for fun and for learning
If there’s a test suite why don’t you port it to your desired language?
I’m increasingly thinking of doing that. There’s no test suite but there are examples and rxi has made real stuff with it. I reckon it would be quite different in the target language though, to be idiomatic.
I got a lot out of reading DJB’s daemontools a number of years ago.
It’s good if you want to see how to write simple and reliable C code in a very careful and minimalist way. This paper has some thoughts on DJB’s style of secure C coding:
DJB also notably uses shell and C together to minimize privilege.
I think you can start at any file with a
main(), as it is a small collection of utilities, loosely joined. The overall design is as important as the code.
Another good read is CPython. There are definitely things I don’t like about it, but it’s been well maintained by a small-ish group of people for 30 years now, which is incredible.
It’s not a project where one person does everything. I think that’s a good contrast to DJB’s style, which is more about keeping everything small so that one person can vouch for correctness and security.
It’s obviously important to the world, which makes it worth reading. But I would also say that the code is significantly easier to read than its contemporaries: Perl, Ruby, PHP, R, and arguably Lua. (I have looked at all of them to varying degrees, as well as many other language implementations)
It’s extremely modular and globally coherent. Seeing how
PyTypeObjectwork together actually taught me a lot about the Python language, even after I had programmed in it for ~15 years!
I’m not sure you can start in one place by reading CPython; I think it’s easier to write your own Python-C extension, and that may have give you a hint of how the interpreter works. It’s very simple, open to extension, and dynamically typed. C sort of lends itself to this dynamically typed architecture which tends to “grow well”. There are a lot of things about CPython that could be more optimal locally, but I think it has a lot of global coherence and that’s one reason why it has lasted.
Another good read is xv6, which is the modernized source for v6 Unix, and taught at MIT. It’s extremely easy to compile and modify, which is rare for an OS. I added a command line tool to it and ran it in QEMU, and it was easy (I think it also taught me how to run QEMU :) ). It’s good for understanding where C and Unix came from.
As for Python code, I got a lot of out of this, but it’s NOT easy to read. It’s just small. If you know Python well then it’s fun to figure out the puzzle of how they did it: http://www.tinypy.org/
There’s also a Python bytecode compiler in Python here that is interesting because it’s very short and Lispy:
It definitely reminded me that you can write Python with a Lisp accent :) :) Very cool and short.
accompanying article: https://codewords.recurse.com/issues/seven/dragon-taming-with-tailbiter-a-bytecode-compiler
Seconded. The CPython implementation is quite straightforward. It doesn’t use too many tricks to improve its speed, which means the code is easier to read than hyper bummed implementations. Speaking of which, the same is certainly true for Scheme48 which was also written for clarity and with simplicity in mind.
I found SBCL to be a treasure trove of solid code as well, since here too most of the system is implemented in Lisp itself. It’s a bit more complex to navigate as it’s very big, but I found it very valuable to study when I was reading up on bignum implementations.
I agree and like that it’s straightforward, though one exception is
ceval.c, the main bytecode interpreter loop. It is really long and full of macros and and obscure control flow. Not very readable IMO, which is why I started hacking on the Python versions.
Ruby itself also uses codegen in its bytecode interpreter: https://github.com/ruby/ruby/blob/master/insns.def
Oh yeah I think I have peeked at that file before! Definitely looks cleaner than how CPython has done it.
If you’re a little intimidated to read CPython yourself and would like a ‘guided tour’, Philip Guo has an excellent set of lectures where he just goes through the code piece-by-piece. He goes from ‘CPython is just a bunch of .c and .h files’ to ‘you create an iterator from a generator object by calling
PyObject_SelfIter, which just increments a ref counter and returns itself.’
I neglected to mention a shell codebase :) Aboriginal Linux is defunct but its goal was to be the smallest Linux system that can rebuild itself. (In that sense it’s similar to recent Guix bootstrapping efforts.)
And it’s all written in shell. It’s like a mini-Linux from scratch. Linux from Scratch is also worthwhile though it takes forever to do, whereas Aboriginal is small.
So in a sense I think Aboriginal gives you a better idea of how to build Linux from scratch – how to build and configure a kernel, and what’s in user space and how to assemble it. It also gave me more of an idea of how embedded devs think and code which is considerably different than server side / desktop / web / etc. developers.
It’s much clearer than say Debian, which is a bunch of shell-make-custom-tool-package-manager gobbledygook. Aboriginal is pure shell. It’s closer to a program than a bunch of scripts grown over time.
There are a few of books on this topic. The The Architecture Of Open Source Applications series. I was (and am) curious about the same thing. Upon reading one of them most of the way through, I found myself studying problems I don’t have and not interested anymore. I maybe gleaning some universal patterns or ideas about splitting up responsibilities. So, I agree with what @DanielBMarkham said. But the bigger question about credentials, authority and reference material remains. Software is too damn abstract.
At least in hardware, you can measure and test something physical. Some bits of it are abstract but at least the thing has concrete attributes. But then OTOH, even hardware has context. “There are no bad products, only bad prices”. Go’s stdlib is what I would read because that was the mantra at the time. But they were solving problems at a certain abstraction level. And so to find similar software at the same abstraction level I’m doing, I’ll simply search Github for similar package combinations:
I don’t understand something until I’ve broken it and fixed it. Which is another way of saying being there when it was built. And then that’s where I feel appreciation for something I didn’t write.
Good point. I really enjoyed the O’Reilly book Beautiful Code. This book has excerpts from real code bases with a discussion by the authors themselves. It has code in a variety of languages with many famous people like Brian Kernighan, Yukihiro “Matz” Matsumoto, Simon Peyton Jones, Kent R Dybvig etc. The cherry on top is that all royalties are donated to Amnesty International.
Another book is Diomidis Spinellis’s Code Reading, in which he recommends among other things the NetBSD codebase.
BoltDB was one of the first times I had to read into a repository for a project, and its simplicity is very engaging. The straightforward architecture really gave me an appreciation for how something seemingly massive like a database can be written in a simple yet still incredibly quick way. Also, while people like to say Go code is “boring” - isn’t that a good thing?
LLVM was the project that made me stop hating C++. When I started working on it back in 2008, it was using a ‘tasteful subset’ of C++98. There were some rough corners, but it was a clear improvement over the C codebases that I was working on at the time. The move to C++11 was a huge change and it’s undergone a lot of refactoring to make it into a modern C++ codebase. Smart pointers for ownership, ranges types for iteration, and so on. When I started, it was common to see things like:
The type names for the iterators often pushed you over the 80-column limit (in my browser window, the above example has a horizontal scroll bar). The abstractions were clean (for example, you didn’t need to know what collection the methods were stored in internally and this could change easily), but the code was verbose. With C++11, this changed to:
With C++17-inspired ranges (which LLVM adopted before the standard library caught up), this instead becomes:
For C abstractions, others have recommended the various *BSD codebases. My bias is towards FreeBSD. All three have good and bad bits of the code, but a lot of the good abstractions have been ported between them (e.g.
busdmain FreeBSD was brought over from NetBSD). I’ve recently been hacking on Linux and it’s so very painful in comparison. A few examples:
This isn’t to say that the FreeBSD kernel is great. It’s quite dated and every time I try to do something in it I am frustrated that it would be half as many lines of code in C++ and would get stricter compile-time checks.
MacPaint and QuickDraw.
The code just reads nicely and clearly, without much fluff. It’s got more globals than modern tastes dictate, but even so I find it easy to follow and pleasant to read.
That reminds me that in my early days at Apple (c.1991) I used to hear horror stories about Bill Atkinson’s code, specifically HyperCard. Some of my co-workers had worked on HyperCard 2, and they said getting it to run on the Mac II was a nightmare because the 1.0 code was so hard coded to a 512x340x1 screen — it was full of assumptions about bit depths and rowBytes and such. (And it used a lot of internal QuickDraw code, because Bill had also written that, which broke in the color-enabled QD…)
IIRC it took another later death-march to get the code base to support opening two documents (stacks) at once.
If you’re looking for good C code bases, the NetBSD kernel and the PostgreSQL code bases are quite nicely structured and easy to navigate. Presumably other BSD’s are similar, but I only really have worked with the NetBSD kernel. If you find this sort of thing interesting, the Design and Implementation fo the FreeBSD Operating System is a very nice book too. It’s like a collection of papers on the various subsystems of the kernel, detailing the data structures and algorithms that were used.
One meta-strategy I have is to read the code relating to the very first public release of various projects, because there will be a lot less optimization that has happened at that point, and the overall architecture will be cheaper to follow along to. I’ve really enjoyed reading the redis and kafka source codes from early versions (haven’t looked at either since 2012 though, but I remember they made a nice impression on me). My first job was at a place that worshiped DJB and the appreciation of parsimony rubbed off a lot on me. Following along to Riak, FoundationDB (their talks and more recently their source) and SQLite’s testing strategy have had enormous influences on me.
Over time, I’ve basically stopped caring about most of the metrics I used to value in reading a codebase. Today, I try to get a sense of the architecture, then I look at their tests, then I measure the thing using the workload I actually care about on hardware I will actually be using. Nothing else really matters to me now. If the architecture sucks, I will not look at the tests because I’ll know it can’t be fixed. If the tests suck, I will not care about performance. If the performance sucks for what I care about, I will not care about how pretty the code is, regardless of how well it performs for someone else’s workload. If it checks out and I use it and it breaks, then I might develop opinions on the coding style as I fix it etc… but what matters is what it does when it runs. I treat documentation as aspirational and mostly irrelevant.
This is quite an inspiring comment. How do you go about that first step of getting a sense of the architecture? I’ve always found it highly non-trivial in a new codebase, so much so that I’ve organized my whole research program around that problem: http://akkartik.name/about
I have long used the same trick of reading old versions, so much so that I tried to make that approach more explicit while authoring software: http://akkartik.name/post/wart-layers. But I suspect I’m nowhere near as good or as practiced at it as you are.
The sails codebase has really well-done documentation in my opinion.
lib/router). Individual files comments explain both the existing functionality and the development process, although explanatory comments are occasionally included unnecessarily imo (see
lib/router/bind.js). Attention to detail means that comments are provided for every dotfile in the repo explaining its purpose and the purpose of individual settings (check out the
I think the documentation in this codebase is worth looking at because:
docs/* and ends up on https://sailsjs.com/documentation/reference after being autogenerated from markdown.
Essentially, very little is left in the maintainers’ heads. I wouldn’t comment my own projects quite so heavily but I do think there’s a lot to be learned from this codebase about documenting for both users and devs of a project.
pjsip, while specific to VoIP/RTC, is really interesting because it’s built from the ground up with a focus on portability.
may you please consider updating the url ? the one referenced above, doesn’t point to https://www.pjsip.org …
Thank you for pointing that out. I don’t seem to be able to edit/update the comment. But, here is the PJSIP git repo as well.
Pandoc (Haskell) is surprisingly readable! It’s a great practical example of how to use monadic parsing for non-trivial inputs which don’t fit cleanly into a stateless model for parsing. The Markdown reader is a good place to start. The
guardstatements everywhere are a pretty clever way to enable/disable certain features (such as whether or not to allow line breaks) depending on context.
I learned a lot by reading some snippets of the Visual Studio Code (TypeScript) repo. It’s a good example of how to structure a large electron application, and it’s really cool that they dogfood their extension system to implement many of the editor’s core features.
What @calvin said very much resonates. I’m not really qualified to respond since I haven’t read anywhere near enough code but I can offer a small anecdote: I had a lot of fear of inadequacy (still do..) and there was some library I wanted to use for a personal project and I wasn’t sure how to use it, so I did a reverse dependency lookup and found all the projects on github that were using this particular library. Reading the source of those projects and seeing how differently people used it made me realize that there was no “right way” (not even close! they used it so differently it was literally incredible to me haha).
Some codebases that I’ve read a bit of and look good:
Besides these I have mostly just been looking at libraries that I plan to use, some are easier to read than others, not really sure if you’d be interested in such suggestions.
I look into the code of musl libc a lot to see reference implementations of the C standard library functions. The code is minimalist and well written.
The Anti-Grain Geometry library, a software 2D renderer like Cairo, Skia–
Zotonic has been an incredible resource of inspiration over the years to me. Two things I particularly learned from it:
I’m not sure if it counts as studying, but I find the Crystal code-base is structured well, and I’ve often used it to understand how to use specific modules. Reading the tests has been especially helpful for this.
The most beautiful code I’ve seen is the RubyX11 X11 client library by Mathieu Bouchard
This file astonishes me with the elegance in implementation due to the way the Ruby language is used:
A few examples of the Adapter pattern in Django.
My top picks for C code:
The code is both elegant and simple. Not actually sure how I found it and I am not using it, but really enjoy reading the code.
Go standard library. I think that code at least used to be a great example of being pragmatic when programming and making decisions. Rob Pike’s ivy is similar.
It would be awesome if we could get at least one of: