I laughed at the headline since if you look at my open source libraries you’ll find a few files bigger than that: https://github.com/adamdruppe/arsd/ simpledisplay is 21.8k, nanovega.d is 15.1k, minigui.d is 14.4k, cgi is 11.1k…
I find larger files easier to work with than collections of smaller files, all other things equal, and I like having complete units of functionality.
But “all other things equal” does a lot there - the article’s description of “It looked like the entire file would execute through from top to bottom”… That’s what is really scary: not the size of the file, but rather that this appears to be a large single function. See, if I open simpledisplay.d, I’m not looking at the whole file. I’m just interested in SimpleWindow.close() or EventLoop.impl. The rest of the file isn’t terribly important; I open it, jump straight to the function I want to work on, and do what needs to be done. Then the individual function is pretty ordinary.
So I push back against file size by itself as mattering - a file is just a container. You actually reason about functions or classes or whatever so that’s what you want to keep easy to follow (and note that smaller is not necessarily easier to follow, I also prefer long, simple code to short, clever code, and I’d prefer direct local use to indirect things through multiple layers).
I find larger files easier to work with than collections of smaller files, all other things equal
That’s interesting, I’m very much the opposite: I try to keep my source files under 500 lines, and each file has a specific purpose like one class, or a set of constants used in one module. Makes it a lot easier to jump to a specific part of the code, by just clicking on a filename or tab. And when I search I’m limited to the appropriate context, like the specific class I’m working on.
I’m also team big files. I hate how many subjective decisions you have to make when you split things across files “Hey we split up foo and bar, now where do we put this function that was used by both foo and bar? Into foo.js, bar.js, or helpers.js?” or “Hey do we group models together and controllers together, or do we group by feature?”.
Whatever organizational decisions you make, some of them will ultimately prove unsatisfying as your codebase evolves and you’ll face the constant temptation to spend a bunch of energy reorganizing, but reorganizing your code across files doesn’t actually make your code more modular or more adaptable or improve it in any way other than maybe making it a little bit more navigable for somebody whose muscle memory for navigating codebases revolves around filenames.
I default to large files because it requires the least energy to maintain.
I’ve had this idea for a while - why can’t a filesystem or text editor support “views,” or different groupings of the same underlying files. Example: view code by business function, or view code by technical concern e.g. “show me all web controllers.”
It seems what you really want is to store all “source code” in a database instead.
When you think about it, how can it be possible that storing program code in a bunch of plain text files (ASCII, UTF-8) is in any way optimal for comprehension and modification? Text files are very much a least-common-denominator representation. We continue to use it because of the ecosystem of operating systems, version control, text editors and such allow us to use and interchange this information. So there is a very good reason why they persist to this day.
But I can imagine some kind of wacky Matrix-y (in the William Gibson, Vernor Vinge sense) 3D representation of programs, which makes great use of colored arrows, shapes and more to represent program operations and flow.
Do I have the slightest idea of where to start making such a programming “language”, and what exactly it looks like? No, I do not. Until we have better 3D systems (something akin to a floating hologram in front of me), that allows me to easily grab and manipulate objects, I don’t think I’d want to use such a system anyway. But this is the direction I think things will go in… eventually. That will likely take a long time.
Also, do we want to design a programming system that is optimized for human comprehension? Or something that is optimized for AI to use?
Those are heading in the right direction, but I’m thinking about something more comprehensive. The link from /u/amw-zero about NDepend is very interesting.
I’m not sure all or even most of these decisions are subjective - I think that ideally one would want to reduce coupling throughout, and to limit the visibility of implementation details.
I tend to think in terms of build system DAGs - it profoundly annoys me when everything needs to get rebuilt for no actual good reason. Which is another reason to prefer smaller files, I think - less rebuilding of code.
I agree. For one, I don’t think the decisions are subjective or objective in itself .
But the fact that I do have to spend time thinking about code before I start is a clear benefit to me, not a disadvantage. Yes, sometimes it’s annoying, but mostly in pays off in, as you say, simpler code, clearer boundaries, less stuff tied into one big implementation.
Splitting things into classes or modules can limit coupling/visibility. Some programming languages enforce one-module-per-file, but many (golang, ruby) don’t and in these languages there is no encapsulation benefit to putting things in different files.
Optimizing the build is a good point, though. If that’s the criterion for file divisions, and not subjective developer feelings about what belongs where, that eliminates the problem for me…
At some point in my career, I realized I had crossed a threshold where the writing & testing of the code was no longer the hard part: the hard part is organizing the code so it “makes sense” for what it’s doing, and I’ll be able to figure it out in 5 years when I come back to it.
My mental line is 500 lines, too. Once I hit that, and I don’t immediately know how to break it up, it’s usually a sign that I need to take a hike and think about the structure at a higher level. Most of the time, this mental refactoring unlocks a lot of future features, and the invested “thinking” time pays itself off multiple times over.
(None of this is a new insight, btw. I think it’s been written about since before the transistor.)
Basically everything. It is easier to find what I’m looking for, since I can just search inside the file rather than having to find more. Perhaps if I used an IDE and project files I’d feel differently, but I don’t. And even if I did, sometimes I browse projects online, and online, you often click links that lead to single files to see in the web browser. So it is easier to work with here and easier to browse online. It is also easier for users (which includes me doing a quick ad-hoc build or test run): I can say “download cgi.d and list it on your build command” and it just works for them, no complication in mirroring the directory and figuring out which list of files belong in the build, etc. D’s encapsulation is set at the file layer too, so I can actually define smaller, better defined API surfaces this way than with a variety of files, since the implementations are all wrapped up, without being tempted to expose package protection or similar to share bits that need to be shared across files (or worse yet, marking something public because I need it from a different file in my own project instead of committing to supporting it long term to the users, which is what I think public SHOULD me).
When working with other people’s projects, I find the file organization almost never helps. I can never guess where something is actually found from the directory/file organization and just have to grep -R it.
So I believe in the old gmail slogan: don’t organize, search! And then you start working on a more abstract level - classes, functions, etc. - instead of files anyway, so just take one more complication out of the way.
Perhaps if I used an IDE and project files I’d feel differently, but I don’t.
It’s not just in IDEs. Any good programming editor ought to offer a file hierarchy view and support multi-file search. I can’t even imagine doing nontrivial coding without that, just like I can’t imagine not having Undo.
I don’t mean to sound condescending, but it really is worth it to look into more powerful tools.
What possible benefit could I get from merging my file browser into my editor? I have a file browser, I have an editor, they know how to talk to each other.
I generally like the large file as well, if I don’t have really good navigation in my IDE, because it puts that much more within easy reach of the search function of my editor. And since I spend a lot more time looking for and reading code than jumping confidently to places in a codebase that biases me to wanting longer files.
Yes, thus my statement “when I have good IDE navigation support.” The logical conclusion of that is Smalltalk where there are no files. But if I am sitting in a terminal with cli tools and vi, large files are easier since I don’t have to keep Ctrl+z’ing to run grep and fg’ing again.
There’s no right / wrong answer here, because obviously people’s individual brains are different. But lots of small files absolutely kills my productivity, because working on large pieces of functionality ends up requiring changing 5-10 files. There’s no good way to look at 10 files simultaneously, I don’t care how large your monitor is. So the benefit of larger files is that all of the code that I need to understand something is localized within something that’s on my screen right now, and it’s pretty easy to use something like the sidebar in Sublime or VSCode to scroll quickly to the part of the file that you need. Or text search for a specific function name to jump right to it. The benefit is not really needing to do anything to find the code that I need.
Also:
Makes it a lot easier to jump to a specific part of the code, by just clicking on a filename or tab
This isn’t unique to the small-file approach, if your IDE / editor has “jump to definition” support, it works just as well within a file. Like I said, while this can be subjective and you may prefer one way or the other, I find this often with people who prefer small files - there’s no actual tangible reason or benefit, it just feels more organized (to proponents) when code is factored into small pieces.
It may be a limitation of our tools, but I find too many files to be a cognitive cost. After a while I have too many editor tabs open, I can’t get to the part of code that I wanted without backtracking, etc. And, what is the downside of larger files? “Large things are bad” is not an axiom. “Large files are bad because they are large” is circular logic.
All that being said, I don’t care all that much. I can navigate around most codebases whatever the structure.
The problem is not, in my experience, large files, but the lack of separation of concerns. Large file sizes can be a symptom of a lack of separation of concerns but they’re the symptom, not the problem. I started work on clang in 2008 because I was working on GNUstep and wanted to use the shiny new Objective-C features that Apple had shipped. Apple had their own fork of GCC and no one merged their changes into the main branch[1] and so Objective-C on non-Apple platforms had a NeXT-era feature set.
I looked at GCC to see how much effort it would be to update it. All of the Objective-C code in GCC was contained in a single file, objc-act.c, which was around 10K lines. It didn’t have any clear separation between the compiler stages and it was littered with if (next_runtime) everywhere. Some of the new features needed a new runtime, so all of those would need auditing and extending to provide a different implementation and become exciting switch cases.
At the time, clang had mostly working codegen for C (it miscompiled printf implementations, but a lot of C code worked). It also had parsing and semantic analysis support for Objective-C, but no code generation. I started by adding an abstraction layer separating the language-specific parts from the runtime-specific parts. That’s still there: there is an abstract CGObjCRuntime class with a bunch of different subclasses (Apple has two significantly different runtimes and a bunch of different variants of the newer one, so has made a lot of use of this abstraction). For a while, clang had better support for Objective-C on non-Apple platforms than on macOS.
Clang now has a bunch of source files that are larger than objc-act.c, but they’re cleanly layered. Parsing, semantic analysis, and IR generation are all in separate files. Objective-C runtime-agnostic IR generation is mostly in one file, Apple runtimes in another, non-Apple runtimes in a third. If you want to navigate the codebase and modify the Objective-C support, it’s easy to find the right place.
[1] The FSF used to point to Objective-C as a big win for the GPL. I consider it a great example of failure. NeXT was forced to open source their GCC changes but not their runtime, which made the changes useless in isolation. Worse, the NeXT code was truly awful. If NeXT had offered to contribute it to GCC, I strongly suspect that it would have been rejected, but because the FSF had made such a big deal about forcing NeXT to release it, it was merged.
Those tests are awesome, I cribbed quite liberally from them while working on full numeric tower support in CHICKEN! We went from 4.5kLOC of tests to 32.6kLOC of tests in a single commit by doing that ;)
Unfortunately, there are also several nonstandard extensions in s7 which the tests exploit, so it takes some weeding to get rid of those differences when porting to another Scheme.
Unfortunately, there are also several nonstandard extensions in s7 which the tests exploit, so it takes some weeding to get rid of those differences when porting to another Scheme.
There was a twitter thread a few years ago where someone was bragging about their single-line file running their entire business. Later in the thread someone complained about a compile error from the site, and the response was “oh you must have accessed it while i was in the middle of uploading a new version.”
My personal favourite is dcraw, a tool for camera raw photo processing. It’s 10,000+ lines of C, contains redefinitions of macros and makes extensive reuse of shared static buffers and such for various purposes.
It’s interesting how the discussion is focused mainly on the number of lines. That seems to be the most benign of the factors at work here. I’d venture that the same untrained, unaware contributors to that nightmare would have done even worse if somehow someone had the idea at some point to split in multiple files.
I worked on a C project with a function with 10kSLOC and 46 parameters. Then someone decided to add another parameter because at that point, I guess why not.
There’s probably a happy middle ground in between one-file-per-function / object and 11k LOC mega disasters. For example, if a task really just takes 3k LOC, then let it be just that, no need to break it up artificially just to adhere to some dogmatic line limit.
That’s kind of the point of the The Life of a File talk by Evan Czaplicki.
I laughed at the headline since if you look at my open source libraries you’ll find a few files bigger than that: https://github.com/adamdruppe/arsd/ simpledisplay is 21.8k, nanovega.d is 15.1k, minigui.d is 14.4k, cgi is 11.1k…
I find larger files easier to work with than collections of smaller files, all other things equal, and I like having complete units of functionality.
But “all other things equal” does a lot there - the article’s description of “It looked like the entire file would execute through from top to bottom”… That’s what is really scary: not the size of the file, but rather that this appears to be a large single function. See, if I open simpledisplay.d, I’m not looking at the whole file. I’m just interested in SimpleWindow.close() or EventLoop.impl. The rest of the file isn’t terribly important; I open it, jump straight to the function I want to work on, and do what needs to be done. Then the individual function is pretty ordinary.
So I push back against file size by itself as mattering - a file is just a container. You actually reason about functions or classes or whatever so that’s what you want to keep easy to follow (and note that smaller is not necessarily easier to follow, I also prefer long, simple code to short, clever code, and I’d prefer direct local use to indirect things through multiple layers).
That’s interesting, I’m very much the opposite: I try to keep my source files under 500 lines, and each file has a specific purpose like one class, or a set of constants used in one module. Makes it a lot easier to jump to a specific part of the code, by just clicking on a filename or tab. And when I search I’m limited to the appropriate context, like the specific class I’m working on.
What is it you prefer about single big files?
I’m also team big files. I hate how many subjective decisions you have to make when you split things across files “Hey we split up foo and bar, now where do we put this function that was used by both foo and bar? Into foo.js, bar.js, or helpers.js?” or “Hey do we group models together and controllers together, or do we group by feature?”.
Whatever organizational decisions you make, some of them will ultimately prove unsatisfying as your codebase evolves and you’ll face the constant temptation to spend a bunch of energy reorganizing, but reorganizing your code across files doesn’t actually make your code more modular or more adaptable or improve it in any way other than maybe making it a little bit more navigable for somebody whose muscle memory for navigating codebases revolves around filenames.
I default to large files because it requires the least energy to maintain.
I’ve had this idea for a while - why can’t a filesystem or text editor support “views,” or different groupings of the same underlying files. Example: view code by business function, or view code by technical concern e.g. “show me all web controllers.”
It seems what you really want is to store all “source code” in a database instead.
When you think about it, how can it be possible that storing program code in a bunch of plain text files (ASCII, UTF-8) is in any way optimal for comprehension and modification? Text files are very much a least-common-denominator representation. We continue to use it because of the ecosystem of operating systems, version control, text editors and such allow us to use and interchange this information. So there is a very good reason why they persist to this day.
But I can imagine some kind of wacky Matrix-y (in the William Gibson, Vernor Vinge sense) 3D representation of programs, which makes great use of colored arrows, shapes and more to represent program operations and flow.
Do I have the slightest idea of where to start making such a programming “language”, and what exactly it looks like? No, I do not. Until we have better 3D systems (something akin to a floating hologram in front of me), that allows me to easily grab and manipulate objects, I don’t think I’d want to use such a system anyway. But this is the direction I think things will go in… eventually. That will likely take a long time.
Also, do we want to design a programming system that is optimized for human comprehension? Or something that is optimized for AI to use?
Well, my vote is always for humans. I have no stock in AI-produced code every being a good thing.
Actually now that I think about it, NDepend is pretty similar to this. Warning: that page autoplays a video with sound.
So…Smalltalk? Lucid Common Lisp?
Those are heading in the right direction, but I’m thinking about something more comprehensive. The link from /u/amw-zero about NDepend is very interesting.
I’m not sure all or even most of these decisions are subjective - I think that ideally one would want to reduce coupling throughout, and to limit the visibility of implementation details.
I tend to think in terms of build system DAGs - it profoundly annoys me when everything needs to get rebuilt for no actual good reason. Which is another reason to prefer smaller files, I think - less rebuilding of code.
I agree. For one, I don’t think the decisions are subjective or objective in itself . But the fact that I do have to spend time thinking about code before I start is a clear benefit to me, not a disadvantage. Yes, sometimes it’s annoying, but mostly in pays off in, as you say, simpler code, clearer boundaries, less stuff tied into one big implementation.
Splitting things into classes or modules can limit coupling/visibility. Some programming languages enforce one-module-per-file, but many (golang, ruby) don’t and in these languages there is no encapsulation benefit to putting things in different files.
Optimizing the build is a good point, though. If that’s the criterion for file divisions, and not subjective developer feelings about what belongs where, that eliminates the problem for me…
At some point in my career, I realized I had crossed a threshold where the writing & testing of the code was no longer the hard part: the hard part is organizing the code so it “makes sense” for what it’s doing, and I’ll be able to figure it out in 5 years when I come back to it.
My mental line is 500 lines, too. Once I hit that, and I don’t immediately know how to break it up, it’s usually a sign that I need to take a hike and think about the structure at a higher level. Most of the time, this mental refactoring unlocks a lot of future features, and the invested “thinking” time pays itself off multiple times over.
(None of this is a new insight, btw. I think it’s been written about since before the transistor.)
Basically everything. It is easier to find what I’m looking for, since I can just search inside the file rather than having to find more. Perhaps if I used an IDE and project files I’d feel differently, but I don’t. And even if I did, sometimes I browse projects online, and online, you often click links that lead to single files to see in the web browser. So it is easier to work with here and easier to browse online. It is also easier for users (which includes me doing a quick ad-hoc build or test run): I can say “download cgi.d and list it on your build command” and it just works for them, no complication in mirroring the directory and figuring out which list of files belong in the build, etc. D’s encapsulation is set at the file layer too, so I can actually define smaller, better defined API surfaces this way than with a variety of files, since the implementations are all wrapped up, without being tempted to expose
package
protection or similar to share bits that need to be shared across files (or worse yet, marking somethingpublic
because I need it from a different file in my own project instead of committing to supporting it long term to the users, which is what I think public SHOULD me).When working with other people’s projects, I find the file organization almost never helps. I can never guess where something is actually found from the directory/file organization and just have to
grep -R
it.So I believe in the old gmail slogan: don’t organize, search! And then you start working on a more abstract level - classes, functions, etc. - instead of files anyway, so just take one more complication out of the way.
It’s not just in IDEs. Any good programming editor ought to offer a file hierarchy view and support multi-file search. I can’t even imagine doing nontrivial coding without that, just like I can’t imagine not having Undo.
I don’t mean to sound condescending, but it really is worth it to look into more powerful tools.
What possible benefit could I get from merging my file browser into my editor? I have a file browser, I have an editor, they know how to talk to each other.
In the magic land where your file browser works as well with your editor as in an IDE, I’m sure you are correct. 🦄
I generally like the large file as well, if I don’t have really good navigation in my IDE, because it puts that much more within easy reach of the search function of my editor. And since I spend a lot more time looking for and reading code than jumping confidently to places in a codebase that biases me to wanting longer files.
Does your IDE not have good ‘search this entire project’ support?
Yes, thus my statement “when I have good IDE navigation support.” The logical conclusion of that is Smalltalk where there are no files. But if I am sitting in a terminal with cli tools and vi, large files are easier since I don’t have to keep Ctrl+z’ing to run grep and fg’ing again.
There’s no right / wrong answer here, because obviously people’s individual brains are different. But lots of small files absolutely kills my productivity, because working on large pieces of functionality ends up requiring changing 5-10 files. There’s no good way to look at 10 files simultaneously, I don’t care how large your monitor is. So the benefit of larger files is that all of the code that I need to understand something is localized within something that’s on my screen right now, and it’s pretty easy to use something like the sidebar in Sublime or VSCode to scroll quickly to the part of the file that you need. Or text search for a specific function name to jump right to it. The benefit is not really needing to do anything to find the code that I need.
Also:
This isn’t unique to the small-file approach, if your IDE / editor has “jump to definition” support, it works just as well within a file. Like I said, while this can be subjective and you may prefer one way or the other, I find this often with people who prefer small files - there’s no actual tangible reason or benefit, it just feels more organized (to proponents) when code is factored into small pieces.
It may be a limitation of our tools, but I find too many files to be a cognitive cost. After a while I have too many editor tabs open, I can’t get to the part of code that I wanted without backtracking, etc. And, what is the downside of larger files? “Large things are bad” is not an axiom. “Large files are bad because they are large” is circular logic.
All that being said, I don’t care all that much. I can navigate around most codebases whatever the structure.
There’s also no good way to look at 11k lines simultaneously :)
The problem is not, in my experience, large files, but the lack of separation of concerns. Large file sizes can be a symptom of a lack of separation of concerns but they’re the symptom, not the problem. I started work on clang in 2008 because I was working on GNUstep and wanted to use the shiny new Objective-C features that Apple had shipped. Apple had their own fork of GCC and no one merged their changes into the main branch[1] and so Objective-C on non-Apple platforms had a NeXT-era feature set.
I looked at GCC to see how much effort it would be to update it. All of the Objective-C code in GCC was contained in a single file,
objc-act.c
, which was around 10K lines. It didn’t have any clear separation between the compiler stages and it was littered withif (next_runtime)
everywhere. Some of the new features needed a new runtime, so all of those would need auditing and extending to provide a different implementation and become exciting switch cases.At the time, clang had mostly working codegen for C (it miscompiled
printf
implementations, but a lot of C code worked). It also had parsing and semantic analysis support for Objective-C, but no code generation. I started by adding an abstraction layer separating the language-specific parts from the runtime-specific parts. That’s still there: there is an abstractCGObjCRuntime
class with a bunch of different subclasses (Apple has two significantly different runtimes and a bunch of different variants of the newer one, so has made a lot of use of this abstraction). For a while, clang had better support for Objective-C on non-Apple platforms than on macOS.Clang now has a bunch of source files that are larger than
objc-act.c
, but they’re cleanly layered. Parsing, semantic analysis, and IR generation are all in separate files. Objective-C runtime-agnostic IR generation is mostly in one file, Apple runtimes in another, non-Apple runtimes in a third. If you want to navigate the codebase and modify the Objective-C support, it’s easy to find the right place.[1] The FSF used to point to Objective-C as a big win for the GPL. I consider it a great example of failure. NeXT was forced to open source their GCC changes but not their runtime, which made the changes useless in isolation. Worse, the NeXT code was truly awful. If NeXT had offered to contribute it to GCC, I strongly suspect that it would have been rejected, but because the FSF had made such a big deal about forcing NeXT to release it, it was merged.
shrug
Maybe some of those source files are too big and ought to be broken up into smaller subcomponents?
Pfft, s7 is a single 95kloc C file, essentially.
Plus another 100k lines of tests!
Those tests are awesome, I cribbed quite liberally from them while working on full numeric tower support in CHICKEN! We went from 4.5kLOC of tests to 32.6kLOC of tests in a single commit by doing that ;)
Unfortunately, there are also several nonstandard extensions in s7 which the tests exploit, so it takes some weeding to get rid of those differences when porting to another Scheme.
Eh … those are really nice extensions though … :)
No doubt!
There was a twitter thread a few years ago where someone was bragging about their single-line file running their entire business. Later in the thread someone complained about a compile error from the site, and the response was “oh you must have accessed it while i was in the middle of uploading a new version.”
https://gist.github.com/alessonforposterity/832da4fab11e10609dad
My favorite example of large files is C#’s garbage collector is implemented in a single 37k+ line file.
isn’t this the one that was (originally?) auto-generated from common lisp?
My personal favourite is dcraw, a tool for camera raw photo processing. It’s 10,000+ lines of C, contains redefinitions of macros and makes extensive reuse of shared static buffers and such for various purposes.
It’s interesting how the discussion is focused mainly on the number of lines. That seems to be the most benign of the factors at work here. I’d venture that the same untrained, unaware contributors to that nightmare would have done even worse if somehow someone had the idea at some point to split in multiple files.
PGBadger is a 23,658 line long Perl script: https://github.com/darold/pgbadger/blob/master/pgbadger
I worked on a C project with a function with 10kSLOC and 46 parameters. Then someone decided to add another parameter because at that point, I guess why not.
It’s only a problem when you get to 52 parameters and run out of letters. Then you have to add… long options.
There’s probably a happy middle ground in between one-file-per-function / object and 11k LOC mega disasters. For example, if a task really just takes 3k LOC, then let it be just that, no need to break it up artificially just to adhere to some dogmatic line limit.
That’s kind of the point of the The Life of a File talk by Evan Czaplicki.
Even better is a ~8,000 line
switch
block in Xen’s x86 emulation: https://github.com/xen-project/xen/blob/master/xen/arch/x86/x86_emulate/x86_emulate.c#L3759-L11603This sounds like pretty much every significant code base I have ever had to work on. It is not unusual.