1. 71
  1.  

  2. 34

    This was not an objective test, this is just an approximation that I hope will encourage readers to be more aware of the consequences of their abstractions, and their exponential growth as more layers are added

    Eh, [citation needed] on the exponential growth bit. Lots of what’s being measured here is fixed start-up costs that are negligible in a Real Program. If you want to claim these costs accumulate exponentially (or even linearly), how about a plot of syscalls as you increase the number or statements printed?

    1. 29

      The “Hello World” article is simplistic to the point of silliness. It’s essentially just chucking stuff like garbage collection in the “other crap” that makes stuff “hard to debug”-bin.

      I’m fairly sure the author understands these kinds of trade-offs because he frequently uses languages like Python and Go, which do “other crap”.

      Of course, writing an article that deals with these kind of nuances is hard and time-consuming, whereas writing an article with simplistic benchmarks and silly generalisations is easy. 🤷‍♂️

      1. 7

        I see the original article as basically a tormented cry of longing towards a compiler & language that would be high level and helpful to the programmer in terms of safety (otherwise, the author would deem Asm sufficient and they wouldn’t see a point to even talk about other languages), while being able to do good on a no-wasteful-bloat scale (a.k.a. why pay for it if you don’t need it?). In those terms, I was absolutely thoroughly impressed with the results of Zig, which AFAIK is leaps and bounds more safe and high level than C and C++, while apparently running circles around them in terms of keeping tabs on the waste. A-bso-lu-te-ly A. Ma. Zing. Sure, it’s not perfect, and a compromise (manual memory mgmt), but in my book, definitely a language to put in my toolset… though for my usage plans, I will be waiting for the libraries ecosystem to get better (I know, I could contribute, but that’s a conscious choice and compromise. Though Drew’s article may make me have to rethink.)

        1. 7

          I don’t think it’s wrong to spark the discussion, though. And sometimes that spark is just very simple measurements that are hard to refute.

          In other words, the data cannot be used to make any operationally meaningful conclusions, so it is not the end of the discussion, but it is a start. And given the publicity and derivative works, it seems like it was a great start!

          1. 12

            Is that a productive discussion? Do Go authors don’t know they’re using GC? Do runtime authors don’t know they initialize heap, I/O, etc. during startup? Should we discuss whether programs should do more than “Hello World”?

            Language design and implementation has a lot of trade-offs. You can latch on to any one of them and claim to demonstrate “hard to debug”, “complex”, and “spark a discussion”, but without any new insight that’s just shit-stirring and wasting time of people who have intentionally made these trade-offs.

            e.g. Drew’s 1-syscall “Hello World” is a nightmare to debug and epitome of complexity, because his code is 197× longer than in HQ9+, the ideal programming language according to my objective ranking.

            1. 3

              I don’t know, maybe you’re right. I just think that if it was truly inconsequential shit-stirring, people wouldn’t care. It wouldn’t have gotten the publicity and follow-up that it has.

              But on some level, people seem to think there’s some validity to the thesis, but that it’s counterbalanced with other arguments – these are the tradeoffs you mention. And the discussion of these has been insightful for the most part.

            2. 3

              If it would be just the measurements then that would be okay, it’s the comments surrounding the measurements that are the problem.

            3. 4

              devault writing a bad article? color me surprised

            4. 23

              I find it mildly interesting that the author includes Perl, which does moderately well on this “benchmark”, finds the time to include a “nudge nudge isn’t it funny” cheap shot:

              Passing /dev/urandom into perl is equally likely to print “hello world”

              but can’t be bothered to note the version of Perl used.

              1. 2

                Versions are further up in the text; “Perl 5.30.1”

                1. 3

                  If so, added after I read the post.

                    1. 3

                      Thanks! I should have guessed @ddevault has his blog posts under version control.

              2. 41

                Interestingly, I have a pattern of LEDs reading “Hello, world!” that require none of the bloat of microcontrollers, modern integrated circuits, a full general purpose computer, or anything of the sort.

                Latency to “Hello, world!” is microseconds when you throw the switch, rather than the eons it takes for this guy’s machine to boot, compile, and execute. A warning for those of you who rely on abstractions like software.

                Any true optimizer would generate this machine and not that machine.

                1. 12

                  Why use a pattern of LEDs, which requires electricity? A true optimizer would create a painting and completely omit the need for a runtime environment at all.

                  1. 6

                    If we use Braille, we don’t even require the user to be able to see!

                  2. 1

                    But a general-purpose computer can not only render “Hello, world!” in a variety of fonts, but render an equivalent phrase in any written language (including languages with complex scripts), and also render it as synthetic speech or braille for people who can’t see it. I’ll take those capabilities, especially the ones related to internationalization and accessibility, over optimization every time.

                    More seriously, any developer who wants to develop super lean and mean applications, rather than use a bloated UI toolkit or framework, should keep these things in mind.

                    1. 4

                      Particularly with a11y, people forget how much you have to implement yourself to either roll your own a11y features or integrate into the OS frameworks. Much of this just reeks of “I speak only English, let alone a language not representable on the 7-bit ASCII plane, so it’s bloat”.

                      1. 2

                        To be clear, you’re agreeing with me about the importance of i18n and a11y, right? I’m well aware of how complicated a11y is; I work at Microsoft on the team that develops the Windows UI Automation API and the Narrator screen reader.

                        1. 3

                          Absolutely. I know how much for free the Windows API is giving you (and how much you forsake when you reinvent the wheel for “minimalism”) - and it’s often stuff you were going to invent anyways.

                      2. 1

                        Do you have anything to support the idea that C or other ‘optimisation’ based languages can’t render non-ASCII text?

                        As I understand it, UTF-8 is universally supported (except in cases where you want to grab graphemes, which is pretty messy no matter what language you’re in as some of Unicode’s characters have ‘variable width’)

                    2. 10

                      Beat me to the punch in a way far better than I could.

                      I do disagree with this point (in the sense that it may be worded a bit strictly for the general case):

                      Computers are fast - a lot faster than we can possibly perceive. By some estimates human beings perceive two events as instantaneous at around 100ms. To Drew’s point, this is an eternity for a computer. In that time a modern CPU could execute 10 billion instructions.

                      One of the points I wanted to drive home was we really should care about these death-by-a-thousand-cuts sometimes incurred via these abstractions (imo necessary for security and productivity), when they often are the same cut repeated a few hundred times. Users are penalised for this when applied repeatedly (imagine a 100ms delay after pressing a button throughout the lifetime of a program). Of course if you read further along, for program start-up, this delay sure is often worth it.

                      In a sense we should be focusing a lot more on program optimisation for the sake of preserving energy. We should be designing our languages so that they’re conducive to optimisation (looking at you C, for the pointer alias semantics which arguably killed Itanium).

                      Ironically, one of the benchmarks that I found was that printing Hello World a thousand times uses half the syscalls in Electron (a pretty large hammer for the nail that is ensuring multiplatform UI consistency) as compared to C. :)

                      1. 2

                        for the pointer alias semantics which arguably killed Itanium

                        I agree with your general comment. I’ll note that Itanium was largely killed by a combination of no backward compatibility (highest importance) and being a pain for compiler developers. Mainly backward compatibility. Intel previously lost billions on i432 APX and i960 for similar reasons.

                      2. 10

                        Nice comparison, here’s a one liner to do it for several shells

                        $ for sh in dash bash zsh mksh; do echo $sh; strace -c -- $sh -c 'echo hi' 2>&1 | tail -n 1  ; done
                        dash
                        100.00    0.000000                    40         3 total
                        bash
                        100.00    0.000000                   131        12 total
                        zsh
                        100.00    0.000000                   915        18 total
                        mksh
                        100.00    0.000000                    61         3 total
                        

                        The third column is the number of sys calls (the fourth is the number of errors).

                        So zsh is in the range of Python (slow), bash is in the range of Rust; dash is is in the range of C.

                        (Oil doesn’t fare well because of embedding the Python the interpreter; that’s one reason among many I’m working on translating it to native code )

                        1. 21

                          I appreciate your work and the data you collected. But I have to complain about the style.

                          Most languages do a whole lot of other crap other than printing out “hello world”, even if that’s all you asked for. a decent metric for how much shit is happening that you didn’t ask for

                          Please do not talk about the „additional“ (beyond that two) instructions as „crap“ or „shit“. Authors of the compilers and runtimes have not added them just for fun or to slow down your software. Their work makes sense and was requested by someone and there are reasons for such „additional“ instructions. Nobody (in real world) optimizes for „Hello world“ programs.

                          If you think that you can make something faster or less memory or disk demanding, feel free to contribute to given project.

                          P.S. I would also recommend milliseconds as a unit for such measurements.

                          1. 9

                            Here are the results for OCaml, which I think are pretty interesting!

                            $ cat hello.ml
                            let () = print_endline "hello world"
                            

                            With a dynamically linked glibc:

                            $ ocamlopt -o hello hello.ml
                            
                            • Resulting binary size (after strip): 279K
                            • Syscalls: 69, Unique syscalls: 17

                            With a dynamically linked musl:

                            $ ocamlopt -cc musl-gcc -o hello hello.ml
                            
                            • Resulting binary size (after strip): 219K
                            • Syscalls: 24, Unique syscalls: 13

                            With a statically linked musl:

                            $ ocamlopt -cc musl-gcc -ccopt -static -o hello hello.ml
                            
                            • Resulting binary size (after strip): 262K
                            • Syscalls: 23, Unique syscalls: 12

                            These results are pretty close to the numbers for C (!), so I guess that would put OCaml somewhere between C and Rust in the blogpost table…

                            1. 9

                              Out of interest, how does Rust with MUSL does? rustc --target x86_64-unknown-linux-musl -C opt-level=s, this will result with statically linked libc. Additionally it would be interesting to see it with -C lto enabled.

                              EDIT:

                              On macOS I have went down quite significantly:

                              [nix-shell:~]$ rustc -C opt-level=s test.rs 
                              
                              [nix-shell:~]$ ls -la test
                              -rwxr-xr-x 1 hauleth staff 233968 Jan  4 19:33 test
                              
                              [nix-shell:~]$ rustc -C lto -C opt-level=3 -C panic=abort test.rs 
                              
                              [nix-shell:~]$ ls -la test
                              -rwxr-xr-x 1 hauleth staff 189164 Jan  4 19:33 test
                              
                              [nix-shell:~]$ strip test; ls -la test
                              -rwxr-xr-x 1 hauleth staff 160508 Jan  4 19:34 test
                              

                              About 20% size reduction after LTO, opt-level 3, and aborting on panic. 40% when we also strip. Both have used 30 sys calls on macOS though.

                              1. 8

                                Nice, number of syscalls is a pretty interesting indirect metric for how much STUFF is going on behind the scenes, even for a short-lived program. I don’t see this as a performance metric, just complexity. I want to play with the the Rust example sometime, and I want to see how Nim stacks up so I should try that too; anyone have any other suggestions for languages to try? Non-JIT implementations by preference, to keep apples to apples.

                                1. 6

                                  Note about the rustc command:

                                  rustc -C opt-levels=s test.rs

                                  should be:

                                  rustc -C opt-level=s test.rs

                                  1. 2

                                    Yeah, and I also wanted to compare debug vs release mode. Looks like someone beat me to it though.

                                    1. 2

                                      Yep that and the C version has a subsequent strip invocation but the Rust one doesn’t.

                                    2. 3

                                      number of syscalls is a pretty interesting indirect metric for how much STUFF is going on behind the scenes, even for a short-lived program

                                      Not really. Setup and teardown might take a while (relatively–even 100 syscalls is basically nothing), even if runtime is very lightweight and optimized.

                                    3. 7

                                      When an issue is a complex multi-dimensional trade-off, presenting a dry benchmark on one dimension is underhanded.

                                      Number of syscalls or constant overhead in an executable may be a sign of bloat, but it also may be:

                                      • the best choice for literally every program bigger than hello world (e.g. initializing a heap)
                                      • sensible choice for predictable performance (the startup prepares environment ahead of time, so that it doesn’t have to do it lazily later)
                                      • necessity for good security (e.g. setting up guard pages)
                                      • necessity for compatibility or to be a good citizen on a platform (e.g. dynamically linking with OS libraries)
                                      • price paid for richer standard library that improves productivity.
                                      1. 7

                                        Is it only me who’s impressed that Zig has fewer syscalls than C?

                                        1. 1

                                          I was already interested in checking out Zig at some point but this has sparked my curiosity anew.

                                        2. 7

                                          He mostly missed the point of why people hated his article, but this line kind of bugged me.

                                          So far as most end-users are concerned, computers haven’t improved in meaningful ways in the past 10 years, and in many respects have become worse.

                                          The huge leaps in graphics, web accessiblity and design (in a way thank f for Web 3.0; no more Flash player or ActiveX or Silverlight!), processing speeds… hell, using GNU/Linux a decade ago could be a massive pain in the ass and today it’s simple as. Don’t get me started on the smartphone and the nearly unimaginable changes to the world they’ve introduced past decade. So much work done by so many brilliant people completely dismissed with some arbitrary complexity boogeyman.

                                          1. 1

                                            I think he was referring more to the actual hardware. I think his point is that “The hardware’s performance has not much improved while the needs of the software due to its layers upon layers of abstractions have increased significantly. Resulting ultimately in a loss of user perceptible responsiveness and performance.”

                                            1. 1

                                              This has been an observation for many years - Andy Grove keeps making faster computers, and Bill Gates keeps releasing software that slows them down.

                                              Developer costs are constant or rising, hardware costs per computing unit have been falling for a long time. Maybe in the future this won’t be the case - when that happens, the market will adapt.

                                          2. 7

                                            To me, the article basically boils down to an argument that “everything is a compromise, and I agree with Go’s current position of compromise and don’t see a need for Go to improve”. Whereas Drew’s one kinda boils down to “I know everything’s a compromise, but geez, [language authors] guys, you’re getting freaking lousy in your compromising; enough of your wink wink nudge nudge, I’m checking your cards and you’re sooooooo bluffing; maybe, you know, it’s time to get your act together, for the sake of all things saint?”.

                                            To add to this, in the particular case of Go, as much as I like it, and appreciate it, and still use it as a favourite weapon of choice in many, many cases, I do feel a best friend’s good-willing sadness in how the runtime & compilers are drifting away from some of the original ideals. Most ironically, as far as commenting on the article above, it mentions dynamic linking as a space saving measure, whereas in the early days of Go, it was exactly the opposite which was being claimed; I understand Drew’s complaint about the magic incantation for static linking in Go as aiming exactly at this kind of dynamic.

                                            1. 1

                                              Is the static linking problem for Linux only? With Windows I just do go build app.go and its totally static.

                                              1. 1

                                                On Windows, the OS interface is implemented not through syscalls in Linux sense, but through a set of standard DLLs with a backwards compatible API and ABI. So, in a way, on Windows the binaries are actually kinda totally dynamic, though the ABI is fixed so kinda totally static, so… :) the design of interfaces is very different, so I’m not sure if static/dynamic are a good distinction here, and Windows has generally very backwards compatible ABI. Whereas glibc IIUC does not, but then Linux syscalls do again I think (I believe Linus tries to be especially anal about that). So, on Linux, if you use just syscall ABI, you’re “static”, whereas if you use some glibc, you have no way but to do it dynamically. All this in a huge IIUC bracket.

                                                1. 1

                                                  I wonder if Musl would solve the issue?

                                                  1. 2

                                                    I believe the problem is, some distros provide a few features (e.g. some aspects of DNS resolving) only through dynamic linking with some particular distro-provided library. Such that Musl would either have to link with the same library anyway, or skip implementing those features. By “skip” I mean they could still reimplement them, but e.g. with different locations of files such as resolv.conf (because you can’t know where every distro decides to put them), making it behave unexpectedly on some distros, by missing some config files. AFAIU, this reimplementing is what Go does when you compile it with netgo build tag. Further, IIUC, Drew is not arguing that it was a bad choice to do it dynamically, especially given that effort was put to allow choosing static linking with reimplementation. What I believe he complains about, is that this alternative option gradually deteriorated, and people wanting or needing to do this choice now have to jump through increasingly many hoops, increasingly obscure and un-/poorly-documented, which paints a very different picture and message than in the earlier days of Go.

                                            2. 11

                                              Using cargo, Rust package manager, instead of directly calling compiler, produces more optimized binary even without any special switches or flags. Rust docs suggest users to use cargo instead of directly calling the compiler; this is the standard way to compile rust programs.

                                              cargo build --release
                                              

                                              This produces total of 95 syscalls, 20 of them or unique.

                                              While not significantly better, it’s something.

                                              Used cargo 1.40.0.

                                              1. 13

                                                An interesting metric if you want to make a short-living utility with low latency to output.

                                                Pretty useless if you are interested in throughput or latency of a long living process.

                                                1. 3

                                                  If you’re interested in latency, the relationship between syscalls and latency is so scattered that you’re better off just measuring the latency. Otherwise, you might conclude that JIT compiled Java is as good for quick command line programs as go.

                                                  1. 6

                                                    It also provides a datapoint on unnecessary complexity and bloat.

                                                    Pity Nim is not in the article (yet).

                                                    1. 8

                                                      “Unnecessary complexity and bloat.” in the context of a useless program.

                                                      1. 3

                                                        Isn’t it roughly equivalent to the quite non-useless true program?

                                                        1. 2

                                                          Oh god, people will be “beating” GNU true in all manners of programming languages next…

                                                        2. 1

                                                          Well, the more unnecessary syscalls, the more of a runtime there is. Those additional syscalls never go away.

                                                          1. 7

                                                            And?

                                                            If the runtime is long enough this syscalls at start get negligible.

                                                            Once again - this metrics are useful in one context and totally useless in others. The important stuff is to know if they are relevant to your situation.

                                                            1. 3

                                                              Not all of these syscalls are strictly startup-related, though.

                                                              Part of Rust’s overhead is from stdout locking, and that means additional syscalls every time you print, not just at startup.

                                                              1. 3

                                                                If Rust generates syscalls for an uncontested lock, that’s bananas. Every decent lock implementation uses an atomic instruction in userspace, and only falls back to the kernel when it finds the lock held by another thread. For example, pthread_mutex_lock in musl libc tries an atomic compare and swap before resorting to the syscall implementation.

                                                              2. 1

                                                                I meant the language runtime. As we can see, the “slower” and more abstracted languages make more syscalls. The more control you have, the less syscalls are called.

                                                                1. 1

                                                                  This may be true but it can’t be tested the way the linked blogpost does.

                                                        3. 2

                                                          A commenter on HackerNews made a test with Nim: https://news.ycombinator.com/item?id=21957476

                                                          1. 2

                                                            A lot of people in this thread seems to be focusing on startup time and ignoring the point of the article hinted more by the amount of disk space used and, secondarily, the number of syscalls:

                                                            These numbers are real. This is more complexity that someone has to debug, more time your users are sitting there waiting for your program, less disk space available for files which actually matter to the user.”

                                                            This was not an objective test, this is just an approximation that I hope will encourage readers to be more aware of the consequences of their abstractions, and their exponential growth as more layers are added.

                                                            Unnecessary complexity translates into cognitive load for those who want to understand what happens under the hood.

                                                            Especially when contributing to the compiler or porting it to a different architecture.

                                                            1. 17

                                                              I read the entire article and understood the point.

                                                              I don’t think the point is valid, that’s all - at least not when it comes to “real-world” software development.

                                                              For example, you can’t throw a pebble on this site without hitting a comment decrying C’s lack of memory safety. “But my users will thank me when they count the low number of syscalls my code is using!” isn’t much use when your program is crashing or their box is getting rooted because you messed up memory management.

                                                              Likewise, if your code is spending most of its time waiting for data to come down the wire, or for something to be fetched from a database, why optimize for syscall count?

                                                              1. -1

                                                                Umm, you do realize the database fetching the data is a program using syscalls, and the routers transmitting the data also use syscalls. If everything in the chain is slower, you will be waiting longer…

                                                                1. 5

                                                                  Any database system can be coded as lean and mean as possible, and still be brought down by someone mistyping a query and performing a full-table scan.

                                                                  A power outage can knock out a datacenter, forcing traffic to go via slower pipes. So users will be waiting longer, despite routers being lean and mean.

                                                                  More syscalls contribute to slower performance, but they’re generally dwarfed by other factors.

                                                            2. 1

                                                              I would not be surprised if the ‘useless’ results carry through proportionally to real programs.

                                                              1. 8

                                                                A month or two ago there was a spate of posts where people “beat” GNU wc using a plethora of languages. It would be interesting to see the results of a program that read a 1MB Unicode text file and reported number of lines, bytes, characters etc, and compare using this metric.

                                                            3. 6

                                                              But regardless of what our computer is doing, if it takes less than 100ms to execute our program, we simply won’t be able to tell the difference between executing 1 instruction or all 10 billion.

                                                              I’ll be able to tell the difference energy-wise. It may also interest you to know that what takes 100 ms on your machine under zero load can be frustratingly slow on other machines.

                                                              It’s because statements like this that I frequently encounter 100 % CPU load for about a minute whenever I open ~10 Facebook pages in the background. (I do this instead of RSS. Also, 100 % means, in this case, that all 4 cores are fully utilized.)

                                                              I don’t have that problem with normal websites, of course.

                                                              I for one don’t support the idea that people should buy new computers and smartphones every year. I’m certainly not willing to make this a requirement (unless under very specific circumstances).

                                                              And don’t even try that “premature optimization” quote on me, because I’m sick of it. I have a reasonable grasp on tradeoffs between my time, machine time and code quality. But the way this statement is used today is incredibely retarded.

                                                              For example: I was working on a particular project that had to handle thousands of requests every minute and each request had to be handled as fast as possible (it was basically a proxy). I knew a specific code branch would be executed repeatedly and a lot, so I searched for some information regarding efficiency of something I intended to do. And there comes the StackOverflow with the exact same question as I had and retarded answers themed “premature optimization” and “always profile first!”.

                                                              No, fuck you. I’m experienced enough to tell what’s completely irrelevant in the overall picture and what could be the real bottleneck. Before I spend a few hours writing a code, only to measure and fully rewrite it as soon as I find out it’s intolerably inefficient, I’d like to take a look and spend a few minutes deciding what approach to use. It’s the same as with choosing algorithms or data structures. Is it a premature optimization too?!

                                                              1. 2

                                                                In principle I hear what you’re saying. But surely this example isn’t making that case. The Go version of Hello World does not have performance problems. Optimizing it is absolutely an example of “premature optimization”.

                                                                1. 2

                                                                  Yeah, I’ve meant to criticize the argumentation, not the conclusion. I don’t know about Go in particular, but I also use high-level languages such as Java or Python and there’s also a lot of overhead (compared to well-optimized native code). Instead of implicitly considering that overhead negligible and irrelevant, I tend to view it as a disadvantage. With this particular project I talked about, for example, I briefly considered using other language. In the end, the advantages of using Java were so overhelming that I just decided to go with that and use the extra time it’d save me to properly tune the garbage collector etc. Engineering is always about cost-benefit analysis.

                                                                  As it is, your argumentation IMHO sets a bad example. If the objective was to actually write a Hello World program (or something very similar; such as a subset of echo), there’s virtually no reason to include an interpreter, garbage collector or debugging information with the final binary file unless you’re targetting an OS which is entire written as a LISP machine (or something like that). Even an unexperienced programmer would be (or should be) able to write it in a more suitable language. It’s understandable if he doesn’t, because it’s a little bit faster to use a language he’s more comfortable with, but not negligible and irrelevant per se. If it was executed billion times a day on computers around the world, it would make some difference. Don’t get me wrong: It’s a very silly microoptimization to force someone to redo his work, because it would save one brick of coal from being burned after who knows how many runs. I’m just saying that if someone is able to go and write the exact same (deadly simple) program in C instead of Python and it takes him the same amount of time, it’s likely a better choice. If not (because your build, CI, deployment etc. infrastructure is only designed for a particular language and you’d have to change everything), then the cost-benefit comparison would be different, obviously. Most of the time, you also have to consider popularity of a language, its learning difficulty and other things. Then again, most of the time we’re not talking about Hello World programs.

                                                                  tl;dr: I’d base my argument on the fact that Hello World is not a very typical (serious) use-case, not try to justify its inefficiency.

                                                                  1. 2

                                                                    I think I disagree with what you’re getting at here. Unless the performance difference is significant (and it might be for python) the programmer should use the language they’re most likely to be successful in.

                                                                    I’m not a good C programmer. I will likely make mistakes. And mistakes in C easily can lead to massive security vulnerabilities. So the bar for using it is high.

                                                                    1. 1

                                                                      That’s just because we talk about a misleading example from the very beginning. I’m not a very good C programmer either, but I’m confident I’d be able to write a Hello World (or simplified echo etc.) correctly and in the same or nearly the same amount of time it would take me in other languages.

                                                                      Now it may sound that I just quibble. Let me give you a more realistic example (it doesn’t seem to work in Firefox or Chrome, but is well-readable using links). It’s written in Czech, but there’s not much to translate: Author wasn’t satisfied with the setleds and pwd commands and thus implemented them in assembly. This man is crazy: He’s been well-known for his incredible accomplishments in Turbo Pascal under MS-DOS back when he was younger. His code was always at least half-filled with inline assembly. He implemented a graphical shell (similar to what Windows 3.x was) with a multitasking. To this day, I have some of his work archived and appreciate it as an art.

                                                                      Is it actually necessary to write setleds or pwd in assembly? Absolutely not.

                                                                      Is it easy to gain trust in these programs by reviewing them? If you don’t know the assembly or particular syscalls, then certainly not. On the other hand, that’s the case with almost any language you don’t know exceptionally well (so well that you wouldn’t fall victim for any tricks that prevent you from seeing what the code actually does). And in this case, you only have to trust a few lines of code and assembler – the simplest compiler in existence. And you can verify the assembler’s output by using independent diassembler.

                                                                      Is it realistically possible that you would need to debug these programs? Not really. At worst, you’d say: “Hey, it doesn’t work!”. I doubt you could make use of some debugging information attached to the binary.

                                                                      Is it possible you would need to find a different maintainer for these programs? No. If it stopped working due to some ABI change, for example (oh, hello, Linus, don’t be mad, I’m just thinking, not saying you would allow it), you could just replace the entire program altogether. There’s not much to do otherwise.

                                                                      –==[FReeZ]==–‘es implementation of setleds is actually not compatible with the one I have from kbd package. But the differences are not that big, so lets suppose it actually was a 1-to-1 drop-in replacement and I was presented with two versions: the original C version, and the microoptimized assembly version. I didn’t have to pay any price, because it has been donated to me under a permissive license. Which implementation is more valuable for me?

                                                                      The assembly version. It’s significantly smaller. Faster to load, faster to run.

                                                                      The benefit is so negligible that even putting my time into copy-pasting the code from his website, compiling it and putting it in $PATH is not worth it, not to mention it would require my attention should I want to reinstall the system. On my part, it would be a silly microoptimization.

                                                                      But suppose the original implementation was written in assembly. There are people who write simple programs such as these as naturally in assembly as they do in C. Would it be better?

                                                                      Basically the only valid argument against it would be on the grounds of readability. As I already said: That can be the case for any language. But more importantly: The original “Hello world” article didn’t promote the use of assembly. It simply criticized compilers that produce unnecessarily large binaries. For all we care about, the C code in the above discussed article could produce the 255 bytes long binary – the same size that manually written assembly equivalent (compiled using Flat Assembler) produces.

                                                                      If gcc could optimize small programs this rapidly and your setleds wasn’t 14 632 bytes long, but 10x less that, would you complain? Would you consider that binary worse?

                                                                      Let’s not think about a single small program, but about all little programs in the entire system. A few milliseconds here, a few milliseconds there. Sum it up all together and all of sudden, it could boot noticeably faster and be more pleasant to use. Or save a brick of coal from being burned.

                                                                      Is tuning a compiler to produce better binaries even a microoptimization considering how many programs it affects? I bet compiling my entire system with somewhat different optimization flags would be easily noticeable in double-blinded tests.

                                                                      Again: I’m not suggesting people should write in C or even assembly instead of high-level languages. The foremost important thing is to get the job done, write clean and manageable code (where any “managibility” is the real thing) etc. There are trade offs. What I criticize is the tendency to pretend that trade offs do not exist and 100 ms delay is irrelevant, because the user won’t notice. It may be (and quite often is) justifiable considering the overall picture, but it’s never irrelevant and it’s dangerous to start thinking like that. I prefer to think of it as a price. If there’s no real reason to pay the price (i.e. it doesn’t get you anything back), then it’s a bad deal. This depends on the circumstances.

                                                              2. 6

                                                                I wish this were a github repo or such. I would love to contribute a bit more. Would be interested in seeing [idiomatic modern] C++.

                                                                Would be also cool to see the assembly output (where possible) of each.

                                                                Here’s Rust’s:

                                                                https://godbolt.org/z/uqFRP7

                                                                Looks like O3 does slightly better than Os

                                                                1. 3

                                                                  Godbolt shows far fewer syscalls than both u/ddevault’s and u/soptik’s results, I don’t understand what’s going on here; any insight anyone?

                                                                  1. 2

                                                                    How do you figure? There’s basically just a call to std::io::stdio::_print, but it probably does a bunch of syscalls inside there.

                                                                    If there is a difference it’s likely because godbolt is compiling it as a library rather than a binary

                                                                    1. 2

                                                                      Doh, I misunderstood the output

                                                                  2. 2

                                                                    I would be curious about Java with Graal…

                                                                  3. 5

                                                                    How is he counting syscalls? strace, or something else? I ask because I want to do a few more tests myself.

                                                                    1. 8

                                                                      strace, yes.

                                                                      1. 2

                                                                        Is this the real Drew? Or was your username something like sircmpwm…?

                                                                        1. 4

                                                                          Yes, I am Drew DeVault.

                                                                    2. 7

                                                                      One note regarding Chesterson’s fence: it’s a useful heuristic for dealing with unknown unknowns that can’t be investigated, but as a programmer (especially one writing something new from scratch), it’s your responsibility to understand every layer below you in enough detail to make reasonable decisions. Chesterson’s fence definitely applies to dealing with black boxes in large existing applications – particularly if you’re under deadline – but when deciding whether or not to introduce a dependency (even if that dependency is essentially built into the language), you need to know that dependency’s cost. (This generally means being confident in your ability to roll your own version of any third party code you include – otherwise you will have a warped idea of how that code behaves in a variety of situations.)

                                                                      The thing about programming languages is that general purpose ones don’t really exist. There are two kinds of languages: languages that are exceptionally well-suited to certain applications, and languages that are roughly equally bad at everything. When it comes to determining if overhead is worthwhile, the answer depends on whether the particular application is easier/better with the overhead or without it.

                                                                      ‘Hello world’ is not merely a mediocre test case but an exceptionally bad one. It’s trivial to implement in nearly every language (exceptions being esolangs like malboge and brainfuck) so there’s no sensible case to be made for it being substantially easier to implement in one language or another (though it occasionally acts as a reductio ad absurdam for boilerplate-heavy languages like Java); it’s trivial to run in nearly every language too (malboge being the exception again) so there’s little point in comparing performance; there’s no practical application for it, so there’s no grounds to determine what level of performance is appropriate (OP’s suggestion that we should use human attention as the benchmark doesn’t make much sense because nobody’s running something to print a static string in a way that requires human attention – the closest actually-useful program in terms of functionality is the unix tool ‘yes’, which because it’s almost always piped into other tools & is the bottleneck for simple pipelines, is pretty highly optimized).

                                                                      I’m sympathetic to complaints about languages having too much bloat, producing bloated applications. After all, I’m a cheapskate where it comes to computer hardware & that has always led me to situations wherein I notice unnecessarily bloated applications moreso than my peers. But no language forces people to write code as bloated as your average electron app (not even javascript, and not even DOM-manipulating javascript), & even if it did, choice of language is a design choice (you can make the argument that javascript is simply not an acceptable choice for interactive graphical applications). Language matters a lot (since it defines the boundaries of what we, as developers, can easily imagine implementing & therefore shapes implementation in subtle and pervasive ways), but it’s not generally the bottleneck. You can write prolog code that’s more performant than most professional applications are, if you’re sufficiently motivated.

                                                                      1. 10

                                                                        Optimising for “hello world”, a program nobody except programmers ever runs, is definitely missing the wood for the trees.

                                                                        1. 4

                                                                          Interesting. I would say that today a proper hello world should at least ensure it ouputs unicode/utf8 - and maybe do l10n.

                                                                          I wonder why c99 w/puts is so bloated?

                                                                          1. 7

                                                                            So

                                                                            print "👋🌍\n"
                                                                            

                                                                            ?

                                                                            1. 3

                                                                              For example, although I’d go with something more along the lines of regular utf8 multibyte, like Japanese or Norwegian (partly because they are languages I know).

                                                                              One probably really should do a right-to-left language too, for good measure.

                                                                              And as mentioned, basic localization (l10n) might be useful too - unfortunately that does impose a bit more - finding a way to encode and look up translations.

                                                                              1. 5

                                                                                Outputting utf8 is the same thing as outputting bytes. “👋🌍\n” is a valid string literal in C and you can print it to stdout. Applications and especially libraries rarely need to decode utf8.

                                                                                1. 1

                                                                                  Sure, nothing wrong with using emoji. I tend to prefer Norwegian mostly because I know I have proper font support everywhere for multibyte sequences.

                                                                                  On another note, for many, many applications dealing with text, you probably want to do things like sort, which implies decoding in some form. And maybe case-insensitive sort and/or comparison.

                                                                                  For web servers and similar applications, you want to be aware of encodings, even if all you do is crash and burn on non-utf8 data.

                                                                                  Ouputting “bytes” to a “text” stream can be a little unreliable as well.

                                                                          2. 4

                                                                            Not sure if this is the most useful metric, but I gave it a try for Inko:

                                                                            $ cat hello.inko
                                                                            import std::stdio::stdout
                                                                            
                                                                            stdout.print('Hello, world!')
                                                                            

                                                                            To not measure the overhead of the compiler, I compiled the bytecode first then ran the VM as follows:

                                                                            $ strace -c -- vm/target/release/ivm -I \
                                                                              /home/yorickpeterse/.cache/inko/bytecode/release \
                                                                              /home/yorickpeterse/.cache/inko/bytecode/release/e8/8772ad6c437d3772b03418ac820625197d4a200.inkoc
                                                                            

                                                                            This produces the following output:

                                                                            Hello, world!
                                                                            % time     seconds  usecs/call     calls    errors syscall
                                                                            ------ ----------- ----------- --------- --------- ----------------
                                                                             21,88    0,002214         105        21           clone
                                                                             16,86    0,001706          32        52           mmap
                                                                             13,56    0,001372           5       252         1 statx
                                                                             11,25    0,001138          37        30           mprotect
                                                                              8,88    0,000898          13        68           read
                                                                              8,74    0,000884         147         6         1 futex
                                                                              6,10    0,000617          11        55           openat
                                                                              4,10    0,000415           7        55           close
                                                                              1,94    0,000196          10        19           brk
                                                                              1,28    0,000129          43         3           munmap
                                                                              1,06    0,000107          13         8           fstat
                                                                              1,00    0,000101          12         8           lseek
                                                                              0,71    0,000072          14         5           rt_sigaction
                                                                              0,43    0,000044          14         3           sigaltstack
                                                                              0,39    0,000039          19         2           getrandom
                                                                              0,32    0,000032          16         2           prlimit64
                                                                              0,30    0,000030          15         2           sched_getaffinity
                                                                              0,27    0,000027          27         1         1 access
                                                                              0,25    0,000025          25         1           epoll_create
                                                                              0,16    0,000016          16         1           rt_sigprocmask
                                                                              0,15    0,000015          15         1           fcntl
                                                                              0,14    0,000014          14         1           set_tid_address
                                                                              0,13    0,000013           6         2         1 arch_prctl
                                                                              0,13    0,000013          13         1           set_robust_list
                                                                              0,00    0,000000           0         1           execve
                                                                            ------ ----------- ----------- --------- --------- ----------------
                                                                            100.00    0,010117                   600         4 total
                                                                            

                                                                            So that’s a total of 600 system calls in 10 milliseconds, which isn’t too bad considering not much of Inko is optimised at this point. Measuring the size of the program is more difficult. The bytecode file for the hello.inko file is only 2.13 KB, but if we include all other bytecode files compiled (= parts of the standard library used) the total size is around 434 KB.

                                                                            1. 1

                                                                              Where are the calls to write() and _exit()? You know, the two actual system calls required?

                                                                              1. 2

                                                                                Unsure, perhaps I ran strace with the wrong arguments. It would be nice if the blog post mentioned what arguments exactly were used for strace.

                                                                            2. 4

                                                                              Even if a “Hello World” benchmark exposes complexity and layers of abstractions, it falls short:

                                                                              • it doesn’t distinguish between inherent complexity and accidental complexity,
                                                                              • it looks at only one side of a multi-dimensional trade-off.

                                                                              For example, a JS VM allows to securely run an arbitrary program, on almost any computer, with pretty decent performance, and the JS programs are relatively easy to write. These are real achievements that required a lot of engineering, and there’s a lot of inherent complexity in solving all these problems at once. You don’t need all this complexity and abstractions for a “Hello World” program, but that’s not the kind of program that JS VMs were created for.

                                                                              Rust and Go compilers could detect that a program doesn’t need threads and growable stack and trim the runtime, but they don’t, because it’s a trade-off: it adds complexity to the compiler, it adds to compile times, and it only marginally improves only some trivial programs. Resources are finite, an there are plenty of other things that compilers can do that improve more programs in more significant ways.

                                                                              1. 5

                                                                                Previous lobsters discussion on the original Hello World article, for those who missed it.

                                                                                1. 3

                                                                                  I think Drew’s main point was what you get by default…and how hard it is, to get the ideal. How come there can’t be a flag (-Os or whatever) in Rust or Go, that enables the binaries to be their most ideal size? I think that’s the point he was more or less trying to make.

                                                                                  Great response nonetheless.

                                                                                  1. 3

                                                                                    Tried Nim 1.0.4 (on Linux amd64), with strace call borrowed from andrewk’s comment (also wish it was included in @ddevault’s article, would make runaway community comparisons easier):

                                                                                    $ cat hello_e.nim
                                                                                    echo "hello world\n"
                                                                                    $ strace -c -- ./hello_e 
                                                                                    hello world
                                                                                    
                                                                                    % time     seconds  usecs/call     calls    errors syscall
                                                                                    ------ ----------- ----------- --------- --------- ----------------
                                                                                     24.60    0.000046          23         2           write
                                                                                     20.32    0.000038           6         6           rt_sigaction
                                                                                     18.72    0.000035           9         4           mprotect
                                                                                     13.90    0.000026          26         1           munmap
                                                                                      9.63    0.000018           2         9           mmap
                                                                                      8.56    0.000016           5         3           brk
                                                                                      4.28    0.000008           3         3           fstat
                                                                                      0.00    0.000000           0         1           read
                                                                                      0.00    0.000000           0         2           open
                                                                                      0.00    0.000000           0         2           close
                                                                                      0.00    0.000000           0         3         3 access
                                                                                      0.00    0.000000           0         1           execve
                                                                                      0.00    0.000000           0         1           arch_prctl
                                                                                    ------ ----------- ----------- --------- --------- ----------------
                                                                                    100.00    0.000187                    38         3 total
                                                                                    

                                                                                    I tried adding a quit 0 to force it to emit an exit syscall, but still don’t see one (or is it the brk one? don’t know syscalls, sorry!). With the quit the number was also 38. Counting “unique syscalls” by hand gives me 13 IICC (If I Counted Correctly). Didn’t see any difference in terms of syscalls when I tried adding -d:danger or --opt:size (though I’m not sure if the names of syscalls used did not change; but both the total stayed at 38 and the uniques seemed to stay at 13).

                                                                                    As to size, with default debug build I got 92KiB, with -d:release 83KiB, with -d:danger or --opt:size 66KiB (as reported by ls -lh), edit: with -d:danger --opt:size I got to 33KiB. The name is hello_e.nim because I also tried a hello_w.nim with stdout.writeLine "hello world\n" — the syscalls were the same, but sizes were (to my surprise) slightly worse (maybe echo uses some shortcuts? no idea). edit 3: IIUC, for the size comparison to be correct, I should add glibc size, which is about 1.8MiB on my system, plus 159KiB for ld-2.23.so, I suppose? That puts it more or less where other dynamic glibc cases are.

                                                                                    IIUC, as far as syscalls, this makes it worse than Zig, C static, and C musl dynamic, but better than C glibc dynamic or Rust (!). As to size, it’s not clear to me if Drew added the size of libc .so into the total (because it’s the runtime library) or not; if yes, no idea how to do that correctly. edit 2: With --passL:-static, i.e. statically linked with glibc, and still with -d:danger --opt:size, the size balloons to 912KiB. No idea how to quickly test with musl (do I even have it installed?), I don’t care that much to put the effort required to find out how to do this (with or without Nix).

                                                                                    1. 3

                                                                                      Thanks for this.

                                                                                      That post was a pile of fuming garbage; I guess it did at least one thing right as it prompted this article to be made.

                                                                                      1. 3

                                                                                        The comparison is a bit biased in favor of compiled languages because interpreters are always passed a file containing the source code. Passing the code as a command-line argument, for instance, reduces the number of syscalls.

                                                                                        For instance, on my machine, Lua (5.3.5) uses 80 syscalls if I pass the code as a command-line argument and 85 if I pass it as a file. Python (3.8.1) uses 944 and 969 respectively, Perl (5.30.1) 227 and 229.

                                                                                        Note that Lua still always fewer syscalls than the Rust example (which uses 95 on my machine with 1.40.0)…

                                                                                        1. 3

                                                                                          Still lacking: actual steps done to get the values for syscall counts so we can fix/reproduce his numbers. If you’re going to complain about a metric, let people at least know how you gathered it so we can maybe do something about it.

                                                                                          1. 3

                                                                                            I think I see where Drew is coming from here, and I admire what he’s trying to do. It’s just that, as he himself admits, the way he presented the data and the inflammatory commentary around it is unfortunate.

                                                                                            I’m old. I started out on a 6502 based machine with 16K RAM. I am very keenly aware of the fact that Python, my language of choice, could easily be seen by someone who’s used to working in lovingly hand crafted C as ludicrously bloated.

                                                                                            And I do care about what’s happening at the syscall level, but ultimately I make the pragmatic choice that the problems I’m trying to solve make thinking about things at the level Drew is working at a waste of my time.

                                                                                            If nothing else, I think this could be a good case study in how good scholarship can be obfuscated past the point of utility by incautious language choices.

                                                                                            And while we’re talking about it, can we all agree that “bloated” is just the most flame-bait-y word in the English language? :)

                                                                                            1. 3

                                                                                              As somebody who avoids premature optimization, I believe the more important metric here is lines of code required to write a simple “Hello world” program. In this case, Crystal, JavaScript, Julia, Perl, and Python are the clear winners. Accounting for both lines of code and syscall count, Crystal seems incredible.

                                                                                              As an aside, I wonder how AOT-compiled Julia fares. A lot of those syscalls are probably from the JIT compiler doing its thing, and cutting it out would probably move Julia much further up the list.

                                                                                              1. 2
                                                                                                # dynamic
                                                                                                go build -o test test.go
                                                                                                

                                                                                                I thought Go builds statically linked binaries by default.

                                                                                                1. 3

                                                                                                  It’s kinda a “compromise” now, AFAIU. The tipping point IIUC was when they wanted to support some DNS(?)-related stuff and found that the only reasonable and standard API on Linux for that is through distro-provided dynamoc libraries. The keyword here is “netgo” (it’s a build tag for switching to a pure-Go rewrite of most of the logic). From there, it’s kinda slowly deteriorated further.

                                                                                                2. 2

                                                                                                  minimal haircut for PicoLisp startup - http://ix.io/26uz

                                                                                                  1. 2
                                                                                                    # static w/o cgo
                                                                                                    GOOS=linux GOARCH=amd64
                                                                                                    CGO_ENABLED=0
                                                                                                    go build -o test -ldflags '-extldflags "-f no-PIC -static"' -buildmode pie -tags 'osusergo netgo static_build' test.go
                                                                                                    

                                                                                                    Aside: it is getting way too goddamn difficult to build static Go binaries.

                                                                                                    I’m pretty sure -extldflags is ignored when cross-compiling and/or CGO_ENABLED=0, although -buildmode pie implies dynamically linked executable unless you use external linker. Since go build run on linux/amd64 host, Go probably assumed that it can use external (host) linker.

                                                                                                    Unless you need statically linked and position-independent executable, Go can compile programs using it’s own toolchain.

                                                                                                    GOOS=linux GOARCH=amd64
                                                                                                    CGO_ENABLED=0
                                                                                                    GO_EXTLINK_ENABLED=0
                                                                                                    go build -o test test.go
                                                                                                    

                                                                                                    Note: GO_EXTLINK_ENABLED=0 forces internal link mode.

                                                                                                    You can also build dynamically linked position-independent executable without Cgo runtime.

                                                                                                    GOOS=linux GOARCH=amd64
                                                                                                    CGO_ENABLED=0
                                                                                                    GO_EXTLINK_ENABLED=0
                                                                                                    go build -buildmode=pie -o test test.go
                                                                                                    
                                                                                                    1. 2

                                                                                                      I think this misses the point of the previous article. I don’t think that the previous article is shaming languages for providing more features or anything like that. It highlighted the fact that a bunch of languages in their compiled form do additional things that are unnecessary to do what the code does. This is where the author misses the point. E.g when talking about stack traces, the author seems to not know that the Zig(safe) build does have stack traces using DWARF while only making 3 calls and weighing 11KB. Doing the “lets include repeating in the program” section misses the point again. This wasn’t any kind of a performance benchmark. It’s more of a compiler optimization benchmark. Then we go to excusing go. Yeah, sure, we want fast builds. But do we really need to have all of that stuff that is entirely unnecessary for the execution of the given code. So why have it? Do we use reflection in our code? Do we use multi-threading? Do we create any objects that would require error handling? Does the task benefit from knowing weather the streams block? Should this program care about signals? Do we need to know the executable path? No. No we don’t and this is additional code that complicates debugging. Now some might say it doesn’t complicate it, but if you need to debug a performance bug on startup, all of this different stuff just gets in the way, while not being necessary.

                                                                                                      1. 2

                                                                                                        Fair enough. I will grant that most existing high level languages aren’t particularly good at optimizing programs which merely print “hello world” to the screen.

                                                                                                        It’s a very silly thing to optimize for, with no connection to the challenges of real-world software development, but there you go.

                                                                                                        My “lets include repeating in the program” was to illustrate why “slow” startup could be a problem in theory, but then to illustrate why merely eliminating code bloat isn’t always the best avenue for improving performance. (also “slow” here is not an electron app. We’re talking about a few milliseconds. it’s complete imperceptible)

                                                                                                        In my experience debugging a high level language like Go is much, much easier than debugging raw assembly. But then again I’ve only ever had to get that low-level maybe once or twice in my career. (Whereas I have to fix bugs in Go/Python/C#/Java code all the time)

                                                                                                        And I’ll take a fast build that makes a 5MB binary over a slow build that makes an 11KB binary. 5MB means nothing on my laptop, where I do most of my primary development. But a 1 second build vs a 3 minute build means a whole lot.

                                                                                                        Also FWIW I used Go because its what I knew. But the same reasoning applies to any of the languages. I bet if you dig into what Rust or Java is doing, there are also abundantly good reasons for the syscalls there.

                                                                                                        1. 2

                                                                                                          Numbers mean something. 5 MB doesn’t matter in terms of a hard drive, but it matters. That’s at least 2 levels of magnitude over what the size could be. Comparing any binary against a hard is such a cop out, that comparison hasn’t been meaningful in 20 years.

                                                                                                          Instead of hand waving it away, it would be better if you could explain what is being gained out of those extra 4 MB.

                                                                                                          1. 2

                                                                                                            The size of the binaries is a product of many things:

                                                                                                            1. Statically linking libraries instead of dynamically linking them
                                                                                                            2. Debugging information
                                                                                                            3. Pre-compiling object files to improve compiler performance
                                                                                                            4. Language features preventing dead-code analysis from eliminating all unused code

                                                                                                            Now certainly folks could spend time improving that, and there are issues in the go repo about some of them, but in practice a 5MB binary doesn’t matter in my everyday life. It doesn’t use up much space on my hard drive. I can upload them very quickly. They start plenty fast and servers I run handle them just fine.

                                                                                                            Why prioritize engineering resources or make the Go compiler slower to fix something that matters so little?

                                                                                                            1. 1

                                                                                                              …you did it again. You compared the size of the executable against the size of your hard drive. Thats the very thing that I just commented against in my last comment. Comparing an executable size against hard drive size hasnt been a meaningful comparison in at least 20 years. Why do you keep doing it? Cant you find a better comparison?

                                                                                                              1. 1

                                                                                                                I’ve lost the issue here.

                                                                                                                Why is a 5MB binary a problem?

                                                                                                                1. 2

                                                                                                                  I never said 5MB binary was problem. I said comparing the size of an executable, to your hard drive size, is a meaningless comparison and has been for a couble of decades.

                                                                                                                  Ive said pretty much this same comment 3 times now, am I not being clear?

                                                                                                                  1. 2

                                                                                                                    OK. I agree a 5MB binary is not a problem.

                                                                                                                    1. 1

                                                                                                                      I also agree with you that short sighted comparisons are wrong.

                                                                                                        2. 1

                                                                                                          E.g when talking about stack traces, the author seems to not know that the Zig(safe) build does have stack traces using DWARF while only making 3 calls and weighing 11KB.

                                                                                                          The zig programs were built with --strip which omits debug info. A minimal program capable of displaying its own stack traces with --release-safe comes out to ~500KB. That’s the size of both the DWARF info, and the code to parse it and utilize it to display a stack trace.

                                                                                                          But yeah it’s still only 3 syscalls. The third one is a segfault handler to print a stack trace. The std lib supports opt-out with pub const enable_segfault_handler = false; in the root source file (next to main).

                                                                                                        3. 2

                                                                                                          I like the writeup, but I’ll be in the minority here by actually agreeing with Drew (which I am loathe to do under other circumstances–or at least, by agreeing with my interpretation of what he may have written).

                                                                                                          One of the common refrains here is “Well it’s a trivial task, this isn’t a real workload.” Okay, if it’s so trivial, it should be pretty easy to do it without obviously sucking. If a PhD graduate can’t make a peanut-butter and jelly sandwich, I’m not going to go “well PBJ is just this trivial cooking case, and think about how much they know about type systems!”. There’s a lot of handwaving over the “complexity” and extra neat features of runtimes, but it seems few are willing to admit that like, maybe, it’s super weird that copying a constant pile of bytes into a file descriptor is super inefficient in a lot of these languages.

                                                                                                          I suspect that if there were some additional examples, say handling threading or file access or network access, maybe Drew’s grump would’ve been taken more seriously. That said, the man still has a point and we’d be poor engineers to overlook it.

                                                                                                          1. 4

                                                                                                            I think its important to note that the overhead is a one-time fixed cost. As a program becomes more complex and does more things, that bloat does not exponentially increase (as was state in the original blog post).

                                                                                                            In reality some of those fixed costs do precisely the opposite. As the program becomes larger and more complex the slightly higher initial fixed costs lead to significantly improved performance for the lifetime of the program.

                                                                                                            I don’t think it’s weird for a couple reasons:

                                                                                                            1. None of these programming languages were designed to make “Hello World” programs. They were designed to build real software and they do a much better job when assessed according to those standards.
                                                                                                            2. I think computers and operating systems are actually a whole lot more complex than many developers realize. Believe me when I say I wish they weren’t. I wish we had a computer that was easy to reason about and program effectively. But that’s not the world we live in. High level programming languages free developers to focus on their code without having to worry about all this complexity. (just skimming through the Go runtime folder there are hundreds of examples of this incidental complexity)

                                                                                                            If we want simpler programming languages we need simpler computers.

                                                                                                            1. 2

                                                                                                              I think a better example would have been using the Sieve of Erathostenes to calculate and print the first 1,000 primes. Extra credit to spell them out in English: two, three, five…

                                                                                                              Rosetta Code has a bunch of examples in many languages: https://rosettacode.org/wiki/Sieve_of_Eratosthenes

                                                                                                              1. 2

                                                                                                                One of the common refrains here is “Well it’s a trivial task, this isn’t a real workload.” Okay, if it’s so trivial, it should be pretty easy to do it without obviously sucking.

                                                                                                                This is certainly true. Pragmatically, though, if you’re a compiler developer, why would you bother optimizing this case? Nearly every use of your compiler is going to be for something larger and more complex—something that, unlike Hello World, probably would benefit from having debugging information or a threading system or whatever else—so why would you put time and effort into optimizing something that almost no one is going to do?

                                                                                                              2. 1

                                                                                                                would be cool to see measurements for myrddin

                                                                                                                1. 1

                                                                                                                  Looks like using MUSL with Rust produces 10 syscalls, 6 unique.

                                                                                                                  rustc -C opt-level=s --target x86_64-unknown-linux-musl hello.rs

                                                                                                                  1. 1

                                                                                                                    I wanted to check D with the -betterC option and it compares quite well with the other entries in the table!

                                                                                                                    (dmd-2.090.0)$ dmd -betterC -O hellobc.d
                                                                                                                    (dmd-2.090.0)$ ls -al
                                                                                                                    total 16
                                                                                                                    drwxrwxrwx 1 speps speps  512 Jan  6 21:09 .
                                                                                                                    drwxrwxrwx 1 speps speps  512 Jan  6 19:41 ..
                                                                                                                    -rwxrwxrwx 1 speps speps 8328 Jan  6 21:09 hellobc
                                                                                                                    -rwxrwxrwx 1 speps speps   97 Jan  6 21:04 hellobc.d
                                                                                                                    -rwxrwxrwx 1 speps speps 1816 Jan  6 21:09 hellobc.o
                                                                                                                    (dmd-2.090.0)$ strip hellobc
                                                                                                                    (dmd-2.090.0)$ ls -al
                                                                                                                    total 12
                                                                                                                    drwxrwxrwx 1 speps speps  512 Jan  6 21:09 .
                                                                                                                    drwxrwxrwx 1 speps speps  512 Jan  6 19:41 ..
                                                                                                                    -rwxrwxrwx 1 speps speps 6128 Jan  6 21:09 hellobc
                                                                                                                    -rwxrwxrwx 1 speps speps   97 Jan  6 21:04 hellobc.d
                                                                                                                    -rwxrwxrwx 1 speps speps 1816 Jan  6 21:09 hellobc.o
                                                                                                                    (dmd-2.090.0)$ strace -c -- ./hellobc
                                                                                                                    Hello betterC
                                                                                                                    % time     seconds  usecs/call     calls    errors syscall
                                                                                                                    ------ ----------- ----------- --------- --------- ----------------
                                                                                                                      0.00    0.000000           0         1           read
                                                                                                                      0.00    0.000000           0         1           write
                                                                                                                      0.00    0.000000           0         2           close
                                                                                                                      0.00    0.000000           0         8         7 stat
                                                                                                                      0.00    0.000000           0         3           fstat
                                                                                                                      0.00    0.000000           0         5           mmap
                                                                                                                      0.00    0.000000           0         4           mprotect
                                                                                                                      0.00    0.000000           0         1           munmap
                                                                                                                      0.00    0.000000           0         3           brk
                                                                                                                      0.00    0.000000           0         1           ioctl
                                                                                                                      0.00    0.000000           0         3         3 access
                                                                                                                      0.00    0.000000           0         1           execve
                                                                                                                      0.00    0.000000           0         1           arch_prctl
                                                                                                                      0.00    0.000000           0        10         8 openat
                                                                                                                    ------ ----------- ----------- --------- --------- ----------------
                                                                                                                    100.00    0.000000                    44        18 total
                                                                                                                    

                                                                                                                    Used the example from [1]:

                                                                                                                    extern(C) void main()
                                                                                                                    {
                                                                                                                        import core.stdc.stdio : printf;
                                                                                                                        printf("Hello betterC\n");
                                                                                                                    }
                                                                                                                    

                                                                                                                    [1] https://dlang.org/spec/betterc.html

                                                                                                                    1. 1

                                                                                                                      I think they’re both right, mostly. We pay in syscalls and startup time for stable APIs/ISAs (sorry, I just watched the 30 million line problem). Are they value-added, and added with purpose? (I think Caleb argues yes) Are we getting ripped off a bit? (I think Drew argues yes)

                                                                                                                      I think the article would be stronger if you consider the Go GC and its “downsides”. Automatic garbage collection has many benefits, sure, but also some downsides. These downsides are suprisingly minimal!! thanks to their work. If my mom were living in the dark “to save electricity” (b/c that was true -30yrs) I wouldn’t extol the virtues of lighting, I’d show her how cheap an LED bulb is (cents on the dollar). Go is maybe an outlier here: you might get sticker shock when you see the cost of, idk, the Perl 5 GC. Ruby 1.8 or something.

                                                                                                                      EDIT: thinking more, I think we have to be very specific about which runtimes are being wasteful/costly in what ways: it really doesn’t help to just say GCs are slow or any other “truism” (like I kinda do! above)