1. 5
  1. 14

    In which we discover that the overhead of C++ I/O streams is larger than a call to printf.

    No surprise here. Apples-to-apples would be the same calls, this post isn’t worth discussing.

    1. 4

      Agreed. C++ iostream is a badly designed, inefficient API. I sometimes use it because it’s often nicely idiomatic, but when IO is on the critical path it’s too slow to use and I fall back to the C stream APIs or even system calls.

      Plus, a microbenchmark that measures just startup time plus one write? Clickbait.

      (I’ve really got to wean myself off of C++ one of these years. I want to be able to have nice things. I just never want to pay my own performance penalty of having to get up to speed in a new language, and to figure out how to bridge to my existing code…j

      1. 2

        No surprise here.

        I’m always a little surprised. This post is really saying that loading code takes time, so the less you have, the faster your program launches. But the surprise is why C++ coding primitives end up with so much more code. In theory, printf is terrible for code size, because it doesn’t know until run time which capabilities it needs, so it needs to include code for every conceivable format conversion. In theory, C++ I/O streams are type aware and can only include things specific for a single use case.

        In practice, it consistently breaks the other way. A full -static for both results in 820Kb for printf and 2.2Mb for C++ I/O streams. Both of these seem shocking in terms of inability to remove (presumably) unreachable code, although the C++ case should have more explicit hints to the compiler about what’s unreachable.

        1. 3

          As I understand it, this isn’t about code loading, this is about the actual runtime cost of C++’s iostreams.

          See this comparison of the actual compiled assembly. Notice all of the extra code..

          This is clickbait, plain and simple. There is nothing worth discussing.

          1. 1

            this is about the actual runtime cost of C++’s iostreams.

            I don’t think it is. The article mentions a 1.4ms time for dynamic loading vs 0.7ms for static linkage, for instance. That’s not about the instructions to execute to invoke iostreams. Godbolt appears to be showing the result of the compilation unit, meaning I don’t think it displays any difference for -static. It seems very unlikely to me that the 700usec is spent executing additional instructions added as a result of dynamic linkage (that’s a lot!) But loading a chain of shared objects, relocating them and resolving imports? That seems much more substantial than pushing a string literal on the stack and calling a function.

            What I got from the godbolt link is the compiler is smart enough to replace printf with puts, which eliminates the code I was referring to - both versions are effectively type specialized.

            1. 1

              I still think it’s not even worth a discussion; dynamic loading affects more than just C++. The more “important” difference is the cost of executing those extra call and comparison instructions related to the use of iostreams.

              ..also, was this article edited? Dynamic linkage was 1.4ms, now it’s 0.8ms? What is the point of updating these metrics and still staying with the same strawman?

              1. 3

                dynamic loading affects more than just C++

                Agree completely. My biggest complaint with this article is that libc is dynamically linked in all cases, so it’s measuring the dynamic load time of C and C++ against the dynamic load time of C without C++. A better comparison is the one @c-cube did below of creating both as static, so the linker is free to find what is used and discard the rest, then measure the cost of the code generated for the test in isolation.

                Some might argue that libc should be retained because it’s the de facto standard syscall layer on Linux, but note this argument implicitly favors one language by ensuring all of its runtime is required to be loaded, and any other language has to load something additional.

                The more “important” difference is the cost of executing those extra call and comparison instructions related to the use of iostreams.

                If we assume that there’s zero cost of loading a large static binary, the cost of the extra calls (over C) is 300 usec; the cost of dynamic loading is 600 usec. In reality the difference will be larger, because loading a large static binary has a cost. Whether it’s important or not is a little subjective (based on how an individual prioritizes short lived processes), but it’s definitely a larger cost than extra instructions in the context of a program that can execute in 500 usec.

                ..also, was this article edited? Dynamic linkage was 1.4ms, now it’s 0.8ms?

                It looks like something was added to it. I’m fairly sure dynamic linkage was always 1.4 and static was 0.8, and I misquoted these numbers above to suggest a 700 usec difference which should have been 600 usec. (My bad.)

                The new parts at the end probably do a better job of quantifying the overhead of the extra C++ call instructions at 10k cycles out of 60k (16%), avoiding conflating dynamic library loading with the language.

        2. 1

          I wouldn’t say it’s not worth discussing, but I’d have liked to see more actual discussion of why in the post.

          This post could have been an interesting deep dive into what exactly I/O streams do that makes them slower, and why they were designed that way!

          That’s coming from my perspective as a C/C++ idiot, and I’m sure many people know why. But while I have a basic idea, I’d have been interested in a post with more details (maybe some assembly code).

          1. 5

            There have been a lot of such articles written. This is a particularly evil example, because the C version actually gets compiled to a puts call, because the compiler knows the semantics of printf and, when called with a constant format string, is able to convert it into something that bypasses the formatting. In contrast, the C++ version gets the buffer management inlined and still goes via the generic formatting layer.

            It’s a bit more interesting if you add some real formatting. If you try to print numbers in a loop, the performance goes:

            • fmt::print is fastest because it ends up being specialised on the format string and so doesn’t need any dynamic dispatch.
            • printf is in the middle. It’s having to parse the format string and write the arguments with some exciting variadic magic, but it’s not too bad.
            • std::cout<< is the slowest because it’s doing an insane number of things.

            A moderate chunk of the slowness of std::cout comes from the fact that it also synchronises with C. In C, stdout is a FILE*, not a raw file descriptor. When you call printf, this is a quick tail call to fprintf(stdout, ...). This locks the stdout FILE* and uses its internal buffer to build the output (writing it if it gets full) and then unlocks the FILE*. In C++, you end up needing to lock and unlock the corresponding FILE* on each operator<< (which is stupid, from the C++ side, because it doesn’t prevent interleaving in the output between two subsequent calls like this). Often the C++ ostream has its own locking and buffering and so ends up needing to lock the C FILE, flush its buffer, lock the C++ ostream, collect things in the C++ ostream’s buffer, write that to the underlying file descriptor, and then unlock the C++ ostream and the C FILE and return. This is very slow.

            In exchange for all of this locking, you still get the problem that std::stream << "hello " << someInt << std::endl can end up with some other output interleaved between the string, integer, and newline parts. The libfmt fmt::print call doesn’t have this problem, is type safe, and is faster than printf. Since libstdc++ still doesn’t have a C++20 std::format implementation, fmt::print is the easiest way to get cross-platform formatted output in C++ and also the fastest.

        3. 5

          This post is disappointing clickbait. Reviewing other submissions from this domain shows much richer content. Puzzling!


          1. 2

            Lemire is known for some really advanced content in regards to optimisations, data structures magic, hashing, etc. I am also puzzled by this post.

          2. 3

            I tried on my machine with the same source code (the C one), because there’s no point iostream and stdio here. With gcc -static -O2 and g++ -static -O2… both take the same time on my machine (around 0.3ms).

            So… yeah, no.

            1. 2

              std::endl does a lot more than output a single '\n'. check this: https://godbolt.org/z/djrKzercn

              1. 1

                Yeah, flushing the output stream isnt free