1. 44
  1.  

  2. 25

    Fun little war story: the Alpino (natural language) parser for Dutch uses a finite state transducer for tokenization. The tokenizer is implemented in a Prolog toolkit that offers a DSL to make recoginizers and transducers using a regexp-like language, along with common operations for FSA/FSTs (minimization, etc.). The resulting transducer is dumped as a large C transition table with the small amount of C code necessary to use the transition table.

    At the time (this must have been 2008 or 2009), gcc would absolutely struggle with these transition table C files. It would take quite a while to compile the table, and would OOM or end up in swap hell on regular Linux workstations of the day. The solution: compile this one file (and only this file) with Fabrice Bellard’s Tiny C Compiler, which would compile this file in no time with very little memory use. Then link with the other objects produced by gcc.

    tl;dr: small dumb(er) compilers may do better on weird benchmarks.

    Edit: this format is very unreadable.

    1. 5

      I used the same trick for an 80MB file of graph patterns for a research compiler backend. :)

      1. 2

        To add to that: when I had to work with Pino (around 2014 or so) it was a total pain to set up, in part because of all the dependencies it came with, and IIRC a bunch of outdated ones that it bundled but didn’t work everywhere.

        1. 1

          When I did my PhD I had access to SICStus Prolog 3, so I could just compile it against my local system and it was painless. Now I use this when I need to run the binary version:

          https://git.sr.ht/~danieldk/nix-packages/tree/master/pkgs/alpino/default.nix

          You can pin nixpkgs to a particular version and it should always work.

        2. 1

          At the time (this must have been 2008 or 2009), gcc would absolutely struggle with these transition table C files.

          That’s interesting. I’d expect that using (almost) no optimizations would speed up the thing.

          1. 4

            That’s what we thought, but it didn’t help much.

        3. 21

          I’m not sure of what the purpose of this post is. The only information it seems to contain is “the rust compiler uses too much RAM (for example compared to Go) when compiling a 1e6LoC main”, formatted in a very irritating way. No insight as to why, or if it is a pitfall that may be an issue in actual use case.

          Also, you’re comparing to V, but last time I checked, V was nothing but unbacked claims.

          1. 13

            I’m not sure of what the purpose of this post is.

            Seriously though, what’s the point of anything?

            1. 7

              Exactly! Do things have to have points?

              1. 3

                Seriously though, what’s the point of anything?

                I sincerely hope more people do things without there being a ‘point’ to them. Not everything we do has to produce a useful widget, and you never know where random exploration of the space of [the universe of all experiences] where it intersects your abilities might go.

              2. 6

                Also, you’re comparing to V, but last time I checked, V was nothing but unbacked claims.

                To me this was clearly a joke, and probably a reference to the author’s other article titled “V for Vaporware”.

                Anyways, the post was okaay in my opinion. It presented a story, and an interesting outcome. I hate the formatting as much as you do, but I appreciate the information that I got from it, which is that Rust and Go with this example were worlds apart when it came to compilation speed.

                Unless we’re gonna pay the author I see no reason for us to decide how insightful their articles should be, and it certainly still fits here regardless of that.

                1. 5

                  I was in the middle of making a comparison with multiple languages, but we got nerd sniped by rust ooming the biggest systems some friends and I all had. Then a friend broke out the x1e.32xlarge (4 TB of ram) and it still oomed. I was laughing so hard I yolo’d it and published it as is. I plan to do it for other languages at some point in the future (2-3 months from now).

                  1. 2

                    haha fair enough, I actually liked how verbatim it was by the way! only the aesthetic of the tweets put me off.

                    I plan to do it for other languages at some point in the future (2-3 months from now).

                    looking forward to see the results:)

              3. 8

                It’s probably actually significantly more than a 1.2 million lines since println! is a macro. Sounds like maybe there’s issues with extremely large function bodies regardless, perhaps related to the number of scopes created in a single function body? I’m not sure it’s worth optimising for the sake of a meme.

                1. 10

                  It doesn’t use println!()

                  1. 1

                    I missed that (the screenshot where it shows that wasn’t readily visible in the tweet embed on my phone). I don’t really have a suggestion for why that happened - would certainly be interesting to find out!

                  2. 5

                    This was my first thought too.

                    fn main() {
                        println!("Hello, world!");
                    }
                    

                    is expanded to

                    #![feature(prelude_import)]
                    #![no_std]
                    #[prelude_import]
                    use ::std::prelude::v1::*;
                    #[macro_use]
                    extern crate std as std;
                    fn main() {
                        {
                            ::std::io::_print(::core::fmt::Arguments::new_v1(&["Hello, world!\n"],
                                                                             &match () {
                                                                                  () => [],
                                                                              }));
                        };
                    }
                    

                    The interesting part being the println!(...) macro turning into it’s own lexical block with two function calls, a pointer to a slice containing a str literal, and then that useless match statement that would be used if there was any string formatting going on.

                    So that 1_200_000 * 'println!("hello world");'.length == 28.8mb codebase that we thought we were compiling, is actually expanded into about 120mb of minimized code. That’s quite a bit for just the textual representation of a program!

                    Beyond just expanding and churning through those lines, I wouldn’t be surprised if there’s alot of wasted work by the borrow and lifetime checkers as well. Each expansion has it’s own block, so it should have to re-setup and free the variables every single time. I wonder if that means it can’t lift the &["Hello, world!\n"] into a constant itself, though I assume LLVM would eventually.

                    So it’s definitely doing a ton of work…likely more than it’s given credit for. But I’m curious if it’s worth optimizing for? Code generation isn’t that uncommon, but I would hope in this case it’d be a little more clever than just printing a line 1.2million times. This is the first thing I’ve seen that made me check if loop rolling (as opposed to loop unrolling) exists. Turns out LLVM has a pass that does it in some cases. I also found some old posts hinting that some older compilers would reroll loops after early optimization passes to help later passes recognize when autovectoriztion should occur. A sufficiently advanced compiler could determine that this is basically an unrolled loop…but that seems like it’d be a super uncommon scenario where the extra work that’s done would be worth it. Especially if LLVM would eventually do it’s thing and optimize it anyway, which it may or may not.

                    1. 2

                      Yeah, the scope analysis part are my suspect. It’s perhaps an interesting demonstration that Go was designed around compiler implementation first while Rust is more designed around what a conceptual compiler could optimise a chosen code pattern to in order to avoid runtime penalties, with optimisations to the compiler implementation coming later. Sort of like a production-focused research language.

                  3. 8

                    RE: the blogging experiment, I think you could just link to a tweet thread to better effect. The literal HTML entities and lack of distinct formatting between content and authors are a bit tough on my eyes. But I did enjoy the responsive email quotes.

                    Anyway, I hope you filed a bug!

                    1. 4

                      I plan to!

                      1. 6

                        FWIW I also found this very hard to read. Experiments are good though :)

                      2. 4

                        Oh, I found the problem. Firefox blocks twitter’s widget code. I’ll go fix that by downloading it, hosting it locally and the like.

                      3. 3

                        There’s a lot of context missing here, the only thing I can see is a call on write_all on out. Not sure what out is, what’s in HELLO_WORLD, how is it declared? Are these inside of a main? This is calling out something fairly large, but it seems to be very light on actual details of what’s being compiled, how it’s being compiled, etc…

                        I get it, Rust might take a long time compiling something, but it would be a bit more useful if there was more information, not necessarily to say you’re wrong, but so that we can actually see the whole picture, as right now there isn’t enough information here to validate the claim. Show your work is what I’m saying, give us something reproducible, if it’s a linear growth in memory usage this should be reproducible at smaller scales.

                        1. 2

                          See this: https://github.com/ScottMansfield/rustlol

                          Quadratic time!

                        2. 3

                          For fun, I tried it with another language that uses LLVM: Zig 0.4.0.

                          const std = @import("std");
                          const hello_world = "Hello, world!\n";
                          
                          pub fn main() !void {
                              const stdout_file = try std.io.getStdOut();
                              // 1200000 lines of:
                              try stdout_file.write(hello_world);
                          }
                          
                          ❱ zig build-exe stress.zig
                          00:03:10 elapsed
                          

                          I’m still waiting on the release build; it’s making my laptop sweaty.

                          1. 2

                            I really just feel like this should be extrapolated as a stress test for every language, just for fun.

                            1. 1
                              #include <stdio.h>
                              int main() {
                               // x1200000
                               printf("hello, world\n");
                               ...
                               return 0;
                              }
                              

                              Time taken on a 2013 MacBook Pro:

                              bash-3.2$ time clang gen.c
                              
                              real	0m25.644s
                              user	0m21.782s
                              sys	0m1.790s
                              

                              (the executable is apparently 23m)

                              1. 2

                                (the executable is apparently 23m)

                                It’s time to introduce un-unroll optimizations!

                                1. 4

                                  Loop rolling!

                                  1. 1
                                    time clang -Oz gen.c
                                            146.30 real       108.65 user         7.60 sys
                                    
                                    ls -al -h a.out
                                    -rwxr-xr-x  1 calmbit  staff   9.2M Oct  3 18:13 a.out
                                    

                                    Hmm! Interesting results. An obvious reduction, but not as big of one as I would have thought.

                                    The disassembly is a disaster, and looks something like:

                                    Disassembly of section __TEXT,__text:
                                    __text:
                                    10000136c:      55      pushq   %rbp
                                    10000136d:      48 89 e5        movq    %rsp, %rbp
                                    100001370:      53      pushq   %rbx
                                    100001371:      50      pushq   %rax
                                    100001372:      48 8d 1d 31 7c 92 00    leaq    9600049(%rip), %rbx
                                    100001379:      48 89 df        movq    %rbx, %rdi
                                    10000137c:      e8 09 7c 92 00  callq   9600009 <dyld_stub_binder+0x100928f8a>
                                    100001381:      48 89 df        movq    %rbx, %rdi
                                    100001384:      e8 01 7c 92 00  callq   9600001 <dyld_stub_binder+0x100928f8a>
                                    100001389:      48 89 df        movq    %rbx, %rdi
                                    10000138c:      e8 f9 7b 92 00  callq   9599993 <dyld_stub_binder+0x100928f8a>
                                    100001391:      48 89 df        movq    %rbx, %rdi
                                    100001394:      e8 f1 7b 92 00  callq   9599985 <dyld_stub_binder+0x100928f8a>
                                    100001399:      48 89 df        movq    %rbx, %rdi
                                    10000139c:      e8 e9 7b 92 00  callq   9599977 <dyld_stub_binder+0x100928f8a>
                                    1000013a1:      48 89 df        movq    %rbx, %rdi
                                    1000013a4:      e8 e1 7b 92 00  callq   9599969 <dyld_stub_binder+0x100928f8a>
                                    ... (snip 4800000 lines)
                                    100928f69:      48 89 df        movq    %rbx, %rdi
                                    100928f6c:      e8 19 00 00 00  callq   25 <dyld_stub_binder+0x100928f8a>
                                    100928f71:      48 89 df        movq    %rbx, %rdi
                                    100928f74:      e8 11 00 00 00  callq   17 <dyld_stub_binder+0x100928f8a>
                                    100928f79:      48 89 df        movq    %rbx, %rdi
                                    100928f7c:      e8 09 00 00 00  callq   9 <dyld_stub_binder+0x100928f8a>
                                    100928f81:      31 c0   xorl    %eax, %eax
                                    100928f83:      48 83 c4 08     addq    $8, %rsp
                                    100928f87:      5b      popq    %rbx
                                    100928f88:      5d      popq    %rbp
                                    100928f89:      c3      retq
                                    Disassembly of section __TEXT,__stubs:
                                    __stubs:
                                    100928f8a:      ff 25 80 00 00 00       jmpq    *128(%rip)
                                    Disassembly of section __TEXT,__stub_helper:
                                    __stub_helper:
                                    100928f90:      4c 8d 1d 69 00 00 00    leaq    105(%rip), %r11
                                    100928f97:      41 53   pushq   %r11
                                    100928f99:      ff 25 69 00 00 00       jmpq    *105(%rip)
                                    100928f9f:      90      nop
                                    100928fa0:      68 00 00 00 00  pushq   $0
                                    100928fa5:      e9 e6 ff ff ff  jmp     -26 <__stub_helper>
                                    
                            2. 2

                              0.4.0

                              not that it would make a difference for this benchmark but 0.5.0 was released on monday

                            3. 2

                              In more pure languages, you might hope for a common subexpression elimination (CSE) pass to happen. My understanding is this optimization is very difficult to do safely in many languages including I guess rust.

                              1. 4

                                Gonna be hard to do that on something that mutates external state, such as writing to stdout.

                              2. 2

                                This takes 9m12s in Perl - cygwin on Windows.

                                https://gist.github.com/gustafe/315d3f1e645a4f6edeac54c3646c037b

                                1. 2

                                  I know it might be too much to ask but I kinda want to see memory and wall time as function of the of print statements.

                                  1. 4

                                    a friend of mine benchmarked this:

                                    sgm-mbp:bench sgm$ ~/go/bin/benchstat benchout
                                    name                   time/op
                                    Ones/1-12              170ms ± 2%
                                    Ones/2-12              171ms ± 1%
                                    Ones/3-12              173ms ± 2%
                                    Ones/4-12              174ms ± 2%
                                    Ones/5-12              178ms ±12%
                                    Ones/6-12              174ms ± 4%
                                    Ones/7-12              173ms ± 1%
                                    Ones/8-12              174ms ± 3%
                                    Ones/9-12              174ms ± 1%
                                    Tens/10-12             178ms ± 8%
                                    Tens/20-12             175ms ± 1%
                                    Tens/30-12             182ms ±15%
                                    Tens/40-12             181ms ± 3%
                                    Tens/50-12             180ms ± 2%
                                    Tens/60-12             181ms ± 4%
                                    Tens/70-12             183ms ± 1%
                                    Tens/80-12             187ms ± 4%
                                    Tens/90-12             186ms ± 1%
                                    Hundreds/100-12        187ms ± 1%
                                    Hundreds/200-12        204ms ± 2%
                                    Hundreds/300-12        228ms ± 2%
                                    Hundreds/400-12        255ms ± 1%
                                    Hundreds/500-12        285ms ± 3%
                                    Hundreds/600-12        310ms ± 2%
                                    Hundreds/700-12        337ms ± 1%
                                    Hundreds/800-12        369ms ± 4%
                                    Hundreds/900-12        400ms ± 3%
                                    Thousands/1000-12      425ms ± 2%
                                    Thousands/2000-12      785ms ± 1%
                                    Thousands/3000-12      1.20s ± 3%
                                    Thousands/4000-12      1.73s ± 2%
                                    Thousands/5000-12      2.28s ± 1%
                                    Thousands/6000-12      2.86s ± 4%
                                    Thousands/7000-12      3.53s ± 2%
                                    Thousands/8000-12      4.24s ± 3%
                                    Thousands/9000-12      5.07s ± 3%
                                    TenThousands/10000-12  5.65s ± 3%
                                    TenThousands/20000-12  16.7s ± 2%
                                    TenThousands/30000-12  32.4s ± 2%
                                    

                                    link to the benchmarks: https://github.com/ScottMansfield/rustlol

                                  2. 2

                                    Hmm…

                                    stuff.asm:

                                    section .text
                                    
                                    global _start
                                    _start:
                                    	inc rax ; SYS_write == 1
                                    	inc rdi ; stdout == 1
                                    	mov bl, al
                                    	mov edx, text.end - text
                                    	lea rsi, [rel text]
                                    
                                    %include "stuff2.asm"
                                    
                                    	mov al, 60
                                    	xor edi, edi
                                    	syscall
                                    
                                    text:
                                    	db "hello world!",10
                                    .end:
                                    

                                    stuff2.asm:

                                    	mov al, bl
                                    	syscall
                                    	; and so on...
                                    
                                    $ time nasm -felf64 -o stuff.o stuff.asm && cc -static -nostdlib -nostartfiles -o stuff.elf stuff.o
                                    real	0m18.635s
                                    user	0m18.464s
                                    sys	0m0.047s
                                    $ cat /proc/cpuinfo | grep "model name" | head -1
                                    model name	: Intel(R) Core(TM) i3-2330M CPU @ 2.20GHz
                                    

                                    (I know, the linking part is not counted in the time invocation, but it happens within one second, so I don’t really care.)

                                    That’s longer than I would’ve expected. How well does GAS do it?

                                    stuff.S:

                                    .text
                                    
                                    .global _start
                                    _start:
                                    	inc %rax
                                    	inc %rdi
                                    	movb %al, %bl
                                    	movl $(text_end - text), %edx
                                    	lea text(%rip), %rsi
                                    
                                    #include "stuff2.S"
                                    
                                    	mov $60, %al
                                    	xorl %edi, %edi
                                    	syscall
                                    
                                    text:
                                    	.ascii "hello world!\n"
                                    text_end:
                                    

                                    stuff2.S:

                                    	mov %bl, %al
                                    	syscall
                                    	/* and so on... */
                                    
                                    $ time cc -c -o stuffG.o stuff.S && cc -static -nostdlib -nostartfiles -o stuffG.elf stuffG.o
                                    real	0m5.276s
                                    user	0m4.876s
                                    sys	0m0.383s
                                    

                                    EDIT: the binary is about 4.58 megs in size.