1. 29

“Here, we do something that is not possible to do in Rust or C - we safely stack-allocate the intermediate value, eliminating all heap allocations.”

    1. 18

      Except the intermediate values are stack-allocated in Rust and C as well. This article claims that stack-allocated data is faster than heap-allocated as well, which is dubious without further clarification. malloc is indeed slower than pushing to the stack, and the stack will almost certainly be in cache, but this program is small enough that its working set will always be in cache so there should be minimal difference.

      I’m curious what the modular function is doing in Rust - the C implementation just uses a simple (n % 2) == 0 while the Rust code does ((n % 2) + 2) % 2 and it’s not clear why.

      1. 7

        ((n % 2) + 2) % 2 properly handles negatives, though it shouldn’t matter here.

    2. 17

      I can’t wait for “faster than rust” to be the new meme.

    3. 12

      eliminating all heap allocations

      Rust isn’t heap allocating anything.

      1. 11

        And C isn’t either.

        I haven’t had the time to dig into the codegen, but I posted it to /r/rust; we’ll see what they say :)

        EDIT:

        Looks to me like the use of signed integers is inhibiting some optimizations.

        https://www.reddit.com/r/rust/comments/7m99wo/outperforming_rust_with_functional_programming/drsbori/

        1. 6

          I doubt that, this are unmodified results (on my machine):

          test bench_collatz_106239 ... bench:         467 ns/iter (+/- 18)
          test bench_collatz_10971  ... bench:         367 ns/iter (+/- 12)
          test bench_collatz_2223   ... bench:         229 ns/iter (+/- 7)
          

          And here is after replacing i64 with u64:

          test bench_collatz_106239 ... bench:         509 ns/iter (+/- 4)
          test bench_collatz_10971  ... bench:         388 ns/iter (+/- 3)
          test bench_collatz_2223   ... bench:         267 ns/iter (+/- 10)
          
          1. 12

            Yeah, there’s a bunch of different speculation going on in that thread.

            Regardless, sometimes Rust will not be literally the fastest possible thing. That’s okay, as long as it’s close. I find these little examples interesting, regardless of who “wins”.

            1. 11

              I really like how the Rust community is approaching this as “how can we do better?”

    4. 20

      Something the other comments aren’t touching on is that what’s slow about functional programming has nothing to do with a collatz function. These are just integers, so the types will always be copied in and out of registers instead of mutated (probably, depends on what the compiler feels like doing). What’s slow about ‘functional programming’ is when you have larger datatypes and you choose to make an entirely new one instead of an old one. What’s slow is when the idiomatic datastructure is the linked list which wreaks havoc on your cache, or other tree based persistent datastructures. What’s slow is the abuse of dynamically dispatched first class functions, which compilers go to great lengths to avoid but is sometimes unavoidable.

      The only way to make functional programming fast, as a paradigm, is to really commit to linear/affine types (like BOTH Rust and ATS have, actually), so the compiler has enough information to determine that it can reuse the memory space.

      I’m a secret Standard ML fan and I’ve been using Rust since late mid 2014 so of course I like ATS. :) I just find the “point” of the article pretty misleading.

    5. 4

      Never heard of ATS, and it’s not linked to in the post. Here it is:

      http://www.ats-lang.org

      ATS is a statically typed programming language that unifies implementation with formal specification. It is equipped with a highly expressive type system rooted in the framework Applied Type System, which gives the language its name. In particular, both dependent types and linear types are available in ATS.

    6. 2

      How was the C function tested? With/without -O2 makes a difference, and using unsigned integers also makes a difference.

      collatz_c, linux/amd64, gcc 6.3.0 without optimizations:

      0000000000000000 <collatz_c>:
         0:	55                   	push   %rbp
         1:	48 89 e5             	mov    %rsp,%rbp
         4:	89 7d ec             	mov    %edi,-0x14(%rbp)
         7:	c7 45 fc 00 00 00 00 	movl   $0x0,-0x4(%rbp)
         e:	eb 2e                	jmp    3e <collatz_c+0x3e>
        10:	8b 45 ec             	mov    -0x14(%rbp),%eax
        13:	83 e0 01             	and    $0x1,%eax
        16:	85 c0                	test   %eax,%eax
        18:	75 11                	jne    2b <collatz_c+0x2b>
        1a:	8b 45 ec             	mov    -0x14(%rbp),%eax
        1d:	89 c2                	mov    %eax,%edx
        1f:	c1 ea 1f             	shr    $0x1f,%edx
        22:	01 d0                	add    %edx,%eax
        24:	d1 f8                	sar    %eax
        26:	89 45 ec             	mov    %eax,-0x14(%rbp)
        29:	eb 0f                	jmp    3a <collatz_c+0x3a>
        2b:	8b 55 ec             	mov    -0x14(%rbp),%edx
        2e:	89 d0                	mov    %edx,%eax
        30:	01 c0                	add    %eax,%eax
        32:	01 d0                	add    %edx,%eax
        34:	83 c0 01             	add    $0x1,%eax
        37:	89 45 ec             	mov    %eax,-0x14(%rbp)
        3a:	83 45 fc 01          	addl   $0x1,-0x4(%rbp)
        3e:	83 7d ec 01          	cmpl   $0x1,-0x14(%rbp)
        42:	75 cc                	jne    10 <collatz_c+0x10>
        44:	8b 45 fc             	mov    -0x4(%rbp),%eax
        47:	5d                   	pop    %rbp
        48:	c3                   	retq   
      

      collatz_c, linux/amd64, gcc 6.3.0, -O2:

      0000000000000000 <collatz_c>:
         0:	31 c0                	xor    %eax,%eax
         2:	83 ff 01             	cmp    $0x1,%edi
         5:	75 1a                	jne    21 <collatz_c+0x21>
         7:	eb 2c                	jmp    35 <collatz_c+0x35>
         9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
        10:	89 fa                	mov    %edi,%edx
        12:	83 c0 01             	add    $0x1,%eax
        15:	c1 ea 1f             	shr    $0x1f,%edx
        18:	01 d7                	add    %edx,%edi
        1a:	d1 ff                	sar    %edi
        1c:	83 ff 01             	cmp    $0x1,%edi
        1f:	74 12                	je     33 <collatz_c+0x33>
        21:	40 f6 c7 01          	test   $0x1,%dil
        25:	74 e9                	je     10 <collatz_c+0x10>
        27:	8d 7c 7f 01          	lea    0x1(%rdi,%rdi,2),%edi
        2b:	83 c0 01             	add    $0x1,%eax
        2e:	83 ff 01             	cmp    $0x1,%edi
        31:	75 ee                	jne    21 <collatz_c+0x21>
        33:	f3 c3                	repz retq 
        35:	f3 c3                	repz retq 
      

      collatz_c, linux/amd64, gcc 6.3.0, -O2, s/int/unsigned int/:

      0000000000000000 <collatz_c>:
         0:	31 c0                	xor    %eax,%eax
         2:	83 ff 01             	cmp    $0x1,%edi
         5:	74 23                	je     2a <collatz_c+0x2a>
         7:	66 0f 1f 84 00 00 00 	nopw   0x0(%rax,%rax,1)
         e:	00 00 
        10:	89 fa                	mov    %edi,%edx
        12:	8d 4c 7f 01          	lea    0x1(%rdi,%rdi,2),%ecx
        16:	d1 ea                	shr    %edx
        18:	83 e7 01             	and    $0x1,%edi
        1b:	0f 45 d1             	cmovne %ecx,%edx
        1e:	83 c0 01             	add    $0x1,%eax
        21:	83 fa 01             	cmp    $0x1,%edx
        24:	89 d7                	mov    %edx,%edi
        26:	75 e8                	jne    10 <collatz_c+0x10>
        28:	f3 c3                	repz retq 
        2a:	f3 c3                	repz retq
      
    7. 2

      I was trying to get this to run on my machine, but I was running into issues on Arch Linux (something having to do with Stack and Cabal? I’m not a Haskell guy). One thing I noticed was that there’s no --release flag on the cargo command - the optimizations for Rust really do make a difference. Could someone add that argument to shake.hs and see what the speeds look like?