1. 29

“Here, we do something that is not possible to do in Rust or C - we safely stack-allocate the intermediate value, eliminating all heap allocations.”

  1.  

  2. 18

    Except the intermediate values are stack-allocated in Rust and C as well. This article claims that stack-allocated data is faster than heap-allocated as well, which is dubious without further clarification. malloc is indeed slower than pushing to the stack, and the stack will almost certainly be in cache, but this program is small enough that its working set will always be in cache so there should be minimal difference.

    I’m curious what the modular function is doing in Rust - the C implementation just uses a simple (n % 2) == 0 while the Rust code does ((n % 2) + 2) % 2 and it’s not clear why.

    1. 7

      ((n % 2) + 2) % 2 properly handles negatives, though it shouldn’t matter here.

    2. 17

      I can’t wait for “faster than rust” to be the new meme.

      1. 12

        eliminating all heap allocations

        Rust isn’t heap allocating anything.

        1. 11

          And C isn’t either.

          I haven’t had the time to dig into the codegen, but I posted it to /r/rust; we’ll see what they say :)

          EDIT:

          Looks to me like the use of signed integers is inhibiting some optimizations.

          https://www.reddit.com/r/rust/comments/7m99wo/outperforming_rust_with_functional_programming/drsbori/

          1. 6

            I doubt that, this are unmodified results (on my machine):

            test bench_collatz_106239 ... bench:         467 ns/iter (+/- 18)
            test bench_collatz_10971  ... bench:         367 ns/iter (+/- 12)
            test bench_collatz_2223   ... bench:         229 ns/iter (+/- 7)
            

            And here is after replacing i64 with u64:

            test bench_collatz_106239 ... bench:         509 ns/iter (+/- 4)
            test bench_collatz_10971  ... bench:         388 ns/iter (+/- 3)
            test bench_collatz_2223   ... bench:         267 ns/iter (+/- 10)
            
            1. 12

              Yeah, there’s a bunch of different speculation going on in that thread.

              Regardless, sometimes Rust will not be literally the fastest possible thing. That’s okay, as long as it’s close. I find these little examples interesting, regardless of who “wins”.

              1. 11

                I really like how the Rust community is approaching this as “how can we do better?”

        2. 20

          Something the other comments aren’t touching on is that what’s slow about functional programming has nothing to do with a collatz function. These are just integers, so the types will always be copied in and out of registers instead of mutated (probably, depends on what the compiler feels like doing). What’s slow about ‘functional programming’ is when you have larger datatypes and you choose to make an entirely new one instead of an old one. What’s slow is when the idiomatic datastructure is the linked list which wreaks havoc on your cache, or other tree based persistent datastructures. What’s slow is the abuse of dynamically dispatched first class functions, which compilers go to great lengths to avoid but is sometimes unavoidable.

          The only way to make functional programming fast, as a paradigm, is to really commit to linear/affine types (like BOTH Rust and ATS have, actually), so the compiler has enough information to determine that it can reuse the memory space.

          I’m a secret Standard ML fan and I’ve been using Rust since late mid 2014 so of course I like ATS. :) I just find the “point” of the article pretty misleading.

          1. 4

            Never heard of ATS, and it’s not linked to in the post. Here it is:

            http://www.ats-lang.org

            ATS is a statically typed programming language that unifies implementation with formal specification. It is equipped with a highly expressive type system rooted in the framework Applied Type System, which gives the language its name. In particular, both dependent types and linear types are available in ATS.

            1. 2

              How was the C function tested? With/without -O2 makes a difference, and using unsigned integers also makes a difference.

              collatz_c, linux/amd64, gcc 6.3.0 without optimizations:

              0000000000000000 <collatz_c>:
                 0:	55                   	push   %rbp
                 1:	48 89 e5             	mov    %rsp,%rbp
                 4:	89 7d ec             	mov    %edi,-0x14(%rbp)
                 7:	c7 45 fc 00 00 00 00 	movl   $0x0,-0x4(%rbp)
                 e:	eb 2e                	jmp    3e <collatz_c+0x3e>
                10:	8b 45 ec             	mov    -0x14(%rbp),%eax
                13:	83 e0 01             	and    $0x1,%eax
                16:	85 c0                	test   %eax,%eax
                18:	75 11                	jne    2b <collatz_c+0x2b>
                1a:	8b 45 ec             	mov    -0x14(%rbp),%eax
                1d:	89 c2                	mov    %eax,%edx
                1f:	c1 ea 1f             	shr    $0x1f,%edx
                22:	01 d0                	add    %edx,%eax
                24:	d1 f8                	sar    %eax
                26:	89 45 ec             	mov    %eax,-0x14(%rbp)
                29:	eb 0f                	jmp    3a <collatz_c+0x3a>
                2b:	8b 55 ec             	mov    -0x14(%rbp),%edx
                2e:	89 d0                	mov    %edx,%eax
                30:	01 c0                	add    %eax,%eax
                32:	01 d0                	add    %edx,%eax
                34:	83 c0 01             	add    $0x1,%eax
                37:	89 45 ec             	mov    %eax,-0x14(%rbp)
                3a:	83 45 fc 01          	addl   $0x1,-0x4(%rbp)
                3e:	83 7d ec 01          	cmpl   $0x1,-0x14(%rbp)
                42:	75 cc                	jne    10 <collatz_c+0x10>
                44:	8b 45 fc             	mov    -0x4(%rbp),%eax
                47:	5d                   	pop    %rbp
                48:	c3                   	retq   
              

              collatz_c, linux/amd64, gcc 6.3.0, -O2:

              0000000000000000 <collatz_c>:
                 0:	31 c0                	xor    %eax,%eax
                 2:	83 ff 01             	cmp    $0x1,%edi
                 5:	75 1a                	jne    21 <collatz_c+0x21>
                 7:	eb 2c                	jmp    35 <collatz_c+0x35>
                 9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
                10:	89 fa                	mov    %edi,%edx
                12:	83 c0 01             	add    $0x1,%eax
                15:	c1 ea 1f             	shr    $0x1f,%edx
                18:	01 d7                	add    %edx,%edi
                1a:	d1 ff                	sar    %edi
                1c:	83 ff 01             	cmp    $0x1,%edi
                1f:	74 12                	je     33 <collatz_c+0x33>
                21:	40 f6 c7 01          	test   $0x1,%dil
                25:	74 e9                	je     10 <collatz_c+0x10>
                27:	8d 7c 7f 01          	lea    0x1(%rdi,%rdi,2),%edi
                2b:	83 c0 01             	add    $0x1,%eax
                2e:	83 ff 01             	cmp    $0x1,%edi
                31:	75 ee                	jne    21 <collatz_c+0x21>
                33:	f3 c3                	repz retq 
                35:	f3 c3                	repz retq 
              

              collatz_c, linux/amd64, gcc 6.3.0, -O2, s/int/unsigned int/:

              0000000000000000 <collatz_c>:
                 0:	31 c0                	xor    %eax,%eax
                 2:	83 ff 01             	cmp    $0x1,%edi
                 5:	74 23                	je     2a <collatz_c+0x2a>
                 7:	66 0f 1f 84 00 00 00 	nopw   0x0(%rax,%rax,1)
                 e:	00 00 
                10:	89 fa                	mov    %edi,%edx
                12:	8d 4c 7f 01          	lea    0x1(%rdi,%rdi,2),%ecx
                16:	d1 ea                	shr    %edx
                18:	83 e7 01             	and    $0x1,%edi
                1b:	0f 45 d1             	cmovne %ecx,%edx
                1e:	83 c0 01             	add    $0x1,%eax
                21:	83 fa 01             	cmp    $0x1,%edx
                24:	89 d7                	mov    %edx,%edi
                26:	75 e8                	jne    10 <collatz_c+0x10>
                28:	f3 c3                	repz retq 
                2a:	f3 c3                	repz retq
              
              1. 2

                I was trying to get this to run on my machine, but I was running into issues on Arch Linux (something having to do with Stack and Cabal? I’m not a Haskell guy). One thing I noticed was that there’s no --release flag on the cargo command - the optimizations for Rust really do make a difference. Could someone add that argument to shake.hs and see what the speeds look like?