1. 18
  1. 24

    …or just use xor. One of those cool tricks that unfortunately has no real benefit (apart from being cool obviously).

    1. 4

      xor notably fails when given two pointers to the same variable.

      1. 10

        So will the subtracting trick, obviously.

        This is for swapping two registers, so there is no such thing as “pointers”. If you want to swap two things in RAM then it’s simpler and faster to just load both of them and write them back in the opposite places.

        In 1980 (my last year of high school) I did an aptitude test to try to get a university sponsorship from the NZ Post Office, and was expected to come up with this trick as one of the test questions.

      2. 1

        I remember learning about this trick when I was in high school. I used it… in a Visual Basic program… to swap strings. I did work, but lord knows how. (I don’t mean xoring the contents of the strings either, the classic a ^= b; b ^= a; a ^= b; sequence)

      3. 11

        GCC was smart enough to recognize both forms and convert it to an XCHG instruction.

        1. 2

          Is that an x86 instruction? I wonder what the microcode does

          1. 6

            Most likely renames the registers.

            1. 1

              It’s much more commonly used with a memory operand and the LOCK prefix applied, which turns it into a reasonably important atomic instruction. (I don’t know if it’s ever been commonly used for genuinely swapping registers; perhaps back in the bad old pre-64-bit days when register pressure was pretty intense.)

              1. 2

                (I don’t know if it’s ever been commonly used for genuinely swapping registers; perhaps back in the bad old pre-64-bit days when register pressure was pretty intense.)

                Looking back at my answer, I think you’re right. I replied without thinking about it too much, and now I’m not sure there are really any major reasons to be swapping registers these days. I’ve used it when writing 16-bit real mode bootloaders certainly, when the BIOS expects certain registers as parameters while the instructions you used to compute those values might only take different registers as operands. But nowadays on machines with a plethora of registers and more orthogonal instruction sets (as in, any register can be an operand to any instruction; I’m not referring to the sub [mem], reg orthogonality of x86_64), swapping wouldn’t be needed.

                1. 1

                  Yeah, there are very few instructions still in common use (and that have survived the x86_64 transition) that are fussy about specific registers. mul/imul/div/idiv with their insistence on using eax/rax and edx/rdx for various inputs and outputs spring to mind. There also continue to be a few instructions specialised on the “counter” register (ecx/rcx), but I suspect their non-fussy counterparts with slightly longer encodings are unlikely to be worse than the compact variants combined with a mov or xchg in practice.

                  Beyond that, the main reason to care about register use is the ABI’s function calling convention. I can think of some pathological cases where xchg would likely be the best choice in that scenario, for example:

                  int foo(int a, int b)
                  {
                    return bar(b, a);
                  }
                  

                  Should presumably compile down to:

                  _foo:
                  xchg esi, edi
                  jmp _bar
                  

                  Which, as it turns out, is exactly what GCC does with -Os; clang, curiously, does the 3-way mov dance:

                  https://gcc.godbolt.org/z/8xq5Mz3bW

                2. 2

                  The lock prefix is implicit; all exchanges to memory are atomic.

                3. 1

                  Interesting, is that a thing, in hardware/microcode?

                  1. 2

                    Register renaming is a hardware feature to improve performance and IPC. While x86-64 has 16 integer registers, in reality modern CPU’s have over a hundred of general purpose registers (e.g. Zen 3 CPU’s have 192 entry integer register file) that are assigned in the process of translation of the x86 instructions to microcode. https://en.wikipedia.org/wiki/Register_renaming

                4. 2

                  Yes, but the concept does exist on some other CPUs as well. The 68k has the EXG instruction.

              2. 9

                This is easy in Python! a,b = b,a Yeah, sure, maybe it uses a temporary variable to actually implement it. I sure don’t care. As a bonus it’s correct for all types.

                (If you really care about this as a nerd optimization problem the only code that’s acceptable is assembly language, along with annotations of exactly which CPU ports are in contention.)

                1. 14

                  In the CPython interpreter (results may vary with other interpreters), because the actual interpreter is a stack-based VM, it’s a pretty simple trick: a, b = b, a is implemented in bytecode as

                  LOAD_FAST
                  LOAD_FAST
                  ROT_TWO
                  STORE_FAST
                  STORE_FAST
                  

                  Which is just pushing the two values to top of stack, swapping their positions, then popping them off the stack again and storing them under their new names.

                  1. 2

                    Similar exists in JS. When I implemented pattern matching assignment in JSC one of the things the logic had to do is avoid creation of temporary objects when you do something like [a,b]=[b,a] (which is the JS equivalent syntax)

                    1. 2

                      Thats how it works in C# too: (a, b) = (b, a);

                      1. 2

                        Same in Ruby. Never really understood the vague obsession some folks had with it as a neat trick, it’s just a one-liner 🤭

                      2. 5

                        That was a fun thing that came up with a friend on IRC when I was still in high school, learning C and Linux on my free time, which I had a lot of. They challenged me to do this, and I readily solved the non-problem. We’re talking over 20 years ago.

                        Then it came up again in class in University. I’d be surprised if somebody here didn’t think about it.

                        I prefer EOR (XOR) as an answer, because, in hardware, it does not involve carry, therefore, however unlikely, it does at least have the potential to be faster.

                        1. -1

                          it does not involve carry, therefore, however unlikely, it does at least have the potential to be faster.

                          Sorry, but that’s bullshit. Even on old processors like the gameboy, the timing was exactly the same. Could it generate some picowatts more of heat? Maybe, but probably not; and you have certainly generated more heat already thinking about it

                          1. 3

                            I said “potential” and “however unlikely” for a reason.

                            In Verilog, you can quickly see what happens to timing constraints doing xor vs doing add (with carry) as you increase the width of the register. Propagating values in a chain isn’t free.

                            However, I am not aware of add being slower than xor in any CPU. In no small way because the register width of a general-purpose machine is selected to avoid this constraint.

                            1. 1

                              register width is selected to avoid this constraint

                              It’s true that, in parallel, addition is O(n) while xor is O(1). But I have difficulty believing that a word size of, say, 2, 4, or even 8 times the current norm (64 bits) would result in slower adds than xors.

                            2. 3

                              I believe some of the Forth chips from Greenarrays cannot execute two adds in a row, because of that carry.

                              A less serious niche where you can feel this are home grown TTL computers. A 16-bit adder build out of 4-bit adders is at least 3 times slower than an equivalent XOR… or you’d need another chip to perform carry lookahead. In any case, this adder can easily end up being your frequency bottleneck, to the point where you might design 2 cycle adders instead so you can speed up the rest of the CPU.

                          2. 3

                            Adding the two numbers together creates information which is effectively the same thing as a temporary variable, which is just a convenient memory location to store the extra information that was created. This is just a buggier way of using temporary data storage - it tries to temporarily use the unused data storage within the operands (which may actually be used -oops!)

                            This would work if they were “big ints” which allocate additional storage if the number gets too big. But that additional storage allocated is exactly the same thing as a temporary variable.

                            1. 9

                              It works with bounded integers that wrap around. It only needs group axioms. No need for bigints.

                              1. 4

                                Thanks for the correction 👍

                            2. 3

                              I was asked this question in a job interview. I really don’t understand the reason of asking something like this, it’s just a arithmetical trick.

                              1. 4

                                It’s to see if you’re spending too much time on sites like this ;)

                              2. 2

                                This works! Watch out for numeric overflow:

                                let a = 500000000000000000001
                                undefined
                                let b = 500000000000000000001
                                undefined
                                a = a + b
                                1e+21
                                b = a - b
                                500000000000000000000
                                a = a - b
                                500000000000000000000
                                
                                1. 3

                                  in other words, it doesn’t work ;-)

                                  1. 8

                                    Specifically it doesn’t work with number systems that don’t obey group axioms, such as IEEE754.

                                2. 2

                                  This could be dangerous if used with floating points. I would expect a loss in precision would be likely, especially if a and b are of greatly differing magnitudes.

                                  1. 1

                                    In terms of terrible xor/add/sub hacks I’ve always liked the doubly linked list using a single pointer :)

                                    1. 1

                                      In ES6, without a temporary variable or addition:

                                      > let a = 5
                                      > let b = 10
                                      > [a, b] = [b, a]
                                      > console.log(a, b)
                                      10 5
                                      
                                      1. 2

                                        In fact, while were at it, more languages: https://programming-idioms.org/idiom/21/swap-values

                                      2. 1

                                        I was expecting to read about the XOR trick in the OP, or something equally cunning.

                                        In Swift, BTW, it’s a built-in:

                                        var x = 1
                                        var y = 2
                                        swap(&x, &y)
                                        
                                        1. 1

                                          In Go you can do

                                          a,b=b,a
                                          

                                          but the compiler won’t generate some fancy instruction and instead uses a third register to swap the elements. For ARM64 it generates:

                                                  MOVD    R0, R2
                                                  MOVD    R1, R0
                                                  MOVD    R2, R1
                                          

                                          and the equivalent for AMD64 (godbolt).

                                          Update the same assembly is generated when using the trick from the article.