1. 25
    1. 11

      Plankalkül had custom-width integer types in 1942.

      1. 3

        Oh wow, I didn’t realize! I must include that.

    2. 3

      Common lisp has support for “custom-width integer types”

      For example: http://clhs.lisp.se/Body/f_by_by.htm

      I don’t have much to say on that apart that this is why we often see octet in common lisp where you would usually see byte in other languages.

      1. 6

        While “octet” does reinforce, I think seeing it often in Lisp may have more to do with Lisp culturally pre-dating the 8-bit byte by a couple decades. People who deal with networking standards / documents with traditions dating to before memory settled on being 8-bit addressable (e.g., the 4004, 6502, etc. were early to mid-1970s) also use “octet” just to be precise / specific.

        1. 2

          Octet was broadly used in PDP-11 lingo, as opposed to (16-bit) word, and probably in other mini systems.

        2. 2

          Common Lisp was heavily influenced by the PDP-10 lisps of the 1970s and 1980s; the PDP-10 was a 36 bit computer.

    3. 2

      It’s worth noting that both AArch64 and recent x86 include bitfield extract and bitfield insert instructions. These significantly improve code density (and rename register pressure) relative to shifts and masks to extract unsigned values from smaller bit representations.

      1. 1

        Thanks again for this valuable piece of information. Seems like BEXTR is in the BMI1 extension of x86.

        I guess I should compile with -march=native more often! (Now I’m wondering: does godbolt allow selection of CPU target & features?)

        1. 1

          Thanks again for this valuable piece of information. Seems like BEXTR is in the BMI1 extension of x86.

          Yup, it’s not quite sufficiently widely deployed to be on by default, but it will be fine for a lot of things. Some Linux distros are moving to x86 architecture levels, which will make this easier.

          Now I’m wondering: does godbolt allow selection of CPU target & features?

          It lets you pass any flags you like, including -march and -mcpu, to the compiler.

    4. 1

      I am skeptical of the latency numbers being “negligible” since it means your operation takes 3 ALU ports instead of 1. I don’t know much how such things work for realsies though.

      1. 3

        That’s a good point, and I didn’t elaborate on this. It’s workload and scheduling dependent.

        My thinking was this: in the case of compressed pointers you’d generally load them -> dereference them -> transform the underlying values -> store modified values. Like you correctly mentioned, the ALU ports are contended; but this is only a problem if during the transform step we’re performing a bunch of ALU-heavy instructions.

        Two more things:

        1. we can dispatch the load + deref instructions of the next loop iteration before doing the ALU heavy transformation of the current iteration.

        2. after execution of the mov, a data-dependent instruction needs to wait for 1 cycle before finishing execution.

        With this in mind, once our CPU queues start filling up, we can execute the shr and and instructions at the same time as the mov of the previous iteration and the 1-cycle pause. Basically, we can get two lost cycles back from a smart schedule, causing our total penalty to be a bit less than 1 cycle per iteration (which is negligible).

        1. 1

          Hmmm, now that I’m awake enough to think about this more, that makes sense. The times when you’d mix lots of arithmetic and pointer dereferences would be when walking through arrays or structures in a loop, but for those you would probably just do the mask/shift/etc once and then do your math to the full pointer value in a register. The times when you would be doing lots of pointer loads would be while chasing through linked structures, and you’re likely to be spending lots of time there waiting on memory. So I think you’re right and the extra math required usually isn’t very significant.

          Thanks! Would still be fascinating to see benchmarks sometime, but if I care enough I can conjure those up for myself.

      2. 1

        I dont know anything about ALU ports (sounds like it’s just “inputs the ALU can take at a time”), but if that’s the case, the user can always pick the ideal integer size. It’s best of both worlds: do you want to reduce memory usage, or increase speed? :)