1. 1

    Thanks for posting this; I have been looking for ways to speed up a checksum routine.

    1. 1

      Thanks for posting this!
      The wildcard issue was new to me and I have a number of server apps affected by it.

      1. 1

        If memset is a hot instruction (that is, it’s being called frequently), why didn’t that get moved to hardware? Something like an instruction to memory to zero out big aligned, power of two chunk if memory?

        1. 3

          If memset is a hot instruction (that is, it’s being called frequently), why didn’t that get moved to hardware?

          Block copy and clear instructions have a long history in hardware; here’s a comp.arch thread that discusses some of the difficulties involved with implementing them.

          Also, the ARM instruction set was recently modified to add memcpy (CPYPT, CPYMT, CPYET) and memset (SETGP, SETGM, SETGM) instructions.

          1. 1

            Neat. I didn’t know about these new ARM instructions.

            That blog post mentions that they are also making NMIs a standard feature again.

            Thx

          2. 2

            why didn’t that get moved to hardware?

            It did. See x86’s ‘rep’ family of instructions, ‘clzero’, etc. Rep has a high startup cost; it has gotten better but it is still not free. (Clzero is very specialized and lacks granularity.) The technique implemented by the linked post aims to improve performance of small-to-medium-sized memsets, where you can easily beat the hardware. The calculus is complicated by second-order effects on i$/btb (e.g. see here, sec. 4.4, and note that ‘cmpsb’ is never fast). My own implementation is slower at very small sizes, but ~half the size/branches of the linked version.

            Such small sizes are empirically very rare; but application-side specialization can nevertheless clean up the chaff. Dispense with ‘memset’ entirely and statically branch to memset_small memset_medium clear_page etc. Overgenerality is the bane of performance. Compare the performance of malloc vs your own purpose-built allocator, or dumb virtual dispatch vs closed-world matching or inline-cached JIT (which amounts to the same thing).

            1. 1

              you can easily beat the hardware

              Fun fact: at one point, software popcnt could go faster than hardware popcnt.


              Here is another demonstration of the way specialization can improve performance vs generic solutions.

              1. 1

                rep is an interesting command, but I think I was not clear in my question. I was wondering why the option to clear out chunks of memory didn’t move to memory itself? Repeating something from the CPU still takes a lot of roundtrips on the memory lane, and latencies add up. If something is performance critical, why not do it on the very edge, which in this case is the memory chip/board itself.

                1. 2

                  Heh, everybody wants their problems to run directly on memory!

                  1. ‘Shows up on a profiler’ ≠ ‘performance critical’. The complexity is just not worth it, esp. as it is far-reaching (what are the implications for your cache coherency protocol?)

                  2. Again, the things that were showing up on the profiler were not bandwidth-limited; they fit comfortably in L1. Touching main memory at all would be extremely wasteful

                  3. There are some bandwidth-limited problems. The most obvious example is kernels needing to zero memory before handing it out to applications. But the performance advantage is not there; memory is written to many more times than it is mapped. Dragonflybsd reverted its idle-time zeroing

                  1. 2

                    DDRwhatever sticks are simple memory banks with no logic in them, you can’t move anything into them.

                    The memory controller is in your SoC (was in the northbridge in the old times). Moving the command operation just into the controller I guess doesn’t win much.

                    Now, this might make some sense if you move memory controller to be remote again, talking over a higher-latency serial link (hello IBM) I guess.

                    1. 1

                      You often don’t want this to move to the memory directly because you’re setting the contents of memory that you’re about to use or have just used. In either of those cases it either wants to be, or already is, in the cache. At a minimum, you’d need CPU instructions that invalidated the cache lines that were present and then told memory to set the pattern in a range.

                  2. 1

                    I think rep stosq will work on x86. But that doesn’t mean it’s fast.

                    1. 1

                      It is guaranteed to be quite fast if your cpuid has the ERMS flag (Enhanced REP MOVSB). That would be >=IvyBridge on the Intel side, and only >=Zen3 on AMD.

                  1. 4

                    This is an interesting thread on making Makefiles which are POSIX-compatible. The interesting thing is that it’s very hard or impossible, at least if you want to keep some standard features like out-of-tree builds. I’ve never restricted myself to write portable Makefiles (I use GNU extensions freely), but I previously assumed it wasn’t that bad.

                    That this is so hard is maybe a good example of why portability to different dependencies is a bad goal when your dependencies are already open source and portable. As many posters in the thread say, you can just use gmake on FreeBSD. The same goes for many other open source dependencies: If the software is open source, portability to alternatives to that software is not really important.

                    1. 4

                      you can just use gmake on FreeBSD.

                      I can, but I don’t want to.

                      If you want to require any specific tool or dependecy, fine, that’s your prerogative, just don’t force your idea of the tool’s cost on me. Own your decision, if it impacts me, be frank about it, just don’t bullshit me that it doesn’t impact me just because the cost for you is less than the cost for me.

                      The question of why don’t you use X instead of Y is nobody’s business but mine. I fully understand and expect that you might not care about Y, please respect my right not to care about X.

                      1. 11

                        That’s very standard rhetoric about portability, but the linked thread shows it’s not so simple in this case: It’s essentially impossible to write good, portable Makefiles.

                        1. 5

                          Especially considering how low cost using GNU Make is, over i.e. switching OS/architecture.

                          1. 2

                            It’s just as easy to run BSD make on Linux as it is to run GNU make on BSDs, yet if I ship my software to Linux users with a BSD makefile and tell them to install BSD make, there will hardly be a person who wouldn’t scorn at the idea.

                            Yet Linux users expect BSD users not to complain when they do exact same thing.

                            Why is this so hard to understand, the objection is not that you have to run some software dependency, the objection is people telling you that you shouldn’t care about the nature of the dependency because their cost for that dependency is different than yours.

                            I don’t think that your software is bad because it uses GNU make, and I don’t think that using GNU make makes you a bad person, but if you try to convince me that “using GNU make is not a big deal”, then I don’t want to ever work with you.

                            1. 2

                              Are BSD makefiles incompatible with GNU make? I actually don’t know.

                              1. 2

                                The features, syntax, and semantics of GNU and BSD make are disjoint. Their intersection is POSIX make, which has almost no features.

                                …but that’s not the point at all.

                                1. 2

                                  If they use BSD specific extensions then yes

                            2. 2

                              Posix should really standardize some of GNU make’s features (e.g. pattern rules) and/or the BSDs should just adopt them.

                              1. 5

                                I get the vibe at this point that BSD intentionally refuses to make improvements to their software specifically because those improvements came from GNU, and they really hate GNU.

                                Maybe there’s another reason, but why else would you put up with a program that is missing such a critically important feature and force your users to go thru the absurd workarounds described in the article when it would be so much easier and better for everyone to just make your make better?

                                1. 4

                                  I get the vibe at this point that BSD intentionally refuses to make improvements to their software specifically because those improvements came from GNU, and they really hate GNU.

                                  Really? I’ve observed the opposite. For example, glibc refused to adopt the strl* functions from OpenBSD’s libc, in spite of the fact that they were useful and widely implemented, and the refusal to merge them explicitly called them ‘inefficient BSD crap’ in spite of the fact that they were no less efficient than existing strn* functions. Glibc implemented the POSIX _l-suffixed versions but not the full set from Darwin libc.

                                  In contrast, you’ll find a lot of ‘added for GNU compatibility’ functions in FreeBSD libc, the *BSD utilities have ‘for GNU compatibility’ in a lot of places. Picking a utility at random, FreeBSD’s du [has two flags that are listed in the man page as first appearing in the GNU version], whereas GNU du does not list any as coming from BSDs (though -d, at least, was originally in FreeBSD’s du - the lack of it in GNU and OpenBSD du used to annoy me a lot since most of my du invocations used -d0 or -d1).

                                  1. 2

                                    The two are in no way mutually exclusive.

                                  2. 1

                                    Maybe there’s another reason, but why else would you put up with a program that is missing such a critically important feature and force your users to go thru the absurd workarounds described in the article when it would be so much easier and better for everyone to just make your make better?

                                    Every active software project has an infinite set of possible features or bug fixes; some of them will remain unimplemented for decades. glibc’s daemon function, for example, has been broken under Linux since it was implemented. The BSD Make maintainers just have a different view of the importance of this feature. There’s no reason to attribute negative intent.

                                    1. 1

                                      The BSD Make maintainers just have a different view of the importance of this feature

                                      I mean, I used to think that too but after reading the article and learning the details I have a really hard time continuing to believe that. we’re talking about pretty basic everyday functionality here.

                                    2. 1

                                      Every BSD is different, but most BSDs are minimalist-leaning. They don’t want to add features not because GNU has them, but because they only want to add things they’ve really decided they need. It’s an anti-bloat philosophy.

                                      GNU on the other hand is basically founded in the mantra “if it’s useful then add it”

                                      1. 6

                                        I really don’t understand the appeal of the kind of philosophy that results in the kind of nonsense the linked article recommends. Why do people put up with it? What good is “anti-bloat philosophy” if it treats “putting build files in directories” as some kind of super advanced edge case?

                                        Of course when dealing with people who claim to be “minimalist” it’s always completely arbitrary where they draw the line, but this is a fairly clear-cut instance of people having lost sight of the fact that the point of software is to be useful.

                                        1. 4

                                          The article under discussion isn’t the result of a minimalist philosophy, it’s the result of a lack of standardisation. BSD make grew a lot of features that were not part of POSIX. GNU make also grew a similar set of features, at around the same time, with different syntax. FreeBSD and NetBSD, for example, both use bmake, which is sufficiently powerful to build the entire FreeBSD base system.

                                          The Open Group never made an effort to standardise any of them and so you have two completely different syntaxes. The unfortunate thing is that both GNU Make and bmake accept all of their extensions in a file called Makefile, in addition to looking for files called GNUmakefile / BSDmakefile in preference to Makefile, which leads people to believe that they’re writing a portable Makefile and complain when another Make implementation doesn’t accept it.

                                2. 7

                                  But as a programmer, I have to use some build system. If I chose Meson, that’d be no problem; you’d just have to install Meson to build my software. Ditto if I chose cmake. Or mk. Why is GNU make any different here? If you’re gonna wanna compile my software, you better be prepared to get my dependencies onto your machine, and GNU make is probably gonna be one of the easiest build systems for a BSD user to install.

                                  As a Linux user, if your build instructions told me to install bsdmake or meson or any other build system, I wouldn’t bat an eye, as long as that build system is easy to install from my distro’s repos.

                                  1. 3

                                    Good grief, why is this so difficult to get through? If you want to use GNU make, or Meson, or whatever, then do that! I use GNU make too! I also use Plan 9’s mk, which few people have installed, and even fewer would want to install. That’s not the point.

                                    The problem here has nothing to do with intrinsic software properties at all, I don’t know why this is impossible for Linux people to understand.

                                    If you say “I am using GNU make, and if you don’t like it, tough luck”, that’s perfectly fine.

                                    If you say “I am using GNU make, which can’t cause any problem for you because you can just install it” then you are being ignorant of other people’s needs, requirements, or choices, or you are being arrogant for pretending other people’s needs, requirements, or choices are invalid, and of course in both cases you are being patronizing towards users you do not understand.

                                    This has nothing to do with GNU vs. BSD make. It has nothing to do with software, even. It’s a social problem.

                                    if your build instructions told me to install bsdmake or meson or any other build system, I wouldn’t bat an eye, as long as that build system is easy to install from my distro’s repos.

                                    And this is why Linux users do not understand the actual problem. They can’t fathom that there are people for whom the above way of doing things is unacceptable. It perfectly fine not to cater to such people, what’s not fine is to demand that their reasoning is invalid. There are people to whom extrinsic properties of software are far more important than their intrinsic properties. It’s ironic that Linux people have trouble understanding this, given this is the raison d’etre for the GNU project itself.

                                    1. 5

                                      I think the question is “why is assuming gmake is no big deal any different than assuming meson is no big deal?” And I think your answer is “those aren’t different, and you can’t assume meson is no big deal” but you haven’t come out and said that yet.

                                  2. 1

                                    I can, but I don’t want to.

                                    Same. Rewriting my Makefiles is so annoying, that so far I have resigned to just calling gmake on FreeBSD. Maybe one day I will finally do it. I never really understood how heavily GNUism “infected” my style of writing software, until I switched to the land of the BSD.

                                  3. 2

                                    What seems to irk BSD users the most is putting gnuisms in a file called Makefile; they see the file and expect to be able to run make, yet that will fail. Naming the file GNUMakefile is an oft-accepted compromise.

                                    I admit I do not follow that rule myself, but if I ever thought a BSD user would want to use my code, I probably would follow it, or use a Makefile-generator.

                                    1. 4

                                      I’d have a lot more sympathy for this position if BSD make was actually good, but their refusal to implement pattern rules makes it real hard to take seriously.

                                      1. 2

                                        I’d have a lot more sympathy for this position if BSD make was actually good

                                        bmake is able to build and install the complete FreeBSD source tree, including both kernel and userland. The FreeBSD build is the most complex make-based build that I’ve seen and is well past the level of complexity where I think it makes sense to have hand-written Makefiles.

                                        For the use case in mind, it’s worth noting that you don’t need pattern rules, bmake puts things in obj or $OBJDIRPREFIX by default.

                                    2. 1

                                      That this is so hard is maybe a good example of why portability to different dependencies is a bad goal when your dependencies are already open source and portable.

                                      I mean, technically you are right, but in my opinion, you are wrong because of the goal of open source.

                                      The goal of open source is to have as many people as possible using your software. That is my premise, and if it is wrong, the rest of my post does not apply.

                                      But if that is the goal, then portability to different dependencies is one of the most important goals! The reason is because it is showing the user empathy. Making things as easy as possible for users is being empathetic towards them, and while they may not notice that you did it, subconsciously, they do. They don’t give up as easily, and in fact, sometimes they even put extra effort in.

                                      I saw this when porting my bc to POSIX make. I wrote a configure script that uses nothing other than POSIX sh. It was hard, mind you, I’m not denying that.

                                      But the result was that my bc was so portable that people started using on the BSD’s without my knowledge, and one of those users decided to spend effort to demonstrate that my bc could make serious performance gains and help me to realize them once I made the decision to pursue that. He also convinced FreeBSD to make my bc the system default for FreeBSD 13.

                                      Having empathy for users, in the form of portability, makes some of them want to give back to you. It’s well worth it, in my opinion. In fact, I just spent two days papering over the differences between filesystems on Windows and on sane platforms so that my next project could be portable enough to run on Windows.

                                      (Oh, and my bc was so portable that porting it to Windows was little effort, and I had a user there help me improve it too.)

                                      1. 4

                                        The goal of open source is to have as many people as possible using your software.

                                        I have never heard that goal before. In fact, given current market conditions, open source may not be the fastest way if that is your goal. Millions in VC to blow on marketing does wonders for user aquisition

                                        1. 1

                                          That is true, but I’d also prefer to keep my soul.

                                          That’s the difference. One is done by getting users organically, in a way that adds value. The other is a way to extract value. Personally, I don’t see Open Source as having an “extract value” mindset in general. Some people who write FOSS do, but I don’t think FOSS authors do in general.

                                        2. 4

                                          The goal of open source is to have as many people as possible using your software.

                                          I actually agree with @singpolyma that this isn’t necessarily a goal. When I write software and then open source it, it’s often stuff I really don’t want many people to use: experiments, small tools or toys, etc. I mainly open source it because the cost to me of doing it is negligible, and I’ve gotten enough neat random bits and pieces of fun or interesting stuff out of other people’s weird software that I want to give back to the world.

                                          On the other hand, I’ve worked on two open source projects whose goal was to be “production-quality” solutions to certain problems, and knew they weren’t going to be used much if they weren’t open source. So, you’re not wrong, but I’d turn the statement around: open source is a good tool if you want as many people as possible using your software.

                                      1. 4

                                        The same authors also propose allowing use of 0/8 and 240/4.

                                        1. 15

                                          240/4 feels like the only one that could have legs here. I can’t see a world where 0/8 and 127/8 are anything but eternal martians, with anyone unlucky enough to get an IP in that space just doomed to have things never work.

                                          Can we just have IPv6 already? :/

                                          1. 4

                                            totally agree, we should have had IPv6 10 years ago - and yet here in Scotland my ISP cannot give me IPv6.

                                            1. 2

                                              Vote with your feet and change ISP.

                                              1. 4

                                                Neither of the two broadband ISPs available where I live provide IPv6. Voting with my feet would have to be uncomfortably literal.

                                            2. 1

                                              Call me naive but I’m actually not sure if 0/8 would be such a big problem. I’ve surely never seen it actively special cased like 127/8. Which might just mean my experience with different makes of switches etc is not the best, but for 127/8 I don’t even need to think hard about 10 things that would break, wheres 0/8 is more like “I’d have to check all the stuff and it might work”

                                              1. 1

                                                That’s weird, I thought I’ve seen an IP address like 0.3.something publicly routable. Not completely sure, but I vaguely remember seeing something along those lines and thinking it was weird.

                                              2. 7

                                                Allocating 240/4 is the most realistic option because equipment that has hardcoded filters for it is either some really obscure legacy stuff that shouldn’t even be allowed access to the public Internet (like any unmaintained networked device) or, if maintained, should have never had it hardcoded and should be fixed.

                                                Maybe it’s their secret plan: make two infeasible proposals to make the 240/4 proposal look completely sensible by comparison. ;)

                                                1. 1

                                                  In all seriousness, I don’t think you have any concept of how much aging network kit there is out in the world which will never see a software upgrade ever again (either because the manufacturer don’t release them anymore, or because “it ain’t broke, why fix it?”).

                                                  1. 1

                                                    I know it quite well, but whose problem is it? Those people are already at a much greater risk than not being able to reach newly-allocated formerly reserved addresses.

                                                    1. 1

                                                      That may be the case but it’s ultimately everyone’s problem — there are network operators who will end up having to take on the support burden from users who can’t reach these services (whose hands may be tied for other reasons, e.g. organisational, budgetary etc), there are service operators who will end up having to take on the support burden from users who can’t reach their services (who can do basically nothing because it’s your network problem not ours), and there are users who will no doubt be unhappy when they can’t reach these services and don’t understand why (my friend or colleague says this URL works but for me it doesn’t).

                                              1. 1

                                                Today:
                                                I’m hacking on the transport protocol I’ve mentioned previously. My latest change is using Apps Hungarian to indicate fields that should be in host or network byte order. I have found several bugs by doing so, the code was originally written for a big-endian target so missing byte swaps would have worked correctly.

                                                Tomorrow:
                                                We will be seeing Dune.

                                                1. 9

                                                  I’d say that it was a little excessive to say that the BMP file format “didn’t make it” when it was the only raster image format that the Windows OS natively supported for years, and is still supported by Firefox and Chrome out of the box. The WMF file format would have been a more interesting item to use there instead.

                                                  1. 5

                                                    Most screenshots I get from customers at $work are in .BMP. Still very much out there.

                                                    1. 2

                                                      You’re lucky! I’m still amazed at the number of people who send screenshots in Word documents or even Excel workbooks.

                                                      1. 2

                                                        I get those, sometimes also converted to a PDF to be more professional.

                                                    2. 4

                                                      It’s just a terrible headline.

                                                      The article lists formats that were (and are) in widespread use, where most people reading the article probably interacted with more than half. That’s fairly successful.

                                                      The real point is that we have faster CPUs than in the 80s, with more ability to compress; we have more bits per pixel and more pixels which increases the benefit of compression; and we moved to networks that are frequently bandwidth constrained. Hence, formats today are more compressed than formats in the late 80s/early 90s. That doesn’t mean they failed, it just means tradeoffs change.

                                                      1. 1

                                                        Yeah. Three I had never heard of, two I knew by name and 5 have used from sometimes to often.

                                                        But if the definition is “it’s not PNG, JPEG, GIF, or SVG”, then yes, they didn’t make it.

                                                      2. 3

                                                        Same goes for IFF ILBM on the Amiga; it was the only picture format for graphics of 256 colours or less, making it the universal format.

                                                        For that matter, TIFF was still the only way we handled photos when I worked in publishing; it can handle CMYK and its only real contender is Photoshop’s internal format.

                                                        1. 3

                                                          Yeah but I also think it’s fair to say that IFF ILBM ‘didn’t make it’. Sure, it was the lingua franca for images at the time but it only ever truly took off on the Amiga, and although those of us who were Of the Tribe may feel like The One True Platform was the most important thing EVER in the HISTORY COMPUTING, if we’re honest with ourselves - it wasn’t :)

                                                          #include <intuition.h> FOR-EVER! :)

                                                          1. 2

                                                            Well, at least it was included, unlike XPM/PPM or Degas Elite, but I still think this was mostly a tendentious listicle.

                                                            1. 3

                                                              Sure I mean the whole idea is an exercise in futility. There are always going to be unhappy NERRRDS grousing about how their Xerox INTERLISP 4 bit image format got left out :)

                                                            2. 2

                                                              RIFF, a little-endian variant of IFF, lives on in WAV, AVI, WEBP, and other formats.

                                                              1. 1

                                                                those of us who were Of the Tribe may feel like The One True Platform was the most important thing EVER in the HISTORY COMPUTING

                                                                …I’m in this picture and I don’t like it.

                                                                1. 2

                                                                  I get it but I also think everyone is young once and sadly a necessary aspect of that and the concomitant dearth of life experience that helps you scope your opinions against commonly perceived reality means we all get a pass and rightly deserve it :)

                                                                  It’s the folks who NEVER grow out of this that are sadly crippled and deserve our pity, and maybe where appropriate our help.

                                                              2. 3

                                                                3.x’s Datatypes were so awesome, though.

                                                                Although I remember setting a JPEG as my Workbench backdrop and it would take several minutes before it would display after startup on my 1200, until I downloaded a 68020-optimized (maybe 030/FPU optimized? I got an upgrade at some point) data type from Aminet and it would display after only a second or so.

                                                                Good times.

                                                                1. 2

                                                                  Or convert to ILBM. You only need to do it once, then it loads instantly :)

                                                                  1. 3

                                                                    I mean yes but Datatypes were so much cooler. Also I think I had a good reason at the time, but I don’t remember what.

                                                                    I do remember downloading an MP3 (Verve Pipe’s “The Freshmen”) and trying to play it on my 1200. The machine simply was not fast enough to decode the MP3 in real time, so I converted it to uncompressed 8SVX…it played just fine but took up an enormous portion of my 40MB hard drive.

                                                                    1. 1

                                                                      With accelerator boards, mp3 (and vorbis, and opus) are feasible. From a quick search, apparently a 68040/40MHz will handle mp3@320.

                                                                      And, back then, mp2 was still popular and used less resources.

                                                                2. 2

                                                                  I have a vague feeling that .AVI, which was for a while extremely prevalent container for video content, is some derivation from IFF (but not perhaps ILBM).

                                                                3. 3

                                                                  WMF is a walking security vulnerability. One of the opcodes is literally “execute the code at this offset in the file” and there were dozens of parsing vulns on top of that.

                                                                1. 1

                                                                  Work:

                                                                  • Application support as usual.

                                                                  Personal:

                                                                  • I’m hacking on an old transport protocol implementation from the 80s. Is TCP Illustrated: The Implementation still the best reference on implementing low-level network protocols?
                                                                  1. 1

                                                                    Somehow I find it more interesting when someone “brings the future to the seventies” (such as when someone ports Rust to the 6502) than when someone “brings the seventies to the future”, as is the case of porting a BSD to a brand new CPU architecture.

                                                                    1. 3

                                                                      I don’t think that’s a far characterisation. The core abstractions in FreeBSD are those from 4BSD, so it’s late ’80s. In contrast, RISC-V is providing early ’80s abstractions, not learning from any of the lessons from SPARC or Arm from the ’90s onwards. This is more the case of someone bringing the late ’80s to the mid ’80s.

                                                                      The exciting RISC-V feature that I learned about most recently: It does not guarantee that implementations will clear the reservation station on interrupt, intentionally so that load-linked and store-conditional can be implemented in Machine Mode. This means that the kernel has to explicitly do a dummy load-linked to a fixed address on every context switch to avoid store conditionals spuriously succeeding and atomic operations behaving surprisingly. I don’t know of any other architecture where this has been necessary.

                                                                      1. 2

                                                                        I don’t think that’s a far characterisation. The core abstractions in FreeBSD are those from 4BSD, so it’s late ’80s. In contrast, RISC-V is providing early ’80s abstractions, not learning from any of the lessons from SPARC or Arm from the ’90s onwards. This is more the case of someone bringing the late ’80s to the mid ’80s.

                                                                        What would you consider should be learned from SPARC?

                                                                        1. 4

                                                                          The first example that comes to mind: SPARC initially didn’t have a broadcast I-cache invalidate and didn’t have a coherent data and instruction cache. This caused problems in early SMP SPARC systems because any time that you loaded a shared library you needed to IPI all cores and do local i-cache invalidates over every cache line in the new mapping. If you didn’t then occasionally you’d see a stale line in your i-cache and you’d run nonsense instructions. This got a lot worse with Java adding JITs. Later SPARC added broadcast i-cache invalidate because it turns out that if you’ve already built a cache-coherency protocol and a bus or NoC that can send cache coherency messages then sending i-cache invalidates as broadcast messages over this interconnect is almost free and is vastly faster than doing an IPI.

                                                                          RISC-V decided not to make the instruction and data caches coherent[1] and made the I-cache invalidate instruction local to the current hart. This means that it comes with all of the same overheads that SPARC suffered from and fixed a couple of decades before RISC-V came along. I think there’s now an extension that does a broadcast i-cache invalidate but it’s an optional extension and so you can’t rely on it (though you can trap and emulate it if it isn’t supported, which is not much worse than requiring a syscall + broadcast IPI to do the invalidate). This is table-stakes functionality: it’s something that SPARC learned in the ’90s is essential for good performance of an operating system on SMP hardware.

                                                                          [1] This alone isn’t necessarily a bad idea. It’s quite easy to make instruction caches coherent with data caches, the real problems come in store forwarding within the pipeline. On x86, you can use a mov instruction to write to the address of the next instruction and the pipeline will do the right thing. The hardware complexity required for this is pretty significant.

                                                                        2. 1

                                                                          I know you’ve worked on FreeBSD and know more about operating systems than most of us, so I think we see this at different levels. From my point of view, as a user, I don’t see the same abstractions in BSD as you do, and the abstractions I see don’t exactly feel like they were thought up in the late eighties.

                                                                          Fascinating about the RISC-V; I’ve seen you criticise it before which saddens me not because I have any stakes in RISC-V but simply because I expected something seemingly new to have learned the lessons of its predecessors.

                                                                          1. 4

                                                                            I know you’ve worked on FreeBSD and know more about operating systems than most of us, so I think we see this at different levels. From my point of view, as a user, I don’t see the same abstractions in BSD as you do, and the abstractions I see don’t exactly feel like they were thought up in the late eighties.

                                                                            Don’t take it too negatively, but FreeBSD (and Linux, and Windows) are still embodiments of designs created for minicomputers with incremental improvements. They are designed as multi-user systems where a user is trusted with their data. These days most users have multiple computers and run software that, even if its author is trusted, may be compromised and so users need tools to prevent a compromised application from damaging their system. Capsicum is probably the best tool available on commodity operating systems for this but it’s relatively invasive to software. Capsicum is an attempt to retrofit ideas from more modern designs such as Coyotos to POSIX.

                                                                            They provide a flat namespace filesystem that abstracts away details of things like durability and name-to-storage mapping at a level that, inevitably as a one-size-fits-all approach, is too abstract for some things (e.g. databases that need to precisely control persistence for ACID compliance) and too low-level for others (GUI applications that want to deal with documents with rich metadata and indexing). The namespace that they provide is local to a computer, which is increasingly not the level of granularity that you want for modern applications (if I have a laptop, a phone, a tablet, a desktop, a smartThingy, and so on, I don’t want each to expose its own private namespace I want them to all have a shared namespace for me).

                                                                            On the server side, we’re gradually seeing that the right abstractions are closer to Nemesis and Exokernel. A traditional OS is responsible for providing two sets of functionality:

                                                                            • Isolating individual applications.
                                                                            • Abstracting details of hardware from applications.

                                                                            These come into conflict. The second of these is typically conflated with the provision of a shared namespace, such as a filesystem, IPC namespace, port number, and so on. When you don’t need a shared namespace, all of this adds overhead. With things like Intel’s S-IOV and other alternatives (which I hope will eventually be standardised by the PCI-SIG), it’s possible to do kernel-bypass device assignment directly to a process and scale to tens of thousands of processes in a single machine. The kernel doesn’t need to provide device abstractions because processes can just have direct access to devices and libraries can provide the correct abstractions. This is already done for high-performance networking things via DPDK or NetMap and by most GPU drivers, using SR-IOV (which has a much higher hardware cost than S-IOV).

                                                                            Isolation becomes a lot easier if you’re not providing the shared namespace because your attack surface is smaller. VM isolation uses exactly the same hardware as process isolation but it’s a lot more secure because a hypervisor provides a much smaller attack surface than a *NIX kernel (which is, in turn, a much smaller attack surface than the NT kernel). Windows + Hyper-V is starting to shift like this, using VM isolation and exposing a narrow set of interfaces to allow single-application Windows or Linux VMs to see a subset of the host filesystem. Once you’re in that world, you start wondering why the Windows or Linux kernel in the VM is providing a load of IPC and shared-namespace filesystem services when there’s only a single consumer of the sharing.

                                                                            The server OS really needs only to provide network access because most other services are exposed over the network and they aren’t OS services anymore because they’re parts of a distributed system.

                                                                            If you were designing an OS from scratch today, for server or client usage, it would look very different from UNIX. Neither the hardware nor the requirements look anything like the ones that created UNIX.

                                                                            Fascinating about the RISC-V; I’ve seen you criticise it before which saddens me not because I have any stakes in RISC-V but simply because I expected something seemingly new to have learned the lessons of its predecessors.

                                                                            I think there are technical and political problems with RISC-V. The technical problems arise from having an ISA that is full of premature optimisation. A lot of the early decisions were based on experience implementing an in-order core with no FPU and with an unoptimised C compiler. For example, the encoding puts source registers in the same place in every instruction, which causes immediates to be split. This is great on an in-order core with no FPU because you can do register fetch in parallel with decode. It becomes less useful when you have an FPU and SIMD registers because you don’t know which register file to fetch from (if you don’t care about power, you could fetch from all of them) and it’s largely useless when you have register renaming because you need to know which register file the register comes from before you can find the right rename register. At the same time, splitting the immediates requires longer wires, going around corners, which is very painful at 7nm. The decision (which, I believe, was just reversed last week) to require compressed instructions to be expandable into a single non-compressed instruction came from the desire in Andrew’s dissertation to be able to plug in the compressed-instruction decoder as an extra pipeline stage. If you’re designing a high-performance core, that’s definitely not what you’d do.

                                                                            A lot of the other decisions came from not talking to a experienced software people. For example, during the FreeBSD bringup we fed back that it would be useful to have separate page-table base registers for user and supervisor mode. On other platforms that provide this, it simplifies the pmap implementation. Linux didn’t need it, so they didn’t add it. Now Linux is starting to use it on platforms that support it because it’s a massive perf win for some of their speculative execution vulnerability mitigations. The i-cache thing is another example of both: it was done to optimise for microcontrollers, without listening to folks who had lived through the problem on Solaris and been there when SPARC fixed it.

                                                                            Other ISAs went through a lot of this and learned from it. MIPS, SPARC, and ARMv1 were all designed to aggressively take advantage of things that were possible with the microarchitecture of their first-generation parts and then suffered. For example, MIPS branch delay slots were annoying on later implementations and caused pain in software. Arm had a bunch of mistakes that were only apparent in retrospect. Making the PC a general-purpose register is amazing as an assembly-language programmer and it was simple in the first in-order parts (just stall the pipeline until an instruction that writes to pc retires) but it added a lot of pain with branch prediction and superscalar implementations (any instruction might be a branch and so every pipeline needs to be able to update control flow, you can’t have a separate pipeline for branch instructions). The load/store multiple instructions made function prologs and epilogs very short and was easy to implement with a simple state machine, right up until they added an MMU and then it was an instruction that could trap in the middle of execution.

                                                                            The political problems are largely caused by a conflict of interest. The folks leading the RISC-V Foundation are also investors in SiFive, but the interests of SiFive are largely in conflict with the interests of the ecosystem as a whole. There’s also quite a lot of arrogance in the process. In the first talk I saw by Krste about RISC-V, he was bragging about how short the RISC-V spec was in comparison to the Arm spec. Everyone in the audience cringed because the Arm spec is longer for two reasons:

                                                                            • It has a lot more detail and less ambiguity. Each instruction definition is accompanied by pseudocode that is generated from a formal model that can also be executed and used for conformance tests.
                                                                            • It has a lot of features that were all added for a specific software need, which RISC-V lacks.

                                                                            The political problems are also likely to lead to ecosystem problems. Fragmentation was the thing that really killed MIPS. x86 has had huge binary compatibility advantages, not just in the core ISA but also in things like platform firmware, standard PICs, and so on. Until very recently, you could still boot DOS 1.0 on a modern PC (you still can if it has a BIOS compat layer in the UEFI implementation, they’re just no longer ubiquitous - the most recent board I bought didn’t have one, which led to some fun manually repartitioning a disk and installing the bits of FreeBSD necessary to make it UEFI-bootable. Fortunately I had a swap partition at the start that I could use for UEFI and the man pages about how the boot process works are fantastic, so manually setting up the loader was quite easy). Arm has aggressively reduced fragmentation over the last 20 years to the degree that Android applications using the NDK can easily run on any handset and Linux can ship a single kernel binary that boots on most SoCs. MIPS didn’t, and so everyone targeted the MIPS R4K baseline for userspace code because that was the lowest common denominator and required a custom kernel for each device.

                                                                            RISC-V is in the same state. The core ISA is so bad that everyone is going to need to ship extensions and they’re going to want to ship incompatible extensions as differentiating features. That doesn’t matter too much in the embedded space where you’re building everything in the firmware for the specific device but it’s a huge problem in the big-computer world. I think the most likely outcome for RISC-V is that Google will define a RISC-V Android Extension, which contains a huge number of new instructions, a different standard PIC to the SiFive one, and an alternative Supervisor and Hypervisor spec, which will be the only RISC-V ISA that they’ll support for Android. The RISC-V Foundation will keep standardising whatever they want but no one will care about anything other than the Google version and that will be completely under Google’s control.

                                                                      1. 22

                                                                        It’s worth remembering that the original form of GOTO that Dijkstra objected to was very different from the C one. His article was written at a time when structured programming was only starting to gain traction. The C goto is a purely local construct that affects flow control within a subroutine, whereas older languages used goto in place of explicit subroutines with strict call-return semantics, scoping, and all of the things that are so ingrained in modern programming languages that it’s easy to forget that they’re optional. This made it incredibly hard to understand the control flow of a non-trivial program because it would jump all over the place with ad-hoc policies for how it selected the next code fragment to jump to.

                                                                        The C goto eliminates most of these problems because it’s closely tied to the scoping rules of the language. You can’t goto x if x is a label in another function. You can, as the article says, jump over variable initialisation, but you’re implicitly jumping through the allocation of the variable, so the allocation / deallocation of automatic-storage variables is still well defined.

                                                                        1. 5

                                                                          Here’s a concrete example of old-school Fortran.

                                                                             C PROGRAM TO SOLVE THE QUADRATIC EQUATION  
                                                                                READ 10,A,B,C  
                                                                                DISC = B*B-4*A*C  
                                                                                IF (DISC) 15,25,35  
                                                                             15 R = 0.0 - 0.5 * B/A  
                                                                                AI = 0.5 * SQRTF(0.0-DISC)/A  
                                                                                PRINT 11,R,AI  
                                                                                GO TO 99  
                                                                             25 R = 0.0 - 0.5 * B/A    
                                                                                PRINT 21,R    
                                                                                GO TO 99  
                                                                             35 SD = SQRTF(DISC)  
                                                                                R1 = 0.5*(SD-B)/A  
                                                                                R2 = 0.5*(0.0-(B+SD))/A  
                                                                                PRINT 31,R2,R1  
                                                                             99 STOP  
                                                                             10 FORMAT( 3F12.5 )  
                                                                             11 FORMAT( 19H TWO COMPLEX ROOTS:, F12.5,14H PLUS OR MINUS,  
                                                                               * F12.5, 2H I )  
                                                                             21 FORMAT( 15H ONE REAL ROOT:, F12.5 )  
                                                                             31 FORMAT( 16H TWO REAL ROOTS:, F12.5, 5H AND , F12.5 )  
                                                                                END
                                                                          
                                                                          1. 1

                                                                            In case any young’uns are wondering “Where are the variable declarations? Wasn’t FORTRAN statically typed?”

                                                                            The answer lies in the old saying… “God is Real. Unless declared Integer.”

                                                                            Note the “read 10…” on the 2nd line….

                                                                            The input was fixed format, see the line labeled “10”?

                                                                            So even if your code was correct…. there were soooo many ways of screwing up running it.

                                                                            That “IF (DISC) 15,25,35 “?

                                                                            If disc < 0 goto 15, if 0 goto 25, if >0 goto 35

                                                                            Brings back old trauma of maintaining this stuff.

                                                                            And as I say, old timer scientists mistrusted subroutines (so many ways of screwing up type safety with common blocks and the like)….

                                                                            So the average function size was HUUUGE and all variables at global scope…

                                                                            What a nightmare!

                                                                          2. 2

                                                                            Oh the joys of FORTRAN2 computed goto’s and code spaghetti…..

                                                                            All written by very smart scientists….. who didn’t have a clue about code quality. Or tests. Or refactoring. Or design. They also often seemed suspicious and distrustful of subroutines.

                                                                            Such memories…..

                                                                            I feel nauseated.

                                                                          1. 4

                                                                            It’s true TCP/IP can be made better, and I’m glad someone out there is still trying.

                                                                            But how would you replace them? If you make a new protocol, you also need enough backbones to adopt and support them. Otherwise, all you have is an intranet.

                                                                            1. 2

                                                                              But how would you replace them? If you make a new protocol, you also need enough backbones to adopt and support them. Otherwise, all you have is an intranet.

                                                                              You can use tunnels, proxies, gateway hosts, and routers until the new protocol is available everywhere. This is what people did before TCP/IP was available everywhere.

                                                                              1. 4

                                                                                But then, the new protocols will suffer all the weaknesses of the old ones. So why would anyone bother to use it, other than a few technical enthusiasts?

                                                                            1. 1

                                                                              Header guards can have issues too. You can have typos in the macro names, rendering your guards useless. That can’t happen with #pragma once, also it’s possible for macro name to clash if they are badly chosen, also can’t happen with #pragma once.
                                                                              However, these issues can be easily avoided (typos are easy to detect and name clashes are prevented if you have a good naming convention).

                                                                              Appending a GUID to the macro name is a common approach for addressing name collisions.

                                                                              1. 2

                                                                                Really? I can’t recall having seen that anywhere, or heard of it before…are there any major/well-known projects that do this? (Personally, it sounds kind of cumbersome to me; I’d much prefer the “use a good naming convention” solution suggested in the article.)

                                                                                1. 2

                                                                                  It’s the only sane way. Given a certain scale (> definitely 10k files though I’d be wary starting from 500) it’s impossible to rely on humans for generating unique enough identifiers.

                                                                                  1. 1

                                                                                    I dunno – taking the Linux kernel source tree as an example (~74K files, of which ~22K are headers), and with a back-of-the-napkin assumption that the first #ifndef in each header is the guard macro name:

                                                                                    $ git ls-files | grep '\.h$' | xargs -L500 grep -hm1 '#ifndef' | awk '{print $2}' | sort | uniq -c | awk '$1 != 1' > duped-guards.txt
                                                                                    $ wc -l duped-guards.txt
                                                                                    532 duped-guards.txt
                                                                                    

                                                                                    I won’t claim to have exhaustively checked every single one of those, but of the random twenty or so I did look at manually, they were all situations where it was a complete non-issue (or was perhaps entirely appropriate), such as in corresponding headers for different target architectures, or local-use-only headers in separate drivers. Sure, it’s possible there are some potentially-problematic collisions lurking somewhere in there, but empirically speaking, it seems to work pretty reasonably in practice.

                                                                                    1. 1

                                                                                      I encountered a header guard conflict today. It’s a small, crufty codebase where the authors used a leading underscore in the guards. Leading underscores are reserved for the implementation and my system has a header with a conflicting guard of _FILE_H. I resolved the issue by using a regex to append the project name to all header guards.

                                                                                  2. 1

                                                                                    Common is probably an overstatement; Visual C++ used to generate header guards with GUIDs and that’s where I saw it.
                                                                                    It’s fairly low maintenance; I keep the GUID file open in my editor and append unique GUIDs onto the end of guards when I create header files.

                                                                                1. 7

                                                                                  Personally I regard malloc() as a fundamentally broken api with the operating systems not providing a reliable alternative.

                                                                                  One of my pet hates is all the introductory text telling everybody to check the return value for NULL… resulting in literally gigabytes of broken useless untested code doing ever more arcane and intricate steps to try and handle malloc returning null.

                                                                                  Most of it is demonstrably broken… and easy to spot… if an out of memory handler uses printf…. guess what. it’s b0rked. Printf uses malloc. Doh!

                                                                                  I always wrap malloc() to check for null and abort(). Invoke the Great Garbage Collector in the Sky.

                                                                                  There after I rely on it being non-null.

                                                                                  These days I also allocate small or zero swap partitions…. by the time you’re swapping heavily… your program is not dead… just unusable, actually worse than that. Your program has made the entire system unusable. So the sooner the OOMKiller wakes and does it’s thing the better.

                                                                                  1. 13

                                                                                    One of my pet hates is all the introductory text telling everybody to check the return value for NULL…

                                                                                    It’s an extremely important thing to do in embedded systems, many of which are incredibly RAM-constrained (I own at least one board with only 16KB of RAM) and in older “retro” OSs (including MacOS pre-X, and IIRC Windows pre-XP) that don’t have advanced virtual memory.

                                                                                    Most of it is demonstrably broken… and easy to spot… if an out of memory handler uses printf…. guess what. it’s b0rked. Printf uses malloc. Doh!

                                                                                    You young fellers didn’t cut your teeth on out-of-memory errors the way I did :) Here’s how you do this: On startup you allocate a block big enough to handle your error-recovery requirements, say 16KB. Sometimes it was called the “rainy day fund.” When allocation fails, the first thing you do is free that block. Now you have some RAM available while you unwind your call chain and report the error.

                                                                                    In your event loop (or some equivalent) if your emergency block is null you try to reallocate it. Until then you operate in emergency low-memory mode where you disable any command that might use a lot of RAM. (You can also check the heap free space for other clues you’re running low.)

                                                                                    This behavior was baked into old classic-Mac app frameworks like MacApp and PowerPlant. If you didn’t use those frameworks (most apps didn’t), then you damn well rolled your own equivalent. Otherwise your testers or end users would be reporting lots and lots of crashes when memory ran low.

                                                                                    I never coded for Windows, DOS or AmigaOS, but I bet they had very similar band-aids.

                                                                                    I always wrap malloc() to check for null and abort(). Invoke the Great Garbage Collector in the Sky.

                                                                                    That works fine in some use cases like a CLI tool, or a server that can get restarted if it crashes. It’s not acceptable in an interactive application; it makes the users quite upset.

                                                                                    Back in the late 80s I once came back from vacation to find myself hung in effigy on our team whiteboard, because the artists using my app kept losing their work when they ran into a particular memory crasher I’d introduced just before leaving.

                                                                                    (Oh, it’s not OK in a library either, because it defeats the calling code’s memory management. If I’m using a library and find that it aborts when something recoverable goes wrong, it’s a sign to stop using it.)

                                                                                    1. 9

                                                                                      You young fellers didn’t cut your teeth on out-of-memory errors the way I did :) Here’s how you do this: On startup you allocate a block big enough to handle your error-recovery requirements, say 16KB. Sometimes it was called the “rainy day fund.” When allocation fails, the first thing you do is free that block. Now you have some RAM available while you unwind your call chain and report the error.

                                                                                      I have been bitten by doing exactly this on a modern OS. If the OS performs overcommit then your rainy-day fund may not actually be accessible when you exhaust memory. You need to make sure that you pre-fault it (writing random data over it should work, if the OS does memory deduplication then writing non-random data may still trigger CoW faults that can fail if memory is exhausted) or you’ll discover that there aren’t actually pages there. I actually hit this with a reservation in the BSS section of my binary: in out-of-memory conditions, my reservation was just full of CoW views of the canonical zero page, so accessing it triggered an abort.

                                                                                      Similarly, on modern platforms, just because malloc failed doesn’t mean that exactly the same malloc call won’t succeed immediately afterwards, because another process may have exited or returned memory. This is really bad on Linux in a VM because the memory balloon driver often doesn’t return memory fast enough and so you’ll get processes crashing because they ran out of memory, but if you rerun them immediately afterwards their allocations will succeed.

                                                                                      Back in the late 80s I once came back from vacation to find myself hung in effigy on our team whiteboard, because the artists using my app kept losing their work when they ran into a particular memory crasher I’d introduced just before leaving.

                                                                                      I think you learned the wrong lesson from this. A formally verified app will never crash from any situation within its reasoning framework but anything short of that cannot guarantee that it will never crash, especially when running on a system with many other processes that are out of its control (or on hardware that may fail). The right thing to do is not to try to guarantee that you never crash but instead to try to guarantee that the user never loses data (or, at least, doesn’t lose very much data) in the case of a crash. Even if your app is 100% bug free, it’s running on a kernel that’s millions of lines of C code, using RAM provided by the lowest bidder, so the host system will crash some times no matter what you do.

                                                                                      Apple embraced this philosophy first with iOS and then with macOS. Well-behaved apps opt into a mechanism called ‘sudden termination’. They tell the kernel that they’ve checkpointed all state that the user cares about between run-loop iterations and if the system is low on RAM is can just kill -9 some of them. The WindowServer process takes ownership of the crashed process’s windows and keeps presenting their old contents, when the process restarts it reclaims these windows and draws in them again. This has the advantage that when a Mac app crashes, you rarely lose more than a second or two of data. It doesn’t happen very often but it doesn’t really bother me now when an app crashes on my Mac: it’s a 2-3 second interruption and then I continue from where I was.

                                                                                      There’s a broader lesson to be learned here, which OTP made explicit in the Erlang ecosystem and Google wrote a lot about 20 years ago when they started getting higher reliability than IBM mainframes on much cheaper hardware: the higher the level at which you can handle failure, the more resilient your system will be overall. If ever malloc caller has to check for failure, then one caller in one library getting it wrong crashes your program. If you compartmentalise libraries and expect them to crash, then your program can be resilient even if no one checks malloc. If your window system and kernel expect apps to crash and provide recovery paths that don’t lose user data, your platform is more resilient to data loss than if you required all apps to be written to a standard that they never crash. If you build your distributed system expecting individual components to fail then it will be much more reliable (and vastly cheaper) than if you try to ensure that they never fail.

                                                                                      1. 3

                                                                                        Ever since I first read about it, I have always thought “crash only software” is the only way to make things reliable!

                                                                                        1. 2

                                                                                          I generally sympathize with the “crash-only” philosophy, but an issue with that approach is that sometimes a graceful shutdown path can significantly speed up recovery. (Of course, a counterargument is that not having a graceful shutdown path forces you to optimize recovery for all cases, and that in an emergency where recovery time is critical your app likely already crashed anyway.)

                                                                                          1. 1

                                                                                            One of the original papers benchmarked a graceful shutdown vs a crash and fsck for a journaled file system (ext3? ext4? can’t remember) and found crash and fsck was faster!

                                                                                            The actual use case for a graceful shut down is for things like de-registering from basestations.

                                                                                            But I would argue that such “shutdown activity” should be “business as usual” with the only difference being new requests for activity gets rejected with “piss off I’m shutting down” and once it is done. Crash!

                                                                                            1. 3

                                                                                              Since you brought up filesystems, there is a lesson to be learned from ZFS: “crash and no fsck” is fastest – try to use atomic/transactional/CoW magic to make sure that any crash basically is graceful, since there’s nothing to corrupt.

                                                                                            2. 1

                                                                                              The idea with most ‘crash-only’ systems (assuming I’m understanding the term correctly - I don’t think I’ve heard it before) is that your shutdown path isn’t special. You have a small amount of uncommitted data at any given time but anything that actually needs to persist is always persisted. For example, you use an append-only file format that you periodically garbage collect by writing the compacted version to a new file and then doing an atomic rename. You may chose to do the compact on a graceful shutdown, but you’re also doing it periodically. This has the added advantage that your shutdown code path isn’t anything special: everything that you’re doing on the shutdown path, you’re doing periodically. Your code is effectively doing a graceful shutdown every so often, so that it’s always in the recoverable state.

                                                                                              The core mindset is ‘assume that things can fail at any time’. This is vital for building a scalable distributed system because once you have a million computers the probability of one of them breaking is pretty high. Modern software increasingly looks like a distributed system and so ends up needing the same kind of mindset. Isolate whatever you can, assume it will fail at any given time.

                                                                                              1. 1

                                                                                                Some background:

                                                                                                https://www.usenix.org/conference/hotos-ix/crash-only-software

                                                                                                https://lwn.net/Articles/191059/

                                                                                                https://brooker.co.za/blog/2012/01/22/crash-only.html

                                                                                                I think the full “crash-only” philosophy really requires infrastructure support in a runtime or VM, because sometimes it’s just not acceptable to bring the whole system down. There was some work on “micro-reboot” prototypes of the JVM (and I guess .NET AppDomains were supposed to implement a similar model), but so far AFAIK BEAM/Erlang is the only widely used runtime that implements the “micro-reboot” model.

                                                                                                1. 1

                                                                                                  You make a great point that the sort of recovery-optimizing cleanup one might do in a graceful shutdown path can instead be done periodically in the background. During Win7 there was an org-wide push to reduce system shutdown latency, and I remember doing some work to implement precisely this approach in the Windows Search Engine.

                                                                                          2. 7

                                                                                            It’s an extremely important thing to do in embedded systems…

                                                                                            That’s my day job.

                                                                                            The way I set things up is this…. in order from best to worst…

                                                                                            • If I can do allocation sizing at compile time… I will.
                                                                                            • Statically allocate most stuff for worst case so blowing the ram budget will fail at link time.
                                                                                            • A “prelink” allocation step (very much like C++’s collect2) that precisely allocates arrays based on what is going into the link and hence will fail at link time if budget is blown. (Useful for multiple products built from same codebase)
                                                                                            • Where allocations are run time configuration dependent… Get the configuration validator to fail before you can even configure the device.
                                                                                            • Where that is not possible, fail and die miserably at startup time… So at least you know that configuration doesn’t work before the device goes off to do it’s job somewhere.
                                                                                            • Otherwise record error data (using preallocated resources) and reset.. aka. Big Garbage Collector in the Sky. (aka. Regain full service as rapidly as possible)
                                                                                            • Soak test the hell out of it and record high water marks.

                                                                                            I still find despite all this, colleagues that occasionally write desperate, untested and untestable attempts at handling OOM conditions.

                                                                                            And every bloody time I have reviewed it…. the unwinding code is provably buggy as heck.

                                                                                            The key thing is nobody wants a device that is limping along in a degraded low memory mode. They want full service back again asap.

                                                                                            1. 1

                                                                                              Sounds like “fun”! I’ve got a side project making alternate firmware for a MIDI controller (Novation’s LaunchPad Pro) where I’ve been making a lot of use of static allocation and C++ constexpr … it’s been interesting to see how much I can do at compile/link time. IIRC I’ve been able to avoid all calls to malloc so far.

                                                                                              and every bloody time I have reviewed it…. the unwinding code is provably buggy as heck

                                                                                              Yeah, this was always a miserable experience developing for the classic Mac OS. QE would keep finding new ways to run out of memory on different code paths, and filing new crashers. But crashing wasn’t an option in a GUI app.

                                                                                              1. 3

                                                                                                The trick is back pressure.

                                                                                                Unwinding is nearly always a bad choice of architecture.

                                                                                                To massively oversimplify a typical device… it’s a pipeline from input events, to interacting with and perhaps modifying internal state to output.

                                                                                                If something on the output side of that pipeline runs out of resource…. attempting to unwind (especially in a multithreaded real time system) is a nightmare beyond belief.

                                                                                                The trick is to either spec the pipeline so downstream always has more capacity / bandwidth / priority than upstream OR have a mechanism to sniff if my output queue is getting near full and so throttle the flow of input events by some means. (Possibly recursively).

                                                                                                By throttle I mean things like ye olde flow xon/xoff flow, blocking, dropping packets, etc…

                                                                                                The important principle is to do this as soon as you can before you have wasted cpu cycles or resources or … on an event that is going to be dropped or blocked anyway.

                                                                                                1. 1

                                                                                                  Yeah, I’ve used backpressure in networking code, and I can see it would be important in a small constrained device processing data.

                                                                                                2. 1

                                                                                                  Unrelated to malloc….

                                                                                                  I was watching this bloke https://www.youtube.com/watch?v=ihe9zV07Lgk

                                                                                                  His Akai MPK mini was giving me the “buy me’s”….

                                                                                                  Doing some research other folk are recommending the LaunchPad….

                                                                                                  What’s your opinion? I find it very interesting that you can reprogram the Novation… Does it come with docs and sdk and the like?

                                                                                                  1. 1

                                                                                                    So, I have the previous-generation LP Pro. They partially open-sourced it in 2015 or so. (I don’t believe the current model is supported.) Instead of releasing the stock firmware, they have a GitHub repo with a minimal framework you can build on. Much of it is still a binary blob. The README describes their reasoning.

                                                                                                    I found it easy to get started with — there’s even a sort of emulator that lets you run it on your computer for easier debugging.

                                                                                                    But you really are starting from scratch. There are empty functions you fill in to handle pad presses and incoming MIDI, and some functions to call to light up pads and send MIDI. So even recreating what the stock firmware does takes significant work. But I’m having fun with it.

                                                                                              2. 4

                                                                                                IIRC Windows pre-XP

                                                                                                It’s probably still important to this day. Windows has never supported memory overcommit - you cannot allocate more memory than there is available swap to back. This is why the pagefile tends to be at least as large as the amount of physical memory installed.

                                                                                                1. 2

                                                                                                  One of Butler Lampson’s design principles: “Leave yourself a place to stand.”

                                                                                                2. 3

                                                                                                  These days I also allocate small or zero swap partitions….

                                                                                                  Have you read this? https://lobste.rs/s/rgv1sv/defence_swap_common_misconceptions_2018

                                                                                                  It made me reconsider swap, anyway.

                                                                                                  1. 1

                                                                                                    No I hadn’t…. it’s a pretty good description.

                                                                                                    Didn’t learn much I didn’t know beyond the definition of the swappiness tunable….

                                                                                                    and the existence of something in cgroup exists…. and that it does something with memory pressure.

                                                                                                    I have been meaning to dig into cgroup stuff for awhile.

                                                                                                    But yes, the crux of the matter is apps need to be able to sniff memory pressure and have mechanisms to react.

                                                                                                    For some tasks eg. a big build, the reaction may be… “Hey OS, just park me until this is all over, desperately swapping to give me a cycle isn’t helping anybody!”

                                                                                                  2. 3

                                                                                                    I literally woke up this morning thinking about this: 😩

                                                                                                    guess what. it’s b0rked. Printf uses malloc. Doh!

                                                                                                    I have not looked up the source code of [any implementation of] printf, but I can’t think of a reason printf would need to call malloc. It’s just scanning the format string, doing some numeric conversions that can use fixed size buffers, and writing to stdout. Given that printf-like functions can be a bottleneck (like when doing lots of logging) I’d think they’d try to avoid heap allocation.

                                                                                                    1. 2

                                                                                                      It’s an edge case, a bad idea, and a misfeature, but glibc allows registering callbacks for custom conversion specifiers for printf.

                                                                                                      1. 2

                                                                                                        For localisation, printf provides qualifiers that allow you to reference the arguments by their location in the format string. C’s stdarg does not allow you to get variadic parameter n, so to support this printf may need to do a two-pass scan of the format string. First it collects the references to the formats and then collects them all to an indexed data structure. That indexed data structure needs to be dynamically allocated.

                                                                                                        It’s quite common in printf implementations to use alloca or a variable-length array in an inner block of the printf to dynamically allocate the data structure on the stack but it’s simpler to use malloc and free. It’s also common in more optimised printf implementations to implement this as a fall-back mode, where you just do va_next until you encounter a positional qualifier, then punt to a slow-path implementation with the string, a copy of the va_list, and the current position (once you’ve seen a positional specifier you need may discover that you need to collect some of the earlier entries in the va_list that you’ve discarded already. Fortunately, they’re still in the argframe).

                                                                                                        To make this even more fun, printf is locale-aware. It will call a bunch of character-set conversion functions localeconv, and so on. If the locale is lazily initialised then this may also trigger allocation. Oh, and as @dbremmer points out, GNU-compatible printf implementations can register extensions. FreeBSD libc actually has two different printf implementations, a fast one and one that’s called if you have ever called register_printf_function.

                                                                                                        To put printf in perspective: GNUstep has a function called GSFormat, which is a copy of a printf implementation, extended to support %@ (which is a fairly trivial change, since %@ is just %s with a tiny bit of extra logic in front). I was able to compile all of GNUstep except for this function with clang, about a year before clang could compile all of GNUstep. The printf implementation stressed the compiler more than the whole of the rest of the codebase.

                                                                                                        Java’s printf is even more exciting. The first time it’s called, it gives pretty much complete coverage of the JVM spec. It invokes the class loader to load locales (Solaris libc does this as well - each locale is a separate .so that is dlopened to get the locale, most other systems just store locales as tables), generates enough temporary objects that it triggers garbage collection, does dispatch via interfaces, interface-to-interface casts, and dispatches at least one of every Java bytecode.

                                                                                                        Whatever language you’re using, there’s a good chance that printf or the local equivalent is one of the most complex functions that you will ever look at.

                                                                                                        1. 1

                                                                                                          Sigh … It’s always the most obscure 10% of the feature set that causes 90% of the complexity, isn’t it?

                                                                                                        2. 1

                                                                                                          Yeah I thought the same thing. Printf is a little interpreter and it doesn’t need to Malloc. Sprintf doesn’t either. That’s why it has the mode where it returns the number of bytes you need to allocate.

                                                                                                          1. 1

                                                                                                            I’ve only just skimmed the printf code in glibc. All it does is call vfprintf with stdout as the file. It does appear that vfprintf allocates some work buffers to do its thing, but I did not dive too deeply because the code is hard to read (lots of C preprocessor abuse) and I just don’t have the time right now.

                                                                                                            1. 1

                                                                                                              So I thought to until I set a break point….

                                                                                                              I don’t think we were using standard gnu libc at that point so your milage may vary.

                                                                                                              I have the vaguest of memory it was printing 64 bit int’s (on a 32 bitter) that trigger that.

                                                                                                            2. 2

                                                                                                              These days I also allocate small or zero swap partitions…. by the time you’re swapping heavily… your program is not dead… just unusable

                                                                                                              Swap still has its use. I’ve got a server which can evict most of the application because ~50% of its memory is lower levels of jit which will never be hit after a minute of runtime. Without swap, I’d have to run two servers. With swap, 1gb of it is used, and I can run two copies of the server without oomkiller kicking in. 60% of swap is marked used, but almost never touched.

                                                                                                              1. 3

                                                                                                                Yes. Also, the more unused anonymous pages you can swap out, the more RAM you have for pagecache to accelerate I/O.

                                                                                                            1. 1

                                                                                                              Hacker’s Delight is the book for bit-twiddling.

                                                                                                              1. 5

                                                                                                                Aren’t all 4 presented cases something that should be fixed properly rather than covered with a name-colliding ifdef? For example, I’d rather the compiler shouted at me and failed if I have unexpected file duplicates through multiple mount points rather than pretend they’re likely the same files and ignore the issue.

                                                                                                                1. 3

                                                                                                                  I came to say the same thing. I was put of #pragma once by the horror stories but I switched to using it a few years ago. If you have two copies of the same header in your build then one of the following is true:

                                                                                                                  • You have two copies of an identical header. This works today but is fragile if one of them is updated.
                                                                                                                  • You have a newer copy and an older copy. If the library is backwards ABI-compatible then this won’t break things but it may cause compilation failures in things that accidentally see the old header first.
                                                                                                                  • You have two versions of the header from distinct forks. In this case, all bets are off when it comes to compatibility. and who knows what will happen as they diverge further. Any compilation unit including both is setting itself up for pain.

                                                                                                                  None of these are good situations to be in. If #pragma once causes you duplicate-definition errors as soon as you get into that situation, rather than after you’ve build infrastructure around that situation and ended up in an unrecoverable mess, then that’s a feature.

                                                                                                                  1. 2

                                                                                                                    Amen.

                                                                                                                    Sometimes I feel like I’m the only one left, but I really think #pragma once is harmful for similar reasons. If headers had no side effects, including headers in any order would achieve the same result. But, the whole reason headers exist is specifically to have side effects, so different orders don’t necessarily have the same meaning. Rather than just throw headers at the compiler randomly and then suppress the compiler’s complaints arising from it, I really think the “right” thing is to have a single order where headers are included once, and use it consistently, so all compilation units are getting the same declarations. I’d love a compiler warning in response to #pragma once, since it implies that headers are potentially being included with inconsistent orders.

                                                                                                                    1. 3

                                                                                                                      Rather than just throw headers at the compiler randomly and then suppress the compiler’s complaints arising from it, I really think the “right” thing is to have a single order where headers are included once, and use it consistently, so all compilation units are getting the same declarations. I’d love a compiler warning in response to #pragma once, since it implies that headers are potentially being included with inconsistent orders.

                                                                                                                      Plan9 uses this approach; header guards aren’t used and headers don’t include other headers.
                                                                                                                      Here is an excerpt from How to use the Plan 9 C Compiler.

                                                                                                                      In strict ANSI C, include files are grouped to collect related functions in a single file: one for string functions, one for memory functions, one for I/O, and none for system calls. Each include file is protected by an #ifdef to guarantee its contents are seen by the compiler only once. Plan 9 takes a different approach. Other than a few include files that define external formats such as archives, the files in /sys/include correspond to libraries. If a program is using a library, it includes the corresponding header. The default C library comprises string functions, memory functions, and so on, largely as in ANSI C, some formatted I/O routines, plus all the system calls and related functions. To use these functions, one must #include the file <libc.h>, which in turn must follow <u.h>, to define their prototypes for the compiler. Here is the complete source to the traditional first C program:

                                                                                                                      1. 2

                                                                                                                        How would you deal with common functionality then? For example you want to include two unrelated third party header libraries and they both need stdint. You can’t include-order your way out of the one.

                                                                                                                        1. 1

                                                                                                                          I may not be following this example. What I’d like is to first include stdint, then include third party library A, then third party library B.

                                                                                                                          It’s true that if the headers for these libraries turn around and include stdint, then #pragma once is needed, but that’s a tautology: it’s saying that if the model I think is more robust is not followed, then it cannot be followed.

                                                                                                                    1. 1

                                                                                                                      Work:
                                                                                                                      I’m counting the seconds until I’m done with on-call; my last call was at 2am this morning for spurious reasons.

                                                                                                                      Personal:
                                                                                                                      I’m hacking on an old BFTP implementation.

                                                                                                                      1. 6

                                                                                                                        @pushcx, could you please merge this with Python Multithreading without the GIL, an earlier story on this topic?

                                                                                                                        1. 1

                                                                                                                          I’m on call this weekend so my movements are restricted. I’m reading and hacking on some old networking code.

                                                                                                                          1. 8

                                                                                                                            A Viable Solution for Python Concurrency is the LWN story on this topic.

                                                                                                                            1. 2

                                                                                                                              This is a fascinating line:

                                                                                                                              I think the wonderful thing about vi is that it has such a good market share because we gave it away. Everybody has it now. So it actually had a chance to become part of what is perceived as basic UNIX. EMACS is a nice editor too, but because it costs hundreds of dollars, there will always be people who won’t buy it.

                                                                                                                              1. 1

                                                                                                                                I guess this was referring to the version of Emacs that preceded GNU Emacs.

                                                                                                                                1. 3

                                                                                                                                  Yes, he’s referring to Gosling Emacs.