1. 56

    IMHO it’s hard to get much out of reading a codebase without necessity. Without a reason why, you won’t do it, or you won’t get much out of it without knowing what to look for.

    1. 5

      Yeah, this seems a bit like asking “What’s your favorite math problem?”

      I dunno. Always liked 7+7=14 since I was a kid.

      Codebases exist to do things. You read a codebase because you want to modify what that is or fix it because it’s not doing the thing its supposed to. Ideally, my favorite codebase is the one I get value out of constantly but never have to look at. CPU microcode, maybe?

      1. 4

        I often find myself reading codebases when looking for examples for using a library I am working with, or to understand how you are supposed to interact with some protocol. Open source codebases can help a lot there. It’s not so much 7 + 7 = 14, but rather 7 + x + y = 23, and I don’t know how to do x or y to get 23, but there are a few common components between the math problems. Maybe one solution can help me understand another?

        1. 2

          I completely agree. I do the same thing.

          when I am solving a similar problem or I’m interested in a class of problems, sometimes I find reviewing a codebase very informative. In my mind, what I’m doing is walking through the various things I might want to do and then reviewing the code structure to see how they’re doing it. It’s also bidirectional: A lot of times I see things in the structure and then wonder what sorts of behavior I might be missing.

          I’m not saying don’t review any codebases at all. I’m simply pointing out that without context, there’s no qualifiers for one way of coding to be viewed as better or worse than any other. You take the context to your codebase review, whether explicitly or completely inside your mind.

          There’s a place for context-free codebase reviews, of course. It’s usually in an academic setting. Everybody should walk through the GoF and functional data structures. You should have experience in a generic fashion working through a message loop or queuing system and writing a compiler. I did and still do, but in the same way I read up on what’s going on in mRNA vaccinations: familiarity. There exists these sorts of things that might help when I need them. I do not necessarily have to learn or remember them, but I have to be able to get them when I want. I know these coding details at a much lower level than I do biology, after all, I’m the guy who’s going to use and code them if I need them. But the real work is matching the problem context up (gradually, of course) with the various implementation systems you might want to use.

          There are folks who are great problem-solvers that can’t code. That sucks. There are other folks who can code like the wind but are always putting some obscure yet clever chunk of stuff out and plugging it in somewhere. That also sucks. Good coders should be able to work on both sides of that technical line and move back and forth freely. I review codebases to review how that problem-solving line changed over the years of development, thinking to myself “Where did these guys do too much coding? Too little? Why are these classes or modules set up the way they are (in relation to the problem and maintaining code)?”

          That’s the huge value you bring from reviewing codebases: more information on the story of developing inside of that domain. The rest of the coding stuff should be rote: I have a queue, I have a stack, etc. If I want to dive down to that level, start reviewing object interface strategy, perhaps, I’m still doing it inside of some context: I’m solving this problem and decided I need X, here’s a great example of X. Now, start reading and go back to reviewing what they’ve done against the problem you’re solving. Don’t be the guy who brings 4,000 lines of code to a 1 line problem. They might be great lines of code, but you’re working backwards.

          1. 1

            Yeah, I end up doing this a lot for i.e obscure system-specific APIs. Look at projects that’d use it/GH code search, chase the ifdefs.

          2. 2

            Great Picard’s Theorem, obvs. I always imagined approaching an essential singularity and seeing all infinity unfold, like a fractal flower, endlessly repeated in every step.

            1. 1

              I’d disagree. While sure, one could argue you just feed a computer what to do, you could make a similar statement about for example architecture, where (very simplified) you draw what workers should do and they do it.

              Does that mean that architects don’t learn from the work of other architect? I really don’t think so.

              But I also don’t think that “just reading” code or copying some “pattern” or “style” from others is what makes you like it. It’s more that if you write some code only on your own or with a somewhat static, like-minded team your mental constructs don’t really change, while different code bases can challenge your mental model or give you insights in a different mental/architectural model that someone else came up with.

              For me that’s not so different from learning different programming languages - like really learning them, not just being able to figure out what it means or doing the same thing you did before with different syntax.

              I am sure it’s not the same for everyone, and it surely depends on different learning styles, but I assume that most people commenting here don’t read code like the read a calculation and I’d never recommend people to just “read some code”. It doesn’t work, just like you won’t be a programmer after just reading a book on programming.

              It can be a helpful way of reflecting on own programming, but very differently from most code-reviews (real ones, not some theoretical optimal code review).

              Another thing, more psychological maybe is that I think everyone has seen bad code, and be it some old self-written code from some years ago. Sometimes it helps for motivation to come across the opposite by reading a nice code base to be able to visualize a goal. The closer it is to practical the better in my opinion. I am not so much a fan of examples or example apps, because they might not work in real world code bases, but that’s another topic.

              I hope though that nobody feels like they need to read code, when they don’t feel like it and it gives them nothing. Minds work differently and forcing yourself to do something seems to often counter-act how much is actually learned.

            2. 4

              Well, it varies. Many contributions end up being a grep away and only make you look at a tiny bit of the codebase. Small codebases can be easier to grasp, as can those with implementation overviews (e.g. ARCHITECTURE.md)

              1. 3

                “Mathematics is not a spectator sport” - I think the same applies to coding.

                1. 1

                  I have to agree with this; I’ve found the most improvement comes from contribution, and having my code critiqued by others. Maybe we can s/codebases to study/codebases to contribute to/?

                  1. 2

                    Even if you don’t have to modify something, reading something out of a necessity to understand it makes it stick better (and more interesting) than just reading it for the sake of reading. That’s how I know more about PHP than most people want to know.

                    1. 1

                      Years ago working on my MSc thesis I was working on a web app profiler. “How can I get the PHP interpreter to tell me every time it enters or exits a function in user code” led to likely a similar level of “I know more about the internals of PHP than I would like” :D

                1. 2

                  I was looking into this style of error handling last week. Currently the Outcome library looks like the best choice — it can be used standalone or with Boost, only requires C++14, and claims to be quite lightweight.

                  The upcoming exception refresh in C++2x is going to be similar to these in architecture, but integrated into the language syntax so it looks more like try/catch, and probably faster since the ABI will allow for optimizations like using a CPU flag to indicate whether the return value is a result or an error.

                  1. 1

                    That’s cool! Do you have any more info on the exception refresh (eg. examples)? I’m not seeing how try/catch can work with stuff like StatusOr.

                    1. 1

                      I don’t have the URL it came from, but the working-group document is titled “Zero-overhead Deterministic Exceptions: Throwing Values”, by Herb Sutter.

                  1. 3

                    IIUC, this only works if your NAT has certain ALGs enabled.

                    1. 1

                      Is there a straightforward way to disable the problematic ALGs? I suppose it varies by what router you’re using. I have an Eero system; its firmware is up to date, but the release history doesn’t mention any fixes for something like this.

                      1. 1

                        Is there any easy way to test if a NAT has this enabled or not? Many consumer routers provided by ISPs don’t offer many configuration options.

                      1. 37

                        Hello, I am here to derail the Rust discussion before it gets started. The culprit behind sudo’s vast repertoire of vulnerabilities, and more broadly of bugs in general, is accountable almost entirely to one matter: its runaway complexity.

                        We have another tool which does something very similar to sudo which we can compare with: doas. The portable version clocks in at about 500 lines of code, its man pages are a combined 157 lines long, and it has had two CVEs (only one of which Rust would have prevented), or approximately one every 30 months.

                        sudo is about 120,000 lines of code (100x more), its had 140 CVEs, or about one every 2 months since the CVE database came into being 21 years ago. Its man pages are about 10,000 lines and include the following:

                        $ man sudoers | grep -C1 despair
                        The sudoers file grammar will be described below in Extended Backus-Naur
                        Form (EBNF).  Don't despair if you are unfamiliar with EBNF; it is fairly
                        simple, and the definitions below are annotated.
                        

                        If you want programs to be more secure, stable, and reliable, the key metric to address is complexity. Rewriting it in Rust is not the main concern.

                        1. 45

                          its had 140 CVEs

                          Did you even look at that list? Most of those are not sudo vulnerabilities but issues in sudo configurations distros ship with. The actual list is more like 39, and a number of them are “disputed” and most are low-impact. I didn’t do a full detailed analysis of the issues, but the implication that it’s had “140 security problems” is simply false.

                          sudo is about 120,000 lines of code

                          More like 60k if you exclude the regress (tests) and lib directories, and 15k if you exclude the plugins (although the sudoers plugin is 40k lines, which most people use). Either way, it’s at least half of 120k.

                          Its man pages are about 10,000 lines and include the following:

                          12k, but this also includes various technical documentation (like the plugin API); the main documentation in sudoers(1) is 741 lines, and sudoers(5) is 3,255 lines. Well under half of 10,000.

                          We have another tool which does something very similar to sudo which we can compare with: doas.

                          Except that it only has 10% of the features, or less. This is good if you don’t use them, and bad if you do. But I already commented on this at HN so no need to repeat that here.

                          1. 12

                            You’re right about these numbers being a back-of-the-napkin analysis. But even your more detailed analysis shows that the situation is much graver with sudo. I am going to include plugins, becuase if they ship, they’re a liability. And their docs, because they felt the need to write them. You can’t just shove the complexity you don’t use and/or like under the rug. Heartbleed brought the internet to its knees because of a vulnerability in a feature no one uses.

                            And yes, doas has 10% of the features by count - but it has 99% of the features by utility. If you need something in the 1%, what right do you have to shove it into my system? Go make your own tool! Your little feature which is incredibly useful to you is incredibly non-useful to everyone else, which means fewer eyes on it, and it’s a security liability to 99% of systems as such. Not every feature idea is meritous. Scope management is important.

                            1. 9

                              it has 99% of the features by utility

                              Citation needed.

                              what right do you have to shove it into my system?

                              Nobody is shoving anything into your system. The sudo maintainers have the right to decide to include features, and they’ve been exercising that right. You have the right to skip sudo and write your own - and you’ve been exercising that right too.

                              Go make your own tool!

                              You’re asking people to undergo the burden of forking or re-writing all of the common functionality of an existing tool just so they can add their one feature. This imposes a great cost on them. Meanwhile, including that code or feature into an existing tool imposes only a small (or much smaller) cost, if done correctly - the incremental cost of adding a new feature to an existing system.

                              The key phrase here is “if done correctly”. The consensus seems to be that sudo is suffering from poor engineering practices - few or no tests, including with the patch that (ostensibly) fixes this bug. If your software engineering practices are bad, then simpler programs will have fewer bugs only because there’s less code to have bugs in. This is not a virtue. Large, complex programs can be built to be (relatively) safe by employing tests, memory checkers, good design practices, good architecture (which also reduces accidental complexity) code reviews, and technologies that help mitigate errors (whether that be a memory-safe GC-less language like Rust or a memory-safe GC’ed language like Python). Most features can (and should) be partitioned off from the rest of the design, either through compile-time flags or runtime architecture, which prevents them from incurring security or performance penalties.

                              Software is meant to serve the needs of users. Users have varied use-cases. Distinct use-cases require more code to implement, and thereby incur complexity (although, depending on how good of an engineer one is, additional accidental complexity above the base essential complexity may be added). If you want to serve the majority of your users, you must incur some complexity. If you want to still serve them, then start by removing the accidental complexity. If you want to remove the essential complexity, then you are no longer serving your users.

                              The sudo project is probably designed to serve the needs of the vast majority of the Linux user-base, and it succeeds at that, for the most part. doas very intentionally does not serve the needs of the vast majority of the linux user-base. Don’t condemn a project for trying to serve more users than you are.

                              Not every feature idea is meritous.

                              Serving users is meritous - or do you disagree?

                              1. 6

                                Heartbleed brought the internet to its knees because of a vulnerability in a feature no one uses.

                                Yes, but the difference is that these are features people actually use, which wasn’t the case with Heartleed. Like I mentioned, I think doas is great – I’ve been using it for years and never really used (or liked) sudo because I felt it was far too complex for my needs, before doas I just used su. But I can’t deny that for a lot of other people (mainly organisations, which is the biggest use-case for sudo in the first place) these features are actually useful.

                                Go make your own tool! Your little feature which is incredibly useful to you is incredibly non-useful to everyone else

                                A lot of these things aren’t “little” features, and many interact with other features. What if I want doas + 3 flags from sudo + LDAP + auditing? There are many combinations possible, and writing a separate tool for every one of them isn’t really realistic, and all of this also required maintenance and reliable consistent long-term maintainers are kind of rare.

                                Scope management is important.

                                Yes, I’m usually pretty explicit about which use cases I want to solve and which I don’t want to solve. But “solving all the use cases” is also a valid scope. Is this a trade-off? Sure. But everything here is.

                                The real problem isn’t so much sudo; but rather that sudo is the de-facto default in almost all Linux distros (often installed by default, too). Ideally, the default should be the simplest tool which solves most of the common use cases (i.e. doas), and people with more complex use cases can install sudo if they need it. I don’t know why there aren’t more distros using doas by default (probably just inertia?)

                                1. 0

                                  What if I want doas + 3 flags from sudo + LDAP + auditing?

                                  Tough shit? I want a pony, and a tuba, and barbie doll…

                                  But “solving all the use cases” is also a valid scope.

                                  My entire thesis is that it’s not a valid scope. This fallacy leads to severe and present problems like the one we’re discussing today. You’re begging the question here.

                                  1. 4

                                    Tough shit? I want a pony, and a tuba, and barbie doll…

                                    This is an extremely user-hostile attitude to have (and don’t try claiming that telling users with not-even-very-obscure use-cases to write their own tools isn’t user-hostile).

                                    I’ve noticed that some programmers are engineers that try to build tools to solve problems for users, and some are artists that build programs that are beautiful or clever, or just because they can. You appear to be one of the latter, with your goal being crafting simple, beautiful systems. This is fine. However, this is not the mindset that allows you to build either successful systems (in a marketshare sense) or ones that are useful for many people other than yourself, for previously-discussed reasons. The sudo maintainers are trying to build software for people to use. Sure, there’s more than one way to do that (integration vs composition), but there are ways to do both poorly, and claiming the moral high ground for choosing simplicity (composition) is not only poor form but also kind of bad optics when you haven’t even begun to demonstrate that it’s a better design strategy.

                                    My entire thesis is that it’s not a valid scope.

                                    A thesis which you have not adequately defended. Your statements have amounted to “This bug is due to sudo’s complexity which is driven by the target scope/number of features that it has”, while both failing to provide any substantial evidence that this is the case (e.g. showing that sudo’s bugs are due to feature-driven essential complexity alone, and not use of a memory-unsafe language, poor software engineering practices (which could lead to either accidental complexity or directly to bugs themselves), or simple chance/statistics) and not actually providing any defense for the thesis as stated. Assume that @arp242 didn’t mean “all” the usecases, but instead “the vast majority” of them - say, enough that it works for 99.9% of users. Why is this “invalid”, exactly? It’s easy for me to imagine the argument being “this is a bad idea”, but I can’t imagine why you would think that it’s logically incoherent.

                                    Finally, you have repeatedly conflated “complexity” and “features”. Your entire argument is, again, invalid if you can’t show that sudo’s complexity is purely (or even mostly) essential complexity, as opposed to accidental complexity coming from being careless etc.

                              2. 9

                                I dont’t think “users (distros) make a lot of configuration mistakes” is a good defence when arguing if complexity is the issue.

                                But I do agree about feature set. And I feel like arguing against complexity for safety is wrong (like ddevault was doing), because systems inevitably grow complex. We should still be able to build safe, complex systems. (Hence why I’m a proponent of language innovation and ditching C.)

                                1. 11

                                  I dont’t think “users (distros) make a lot of configuration mistakes” is a good defence when arguing if complexity is the issue.

                                  It’s silly stuff like (ALL : ALL) NOPASSWD: ALL. “Can run sudo without a password” seems like a common theme: some shell injection is found in the web UI and because the config is really naïve (which is definitely not the sudo default) it’s escalated to root.

                                  Others aren’t directly related to sudo configuration as such; for example this one has a Perl script which is run with sudo that can be exploited to run arbitrary shell commands. This is also a common theme: some script is run with sudo, but the script has some vulnerability and is now escalated to root as it’s run with sudo.

                                  I didn’t check all of the issues, but almost all that I checked are one of the above; I don’t really see any where the vulnerability is caused directly by the complexity of sudo or its configuration; it’s just that running anything as root is tricky: setuid returns 432 results, three times that of sudo, and I don’t think that anyone can argue that setuid is complex or that setuid implementations have been riddled with security bugs.

                                  Other just mention sudo in passing by the way; this one is really about an unrelated remote exec vulnerability, and just mentions “If QCMAP_CLI can be run via sudo or setuid, this also allows elevating privileges to root”. And this one isn’t even about sudo at all, but about a “sudo mode” plugin for TYPO3, presumably to allow TYPO3 users some admin capabilities without giving away the admin password. And who knows why this one is even returned in a search for “sudo” as it’s not mentioned anywhere.

                                  1. 3

                                    it’s just that running anything as root is tricky: setuid returns 432 results, three times that of sudo

                                    This is comparing apples to oranges. setuid affects many programs, so obviously it would have more results than a single program would. If you’re going to attack my numbers than at least run the same logic over your own.

                                    1. 2

                                      It is comparing apples to apples, because many of the CVEs are about other program’s improper sudo usage, similar to improper/insecure setuid usage.

                                      1. 2

                                        Well, whatever we’re comparing, it’s not making much sense.

                                        1. If sudo is hard to use and that leads to security problems through its misusage, that’s sudo’s fault. Or do you think that the footguns in C are not C’s fault, either? I thought you liked Rust for that very reason. For this reason the original CVE count stands.
                                        2. But fine, let’s move on on the presumption that the original CVE count is not appropriate to use here, and instead reference your list of 39 Ubuntu vulnerabilities. 39 > 2, Q.E.D. At this point we are comparing programs to programs.
                                        3. You now want to compare this with 432 setuid results. You are comparing programs with APIs. Apples to oranges.

                                        But, if you’re trying to bring this back and compare it with my 140 CVE number, it’s still pretty damning for sudo. setuid is an essential and basic feature of Unix, which cannot be made any smaller than it already is without sacrificing its essential nature. It’s required for thousands of programs to carry out their basic premise, including both sudo and doas! sudo, on the other hand, can be made much simpler and still address its most common use-cases, as demonstrated by doas’s evident utility. It also has a much smaller exposure: one non-standard tool written in the 80’s and shunted along the timeline of Unix history every since, compared to a standardized Unix feature introduced by DMR himself in the early 70’s. And setuid somehow has only 4x the number of footgun incidents? sudo could do a hell of a lot better, and it can do so by trimming the fat - a lot of it.

                                        1. 3

                                          If sudo is hard to use and that leads to security problems through its misusage, that’s sudo’s fault.

                                          It’s not because it’s hard to use, it’s just that its usage can escalate other more (relatively) benign security problems, just like setuid can. This is my point, as a reply to stephank’s comment. This is inherent to running anything as root, with setuid, sudo, or doas, and why we have capabilities on Linux now. I bet that if doas would be the default instead of sudo we’d have a bunch of CVEs about improper doas usage now, because people do stupid things like allowing anyone to run anything without password and then write a shitty web UI in front of that. That particular problem is not doas’s (or sudo’s) fault, just as cutting myself with the kitchen knife isn’t the knife’s fault.

                                          reference your list of 39 Ubuntu vulnerabilities. 39 > 2, Q.E.D.

                                          Yes, sudo has had more issues in total; I never said it doesn’t. It’s just a lot lower than what you said, and quite a number are very low-impact, so I just disputed the implication that sudo is a security nightmare waiting to happen: it’s track record isn’t all that bad. As always, more features come with more (security) bugs, but use cases do need solving somehow. As I mentioned, it’s a trade-off.

                                          sudo, on the other hand, can be made much simpler and still address its most common use-cases, as demonstrated by doas’s evident utility

                                          We already agreed on this yesterday on HN, which I repeated here as well; all I’m adding is “but sudo is still useful, as it solves many more use cases” and “sudo isn’t that bad”.

                                          Interesting thing to note: sudo was removed from OpenBSD by millert@openbsd.org; who is also the sudo maintainer. I think he’ll agree that “sudo is too complex for it to the default”, which we already agree on, but not that sudo is “too complex to exist”, which is where we don’t agree.

                                          Could sudo be simpler or better architectured to contain its complexity? Maybe. I haven’t looked at the source or use cases in-depth, and I’m not really qualified to make this judgement.

                                  2. 5

                                    I think arguing against complexity is one of the core principles of UNIX philosophy, and it’s gotten us quite far on the operating system front.

                                    If simplicity was used in sudo, this particular vulnerability would not have been possible to trigger it: why have sudoedit in the first place, which just implies the -e flag? This statement is a guarantee.

                                    If it would’ve ditched C, there is no guarantee that this issue wouldn’t have happened.

                                  3. 2

                                    Did you even look at that list? Most of those are not sudo vulnerabilities but issues in sudo configurations distros ship with.

                                    If even the distros can’t understand the configuration well enough to get it right, what hope do I have?

                                  4. 16

                                    OK maybe here’s a more specific discussion point:

                                    There can be logic bugs in basically any language, of course. However, the following classes of bugs tend to be steps in major exploits:

                                    • Bounds checking issues on arrays
                                    • Messing around with C strings at an extremely low level

                                    It is hard to deny that, in a universe where nobody ever messed up those two points, there are a lot less nasty exploits in the world in systems software in particular.

                                    Many other toolchains have decided to make the above two issues almost non-existent through various techniques. A bunch of old C code doesn’t handle this. Is there not something that can be done here to get the same productivity and safety advantages found in almost every other toolchain for tools that form the foundation of operating computers? Including a new C standard or something?

                                    I can have a bunch of spaghetti code in Python, but turning that spaghetti into “oh wow argv contents ran over some other variables and messed up the internal state machine” is a uniquely C problem, but if everyone else can find solutions, I feel like C could as well (including introducing new mechanisms to the language. We are not bound by what is printed in some 40-year-old books, and #ifdef is a thing).

                                    EDIT: forgot to mention this, I do think that sudo is a bit special given that its default job is to take argv contents and run them. I kinda agree that sudo is a bit special in terms of exploitability. But hey, the logic bugs by themselves weren’t enough to trigger the bug. When you have a multi-step exploit, anything on the path getting stopped is sufficient, right?

                                    1. 14

                                      +1. Lost in the noise of “but not all CVEs…” is the simple fact that this CVE comes from an embarrassing C string fuckup that would be impossible, or at least caught by static analysis, or at very least caught at runtime, in most other languages. If “RWIIR” is flame bait, then how about “RWIIP” or at least “RWIIC++”?

                                      1. 1

                                        I be confused… what does the P in RWIIP mean?

                                        1. 3

                                          Pascal?

                                          1. 1

                                            Python? Perl? Prolog? PL/I?

                                          2. 2

                                            Probably Python, given the content of the comment by @rtpg. Python is also memory-safe, while it’s unclear to me whether Pascal is (a quick search reveals that at least FreePascal is not memory-safe).

                                            Were it not for the relative (accidental, non-feature-providing) complexity of Python to C, I would support RWIIP. Perhaps Lua would be a better choice - it has a tiny memory and disk footprint while also being memory-safe.

                                            1. 2

                                              Probably Python, given the content of the comment by @rtpg. Python is also memory-safe, while it’s unclear to me whether Pascal is (a quick search reveals that at least FreePascal is not memory-safe).

                                              That’s possibly it.

                                              Perhaps Lua would be a better choice - it has a tiny memory and disk footprint while also being memory-safe.

                                              Not to mention that Lua – even when used without LuaJIT – is simply blazingly fast compared to other scripting languages (Python, Perl, &c)!

                                              For instance, see this benchmark I did sometime ago: https://0x0.st/--3s.txt. I had implemented Ackermann’s function in various languages (the “./ack” file is the one in C) to get a rough idea on their execution speed, and lo and behold Lua turned out to be second only to the C implementation.

                                      2. 15

                                        I agree that rewriting things in Rust is not always the answer, and I also agree that simpler software makes for more secure software. However, I think it is disingenuous to compare the overall CVE count for the two programs. Would you agree that sudo is much more widely installed than doas (and therefore is a larger target for security researchers)? Additionally, most of the 140 CVEs linked were filed before October 2015, which is when doas was released. Finally, some of the linked CVEs aren’t even related to code vulnerabilities in sudo, such as the six Quest DR Series Disk Backup CVEs (example).

                                        1. 4

                                          I would agree that sudo has a bigger target painted on its back, but it’s also important to acknowledge that it has a much bigger back - 100× bigger. However, I think the comparison is fair. doas is the default in OpenBSD and very common in NetBSD and FreeBSD systems as well, which are at the heart of a lot of high-value operations. I think it’s over the threshold where we can consider it a high-value target for exploitation. We can also consider the kinds of vulnerabilities which have occured internally within each project, without comparing their quantity to one another, to characterize the sorts of vulnerabilities which are common to each project, and ascertain something interesting while still accounting for differences in prominence. Finally, there’s also a bias in the other direction: doas is a much simpler tool, shipped by a team famed for its security prowess. Might this not dissuade it as a target for security researchers just as much?

                                          Bonus: if for some reason we believed that doas was likely to be vulnerable, we could conduct a thorough audit on its 500-some lines of code in an hour or two. What would the same process look like for sudo?

                                          1. -1

                                            but it’s also important to acknowledge that it has a much bigger back - 100× bigger.

                                            Sorry but I miss the mass of users pretesting on the streets for tools that have 100x code compare to other tools providing similar functionality.

                                            1. 10

                                              What?

                                        2. 10

                                          So you’re saying that 50% of the CVEs in doas would have been prevented by writing it in Rust? Seems like a good reason to write it in Rust.

                                          1. 11

                                            Another missing point is that Rust is only one of many memory safe languages. Sudo doesn’t need to be particularly performant or free of garbage collection pauses. It could be written in your favorite GCed language like Go, Java, Scheme, Haskell, etc. Literally any memory safe language would be better than C for something security-critical like sudo, whether we are trying to build a featureful complex version like sudo or a simpler one like doas.

                                            1. 2

                                              Indeed. And you know, Unix in some ways have been doing this for years anyway with Perl, python and shell scripts.

                                              1. 2

                                                I’m not a security expert, so I’m be happy to be corrected, but if I remember correctly, using secrets safely in a garbage collected language is not trivial. Once you’ve finished working with some secret, you don’t necessarily know how long it will remain in memory before it’s garbage collected, or whether it will be securely deleted or just ‘deallocated’ and left in RAM for the next program to read. There are ways around this, such as falling back to manual memory control for sensitive data, but as I say, it’s not trivial.

                                                1. 2

                                                  That is true, but you could also do the secrets handling in a small library written in C or Rust and FFI with that, while the rest of your bog-standard logic not beholden to the issues that habitually plague every non-trivial C codebase.

                                                  1. 2

                                                    Agreed.

                                                    Besides these capabilities, ideally a language would also have ways of expressing important security properties of code. For example, ways to specify that a certain piece of data is secret and ensure that it can’t escape and is properly overwritten when going out of scope instead of simply being dropped, and ways to specify a requirement for certain code to use constant time to prevent timing side channels. Some languages are starting to include things like these.

                                                    Meanwhile when you try to write code with these invariants in, say, C, the compiler might optimize these desired constraints away (overwriting secrets is a dead store that can be eliminated, the password checker can abort early when the Nth character of the hash is wrong, etc) because there is no way to actually express those invariants in the language. So I understand that some of these security-critical things are written in inline assembly to prevent these problems.

                                                    1. 1

                                                      overwriting secrets is a dead store that can be eliminated

                                                      I believe that explicit_bzero(3) largely solves this particular issue in C.

                                                      1. 1

                                                        Ah, yes, thanks!

                                                        It looks like it was added to glibc in 2017. I’m not sure if I haven’t looked at this since then, if the resources I was reading were just not up to date, or if I just forgot about this function.

                                            2. 8

                                              I do think high complexity is the source of many problems in sudo and that doas is a great alternative to avoid many of those issues.

                                              I also think sudo will continue being used by many people regardless. If somebody is willing to write an implementation in Rust which might be just as complex but ensures some level of safety, I don’t see why that wouldn’t be an appropriate solution to reducing the attack surface. I certainly don’t see why we should avoid discussing Rust just because an alternative to sudo exists.

                                              1. 2

                                                Talking about Rust as an alternative is missing the forest for the memes. Rust is a viral language (in the sense of internet virality), and a brain worm that makes us all want to talk about it. But in actual fact, C is not the main reason why anything is broken - complexity is. We could get much more robust and reliable software if we focused on complexity, but instead everyone wants to talk about fucking Rust. Rust has its own share of problems, chief among them its astronomical complexity. Rust is not a moral imperative, and not even the best way of solving these problems, but it does have a viral meme status which means that anyone who sees through its bullshit has to proactively fend off the mob.

                                                1. 32

                                                  But in actual fact, C is not the main reason why anything is broken - complexity is.

                                                  Offering opinions as facts. The irony of going on to talk about seeing through bullshit.

                                                  1. 21

                                                    I don’t understand why you hate Rust so much but it seems as irrational as people’s love for it. Rust’s main value proposition is that it allows you to write more complex software that has fewer bugs, and your point is that this is irrelevant because the software should just be less complex. Well I have news for you, software is not going to lose any of its complexity. That’s because we want software to do stuff, the less stuff it does the less useful it becomes, or you have to replace one tool with two tools. The ecosystem hasn’t actually become less complex when you do that, you’re just dividing the code base into two chunks that don’t really do what you want. I don’t know why you hate Rust so much to warrant posting anywhere the discussion might come up, but I would suggest if you truly cannot stand it that you use some of your non-complex software to filter out related keywords in your web browser.

                                                    1. 4

                                                      Agree with what you’ve wrote, but just to pick at a theme that’s bothering me on this thread…

                                                      I don’t understand why you hate Rust so much but it seems as irrational as people’s love for it.

                                                      This is obviously very subjective, and everything below is anecdotal, but I don’t agree with this equivalence.

                                                      In my own experience, everyone I’ve met who “loves” or is at least excited about rust seems to feel so for pretty rational reasons: they find the tech interesting (borrow checking, safety, ML-inspired type system), or they enjoy the community (excellent documentation, lots of development, lots of online community). Or maybe it’s their first foray into open source, and they find that gratifying for a number of reasons. I’ve learned from some of these people, and appreciate the passion for what they’re doing. Not to say they don’t exist, but I haven’t really seen anyone “irrationally” enjoy rust - what would that mean? I’ve seen floating around a certain spiteful narrative of the rust developer as some sort of zealous online persona that engages in magical thinking around the things rust can do for them, but I haven’t really seen this type of less-than-critical advocacy any more for rust than I have seen for other technologies.

                                                      On the other hand I’ve definitely seen solid critiques of rust in terms of certain algorithms being tricky to express within the constraints of the borrow checker, and I’ve also seen solid pushback against some of the guarantees that didn’t hold up in specific cases, and to me that all obviously falls well within the bounds of “rational”. But I do see a fair amount of emotionally charged language leveled against not just rust (i.e. “bullshit” above) but the rust community as well (“the mob”), and I don’t understand what that’s aiming to accomplish.

                                                      1. 3

                                                        I agree with you, and I apologize if it came across that I think rust lovers are irrational - I for one am a huge rust proselytizer. I intended for the irrationality I mentioned to be the perceived irrationality DD attributes to the rust community

                                                        1. 2

                                                          Definitely no apology needed, and to be clear I think the rust bashing was coming from elsewhere, I just felt like calling it to light on a less charged comment.

                                                        2. 1

                                                          I think the criticism isn’t so much that people are irrational in their fondness of Rust, but rather that there are some people who are overly zealous in their proselytizing, as well as a certain disdain for everyone who is not yet using Rust.

                                                          Here’s an example comment from the HN thread on this:

                                                          Another question is who wants to maintain four decades old GNU C soup? It was written at a different time, with different best practices.

                                                          In some point someone will rewrite all GNU/UNIX user land in modern Rust or similar and save the day. Until this happens these kind of incidents will happen yearly.

                                                          There are a lot of things to say about this comment, and it’s entirely false IMO, but it’s not exactly a nice comment, and why Rust? Why not Go? Or Python? Or Zig? Or something else.

                                                          Here’s another one:

                                                          Rust is modernized C. You are looking for something that already exists. If C programmers would be looking for tools to help catch bugs like this and a better culture of testing and accountability they would be using Rust.

                                                          The disdain is palatable in this one, and “Rust is modernized C” really misses the mark IMO; Rust has a vastly different approach. You can consider this a good or bad thing, but it’s really not the only approach towards memory-safe programming languages.


                                                          Of course this is not representative for the entire community; there are plenty of Rust people that I like and have considerably more nuanced views – which are also expressed in that HN thread – but these comments certainly are frequent enough to give a somewhat unpleasant taste.

                                                        3. 2

                                                          While I don’t approve of the deliberately inflammatory form of the comments, and don’t agree with the general statement that all complexity is eliminateable, I personally agree that, in this particular case, simplicity > Rust.

                                                          As a thought experiment, world 1 uses sudo-rs as a default implementation of sudo, while world 2 uses 500 lines of C which is doas. I do think that world 2 would be generally more secure. Sure, it’ll have more segfaults, but fewer logical bugs.

                                                          I also think that the vast majority of world 2 populace wouldn’t notice the absence of advanced sudo features. To be clear, the small fraction that needs those features would have to install sudo, and they’ll use the less tested implementation, so they will be less secure. But that would be more than offset by improved security of all the rest.

                                                          Adding a feature to a program always has a cost for those who don’t use this feature. If the feature is obscure, it might be overall more beneficial to have a simple version which is used by the 90% of the people, and a complex for the rest 10%. The 10% would be significantly worse off in comparison to the unified program. The 90% would be slightly better off. But 90% >> 10%.

                                                          1. 2

                                                            Rust’s main value proposition is that it allows you to write more complex software that has fewer bugs

                                                            I argue that it’s actually that it allows you to write fast software with fewer bugs. I’m not entirely convinced that Rust allows you to manage complexity better than, say, Common Lisp.

                                                            That’s because we want software to do stuff, the less stuff it does the less useful it becomes

                                                            Exactly. Software is written for people to use. (technically, only some software - other software (such as demoscenes) is written for the beauty of it, or the enjoyment of the programmer; but in this discussion we only care about the former)

                                                            The ecosystem hasn’t actually become less complex when you do that

                                                            Even worse - it becomes more complex. Now that you have two tools, you have two userbases, two websites, two source repositories, two APIs, two sets of file formats, two packages, and more. If the designs of the tools begin to differ substantially, you have significantly more ecosystem complexity.

                                                            1. 2

                                                              You’re right about Rust value proposition, I should have added performance to that sentence. Or, I should have just said managed language, because as another commenter pointed out Rust is almost irrelevant to this whole conversation when it comes to preventing these type of CVEs

                                                            2. 1

                                                              The other issue is that it is a huge violation of principle of least privilege. Those other features are fine, but do they really need to be running as root?

                                                        4. 7

                                                          Just to add to that: In addition to having already far too much complexity, it seems the sudo developers have a tendency to add even more features: https://computingforgeeks.com/better-secure-new-sudo-release/

                                                          Plugins, integrated log server, TLS support… none of that are things I’d want in a tool that should be simple and is installed as suid root.

                                                          (Though I don’t think complexity vs. memory safety are necessarily opposed solutions. You could easily imagine a sudo-alike too that is written in rust and does not come with unnecessary complexity.)

                                                          1. 4

                                                            What’s wrong with EBNF and how is it related to security? I guess you think EBNF is something the user shouldn’t need to concern themselves with?

                                                            1. 6

                                                              There’s nothing wrong with EBNF, but there is something wrong with relying on it to explain an end-user-facing domain-specific configuration file format for a single application. It speaks to the greater underlying complexity, which is the point I’m making here. Also, if you ever have to warn your users not to despair when reading your docs, you should probably course correct instead.

                                                              1. 2

                                                                Rewrite: The point that you made in your original comment is that sudo has too many features (disguising it as a point about complexity). The manpage snippet that you’re referring to has nothing to do with features - it’s a mix between (1) the manpage being written poorly and (2) a bad choice of configuration file format resulting in accidental complexity increase (with no additional features added).

                                                              2. 1

                                                                EBNF as a concept aside; the sudoers manpage is terrible.

                                                              3. 3

                                                                Hello, I am here to derail the Rust discussion before it gets started.

                                                                I am not sure what you are trying to say, let me guess with runaway complexity.

                                                                • UNIX is inherently insecure and it cannot be made secure by any means
                                                                • sudo is inherently insecure and it cannot be made secure by any means

                                                                Something else maybe?

                                                                1. 4

                                                                  Technically I agree with both, though my arguments for the former are most decidedly off-topic.

                                                                  1. 5

                                                                    Taking Drew’s statement at face value: There’s about to be another protracted, pointless argument about rewriting things in rust, and he’d prefer to talk about something more practically useful?

                                                                    1. 7

                                                                      I don’t understand why you would care about preventing a protracted, pointless argument on the internet. Seems to me like trying to nail jello to a tree.

                                                                  2. 3

                                                                    This is a great opportunity to promote doas. I use it everywhere these days, and though I don’t consider myself any sort of Unix philosophy purist, it’s a good example of “do one thing well”. I’ll call out Ted Unangst for making great software. Another example is signify. Compared to other signing solutions, there is much less complexity, much less attack surface, and a far shallower learning curve.

                                                                    I’m also a fan of tinyssh. It has almost no knobs to twiddle, making it hard to misconfigure. This is what I want in security-critical software.

                                                                    Relevant link: Features Are Faults.

                                                                    All of the above is orthogonal to choice of implementation language. You might have gotten a better response in the thread by praising doas and leaving iron oxide out of the discussion. ‘Tis better to draw flies with honey than with vinegar. Instead, you stirred up the hornets’ nest by preemptively attacking Rust.

                                                                    PS. I’m a fan of your work, especially Sourcehut. I’m not starting from a place of hostility.

                                                                    1. 3

                                                                      If you want programs to be more secure, stable, and reliable, the key metric to address is complexity. Rewriting it in Rust is not the main concern.

                                                                      Why can’t we have the best of both worlds? Essentially a program copying the simplicity of doas, but written in Rust.

                                                                      1. 2

                                                                        Note that both sudo and doas originated in OpenBSD. :)

                                                                        1. 9

                                                                          Got a source for the former? I’m pretty sure sudo well pre-dates OpenBSD.

                                                                          Sudo was first conceived and implemented by Bob Coggeshall and Cliff Spencer around 1980 at the Department of Computer Science at SUNY/Buffalo. It ran on a VAX-11/750 running 4.1BSD. An updated version, credited to Phil Betchel, Cliff Spencer, Gretchen Phillips, John LoVerso and Don Gworek, was posted to the net.sources Usenet newsgroup in December of 1985.

                                                                          The current maintainer is also an OpenBSD contributor, but he started maintaining sudo in the early 90s, before OpenBSD forked from NetBSD. I don’t know when he started contributing to OpenBSD.

                                                                          So I don’t think it’s fair to say that sudo originated in OpenBSD :)

                                                                          1. 1

                                                                            Ah, looks like I was incorrect. I misinterpreted OpenBSD’s innovations page. Thanks for the clarification!

                                                                      1. 3

                                                                        I’d be sold, if the published part was static.

                                                                        1. 5

                                                                          Well, sqlite is almost as static as your journal filesystem.

                                                                          1. 5

                                                                            My static webserver disagrees.

                                                                          2. 3

                                                                            you trade that for having all of the data being in sqlite.

                                                                            1. 3

                                                                              I think the design choice is sqlite can store tons of small files in a single file with FTS.

                                                                              1. 2

                                                                                yes, you get full text search now, but sqlite has loads of benefits over plain files.

                                                                              2. 2

                                                                                Why does this have to be a trade?

                                                                                The CMS and the resulting website do not need to both be dynamic and stored in sqlite.

                                                                                I’d personally prefer running the CMS somewhere private, and hosting the (static) website just about anywhere.

                                                                                1. 5

                                                                                  Then go use something like Hugo, etc. This lets you edit live, that’s a non-trivial task when the files are 100% static. There is a trade-off for this.

                                                                                  Personally I just use Fossil-scm for websites anymore. I can host files, edit locally or remotely (live or not), sync changes, etc. It also is just a sqlite file.

                                                                                  1. 3

                                                                                    There is a trade-off for this.

                                                                                    There doesn’t have to be. There’s no need for the CMS to actually dynamically render the resulting website for the public.

                                                                                    An export button that creates a static website would be good enough. An automatically updated export would be excellent.

                                                                                    1. 3

                                                                                      I can’t imagine go + sqlite not being fast enough for almost all websites in the world that would ever use something like this just doing the rendering every page load.

                                                                                      I haven’t looked at how this is coded, but if that is required for load reasons, then a cache would probably be a better solution.

                                                                                      But you are right, one could store in sqlite the rendered version (or even out on temp disk), for example.

                                                                                      But you are correct, technically. sure. I don’t see any reason to bother with the added complexity.

                                                                                      Just turning on your nginx caching option(or other front-end handling the TLS cert) would almost certainly be 99% easier and achieve basically the same effect.

                                                                                      1. 5

                                                                                        I can’t imagine go + sqlite not being fast enough for almost all websites in the world that would ever use something like this just doing the rendering every page load.

                                                                                        Static is simpler.

                                                                                        I don’t see any reason to bother with the added complexity.

                                                                                        Again, static is simpler.

                                                                                        Just turning on your nginx caching option

                                                                                        Now that’s insane. A cache frontend for a backend that’s dynamically-rendering static content.

                                                                                        Instead of just serving the static content directly.

                                                                                        99% easier

                                                                                        Than serving static content? You can’t be serious.

                                                                                        1. 3

                                                                                          yes, static content is simpler, but is it simpler for this project, the way it’s implemented currently? I’d argue no.

                                                                                          You have to render the content at some point. You have effectively 2 choices, you can render @ save time, and create , effectively, a static site, or you can render it at load time.

                                                                                          You seem to think rendering at save time is the better choice, Then you would save both copies, the raw MD file and the rendered HTML, put them both in your sqlite DB. Then at HTTP GET time, you can simply stream from the sqlite file the rendered version. (alternatively you could store the rendered content out in some file somewhere I guess… complicating the code even further. sqlite is often faster than open() anyway, so I’d argue that point also..

                                                                                          The problem is, it’s easy to have cache and sync issues. If you do it at save time, and there is exactly 1 way to edit, then the cache and sync issues basically go away.. but there is more than 1 way to edit, you can edit using any thing that can speak sqlite or calling sqlite CLI directly.. or using the web interface. The big feature of this code base is the ‘live edit’ feature of the CMS, so one could punt the problem, and save 2 copies in sqlite, the raw MD and the rendered version, and if you are so inclined to do it outside of the live edit session, then it’s your problem to update the cache also.

                                                                                          Alternatively, do it at read time(HTTP GET), and save yourself the headache of cache and sync issues. This is the simpler version. Seriously, it is. It was one sentence, vs a paragraph for rendering @ save time.

                                                                                          Complication matters. Static seems simpler, but is it really?? not always.

                                                                                          1. 2

                                                                                            sqlite is often faster than open() anyway,

                                                                                            Sure, but reading from sqlite is dynamic. Whereas any static webserver can serve a plain file. I prefer static webservers, as they are the simplest. This means low LoC count, which means easy to understand/audit/sandbox.

                                                                                            Specifically, I use openbsd’s httpd, and I would like to eventually move away from UNIX for my for public services whenever possible (e.g to a component system on top of seL4). A static website served from plain files is best.

                                                                                            1. 1

                                                                                              Changing the goalposts again, that’s fine I can meet your goalpost here too :)

                                                                                              Reading from a static sqlite file isn’t any more dynamic than open() on a file. They are both reading effectively static content in the above scenario of rendered content.

                                                                                              I agree from a security point of view, something like seL4 will be more secure, for some definitions of secure.. but at some point we are just messing around and not actually solving problems.

                                                                                              What is the security risks you are trying to protect from? Without identifying the risks, it’s very hard to mitigate them. Otherwise we are just playing security theatre games, for zero benefit.

                                                                                              What’s the worst case scenario here, that someone manages to get into a static content web server? Given proper permissions.. nothing. if they get in, rootkit and get write access, the situation is worse, but again, given static content the likes of a personal project, the consequences are equally trivial I imagine.

                                                                                              Anyways, you didn’t refute any of my statements above, so I’m glad we finally agree, static is not always better, or even simpler.

                                                                                              Like I mentioned way up thread, I like live-edit and I’m very lazy. I just use Fossil-scm for my websites. They are technically dynamic now, but it’s amazingly easy to make them go, I even get a built in forum and bug-tracking for feedback, email notifications, etc. I get 100% revision and history control and is audit-able, off-line sync capable, live edit capable, etc. Plus deployment is super easy, a single binary that does everything and backups are equally trivial as it’s a single sqlite file. Because of offline capabilities, I generally have a few copies laying about anyways and it’s all cross-platform.

                                                                                              1. 2

                                                                                                Reading from a static sqlite file isn’t any more dynamic than open() on a file. They are both reading effectively static content in the above scenario of rendered content.

                                                                                                My webserver doesn’t support serving from static sqlite files. Dynamic as in, I’d have to run cgi code in addition to my webserver.

                                                                                                Like I mentioned way up thread, I like live-edit and I’m very lazy.

                                                                                                Me too, thus I’d love a CMS. It’s just, while dynamic is good for me (the one writing articles), it is unnecessary for viewers. I currently use a static site generator, which takes Markdown as input.

                                                                                                I do not wish to change the setup of my public site, which is a static site.

                                                                                                What is the security risks you are trying to protect from? Without identifying the risks, it’s very hard to mitigate them. Otherwise we are just playing security theatre games, for zero benefit.

                                                                                                On my (public) personal sites, I simply want to minimize complexity, which should mitigate a broad range of security risks. It’s a win/win strategy.

                                                                                                1. 0

                                                                                                  Your webserver could support serving from sqlite files, if it so chose. that’s basically all rwtxt is doing.

                                                                                                  Anyways, I feel like we aren’t having a productive conversation anymore. I already covered your supposed complexity in a previous comment.

                                                                                                  edit: also in this thread, but in a different comment chain, I commented on security analysis, which also applies here.

                                                                                2. 2

                                                                                  you could spider the dynamic version of the site quickly with httrack or something to produce a static version.

                                                                                  1. 1

                                                                                    Just turning on your nginx caching option(or other front-end handling the TLS cert) would almost certainly be 99% easier and achieve basically the same effect.

                                                                                    1. 3

                                                                                      In terms of the security analysis it’s a completely different thing to have a dynamic application running exposed to the internet, even if you cache it.

                                                                                      1. -1

                                                                                        OK, I think I get your point(that in security complexity hurts you), but I think we have very different understandings of security analysis, so I’m going to write some stuff here.

                                                                                        You can’t talk about security mitigation’s without talking about specific risks you are trying to eradicate.

                                                                                        Nginx is dynamic. openbsd’s HTTPD is dynamic. any “static webserver” is dynamically taking inputs (HTTP GET requests) and mapping them to files out on a filesystem. Nothing is stopping nginx from serving /etc/shadow or /home/rain1/mysupersecretthinghere. Except some configuration(and hopefully) some file permissions.

                                                                                        This is no different than program X taking an HTTP get, opening a sqlite file and serving the results out of a column. It’s totally doable and equally “dynamic” for the most part.

                                                                                        I think what you are trying to say is, if rwtxt(since we are in the rwtxt thread) happens to be compromised(and I get complete control of it’s memory) I can convince rwtxt to render to you whatever I want.

                                                                                        Except the same is true of any other webserver. If I compromise nginx serving files from a filesystem(in the same way above), I can also have it render to you whatever I want.

                                                                                        There is basically no difference from a security analysis point of view. between rwtxt and a sqlite file and nginx and a html file. Both files can be read only from the web server perspective; of course then rwtxt’s live edit will not work, but hey, we are trying to be SECURE DAMMNIT! lol.

                                                                                        The difference here, from a security analysis perspective is, nginx is way way way more popular than rwtxt(today, who knows about tomorrow), so the chances of finding a complete compromise of nginx is, one hopes, much much harder than rwtxt, a tiny project mostly(completely? – didn’t look) written by one person. Of course the opposite is also true, way more bad people are looking at how to compromise nginx, then rwtxt, there is something to be said for security through obscurity, in a vague sort of hand-wavy way… as long as you are not an active target of bad people.

                                                                                        Hence why we go back to: You can’t talk about practical security mitigation’s without talking about specific risks you are trying to eradicate.

                                                                                        So mostly your sentence makes no sense from a security analysis perspective.

                                                                                        OK soap box over.

                                                                                3. 1

                                                                                  You may be able to get most of the benefits of static publishing by putting a CDN/caching layer in front of the CMS.

                                                                                  1. 3

                                                                                    That’d make hosting the actual website more complicated, rather than easier. Adding layers isn’t the solution.

                                                                                    I do not wish to expose the CMS to the general public. The CMS is relevant to webmasters (authors) only.

                                                                                    I just need an export button I can press anytime to get a static site.

                                                                                  2. 1

                                                                                    Why is that important for you?

                                                                                  1. 4

                                                                                    Alternatively, Learn X in Y Minutes has a similar guide for Rust and other programming languages.

                                                                                    1. 12

                                                                                      Please, please, please don’t comment code as per the “good” code example. In my opinion, comments should explain the “why” or other non obvious design decisions instead of merely reiterating what the code is doing.

                                                                                      1. 6

                                                                                        Yeah, the example is uncompelling and I would not look kindly on it in a review.

                                                                                        That said, project like GCC have comments of the “it does this” nature and they are immensely useful because it is usually not obvious what the code is, in fact, doing. The reasons for this are legion, but even something seemingly simple benefits from basic comments because you often end up jumping into this code from something that is essentially unrelated. Without those kinds of comments, you would end up spending an incredible amount of time getting to know the module (which is often very complicated) just to get what tends to be tangential, and important, information.

                                                                                      1. 3

                                                                                        Nice savings! I’m a little confused as to the use of PubSub for the import/export process though. IIUC Elasticsearch already has existing methods to migrate indexes. Would it have been even cheaper to use these methods and cut the PubSub cost out of the migration process?

                                                                                        1. 9

                                                                                          I think that in order to apply Brooks’ arguments, we need to look at what we are trying to solve for our users. It seems to me like the counter-examples in the article are just chunks of accidental complexity, plucked from larger user-facing tasks with dominating essential complexity. I would be more convinced if there was a third counter-example of a larger feature/project that took into account the entire software engineering process (from requirements gathering to delivered product).

                                                                                          1. 1

                                                                                            A couple of clarifying questions:

                                                                                            1. You state that if you haven’t received an ack within X milliseconds, to mark the current message as sent and proceed. If you don’t care about retries, why not remove the requirement to listen to acks in the first place?
                                                                                            2. How important is event ordering to you? For most event architectures, it’s worth it to quash that requirement due to increased complexity.
                                                                                            3. What’s worse: a user not receiving a message, or a user receiving more than one copy of a message?
                                                                                            1. 2
                                                                                              1. I get acks 85%-90% of the times. So, I would like to optimise it so that it is ordered for maximum number of users and let it go out of order for few. Also, by adding this X amount of delay, the message is usually sent to user as ordered. The messages are going out of order when I send them instantly.

                                                                                              2. The current system is unordered and works really well (scale, maintainability). However, a lot of messages are sent out of order. So, ordering is very important. My naive solution is to add a delay of X ms after every message and it should solve for most cases. However, I would be slowing down simply and I don’t want to do that.

                                                                                              3. User not receiving a message is worse. But I would try not send multiple times either.

                                                                                              1. 4

                                                                                                Have you considered enabling PubSub ordering, with the ordering key being the user/room? Some of the tradeoffs are that you will be limited in your throughput (1MB/s) per ordering key, and will be vulnerable to hot sharding issues.

                                                                                                After enabling ordering, if the ordering issue still impacts a greater fraction of users than you would like, then the problem is most likely on the upstream side (Insta/WhatsApp). AFAIK there is no ordering guarantee for those services, even if you wait for acks.

                                                                                                My advice: if the current solution is working great without ordering, I would strongly suggest sticking with it.

                                                                                                1. 2

                                                                                                  Once I enable ordering on the queue, it becomes difficult to distribute the work within multiple workers, right?

                                                                                                  if the current solution is working great without ordering, I would strongly suggest sticking with it.

                                                                                                  I so wish to do this, but seems I can’t :(

                                                                                                  1. 3

                                                                                                    Has someone actually quantified how impactful out of order messages are to the business? This is the kind of bug that a less-technical manager or PM can prioritize highly without doing due diligence.

                                                                                                    Another suggestion is to make a design, and be sure to highlight whatever infrastructure costs are changing (increasing most likely), as well as calling out the risks of increasing the complexity of the system. You have the agency to advocate for what you think is right. If they decide to proceed with the design then go ahead and get the experience and you’ll find out if the warnings were warranted over time.

                                                                                                    Quantifying the impact is a good exercise for you to do anyway, since if you build the system you can then put an estimate of the value you created on your resume.

                                                                                                    1. 2

                                                                                                      Correct; you will only be able to have one worker per ordering key, or you lose your ordering guarantee.

                                                                                                  2. 2

                                                                                                    If you want to avoid duplicates and lost messages, the only solution is to use idempotent APIs to send messages. If you do not receive an ack within some time, resend the message idempotently; lost sends/acks are repeated and the API provider filters the duplicates. Only proceed to sending message N+1 once you eventually succeed sending message N.

                                                                                                    If your API provider does not provide an idempotent API, then you could try to query their API for your last sent message and compare it with the one you plan to send. But this is slow and, since it’s not atomic / transactional, is very racey.

                                                                                                1. 3

                                                                                                  I wish there was a more in-depth analysis as to why certain benchmarks were faster/slower, and also a discussion of the benchmarking methodology (eg. were there pauses to account for the thermal throttling of both processors?).

                                                                                                  1. 7

                                                                                                    Take a look at your nearest corporate wiki. I can almost guarantee it’s a mess because most companies should never have wikis.

                                                                                                    What’s the alternative? Documentation in the codebase? Design docs? The article seems to suggest code rewrites as the solution. However, there may still be knowledge outside of the code (eg. ops playbooks) that still needs to go somewhere.

                                                                                                    1. 1

                                                                                                      I think he means to have docstrings generated into docs, so at least the person changing the code has immediate access to the human text.

                                                                                                    1. 5

                                                                                                      I use Brazil on a day-to-day basis and the thing I’d liken it to most is nixpkgs. Personally, I think that most of the benefits of Brazil can be realized with something like target/lorri and Cargo-style build systems.

                                                                                                      1. 1

                                                                                                        I’m not sure if I understood the “version set” concept completely. It seems to similar to Cargo.lock files where the transitive closure of your dependencies is recorded. However…

                                                                                                        When version sets are imported into Apollo, all of the test and build dependencies are stripped away and the remaining graph is passed in.

                                                                                                        That sounds like it additionally stores the information which dependencies are only used for testing and build purposes but unnecessary for deployment. That sounds more like package manager like Nix or Apt. In some sense, Amazon software is treated like an extended Linux distro?

                                                                                                        If you create a package, say the IDL for your service API or config for clients to connect, you specify the interface version (say, “1.0”). Every time you build, the generated package is given a concrete build version (“1.0.1593214096”). That build version is what’s stored in the version set for each package node. Developers never manage the build version either as an author or consumer.

                                                                                                        That sounds unlike any package manager I know.

                                                                                                        1. 3

                                                                                                          I’m not sure if I understood the “version set” concept completely. It seems to similar to Cargo.lock files where the transitive closure of your dependencies is recorded. However…

                                                                                                          Yup, you can think of version sets are a server-managed Cargo.lock which enable some neat things with the internal CI/CD system such as “where has this commit been deployed to?”. From a development standpoint, version sets are also tied to workspaces, which consist of several cloned packages. You can think of workspaces akin to Cargo workspaces.

                                                                                                          That sounds like it additionally stores the information which dependencies are only used for testing and build purposes but unnecessary for deployment.

                                                                                                          That’s right.

                                                                                                          That sounds more like package manager like Nix or Apt. In some sense, Amazon software is treated like an extended Linux distro?

                                                                                                          It really depends on the language, but that’s not wholly inaccurate. With npm or Cargo-based projects, language-idiomatic tooling is used for builds with Brazil serving system dependencies. At that point, it’s less a build system and more a packaging and artifact provenace system.

                                                                                                          That sounds unlike any package manager I know.

                                                                                                          Yup, that part is unique due to the norms around package versioning (you don’t ever cut a new version.)

                                                                                                          1. 1

                                                                                                            Yup, that part is unique due to the norms around package versioning (you don’t ever cut a new version.)

                                                                                                            What do you mean by “cut”?

                                                                                                            1. 1

                                                                                                              My bad. In this instance, releasing a package at a new version. A bump from 1.0 to 1.1 is considered to be a really big deal and is highly discouraged. Brazil treats any version bump as breaking change, as version numbers are treated as opaque strings.

                                                                                                        2. 1

                                                                                                          How are security issues in dependencies handled? Does the insecure version get blacklisted and consumers are forced to move to a later version, or do the dependency authors backport the fix to older versions?

                                                                                                          1. 6

                                                                                                            When I was there, such things would be surfaced in various interfaces, but if it was truly critical it would require one team going and wrangling all the other teams into updating their version sets (and possibly doing the work for them).

                                                                                                            The upside of version sets is that each team owns their whole dependency chain and get to choose when they want to take changes from other teams - it gives every team significant independence. The downside is that version sets accumulate cruft (good luck fully rebuilding one; quite possible some of the artifacts you’ve been using for years no longer compile), and if you vend libraries to other teams you can expect a long tail of old versions in use. It’s very difficult to go make the kind of large, company-wide crosscutting changes that you can at eg. Google/FB, and that includes things like compiler/runtime upgrades.

                                                                                                            I don’t really think there’s a better or worse between the Amazon and Google/FB style build/deploy/source control systems, it’s primarily a reflection of the org/team structure and what they prioritize - there’s tension between team independence/velocity and crosscutting changes that optimize the whole codebase.

                                                                                                        1. 1

                                                                                                          I’m confused by this. According to page 10 of the research paper, k-anonymized data is sent to the research partner, who then classifies this data by perceived race. However, the Project Lighthouse blog post states:

                                                                                                          to get these perceptions, we’ll share Airbnb profile photos and the first names associated with them with an independent partner that’s not part of Airbnb

                                                                                                          How does one k-anonymize a profile photo?

                                                                                                          1. 3

                                                                                                            From my understanding of page 12 in the paper, the k-anonymize step generates rows for ‘Data Store 1’ by mapping user_ids to an anonymized nid.

                                                                                                            Then files are sent to the trusted partner where the data being classified (first_name, photo_url) is keyed by nid (instead of user id).

                                                                                                            So once the classification makes its way back to Airbnb, there is no way to correlate the result with the underlying user_id (just the k-anonymized nid).

                                                                                                            1. 1

                                                                                                              That makes sense; the data isn’t fully k-anonymized until it comes back from the research partner.

                                                                                                          1. 3

                                                                                                            We need the hash of the uncompressed block because now that we have added a layer of optimization in compression we have also exposed ourselves to a durability issue. Corruption is now possible while compressing the block on the client as our clients rarely have ECC memory. We see a constant rate of memory corruption in the wild and end-to-end integrity verification always pays off.

                                                                                                            Fascinating: non-ECC memory corruption is something I’ve heard about, but I’ve rarely come across someone saying they’ve had to mitigate it in their code.

                                                                                                            1. 4

                                                                                                              It is rare indeed. I have seen this before from ZFS developers, which makes sense.

                                                                                                              1. 12

                                                                                                                dropbox sync engineer here :)

                                                                                                                we regularly see bitflips on consumer machines. for example, the desktop client’s sync engine has a metadata consistency checker that compares the client’s view of the filesystem to the remote filesystem at a snapshot, and we report any discrepancies up to the server. any report should indicate a bug in our protocol to be fixed.

                                                                                                                but… we do have to bucket out differences in metadata that are separated by one or two bitflips. it’s not a huge number but it shows up when you’re trying to have zero inconsistencies over many 10s of millions of machines.

                                                                                                                1. 1

                                                                                                                  I love seeing examples like this of the law of large numbers. See also: “it’s never the network”, “this random string will never appear in customer input”, (ab)using JavaScript numbers as monotonic counters, etc.

                                                                                                                  1. 1

                                                                                                                    Cool! Is it usually a small subset of clients that have most of the bitflips or are they pretty evenly distributed?

                                                                                                                    1. 6

                                                                                                                      you know, I had never actually looked at the distribution. I just ran the numbers and indeed a handful of hosts make up the majority of bitflip inconsistencies.

                                                                                                                      1. 2

                                                                                                                        Thanks for running it! My (strawman) theory is that most of these bitflips are coming from bad RAM rather than external factors like EM radiation.

                                                                                                              1. 1

                                                                                                                I’m using the 1440p variant of the HP Z27 with a MacBook Pro and it’s served me very well. IMO 1080p/4K resolution is better on a 24” monitor, while 27” should be 1440p/5K. To each their own; I’m sure that the 4K Z27 is awesome too.

                                                                                                                1. 2

                                                                                                                  Sampling is another technique that could be used to reduce memory usage (which isn’t covered by this article).

                                                                                                                  1. 9

                                                                                                                    I feel like “write small methods” has been drilled into so many engineers these days that it usually tends to go in the opposite direction, with tons of nested 5-10 line functions that make debugging difficult. As an alternative, I like John Ousterhout’s philosophy of writing deep modules/classes/functions (small interface, lots of functionality).

                                                                                                                    1. 3

                                                                                                                      That link is superb. I particularly like the emphasis on minimal interfaces. Committing to the fewest number of simple interfaces necessary really cuts down the design space to be considered.

                                                                                                                    1. 2

                                                                                                                      In my experience, there has been a distinct lack of critical thinking in software engineering. Whether it’s citing scientific studies as the end-all-be-all, or saying “hey, it works for company X, therefore it will work for us,” it’s the same problem. I think software needs to be more scientific and data-driven, but on a smaller scale (within your team/organization).

                                                                                                                      As an anecdote, I’ve had an experience where a full rewrite was proposed for performance issues without any measurement as to what those issues actually were. Experiences from the past were put forth to justify areas to focus on for performance issues. The rewrite went ahead and a lot of effort was put into optimizing something that brought very little improvement. In this case, measuring first would have helped identify the more significant issues.

                                                                                                                      1. 2

                                                                                                                        I think query builders are pretty nifty tech in that their surface area is super small, and help out a bit (though tbh manual joins when you have a lot of relational data is a pain)

                                                                                                                        What I still haven’t seen but want is some way for us to have client-side plan builders. Instead of giving Postgres an SQL string and for it to pass through an optimiser, I would love to be able to describe a query plan which sidesteps the planner and gives the executor the “ideal” way of handling things for the shape of my data.

                                                                                                                        I actually know what my data looks like, and in some cases I know that if I could convince the planner to do a thing, it would go much faster. But this is rarely given to us. Given how many people love dropping down to C/assembler for stuff, I’m surprised to have never seen someone try to circumvent SQL to build nicer plans.

                                                                                                                        1. 1

                                                                                                                          Do you experince a lot of query planner inefficiency? I’ve noticed that (at least for MySQL/Percona) it provides decent results as long as you don’t have any low cardinality indexes.

                                                                                                                          1. 2

                                                                                                                            So for 95% of stuff this is absolutely no problem.

                                                                                                                            There’s a couple really tricky queries where I have chosen to optimise for maintainability of the query/just relying on the ORM over trying to get the machine to do “what I want”. This often ends with me reaching for subqueries that just conceptually can’t be aware of the data properties I’m aware of. Stuff like “well for rows with this property I don’t actually need this subquery result”. But this stuff is hard