Threads for saturn

  1.  

    I don’t think this is on topic. As the moderation log is fond of saying, “Lobsters is not the torch and pitchfork store.”

    1.  

      I’m not asking for revenge or whatever—this is a warning about a well-known manufacturer behaving badly.

      But fair enough that it may not be on-topic enough. I’ve never been entirely clear on what lobsters considers on-topic and off-.

      1.  

        this is a warning about a well-known manufacturer behaving badly.

        Personally I find this a useful datum.

        Similarly, the drama around Manjaro’s expenses and their financial controller leaving - I don’t care for arguing about it online, but it did feed into my decision to use Mint rather than Manjaro on my recent gaming system reinstallation.

      2.  

        The title is clearly informative and 100% tech related. It is a really useful post.

        Note that the discussion here is civil and focused. People are discussing what it means to hire a surveillance cop in such a company, with pros and cons.

        Lobste.rs cannot be a buble “tech without politics” because the tech itself is politic and influencing human politic more than ever before. It would be like talking about cars without talking about drivers and effect on the environment.

        1.  

          Seconded. This is not on topic and is not useful.

        1. 6

          Is the premise of the backlash that all surveillance is evil? There is crime in the world, and sometimes cracking cases involves criminal surveillance; this doesn’t seem controversial to me.

          1.  

            No, it’s a necessary evil – and as such, it’s something to be discussed with appropriate gravity. Not “whee, he makes light sabers, bye bye” which are literal quotes from the official account.

            But really, it’s the insults that are astonishing.

            1.  

              Yeah, that’s definitely not an appropriate response. It sounds like they hired a child to run social media:)

            2.  

              sometimes cracking cases involves criminal surveillance

              You speak as though the cops regularly crack cases. I think someone’s been watching too much TV…

              1.  

                They do regularly break cases. They just don’t have a very good success rate across all cases. These are two different things.

                Classic base rate neglect fallacy.

            1.  

              A bit of a weird post (for lobsters) as it’s a social media post and not an article, but this is apparently their official account, and this seems like the kind of Milkshake Duck moment that would be relevant to a lot of people here.

              Feel free to suggest updates to the title. It’s a little hard to summarize…

              EDIT: Here’s a screenshot of the thread complete with blocked replies, sorry it’s in this format: https://lab.brainonfire.net/tmp/delete-after/20231231/rspy.png

              1.  

                410

                1.  

                  Oof, yeah, their site isn’t doing so well, and the snapshot that I thought I’d saved… isn’t. Their original post:

                  We hired a policeman & it’s going really great. Meet our #Maker in Residence @TobyRobertsPi.

                  “I was a #Surveillance Officer for 15 years, so I built stuff to hide covert video & audio gear. I’d disguise it as something else, like a piece of street furniture or a household item.

                  During all those years of working with Raspberry Pi, I never thought I’d end up working here; as I’ve always been a #RaspberryPi fan, I’m fascinated to see what takes place behind the scenes.”

                  That’s followed by a large number of people questioning if this is a joke or saying that it’s in poor taste, then getting a mix of belligerent and condescending replies from the account, and finally getting blocked. (Possibly so their replies don’t show up on the page?) I’ll try to get a screengrab, but here’s an example:

                  <@eviloatmeal@ak.angelstrapped.com>: Yikes.
                  <@Raspberry_Pi@raspberrypi.social> @eviloatmeal Wanna block us or should we block you?

                2.  

                  I find “spy” to convey ill intent. I would keep with the language from the post by either using “police officer” or “surveillance specialist”, or something similar.

                  1.  

                    Thanks, updated to “surveillance cop”. There’s a character limit in the title.

                1. 8

                  Not quite on topic for lobsters (cancel culture?). Another way of thinking about this is: here’s a human being who’s a fan of cool gear and quite a geek and he switched jobs from working in surveillance (boo) to working for Raspberry Pi, which gives us average people cool tech (yay).

                  1.  

                    Yeah, it’s the tone of the announcement that’s a bit of a yellow flag, and then the replies from the official account that are atrocious.

                    I’m glad the guy’s not doing surveillance tech any more. That part I’m probably fine with.

                  1.  

                    Besides the bad (employee used to build covert A/V devices for surveillance purposes in his old job), there’s also some advantages to this:

                    • Employee is able to design/integrate embedded computers in household items, and has experience with this
                    • Employee has experience with low-power environments, most likely battery powered
                    • A/V experience with RPi is a nice-to-have nowadays, given the popularity of motion/frigate and integrations with smart home boxes (which are very well supported on (surprise!) Raspberry Pis

                    Discrimination based on previous job, without any proof that he had any malicious intention or did anything wrong, is something I’m surprised to see as a first reaction.

                    Until I find any of his covert devices in my home, on my street, or read about similar stories somewhere else, I’d be willing to give the guy a chance!

                    1. 7

                      I might actually be fine with them hiring the guy! But it’s hard to know because by the time I saw the thread, whoever was running the account was already saying (and I quote) “bye bye” to politely surprised posts and blocking people. The community manager is who needs to be fired.

                    1. 1

                      Use of the term “eyeball” here is pretty offputting. If that’s what they want as their internal terminology, that’s fine, but maybe use different words on your public post?

                      1. 10

                        https://en.wikipedia.org/wiki/Eyeball_network https://en.wikipedia.org/wiki/Happy_Eyeballs

                        Okay, “slang”. The problem is this is the most sensible term here. Think about alternatives:

                        • “users” -> what users, cloudflare users? people using websites on cloudflare?
                        • “customers” -> similar, our customers or our customers customers?
                        • “humans” -> misleading (what about crawlers)
                        • “endpoints” -> … initiating or terminating connections?

                        And so on. Like it or not, “eyeball” is the best term here, and is pretty much a standard term in networking.

                      1. 3

                        Always? No. I’ve had to write range(0, x+1) so many times when I want to work with an inclusive endpoint, and it’s not just annoying, but also fails to communicate intent. Is the +1 part of the algorithm, or just a choice of closed interval? It’s not clear.

                        I appreciate that Kotlin gives you both: a..b is closed, a until b is half-open. I use both all the time! Having just one is really irritating, no matter which one it is.

                        (I also appreciate that IntelliJ displays them with an appropriate mix of < and signs, removing all ambiguity.)

                        1. 2

                          Thanks for the info about ranges in Kotlin. Ranges in other languages that support both types:

                          1. 1

                            Ah, cool! Ruby’s syntax scares me a little (so visually similar!) but Swift’s looks nice.

                        1. 2

                          Unless Twitter requires manual interventions to run (imagine some guys turning cranks all day long :)) , why exactly would it go down ?

                          1. 15

                            Eventually, they will have an incident and no one remaining on staff will know how to remediate it, so it will last for a long time until they figure it out. Hopefully it won’t last as long as Atlassian’s outage!

                            1. 15

                              Or everyone remaining on staff will know how to fix it but they will simply get behind the pace. 12 hour days are not sustainable and eventually people will be ill more often and make poorer decisions due to fatigue. This post described the automation as clearing the way to spend most their time on improvements, cost-savings, etc. If you only spent 26% of your time putting out fires and then lost 75% of your staff well now you’re 1% underwater indefinitely (which completely ignores the mismatch between when people work best and when incidents occur).

                              1. 6

                                Even worse - things that would raise warnings and get addressed before they’re problems may not get addressed in time if the staffing cuts were too deep.

                              2. 8

                                That’s how all distributed systems work – you need people turning cranks all day long :) It gets automated over time, as the blog post describes, but it’s still there.

                                That was my experience at Google. I haven’t read this book but I think it describes a lot of that: https://sre.google/sre-book/table-of-contents/

                                That is, if such work didn’t exist, then Google wouldn’t have invented the job title “SRE” some time around 2003. Obviously people were doing similar work before Google existed, but that’s the term that Twitter and other companies now use (in the title of this blog post).

                                (Fun fact: while I was there, SREs started to be compensated as much or more than Software Engineers. That makes sense to me given the expertise/skills involved, but it was cultural change. Although I think it shifted again once they split SRE into 2 kinds of roles – SRE-SWE and SRE-SysAdmin.)


                                It would be great if we had strong abstractions that reduce the amount of manual work, but we don’t. We have ad hoc automation (which isn’t all bad).

                                Actually Twitter/Google are better than most web sites. For example, my bank’s web site seems to go down on Saturday nights now and then. I think they are doing database work then, or maybe hardware upgrades.

                                If there was nobody to do that maintenance, then eventually the site would go down permanently. User growth, hardware failures (common at scale), newly discovered security issues, and auth for external services (SSL certs) are some reasons for “entropy”. (Code changes are the biggest one, but let’s assume here that they froze the code, which isn’t quite true.)


                                That’s not to say I don’t think Twitter/Google can’t run with a small fraction of the employees they have. There is for sure a lot of bloat in code and processes.

                                However I will also note that SREs/operations became the most numerous type of employee at Google. I think there were something like 20K-40K employees under Hoezle/Treynor when I left 6+ years ago, could easily be double that now. They outnumbered software engineers. I think that points to a big problem with the way we build distributed systems, but that’s a different discussion.

                                1. 7

                                  Yeah, ngl but the blog post rubbed me the wrong way. That tasks are running is step 1 of the operarional ladder. Tasks running and spreading is step 2. But after that, there is so much work for SRE to do. Trivial example: there’s a zero day that your security team says is being actively exploited right now. Who is the person who knows how to get that patched? How many repos does it affect? Who knows how to override all deployment checks for all the production services that are being hit and push immediately? This isn’t hypothetical, there are plenty of state sponsored actors who would love to do this.

                                  I rather hope the author is a junior SRE.

                                  1. 3

                                    I thought it was a fine blog post – I don’t recall that he claimed any particular expertise, just saying what he did on the cache team

                                    Obviously there are other facets to keeping Twitter up

                                  2. 4

                                    For example, my bank’s web site seems to go down on Saturday nights now and then. I think they are doing database work then, or maybe hardware upgrades.

                                    IIUC, banks do periodic batch jobs to synchronize their ledgers with other banks. See https://en.wikipedia.org/wiki/Automated_clearing_house.

                                    1. 3

                                      I think it’s an engineering decision. Do you have people to throw at the gears? Then you can use the system with better outcomes that needs humans to occasionally jump in. Do you lack people? Then you’re going to need simpler systems that rarely need a human, and you won’t always get the best possible outcomes that way.

                                      1. 2

                                        This is sort of a tangent, but part of my complaint is actually around personal enjoyment … I just want to build things and have them be up reliably. I don’t want to beg people to maintain them for me

                                        As mentioned, SREs were always in demand (and I’m sure still are), and it was political to get those resources

                                        There are A LOT of things that can be simplified by not having production gatekeepers, especially for smaller services

                                        Basically I’d want something like App Engine / Heroku, but more flexible, but that didn’t exist at Google. (It’s hard problem, beyond the state of the art at the time.)

                                        At Twitter/Google scale you’re always going to need SREs, but I’d claim that you don’t need 20K or 40K of them!

                                        1. 1

                                          My personal infrastructure and approach around software is exactly this. I want, and have, some nice things. The ones I need to maintain the least are immutable – if they break I reboot or relaunch (and sometimes that’s automated) and we’re back in business.

                                          I need to know basically what my infrastructure looks like. Most companies, if they don’t have engineers available, COULD have infrastructure that doesn’t require you to cast humans upon the gears of progress.

                                          But in e.g. Google’s case, their engineering constraints include “We’ll always have as many bright people to throw on the gears as we want.”

                                          1. 1

                                            Basically I’d want something like App Engine / Heroku, but more flexible, but that didn’t exist at Google.

                                            I think about this a lot. We run on EC2 at $work, but I often daydream about running on Heroku. Yes it’s far more constrained but that has benefits too - if we ran on Heroku we’d get autoscaling (our current project), a great deploy pipeline with fast reversion capabilities (also a recentish project), and all sorts of other stuff “for free”. Plus Heroku would help us with application-level stuff, like where we get our Python interpreter from and managing it’s security updates. On EC2, and really any AWS service, we have to build all this ourselves. Yes AWS gives us the managed services to do it with but fundamentally we’re still the ones wiring it up. I suspect there’s an inherent tradeoff between this level of optimization and the flexibility you seek.

                                            Heroku is Ruby on Rails for infrastructure. Highly opinionated; convention over configuration over code.

                                            At Twitter/Google scale you’re always going to need SREs, but I’d claim that you don’t need 20K or 40K of them!

                                            Part of what I’m describing above is basically about economies of scale working better because more stuff is the same. I thought things like Borg and gRPC load balancing were supposed to help with this at Google though?

                                      2. 2
                                        1. Random failures that aren’t addressed
                                        2. Code and config changes (which are still happening, to some extent)

                                        It can coast for a long time! But eventually it will run into a rock because no one is there to course-correct. Or bills stop getting paid…

                                        1. 1

                                          I don’t have a citation for this but the vast majority of outages I’ve personally had to deal with fit into two bins as far as root causes go:

                                          • resource exhaustion (full disks, memory leak slowly eating all the RAM, etc)
                                          • human-caused (eg a buggy deployment)

                                          Because of the mass firing and exodus, as well as the alleged code freeze, the second category of downtime has likely been mostly eliminated in the short term and the system is probably mostly more stable than usual. Temporarily, of course, because of all of the implicit knowledge that walked out the doors recently. Once new code is being deployed by a small subset of people who know the quirks, I’d expect things to get rough for a while.

                                          1. 2

                                            You’re assuming that fewer people means fewer mistakes.

                                            In my experience “bad” deployments are much less because someone is constantly pumping out code with the same number of bugs per deployment but because the deployment breaks how other systems interact with the changed system.

                                            In addition fewer people under more stress, with fewer colleagues to put their heads together with, is likely to lead to more bugs per deployment.

                                            1. 1

                                              Not at all! More that… bad deployments are generally followed up with a fix shortly afterwards. Once you’ve got the system up and running in a good state, not touching it at all is generally going to be more stable than doing another deployment with new features that have potential for their own bugs. You might have deployed “point bugs” where some feature doesn’t work quite right, but they’re unlikely to be showstoppers (because the showstoppers would have been fixed immediately and redeployed)

                                        1. 4

                                          To be honest I don’t really like finding behaviors like this when I am not looking for them explicitly, because when this happens, I am prone to feeling obsessively responsible to investigate.

                                          I felt this in my bones. 😩

                                          1. 2

                                            Does it have to be between a “client” and a “server” or can it be P2P as well?

                                            1. 2

                                              Additionally, a client that expands Initial packets helps reduce the order of amplitude gain of amplification attacks caused by server responses toward an unverified client address.

                                              I’m confused by how this helps. Is the server required to reject any initial packet of size less than 1200 bytes? Otherwise the attacker could just send smaller ones.

                                              1. 4

                                                I was with this right up until this line:

                                                Arguably, Microsoft is cre­at­ing a new walled gar­den that will inhibit pro­gram­mers from dis­cov­er­ing tra­di­tional open-source com­mu­ni­ties. Or at the very least, remove any incen­tive to do so.

                                                In my experience, people mostly engage with open source communities because they A) need a library, or B) want to contribute back. They’re not copying individual lines of open source code, but using libraries, and I think that wouldn’t really change with Copilot in play.

                                                1. 9

                                                  Very intersting, but I would love to hear more details on why matrix is not as scalable. They hint at the merge operations but I don’t understand why that is a problem.

                                                  1. 18

                                                    We’d like to know too :) Matrix as protocol isn’t inherently unscalable at all. It’s true that every time you send a message in matrix you effectively are merging the state of one chatroom with the state of another one - very similar to how you push a commit in Git. Generally this is trivial, but if there’s a merge conflict, it’s heavier to resolve. The Synapse python implementation was historically terrible at this, but has been optimised a lot in the last 12 months. The Dendrite go implementation is pretty fast too.

                                                    There’s an interesting optimisation that we designed back in 2018 where you incrementally resolve state (so called ‘delta state res’), where you only resolve the state which has changed rather than considering all the room state (i.e. all key-value pairs of data associated with the room) en masse. https://matrix.org/_matrix/media/v1/download/jki.re/ubNfLtrmXZMmlGjJZYPnlHHy and https://github.com/matrix-org/synapse/pull/3122 give a bit of an idea of how that works. It would be really cool if Process One is doing something like that with ejabberd, but in practice we suspect that they’ve just done an efficiently implementation of the current state res algorithm. We’ve pinged them on Twitter to see if they want to discuss what they’re up to :) https://twitter.com/matrixdotorg/status/1580549591807975430

                                                    1. 11

                                                      There’s an interesting optimisation that we designed back in 2018 where you incrementally resolve state

                                                      Is it really so hard to see why a protocol that cares about conversation state is more difficult to scale than a protocol that completely ignores it? Seems almost tautological to me.

                                                      1. 15

                                                        Matrix is certainly more complex to scale (as our inefficient first gen implementations demonstrated), but i think folks are conflating together “it’s complex to write an efficient implementation” with “it doesn’t scale”. It’s like pointing out that writing an efficient Git implementation is harder than writing an efficient CVS implementation; hardly surprising given the difference in semantics.

                                                        In practice, you can definitely write a Matrix implementation where all operations (joining, sending, receiving, etc) are O(1) per destination, and don’t scale with the amount of state (i.e. key value pairs) in a room. And to be clear, Matrix never scales with the amount of history in a room; history is always lazyloaded so it doesn’t matter how much scrollback there is.

                                                        Historically, joining rooms in Matrix was O(N) with the number of the users in that room, but we’ve recently fixed this with “faster remote joins”, which allows the room state to get lazily synced in the background, thus making it O(1) with size of room, as it should be. https://github.com/matrix-org/matrix.org/blob/80b36d13c3097ffb5ba33572d9011e71940f1486/gatsby/content/blog/2022/10/2022-10-04-faster-joins.mdx is a shortly-to-be-published blog post giving more context, fwiw.

                                                        1. 9

                                                          The post doesn’t say “Matrix doesn’t scale”, just that XMPP and MQTT scale better. This is because they’re solving dramatically simpler problems. I don’t see anything problems with that claim.

                                                          1. 4

                                                            As an aside, from that draft,

                                                            whereas it used to take upwards of 12 minutes to join Matrix HQ […] this is now down to about 30 seconds (and we’re confident that we can reduce this even further).

                                                            Holy cow they did it! Woo! So proud of the Synapse team :)

                                                            1. 2

                                                              On the technical side, that’s genuinely impressive work. On the product side, I can’t help but compare with iMessage, signal, WhatsApp and discord being closer to one second.

                                                              1. 3

                                                                the target is indeed <1s, and should still be viable. we’ve shaved the number of events needed to join #matrix:matrix.org from ~70K to ~148 iirc, which should be transferred rapidly.

                                                        2. 11

                                                          We’d like to know too :) Matrix as protocol isn’t inherently unscalable at all

                                                          I suspect that this is a question of relative scale. A lot of users of eJabberd are using it as a messaging bus, rather than a chat protocol and so sending a message is likely to be on the order of a few hundred BEAM VM instructions. This is especially true of MQTT, where you don’t have the XML parsing overhead of XMPP and you can parse the packet entirely with Erlang pattern matching. If it’s a deferred message then you may write to Mnesia, but otherwise it’s very fast. In contrast, something that keeps persistent state and does server-side merging is incredibly heavy. That doesn’t mean that it isn’t the right trade off for the group collaboration scale, but it definitely means that you wouldn’t want to use Matrix as the control plane for a massively networked factory, for example.

                                                          1. 5

                                                            I guess it will be interested to benchmark. To use the git v. cvs example again, I think it’s possible to have an efficient (but complex) merging system like git which outperforms a simple “it’s just a set of diffs” VCS. We certainly use Matrix successfully in some places as a general purpose message bus, although when we need faster throughput we typically negotiate a webrtc datachannel over Matrix (e.g. how thirdroom.io exchanges its world data).

                                                            1. 5

                                                              The analogy isn’t really matched to this context though. SIP or XMPP or MQTT doesn’t involve diffs or storage or really even state in the basic use case, whereas Matrix is always diffs and merges.

                                                              1. 4

                                                                Also git and CVS are programs and file formats with (roughly) one implementation, whereas MQTT and Matrix are protocols. The semantics of protocols place an upper bound on the efficiency of any potential implementation.

                                                          2. 9

                                                            No one said it was unscalable, just that it was harder. If it takes a dedicated team multiple years and a full reimplementation to scale it, and even just joining a room is still slow, that says something.

                                                            I currently run Dendrite unfederated, in part (thought not solely) because I don’t want someone to accidentally bring down my small server by joining a large channel somewhere else. I still think Matrix is a good idea, but “scaling Matrix is hard” should be a pretty uncontroversial statement.

                                                            1. 0

                                                              The OP said “Matrix is not as scalable”. My point is that yes, it’s harder to scale, but the actual scalability is not intrinsically worse. It’s the same complexity (these days), and the constants are not that much worse.

                                                        1. 9

                                                          The thread on LKML about this work really doesn’t portray the Linux community in a good light. With a dozen or so new kernels being written in Rust, I wouldn’t be surprised if this team gives up dealing with Linus and goes to work on adding good Linux ABI compatibility to something else.

                                                          1. 26

                                                            I dunno, Linus’ arguments make a lot of sense to me. It sounds like he’s trying to hammer some realism into the idealists. The easter bunny and santa claus comment was a bit much, but otherwise he sounds quite reasonable.

                                                            1. 19

                                                              Disagreement is over whether “panic and stop” is appropriate for kernel, and here I think Linus is just wrong. Debugging can be done by panic handlers, there is just no need to continue.

                                                              Pierre Krieger said it much better, so I will quote:

                                                              Part of the reasons why I wrote a kernel is to confirm by experience (as I couldn’t be sure before) that “panic and stop” is a completely valid way to handle Rust panics even in the kernel, and “warn and continue” is extremely harmful. I’m just so so tired of the defensive programming ideology: “we can’t prevent logic errors therefore the program must be able to continue even a logic error happens”. That’s why my Linux logs are full of stupid warnings that everyone ignores and that everything is buggy.

                                                              One argument I can accept is that this should be a separate discussion, and Rust patch should follow Linux rule as it stands, however stupid it may be.

                                                              1. 7

                                                                I think the disagreement is more about “should we have APIs that hide the kernel context from the programmer” (e.g. “am I in a critical region”).

                                                                This message made some sense to me: https://lkml.org/lkml/2022/9/19/840

                                                                Linus’ writing style has always been kind of hyperbolic/polemic and I don’t anticipate that changing :( But then again I’m amazed that Rust-in-Linux happened at all, so maybe I should allow for the possibility that Linus will surprise me.

                                                                1. 1

                                                                  This is exactly what I still don’t understand in this discussion. Is there something about stack unwinding and catching the panic that is fundamentally problematic in, eg a driver?

                                                                  It actually seems like it would be so much better. It recovers some of the resiliency of a microkernel without giving up the performance benefits of a monolithic kernel.

                                                                  What if, on an irrecoverable error, the graphics driver just panicked, caught the panic at some near-top-level entry point, reset to some known good state and continued? Seems like such an improvement.

                                                                  1. 5

                                                                    I don’t believe the Linux kernel has a stack unwinder. I had an intern add one to the FreeBSD kernel a few years ago, but never upstreamed it (*NIX kernel programmers generally don’t want it). Kernel stack traces are generated by following frame-pointer chains and are best-effort debugging things, not required for correctness. The Windows kernel has full SEH support and uses it for all sorts of things (for example, if you try to access userspace memory and it faults, you get an exception, whereas in Linux or FreeBSD you use a copy-in or copy-out function to do the access and check the result).

                                                                    The risk with stack unwinding in a context like this is that the stack unwinder trusts the contents of the stack. If you’re hitting a bug because of stack corruption then the stack unwinder can propagate that corruption elsewhere.

                                                                    1. 1

                                                                      With the objtool/ORC stuff that went into Linux as part of the live-patching work a while back it does actually have a (reliable) stack unwinder: https://lwn.net/Articles/728339/

                                                                      1. 2

                                                                        That’s fascinating. I’m not sure how it actually works for unwinding (rather than walking) the stack: It seems to discard the information about the location of registers other than the stack pointer, so I don’t see how it can restore callee-save registers that are spilled to the stack. This is necessary if you want to resume execution (unless you have a setjmp-like mechanism at the catch site, which adds a lot of overhead).

                                                                        1. 2

                                                                          Ah, a terminological misunderstanding then I think – I hadn’t realized you meant “unwinding” specifically as something sophisticated enough to allow resuming execution after popping some number of frames off the stack; I had assumed you just meant traversal of the active frames on the stack, and I think that’s how the linked article used the term as well (though re-reading your comment now I realize it makes more sense in the way you meant it).

                                                                          Since AFAIK it’s just to guarantee accurate stack backtraces for determining livepatch safety I don’t think the objtool/ORC functionality in the Linux kernel supports unwinding in your sense – I don’t know of anything in Linux that would make use of it, aside from maybe userspace memory accesses (though those use a separate ‘extable’ mechanism for explicitly-marked points in the code that might generate exceptions, e.g. this).

                                                                          1. 2

                                                                            If I understand the userspace access things correctly, they look like the same mechanism as FreeBSD (no stack unwinding, just quick resumption to an error handler if you fault on the access).

                                                                            I was quite surprised that the ORC[1] is bigger than DWARF. Usually DWARF debug info can get away with being large because it’s stored in separate pages in the binary from the file and so doesn’t consume any physical memory unless used. I guess speed does matter for things like DTrace / SystemTap probes, where you want to do a full stack trace quickly, but in the kernel you can’t easily lazily load the code.

                                                                            The NT kernel has some really nice properties here. Almost all of the kernel’s memory (including the kernel’s code) is pageable. This means that the kernel’s unwind metadata can be swapped out if not in use, except for the small bits needed for the page-fault logic. In Windows, the metadata for paged-out pages is stored in PTEs and so you can even page out page-table pages, but you can then potentially need to page in every page in a page-table walk to handle a userspace fault. That extreme case probably mattered a lot more when 16 MiB of RAM was a lot for a workstation than it does now, but being able to page out rarely-used bits of kernel is quite useful.

                                                                            In addition, the NT kernel has a complete SEH unwinder and so can easily throw exceptions. The SEH exception model is a lot nicer than the Itanium model for in-kernel use. The Itanium C++ ABI allocates exceptions and unwind state on the heap and then does a stack walk, popping frames off to get to handlers. The SEH model allocates them on the stack and then runs each cleanup frame, in turn, on the top of the stack then, at catch, runs some code on top of the stack before popping off all of the remaining frames[2]. This lets you use exceptions to handle out-of-memory conditions (though not out-of-stack-space conditions) reliably.

                                                                            [1] Such a confusing acronym in this context, given that the modern LLVM JIT is also called ORC.

                                                                            [2] There are some comments in the SEH code that suggest that it’s flexible enough to support the complete set of Common Lisp exception models, though I don’t know if anyone has ever taken advantage of this. The Itanium ABI can’t support resumable exceptions and needs some hoop jumping for restartable ones.

                                                                    2. 4

                                                                      What you are missing is that stack unwinding requires destructors, for example to unlock locks you locked. It does work fine for Rust kernels, but not for Linux.

                                                                  2. 7

                                                                    Does the kernel have unprotected memory and just rolls with things like null pointer dereferences reading garbage data?

                                                                    For errors that are expected Rust uses Result, and in that case it’s easy to sprinkle the code with result.or(whoopsie_fallback) that does not panic.

                                                                    1. 4

                                                                      As far as I understand, yeah, sometimes the kernel would prefer to roll with corrupted memory as far as possible:

                                                                      So BUG_ON() is basically ALWAYS 100% the wrong thing to do. The argument that “there could be memory corruption” is [not applicable in this context]. See above why.

                                                                      (from docs and linked mail).

                                                                      null derefernces in particular though usually do what BUG_ON essentially does.

                                                                      And things like out-of-bounds accesses seem to end with null-dereference:

                                                                      https://github.com/torvalds/linux/blob/45b588d6e5cc172704bac0c998ce54873b149b22/lib/flex_array.c#L268-L269

                                                                      Though, notably, out-of-bounds access doesn’t immediately crash the thing.

                                                                      1. 8

                                                                        As far as I understand, yeah, sometimes the kernel would prefer to roll with corrupted memory as far as possible:

                                                                        That’s what I got from the thread and I don’t understand the attitude at all. Once you’ve detected memory corruption then there is nothing that a kernel can do safely and anything that it does risks propagating the corruption to persistent storage and destroying the user’s data.

                                                                        Linus is also wrong that there’s nothing outside of a kernel that can handle this kind of failure. Modern hardware lets you make it very difficult to accidentally modify the kernel page tables. As I recall, XNU removes all of the pages containing kernel code from the direct map and protects the kernel’s page tables from modification, so that unrecoverable errors can take an interrupt vector to some immutable code that can then write crash dumps or telemetry and reboot. Windows does this from the Secure Kernel, which is effectively a separate VM that has access to all of the main VM’s memory but which is protected from it. On Android, Halfnium provides this kind of abstraction.

                                                                        I read that entire thread as Linus asserting that the way that Linux does things is the only way that kernel programming can possibly work, ignoring the fact that other kernels use different idioms that are significantly better.

                                                                        1. 4

                                                                          Reading this thread is a little difficult because the discussion is evenly spread between the patch set being proposed, some hypothetical plans for further patch sets, and some existing bad blood between the Linux and Rust community.

                                                                          The “roll with corrupted memory as far as possible” part is probably a case of the “bad blood” part. Linux is way more permissive with this than it ought to be but this is probably about something else.

                                                                          The initial Rust support patch set failed very eagerly and panicked, including on cases where it really is legit not to panic, like when failing to allocate some memory in a driver initialization code. Obviously, the Linux idiom there isn’t “go on with whatever junk pointer kmalloc gives you there” – you (hopefully – and this is why we should really root for memory safety, because “hopefully” shouldn’t be a part of this!) bail out, that driver’s initialization fails but kernel execution obviously continues, as it probably does on just about every general-purpose kernel out there.

                                                                          The patchset’s authors actually clarified immediately that the eager panics are actually just an artefact of the early development status – an alloc implementation (and some bits of std) that follows safe kernel idioms was needed, but it was a ton of work so it was scheduled for later, as it really wasn’t relevant for a first proof of concept – which was actually a very sane approach.

                                                                          However, that didn’t stop seemingly half the Rustaceans on Twitter to take out their pitchforks, insists that you should absolutely fail hard if memory allocation fails because what else are you going to do, and rant about how Linux is unsafe and it’s riddled with security bugs because it’s written by obsolete monkeys from the nineties whose approach to memory allocation failures is “well, what could go wrong?” . Which is really not the case, and it really does ignore how much work went into bolting the limited memory safety guarantees that Linux offers on as many systems as it does, while continuing to run critical applications.

                                                                          So when someone mentions Rust’s safety guarantees, even in hypothetical cases, there’s a knee-jerk reaction for some folks on the LKML to feel like this is gonna be one of those cases of someone shitting on their work.

                                                                          I don’t want to defend it, it’s absolutely the wrong thing to do and I think experienced developers like Linus should realize there’s a difference between programmers actually trying to use Rust for real-world problems (like Linux), and Rust advocates for whom everything falls under either “Rust excels at this” or “this is an irrelevant niche case”. This is not a low-effort patch, lots of thinking went into it, and there’s bound to be some impedance mismatch between a safe language that tries to offer compile-time guarantees and a kernel historically built on overcoming compiler permisiveness through idioms and well-chosen runtime tradeoffs. I don’t think the Linux kernel folks are dealing with this the way they ought to be dealing with it, I just want to offer an interpretation key :-D.

                                                                      2. 1

                                                                        No expert here, but I imagine linux kernel has methods of handling expected errors & null checks.

                                                                      3. 6

                                                                        In an ideal world we could have panic and stop in the kernel. But what the kernel does now is what people expect. It’s very hard to make such a sweeping change.

                                                                        1. 6

                                                                          Sorry, this is a tangent, but your phrasing took me back to one of my favorite webcomics, A Miracle of Science, where mad scientists suffer from a “memetic disease” that causes them to e.g. monologue and explain their plans (and other cliches), but also allows them to make impossible scientific breakthroughs.

                                                                          One sign that someone may be suffering from Science Related Memetic Disorder is the phrase “in a perfect world”. It’s never clearly stated exactly why mad scientists tend to say this, but I’d speculate it’s because in their pursuit of their utopian visions, they make compromises (ethical, ugly hacks to technology, etc.), that they wouldn’t have to make in “a perfect world”, and this annoys them. Perhaps it drives them to take over the world and make things “perfect”.

                                                                          So I have to ask… are you a mad scientist?

                                                                          1. 2

                                                                            I aspire to be? bwahahaa

                                                                            1. 2

                                                                              Hah, thanks for introducing me to that comic! I ended up archive-bingeing it.

                                                                            2. 2

                                                                              What modern kernels use “panic and stop”? Is it a feature of the BSDs?

                                                                              1. 8

                                                                                Every kernel except Linux.

                                                                                1. 2

                                                                                  I didn’t exactly mean bsd. And I can’t name one. But verified ones? redox?

                                                                                  1. 1

                                                                                    I’m sorry if my question came off as curt or snide, I was asking out of genuine ignorance. I don’t know much about kernels at this level.

                                                                                    I was wondering how much an outlier the Linux kernel is - @4ad ’s comment suggests it is.

                                                                                    1. 2

                                                                                      No harm done

                                                                              2. 4

                                                                                I agree. I would be very worried if people writing the Linux kernel adopted the “if it compiles it works” mindset.

                                                                                1. 2

                                                                                  Maybe I’m missing some context, but it looks like Linus is replying to “we don’t want to invoke undefined behavior” with “panicking is bad”, which makes it seem like irrelevant grandstanding.

                                                                                  1. 2

                                                                                    The part about debugging specifically makes sense in the “cultural” context of Linux, but it’s not a matter of realism. There were several attempts to get “real” in-kernel debugging support in Linux. None of them really gained much traction, because none of them really worked (as in, reliably, for enough people, and without involving ritual sacrifices), so people sort of begrudgingly settled for debugging by printf and logging unless you really can’t do it otherwise. Realistically, there are kernels that do “panic and stop” well and are very debuggable.

                                                                                    Also realistically, though: Linux is not one of those kernels, and it doesn’t quite have the right architecture for it, either, so backporting one of these approaches onto it is unlikely to be practical. Linus’ arguments are correct in this context but only insofar as they apply to Linux, this isn’t a case of hammering realism into idealists. The idealists didn’t divine this thing in some programming class that only used pen, paper and algebra, they saw other operating systems doing it.

                                                                                    That being said, I do think people in the Rust advocacy circles really underestimate how difficult it is to get this working well for a production kernel. Implementing panic handling and a barebones in-kernel debugger that can nonetheless usefully handle 99% of the crashes in a tiny microkernel is something you can walk third-year students through. Implementing a useful in-kernel debugger that can reliably debug failures in any context, on NUMA hardware of various architectures, even on a tiny, elegant microkernel, is a whole other story. Pointing out that there are Rust kernels that do it well (Redshirt comes to mind) isn’t very productive. I suspect most people already know it’s possible, since e.g. Solaris did it well, years ago. But the kind of work that went into that, on every level of the kernel, not just the debugging end, is mind-blowing.

                                                                                    (Edit: I also suspect this is the usual Rust cultural barrier at work here. The Linux kernel community is absolutely bad at welcoming new contributors. New Rust contributors are also really bad at making themselves welcome. Entertaining the remote theoretical possibility that, unlikely though it might be, it is nonetheless in the realm of physical possibility that you may have to bend your technology around some problems, rather than bending the problems around your technology, or even, God forbid, that you might be wrong about something, can take you a very long way outside a fan bulletin board.)

                                                                                    1. 1

                                                                                      easter bunny and santa claus comment

                                                                                      Wow, Linus really has mellowed over the years ;)

                                                                                  1. 7

                                                                                    Who is Tweag? A company? Why was there even a repo in their org with a license they don’t like?

                                                                                    1. 46

                                                                                      This is the second recent story that shows no one remembers image maps anymore. 😞

                                                                                      1. 8

                                                                                        Right?! And they still totally work.

                                                                                      1. 10

                                                                                        Scrying definitely feels like a more accurate term.

                                                                                        1. 2

                                                                                          From a discussion elsewebs, comparisons to photography seem fairly apt – exploring and composing and arranging a scene, but also just a lot of luck and happening to be at the right place at the right time (or chancing upon the right seed, in the case of image synths.)

                                                                                        1. 4

                                                                                          Usually even with Google you’ll get a bounce that tells you what to do. And if you have an email account with Gmail you can also check what causes the block.

                                                                                          I’ve been running my mail server since circa 2015. That involved switching software, provider (and then IP). Things run smoothly. I use the email server as my main way to communicate, I use it both personally and professionally and have some family accounts as well.

                                                                                          I’ve had a bounces way back when DKIM became a thing. Setting it up fixed these.

                                                                                          Because of my work I frequently send emails to people I’ve never ever talked to. I also tend to prefer email over calls for support, etc. for documentation purposes. So lots of circumstances where I’d even know if if emails were silently dropped (esp. when using it for work as a consultant). I do get my responses though.

                                                                                          So articles like these always baffle me. For a while I beloved that the old IP was the reason but I’ve switched providers in 2019 and I also added domains so I think that reputation systems should not find them to be trusted at least initially.

                                                                                          The only annoying thing that happened was that someone started spamming with my email address causing me to get bounces that aren’t from masks on my server. The oligopoly doesn’t do that cause they check SPF, but huge amounts of tiny servers don’t seem to do that. And seems like that address gets put into many web forms as I receive automatic responses. I’ve had this before with a non self hosted email address which forwards to the self hosted one. It’s rare but annoying cause these of course aren’t spam, just responses. So they look fine in terms of server soup.

                                                                                          I wished the article was a bit more technical. You’d usually get something from the oligopoly about why it bounced. I’ve helped a couple of people debugging their email servers when bounces happen. So far all of those were simple misconfigurations. Oh or using residential IP space which won’t work.

                                                                                          On receiving spam. It’s hugely annoying that a big portion of Spam is sent out from Sendgrid, Mailgun,Mandril, etc, so they have all the things making them look like valid. At times I felt like they should be taken down as professional spammers. Hugely annoying.

                                                                                          On the topic of IPs and reputation. Mailgun, Mandrill and Sendgrid market dedicated IPs to increase delivery rates so I would argue being a well known IP isn’t the thing that will increase your delivery rate.

                                                                                          I have heard stories about being unlucky having received an IP from your hosting company that was previously used to spam. I hadn’t seen such a case personally but I think such a situation can make it look like you can’t host your own email

                                                                                          I’m waiting for the day that self hosted email won’t work anymore. But so far it works without any issues. Given that various smaller websites seem to still be successfully use send mail through php and wonky setups (as I noticed back when using grey listing which would not work properly in these cases) it seems like big providers have to leave a lot of very shady looking things through. Maybe they use user reputation for that (marking as spam, etc.).

                                                                                          Anyways, as mentioned wirh stuff articles I’d always be curious about the technical side. What responses do the oligopoly’s servers send?

                                                                                          1. 10

                                                                                            On the topic of IPs and reputation. Mailgun, Mandrill and Sendgrid market dedicated IPs to increase delivery rates so I would argue being a well known IP isn’t the thing that will increase your delivery rate.

                                                                                            Indeed, it isn’t. The single reason that lets you deliver email is being on the receiver’s whitelist. That’s it. Nothing else.

                                                                                            You’d usually get something from the oligopoly about why it bounced. I’ve helped a couple of people debugging their email servers when bounces happen. So far all of those were simple misconfigurations.

                                                                                            I ran my self-hosted email for over 20 years now, much like the author of the article. I recently switched to using a relay to send, because fully self hosting that part became impossible. Not because I didn’t configure something:

                                                                                            • I had a stable, dedicated, non-residential IP, which I have been the sole user for the past 8+ years. No spam ever left my system during that time. I recently had to change the IP, but that makes no difference, because I had to switch to a relay before that anyway.
                                                                                            • I have a stable domain name which I have owned since 2009. I have been the sole sender from this domain.
                                                                                            • I have an appropriate PTR record.
                                                                                            • SPF, DKIM, DMARC and the rest are set up properly.
                                                                                            • 90% of the e-mail I send are replies to e-mail sent to me, between known contacts.

                                                                                            Yet, despite all that, my email routinely ended up in spam folders when sent to Google-hosted domains, be them @gmail.com or custom domains. I have no idea why. On the SMTP level, the message is accepted. On the client’s end, they see it is in spam, but the headers tell nothing about why, apart from “we found this message suspicious”, which isn’t very helpful. And that’s the good situation! It’s much worse when Google accepts the email, and then drops it before delivering it to the recipient at all. I have comfirmed that happening with dozens of people: I sent an email, verified my logs that Google’s servers acked it, and even a week later, there was no sign of it in their inbox, neither in Spam, nor anywhere else. It just disappeared without a trace.

                                                                                            In this respect, I found most Microsoft servers better, because those reject the mail at least. That allowed me to switch to a backup @gmail.com account if I desperately needed to send email to such an address. The reject reason: “We do not accept email from this sender at this time, please contact our administrators”. So I did just that, and contacted multiple postmasters of such servers. Turns out they were all using the same deny & allow lists, had no permission to change them, and I would need to contact the administrators of those lists to get my server on the list. No reputation, no SPF, DKIM, DMARC, etc checking. A simple allow list, because the deny list was basically “everything not on the allow list”.

                                                                                            Who are on the allow lists? The big names, and relays that pay them hefty amounts of money.

                                                                                            Why use an allow list? Because a deny list in 2022 does nothing, its an unwinnable uphill battle when IPv4 addresses are frequently reused, when domain names don’t even long enough to update a list, and when IPv6 addresses are a plenty. SPF, DKIM, DMARC all sound good in practice, but they aren’t widespread enough to make a big difference. Not to mention that a large percentage of spam (about 60% of all my incoming spam) comes through the Big Names’ servers, with valid SPF, DKIM and DMARC records. Of the remaining 40% of the spam I receive, half of them also come from places with valid SPF, DKIM and DMARC stuff.

                                                                                            Thing is, those don’t cost much to set up, so spammers do so too. It can be fully automated, and nets them a higher chance of delivery.

                                                                                            Thus, no matter what I did, to be able to email contacts I need to email, without having to keep a separate gmail (or other big-name) account, the only reliable solution was to use a relay, where someone else makes sure they are on the appropriate allow lists. But this way, my email is not fully self hosted anymore. I still have an SMTP server that can send mail. I still have all the things set up for domains I do not relay (because relaying all of them would be pricey, so I keep that to an affordable minimum), but their delivery rate is abyssmal.

                                                                                            1. 2

                                                                                              On the client’s end, they see it is in spam, but the headers tell nothing about why, apart from “we found this message suspicious”, which isn’t very helpful.

                                                                                              For Gmail I found that if you show press the “Show original” button on an email it tells you more. Based on someone I have helped it appears that Gmail (and probably others) not too long ago switched from not allowing soft-fail in SPF anymore, validating that it is -all instead of ~all.

                                                                                              Sorry, that you were forced to switch. I’ll hang in till it stops working for me as well and then will join the group of people saying email should be abandoned. ;)

                                                                                              Speaking about allow lists. Maybe it also helps to add it to whitelists. Ages ago I added mine to dnswl.org.

                                                                                              As mentioned for my job reasons I do send emails to random people every once in a while. So far that worked, as if it didn’t someone would have let me known, after all the job has to be done. Heh. The same is true when interacting with people in the open source community or making an appointment at a doctor and such things. So far it worked. I hope it will stay that way.

                                                                                              1. 5

                                                                                                For Gmail I found that if you show press the “Show original” button on an email it tells you more.

                                                                                                Like I said, I checked the headers. There was nothing useful there. I just checked a legit message gmail routed to spam, and there were 0 useful headers. All it told me is what I already knew: how google verified the valid SPF/DKIM/DMARC headers.

                                                                                                There was no header indicating that the message was even considered spam, let alone any that would help me figure out why. All it told me is that both SPF and DKIM passed. Great. I knew that already. Not helpful.

                                                                                                Speaking about allow lists. Maybe it also helps to add it to whitelists. Ages ago I added mine to dnswl.org.

                                                                                                In my case, it would not have made a difference. The allow lists the domains I had to send email to did not make it possible for me to add myself. For one, even discovering which whitelists I’d need to add myself to was a problem, because the lists used were deemed confidential information, or they simply didn’t know, not even their tech people (“we use google/outloook/whatever, ask them”). The one whitelist provider I did manage to contact was asking for upwards of $10k/year/sender domain. Yeah, nope.

                                                                                                So far that worked, as if it didn’t someone would have let me known, after all the job has to be done.

                                                                                                You’re lucky then. For a long while, ‘till about 3-4 years ago, I had no issues either. I had the odd message going to spam here and there, but overall, what I sent was delivered. But then I started to get bounces (while my setup being top notch in every feasible regard, apart from not paying for being on any and all allow lists possible), and even worse, had mail disappear into the void after being accepted by the recipient’s servers.

                                                                                                As for letting me know if they don’t receive email: oh, they did, yes. They called me. Doesn’t help when I need to send scanned attachments and stuff. I can resend and then they’ll call me again a few days later. Doesn’t solve the problem. The caller won’t be able to help me, because they have absolutely no idea how email works, it’s not their job, and I won’t have much luck contacting their tech people either, because they’ll just say “We use gmail/outlook, no clue why your email disappears, talk to them”.

                                                                                                I’m too small for both google and microsoft to care, and the tools available to help figure out what went wrong are totally useless once you have a setup that is supposed to work.

                                                                                                To this day, I have no trouble communicating with most open source projects and people - they usually self host, and have reasonable setups.

                                                                                                The problem is getting mail into the big names (especially into outlook, but google is getting harder and harder fast), and the troubling issue with that is that makes it harder for small businesses to stay out of their clutches. If they want email reliably delivered, being behind Big Names is far easier. It makes email someone else’s problem. They can send email. If they do not receive email, then the problem is clearly at my end as far as they’re concerned. “Tough luck mate, get a gmail account, its free” I was often told.

                                                                                            2. 2

                                                                                              Usually even with Google you’ll get a bounce that tells you what to do.

                                                                                              What can I say? That’s not my experience.

                                                                                              I have my own domain, with MX records pointing to Fastmail. I had SPF set up but not DKIM, and at some point GMail (and Yahoo, for that matter) started sorting all of my messages into the recipients’ Spam folders. I wasn’t blackholed, but I might as well have been: No bounce, no notification, just quietly hidden out of normal view. It still happens sometimes even now that I have DKIM set up.

                                                                                              Fastmail is not a small company. They’re a sizable player. But smaller than the biggest ones. I’ve been using this domain for over 15 years. And this shit still happens.

                                                                                            1. 13

                                                                                              More accurate title would be “The case against auto-formatting SQL”. It’s the “auto” part that’s relevant here.