Threads for markwatson

    1. 2

      Is it native or electron? Is it Java (like JetBrains other IDEs)?

      1. 6

        Like many of our IDEs, the functionality of RustRover can be installed as a plugin in IntelliJ IDEA Ultimate.

        Probably Java given this line in the post.

        1. 9

          The Rust plugin actually uses Kotlin (I know, basically the same thing).

          I believe all the classic IDEs use Java Swing as the GUI library, which isn’t native or electron, in my opinion, but I guess it depends on what is meant by “native”. This is different from an IDE like Eclipse which used SWT. SWT wraps native components from the OS and makes them available to Java, where Swing draws the components from Java.

          JetBrains now has Fleet in preview, which uses a new UI library called Noria. This IDE feels a lot closer to something like VSCode.

          1. 1

            Do you know if the “new UI” in, for example, Goland, is using Noria? Or is it just restyled Swing components?

            1. 3

              It’s restyled Swing, AIUI other IDEs on their old plaform are picking up same look.

              1. 1

                There’s a reason for me to not consider it all on its own. As a Linux user, across two decades, multiple machines, and and multiple distros, I’ve never found a Swing, SWT, or JavaFX app that didn’t suffer from an irritating problem of subtle but noticeable input latency on Linux desktops.

                (For context, I was an avid KDE 3.5 power user but migrated to LXDE and stayed away from KDE 4 for much longer than many other people because each new “it’s fast and stable now” release never seemed to fix the input latency KWin had gained in the port to Qt 4… and I’ve been around long enough that I remember when multi-monitor users had to hex-edit the JRE to rename the string XINERAMA to some non-matching piece of gibberish to un-break Java apps on multi-monitor desktops.)

            2. 2

              Do you know if the “new UI” in, for example, Goland, is using Noria? Or is it just restyled Swing components?

              I don’t know the answer to that.

            3. 2

              The IntelliJ Platform is open source, so the way to get the answer here is to take a look at the source:

              My understanding is that both new and old UI are basically Swing, though details are fun. Eg, JetBrains maintains their own JRE to have better font rendering: https://github.com/JetBrains/JetBrainsRuntime

              And year, JetBrains in general do a lot of work in the space of desktop GUIs:

              • the core IDEs are heavy users of Swing
              • Rider uses the “Rider Protocol” to strictly separate IDE UI (JVM+Swing) from brains (C#), and, as far as I understand, that’s a significantly different protocol than LSP
              • there are two version of “let’s glue JVM to skia” idea — both Noria and Compose For Desktop work that way (and I think both a powered by skija
    2. 35

      New languages like Go and Rust combine the memory safety of a high-level language with the performance of C.

      Rust has similar performance profile to C, C++ and Zig. Go has similar performance profile to Java, Swift and C#. As far as we want to group languages into performance bins, Go and Rust certainly belong to different bins.

      The confusion here stems from the fact that Go was originally marketed as a “systems” programming language, but that used the definition of “systems programming” mostly unrelated to performance.

      https://willcrichton.net/notes/systems-programming/

      1. 19

        You’d think that Go is actually faster than Java/C# based on their marketing around their GC, and how they emphasized it as a “systems” language (ultimately killing any meaning the term ever had).

        One of the many reasons I hate Go fundamentally and not just as a technology (which I also hate it for).

        1. 12

          They market their GC as low delay / nearly pauseless, that generally goes against throughput, and thus raw performances. The lack of generational support can also be an issue if you can’t (easily) avoid the allocations.

          Go v Java/C# is a win some / lose some: the AOT means you have reliable and immediate performances, but the limited optimisations & lack of JIT means a relatively low ceiling (although the idiomatic avoidance of abstraction limits reliance on optimisations). It doesn’t have a generational GC but being value oriented limits the allocation requirements & it has really good tooling for uncovering the allocations it does have. The GC works fine OOTB, but the lack of tunability means if you get into a bad workload you’re hosed.

        2. 2

          I am very new to go and my understaning was it is much much faster than Java/C#. I was thinking of it being very close to C.

          Do you have more referernces where this difference is more apparent? I would like a complete weekend reading to get off this brainwash.

          1. 8

            It’s very difficult to realistically compare programming language performance for a lot of reasons. There are always ways to further optimize code to perform at a given benchmark, so determining if the performance is a result of the language or program becomes difficult. Additionally, benchmarks are usually fairly synthetic, so the benchmark doesn’t tell you how your language will actually perform when you’re writing code with suboptimal routines, using idiomatic patterns, etc. Finally, performance isn’t just a linear question of fast vs. slow - language features might have different tradeoffs depending on the problem you’re trying to solve. For example, go routines would perform different than Java threads or async patterns.

            That all being said, here are some papers I’ve found:

            And some benchmarks:

            So in general while benchmarks vary, it seems Go is similar to Java/C# in terms of performance (with a massive asterisk). To know exactly why you would need to research garbage collection, memory safety, pros and cons of JITs, etc. I’m sure someone else can provide better resources than I did, but hopefully it’s helpful. I’m definitely not an expert.

            edit: one more thing - while I don’t think programming language performance should be a big consideration when choosing what language to use for a new project, I do think learning more about the performance tradeoffs made by each language is really valuable. So hopefully the above answered your question, but it might be more worthwhile to research programming language design in general.

            1. 1

              I like @benhoyt’s blog post where he compares an idiomatic version of word counting to an optimized version in several languages. It’s good as a rough ballpark: https://benhoyt.com/writings/count-words/

          2. 6

            I am very new to go and my understaning was it is much much faster than Java/C#. I was thinking of it being very close to C.

            Of course it depends on how you define fast, but in many cases, it certainly is.

            Go GC optimizes for latency over throughput. Typical GC pause times are sub-microsecond. This is most visible in long-tail request latencies. Go servers tend to be as fast or faster than Java servers from the client perspective, while using something like an order of magnitude less memory.

            1. 2

              I never seem to get close-to-C speeds even when I remove all allocations. As I’m sure you know, Go’s compiler is not very aggressive about optimizing so it just kind of punts where clang, gcc, etc do a good job of inlining and on top of that Go function calls are more expensive (I haven’t measured so I don’t have a good idea about how much more expensive). And of course, C doesn’t have to worry about eliding bounds checks in the first place.

          3. 5

            For starters you should ask yourself what about Go is supposed to be faster. Compared to Java the main thing is going to be that you can sometimes allocate data to the stack in Go. Compared to C# there’s no such benefit - C# can do that. So what is going doing differently?

            Well you might hear people say “Go optimizes for latency” - that’s just Go marketing. There are Java GCs that also optimize for latency, most Java GC’s are entirely tunable for such things.

            Probably the single feature that Go has that theoretically gives it any advantage is that it has native async support and Java’s support is weak. C# does not suffer there.

            Take a look at these benchmarks and note that Java is very nearly at the top for many. https://www.techempower.com/benchmarks/#section=data-r21&test=json

            You can find other benchmarks here: https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/csharp.html

            You could ask yourself why Go’s GC gets hyped as being so low latency, there are articles out there that will explain that it’s absolutely nothing new - not one novel component - and that it’s actually kinda behind the status quo. Some people will say “it isn’t generational” like it’s a good thing - but C# has the same memory model as Go and has found that generational GC is still an advantage.

            Go’s compiler is also hand rolled. It’s very behind on optimizations. Things that we take for granted, like passing arguments via register, are bleeding edge for Go. Java and C# have just had way more time put into their optimization passes.

            1. 3

              Agreed, there’s theoretically no reason for C# to be slower than C, C++, or Rust. There’s a difference because of the garbage collector, but it depends on the workload in whether it’s a win or loss. Gc can be pretty fast, with pretty good throughout.

              1. 1

                C#, Java, and Go, are all in a whole other class of performance vs C, C++, Rust. I wasn’t trying to indicate that they’re all on the same level.

                1. 2

                  Are you so sure?

                  https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/csharpcore-gpp.html

                  Java is problematic because of the difficulty of value types and layout control. They’re working on it. There’s also a big difference in frameworks and standard library code available. But in terms of raw performance, the JIT is pretty capable these days.

                  The real difference is in how idiomatic code performs. This is obviously muddled and hard to answer, because what is idiomatic?

                  1. 2

                    I agree, Java, of all of these three languages, will have the hardest time. Lack of value types is a problem.

                    C# and Go can maybe get close to C++, but it’s going to be situational and Go is going to suffer from its much less mature compiler. There are even scenarios where C and Go could surpass C++ since GCs are near optimal for some scenarios.

                    But I think that’s why it’s most important to not ask “who’s fastest” but “why would X be faster than Y?”. That’s where idiomatic comes in - would it be idiomatic to exclusively allocate on the stack for C#, avoid inheritance, etc?

                    1. 1

                      C# and Go can maybe get close to C++, but it’s going to be situational and Go is going to suffer from its much less mature compiler. There are even scenarios where C and Go could surpass C++ since GCs are near optimal for some scenarios.

                      This seems wrong for a few reasons.

                      1. C doesn’t have a GC, so why would “scenarios where GCs are near optimal” help C?
                      2. In general, I can’t think of scenarios where a GC is going to outperform arena/pool allocation in C/C++
                      3. In particular, it seems very unlikely that Go’s GC is going to be a benefit versus C/C++. In Go, when we want to go fast, we take care to avoid allocations which means using the GC less and writing programs more like you would in C/C++.

                      Also, Java can get away without value types for a few reasons:

                      1. It has an escape analyzer that stack allocates things as often as possible
                      2. It has a bump allocator so allocations are way, way faster than Go, C, C++, etc
                      3. (I’m told) it has a high-throughput / low latency GC
                      4. Java’s JIT generates a lot tighter code than Go’s compiler (partially because JIT allows it to make better assumptions but also because Go’s compiler is deliberately basic)
                      1. 1

                        C doesn’t have a GC, so why would “scenarios where GCs are near optimal” help C?

                        Because I typo’d C when I meant C#.

                        In general, I can’t think of scenarios where a GC is going to outperform arena/pool allocation in C/C++

                        Over a manually managed arena, nothing. You can always be faster. But for idiomatic code where there are lots of short lived objects a generational GC can be ideal since it’s effectively a bump allocator.

                        In particular, it seems very unlikely that Go’s GC is going to be a benefit versus C/C++. In Go, when we want to go fast, we take care to avoid allocations which means using the GC less and writing programs more like you would in C/C++.

                        Never Go’s. A generational GC can help with short lived allocations over idiomatic “standard” allocation in other languages.

                        It has an escape analyzer that stack allocates things as often as possible

                        Of course. Both C# and Go have escape analysis too.

                        It has a bump allocator so allocations are way, way faster than Go, C, C++, etc

                        Yep, I covered this. Generational GCs are extremely efficient for short lived allocations. C# does the same thing.

                        (I’m told) it has a high-throughput / low latency GC

                        Yep, and many tunables for your exact workload.

                        Java’s JIT generates a lot tighter code than Go’s compiler (partially because JIT allows it to make better assumptions but also because Go’s compiler is deliberately basic)

                        Yep, also agreed.

                        I think we basically agree on everything?

                    2. 1

                      And, in this vein, Go strongly encourages value types, so they’re used more or less ubiquitously.

          4. 3

            It is not. Both C# and Java are quite fast these days (latest CLR and JVM, their runtimes)

            1. 2

              Do you have some references that the parent comment asked for?

          5. 3

            Java is widely used for HFTs and other latency sensitive applications. Writing and Testing High-Frequency Trading Engines and LMAX Disruptor: 100K TPS at Less than 1ms Latency are two examples of what you can do with Java. All memory is allocated on startup and no additional garbage is created. The garbage collector never runs, because there’s nothing for it to do.
            There’s nothing comparable to this for Go. Its GC is conservative, not generational, and collects every 2 minutes; you pay for a GC cycle even if no memory has been allocated.

            1. 6

              https://tip.golang.org/doc/gc-guide

              Go’s GC targets pause times of 1us, not 1ms.

              1. 2

                By throttling other thread’s allocations when it can no longer adequately collect.

                Basically every GC research is done on the JVM, comparatively most other runtimes’ solutions are toys. E.g. c# has very impressive performance, but its GC is still a single, thousands of lines long file, which wouldn’t mean anything in itself, I’m just pointing out that it (and also Go) prefers allocating less in the first place, so they are not as reliant on good GCs as typical Java programs.

                1. 1

                  By throttling other thread’s allocations when it can no longer adequately collect.

                  Where have you seen that Go throttles allocations based on the state of the GC?

              2. 1

                Thanks! I last looked at Go’s GC several years ago, but the 1.19 changes seem promising. There’s still a 2 minute forced collection, but most apps will be fine with it.

            2. 1

              I don’t know that it’s fair to say that Java is widely used for HFT. AFAIK C++ still dominates that sector considerably.

              1. 2

                As far as I know, the sector is divided into a region previously occupied by C++ that requires so much speed that nowadays general purpose CPUs are inadequate for it, and are now done by special purpose ASICs. The other kind operate on a slightly slower (but still very latency sensitive) time-frame, but fast adaptation of code (trying out new algorithms) and correctness is more important. So in a way ASICs have eaten C++’s lunch, and for the other Java is a better choice.

                But that may be inaccurate, I haven’t worked in that industry personally.

          6. 2

            Go is compiled to native code, but not all native code is equal. Google’s Go compiler chooses to have short compilation time over optimizing execution speed. I’m not sure how well gccgo does this, perhaps you could optimize tight loops that don’t do allocations as well as C code.

            1. 1

              As an aside, I’m starting to feel like Go is mature enough these days that it should be able to afford the complexity of separate debug/release modes to manage compile times. I don’t really feel like we’re gaining much from having a single fast build mode that outputs slow code.

        3. 1

          People keep saying they marketed their language as a “systems language”, but I’m only aware of Rob Pike mentioning that in passing one time and in context he was talking more about distributed/networked systems like those at Google. Was there some big campaign to make Go out to be a systems language in the traditional sense? I keep hearing people mention this, so I assume I’m mistaken.

          1. 3

            Thanks for the comment! I realized that when people say “language X was marketed as …” what they actually describe (and I am guilty of that in this very comment) is not a conscious, coordinated marketing from the authors, but rather a self-perpetuating complex of memes. That is, people start with saying that “go is a systems programming language”, and then continue with saying “people say that go is a systems programming language”, at which point the meme never dies. So, yeah, I expect that’s what actually happened here — someone somewhere said that Go is systems programming, Internet caught that, and that turned into marketing claims.

            That being said, looking at https://go.dev/blog/1year, it says

            We set out to build a language for systems programming - the kinds of programs one might typically write in C or C++ - and we were surprised by Go’s utility as a general purpose language. We had anticipated interest from C, C++, and Java programmers, but the flurry of interest from users of dynamically-typed languages like Python and JavaScript was unexpected. Go’s combination of native compilation, static typing, memory management, and lightweight syntax seemed to strike a chord with a broad cross-section of the programming community.

            Also, the above quote fits Rust pretty much ideally :-)

          2. 3

            https://go.dev/doc/effective_go

            “Go is a general-purpose language designed with systems programming in mind. It is strongly typed and garbage-collected and has explicit support for concurrent programming.”

            https://go.dev/talks/2012/splash.article

            “For a systems language [..]”

            There are a bunch of talks from Google engineers that use the term as well but I’m not going to hunt through them.

            This led to an impression, one I still see today, where people say “Go is a systems language” and it proliferates through that way.

            1. 2

              Ah, fair enough. Makes sense.

      2. 3

        Thanks for pointing this out! Not sure what I was thinking when I wrote that. One of the reasons I chose Rust over Go and Vlang was the lack of a garbage collector and memory management through scopes and lifetimes. I’ll correct the article.

      3. 2

        Go has similar performance profile to Java, Swift and C#.

        Is that in real-world-clock performance, or amount-done-per-watt performance?

        1. 3

          There was a paper a while back comparing a few different metrics like time, energy, and RAM of various programming languages using computer language benchmark results. It’s been a few years so the numbers have probably shifted a bit from what’s presented. You can navigate to the results from this site:

          https://sites.google.com/view/energy-efficiency-languages/home

          1. 3

            Note that a lot of these results are considered rather dubious

            1. 1

              Here is a newer paper: https://www.sciencedirect.com/science/article/abs/pii/S0167642321000022

              Not sure what you considered dubious in the first one, I saw that comment often, but never with objective reasoning (mostly questioning how can Java be so efficient): to answer that, it is efficient by being very lazy in when it runs the GC, compared to a usual more busy approach. It’s the usual space vs time tradeoff, where Java prefers letting memory usage grow a bit more.

              1. 1

                Yes i know these papers and no this has nothing to do with java. But thanks for the links

          2. 2

            I’d love to see this experiment repeated considering all the changes in GC langs maybe even on a yearly basis considering Google’s endless resources

            1. 1

              I have commented above, but the link may be more relevant for your question (2021 study):

              https://www.sciencedirect.com/science/article/abs/pii/S0167642321000022

        2. 2

          I would expect the two metrics to correlate sufficiently strongly to be equivalent at this level of precision.

        3. 1

          It seems highly unlikely that golang runtime performance is as good as a post-warmup HotSpot. From the beginning go has been designed and engineered for compile-time performance, because that was the main problem Google was trying to solve. The mainline JVM is the end-point for several projects beginning with Self that have been not only focused on incremental performance improvements for about 30 years straight, but also the place where (almost) every current trick to improve runtime performance of such dynamic systems was originally developed, save a few that came out of Adobe’s Tamarin (Flash) and the original v8.

          1. 4

            And yet JVM doesn’t have value types, which are pretty important for perf nowadays, when we realize that memory layout is a bigger deal than particular CPU instructions.

          2. 1

            One thing to consider is that Go doesn’t necessarily allocate everything on the heap, and in fact tries pretty hard to keep as many allocs on the stack as it can, which reduces the amount of GC needed in the first place. Significantly so, compared to Java, I believe.

          3. 1

            It is… Complicated.

            We actually do not have a lot of data about real impact of hotspot in production fascinatingly enough. We do have so rather in depth benchmarking and it is… Not great.

            It seems that in a quite significant amount of situation, post warmed up hotspot end up stuck in a worse operating point than before. And it also regularly end up not stabilising but oscillating between different operating points.

            Im aggregate it is still probably beneficial, but we are desperately lacking work in the public record and literature analysing hotspot in production through continuous profiling on a large sample.

            1. 1

              We actually do not have a lot of data about real impact of hotspot in production fascinatingly enough

              The public may not have. Given that it is one of the biggest languages used in enterprise settings with proprietary code bases, I’m entirely sure that plenty of such data is available for OpenJDK engineers.

              1. 1

                You would be surprised. I know really few org running it in production that actually do continuous profiling.

    3. 8

      This feels like every time I start to work on something at work. It starts simple and then ends up in a mess of confusing code, and I later realize there is a simpler solution that’s good enough for now. So I can definitely relate to this problem.

      For this specific problem I’m a little curious why a standard log shipper wasn’t used, as it would have solved many of the issues mentioned. But I know that’s not really the point of the article…

      1. 6

        What the story doesn’t say is that there is on month of red tape just to get access to the log aggregation system. At this stage, we offer to use a specific log shipper, but they couldn’t find a good way to let us install it using their infra. Eventually, they suggested we package it ourselves and maintain the package. The team politely declined.

        1. 1

          I guess this was about packaging an rpm or deb? Is this to do with service management for the log shipper, shared library dependencies, or something else?

          I’m not sure if this was a while ago, or just an old system, but wonder if you see newer solutions that could’ve worked, if they were available at the time. Like systemd user services, but maybe it wasn’t (yet) a systemd OS. Or shipping static binaries with musl or Go.

          We rely on systemd-journal a lot, but with a custom shipper. I believe there’s an example HTTP shipper in the systemd source code, that can give a head-start (though it’s in C). The nice thing is that it essentially takes care of the queuing part: the API gives you a cursor (simple string), and you don’t have to care about where the log files are, or rotation. The Rust systemd crate has also been solid for us, for this purpose.

    4. 5

      I believe this is the original memo, in case anyone want’s to read the whole text. It appears to be from Jun 2nd 1966, not June 6th.

    5. 5

      From the rant:

      …the Bluesky people were looking for an identity system that provided global ids, key rotation and human-readable names.

      They must have realized that such properties are not possibly in an open and decentralized system, but instead of accepting a tradeoff they decided they wanted all their desired features and threw away the “decentralized” part…

      I’m really curious about this - why is it not possible in a decentralized system? I know there have been past attempts around web of trust. I’ve always wondered why something couldn’t be built using blockchain to store public keys. Key rotation could be performed by signing a new key with the old key and updating the chain.

      Am I missing something obvious that’s not currently possible? Are there good research papers in this area I should read?

      1. 7

        It’s well-trodden folklore; I’d start with Zooko’s triangle. Several forms of the triangle can be formalized, depending on the context.

      2. 4

        You don’t need a blockchain to store a bunch of self-signed public keys. PKI has been doing that for ages via certificates.

        OTOH, if you want globally-consistent (and globally-readable), unforgeable metadata associated with those keys you need some arbiter that decides who wins in case of conflicts, what followers/“following” graph edges exist, etc.

        Nostr actually uses existing PKI (via HTTPS + TLS) to “verify” accounts that claim association with an existing public domain. Everything else is…well, not even “eventually consistent” so much as “can look kinda consistent if you check a lot of relays and don’t sweat the details.”

      3. 4

        It’s possible, but you need some kind of distributed consensus, which in practice looks like a blockchain. ENS is one implementation on Ethereum. You also need some mechanism to prevent one person from grabbing every name (if you assume sybil attacks are possible, which they will be on almost any decentralized system). The most common one is to charge money, which is not really ideal for a social network (you want a very small barrier to entry)

        1. 4

          You also need some mechanism to prevent one person from grabbing every name

          An interesting take on this is done by the SimpleX network: it employs no public identifiers. The SimpleX Chat protocol builds on top and continues this trend. You can build a name service on top, but the takeaway I make is that maybe we don’t need name services as often as we think.

      4. 3

        The assertion of Zooko’s Triangle is that you can’t have identities that are simultaneously human-meaningful, globally unique and secure/decentralized. You can only pick two properties. DNS isn’t secure (you have to trust registries like ICANN.) Public keys aren’t human-meaningful. Usernames aren’t unique because different people can be “snej” on different systems.

        The best compromise is what are known as Petnames, which are locally-assigned meaningful names given to public keys.

        1. 2

          Zooko’s Triangle describes the properties human-meaningful, secure, and decentralized. DNS is secure, not decentralized.

          1. 2

            Oops, thanks for the correction. I was working from memory.

    6. 3

      I don’t think it should be considered a “Sovereign cloud” unless you also own the datacenter building and the land it’s on. Even then, it’s a bit of a gray area because you’re still paying rent to an ISP for bandwidth.

    7. 20

      Title is somewhat misleading, “free” here just means proprietary zero-cost.

      1. 12

        I’m confused, they say the binaries are available under the Apache 2 license, but that’s a source code license, it doesn’t make sense for a binary-only distribution.

        1. 16

          Your confusion is justified. As far as I’m aware, no one has ever used the Apache 2 license in this way before.

          1. 10

            Since it runs on the JVM, I wonder if that means the Java Bytecode is licensed as Apache 2? If so I wonder if it could it be run through a bytecode decompiler to get an open source version of the code…

        2. 7

          it doesn’t make sense for a binary-only distribution.

          Doesn’t it? Why not?

          The apache license:

          Source” form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.

          Object” form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.

          You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: …

          It seems to describe conditions that apply both to the distribution of source code and of binaries, this comprising the latter.

          1. 7

            The way that I, and lawyers that I’ve spoken to, have always interpreted that is that the license is on the source code and these are the terms applied to binaries generated from source code under the license. I don’t know what it means for the binary to be under the license and the source code under a different license and I suspect that their lawyers don’t either.

      2. 8

        Thanks. Edited the title.

      3. 4

        I feel like if it was fully open sourced it would say “Datomic released as open source” or something similar. The title is fine.

        1. 13

          They specifically chose to include a section in the article saying “Is it Open Source?” and the answer is definitely “no” but they don’t actually say “no” in that section.

          Very strange, evasive wording.

          1. 4

            It’s even weirder, they list an Open Source license in the ‘Is it Open Source?’ question, which made me think that the project was Apache 2 licensed. It turns out that only the binaries are, which is very strange.

    8. 46

      I had a walk while preparing a presentation around Supply-Chain security when this thought occurred to me. It was later refined by a friend of mine.

      There are two camps of “Supply-chain security”.

      • How can i continue to extract labor from OSS and mitigate the risks? (camp 1)
      • How can we model a supply chain that accounts for OSS and recognizes their effort/collaborates with them? (camp 2)

      Most (all?) efforts from Google, OpenSSF and Companies largely fall into Camp 1. They are check boxes you can tick, measurements for you to look at Open-Source labor and ways to get badges so maintainers can look more attractive.

      However, if you actually look at what is needed very little resources are spent on figuring out how we can better support maintainers. That is a problem. I get disappointed every time Google launches a new supply-chain product they are one lay-off/promotion away from being abandoned.

      1. 20

        I couldn’t agree with this more. I remember vividly sitting in Kelsey Hightower’s closing StrangeLoop keynote on securing the software supply chain thinking “who is this for?”. OSS developers providing supply chain integrity proofs and Kelsey out on the conference circuit astroturfing this as “the expected minimum” has nothing to do with developing software that serves users. It has everything to do with helping companies (Google etc.) satisfy their regulatory obligations. Scaremongering about how artifact signing may become legally required in the future is something Google is concerned about and needs to hedge against, not something OSS developers should joyously lean into.

        Waste of a dang keynote.

      2. 16

        I forgot my last point :)

        One of the best initiatives I have seen the past years with the supply-chain security hype we’ve had since log4j/solarwinds has to be Open Collective hiring maintainers. More of this please.

        1. 1

          Open Collective does seem like a pretty good way to get maintainers paid on a donation model. Haven’t used it myself or know of any projects that especially use it though; hopefully it works as well as it sounds?

      3. 8

        100% agree. Google clearly financially benefits from this - both internally through the use of open source as well as externally through encouraging more people to adopt GCP. They are already putting in effort to track security vulnerabilities and in some cases contribute fixes upstream, why not go the last mile and also offer grants to upstream projects to help them triage and merge in fixes?

        The only reason I can think is if they funded OSS directly, then patches would land upstream faster, which would slightly reduce the competitive advantage of Google’s repositories. That seems like a small downside to me, but I haven’t built a trillion dollar multinational conglomerate, so my viewpoint is probably not aligned with theirs. It makes me think a camp 2 solution to this problem simply isn’t realistic in a capitalist economy where corporate profit is the only motivator. Maybe I’m just being overly pessimistic though.

        1. 7

          They are already putting in effort to track security vulnerabilities and in some cases contribute fixes upstream, why not go the last mile and also offer grants to upstream projects to help them triage and merge in fixes?

          So they do something like this through their “Secure Open Source Reward” program, https://sos.dev/. However the major caveat here is that the rewards are based on their evaluation of the work you are submitting to the program. You don’t get paid to do the work, you get paid for the completed work.

          That’s not good enough.

          They did fund Open-Source security work of a few open-source maintainers, but this was discontinued.

          The Google Open-Source Security Team was funding some of the Reproducible Builds work until very recently as well.

          https://reproducible-builds.org/who/sponsors/

      4. 8

        Honestly it sounds like “Camp 1” is what managers think privately while “Camp 2” is what they say publicly. “Modeling a supply chain and recognizing their effort” is just a way to sugar coat labor extraction.

        I guess the material difference you’re pointing to is whether they support maintainers?

        1. 4

          I think contributing towards figuring out how we can solve the problem of “supporting maintainers” should be the bar.

          It’s a low bar.

      5. 4

        if you actually look at what is needed very little resources are spent on figuring out how we can better support maintainers

        100% this. I gave a talk at PhillyETE this week on the premise that lots of coders want to contribute (willing). But they’re not ready and not able. I don’t think most OSPO offices even realize this is a problem let alone are ready to invest in it.

        The talk comes from the years I’ve worked on CodeTriage and my recent book How to Open Source (dot dev). It focuses on: what are practical and evidenced based interventions we can introduce to increase contributor success rate. Ultimately with the goal of reducing maintainer burden.

        The title of the talk is “How to Steal from Maintainers” but the video is not yet published

      6. 2

        Most (all?) efforts from Google, OpenSSF and Companies largely fall into Camp 1

        I don’t really understand why you think companies are different from other consumers of F/OSS here. As a user of a load of F/OSS applications and libraries, I care that nation-state actors are not inserting malware into the code that I’m running. As a maintainer of F/OSS projects, I don’t have the resources to formally versify every patch that I get, so I can’t tell if someone is sneaking subtle backdoors into my projects (remember the null pointer vulnerabilities in SELinux that the NSA introduced that weren’t even a known vulnerability class until they were discovered?) and I would appreciate tools that would at least let me limit the damage if this happens.

        1. 3

          null pointer vulnerabilities in SELinux that the NSA introduced

          What?

          1. 2

            The null pointer handling that the NSA contributed to SELinux converted a bunch of crashes into arbitrary code execution bugs. This became an entirely new vulnerability class. There’s no evidence that this was done maliciously (it’s far more plausible that the folks at the NSA who introduced the bug were also unaware of its potential for exploitation), but it serves as an example of a contribution that expert reviewers missed as introducing vulnerabilities. A nation state adversary would find it very easy to sneak code like this into a lot of projects deliberately.

        2. 2

          Companies look at licenses in a fundamentally different way from individuals; a company is less likely to honor the license overall, and also more likely to be intimidated by an uncommon license. Google looks at its offerings in terms of legal liability and partnerships, not in terms of machine capabilities and programmer efficiency.

          1. 1

            a company is less likely to honor the license overall, and also more likely to be intimidated by an uncommon license. Google looks at its offerings in terms of legal liability

            These two statements seem fundamentally at odds. I would expect that a company that has lawyers review a license and avoids licenses that their lawyers do not say that they can comply with easily would be more likely to honour the license.

            If anything, I’d expect individual users and developers who don’t have legal teams to be more likely to honour licenses. For example, I’ve seen users pass copies of GPLv2 binaries to their friends, yet that is directly against the terms of the license (unless they provide the source code or a written offer good for at least three years to provide the exact version of the source code). I’ve seen a lot of community Linux distros do this as well. I’ve also seen a load of people incorporate permissively licensed code into their own projects and ignore the attribution requirements. It’s far more rare in my experience for a big company to do this because they have a lot more to lose. It does happen, but it’s far less common.

    9. 4

      The app looks pretty neat, but I tried to open an 800MB JSON file, and it took forever to load and used up 24GB of memory.

      1. 3

        I was quite impressed with it too. Would be nice if it came with a man page. Doesn’t even seem to accept -h or –help; even if it doesn’t take lots of options, a brief usage message would be helpful.

      2. 3

        OMG)) This is an embarrassment. Will start using pprof)))

      3. 1

        I use it most of the times, but recently had it fail on a large-ish JSON file (like 27MB). I can’t report a bug as the JSON contains private data, but jless (a similar tool) worked with it. I do prefer fx over jless though, in particular for the line wrapping of long content strings.

        1. 2

          I spend a lot of time implementing wrap functionality. )))

          Combined with search and themes features code becomes complex.

          Will try to rework the whole part to make it fast. I guess a new approach is needed.

          1. 1

            You’re on the right track, keep pushing :)

    10. 2

      This tool has a nice UI! I had an idea the other day - create a tool that compares across all the cloud providers by price per bandwidth, CPU performance, memory, etc. It would base the comparisons on CPU benchmarks, so you could compare different processor types across clouds (and maybe even pluggable so you could run your own benchmarks to test your specific application). Right now it seems like you just have to estimate, and then do a bunch of manual testing to find the best deals.

    11. 4

      Great post! It has a lot of parallels to some of the classic Twitter engineering posts / talks.

      It’s interesting they went with hash based load balancing, and sending open requests to a specific instance. It seems like they’re kind of trading hot partitions in Cassandra for hot nodes in the data layer cluster. I wonder if they also load balance on CPU or geo a bit, or just over provision the worker nodes so much it doesn’t matter. As they continue to scale I wonder if they will need to in order to spread out requests for the same data to more nodes.

      Related to this, the new DynamoDB paper is interesting, as AWS had to solve some similar issues. I wish Scylla / Cassandra had more advanced features like this, but I suppose a lot of users would have no need for this (it’s highly use case dependent and only an issue at scale).

      It seems like a practical solution for these types of scalability problems could be:

      • Keep data in memory on nodes partitioned by a hash of the keys being looked up (i.e. how any distributed database/cache works).
      • Handle hot partitions with teaks to the hash partitioning: split the partitions into smaller chunks and allow the hottest partitions to be replicated to more nodes. There are tons of great open source databases to help with partitioning, but I haven’t seen the second point addressed very commonly.
    12. 3

      Although lunar time would remain the official timescale, its users might, as on Earth, want to offset it in time zones that link to the Sun’s position in the sky. This is less a question for metrologists and more one of convention. “When somebody really lives there on the Moon, I think it makes sense,” he says.

      I’m sick of programming around timezones on earth. I think we should get rid of them on earth and just use UTC everywhere. I really really hope they don’t introduce timezones on the moon :( . What’s next - DST on the dark side of the moon?

      1. 5

        Obligatory reference to “So You Want To Abolish Time Zones “.

        1. 1

          I remember when “So you want to…” articles were in vogue and for most of them they meant to educate you on the myriad of issues that you don’t know that you’re ignoring. They were a valuable resource of their day similar to today’s Falsehoods programmers believe about names

          But this one has always struck the wrong chord with me. The whole crux of their argument is “we abolish timezones so now I can no longer look at my clock and figure out if I can call Uncle Steve in Melbourne. Therefore we need timezones”. But you couldn’t tell if you could call him before either. Maybe he’s working, maybe he works weird hours, maybe he’s on holiday, maybe he’s out shopping, and yeah maybe he’s asleep. So… ask him? This argument is just so flimsy. There are tradeoffs to be argued about, this just isn’t one of them. Ask him when he wants to be called. If it’s a business, look at their hours on their website (which you can now understand!). I guess this guy wants to inherent the cachet of the “so you want to” article trend of its day but it’s not nearly as detailed or irrefutable as the others were, it’s just one argument presented and it’s an argument that I personally can easily reject.

          They make a secondary argument about the day of the week and alright sure that’s inconvenient. There’s probably a solution to it that I’m not seeing, but sure I’ll buy that there’s some tradeoff or further solution to explore there. It doesn’t nearly shoot down the whole idea though, it’s just a little piece to think about at the same time that you’re thinking about the rest of the problem.

          1. 1

            Speak for yourself, I found this article to a brilliant and entertaining argument for why time zones make sense for solving human problems, and it definitively convinced me that abolishing time zones would be a bad idea. There are people I might call at 4:25 in the morning in whatever time zone they happened to be in, but at least I’m aware of the connotations of “4:25 AM” and why specific people I call at that hour are different from ordinary people under ordinary circumstances.

      2. 4

        Let’s start with getting rid of DST

      3. 2

        If people actually living on the moon are completely divorced from a natural day-night cycle, which seems reasonable enough given imaginable living conditions on the moon, then everyone on the moon could decide to adhere to the same Lunar Time Zone, which might as well be identical to UTC.

        On the other hand, it might be useful for different settlements on the moon to have deliberately staggered standard sleeping and waking times. If some kind of emergency on the moon happened it might prove useful if, no matter what time it was, that time would always coincide with the standard waking hours of some lunar settlement.

    13. 2

      This seems familiar, it’s the same problem with JVM applications iirc?

      1. 7

        It looks like a special case of the general problem that nested schedulers always interact in surprising ways and, unless a nested scheduler has full visibility into its parent, will always behave poorly. This is true for JVMs and for guest kernels on hypervisors’ VCPU schedulers, and for every other attempt to build nested schedulers.

      2. 1

        I believe the JVM has been updated to use cgroups info, but I don’t know the internals and if that totally solves this problem.

        1. 1

          oh neat! yeah this looks like it solves the main complaints of JVM apps in containers. it at least removes the need for a lot of hacky workarounds.

    14. 12

      I don’t see how you can do E2EE in a web-based service. Or rather, I know how, but it requires trusting the JS the server sends you, which means trusting the server. Which defeats the point of E2EE in a federated system where the server is run by some volunteer in Ruritania whom you’ve never met and don’t know by name.

      I know you ultimately have to trust the software that’s doing the encryption, but in other E2EE systems like iMessage the software isn’t downloaded on literally any page view, and it’s only the app or OS vendor you need to trust this way, which is (a) one entity, and (b) one that’s very visible and has a lot of reputation to lose if they abuse this.

      PS: Thanks for posting this. I’m working on E2EE myself, in a different system, and it’s useful to read more about how others have done it.

      1. 16

        I don’t see how you can do E2EE in a web-based service. Or rather, I know how, but it requires trusting the JS the server sends you, which means trusting the server.

        Luckily, I already wrote the part that addresses this:

        Users shouldn’t expose their keys to the webserver. This means using end-to-end encryption outside the context of the website. This implies a Browser Extension.

        1. 3

          I wish there was a way to do this with something like WebAuthN or FIDO, where the secret key is in a TPM of some sort and persists on the device. I know this works for authentication, but I don’t think you can use it to encrypt arbitrary data. This still doesn’t solve the issue of migrating devices, but it would be a neat way to securely store the private key, ensuring it never gets to the server.

          1. 6

            I do eventually want to write a W3C specification for using a FIDO2-compatible device to encrypt/decrypt keys in a browser context. That would give us a better key management solution than the ones listed on that page.

        2. 1

          maybe the idea of an eternally stored, private message is a fallacy.

          Are short-lived keys like at https://privatebin.info be helpful?

      2. 3

        Honestly I think that mastodon would do itself a big favor by asking people to use native clients. Beyond this problem (web extensions are of course mentioned, and I think you can create isolated contexts that would be very hard for hijacked instances to deal with), there are loads of issues that come from “we are passing around instance-specific links to things”, that basically completely disappear if you are using a client.

        “The follow button semantics are weird” native client resolves auth issues with HTTP. “Why is this link from instance A being hosted on instance B when shared” protocol links to allow clients to receive real things. “E2EE trust semantics” again, native clients mean your trust matrix is different (and theoretically easier to audit).

        One side note on the web-based service issue, is that with a lot of effort mastodon instances could make it so that you could outright save an HTML page locally on disk and have that be “the software”, for people who are worried. There is still an audit step involved there, but it’s not impossible!

        But ultimately the fact that instances host web interfaces (rather than “there are instances, get your client software from X/Y/Z”) is generating constant usability concerns.

      3. 1

        You can have whatever download semantics you want with service workers and progressive web apps (PWA). You could only let the app update with the user’s permission, you could make all updates be signed by a predefined set of keys, etc.. This closes the gap between native apps and web apps and makes it more of a spectrum.

        If you rely on a browser extension then you’ve just ruled out 99% of people who are not able to, or will not, install a browser extension in their environment.

        1. 1

          Interesting. How does the end user know the web-app has these policies, though? If I create a Mastodon account at randomsite.io, can I verify that they are using an unmodified copy of a front end codebase I trust, and that it can’t be changed without my consent?

          99% of people who are not able to, or will not, install a browser extension

          [citation needed]

          Back to my original point, 99% of people will install apps on their phones/tablets though.

          1. 1

            You can verify the update policy by reading the (typically < 100 LOC) service worker code. They could also check the hashes of retrieved assets against some known value, which again you could inspect and compare against some other source with the values in the service worker source.

            EDIT: I’ve just dug deeper and I think I’m wrong. This only applies to all the website assets excluding the service worker script itself. This is still checked for updates regardless, and the browser will install the new version either on next load or immediately if it calls skipWaiting. Although that is also visible to the user (the SW has an incrementing version controlled by the browser).

            That 99% of people is based on the following: Take one of the most popular extensions like ublock origin, add up the number of installs across all browsers (23 million) divide by the number of internet users (~5 billion). this gives 0.5%. If this is the most popular extension then this would be an upper bound.

    15. 16

      Could you go into more detail about why this round of embedded key-value stores are different than, say, Berkeley DB, which has been in the Python stdlib for decades? I’ve always been very confused about this hype, figuring it as just being the Google-backed version of an existing idea.

      1. 9

        A lot of the ones mentioned in the article were developed for use in servers. They are much more scalable than Berkeley DB and, with LSM trees, can handle much higher write loads.

        For embedded use, LMDB and its spin-off libMDBX are a lot faster than BDB, but have neatly-compatible APIs. (The guy who built LMDB has a paper on how his project, OpenLDAP, had so many performance problems with BDB that he wrote LMDB as a replacement.)

        1. 5

          The guy who built LMDB has a paper on how his project, OpenLDAP, had so many performance problems with BDB that he wrote LMDB as a replacement.

          That’s very interesting. I usually think of an LDAP backing store as the canonical example of something that’s almost entirely unconcerned with write performance. (Because directories are read so frequently and updated so rarely.)

          edit: Assuming this is the paper in question, it seems to me that read optimization was the focus of MDB development, not write loads. But it sounds like some of the design decisions they made helped quite a bit with write performance as well.

          1. 1

            Yes, my comment about higher write loads was specific to LSM trees, not LMDB. Sorry for the confusion.

        2. 3

          As I remember it the main reason for the move away from BDB was not performance, it was the license change.

      2. 6

        My college database course had us write a SQL engine on top of BerkeleyDB, and that was 8 years ago. I was surprised to learn just now that it was initially released in 1994. Page 6 of this paper from this year shows BerkeleyDB winning in most of the workloads they tested. (The paper is “Performance Comparison of Operations in the File System and in Embedded Key-Value Databases” by Hines et al.)

        1. 4

          Interesting paper, but they’re pretty vague about the setup: they ran the tests in a VM provided by the university, so that adds a lot of unknowns (like, what kind of mass storage? And was anyone else using the VM?), and didn’t specify what filesystem they used. I suspect they’d also get different results on Apple platforms and Windows.

          1. 1

            Agreed. I wish they posted their code so we could try it on other systems.

      3. 2

        It would be great to see more examples of situations where each is better. The article mentions:

        It can allow you to swap B-Trees (traditional choice) for LSM Trees (new hotness) as the underlying storage method (useful for optimizing write-heavy workloads).

        I don’t think LSM trees are strictly better than b-trees, if your only requirement is write heavy workloads. You also need to require the index characteristics an LSM tree provides (sequential IO), as well as be okay with the compaction characteristics. Cassandra, for example, uses this structure. I distinctly remember compactions being something that was tricky to optimize well (all though it’s been a looong time since I’ve used it).

        The original paper goes into more detail, but it’s been a long time since I’ve read it. Google might have made them more popular, but they weren’t invented at Google. Like anything in CS, it’s just a different data structure. There are tradeoffs, it depends on the use case, and ymmv.

        1. 4

          Sure I’m making generalizations because precision requires benchmarking and I’m not even talking about any specific code in this post. But if you just Google “lsm tree vs btree benchmarks” you’ll find a bunch and they mostly agree with the generalization I made.

          For example:

          Here’s a Mongo benchmark.

          Their takeaway: “If you have a workload that requires a high write throughput LSM is the best choice.”

          Here’s a TiKV benchmark.

          Their takeaway: “The main purpose for TiKV to use LSM-tree instead of B-tree as its underlying storage engine is because using cache technology to promote read performance is much easier than promote write performance.”

      4. 2

        Yeah this is a good question. I’d like to know myself.

        1. 8

          When you talk about key-value stores, there are basically 2 data structures that are implemented for storing the data: B-trees and LMS-trees. B-trees have been around for a long time (wikipedia tells me since 1970), and so the original embedded DBs all used a B-tree as their underlying data structure. There are differences in all other aspects (manual page cache vs mmap, variations of B-trees, etc), but they’re all implementations of a B-tree. B-trees are more efficient at finding/reading data than inserting or updating data.

          LSM-trees on the other hand were invented in 1996 (again, so says wikipedia). They were designed to handle writes/updates more efficiently than B-trees. The observation was that the “heavy” work of sorting can be done in-memory, then a merge-sort style operation can be performed, which incurs a sequential read from and write to disk, which is typically very fast. This spawned a number of implementations (LevelDB, RocksDB, etc) which too varied in a number of aspects, but most specifically in the strategy around when you merge data that has been persisted to disk. There are 2 main camps here: when a certain number of files-per-level have been written, and when a certain size-per-level has been written. These strategies vary in performance characteristics (write amplitude, space amplitude, etc) and can be chosen based upon the workload.

          1. 1

            I’m aware of the data structures. :) But I’m not as aware about every major implementation and their ebb and flow in popularity/use over time.

    16. 8

      I have been trying to get my students into the habit of explaining their code for my intro class. I dont see this as that much different than them cobbling together stack overflow posts to get to a working answer they dont understand. Its just built into the editor now

      1. 7

        One of my professors did this. We submitted both the code as well as a paper explaining how it worked. That worked okayish on a large scale with low amount of cheating.

        Another professor had the perfect small scale solution. You went to his office with your program, opened it in an editor and he deleted ~5 lines somewhere in the middle. If you could not recreate them, and make the program run, you failed.

        1. 6

          That’s horrible. I can’t do that and I’ve been programming for over 20 years :P

          1. 4

            Hey, remember, we’re talking about undergrad level programming and not rocket surgery. Think merge sorts and binary trees.

      2. 5

        Well, apparently Copilot is really good at writing comments, so it can “explain” the code too. :-)

        1. 3

          Thought experiment: If a student uses Copilot to generate both the code and the explanation, then memorizes both and can reproduce them on command, have they learned the material?

      3. 1

        In a way I’m not sure copilot makes anything much different than when I was in school. The people that didn’t want to learn how to write an algorithm would just “borrow” the solution from someone else, or the internet. (And hopefully change a few variable names and indentation). If people want to learn they will, and if they don’t then they won’t. It’s not really a teacher’s job to force anyone to learn.

    17. 5

      This is a really neat approach, I had no idea it worked! So far I’ve used direnv with a config that exports “GIT_AUTHOR_EMAIL”. It’s neat that the git config supports this out of the box.

    18. 4

      Is there any reason why an editor couldn’t do the same for leading spaces?

      A project is configured to use 4 spaces per indentation level, my preference is 8 spaces, editor displays each leading space as 2 spaces.

      If I cared more about it I would give it a try.

      1. 1

        This would break alignment at least

        1. 2

          So there are auto formatters now, which are pretty neat. I wonder if the next level is running the auto formatter both on opening the file as well as saving it (or checkout and checkin, etc). This way you can edit with whatever your personal readability preferences are, while also keeping the code in a consistent format.

          1. 1

            Now that editors use things like Tree-sitter, the old idea of storing a syntax tree, that you can render however you want, feels close.

            And the serialised format on disk is just the auto-formatted plain-text source-code.

    19. 23

      Tabs for indentation, spaces for alignment.

      1. 7

        Exactly. Variable-width characters at the start of a line are great. Variable-width characters in the middle of a line are annoying because they won’t line up with other fixed-width things. Recent versions of clang-format now support this style, so there’s no reason to use spaces anymore.

        1. 4

          I have to suffer through clang-format at work, I can tell you it’s pretty bad. The worst aspect so far is that it does not let me chose where to put line breaks. It’s not enough to stay below the limit, we have to avoid unnecessary line breaks (where “unnecessary” is defined by clang-format).

          Now to enforce line lengths, clang format has to assume a tab width. At my workplace it assumes 4 columns.

          Our coding style (Linux) explicitly assumes 8.

          1. 2

            You can tell clang-format how wide it should assume for tabs. If people choose to use a wider tabstop value, then it’s up to them to ensure that their editor window is wider. That remains their personal choice.

          2. 1

            I’ve found out that clang-format respects line comments:

            void f( //
                void *aPtr, size_t aLen, //
                void *bPtr, size_t bLen //
            );
            
      2. 7

        I think when people say this they imagine tabs will only occur at the start of the line. But what about code samples in comments? This is common for giving examples of how to use an function or for doc-tests. It’s much harder to maintain tab discipline there because your formatter would have to parse Markdown (or whatever markup you use) to know whether to use tabs in the comment. And depending on the number of leading comment characters, the indentation can look strange due to rounding to the next tabstop. Same thing goes for commented out sections of code.

        1. 3

          Go uses tabs for indentation and spaces for alignment. It works pretty well in practice. I can’t say that I’ve ever noticed anything misaligned because of it.

          1. 4

            If you wrote some example code in a // comment, would you indent with spaces or tabs? If tabs, would you write //<space><tab> since the rest of the comment has a single space, or just //<tab>? gofmt can’t fix it for you, so in a large Go codebase I expect you’ll end up with a mix of both styles. With spaces for indentation it’s a lot easier to be consistent: the tab character must not appear in source files at all.

            1. 1

              I can’t say that I’ve ever written code in a comment, because I just write a ton of ExampleFunctions, which Go automatically adds to the documentation and tests. Those are just normal functions in a test file. I think what’s interesting about Go is that they don’t add all the features but the ones they do add tend to reinforce each other, like go fmt, go doc, and go test.

          2. 3

            Personally, I think it would have annoyed me if go fmt didn’t exist. Aligning code with spaces is annoying, and remembering to switch between them even more so.

            1. 1

              Yes, it’s only practical if a machine does it automatically.

      3. 1

        I said this elsewhere in the thread, but it’s worth reiterating here: I’d bee with you 100% if it weren’t for Lisp, which simply can’t be idiomatically indented with tabs (short of elastic tabs) because it doesn’t align indentation with any regular tab-stops.

    20. 2

      I’ve done this before with large datasets, and it’s really cool! I’ve also tried loading into a local Postgres instance to query large datasets. While both work, I found xsv a bit faster for my specific use case of simple row/column filtering on somewhat large CSV files. This was a while ago though, so it might not be the case now. Obviously which tool is better depends on what you’re trying to do with the data.