1. 7

    Is he right in regards to keep alives being wrongly implemented at the application layer? Aren’t TCP keep alives not good enough because they only verify from proxy to proxy, are off by default and if turned on, the period is 2 hours. See https://stackoverflow.com/a/23240725/4283659

    1. 1

      Another upside to doing stuff over the encrypted channel is that it’s less open for things on the network to mess with it. Still some possibility, since you can always have a corporate middlebox with its own root cert installed on the client computer, etc. But generally the hope with putting stuff in the encrypted channel seemed to be “hopefully now the network won’t completely break it.”

    1. 4

      Surely I’m not going to be the only one expecting a comparison here between go’s. I’m not really well versed in GC but this appears to mirror go’s quite heavily.

      1. 12

        It’s compacting and generational, so that’s a pair of very large differences.

        1. 1

          My understanding, and I can’t find a link handy, is that the Go team is on a long term path to change their internals to allow for compacting and generational gc. There was something about the Azul guys advising them a year+ ago iirc.

          Edit; I’m not sure what the current status is, haven’t been following, but see this from 2012, look for Gil Tene comments:

          https://groups.google.com/forum/#!topic/golang-dev/GvA0DaCI2BU

          1. 3

            This presentation from this July suggests they’re averse to taking almost any regressions now even if they get good GC throughput out of it. rlh tried freeing garbage at thread (goroutine) exit if the memory wasn’t reachable from another thread at any point, which seemed promising to me but didn’t pan out. aclements did some very clever experiments with fast cryptographic hashing of pointers to allow new tradeoffs, but rlh even seemed doubtful the prospects of that approach in the long term.

            Compacting is a yet harder sell because they don’t want a read barrier and objects moving might make life harder for cgo users.

            Does seem likely we’ll see more work on more reliably meeting folks’ current expectations, like by fixing situations where it’s hard to stop a thread in a tight loop, and we’ll probably see work on reducing garbage through escape analysis, either directly or by doing better at other stuff like inlining. I said more in my long comment, but I suspect Java and Go have gone on sufficiently different paths they might not come back that close together. I could be wrong; things are interesting that way!

            1. 1

              Might be. I’m just going on what I know about the collector’s current state.

          2. 9

            Other comments get at it, but the two are very different internally. Java GCs have been generational, meaning they can collect common short-lived garbage without looking at every live pointer in the heap, and compacting, meaning they pack together live data, which helps them achieve quick allocation and locality that can help processor caches work effectively.

            ZGC is trying to maintain all of that and not pause the app much. Concurrent compacting GCs are hard because you can’t normally atomically update all the pointers to an object at once. To deal with that you need a read barrier or load barrier, something that happens when the app reads a pointer to make sure that it ends up reading the object from the right place. Sometimes (like in Azul C4 I think) this is done with memory-mapping tricks; in ZGC it looks like they do it by checking a few bits in each pointer they read. Anyway, keeping an app running while you move its data out from under it, without slowing it down a lot, is no easier than it sounds. (To the side, generational collectors don’t have to be compacting, but most are. WebKit’s Riptide is an interesting example of the tradeoffs of non-compacting generational.)

            In Go all collections are full collections (not generational) and no heap compaction happens. So Go’s average GC cycle will do more work than a typical Java collector’s average cycle would in an app that allocates equally heavily and has short-lived garbage. Go is by all accounts good at keeping that work in the background. While not tackling generational, they’ve reduced the GC pauses to more or less synchronization points, under 1ms if all the threads of your app can be paused promptly (and they’re interested in making it possible to pause currently-uncooperative threads).

            What Go does have going for it throughput-wise is that the language and tooling make it easier to allocate less, similar to what Coda’s comment said. Java is heavy on references to heap-allocated objects, and it uses indirect calls (virtual method calls) all over the place that make cross-function escape analysis hard (though JVMs still manage to do some, because the JIT can watch the app running and notice that an indirect call’s destination is predictable). Go’s defaults are flipped from that, and existing perf-sensitive Go code is already written with the assumption that allocations are kind of expensive. The presentation ngrilly linked to from one of the Go GC people suggests at a minimum the Go team really doesn’t want to accept any regressions for low-garbage code to get generational-type throughput improvements. I suspect the languages and communities have gone down sufficiently divergent paths about memory and GC that they’re not that likely to come together now, but I could be surprised.

            1. 1

              One question that I don’t have a good feeling for is: could Go offer something like what the JVM has, where there are several distinct garbage collectors with different performance characteristics (high throughput vs. low latency)? I know simplicity has been a selling point, but like Coda said, the abundance of options is fine if you have a really solid default.

              1. 1

                Doubtful they’ll have the user choose; they talk pretty proudly about not offering many knobs.

                One thing Rick Hudson noted in the presentation (worth reading if you’re this deep in) is that if Austin’s clever pointer-hashing-at-GC-time trick works for some programs, the runtime could choose between using it or not based on how well it’s working out on the current workload. (Which it couldn’t easily do if, like, changing GCs meant compiling in different barrier code.) He doesn’t exactly suggest that they’re going to do it, just notes they could.

              2. 1

                This is fantastic! Exactly what I was hoping for!

              3. 3

                There are decades of research and engineering efforts that put Go’s GC and Hotspot apart.

                Go’s GC is a nice introductory project, Hotspot is the real deal.

                1. 3

                  Go’s GC designers are not newbies either and have decades of experience: https://blog.golang.org/ismmkeynote

                  1. 1

                    Google seems to be the nursing home of many people that had one lucky idea 20 years ago and are content with riding on their fame til retirement, so “famous person X works on it” has not much meaning when associated with Google.

                    The Train GC was quite interesting at its time, but the “invention” of stack maps is just like the “invention” of UTF-8 … if it hadn’t been “invented” by random person A, it would have been invented by random person B a few weeks/months later.

                    Taking everything together, I’m rather unconvinced that Go’s GC will even remotely approach G1, ZGC’s, Shenandoah’s level of sophistication any time soon.

                  2. 3

                    For me it is kind of amusing that huge amounts of research and development went into the Hotspot GC but on the other hand there seem to be no sensible defaults because there is often the need to hand tune its parameters. In Go I don’t have to jump through those hoops, and I’m not advised to, but still get very good performance characteristics, at least comparable to (in my humble opinion even better) than for a lot of Java applications.

                    1. 12

                      On the contrary, most Java applications don’t need to be tuned and the default GC ergonomics are just fine. For the G1 collector (introduced in 2009 a few months before Go and made the default a year ago), setting the JVM’s heap size is enough for pretty much all workloads except for those which have always been challenging for garbage collected languages—large, dense reference graphs.

                      The advantages Go has for those workloads are non-scalar value types and excellent tooling for optimizing memory allocation, not a magic garbage collector.

                      (Also, to clarify — HotSpot is generally used to refer to Oracle’s JIT VM, not its garbage collection architecture.)

                      1. 1

                        Thank you for the clarification.

                  3. 2

                    I had the same impression while reading the article, although I also don’t know that much about GC.

                  1. 2

                    Babytimes, seeing some folks, and cleaning.

                    1. 6

                      Nice article. Also interesting for reasoning about DNS-over-HTTPS.

                      As far as I can say from my experience in Kenya, it should also be noted that Africans have a very different way of perceiving time. And security. And… everything! :-D

                      How would I address this issue?

                      I think that I would basically create a reverse proxy serving over HTTP those sites that could benefit more from caching (eg Wikipedia). Probably with a custom domain such as wikipedia.cached.local so that people could not be fooled to take the proxied for the original. Rewriting URIs for hypertexts shouldn’t be an issue, but it could be harder to Ajax pages. Probably I would also create a control page so that a page could be prefetched or updated. With a custom protocol and a server in Europe, one could also prefetch several contents at once and send them back together, maximizing bandwidth usage.

                      Obviously it wouldn’t be safe, but it would be visibly unsafe, and limited to those website that can get advantage of such caches without creating serious threats.

                      As for service workers, I do not think they would improve the user experience at all, since they are local to the browser and the browser has a cache anyway. The problem is to share such cache between different machines.

                      1. 2

                        Local reverse proxy is a clever idea, and a proxy that you explicitly set up clients trust a la corporate middleboxes (see Lanny’s comment) seems like it can work in some environments too. Sympathetic to the problem of existing solutions no longer working, sort of surprised the original blog post wasn’t more about how to improve things now.

                        1. 5

                          The point of the machinery I described was to make user explicitly choose between security and access time.

                          You can make everything smoother (and easier to implement) with a local CA or by installing proper fake certificates in the clients and transparent proxy, but then people cannot easily opt-out.
                          Worse: they might be trusting the wrong people without any benefit, as for sensible pages that cannot be cached (shopping carts, online banking and similar…)

                          That why using the reverse proxy should be opt-in, not default and trivial to opt out: there’s no need for a proxy if you want to edit a wikipedia page!

                          Sympathetic to the problem of existing solutions no longer working, sort of surprised the original blog post wasn’t more about how to improve things now.

                          Eric Meyer is a legend of HTML, CSS and Web accessibility. A legend, beyond any doubt.
                          Before HTML5 I used to read his website daily. He teached me a lot.

                          But he is a client-side guy.
                          I think his reference to service workers is an attempt to improve things now.

                          1. 2

                            Nit: the past tense of teach is taught, not teached.

                            1. 1

                              I’m sorry… I can’t edit it anymore, but thanks!

                      1. 8

                        Neat to see the “Lobster” codebase get used this way. And that explains a strange and useless issues I recently got. I had thought they were some new, half-baked commercial product or service that wanted to contribute to Lobsters as a marketing effort (this has happened a couple times and so far been a complete waste of time). This is the only contact I’ve had with the study authors (I have no idea if they contacted jcs), though I see they’re also local to Chicago.

                        Looks like this analysis is at least a few months old; when I added rubocop it caught all the opportunities to use find_by.

                        1. 3

                          :/ about the useless issues opened.

                          There was a long history of automated analysis at Google and a takeaway was, more or less, “you don’t really have a useful analyzer ’til it mostly finds fixes coders think are useful.”

                          I wonder what folks would come up with given incentives like that; currently, there’s probably more pressure to maximize your claimed number of bugs to get published.

                        1. 1

                          This analogy is a stretch, but ORMs and DBs face a problem a little like what dynamic language runtimes face: with more understanding of how the code really runs, you could make it run faster (x is almost always a float in this JS function, or we almost always/never retrieve obj.related_thing after this ORM query retrieves some objs), but that info isn’t readily available when the code is first run.

                          JITs deal with this by recording specifics of what happens at each callsite, then making an optimized path for what usually happens. You could imagine ORMs tagging query results with the callsites they came from, and figuring out things like “this bulk query should probably retrieve this related object” or “looks like this query is usually just an existence check” from how the result was actually used.

                          An inherent challenge with this kind of thing is that you need to make sure that tracking isn’t so expensive it eats up any advantage it brings. We take for granted that JVMs, V8, etc. do magic with our code, but that’s with big teams of experts working on them over years. Perhaps a more achievable thing is more like profile-guided optimization, where you do test runs in a slow stat-tracking mode and some changes get suggested to the code.

                          That’s sort of a big dream and there is much lower hanging fruit; pushcx notes rubocop was able to find a chunk of things with static analysis, and stuff like “this query scans a table” or just “profiling shows this line is empirically one of our slowest” probably does a lot for big hotspots out there with a lot less trickiness.

                          1. 12

                            You don’t have to use the golden ratio; multiplying by any constant with ones in the top and bottom bits and about half those in between will mix a lot of input bits into the top output bits. One gotcha is that it only mixes less-significant bits towards more-significant ones, so the 2nd bit from the top is never affected by the top bit, 3rd bit from the top isn’t affected by top two, etc. You can do other steps to add the missing dependencies if it matters, like a rotate and another multiply for instance. (The post touches on a lot of this.)

                            FNV hashing, mentioned in the article, is an old multiplicative hash used in DNS, and the rolling Rabin-Karp hash is multiplicative. Today Yann Collet’s xxHash and LZ4 use multiplication in hashing. There have got to be a bajillion other uses of multiplication for non-cryptographic hashing that I can’t name, since it’s such a cheap way to mix bits.

                            It is, as author says, kind of interesting that something like a multiplicative hash isn’t the default cheap function everyone’s taught. Integer division to calculate a modulus is maybe the most expensive arithmetic operation we commonly do when the modulus isn’t a power of two.

                            1. 1

                              Nice! About the leftward bit propagation: can you do multiplication modulo a compile time constant fast? If you compute (((x * constant1) % constant2) % (1<<32)) where constant1 is the aforementioned constant with lots of ones, and constant2 is a prime number quite close to 1<<32 then that would get information from the upper bits to propagate into the lower bits too, right? Assuming you’re okay with having just slightly fewer than 1<<32 hash outputs.

                              (Replace 1<<32 with 1<<64 above if appropriate of course.)

                              1. 1

                                You still have to do the divide for the modulus at runtime and you’ll wait 26 cycles for a 32-bit divide on Intel Skylake. You’ll only wait 3 cycles for a 32-bit multiply, and you can start one every cycle. That’s if I’m reading the tables right. Non-cryptographic hashes often do multiply-rotate-multiply to get bits influencing each other faster than a multiply and a modulus would. xxHash arranges them so your CPU can be working on more than one at once.

                                (But worrying about all bits influencing each other is just one possible tradeoff, and, e.g. the cheap functions in hashtable-based LZ compressors or Rabin-Karp string search don’t really bother.)

                                1. 1

                                  you’ll wait 26 cycles for a 32-bit divide on Intel Skylake

                                  And looking at that table, 35-88 cycles to divide by a 64 bit divide. Wow. That’s so many cycles, I didn’t realize. But I should have: on a 2.4 GHz processor 26 cycles is 10.83 ns per op, which is roughly consistent with the author’s measurement of ~9 ns per op.

                                  1. 1

                                    That’s not what I asked. I asked a specific question.

                                    can you do multiplication modulo a compile time constant fast?

                                    similarly to how you can do division by a constant fast by implementing it as multiplication by the divisor’s multiplicative inverse in the group of integers modulo 2^(word size). clang and gcc perform this optimisation out the box already for division by a constnat. What I was asking is if there’s a similar trick for modulo by a constant. You obviously can do (divide by divisor, multiply by divisor, subtract from original number), but I’m wondering if there’s something quicker with a shorter dependency chain.

                                    1. 1

                                      OK, I get it. Although I knew about the inverse trick for avoiding DIVs for constant divisions, I didn’t know or think of extending that to modulus even in the more obvious way. Mea culpa for replying without getting it.

                                      I don’t know the concrete answer about the best way to do n*c1%(2^32-5) or such. At least does intuitively seem like it should be possible to get some win from using the high bits of the multiply result as the divide-by-multiplying tricks do.

                                2. 1

                                  So does that mean that when the author says Dinkumware’s FNV1-based strategy is too expensive, it’s only more expensive because FNV1 is byte-by-byte and fibonacci hashing multiplying by 2^64 / Φ works on 8 bytes at a time?

                                  Does that mean you could beat all these implementations by finding a multiplier that produces an even distribution when used as a hash function working on 8 byte words at a time? That is, he says the fibonacci hash doesn’t produce a great distribution, whereas multipliers like the FNV1 prime are chosen to produce good even distributions. So if you found an even-distribution-producing number for an 8 byte word multiplicative hash, would that then work just as well whatever-hash-then-fibonacci-hash? But be faster because it’s 1 step not 2?

                                  1. 1

                                    I think you’re right about FNV and byte- vs. word-wise multiplies.

                                    Re: 32 vs. 64, it does look like Intel’s latest big cores can crunch through 64-bit multiplies pretty quickly. Things like Murmur and xxHash don’t use them; I don’t know if that’s because perf on current chips is for some reason not as good as it looks to me or if it’s mainly for the sake of older or smaller platforms. The folks that work on this kind of thing surely know.

                                    Re: getting a good distribution, the limitations on the output quality you’ll get from a single multiply aren’t ones you can address through choice of constant. If you want better performance on the traditional statistical tests, rotates and multiplies like xxHash or MurmurHash are one approach. (Or go straight to SipHash, which prevents hash flooding.) Correct choice depends on what you’re trying to do.

                                    1. 2

                                      That makes me wonder what hash algorithm ska::unordered_map uses that was faster than FNV1 in dinkumware, but doesn’t have the desirable property of evenly mixing high bits without multiplying the output by 2^64 / φ. Skimming his code it looks like std::hash.

                                      On my MacOS system, running Apple LLVM version 9.1.0 (clang-902.0.39.2), std::hash for primitive integers is the identity function (i.e. no hash), and for strings murmur2 on 32 bit systems and cityhash64 on 64 bit systems.

                                      // We use murmur2 when size_t is 32 bits, and cityhash64 when size_t
                                      // is 64 bits.  This is because cityhash64 uses 64bit x 64bit
                                      // multiplication, which can be very slow on 32-bit systems.
                                      

                                      Looking at CityHash, it also multiplies by large primes (with the first and last bits set of course).

                                      Assuming then that multiplying by his constant does nothing for string keys—plausible since his benchmarks are only for integer keys—does that mean his benchmark just proves that dinkumware using FNV1 for integer keys is better than no hash, and that multiplying an 8 byte word by a constant is faster than multiplying each integer byte by a constant?

                                  2. 1

                                    A fair point that came up over on HN is that people mean really different things by “hash” even in non-cryptographic contexts; I mostly just meant “that thing you use to pick hashtable buckets.”

                                    In a trivial sense a fixed-size multiply clearly isn’t a drop-in for hashes that take arbitrary-length inputs, though you can use multiplies as a key part of variable-length hashing like xxHash etc. And if you’re judging your hash by checking that outputs look random-ish in a large statistical test suite, not just how well it works in your hashtable, a multiply also won’t pass muster. A genre of popular non-cryptographic hashes are like popular non-cryptographic PRNGs in that way–traditionally judged by running a bunch of statistical tests.

                                    That said, these “how random-looking is your not-cryptographically-random function” games annoy me a bit in both cases. Crypto-primitive-based functions (SipHash for hashing, cipher-based PRNGs) are pretty cheap now and are immune not just to common statistical tests, but any practically relevant method for creating pathological input or detecting nonrandomness; if they weren’t, the underlying functions would be broken as crypto primitives. They’re a smart choice more often than you might think given that hashtable-flooding attacks are a thing.

                                    If you don’t need insurance against all bad inputs, and you’re tuning hard enough that SipHash is intolerable, I’d argue it’s reasonable to look at cheap simple functions that empirically work for your use case. Failing statistical tests doesn’t make your choice wrong if the cheaper hashing saves you more time than any maldistribution in your hashtable costs. You don’t see LZ packers using MurmurHash, for example.

                                  1. 5

                                    In case it’s still tagged crypto, note that these aren’t generators for cryptography; they’re for producing noise with good statistical properties faster than cryptographic generators would. That’s useful for, for example, Monte Carlo simulation.

                                    It is interesting to note that some cryptographic primitives are fast enough to be in the comparison table, though. ChaCha20/8 and AES-128 counter mode are both under one cycle/byte on new x64 hardware and really well studied. If you find nonrandomness in them of any sort that would break your Monte Carlo simulation, you can probably get a paper published about it at least.

                                    1. 4

                                      chromium-browser is scrutinized closely enough that this would be noticed on ubuntu, right?

                                      1. 5

                                        The sandbox engine downloading and running ESET actually appears to be in Chromium: https://cs.chromium.org/chromium/src/chrome/browser/safe_browsing/chrome_cleaner/ so developpers are free to review it and remove any reference to it. If my memory serve me well, Chrome Cleaner is not special and should appear in chrome://components/ along other optional close source components, although I don’t have a windows machine to validate right now. It should (Or at least used to) be disabled for other build than Google Chrome.

                                        1. 2

                                          Thanks. It doesn’t appear in chrome://components for me, at any rate.

                                          1. 1

                                            If I look at it on windows I can see the entry: Software Reporter Tool - Version: 27.147.200

                                            1. 1

                                              Excellent, a positive control.

                                        2. 2

                                          isra17’s reply implies there’s no scanner in Chromium, only Chrome. [I wrote this referring to his separate comment–now he has another reply here.] It probably wouldn’t make sense to have this on Linux anyway, just because there isn’t the same size of malware ecosystem there.

                                          (And I think the reporting/story would be different if the scanner were open source–we’d have an analysis based on the source code, people working on patched Chromium to remove it, and so on.)

                                          1. 1

                                            I’m curious about MacOS. I don’t run Chrome usually, but I have to in some cases, e.g. to use Google Meets for work.

                                            1. 2

                                              I don’t have an authoritative answer, but https://www.blog.google/products/chrome/cleaner-safer-web-chrome-cleanup/ only talks about Windows.

                                              1. 2

                                                I don’t see it in chrome://components on my Mac, if that is indeed where it is supposed to appear.

                                          1. 3

                                            We’re a small shop (~15 folks, ~10 eng), but old (think early 2000s, using mod_perl at the time). Not really a startup but we match the description otherwise so:

                                            It’s a Python/Django app, https://actionk.it, which some lefty groups online use to collect donations, run their in-person event campaigns and mailing lists and petition sites, etc. We build AMIs using Ansible/Packer; they pull our latest code from git on startup and pip install deps from an internal pip repo. We have internal servers for tests, collecting errors, monitoring, etc.

                                            We have no staff focused on ops/tools. Many folks pitch in some, but we’d like to have a bit more capacity for that kind of internal-facing work. (Related: hiring! Jobs at wawd dot com. We work for neat organizations and we’re all remote!)

                                            We’ve got home-rolled scripts to manage restarting our frontend cluster by having the ASG start new webs and tear the old down. We’ve scripted hotfixes and semi-automated releases–semi-automated meaning someone like me still starts each major step of the release and watches that nothing fishy seems to be happening. We do still touch the AWS console sometimes.

                                            Curious what prompts the question; sounds like market research for potential product or something. FWIW, many of the things that would change our day-to-day with AWS don’t necessarily qualify as Solving Hard Problems at our scale (or 5x our scale); a lot of it is just little pain points and time-sucks it would be great to smooth out.

                                            1. 6

                                              FYI, I get a “Your connection is not private” when going to https://actionk.it. Error is NET::ERR_CERT_COMMON_NAME_INVALID, I got this on Chrome 66 and 65.

                                              1. 2

                                                Same here on Safari.

                                                1. 1

                                                  Sorry, https://actionkit.com has a more boring domain but works :) . Should have checked before I posted, and we should get the marketing site a cert covering both domains.

                                                2. 1

                                                  Firefox here as well.

                                                  1. 1

                                                    Sorry, I should have posted https://actionkit.com, reason noted by the other comments here.

                                                  2. 1

                                                    https://actionk.it

                                                    This happens because the served certificate it for https://actionkit.com/

                                                    1. 1

                                                      D’oh, thanks. Go to https://actionkit.com instead – I just blindly changed the http://actionk.it URL to https://, but our cert only covers the boring .com domain not the vanity .it. We ought to get a cert that covers both. (Our production sites for clients have an automated Let’s Encrypt setup without this problem, for the record :) )

                                                  1. 9

                                                    The main topic discussed here is now known as the Han Unification, for those curious to catch up with what’s happened since this was written.

                                                    1. 2

                                                      A consequence that hadn’t soaked in for me is that you need to know the language of a piece of text to reliably display it correctly. I’m guessing Twitter and Facebook and such are assigning languages to comments, etc. based on content, the poster’s language preferences/location/Accept-Language, and who knows what else.

                                                      (And then if a user switches into not-their-native-language for a post, or mixes languages by quoting text in another language or whatever, there’s a whole other level of difficulty.)

                                                    1. 7

                                                      Compared to, say, the ARM whitepaper, Intel’s still reads to me as remarkably defensive, especially the section on “Intel Security Features and Technologies.” Like, we know Intel has no-execute pages, as other vendors do, and we know they aren’t enough to solve this problem. And reducing the space where attackers can find speculative gadgets isn’t solving the root problem.

                                                      Paper does raise the interesting question of exactly how expensive the bounds-check bypass mitigation will be for JS interpreters, etc. To foil the original attack, you don’t have to stop every out-of-bounds load, you just have to keep the results from getting leaked back from speculation-land to the “real world.” So you only need a fence between a potentially out-of-bounds load and a potentially leaky operation (like loading an array index that depends on the loaded value). You might even be able to reorder some instructions to amortize one fence across several loads from arrays. And I’m sure every JIT team has looked at whether they can improve their bounds check elimination. There’s no lack of clever folks working on JITs now, so I’m sure they’ll do everything that can be done.

                                                      The other, scarier thing is bounds checks aren’t the only security-relevant checks that can get speculated past, just an easy one to exploit. What next–speculative type confusion? And what if other side-channels besides the cache get exploited? Lot of work ahead.

                                                      1. 5

                                                        FWIW: a possible hint at Intel’s future directions (or maybe just a temp mitigation?) are in the IBRS patchset to Linux at https://lkml.org/lkml/2018/1/4/615: one mode helps keep userspace from messing with kernel indirect call speculation and another helps erase any history the kernel left in the BTB. I bet both of these are blunt hammers on current CPUs (‘cause a microcode update can only do so much–turn a feature off or overwrite the BTB or whatever), but they’re defining an interface they want to make work more cheaply on future CPUs. It also seems to be enabled in Windows under the name “speculation control” (https://twitter.com/aionescu/status/948753795105697793)

                                                        ARM says in their whitepaper that most ARM implementations have some way to turn off branch prediction or invalidate branch predictor state in kernel/exception handler code, which sounds about in line with what Intel’s talking about. The whitepaper also talks about barriers to stop out of bounds reads. The language is a bit vague but I think they’re saying an existing conditional move/select works on current chips and a new instruction, CLDB will barrier for future chips that provides just the minimum you need to avoid the cache side channel attack.

                                                        1. 25

                                                          Zenyep Tufekci said, “[T]oo many worry about what AI—as if some independent entity—will do to us. Too few people worry what power will do with AI.” (The thread starts with a result that, in some circumstances, face recognition can effectively identify protesters despite their efforts to cover distinguishing features with scarves, etc.)

                                                          And, without really having to have tech resembling AI, what they can do can look like superhuman abilities looked at right. A big corporation doesn’t have boundless intelligence but it can hire a lot of smart lawyers, lobbyists, and PR and ad people, folks who are among the best in their fields, to try and shift policy and public opinion in ways that favor the sale of lots of guns, paying a lot for pharmaceuticals, use of fossil fuels, or whatever. They seem especially successful shifting policy in the U.S. recently, with the help of recent court decisions that free some people up to spend money to influence elections (and the decisions further back that established corps as legal people :/).

                                                          With recent/existing tech, companies have shown they can do new things. It’s cheap to test lots of possibilities to see what gets users to do what you want, to model someone’s individual behavior to keep them engaged. (Here’s Zenyep again in a talk on those themes that I haven’t watched.) The tech giants managed to shift a very large chunk of ad spending from news outlets and other publishers by being great at holding user attention (Facebook) or being smarter about matching people to ads than anyone else (Google), or shift other spending by gatekeeping what apps you can put on a device and taking a cut of what users spend (Apple with iOS), or reshape market after market with digitally-enabled logistics, capital, and smart strategy (Amazon). You can certainly look forward N years to when they have even more data and tools and can do more. But you don’t really even have to project out to see a pretty remarkable show of force.

                                                          This is not, mostly, about the details of my politics, nor is it to suggest silly things like that we should roll back the clock on tech; we obviously can’t. But, like, if you want to think about entities with incredible power that continue to amass more and how to respond to them, you don’t have to imagine; we have them right here and now!

                                                          1. 3

                                                            Too few people worry what power will do with AI.

                                                            More specifically, the increasingly police statey government.

                                                          1. 42

                                                            Reminds me of a quote:

                                                            Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.

                                                            • Brian W. Kernighan
                                                            1. 13

                                                              :) Came here to post that.

                                                              The blog is good but I’m not convinced by his argument. It seems too worried about what other people think. I agree that we have to be considerate in how we code but forgoing, say, closures because people aren’t familiar with them or because we’re concerned about how we look will just hold back the industry. Higher level constructs that allow us to simplify and clarify our expression are a win in most cases. People can get used to them. It’s learning.

                                                              1. 8

                                                                I think he may not disagree with you as much as it sounds like. I don’t think that sentence says “don’t use closures,” just that they’re not for impressing colleagues. (It was: “You might impress your peers with your fancy use of closures… but this no longer works so well on people who have known for a decades what closure are.”)

                                                                Like, at work we need closures routinely for callbacks–event handlers, functions passed to map/filter/sort, etc. But they’re just the concise/idiomatic/etc. way to get the job done; no one would look at the code and say “wow, clever use of a closure.” If someone does, it might even signal we should refactor it!

                                                                1. 5

                                                                  It seems too worried about what other people think.

                                                                  I agree with your considerations on learning.
                                                                  Still, just like we all agree that good code must be readable, we should agree that it should be simple too. If nothing else, for security reasons.

                                                                2. 2

                                                                  On the other hand, sometimes concepts at the limit of my understanding (like advanced formal verification techniques) allow me to write better code than if I had stayed well within my mental comfort zone.

                                                                1. 3

                                                                  Here are some things:

                                                                  These are, again, perf but not profiling, but I really want to write (and have started on) a super beginner-focused post about memory use. I wish someone would write a balanced intro to concurrency–from StackOverflow, a couple common mistakes are dividing work into very tiny tasks (with all the overhead that causes) when NumCPU workers and large chunks would do better, and using channels when you really do just want a lock or whatever kind of shared datastructure.

                                                                  1. 11

                                                                    When you mentioned channel costs, I wondered if there was communication via unbuffered channels, which can lead to traffic jams since the sender can’t proceed ‘til each recipient is ready. Looking at the old chat_handler.go that doesn’t seem to be the case, though. The three goroutines per connection thing isn’t without precedent either; I think at least the prototypes of HTTP/2 support for the stdlib were written that way.

                                                                    It looks like maybe the socketReaderLoop could be tied in with ChatHandler.Loop(): where socketReaderLoop communicates with Loop on a channel, just inline the code that Loop currently runs in response, then call socketReaderLoop at the end of Loop instead of starting it asynchronously. You lose the 32-message buffer, but the end-user-facing behavior ought to be tolerable. (If a user fills your TCP buffers, seems like their problem and they can resend their packets.) However, saving one goroutine per connection probably isn’t a make-or-break change.

                                                                    Since you talk about memory/thrashing at the end, one of the more promising possibilities would be(/have been) to do a memprofile to see where those allocs come from. A related thing is Go is bad about respecting a strict memory limit and its defaults lean towards using RAM to save GC CPU: the steady state with GOGC=100 is around 50% of the peak heap size being live data. So you could start thrashing with 512MB RAM once you pass 256MB live data. (And really you want to keep your heap goal well under 512MB to leave room for kernel stuff, other processes, the off-heap mmapped data from BoltDB, and heap fragmentation.) If you’re thrashing, GC’ing more often might be a net win, e.g. GOGC=50 to be twice as eager as the default. Finally, and not unrelated, Go’s collector isn’t generational, so most other collectors should outdo it on throughput tests.

                                                                    Maybe I’m showing myself not to be a true perf fanatic, but 1.5K connections on a Pi also doesn’t sound awful to me, even if you can do better. :) It’s a Pi!

                                                                    1. 2

                                                                      Thank you for such a detailed analysis and looking into code before you commented :) positive and constructive feedback really helps. I have received a great amount of feedback and would definitely try with your tips. BoltDB is definitely coming up every-time and I think it contributes to memory usage as well. Some other suggestions include use fixed n workers and n channels, backlog building up, and me not doing serialization correctly. I will definitely update my benchmark code and test it with new fixes; and if I feel like code is clean enough would definitely love to move back.

                                                                      1. 3

                                                                        Though publicity like this is fickle, you might get a second hit after trying a few things and then explicitly being like “hey, here’s my load test, here are the improvements I’ve done already; can you guys help me go further?” If you don’t get the orange-website firehose, you at least might hear something if you post to golang-nuts after the Thanksgiving holiday ends or such.

                                                                        Looking around more, I think groupInfo.GetUsers is allocating a string for each name each time it’s called, and then when you use the string to get the object out there’s a conversion back to []byte (if escape analysis doesn’t catch it), so that’s a couple allocs per user per message. Just being O(users*messages) suggests it could be a hotspot. You could ‘downgrade’ from the Ctrie to a RWLocked map (joins/leaves may wait often, but reads should be fastish), sync.Map, or (shouldn’t be needed but if you were pushing scalability) sharded RWLocked map. But before you put in time trying stuff like that, memprofile is the principled/right way to approach alloc stuff (and profile for CPU stuff)–figure out what’s actually dragging you down.

                                                                        True that there are likely lighter ways to do the message log than Bolt. Just files of newline-separated JSON messages may get you far, though don’t know what other functionality you support.

                                                                        FWIW, I also agree with the commenter on HN saying that Node/TypeScript is a sane approach. (I’m curious about someday using TS in work’s frontend stuff.) Not telling you what to use; trying to get Go to do things is just a hobby of mine, haha. :)

                                                                    1. 7

                                                                      I’ve been using a Samsung ARM Chromebook (1st generation) as my daily driver for the past 4 years. It’s a lowend, underpowered machine with nothing to write home about, but it can support a full Arch Linux ARM installation, run a web browser just fine, and have an adequate number of terminals. I love it. The battery life hasn’t changed at all since I bought it, it’s still consistently getting >7 hours. I have other friends with ARM laptops from other manufacturers, the battery life story is one I hear consistently.

                                                                      1. [Comment removed by author]

                                                                        1. 6

                                                                          dz, I wrote up a blog post about this: http://blog.jamesluck.com/installing-real-linux-on-arm-chromebook I completely replaced ChromeOS with Archlinux ARM on the internal SSD. The gist of the process is that you make a live USB, boot to that, and then follow the same procedure for making the bootable USB onto the SSD. You just have to untar the root filesystem, edit /etc/fstab, and correct the networking config!

                                                                          1. 1

                                                                            If it’s anything like my Samsung ARM Chromebook, you can boot a different os off external storage (i.e. an SD card), or possibly replace Chrome OS on the internal solid-state storage.

                                                                            1. 1

                                                                              You can replace ChromeOS. Here’s the Arch wiki on the 1st gen Samsung ARM Chromebook and the Samsung Chromebook Plus.

                                                                          1. 6

                                                                            Not sure how responsive this is to the specific q, but I have an ARM Samsung Chromebook Plus. Initially I used developer mode and crouton, which provides an Ubuntu userspace in a chroot hitting Chrome OS’s Linux kernel, but more recently I use a setup like Kenn White describes, where you’re not in dev mode and you use Termux, an Android app that provides a terminal and apt-based distro inside the app’s sandbox. (Termux has a surprising number of apt packages covered and you can build and run executables in there, etc. You can also just use it on any Android phone or tablet.) In both cases I just use terminal stuff, though Crouton does support running an X session. You can apparently boot Linux on the Chromebook Plus full-stop; I don’t know about compatibility, etc. Amazon’s offering it for $350 now.

                                                                            I’ve long been more indifferent to CPU speed than a lot of folks (used netbooks, etc.), but for what it’s worth I find the Chromebook Plus handles my needs fine. The chip is the RK3399, with two out-of-order and four in-order ARMv8 cores and an ARM Mali GPU. I even futz with Go on it, and did previously on an even slower ARM Chromebook (Acer Chromebook 13). The most noticeable downsides seemed more seem to be memory-related: it has 4GB RAM but…Chrome. If I have too much open the whole thing stalls a bit while it tries (I think) compressing RAM and then stopping processes. Also, there’s a low level of general glitchiness: sometimes I’ll open it up and it’ll boot fresh when it should be resuming from sleep. I had to do some tweaking to do what I wanted inside Chrome OS’s bounds, notably to connect to work’s VPN (had to make a special JSON file and the OS doesn’t readily give useful diagnostics if you get anything wrong), but it’s just a matter of round tuits.

                                                                            From the ARM vs. Intel angle, Samsung actually made a Chromebook Pro with the low-power but large-core Core m3. Reviewers seemed to agree the Pro was faster, and I think I saw it was like 2x on JavaScript benchmarks, but Engadget wasn’t fond of its battery life and The Verge complained its Android app support was limited.

                                                                            1. 3

                                                                              Since we’re talking about the Plus, a few notes on my experience. I have an HP chromebook with an m3 as well, for comparison. None of the distinguishing features of the Plus were positives. Chrome touch support is just meh, and the keyboard was annoyingly gimped by very small backspace, etc. Touchpad was noticeably worse. Android support was basically useless for me.

                                                                              More on topic, the ARM cpu was capable enough, but the system wasn’t practically lighter or longer lasting than the m3 system. It was cheaper, perhaps 50% so, but that’s still only a difference of a few hundred dollars. (I mean, that’s not nothing, but if this is a machine you’ll be using for hours per day over several years… Price isn’t everything.)

                                                                              (I would not pick the Samsung Pro, either. Not a great form factor imo.)

                                                                              More generally, if you don’t have ideological reasons to pick arm, the intel m3 is a pretty remarkable cpu.

                                                                            1. 5

                                                                              The technical description doesn’t fully convey some of the, uh, magic that’s been done with it. One of the coolest things is Arcadia, which gives you access to game-engine fun with a Clojure REPL. The MAGIC stuff is apparently key to cutting allocations so your games aren’t all janked up by collection pauses.

                                                                              The author, Ramsey Nasser, also posts about a bunch of fun stuff on Twitter–other recent stuff he’s messed with includes spline drawing tool and (related, I think) Arabic typography