1. 8

    The post smells like one big Nomad advertisement.

    The line

    Nomad’s batch scheduler is optimized to rank instances rapidly using the power of two choices described in Berkeley’s Sparrow scheduler.

    “scheduler is optimized” is an overstatement. The paper to which they link to says, I quote:

    …a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability limitations of a centralized design.

    Basically it’s a poor man’s load balancer. Sparrow scheduler chunks each node into slots and schedules workloads into them. As result randomized scheduling “provides near-optimal performance” only in one situation – when scheduled workloads are homogenious (like batch jobs). Good luck with reaching a high saturation of nodes with this kind of scheduler if you have to deal with any of the following factors

    • the workloads are anything more complex than batch workloads,
    • incremental schedling is needed,
    • the nodes are heterogenios (differ one from another significatly),
    • containers are used by your team in composable manner (employing pods) to achieve high performance through data locality

    It’s near impossible to allocate resources efficiently (with randomized scheduler) for non-homogenious workloads without high levels of churn.

    But from what I remember about Nomad, their shecuduler isn’t the “randomized” one, but the one using a system of two queues to shedule workloads. The Nomad’s scheduling page confirms it. So the article’s point is moot.

    But let’s leave advertising tone of the article aside and stick to an engineering side of things. It’s unclear to me why the option of custom k8s scheduler has not been considered at all. Engineering effort needed to implement a custom scheduler is (roughly) an order of magnitute leser than

    • migration of their rendering stack from managed GKE to a new scheduler
    • subsequent support of Nomed with dedicated on-call personnel

    References:

    1. Sparrow: Distributed, Low Latency Scheduling
    2. Configure Multiple Schedulers | Kubernetes
    1.  

      It’s unclear to me why the option of custom k8s scheduler has not been considered at all.

      They did mention that they at least thought about developing their own, although they don’t go into more detail on whether or not this was completed/implemented and details about why this ended up not working (assuming it didn’t work).

      All these issues eventually convinced us to develop our own in-house autoscaler.

      1.  

        All these issues eventually convinced us to develop our own in-house autoscaler.

        I don’t think so. IIUIC, with the phrase quoted above they refer to Nomad, cause:

        • they say it at the end of “Reasons Behind the Switch” section concluding their decision and
        • “autoscaler” isn’t a custom scheduler… unless it’s the author’s (sloppy) way to name it so
    1. 1

      IIUIC, it looks like the CoC update was issued after @peterbourgon permanent ban from Go communities, cause of this complaint – Go CoC Report. IMHO it’s over reaction… or it’s just me having a thicker skin?

      1. 3

        I think the key element is (from the linked CoC):

        Persistent borderline behavior. Infractions may seem insignificant in isolation, but repeated over time they create a pattern of behavior that doesn’t match our Gopher Values and that adds up to substantial harm.

        We’ve had some people like this on Lobsters too: viewing most messages in isolation doesn’t give a clear “wow, this is terrible, please do something about this guy immediately!”-response, and is more of a “okay, not too great, but not too bad either 🤷”. But if you have this every day then it adds up, especially for regulars who at some point just get tired of it and leave. Most of these people are now banned on Lobsters by the way; the most recent one was @soc, but there have been a few others.

        These cases are really hard to deal with as a moderator; when exactly is it too much? And explaining it too other people is double hard because you can’t point to a single message or action; it’s more of a general atmosphere of unpleasantness spread out over many small incidents, many of which pass more or less unnoticed. I personally wouldn’t have made a CoC report over this incident in particular (I don’t really are much for “reporting someone” in general), but I did have a rather similar incident some years ago and I just stopped visiting the #general room in Gophers slack because of it 🤷

        In short, it’s not only about this particular incident. Plus, there are several ways to deal with such a conflict, even when you think it’s a load of nonsense. To say it wasn’t exactly handled with grace by Peter would be a bit of an understatement. I suspect that this part in particular was really the drop that overflowed the bucket.

        1. 3

          The ‘permanent’ part of that came from him evading a temporary ban and escalating on Twitter, so IMO it’s not really an overreaction. That being said, I’m very disappointed in both how this was handled (zero transparency from the CoC committee) and the general trend this sets up: the willingness and precedent of this committee to enforce post-hoc rules. It really does seem the CoC was updated directly as a result of this situation, which itself isn’t a problem (actually a good thing for this type of document to evolve), but it really seems that his initial ban came from a rule that was not published or part of the CoC at the time.

        1. 3

          can anyone think of a quick and browser-local way to capture this information for oneself? pretty sure my habits align almost exactly and want to try to reproduce.

          also maybe identify some classes of queries that i can redirect to e.g. Dash and interact less with Google.

          1. 4

            Your browser history is captured in the local sqlite database. Shut down the browser (to let it release locked .db file) and use sqlite3 to query its database. Here’s (good enough) gist to start from – Playing around with Chrome’s history.

            1. 2

              See if you can get your browser history as a list of URLs. You could filter them down to known search pages and extract the queries from the query strings.

            1. 1

              The thockin/go-build-template: A Makefile/Dockerfile example for Go projects uses the approach outlined in the article to derive the version from git repo and set it in the version package.

              I’ve been using the template for my personal projects the ones at $WORK. Highly recommend it for anyone starting their project from scratch. The repo is well structured and provides a great (build) baseline.

              1. 2

                I know this is a Clojure post, but out of curiosity I wrote it in Go (what I’ve been working in lately) just to see if I could solve it quickly. Took about 10 minutes, subtract 3-4 for fighting with runes:

                package main
                
                import "fmt"
                
                type result struct {
                	letter string
                	count  int
                }
                
                func main() {
                	const input = "aaaabbbcca"
                
                	var ret []result
                	currentLetter := string(input[0])
                	countCurrentLetter := 1
                
                	for _, elem := range input[1:] {
                		elemAsString := string(elem)
                		if currentLetter == elemAsString {
                			countCurrentLetter++
                		} else {
                			ret = append(ret, result{currentLetter, countCurrentLetter})
                			currentLetter = elemAsString
                			countCurrentLetter = 1
                		}
                	}
                
                	ret = append(ret, result{currentLetter, countCurrentLetter})
                
                	fmt.Printf("%+v", ret)
                }
                

                It’s not particularly elegant, but it works.

                1. 2

                  It’s not particularly elegant, but it works.

                  That’s my problem with many other (non-Lispy) languages. The programs are not elegant, even though they do work. What works for a computer, don’t always work for me.

                  1. 1

                    Okay, I am 5 months late, but this code is terrible and I must object to it because there’s no good Go code in this thread. You are mixing up two problems, lexing a token, and figuring out the next token. Apart from that, the code is very nonIdiomaticWithTheseVariableNames, but more importantly blows up on non-ASCII strings.

                    Here’s two solutions, one imperative: https://play.golang.org/p/-zdWZAnmBip, and one recursive: https://play.golang.org/p/TBudEZBphv7.

                    The proposed solutions:

                    1. actually work on unicode input
                    2. there’s no else in sight.
                    3. all ifs are early returns.
                    4. all loops are exhaustive, no weird [1:] boundary conditions.
                    5. I don’t have to keep around accumulators for the results.
                    6. no useless types.
                    7. much easier to read because the code just tells you what it does, why it does it is obvious.
                    1. 2

                      Cool, I guess?

                      I didn’t say I solved it well. I hacked something together.

                    2. 1

                      Here’s my take on it. It took me (roughly) the same 8-10 minutes to type it in the web-ui. In Emacs I could shave some time off of it.

                      type tuple struct {
                      	s string
                      	i int
                      }
                      
                      func splitStringReturnTuples(str string) []tuple {
                      	str = " " + str
                      	res := []tuple{}
                      	for i := 1; i < len(str); i++ {
                      		if str[i] != str[i-1] {
                      			res = append(res, tuple{string(str[i]), 1})
                      		} else {
                      			res[len(res)-1].i++
                      		}
                      	}
                      	return res
                      }
                      

                      Runnable code at the go playground

                      1. 2

                        This loops over the bytes in the string instead of the runes in the string. Try inserting a multi-byte rune such as 本 in the string, and see what happens.

                        1. 2

                          The problem statement clearly stated the data set, there was no multi-byte symbols. But the tweet gave a clear understanding that interviewer expects solution to be give in a limited time frame. Therefore the solution was provided in terse notation with abbreviated variables taking provided test input and returning expected output. Not more nor less.

                          But point is taken. Here’s the code correctly handling multi-byte encodings. The logic is the same, but the part of casting passed string into a slice of runes.

                          When I interview people I don’t expect them to write code perfectly handling every possible input in the limited time. What I’m interested in first, if they are able to come up with straightforward solution leveraging data structures and algorithms helping them solve the problem with optimal complexity. Second, if they can clearly communicate their approach. And coding comes third.

                          1. 2

                            That makes sense. I did not mean to criticize your solution in particular, just highlight that this is a common “gotcha” in Go. Casting strings to []rune or looping with for _, r := range str is, as far as I know, the only built-in way to access the letters in strings correctly. I’ve seen many problems arise from assuming that str[x] returns a rune instead of a byte. I think it would be more useful and intuitive if []byte(str)[x] was needed to return a byte, while just str[x] could be used to return a rune.

                    1. 2

                      The presentation Containers From Scratch (Youtube video) given by Liz Rice at GOTO 2018 is the best I’ve seen, clearly explaining on how containers work under the hood.

                        1. 2

                          Let’s start from the top and try to outline an actionable plan how to measure success of our “new software architecture”.

                          Success of architecture is measure to which extent your customers (whether they are direct users or an API consumers) are satisfied with the (re-)architected service. At this point it is crucial to take off your engineer’s hat and step into your customer’s shoes and clearly see the product from their point of view. With that in mind “we use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives (SLOs), and agreements (SLAs). These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service.” [1]

                          The same is applicable for the development and operations processes. Define SLIs and SLOs for for them as well. For example, from the development perspective, define a mean time for a feature to be implemented and delivered to the production environment. For ops, it can be a mean number of pages per unit of time, human/HW resource consumption, and so on and so forth, you’ve got the idea.

                          Once we have the SLOs and SLIs defined, we instrument service(s) with needed metrics and aggregate them with the rules of used monitoring system.

                          If a single measure/gauge of “success” is needed, build a mathematical expression aggregating all major metrics weighed with corresponding coefficients accordingly to their importance, set a tight alert threshold on it and put it on the top of the monitoring dashboard.

                          From my experience, it’s hard to put correct coefficients on metrics. The importance of metrics used by development team differ from PMs from what SRE/DevOps teams see and value. To resolve this conflict of interests, it’s a good idea to define separate gauges of “success”for each involved party.

                          Once we have monitoring harness in place, we can start modifying our “s/w arch-re” and measure its “success”. For example, let’s say we detected a bottleneck in some sub-service providing an RPC/API and changed its architecture. Monitoring system will capture the change and you will be able to measure the “success” fo the change after a some period of time. It’s good to do service canarying in this kind of cases as well.

                          Although the approach outlined above is a bit blurry and gives not much of the specifics, it should give a good understanding of the method how to assess the “success” of s/w change and apply it to your architecture.

                          [1] https://landing.google.com/sre/sre-book/chapters/service-level-objectives/

                          1. 16

                            FYI, “дроворуб” is a Ukrainian word, not Russian. Russian word for lumberjack is дровосек.

                            https://uk.wikipedia.org/wiki/%D0%94%D1%80%D0%BE%D0%B2%D0%BE%D1%80%D1%83%D0%B1

                            1. 17

                              I’m not sure if the people who try - and then publish - such attributions care a lot about precision, and “The Russian” is still (or again) a popular boogeyman while “The Ukrainian” is probably a friend to the West or something.

                              I suppose the interesting part is “malware targeting Linux”, everything else is political fluff.

                              1. 16

                                It’s a straw man argument. If anyone cares, first of all, “lumberjack” in Ukrainian is – лісоруб. Here’s the data from Google trends with both terms [1] usage ang geo.

                                Regarding the “drovorub” rootkit’s name. It has nothing to do with wood, per se.

                                The rootkit exploits Linux kernel modules. Modules are hardware drivers (mostly). Drivers are in russian slang – дрова (drova) – wood (engl.). “Drova” (as slang term) is nothing but short native Russian word sounding almost like “drivers”.

                                The second part of “drovo-rub” is derived from verb – рубить – to hack (engl.) ;)

                                Thus, whoever coined that rootkit’s name speaks russian really well and captured the idea of the rootkit’s approach in its name quiet well.

                                [1] https://trends.google.com/trends/explore?date=all&q=%D0%BB%D1%96%D1%81%D0%BE%D1%80%D1%83%D0%B1,%D0%B4%D1%80%D0%BE%D0%B2%D0%BE%D1%80%D1%83%D0%B1

                                1. 5

                                  It’s a straw man argument.

                                  It’s not an argument at all. I don’t care, really.

                                  Proper name for lumberjack in Russian is лесоруб. Дроворуб makes no sense, unless it is a play on дрова. Which, I must say, is quite likely! Good catch!

                                2. 0

                                  So what? If a German gives their malware a Polish name, you would argue it wouldn’t be German in origin if the NSA claimed it was German?

                                  1. 14

                                    I don’t claim anything apart from what I’ve actually stated. Don’t put words in my mouth.

                                    Why is even it relevant that this is an Ukrainian word, not Russian? Because of this section in original pdf 0:

                                    Why is the malware called “Drovorub”, and what does it mean? The name Drovorub comes from a variety of artifacts discovered in Drovorub files and from operations conducted by the GTsSS using this malware; it is the name used by the GTsSS actors themselves. Drovo [дрово] translates to “firewood”, or “wood”. Rub [руб] translates to “to fell”, or “to chop.” Taken together, they translate to “woodcutter”or “to split wood.”

                                    This quote is from “Attribution” section of the document. While the name is not the only (I hope) reason why they attribute this malware to GRU, making a bullshit claim that anyone with dictionary can disprove… it’s just sloppy, sloppy job.

                                    EDIT: although, technically, they only say how the word translates to English. They never say from what language.

                                    1. 4

                                      Actually, they do:

                                      The name of the malware means ‘woodcutter’ in Russian

                                      1. 4

                                        I asked you whether you would make the same argument if different nationalities were involved, because it seemed like you were arguing the malware wasn’t Russian because the name is actually Ukrainian. I don’t think I could know you were merely correcting something buried deep in the PDF linked in a sibling comment.

                                        1. 4

                                          That would be an assumption in good faith.

                                          1. 1

                                            I didn’t assume anything: I asked a question to see if something could be assumed.

                                            As a sibling comment thay has appeared in the mean time shows, it was not a weird thing I only considered a possibly valid assumption.

                                            There are many Russia and China (and US) apologists, so nationalism as a motivation is always something to at least consider.

                                            That Russia is responsible for shooting down a passenger plane of ours and still attempts to blame it on Ukraine doesn’t help for this particular case.