1.  

    How has Rust solved the callback problem of async/await? To explain what I mean, imagine you have an array of URLs and you want to open each of them in turn, download their contents, then do something with those contents. If you wanted to iterate over an array normally you’d do something like urls.foreach(|url| process(url.get()));.

    But what happens when that get is async? Don’t you have to duplicate all your functions to take async functions? You need an async_foreach that takes an async function, etc. etc.

    Now I know that in Rust, iterators are used, but essentially the same problem still exists, just a little swapped around. In most ‘async/await’ systems you can’t call async functions from normal functions. I would assume the same is true of Rust.

    To compare to a system like Go’s, where there aren’t special ‘async’ functions, and you don’t need to rewrite every bit of code that takes a callback to also have a version that takes an async callback and calls the callback asynchronously.

    1.  

      Depends on whether you want to fetch each url in parallel or in series.

      Go does it in series ‘by default’ and provides goroutines to do it in parallel.

      From my understanding, you can do something like foreach process(await url.get()) to run in series, or join(foreach(|url| await process_async(url.get()))) to do it in parallel (where process_async accepts a future and returns another).

      This is also how javascript does it. You don’t need a special version of foreach since futures are just another type.

      You should also be able to generate a future-ized version of a function automatically with macros (eg automatically converting a function of type PageHtml => Result type into Future<PageHtml> => Future<Result>

      1.  

        But you still need a process and process_async?

        If so, this post illustrates the problem: http://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/ .

        Go avoids it by having a high level m:n userspace thread scheduler so there’s no need for separate versions of each function. (basically everything is async) It’s not a zero-cost abstraction though so wouldn’t make sense for Rust. (its a lot like a garbage collector in that regard)

        I’m not sure what a better solution would be. At least async/await cleans up the code.

        1.  

          If zero-cost abstraction is the goal, would it be possible (in any reasonable time) to make everything async-by-default but have the compiler remove asynchronity where it’s not needed?

          Go’s approach seems quite brilliant as an actual real-world, existing & working solution. I think Crystal follows its lead on this.

          1.  

            I don’t think it can be done without introducing a custom scheduler. Channel, network and time operations can be handled intelligently (described in more detail here: https://dave.cheney.net/2015/08/08/performance-without-the-event-loop)

            So blocking a thread might be a normal OS thread wait, or it might be put into a network event loop (kqueue, epoll, etc) so the code looks uniform, but there’s a lot of magic behind the scenes.

            It’s very clever, but there are costs. Occasionally there have been major performance issues from unexpected edge cases. I’m not sure how often people run into those anymore, but it can happen. And there is definitely overhead and less control when it comes to the scheduling. Though in theory the scheduler will probably do a better job scheduling than whatever the vast majority of programmers would come up with.

            I think using futures with async/await allows you to do something very similar if you use an async io library: https://manishearth.github.io/blog/2018/01/10/whats-tokio-and-async-io-all-about/

            Though the part I don’t understand is how you do selection

            1.  

              Interesting thought (making things async by default) but I guess one would need to be more specific. If done manually, unnecessary async should be optimized away. Rust with the current proposals, of course, distinguishes between future construction and execution which is sometimes nice.

            2.  

              You write a process function, and use a macro to future-ize it when you pass it to foreach.

              1.  

                You need process_async if processing itself needs asynchronous calls. Yes, that also means that if you have to abstract and one of the implementations uses async calls, the interface also needs to be async. That said, it will be incredibly easy to implement an async interface with sync code.

                The only time this is a problem is when you have a sync integration point in a library that you do not control and you need to do async calls. You can still, of course, simply block the calling thread until your processing is done.

                What you get in rust is: really nice guarantees that you do not have in go, less overhead, easy cancelation. That said, as GC, the go model makes some things more uniform and even more easily composable; and you have stack traces, awesome (I mean it)! Rusts to avoid a (larger) runtime and manages to have more safety guarantees nevertheless. Quite a feat.

            3.  

              You need async_foreach bc each variant is literally different. Parallel, sequential, fire and forget vs. collect and wait; are all different semantics and all valid. So it’s not accidental complexity that which variant you want must be specified when choosing the foreach since it evaluates to the result.

              On the other hand – an intresting way around this is to use coroutines and explicit scheduling (like in Lua) (so you do schedule(function() process(...) end)). You do have to explicitly wait after the foreach if you want waiting semantics.

              1.  

                When get is async, it returns a future that must be polled to completion (or cancelled, though I’m not sure what the cancellation story is for Rust futures). This can be done one at a time, blocking the current thread, or each can be registered on an Executor, such as an epoll-backed thread pool.

                If using async/await, then you can call your example inside an asyc function, and either await each url or select between the ready ones.

                Disclaimer: this is my current understanding, but it may be incomplete or inaccurate, as I’ve spent very little time with Rust futures.

              1. 7

                When I’ve had time over the week or two, I’ve been working on my blog series on decreasing the cost of running a personal k8s cluster. Just yesterday, I wrapped up a post on utilizing Reserved Instances and Spot Instances to decrease the overall bill by ~40%. As I have time this week, I’ll work on decreasing/stabilizing costs by running a single External Load Balancer via Ingress and optimizing Kops’ default EBS volume allocations.

                I also just finished reading Google’s Site Reliability Workbook, so I’ll hopefully have some time at work to think about applying some of its ideas.

                1. 1

                  It’s really funny for me reading some of your comments on reducing k8s costs, because you’re talking about reducing $160/mo by 40%, and my personal nerd-golf is to try to reduce my cloud budget to < $10/mo (free tier doesn’t count)

                  (I’m not deriding your hobby, just appreciating how it’s so similar in intent but so different in scope than my own)

                  1. 1

                    Haha yeah, definitely wouldn’t recommend running a personal Kubernetes cluster if cost-savings is a predominant concern :) I think for me, the experience I’m gaining with Kubernetes/Cloud Native, and the fun I’m having working with it, justifies the extra cost.

                  2. 1

                    I got one down to 5$ a month on GKE: http://www.doxsey.net/blog/kubernetes--the-surprisingly-affordable-platform-for-personal-projects.

                    It helps that they run the control plane for free but I still had to run my own load balancer, since the built-in one is 18$/mo.

                    Digital Ocean will have a managed k8s soon that might be great for personal clusters, but I haven’t tried it yet.

                    1. 1

                      Very cool! Saved your post for later to give it a good look :)

                  1. 7

                    My usual desktop

                    I’m on OS X, and use a tiling window manager called ChunkWM with a hotkey daemon.

                    There’s NeoVim with a variety of syntax/editing plugins on top right, a currently active PDB session on bottom right, various IRC channels and servers via Weechat on bottom left, and Mutt top left.

                    With the combination of hotkey daemon, window manager, and the variety of CLI-based tools I use, I essentially never use my trackpad/mouse. Not shown is Firefox with a vim-like set of keybindings so that I can navigate with the keyboard.

                    I’m also a huge motorsports fan, so the wallpaper usually rotates between various Formula 1 or World Endurance Championship scenes.

                    1. 4

                      How do you find ChunkWM? I’ve batted around the idea of installing it on my Mac at home, because I am a fan of tiling, keyboard-driven wms when forced to spend time in X; but I worry that it’d end up being a case of fighting the platform, a neither fish-nor-foul hybrid that manages to combine the worst of both worlds.

                      1. 4

                        It’s actually not that bad – there are a few configuration-level things that you need to setup to get things working smoothly, but once it’s going I really don’t have to touch it. I’m actually a bit lost when I need to use a computer that is not my own due to all the built-up muscle memory from the skhd hotkeys.

                        But, it does have some flaws.

                        1. When switching between single monitor & multiple monitors, sometimes windows don’t reposition themselves correctly and I have to hide all windows and then bring them to the foreground to get the chunkwm daemon to recognize them and resize them. It’s not a big deal, but it can be jarring the first time it happens to you. Also, this seems to have almost disappeared in the most recent versions of ChunkWM.
                        2. When resizing windows, you can sometimes see redraw artifacts (edit: on further thought, this might be an issue with iTerm2 - I don’t ever see redraw artifacts on non-console windows). You can even see that in my screenshot that I took (it looks like an extra letter in the self parameter in the top right window, first line). The artifacts disappear when the window in question has to redraw itself again for some reason (e.g. you typed some text), but is supremely annoying.
                        3. The default keybindings in skhd might be a tad annoying for non-English keyboards. I know when I need to type French accented characters, I have to go a circuitous route due to the use of the option key as the main skhd modifier.
                        4. Some menubar applications will need to be added to the chunkwmrc config file as a “do-not-touch”, since chunkwm tries to tile floating menubar windows that appear, and it really just goes a bit nuts. This seems to have been resolved in the most recent versions of chunkwm, but I’m still a bit wary about it.

                        Overall, though, for software that is at 0.4.x level of completeness, I’m very happy with it, and deal with the warts because the productivity it provides me is worth so much.

                        The author of the software has gone through a few iterations of building these hotkey daemons & window managers for OS X, and seems to have taken a lot of knowledge and experience from past implementations.

                      2. 2

                        Another tiling window manager: https://www.spectacleapp.com/.

                        1. 1

                          +1 for Spectacle

                      1. 6

                        This is actually a solid tutorial. I’ve recently been thinking if Kubernetes might suit us after all. Because what we are doing in practice always didn’t seem to fit Kubernetes very well: instances of our app that need to run on separate infra, often even different regions, currently running on single servers. Translating to Kubernetes pretty much seemed like we would have to map each of those servers we have now to individual Kubernetes clusters.

                        But managing a bunch of Kubernetes clusters doesn’t seem any worse than managing a bunch of individual servers? And if we’re already running on single servers, we could turn them into single node Kubernetes clusters for roughly the same price, with GKE masters being free.

                        GCE definitely has an advantage in terms of pricing, here. We’re an AWS shop, but EKS is priced $0.20 per hour for the master, on top of your node costs. That’s instantly ~$150 per month added to your bill.

                        1. 2

                          Translating to Kubernetes pretty much seemed like we would have to map each of those servers we have now to individual Kubernetes clusters.

                          You can assign pods to specific nodes in a single kubernetes cluster quite easily. https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

                          1. 2

                            By default some metadata is associated with each node, for example the region and availability zone. Using that information you can provide an affinity to target only a certain region, or make sure pods are distributed across availability zones.

                            You can also add custom taints to nodes, and then add a toleration to a pod to make sure it runs where you want it to.

                            At Datadog we built a custom controller (similar to the ip sync code in the blog post) which when handed a custom resource definition would create a nodepool with the requested constraints (node types, SSDs, etc), thus allowing developers to also describe the hardware requirements for their application.

                            Paired with the horizontal pod autoscaler and the cluster autoscaler you can go a long way to automating fairly sophisticated deployments.

                            1. 1

                              But everything I can find about Kubernetes (in the cloud) is that you start it in a single region. Am I missing something? Can you span a Kubernetes cluster across multiple regions, or somehow treat it as one large cluster?

                              1. 2

                                Yeah that’s true. I think the etcd latency wouldn’t play well multi-region.

                                You could still tag the nodes and apply the same config in several kubernetes clusters and then in the other clusters the workload just wouldn’t run.

                                Course then you’re going to have the issue that services in one cluster need to talk to services in another. Kubernetes has federation support, but I hear its a work in progress. Istio might be worth a look though.

                          1. 4

                            I love k8s because I love just having a bunch of yaml files I can just apply and have it work, but gke’s pricing for 4 cores and 8 gigs of RAM was like 2 or 3 billion dollars a month I think, so I went back to crappy scripts and digital ocean. Really hope DOs kubernetes offering ends up being good, because using kubernetes is wonderful but administering it isn’t something I want to do for little side projects.

                            1. 3

                              You could also use Typhoon if you want something better than scripts. It also supports DO.

                              1. 1

                                A 3 nodes (n1-standard-1) Kubernetes cluster is ~72$/month. You can even get a 1node k8s cluster but don’t have all the benefits discussed in the OP. Although 3 nodes is still a light cluster, it allows you to have some benefits that you’d not have with 3 crappy servers managed by configuration management (although it would still be cheaper).

                                1. 1

                                  Google has a sustained use discount. I think a 4 core, 15GB machine is 100$/mo. So on the low end its cheaper than digital ocean, but the price ramps up quickly for more computing power. (also pre-emptible nodes are cheaper if you can live with your server disappearing every day)

                                  I suppose it depends on what you’re trying to do. Their burst instances work well for web apps, especially if you can cut down on memory usage.

                                  Some competition from digital ocean would be great. I’d probably switch if the price were competitive.

                                1. 3

                                  The whole vgo project seems somewhat surreal. It pulled out the rug from dep, which everyone thought was on its way to becoming the official way to do dependencies.

                                  And yet I’m not sure I’ll really miss it. Having worked with it the last 6 months it’s god-awful slow, local vendoring doesn’t really work well with git, download proxy servers are a nice idea that was never really implemented in any plausible way, private repos are challenging. It feels half finished and forgotten, with no traction in months.

                                  sboyer didn’t even have the time to actually write down his objections.

                                  And how is it a single developer manages to supplant an entire open source project on his own? And even more, how does he manage to do that and produce something better?

                                  I played with vgo. Its fast, works well with the existing go tool, its dead simple to use, and has a robust proxy solution to avoid left-padding. Somehow it manages to preserve Go’s distributed dependency model.

                                  I really don’t know what to make out of it all.

                                  1. 4

                                    I left a comment on the blog: (his reply was surprisingly hostile…)

                                    Check out https://github.com/ungerik/pkgreflect for a way to reflect packages.

                                    Go leaves many of these problems to be solved at build time rather than at runtime.

                                    Your point about generics is actually about inheritance. You might be the first person I’ve seen defend classical inheritance with virtual methods.

                                    Virtual methods turn out to cause a lot of problems, the primary problem being they’re slow, which is why c# decided to make virtual opt-in instead of the default.

                                    But from a language design perspective inheritance often ends up making very confusing code because where something is implemented can be hard to track down and developers will tend to make assumptions about how something is implemented which later binds your hands as the implementor of the base class.

                                    In general in OO code you should hide everything by default and use virtual with great care. It’s this realization which drives the reasoning behind discarding the concept entirely. It’s just not all that useful in practice.

                                    I also think getting rid of the ternary operator was good, it leads to easier to understand code, for a very slight cost in verbosity. (surprisingly slight… Rob Pike once said developers think a newline is 100x more expensive than any other character)

                                    Also I like git imports. It makes it really easy to find docs and the original source code. Distribution of code is a separate issue from its name and a centralized packaging system is a liability.

                                    1. 3

                                      Kubernetes uses ConfigMaps: https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/

                                      They are basically YAML/JSON properties which can be sent to containers in various ways:

                                      env:
                                      - name: LOG_LEVEL
                                          valueFrom:
                                          configMapKeyRef:
                                              name: env-config
                                              key: log_level
                                      

                                      Kubernetes handles the rollout of changes, and since a lot of infrastructure tasks are pre-defined (like routing from one service to another ala istio) there’s a lot less one-off config changes that you need to do. They support literals, files and directories. You can also do secrets: https://kubernetes.io/docs/concepts/configuration/secret/

                                      1. 2

                                        Go 1.9 introduced type aliases:

                                        type callback = func(int) bool
                                        
                                        func filter(xs []int, f callback) []int {
                                        	var filtered []int
                                        	for _, x := range xs {
                                        		if f(x) {
                                        			filtered = append(filtered, x)
                                        		}
                                        	}
                                        	return filtered
                                        }
                                        
                                        func main() {
                                        	fmt.Println(filter([]int{1, 2, 3, 4}, func(x int) bool {
                                        		return x < 3
                                        	}))
                                        }
                                        

                                        C# has delegate types:

                                        public delegate bool callback(int x);
                                        
                                        public static int[] filter(int[] xs, callback cb) {
                                            var filtered = new List<int>();
                                            foreach (int x in xs) {
                                                if (cb(x)) {
                                                    filtered.Add(x);
                                                }
                                            }
                                            return filtered.ToArray();
                                        }    
                                        
                                        public static void Main() {
                                            int[] xs = {1,2,3,4};
                                            foreach (int x in xs) {
                                                Console.Write(x + " ");
                                            }
                                            Console.WriteLine();
                                        }
                                        
                                        1. 2

                                          Excellent point, but the actual parameters still end up being structurally typed. The formal parameters get named as instances of the type, but the actual values when constructed are not declared to be of that type.

                                          That is, in your first example, I could do something like this:

                                          func foo(i int) bool {
                                              ...
                                          }
                                          
                                          filter(int_array, foo)
                                          

                                          The function foo was never explicitly declared to be of type callback, but rather assignment/passing was allowed because foo met the structural requirements of the callback type.

                                          I think the answer to my question may be “no, it’s not possible to reasonably have function types in a purely nominative type system” though that just rubs me the wrong way.

                                        1. 2

                                          I’m trying to understand what you are after with the “single executable” part?

                                          1. 2

                                            Self-contained. For the most part controversy I guess? :-)

                                            1. 5

                                              Right. That controversy you may have. I guess we have rather differing interpretations of self-contained.

                                              $ file start.sh

                                              start.sh: POSIX shell script, ASCII text executable
                                              

                                              $ file target/hprotostuffdb-rjre

                                              target/hprotostuffdb-rjre: ELF 64-bit LSB executable
                                              

                                              $ grep JAR start.sh

                                              JAR=comments-all/target/comments-all-jarjar.jar
                                              $BIN $PORT comments-ts/g/user/UserServices.json $ARGS\
                                                $PUBSUB $ASSETS -Djava.class.path=$JAR comments.all.Main
                                              

                                              $ objdump -p target/hprotostuffdb-rjre |grep RPATH

                                              RPATH                $ORIGIN:$ORIGIN/jre/lib/amd64/server
                                              

                                              $ objdump -p target/hprotostuffdb-rjre |grep NEEDED

                                              NEEDED               libpthread.so.0
                                              NEEDED               libjvm.so
                                              NEEDED               libcrypto.so.1.0.0
                                              NEEDED               libssl.so.1.0.0
                                              NEEDED               libz.so.1
                                              NEEDED               libstdc++.so.6
                                              NEEDED               libgcc_s.so.1
                                              NEEDED               libc.so.6
                                              

                                              $ find . -name ‘*so’

                                              ./target/jre/lib/amd64/server/libjsig.so
                                              ./target/jre/lib/amd64/server/libjvm.so
                                              ./target/jre/lib/amd64/libzip.so
                                              ./target/jre/lib/amd64/libnet.so
                                              ./target/jre/lib/amd64/libjava.so
                                              ./target/jre/lib/amd64/libnio.so
                                              ./target/jre/lib/amd64/libverify.so
                                              

                                              I’m not even going into the rest of the jre scaffolding. I guess you could argue the stuff under comments-ts is not part of the “comment-engine”, but it’s there, and it (or something equivalent) is needed anyway. Admittedly, only two of the files in the entire package have the ‘executable’ flag set, so you can have half your cake if that’s the criteria for being self-contained :-)

                                              1. 4

                                                Thanks for the detail response.
                                                It was my way of showing people that jvm apps can have “golang-style” deployments where you ship a binary and run and be only 12MB (my production nginx binary is 14MB)

                                                But realistically, if you have the jvm installed, the jar is only 455kb and that is only the one that needs to be shipped along with the 92kb js and 7.1kb css. That is how I deploy internally.

                                                With golang, you do not have this choice.

                                                1. 4

                                                  Ah, so now I am starting to see the points that you are really trying to make.

                                                  1. Bundling of dependencies. I don’t think there’s much novelty to it; proprietary and floss applications alike have distributed binaries with bundled dependencies for a long long time. This includes many applications that bundle a jvm.

                                                  2. A jvm bundle can be reasonably small. Admittedly I haven’t paid attention to it, but I’ve had jvm bundles before, and I don’t recall them being outrageously large.

                                                  Calling it a “single executable” or self-contained might not be the best terminology to get the point across. Even more so when you consider that the executable also depends on many libraries that are not bundled; see objdump output above and compare to the list of bundled shared objects. Any one of these system libraries could go through an ABI change (in worst case one where existing symbols are removed or modified to work in an incompatible way, without symbol versioning…), and when that happens, your uhm.. self-contained single executable thingy won’t run right, or at all. It’s not just a theoretical concern, it’s something people using binary distributions of proprietary software (e.g. games) may have to feck with rather often.

                                                  I can’t comment on how this compares to golang deployments, which I’ve never used.

                                                  1. 1
                                                    1. Pretty much agree.
                                                    2. A lot of ppl dismiss the jvm as bloated (in terms of size and runtime memory). I guess it all depends how one uses it (along with the knobs to tune). I run mine at 128mb memory max, and that could handle 130k req/s. My usage of the jvm is like a stored-procedure language though. All the heavy lifting is on the C++ libs that I’m depending on.

                                                    I understand your points and appreciate your comments. Cheers

                                                  2. 2

                                                    Recent versions of go have additional build modes: https://golang.org/cmd/go/#hdr-Description_of_build_modes

                                                    Theoretically you could deploy your code as a plug-in.

                                            1. 3

                                              Generally agree but there are operational costs to vertical scaling. That single DB is also a single point of failure and achieving high availability is often just as hard as scaling horizontally. (master / slave failover and backups may seem mundane but there are plenty of examples of companies screwing it up)

                                              Something like Cassandra, Elasticsearch or Kafka has redundancy built in, which is a big win. I think spanner style sql databases could hit a real sweet spot.

                                              As for SOA I think it depends on what you’re working on. Sometimes breaking up applications into separate processes with well defined interfaces can make them easier to reason about and work on.

                                              As an application evolves over time the complexity can grow out of control until any time someone touches the code they break it. How often have new developers thrown up their hands, scrapped the product, started over and wasted 6 rebuilding what they had in the first place?

                                              Maybe SOA could help with that by limiting the scope? (Though maybe better code discipline would achieve the same result?)

                                              I guess all I’m saying is that good engineering practices can help smaller software too.

                                              1. 2

                                                Makefiles are great for small Unix projects, not so much for something that needs to be built for Windows too… Windows developers live in a parallel universe of build tools and that coupled with the shear size of chromium helps to explain why they felt it necessary to make ninja.

                                                The complexity of the project is pretty crazy though.

                                                1. 2

                                                  I’ve used nmake with some success to build things on Windows.

                                                1. 3

                                                  Why do folks have such a hard time looking for 3rd party packages? An immutable, sorted map doesn’t come up all that often (I don’t think I’ve ever “needed” the immutability bit), but there are packages out there that can do it. For example: https://github.com/Workiva/go-datastructures#dtrie. *

                                                  It’s not type safe, but it fits the requirements he wanted.

                                                  • caution: I’ve never actually used this package
                                                  1. 4

                                                    One of the chief dangers of excessively verbose and inflexible code is not just that its implementer has to do a lot of typing; it’s that all that typing provides a high surface area for bugs and generally difficult-to-reason-about implementations. This sort of issue affects whoever has to use the library, not just whoever has to do the implementing.

                                                    It’s also, by the way, generally true that code which somebody else wrote is going to be more general and, you know, not written by you, and therefore magnify the verbosity and difficult-to-reason-about issues.

                                                    1. 5

                                                      Why do folks have such a hard time looking for 3rd party packages?

                                                      Sometimes people don’t want to add external dependencies for things that, in some cases, are (relatively speaking) straightforward to implement.

                                                    1. 2

                                                      So, what are the current Best Practices for dealing with these sorts of markets? What are reasonable and secure alternatives to Coinbase?

                                                      1. 4

                                                        Xapo takes security seriously: https://support.xapo.com/xapo-security

                                                        • keys are stored offline
                                                        • multiple signatures are used
                                                        • data is encrypted and distributed across multiple regions
                                                        • they use 2fa, a pin and a password to access the vault
                                                        • it requires 48 hours to move bitcoins out of the vault

                                                        You can still use sites like coinbase, just don’t leave a lot of bitcoins in them

                                                        1. 3

                                                          Some alternatives to Bitcoin allow electronic transfers to be reversed. :)

                                                          1. 3

                                                            Given the history of Bitcoin markets, I wouldn’t keep any money in any of them that you aren’t planning on trading right now. So many of them have been hacked or have the founders mysteriously disappear with the money.

                                                            1. 1

                                                              One alternative is Blockchain wallet (https://blockchain.info/) This site cannot reset your password because your password ultimately secures your wallet’s private key. This means that the service is as secure as your password, which could be considered a “secure alternative to Coinbase”. Note that this does not include ethereum or litecoin wallets, nor an exchange though.

                                                              Coinbase also claims to have FDIC insured deposits for its US customers (but only for the USD balance), so that can be a real advantage over other exchanges at least. They’re also insured against theft/security failures (their policy would probably be detailed enough to exclude OP’s problem).

                                                            1. 3

                                                              I appreciate the way you’ve written this as an exploration of the problem space rather than a polemic (“10 reasons you’re a dumb-dumb if you don’t count with HyperLogLog”). It feels far more authentic and useful.

                                                              Having said that, I find a disturbing story here: how much time was spent standing up Cassandra, operationalizing it, validating the new process, adjusting the existing billing processes, (…) when you’re sitting on an embarrassingly parallel problem? Do you really want to trust a complex datastore when wc is the perfect solution here?

                                                              1. 3

                                                                wc has to count the unique lines. Like many companies we faced the problem were certain customers received an extreme amount of data compared to their peers. This meant that we had to count billions of items.

                                                                I’m curious how the problem is embarrassingly parallel… I was never happy with the Cassandra solution, so if there’s something obvious here I missed, I’d definitely be interested…

                                                                Cassandra is quite simple, though expensive, if you’ve already payed down the operational cost. (which we had) It solves the high availability problem.

                                                                1. 1

                                                                  Careful re-reading indicates the problem isn’t quite embarrassingly parallel; the roll-up has to occur monthly, and I misread the gzipped log date format as monthly logs; oops, my bad :)

                                                                  There are still a handful of problems still present in that pipeline, and you’d still see some very strong gains if you spend a day parallelizing it:

                                                                  find *.gz | xargs -I{} echo "<(gzcat {} | sort | uniq)" | xargs echo sort -m | bash | uniq | wc -l

                                                                  Right now you’re sorting the log file output and eliminating unique elements, merging them together and re-stripping uniques (gah!)

                                                                  Instead, if you did

                                                                  find *.gz | xargs -I{} echo "<(gzcat {} | LANG=C sort )" | xargs echo LANG=C sort -mu | bash | wc -l

                                                                  you eliminate duplicates at merge time (-u), which cuts out n+1 full-file iterations and just does it at the k-way merge time. It’s also critical to set LANG to C or else you’ll get eaten alive with multibyte comparisons.

                                                                  If those two would be too slow - and seriously, don’t underestimate the boost you could get there, the next likely step would be to fan out the download, gzcat and sort onto multiple servers. parallel can help do that with very little administration:

                                                                  aws s3 ls | parallel --sshloginfile ssh-keys-go-here --return {.}.sort --cleanup "aws s3 cp {} . && gzcat {} | sort > {.}.sort" && find *.sort | sort -mu | wc -l (this one is from memory, so the syntax may need a tweak!)

                                                                  Even better, drop the --return option, mount the output directories from your counter nodes to your invoking node, and be sure your network connection is fast enough and you’ve got a very easy to parallelize counter.

                                                              1. 1

                                                                So, to go full circle, if I were to do it again, I’d probably spend most of my time building something clever with a HyperLogLog, only to eventually cave-in and resort to something inefficient, bland and boring.

                                                                Why not HyperLogLog when it’s the most efficient solution?

                                                                1. 3

                                                                  Its not 100% accurate and in the end I don’t think that’d fly with a billing system.

                                                                1. 2

                                                                  Since we wanted better than linear performance (ie a billion tweets shouldn’t take a billion operations to count), we explored the indexed option.

                                                                  Can someone explain what this means? I don’t quite understand how you could have sublinear time for counting items. Maybe this is referring to the uniqueness/deduplication part?

                                                                  1. 4

                                                                    Sorry for the confusion. It’s sub-linear at query time. You still have to pay the cost of indexing, but you can do that over the whole month.

                                                                  1. 2

                                                                    Fantastic. Systemd makes it so much easier to build init scripts and up till now there hasn’t been a great distro option for servers.

                                                                    1. 1

                                                                      An unfortunate side effect of closures is that they make it trivially easy to leak memory in Internet Explorer

                                                                      Even newer versions?

                                                                      1. 2

                                                                        No. (http://msdn.microsoft.com/en-us/library/dd361842.aspx)

                                                                        I think the article was written in 2006 and is pretty out of date.

                                                                        1. 1

                                                                          Assuming it’s still true, I’d say it’s mitigated by per-process tabs. I’m surprised though that each page doesn’t have its own arena. I can understand how DOM and JS objects get tangled together and can’t be collected, but it seems easy (always does) to just blow away both the DOM and JS heaps for a page after you navigate sufficiently far away. (After reading the msdn link, it seems even IE 7 does just that. It’s only a leak while on the page.)

                                                                          (What’s the longest running web page? gmail? Would evil google deliberately add lots of DOM closure leaks to gmail to convince people to use chrome? :))

                                                                          As a side note, closures/lambdas/anonymous functions are where lots of languages' GCs break down. luajit will permanently pin any function passed as a function pointer to C, and the jvm has similar issues with permgen.