1. 23

  2. 7

    This version takes Clojars’ playbook runtimes from 16 minutes to 1 minute 30 seconds. It is my favourite piece of software in recent years. Highly recommended if you use Ansible.

    1. 3

      It’s probably way out of the intended scope, but could Mitogen be used for basic or throwaway parallel programming or analytics? I’m imagining a scenario where a data scientist has a dataset that’s too big for their local machine to process in a reasonable time. They’re working in a Jupyter notebook, using Python already. They spin up some Amazon boxes, each of which pulls the data down from S3. Then, using Mitogen, they’re able to push out a Python function to all these boxes, and gather the results back (or perhaps uploaded to S3 when the function finishes).

      1. 3

        It’s not /that/ far removed. Some current choices would make processing a little more restrictive than usual, and the library core can’t manage much more than 80MB/sec throughput just now, limiting its usefulness for data-heavy IO, such as large result aggregation.

        I imagine a tool like you’re describing with a nice interface could easily be built on top, or maybe as a higher level module as part of the library. But I suspect right now the internal APIs are just a little too hairy and/or restrictive to plug into something like Jupyter – for example, it would have to implement its own serialization for Numpy arrays, and for very large arrays, there is no primitive in the library (yet, but soon!) to allow easy streaming of serialized chunks – either write your own streaming code or double your RAM usage, etc.

        Interesting idea, and definitely not lost on me! The “infrastructure” label was primarily there to allow me to get the library up to a useful point – i.e. permits me to say “no” to myself a lot when I spot some itch I’d like to scratch :)

        1. 3

          This might work, though I think you’d be limited to pure python code. On the initial post describing it:

          Mitogen’s goal is straightforward: make it childsplay to run Python code on remote machines, eventually regardless of connection method, without being forced to leave the rich and error-resistant joy that is a pure-Python environment.

          1. 1

            If it are just simple functions you run, you could probably use pySpark in a straight-forward way to go distributed (although Spark can handle much more complicated use-cases as well).

            1. 2

              That’s an interesting option, but presumably requires you to have Spark setup first. I’m thinking of something a bit more ad-hoc and throwaway than that :)

              1. 1

                I was thinking that if you’re spinning up AWS instances automatically, you could probably also configure that a Spark cluster is setup with it as well, and with that you get the benefit that you neither have to worry much about memory management and function parallelization nor about recovery in case of instance failure. The performance aspect of pySpark (mainly Python object serialization/memory management) is also actively worked on transitively through pandas/pyArrow.

                1. 2

                  Yeah that’s a fair point. In fact there’s probably an AMI pre-built for this already, and a decent number of data-science people would probably be working with Spark to begin with.

          2. 2

            I learned of this project some time ago when you posted about it in one of the “what are you working on” posts. Since then I’ve been waiting to use this since I only really use ansible twice a year, and on those opportunities, I have to work remotely over a very slow and convoluted connection (bouncing through multiple hosts and traveling down a home DSL connection and then a PtP WiFi link in someones garage). When I use ansible there are time constraints, the infra in the garage is only online for so long every weekend, so ansible runs that take 40 minutes or longer are super annoying (especially when this is the developing and testing stage). This project looks promising to me, I plan on using it very soon for my work and I hope to see improvements to my productivity as a result, thanks!

            1. 1

              Its network profile has “evolved” (read: regressed!) a little since those early days, but it should still be a massive improvement over a slow connection. Running against a local VM with simulated high latency works fine, though I’ve never ran a direct comparison of vanilla vs. Mitogen with this setup.

              That’s a really fun case you have there – would love a bug report even just to let me know how the experience went.

              edit: that’s probably a little unfair. Roundtrips have reduced significantly, but to support that the module RPC has increased in size quite a bit. For now it needs to include the full list of dependency names. As a worst-case example, the RPC for the “setup” module is 4KiB uncompressed / 1KiB compressed, due to its huge dependency list. As a more regular case, a typical RPC for the “shell” module is 1177 uncompressed / 631 bytes compressed.

              A lot of this is actually noise and could be reduced to once-per-run rather than once-per-invocation, but it requires a better “synchronizing broadcast” primitive in the core library, and all my attempts to make a generic one so far have turned out ugly