1. 53
  1.  

  2. 22

    guix and nix package manager solve all these problems. If you start using them you eventually forget problems like this exist at all (At least that is my experience.).

    1. 19

      I work in Bioinformatics. Bioinformatics is a fast moving field where there is a lot of code written in academic and research settings that is quickly moved into a production environment. A lot of the code is written by scientists who have no formal software engineering training. In bioinformatics, even more than other fields, the code is something that is a means to an end, and a necessary evil. Almost everyone who is writing the code is trying to solve a bigger problem and the code is cost center in terms of effort and energy and so on.

      Researchers have to use code written by other researchers. Code has conflicting dependencies. Code has been compiled by different compiler versions dynamically linked against different library versions. Codes are in different language frameworks requiring widely different tool chains.

      Docker has been a great solution to the problem of how to quickly install these kinds of tools and make them interoperate without spending impractical amounts of time and energy.

      Does this encourage bad engineering practices? Yes. However, the alternative is do drastically slow down the pace of research in this sector to an impractical pace. The “proper” way to do the research would then be to hand off the software to a team of software engineers who then rewrite the code in possibly a different language, using “proper” software engineering practices. Only then would the tool be released - with a “proper” distribution mechanism - such that everyone could install it on all their systems. Leaving aside the question of whether this is indeed possible engineering wise, given all the different languages and paradigms out there, it separates the original tool developers from the code, slowing down evolution of the tool.

      In fast paced fields like bioinformatics and other *-omics, tools, and tool ecosystems in real life are in a way, not designed; they evolve. Docker is a boon for interoperability, to overcome the design interface mismatches that come from such an evolutionary process. This is an awkward analogy, but Docker allows us to mix a giraffe head with a fish body and get away with it.

      1. 10

        This. As a researcher, I don’t want the bloat of installing docker on my systems. But when other people’s research code can “just work” on my machine with a docker run, I can get to work.

        A fixed amount of overhead gives me access to the algorithmic (big-O class) improvements of a publication. Far too much research code grows decrepit and unusable due to non-operability.

        1. 3

          A fixed amount of overhead gives me access to the algorithmic (big-O class) improvements of a publication

          I’d be curious to hear about how this works! That is kind of the promise, but I’ve never found the “ship a docker version” approach to provide me with anything approaching reusability for other projects. If I want to simply replicate the paper itself, yes, the docker approach is viable. It’s also viable if I want to experiment with variations on the original paper’s ideas. Then I download their archive, make modifications inside it, and run my own tests.

          But if I want to use it as a library-like dependency for my own separate project, things become messy. Am I really going to write my own research code inside this docker archive? Probably not. So I could depend on it externally, but the Docker dependency story is not really well sorted out. And what if I need 2 or 3 other papers’ results? Will my code really depend on 2 or 3 external Docker archives? Now my potentially not that complex research result becomes truly monstrous to reproduce. In many common cases, the whole thing quickly becomes worse than actually just putting in the time to reimplement the original paper, and not having to depend on a pile of trash-can Docker archives.

          I’m not against using other people’s code, really, but docker comes so far down the list that it might be in negative territory. If someone shipped a proper CRAN package encapsulating their work, I’m definitely interested. But a docker archive? Zero interest, less than reading the paper’s pseudocode.

          1. 2

            You bring up a fair point. There’s a difference between prototyping work by calling into one or more docker runnable projects via a CLI and handing around some (usually text-based) file between them, and building a reusable tool that someone else can build off of. Both are valuable. Docker is one way to enable the former, and you’re right that everything would rapidly kludge if it was exclusively done for the latter. What I failed to clarify is that it can be useful to test the need for the latter through the quick-and-very-dirty approach of the former.

            1. 2

              In many common cases, the whole thing quickly becomes worse than actually just putting in the time to reimplement the original paper, and not having to depend on a pile of trash-can Docker archives.

              [repeating something I said below]

              If you find yourself needing to build a big pile of other people’s stuff in a way that won’t drive you crazy and you’re willing to invest a bit in learning some tooling, I’d look at the Spack project.

              It’s similar to Homebrew, Nix, and EasyBuild but is focussed on scientific/HP applications (e.g. you can have multiple versions of an application installed simultaneously). It’s not a silver bullet, but it goes a long way towards organizing chaos.

          2. 7

            [edit: fixed a confusing use of “it’s”]

            TL;DRNOT this. Bad engineering leads to bad science.


            “Bad engineering practices” in a bioinformatics lab are no more (or less) acceptable than bad lab practices at the bench. They’re a slippery slope to bad science.

            There are many, many bioinformaticians, biologists who can code, computer scientist dabbling in biology, and/or etc. who have taken the time to learn how to use their tools effectively. They may consciously take on a bit of technical debt now and then, but in general the goals of their computational work are the same as their goals for the lab; reliability, repeatability, and reproducibility.

            It makes me sad and embarrassed for the/my discipline to hear claims that bad practices are normal or acceptable.

            In this context, my biggest complaint about Docker isn’t that it enables/encourages some of those bad practices, but that folks can use it as a smoke screen and hide a multitude of sins behind it.

            • If I only had nickle (but see below) for every time I’ve heard: “I use Docker, so I can reproduce my software environment.” When “Broken by default: why you should avoid most Dockerfile examples” came across Hacker News and generated a flurry of discussion I felt deja vu all over again.

            • Docker’s instability makes running it in environments where people depend on the computing infrastructure (“production”) a headache (and money sink, but see below) for the admin team.

            • Docker generally invalidates security on shared machines; you shouldn’t let someone run Docker containers on a machine if you wouldn’t give them sudo access on that machine.

              If you have data that should have restricted access (e.g. contractual obligations, or personally identifiable info, or …), you should really think about it.

              If you have HR paperwork for your reports in your home directory and it’s mounted on the computing cluster, oh dear….

            All that said, I see good use cases for container tech. I’ve been following Singularity with interest, it avoids a lot of the complexity and security issues but it’s still possible to put a bow on a container full of stink and smile innocently. It’s also definitely possible to use Docker cleanly, to great effect, but it requires more work, not less.

            On a related note, I’ve used Spack to great effect in setting up bioinformatics computing environments (containerized or not).

            I shouldn’t be too grumpy, I suppose. I do have nickles (and more) from many of the times I’ve helped clean up Docker-related messes in various environments. There’s a nice niche cleaning up engineering (aka “devops”) problems in the bioinformatics world.

            Will Work For Money, feel free to drop me a note.

            1. 3

              It makes me sad and embarrassed for the/my discipline to hear claims that bad practices are normal or acceptable.

              Nonetheless, it is true. I work (part-time) as a research software engineer at Oxford Uni, and have run courses on software engineering practices for researchers. Plenty of techniques that commercial developers might think normal (e.g. iterative development, using version control) are not present in research teams. We encourage them to adopt such practices, we tell them why. Often, we find that students and postdocs are willing to engage with changing their workflows, but more senior staff are not interested. And, as they always say in those management books with fake gold leaf on the cover you buy from airport bookstands, change has to come from the top if it’s going to stick.

              1. 1

                There’s no question that there are labs with poor practices. There are plenty of tech companies with marginal practices too.

                But:

                • I don’t think that it’s normal (but I haven’t surveyed the entire world recently…); and
                • I definitely don’t think that it’s acceptable (and I know that there are many, many groups in academia and industry that feel the same way).

                But^2:

                • eyes-open technical debt isn’t always bad; and
                • “sometimes things don’t need to be reproducible” (see below).
              2. 2

                This might be true in some cases. In others though it’s docker vs a pile of e.g. Perl scripts which take in an undocumented custom text file format and output an equally undocumented text file format.

                In my limited experience, bullets 2 and 3 have very little relevance to most research environments.

                The goal is absolutely reproducibility.

                1. 2

                  This might be true in some cases. In others though it’s docker vs a pile of e.g. Perl scripts which take in an undocumented custom text file format and output an equally undocumented text file format.

                  What really gets my goat is “a pile of e.g. Perl scripts which take in an undocumented custom text file format and output an equally undocumented text file format” that gets wrapped up in a Docker container.

                  True, it’s easier to run the script, but the “solutions” suffers from bullets 2 and 3, plus it’s pretty likely that the Docker image was built with a combination of hackery, skulduggery and prayer (if it weren’t, then installing it w/out Docker would probably be straightforward).

                  In my limited experience, bullets 2 and 3 have very little relevance to most research environments.

                  This has been the case in my past two gigs, covering about 4 years. I’ve been doing this stuff (bio + computers) for quite a while.

                  It’s definitely true that there is a wide variety of environments out there.

              3. 2

                Bioinformatics is a fast moving field where there is a lot of code written in academic and research settings that is quickly moved into a production environment. A lot of the code is written by scientists who have no formal software engineering training. In bioinformatics, even more than other fields, the code is something that is a means to an end, and a necessary evil. Almost everyone who is writing the code is trying to solve a bigger problem and the code is cost center in terms of effort and energy and so on.

                This should terrify people who depend on results in that field.

                1. 2

                  Just to jump on the other side of the fence for a moment, sometimes non-reproducible work is ok.

                  One way to understand a process is to break it and then compare the broken version to the “normal”. It is/was common to expose a bunch of organisms (e.g. vial of fruitflies) to a mutagen (radioactive, chemical), then search for ones that were different (e.g. had white eyes instead of red). Those experiments were never repeatable. But it didn’t matter, if you got something interesting, you could run with it. If not, you went fishing again.

                  But, the downstream work you did needed to be repeatable/reproducible.

                  Likewise, if your program’s task is to search for a list of possibilities and suggest a research candidate, then it might well be fine if it’s really consulting /dev/random to make its choice, or mistakenly skipping the first element in an array, or . One hopes that there’s more intelligence going in to it, but bad suggestions are just going to waste your resources (assuming good downstream science).

                  On the other hand, if your selections are e.g. going to guide patient care then they had better be repeatable, reproducible and defensible, lest they:

                  • hurt people
                  • ruin your career

                  E.g. https://en.wikipedia.org/wiki/Anil_Potti

                  (you can choose which is more motivating).

              4. 28

                The folks working with Python and Ruby and Perl (and PHP) are jealous of the way a Java programmer can create an uberjar, and they are jealous of the way a Golang programmer can create a binary that has no outside dependencies — and so the Python programmer, and the Ruby and Perl and PHP programmer, they turn to Docker, which allows them to create the equivalent of an uberjar. But if that is what they want, maybe they should simply use a language that supports that natively, without the need for an additional technology?

                Because “Use another language” is a thought terminating cliche that does not make use of the expertise on hand, the libraries/ecosystem at play and the business constraints? Because work into single binary Python blobs like PyOxidizer are relatively new, so solutions like Docker (arbitrary userland in a box) are the next best thing?

                Programming languages have their strengths and weaknesses. In that sense, they are not equal. It would be foolish to change out the tech stack, switch to Java, just because someone thinks he understands version pinning better in Maven than Docker image hashes and tags (and therefore it must be the fault of the technology that he learned one stack before the other). It’s easier to update your understanding than making extra work for yourself.

                As to the MySQLdb import, I often tell people to use “-m pip” invocation because it allows you to specify which interpreter, and therefore install location, to use. Relying on fragile, potentially name-stomping scripts like “pip3” isn’t a good idea - it can result in unpredictable and broken install paths.

                Docker is a flawed technology in that it conflates build artifacts and distribution together in the same context (Dockerfile, multi-stage builds et al). Despite its flaws it has been quite successful. I believe that better tooling in optimizing/slimming down Docker images will materialize as opposed to businesses switching languages just because their devops guys want to optimize their side of the business. Still, I have heard of (usually) Golang using businesses quickly trying to evolve a missing ecosystem as they go. That’s an insane amount of work (and best of luck to those in such situations!). So it’s not unheard of, I guess.

                1. 5

                  Because work into single binary Python blobs like PyOxidizer are relatively new,

                  I was using Python app -> exe packing tools a decade and a half ago. They’ve never worked well.

                  1. 6

                    I use them pretty often these days and they work just fine.

                    1. 6

                      Google uses a packing tool for python binaries, they come out as self-contained .par files. I think this is the open source version. It works fabulously within Google at least.

                    2. 6

                      Because “Use another language” is a thought terminating cliche

                      Perhaps… but

                      Despite its flaws it has been quite successful.

                      So is this, and I think a more harmful one, because it stalls progress.

                    3. 7

                      folks working with Python and Ruby and Perl (and PHP) are jealous of the way a Java programmer can create an uberjar, and they are jealous of the way a Golang programmer can create a binary that has no outside dependencies — and so the Python programmer, and the Ruby and Perl and PHP programmer, they turn to Docker, which allows them to create the equivalent of an uberjar.

                      Honestly I think it’s simpler than this: developers use docker because they think it makes things simpler. I don’t agree with that assessment, and I don’t like or use Docker. But that’s clearly (IMO) the explanation, because otherwise you’d never see Go or Java projects using Docker (which they do).

                      Additionally, while I’m not particularly in love with the whole virtualenv/RBenv stuff works, that’s quite possibly related to my limited use of such tools. In the PHP world at least, good practice is to run your app via a dedicated instance of php-fpm (as opposed to e.g. the default which is a single instance running multiple “pools” akin to apache vhosts), which means you get your own complete instance to configure however you want.

                      If you’re at the point where you’re able to use Docker, you can certainly use the above approaches, if you just learn a little. But there’s no conferences about “do a little research about how your app can be run” for your boss to pay so you can go get another cool sticker for your laptop, and you can’t really make dozens of blog posts about how somehow putting your project’s dev environment inside a container inside a VM is so much better than just running a VM in the first place.

                      I have extreme dislike of Docker, but I also have extreme dislike of people claiming that “well you should just use Go”.

                      Edit: clarified, instance, not installation in second paragraph.

                      1. 9

                        Loved it! Articulates my deep distaste for languages like Python and PHP. We should be migrating to languages that natively support the programming challenges we face in today’s distributed world as fast as possible.

                        1. 15

                          Seems like a few rants:

                          1). Dependencies management.

                          2). Multi-threading.

                          3). Hating on python.

                          1 is an argument for Nix/Guix more than anything else, plus they bring a lot of other value.

                          2 gives me nightmares. Go back to whatever program you have seen in production that manages its state in multiple threads/actors. Tell me that anyone but the person who wrote it can add a feature without breaking everything unexpectedly. “We can speed this up by running it concurrently” is a sentence that should be grounds to fire the person who said it on the spot. Independent instances (vms, containers, lambdas) that write to a database is a good solution because it forces you to see the mess the second you touch the system instead of hiding it in logfiles no one looks at. Orchestration of those things is a nightmare because concurrent programming is a nightmare. It’s just that most people like to pretend it’s a solved problem and we can use those 16 cores on the server.

                          3 Python is the second/third best language for anything. Uberjars are great until you need three versions of Java to run them. Something that I’ve had to deal with in the last year. And that’s not even counting the politics of that language, thanks Oracle. Python now seems worse because worse programmers are writing it. I never got the point of type hints (if it says it’s a duck but it had its larynx surgically implanted from a cow it’s not really a duck) until I saw what a mess c# programmers are making of it, same for all the other linting.

                          1. 4

                            1 is an argument for Nix/Guix more than anything else, plus they bring a lot of other value.

                            Nix/Guix are an overcomplication. When you do your deployments with the modern “destroy everything, recreate everything” model, your packaging/deployment tool doesn’t need to be more complex than tar+rsync.

                            1. 8

                              ‘guix pack’ can output a tarball that has contents that run anywhere and come with all the dependencies, so you can get the simplicity you want. Using guix pack has the advantage that all your software across all languages can be packaged in the same way and other people have done the annoying work building complicated dependencies for you.

                              1. 1

                                I admit that the Nix/Guix model is very alluring and I wouldn’t mind getting into it. However, despite a few hearted attempts, I haven’t been able to. Meanwhile, a hashicorp stack of packer + terraform coupled with either a custom deployment script with bash or Ansible is easy to get into, gives most of the same benefits and doesn’t compel me to use a non-standard operating system.

                                1. 4

                                  FYI, they don’t require using nixos or guix for either deployment targets or development environment. I admit they are hard to get into.

                              2. 4

                                And what version of tar? Lib-c? The kernel, etc. You are hiding all the state of the machine and pretending that having a working OS is a ground level which is somehow easy to describe.

                                1. 1

                                  Please. Docker still relies on the kernel, host network stack, etc. Just because you control for things in your container does not give you the control of a VM.

                              3. 3

                                1 is an argument for Nix/Guix more than anything else, plus they bring a lot of other value.

                                I was hoping this would be an article about Nix looking at the title.

                                1. 3

                                  It’s just that most people like to pretend it’s a solved problem and we can use those 16 cores on the server.

                                  It is far as I know. Solo OS was written in a race-free language, Concurrent Pascal, around the 1970’s or so. Ada had Ravenscar for embedded with lots of restrictions. Then, Eiffel made it easy with SCOOP that lets programmers write concurrent code easily. It was ported to Java in an academic project at one point. Haskell’s STM was supposedly safe but I didn’t study it. Later, Rust got safe concurrency with higher performance and learning curve. Now, Pony has it. Then there’s hacky things like DTHREADS. Most of these have been applied in commercial products and non-commercial projects.

                                  So, it doesn’t look like it was a unsolved problem. It looks like people language and tool builders just weren’t using proven solutions to solving it for whatever reason. A typical problem in IT. I’ll note that Python in particular ignored all kinds of PLT advances on purpose to achieve its creator’s goals for the language. It’s not even as powerful as mid-1980’s Lisp. Contrast it to languages like Julia that incorporate some advanced features while retaining code readability. In its case, it’s even compatible with Python.

                                  1. 3

                                    Go back to whatever program you have seen in production that manages its state in multiple threads/actors. Tell me that anyone but the person who wrote it can add a feature without breaking everything unexpectedly. “We can speed this up by running it concurrently” is a sentence that should be grounds to fire the person who said it on the spot.

                                    “Need more capacity? Run more instances!” brought us the scourge of Unicorn et. al. and can’t be relegated to the dustbin of history soon enough. Anyone working in a reasonably modern language (Go, Scala, Clojure, Rust, etc.) should be expected to write and maintain multithreaded network servers (i.e. application services) as table stakes.

                                    1. 2

                                      Sorry but what? Mores law died 10 years ago. If you want more grunt you spin up more instances. Multi-threading is a dead end technology that should be killed off as quickly as possible for security if nothing else.

                                      I’ve worked at a company that spend $1m to modernize their .net apps to run on threads only to have the whole project cancelled before it was finished because someone realized you can’t buy 128 core Xeons but you can buy 64 dual core i3s.

                                      1. 6

                                        For business apps, maybe…

                                        Give a glass of milk to a dev at any major game studio, then tell him that “multi-threading is a dead end technology”, then let me know how much milk comes out.

                                        1. 3

                                          Make sure he has a chance to drink some of the milk before you tell him though.

                                        2. 5

                                          You’re right, I phrased that badly. But to make meaningful use of compute hardware, a single instance of a process needs to not block when a single request goes bad. One application server should handle thousands of concurrent requests, not one.

                                      2. 2

                                        Is 3 a language specific trait or just because of the ecosystem that sprawled around it?

                                        I personally don’t find much python is uniquely better at plus not being statically typed (ewww).

                                        1. 1

                                          And that’s not even counting the politics of that language, thanks Oracle.

                                          What is this politics you are referring to and why would it affect application developers?

                                          1. 1

                                            I guess things like the latest license changes to Java SE and the way they shot down the migration of Java EE to Eclipse Foundation are perfect examples.

                                            1. 1

                                              OK, if I’m developing an app, say a service, with OpenJDK, how does that affect me?

                                              1. 1

                                                OpenJDK is also controlled by Oracle and it was also affected by the license changes. Also I’m not sure what you mean by “app” here. We have a +300K LOCs Java EE app where we use OpenJDK, and are certainly still affected by Oracle’s politics.

                                                1. 2

                                                  OpenJDK is also controlled by Oracle and it was also affected by the license changes.

                                                  How so?

                                        2. 3

                                          Nah, building that fully-defined-image is the holy grail of maintainability. The package is fully-featured, it’s got a runtime environment that will make it run anywhere, and it has minimal interactibility with the world outside it.Great stuff.

                                          1. 3

                                            use of Python tends to lead to the use of Docker

                                            I kind of disagree with this. You can pretty much always deploy a well-architected Python application to a bare server with just pyenv+pipenv; you don’t need a container with “system”-installed libraries. I think it’s more likely that people use Docker because they want to use Docker.

                                            1. 4

                                              Every Python problem the author describes in the article as needing to be solved by with Docker or Kubernetes has been possible without them since long before they were even first released [1]. I think the author also misunderstands the difference between languages that support concurrency well and languages that support distributed systems well, and misses their own point that you should pick languages based on their strengths. The programming model in PHP has seen a resurgence in distributed systems as “serverless”, where stateless applications are used to respond to individual requests (making scaling and state management much easier).

                                              I think the problem Docker solves is very different from the authors assumption that it’s used to “make the transition to the modern world of cloud computing”. There’s a lot of utility in being able to package and install applications the same way regardless of language. In the various places I’ve worked, it’s been just as useful as a way to deploy Golang and Java applications as it has Python - I’ve found Java applications often come with a mountain of shell scripts intended to discover and configure system state that can be thrown out when using Docker.

                                              [1]:

                                              1. 4

                                                I don’t get the jab at Python dependency management… It’s broken because you were initially running Python 2.7, and subsequently couldn’t figure out mysqldb doesn’t support Python 3?

                                                1. 5

                                                  The fact that you don’t see a problem with the py2.7 vs py3.+ situation shows how much people can endure to avoid change.

                                                  That’s one of the few reasons I personally avoid Python whenever feasibly possible.

                                                  1. 11

                                                    The only problem I see is people trying to use Python 2.7 libs on Python 3, and/or refusing to upgrade. That’s not a Python problem. You should not be using Python 2. https://pythonclock.org/

                                                    Avoiding Python 3 because others use Python 2 is a non sequitur.

                                                    1. 1

                                                      Oh no, it totally follows if you are building something that needs to be strongly coupled to that other thing that’s using Python 2. See e.g. https://github.com/denoland/deno/issues/464

                                                    2. 3

                                                      The fact that you don’t see a problem with the py2.7 vs py3.+

                                                      Compared to what?

                                                      The JVM 7 vs 8 vs 9 vs … 12 situation?

                                                      Or C++17 vs 14 vs 11 vs 03?

                                                      Or how about .net vs .net core?

                                                      1. 1

                                                        Newer Java can, AFAIK, run older code, it’s backward compatible.

                                                        C++ is a mess and older code can break newer one.

                                                        Unsure about the .net one as, again AFAIK, I had no problem as long as I kept up to date.

                                                        1. 3

                                                          Newer Java can, AFAIK, run older code, it’s backward compatible.

                                                          This was true until Java 9. If you’re lucky, you can solve this problem (for a few more years anyway) by pretending that nothing after Java 8 actually exists, but not everyone has that luxury.

                                                          1. 1

                                                            Sorry if this has been talked over and rehashed a million times but I’m totally unfamiliar with this situation: Java 9 broke stuff!? Like, roughly how bad, please?

                                                            The last time I touched Java in anger was years and years ago and they had the whole “snail’s pace, never break anything” attitude to backwards compatibility going on back then.

                                                          2. 2

                                                            JVM not Java. I generally try to avoid the language as much as possible but I do know that in the big data space running anything (spark, hadoop, kafka) on anything newer than a 1.8 JVM is asking for Jesus to take the wheel. At least it was last it was 18 months ago when I last worked on a project that I had to do JVM devops for.