1. 18

  2. 8

    One of the most interesting bits was a callout to python2.7 at the end implying that it’s now a great tool for reproducible science code because they’ve finally committed to not changing it. Ironic that it’s final death blow might instead be the thing that makes it live forever.

    1. 2

      It seems to me that software ecosystems would be so much nicer if hacking had never been invented. You could keep using the same tried and true software forever if you wanted.

      1. 1

        At some point, building python 2.7 could become impossible.

        1. 1

          As long as you keep a copy of the source code and the source code of the dependencies, you should always be able to build it. This article mentions guix as a useful tool for reproducible builds and also mentions a number of websites which archive source code. If you no longer have access to at least one architecture which was supported by the last release of python 2.7, it may become slightly more difficult, as you will need to write an emulator first.

      2. 3

        This passage surprised me:

        Newer languages’ rapidly evolving APIs and reliance on third-party libraries make them vulnerable to breaking. In that sense, the sunsetting of Python 2.7 at the start of this year represents an opportunity for scientists, Rougier and Hinsen note.

        Perhaps Python 2.7 will become the Latin of the 21st century? At least it would be a better choice than R or Matlab.

        1. 2

          What is wrong with R?

          1. 5

            With R itself not that much, the biggest issue is the packages ecosystem. Half-baked and abondonned packages after their maintener finished their PhD or published the article about the package or its uses. As always in OS and FS, mainteners do not own the user anything but because of the technical high-level of some packages, the heavy use of RCpp and C++ and the non-existant community around those packages, you have to be extremely careful when choosing which dependency to use and hope for the best for your future.

            My experience in R and research was mostly that. When you try to do something harder than usual or more next to the state of the art, you have to be extra careful about abandonware and and unmaintened packages. Honestly the knowledge gap between those who create those packages and the potential users is huge and it is very hard to understand and modify those packages and working on your research at the same time. This implies also the need to trust that the results coming for the package are correct and can be trusted. You have really a hard time to assess the quality of the ecosystem.

            1. 3

              This is very close to my experience in providing R capabilities for data science.

              Another issue I had was that the CRAN repository only provided source packages for Linux, but binaries for Windows and MacOS which resulted in constant issues with consistency across environments based on build environments and versions of build tools (and the versions at the time builds occurred) which was tedious to manage. I believe RStudio have a solution for this.

              I ended up tackling this by creating binary package build pipelines directly from CRAN into RPMs by transforming the CRAN metadata into package metadata/dependencies and then building on a single build server for consistency. These packages were then used for local distribution but this process started to uncover inconsistencies in the the metadata in the CRAN package sources, and I then had constant battles with edge cases with metadata.

              1. 1

                You are the one I hoped to have on my side during this time :) I mainly used RStudio that compiles the package on the fly when you install it but do not manage at all the external dependencies (normal) but also you can have your package that do not compile because of some flags you have no access to in the interface with RCpp.

                I made a small script where we were listing the packages needed and that will install them directly and require them in the environment. So I can ask the researchers less tech-savy to simply run it before lunch and go came back to work in a correct environment after that. Less fancy and down to earth and I avoided to deal with version management and so one because we were a small team and I knew that nobody will reuse the code after that project.

                1. 2

                  Your script approach is similar to where I started. I could write a book about these kinds of experiences, but I doubt the book sales would reach double digits.

                  The initial goal of the approach was to be able to convert a scientist’s/developer’s R code into functioning containers that could be executed in a shared Kubernetes compute cluster. This made it feel like a clash between old world/new world. Container pipelines for other languages were reliable, it was only R that really started to cause issues due to the heavy compilation requirement. A quick summary of the iterations (I remember) it going through.

                  Iteration 1
                  Base container was built with base OS, compiler prerequisites, CRAN and an entrypoint script. The container entrypoint script would take a project’s requirements from a developer metadata file and then install them using CRAN, with all packages compiled using CRAN from a locally hosted replica of the CRAN upstream repository. The idea was that a developer would have an R project in git listening dependencies, the pipeline would fire, pull base container image, add CRAN requirements and the developer’s project to the container, and ship it to the container registry.

                  The problem was that the iteration time started to grow. If a developer included something dependency heavy like ‘Shiny’ the build phase of the container pipeline would take minutes, which doesn’t seem like much, but when developers are iterating quickly, fixing missing dependencies etc. it really added up. Add to that, when more developers were onboarded, it started to cause CPU contention on the build servers with so many parallel CRAN compiles that it would slow the developer’s pipeline down further. Another side effect was that CRAN installs would fail because something in the CRAN repo had changed, causing a failed dependencies or build/compile problems.

                  Iteration 2
                  This involved adding an ‘intermediate’ layer to the base container that had prebuilt common CRAN packages already installed. This list started small based on frequency of use, but slowly grew from 10 to 100+ packages, largely due to CRAN package dependencies. The developer would then pull the base and intermediate layer together and still add their R project in the final container layer, with any additional packages that weren’t covered in the intermediate layer. This started out well with iteration and build times dropping dramatically.

                  The problem was, the intermediate layer was slowly growing, and it introduced management overhead to find the balance for what should/shouldn’t be in the base container. The intermediate layer required regular rebuilding by resources outside the developer teams to incorporate new additions when the final layer build phase started ‘feeling’ slow.

                  In the end, the inevitable happened. Multiple CRAN packages in the final layer had hard dependencies on specific versions of R packages in the intermediate layer, so due to conflicts, packages were slowly removed from the intermediate layer, which then meant anything that depended on those packages in the intermediate layer was also removed. Build times grew again, enough to find it hard to justify the effort.

                  Iteration 3
                  Is what was described in my original post above, and the goal of the approach was to move all the compilation of CRAN packages, which was the major contributor to slow container builds, out of the container build pipeline. This was achieved by having an ‘out of band’ RPM build pipeline that would essentially build everything an R project required by pulling down the CRAN package source, compiling it and packaging it in RPM. The RPM package format covered off all requirements for managing dependencies, versioning, and distribution. The resultant CRAN binary RPM packages were pushed to a local Yum repository and the entrypoint script for the R container was changed to install the defined CRAN dependencies using yum instead of native CRAN based on the metadata provided.

              2. 2

                Python has the same issue of high performance code just being calling out to C though. I do understand the cultural issue. Has Python treated you better?

                1. 2

                  As you said, it is mainly cultural but also field-specific.

                  My experience with Python and R in research where in two different fields. I used python in remote sensing where the mainstream tools where industry-grade and standard (ArcGIS, QGIS, Postgres + PostGIS, GDAL, OGR, PROJ, etc.). I was way more confident about the quality of the libraries used at this time. The libraries and softwares were driven more by a SE field than a pure research one. Python being the lingua franca in the field and the field being close to the industry make it a better experience for me. Python itself was just a heavy glue language and treated like that.

                  I used R in spatio-temporal modelling in public health. This space is crowded with either big team with a lot of calculation power at hand using have MCMC techniques or a lot of PhD and small teams doing one-time package with a very short-term living time. Some package were perfect for the use case at hand but break apart when willing to try further than the examples in the documentation and/or published articles (if your are lucky). Amazing tools and packages exist in R, one of my favourite is R-INLA [1] by example. But to use it properly, you have to get a good background in statistics and in programming to clearly understand what you are doing when complexity arise.

                  The experience with R is not bad, you just have to be more cautious and learn a few tricks down the road when you are creating your toolbox and try to bet on stable and common packages that will outlive your research.

                  [1] http://www.r-inla.org/

              3. 1

                Nothing wrong with it really as a lingua franca for science. It’s just that Python is better suited for software engineering while still being a decent choice for scientific code.

                If we’d need to stick with a language for centuries (like Latin) I’d expect some big systems to emerge and those would be nicer to work with in Python. Maybe this is just my software engineer’s bias speaking though.

                1. 1

                  Have you tried Julia? I have heard good things about it, and it seems better thought out than Python or R.

                  1. 1

                    Julia is interesting but its “time to first plot” is still too long in my opinion and it needs LLVM which might prove problematic in the long run. Just like any dependency.

                    1. 4

                      How LLVM can be problematic in the long run? It is the one of the dependency I am not afraid to be going anywhere on a long term scale like GMP.

                      Concerning Julia and “time to first anything”, I understand but in a exploratory workflow I did not feel the slowness that much. I really hope that Julia grow more market share in the scientific programming field as it provide the more easy approach than Python and the Metaprogramming capabilities of R coupled with high performance. The heavy use of the C ABI has already permitted to call Python, R, C, Fortran and C++ from Julia and leverage past codebase. On a personal level, Julia give me an Elixir feel where you have access to Lisp level without to much guns to shoot yourself in the foot easily. It is not perfect, I just like it but I am biased in that way.

                      1. 2

                        LLVM is a big project and the Julia language developers have no direct control over it. I agree that it doesn’t look like to be going anywhere but you never know.

                        I’m glad to hear Julia worked out for you. I really like the idea of kissing goodbye to all numpy vectorization tricks and just typing those nested for-loops in full.