We all know and love Python and its ecosystem of libraries for scientific computing. Well known alternatives include Octave (Matlab), Wolfram, Julia, and R. These are all basically dynamically typed. Julia has the ability to optionally declare static types (as does Python BUT not meaningfully so inside the SciPy/Torch/Tensorflow ecosystems, at least as far as I can tell).
So, fellow crustaceans, I have two related questions:
Is there something that looks more like a statically typechecked language that is good for scientific computing (in that people use it for real work, whether prototyping or not); and
Should I learn to stop worrying and accept that I can’t have a static checker catch mistakes in my code?
For those who will ask why I care, the answer is that I’m a scientific computing dilettante, and static checking helps me avoid mistakes.
You might check out http://www.ffconsultancy.com/products/index.html if you haven’t already – approaches to F# and OCaml from that perspective, though I’m not sure how close it gets to “people use it for real work.”
Yossi Kreinin wrote an interesting article where he challenged people to implement matrix algorithms in a type-safe way. His conclusion was many of them had intermediate steps that were very hard for most type systems to handle.
Have you considered using contracts for catching errors?
Interesting, but I think if I could have type checking for all the functions I write that wrap around linear algebra primitives, I could limit the opportunity for that class of errors to a manageable subset of my code. And then have confidence in all the functions that wrap around my functions.
Not perfect but a vast improvement.
Fortran 77?
A little more seriously, a lot of this depends on what sorts of errors you expect to catch. Syntax is one thing, but like do you want something to catch ill-conditioned or singular matrices or something?
But why mention f77 if Fortran 95 seems already available as free software and has modernized syntax, among other improvements?
If you could statically encode information about each variable in a scientific computation, the benefits would be immense. For example:
versus:
wind_speed_raw_m_per_s WIndSpeed;
You could make sure that no wind speed measurement would be used as an input for a. function if it is not 1. a wind speed measurement, 2. in meters per second, 3. not yet adjusted for air density. You could do this with Python and type annotations except that annotations are hard to add to anything in the Pandas et al ecosystem. And if you could do it, I’d ask for half of my hours to be allocated to something besides QA and be much happier.
(I should add, this would be just the tip of the ice berg. )
F# has units of measure :)
chapel seems like a promising new language for high-performance computing. I don’t know if it’s ready or not.
If you want a large library ecosystem, C++ is probably your best bet. The cores of both PyTorch and Tensorflow are written in C++. There are good linear algebra libraries (Eigen, Armadillo), more traditional machine learning libraries (OpenCV, MLPack).
I think you’re probably better off using whatever is popular in the community you operate in. You’ll have more tutorials, existing code, experienced people to draw on, etc. What you can do is use language-neutral techniques such as Design by Contract, Property[/Contract]-Based Testing, and fuzzing to catch most of those errors. For anything hard to express, just add a manual review and/or tests for it. The nice thing is you can incrementally adopt these using them as much or as little as you prefer.
Lobsters search will give you plenty of things to read on those.
OCaml has a few emergent tools, already used for some research in numerics and machine learning: owl (ocaml.xyz), lacaml (mmottl.github.io/lacaml) and slap (akabe.github.io/slap). They can also be used in jupyter nktebooks, by means of the ocaml-jupyter package, although for plotting they are not at the level of python yet.
For machine learning there are also ocaml-torch and tensorflow-ocaml which provide bindings to libtorch and tensorflow.
Biology field uses Java a lot, specifically ImageJ and Fiji.
I teach a physics lab that uses ImageJ/Fiji :) The students use it to measure the positions of silica beads in different solutions to determine the diffusion constant of the beads in those solutions.
“Scientific computing” is a big bucket, and it often depends a lot what particular field you’re working in. Genomics computes differently than weather forecasting, computes differently than quantum chemistry, etc. A lot of fields end up tied to a particular programming language based on the specialized libraries they’re working with, or just based on what language is used for their most common applications. For example, researchers who do molecular dynamics work with GROMACS, which is written in C++, will often do their other work in C++ because it’s familiar.
So… my first suggestion would be that you talk to the (other?) scientists you’re working with, and ask what they’re using. ;-) And if you’re not working with any scientists, I’d suggest you find some collaborators based on what you’re interested in!
FWIW: I have a lot of general experience in building and running large research computing clusters (“HPC”), and have worked mostly with physicists and chemists. At a high level I’d say the most commonly-used languages are C++ and Fortran, both of which are statically typed. Oftentimes throw in some CUDA or other GPU-programming language. (Disclaimer: I work at NVIDIA, so I see a lot of GPU applications.)
However, I’d say that experience is biased by mostly working with large (100+ node) clusters, where my users are more likely to be using some kind of multi-node parallelism framework based on MPI, which supports C++ and Fortran best. For those who mostly work with a single node at a time, I’d say that Python is probably the language I see used most, followed by Julia.
You may not know that you can typecheck Julia as well by defining a macro.
https://nextjournal.com/jbieler/adding-static-type-checking-to-julia-in-100-lines-of-code/
There are some limitations, but you can pick up a bunch of common issues with those macros. There are some static analysis tools as well (one included in the vscode Julia extension).
F# has a good community around it. I know at least one biopharmaceutical company that uses it for literally everything.
You should not stop worrying, if you don’t have a type checker you need comprehensive tests that make sure that your type assumptions are not violated. There are domains where mistakes are costly and those domains in my opinion should be leveraging every tool they can, compiler, static analysis (linter), and tests. F# supports units of measure for example which allow you to track the units such as “Meters/Second” and prevent you from multiplying to say “Meters/Hour” without a conversion, at compile time.
Would strong typing have helped in the recent case where Python code running on Windows (as opposed to on Linux or Mac) returned the results of a directory read in an unexpected order?
There is static typing for Python provided by MyPy.
If Pandas was friendly to type annotations I would be a happynerd.
What kind of errors are you trying to avoid with static typing?
Any that I can, honestly.
I think I’ve seen some people on the Nim forums mention they do it.
I am pretty sure I’ve seen people in the Go community mention they do it (although I believe one needs some getting used esp. to the fact that one can’t use operator overloading for matrix operations).
The problem is that most of these comments are written by language aficionados or perhaps even strong advocates. Taking gonum as an example – the last time I used gonum, it was heavily biased towards Go’s double precision floating point type (
float64
) and only a subset of the functionality was provided for the single precision floating point type (float32
). Most machine learning these days is done at single precision, half precision (FP16), or mixed precision.Also, Go suffers quite severely from limited escape analysis, limited inlining, weak optimizations, and no autovectorization. So, if you need something that is not implemented in those libraries, you’ll end up hand-rolling your own machine code or writing it in C and use cgo (if the C call penalty is acceptable). E.g. take something something like the dot product and marvel about how simplistic the generated assembly is: https://godbolt.org/z/etRA_C Go works well in a lot of domains, but scientific computing is not it.
I do scientific computing in Rust, but at this point I’d be honest and point people to Python or C++, unless they want to do a lot of heavy lifting themselves.