1. 2

Hello. Let me first say that this post is mainly about promoting an open-source library that I’m working on. At the same time, however, I want to make a case for why I started developing it and hopefully open a discussion on that topic.

Let me start with the “what” and then move on to the “why”. The project is essentially an in-memory, traditional machine learning library with a focus on three main features: performance (numerical operations in BLAS/LAPACK/ARPACK), scikit-learn-like API [1] (but in idiomatic Scala using typeclasses) and immutable estimators (easily employed in parallel code).

When it comes to building distributed systems, Scala is arguably a strong candidate due to Akka, Spark, Apache Beam, etc. Similarly, most of the ML libraries in Scala are focusing on distributed training and that’s perfectly fine if you are dealing with enormous datasets. I’m arguing, nevertheless, that in most cases there’s no need for distributed learning and that this big data of yours probably fits into RAM. If that’s the case, distributed training introduces significant but avoidable complexity. E.g. why use SGD when all data points can be used to calculate the gradient? Why train weak learners on different nodes when network latency can be circumvented? Scikit-learn itself could serve as evidence that this reasoning is not completely erroneous [2].

Note that model serving is completely stateless and can thus easily be scaled horizontally, regardless of whether training was distributed or in-memory.

Anyway, I’d be glad to hear your opinion on that or any feedback regarding the released library, cheers.

References:

  1.