Cool stuff! I’ve struggled with related problems with heterogeneous workloads in the past. There’s an interesting and disappointing to me divide between new-school work scheduling systems (e.g. Storm, Spark, Hadoop ecosystem, Airflow, Celery and other simple task queues) and old-school HPC schedulers (TORQUE, Sun Grid Engine, SLURM, etc.).
The former have nicer APIs, are more dynamically tunable and configurable, and often come with useful dashboards and visualizations. As far as I can tell (though I would love to be wrong) none of them support adding resource constraints to jobs/tasks (e.g. “this task requires 4 cores, 32GB of memory and 250GB of disk space, and at least this much time”) and intelligently scheduling jobs to conform to these constraints, which is a feature I’ve required in the past and sounds like it may have been helpful to the OP as well.
The later do support resource scheduling, but are obtuse to configure and use.
The only exception I’m aware of is Mesos, which has a very different way of looking at the world from most tools, and I also understand to be something of a bear to operate.
I wrote the original article at Keen. I agree with your observation, having the ability to attach resource constraints to tasks would help in building scalable services that are multi-tenant or deal with heterogeneous workloads.
That said, Storm does allow you to control resources to some extent by specifying the number of workers (JVMs) per topology, number of executors (threads) for each bolt and heap sizes for individual JVMs etc. You can also write a custom scheduler which decides how to distribute those JVMs across a bunch of hosts (which is what we did). Unfortunately, it was still not fine-grained enough for us and we wanted to control resource allocation per request.
I feel that there is no one-size-fits-all solution for building services that run at high scale when it comes to frameworks. My experience has been that with any framework it’s easier to get started and get a service up and running but when you’ve got to optimize for either performance or cost or scalability, then you have to get a much deeper understanding of that particular framework and tune it accordingly.