1. 28
    1. 6

      I read this article a few days ago and it bounced around in my head a bit.

      I think it could be quite interesting to build a batch processing system like airflow on pure kubernetes resources and operators:

      • First I would create a JobTemplate that is like a CronJob but unscheduled; there is already kubectl create job --from cronjob/foo.
      • Then a DAGTemplate could be written like:
      start:
        - jobTemplateName: A
          id: A
        - jobTemplateName: B
          id: B-1
      steps::
      - type: fanout
        follows: A
        tasks:
          - podTemplateName: B
            id: B-2
          - podTemplateName: C
            id: C
      - type: fanin
        follows:
          - A
          - B-1
          - B-2
        podTemplateName: D
        id: D
      - type: simple
        follows: C
        podTemplateName: E
        id: E
      

      My thinking is that kubernetes could generate a PV for every job to put results into (sizes etc. configurable) and mount that PV into subsequent jobs as ReadOnly. If the PVs or PVCs are configured to linger around after a run this would allow easy debugging by starting jobs from the job-template oneself.

      Instantiating a DAG from a DAG template can then be done with a “simple” API call to the kubernetes API.

      This is all non-actionable, I just needed to get it out of my brain :-D

      1. 4

        This seems similar to workflow description languages like snakemake and CWL.

      2. 3

        This seems almost exactly like an Argo Workflows DAG

        1. 1

          This is what we use at $work to enable analysts to deploy bespoke tools (packaged into Docker images) to a fairly complex-but-extensible pipeline/DAG. This lets us lean into “containers as sandboxes” and “k8s as horizontal scaling platform” while keeping things fairly declarative. All in all, it’s worked quite well, in my opinion.

    2. 6

      The “batch jobs vs services” tension is one that I’ve run into a lot over the past several years, as I’ve spent quite a bit of time working with organizations building on-premise compute clusters. Many of these orgs have both large batch jobs and a constellation of microservices to run, and quite reasonably wanted to use the same tooling for both use cases. Often the preferred tooling was Kubernetes, and we ran into many of the issues in this post.

      It’s worth noting, though, that there are a lot of pre-existing cluster schedulers that are designed around the batch job use case. Whether it’s a “big data” framework like Hadoop, or HPC schedulers like Slurm. The HPC schedulers in particular often have gang scheduling capabilities for multi-node jobs, as well as fine-grained tooling for managing resource allocations and policies around batch jobs. The downside is that they usually lack any mechanism for scheduling services.

      Occasionally I’ve seen Kubernetes bent to fit well enough to run both use cases, but more often I’ve seen organizations split their cluster in two — one running services in Kubernetes, the other running some other batch job scheduler. It’s a deeply unsatisfying solution, and increases operational load for the team that needs to run two clusters… but it has also generally been the most effective way to make each type of workload run well.

      TBH, I’d love to see Kubernetes get as good at batch jobs as the batch-oriented tools in this space, so that you don’t need different schedulers for services and jobs. But the use cases do have pretty different requirements, so having one tool for both might just be really hard.

      (Edited to fix typo)

      1. 1

        As far as I know, Oak Ridge Labs runs two clusters. There are some ideas from DKube here with pros and cons. I think there are two problems and two systems but maybe things will change in the future. I’ve never solo-grok’d this entire stack.

    3. 4

      Kubernetes is also not intended for processes that exit quickly. I have an etl job that I want to reset itself by exiting and respawning. Kubernetes is a tool that respawns containers when they exit. This would seem like a perfect fit, but Kube detects a crash loop if a process consistently exits after less than 10 minutes.

      1. 2

        Wow, even if exit code is 0?

        1. 2

          Sadly yes

    4. 2

      Do people think that kubernetes was designed for batch jobs?

      1. 5

        Probably :-D

        I am of the mind that kubernetes is a nice abstraction to run basically anything, which in this case would mean running airflow and letting that create Pods. Which works rather well in my experience.

    5. 2

      One theme throughout these features is that Kubernetes assumes that the code it’s running is relatively easy to restart.

      Crucially, just because it is easy does not mean that it is quick.

    6. 2

      I really like https://temporal.io/ to bridge the gap between long-lived services and batch jobs

    7. 2

      High performance computing might use a Kubernetes connector to a scheduler. The scheduler is delayed / batch while Kubernetes is realtime / service. They aren’t the same thing, that’s why they are connected but separate. I’ve seen people considering building something themselves and that’s their decision. But when they don’t know what the use case or design is, that’s not a good place to start out from.

      I’ve also framed this as the mainframe world vs the app world (sorry I don’t know what else to call it). There’s no pejorative meant here. They are different use cases with different users. Mainframe was time sharing and not realtime, that’s the model that compute clusters came from. There’s been a bit of work on trying to blend the two because the app world has a lot of nice tooling and innovation.

      1. 2

        Ironic because Kubernetes is literally called a ‘job scheduler’.

    8. 1

      One aspect of batch jobs is that we often run them ad-hoc for research, development, or troubleshooting. For example, we edit some code or tweak some data, and then rerun our linear regression to see if we get better results. For these ad-hoc jobs there would ideally be some way to take e.g. a CSV file that we’re working with locally, and “upload” it to a volume that we could read from a pod. This scenario isn’t supported by Kubernetes, though, so we have to figure out other ways to get data into our pod.

      What I’ve been doing lately is turning Job manifests into Pods that just idle forever but have the image/tooling I need on it, kubectl exec to run the job itself, and kubectl cp the data in and out of the container. This works really well but only after reading this post do I realize just how ironic it is that I had to turn a Job into a Pod to get the batch processing workflow I want.