Kubernetes Runner¶
Warning
Work in progress. Some functionality described below may not be available.
The Kubernetes Runner runs a MLCube® on a Kubernetes cluster.
Why Kubernetes?¶
One of the key goals of the MLCube project is to enable portability of ML models. Kubernetes offers a good set of abstractions to enable model training to be portable across different compute platforms.
Design¶
Kubernetes Runner Proposal Doc
The Kubernetes runner takes in a kubernetes specific task file in the run
directory and re-uses the Docker runner
platform config and prepares a Kubernetes Job manifest. The runner then creates the job on the Kubernetes cluster.
Configuration parameters¶
Attention
Currently, users must create persistent volume claim (PVC) that points to an actual MLCube workspace directory.
# By default, PVC name equals to the name of this MLCube (mnist, matmul, ...).
pvc: ${name}
# Use image name from docker configuration section.
image: ${docker.image}
The Kubernetes runner constructs the following Kubernetes Job manifest.
apiVersion: batch/v1
kind: Job
metadata:
namespace: default
generateName: mlcube-mnist-
spec:
template:
spec:
containers:
- name: mlcube-container
image: mlcommons/mlcube:mnist
args:
- --data_dir=/mnt/mlcube/mlcube-input/workspace/data
- --model_dir=/mnt/mlcube/mlcube-output/workspace/model
volumeMounts:
- name: mlcube-input
mountPath: /mnt/mlcube/mlcube-input
- name: mlcube-output
mountPath: /mnt/mlcube/mlcube-output
volumes:
- name: mlcube-input
persistentVolumeClaim:
claimName: mlcube-input
- name: mlcube-output
persistentVolumeClaim:
claimName: mlcube-output
restartPolicy: Never
backoffLimit: 4
Configuring MLCubes¶
This runner does not need configure step.
Running MLCubes¶
Algorithm is following:
- Load Kubernetes configuration.
- Create job manifest (see above).
- Create job and wait for completion.