The community has many, many efforts ongoing to address the needs of AI/ML workloads. We think it would be a good idea to have a discussion of what areas are impacted, and what more we can or should be doing.
Here are some areas to understand, organized in sort of concentric circles starting at the node-level and moving up the stack. Some are related to AI/ML and some related simply to advancing infrastructure technologies (often in service of AI/ML). I am 100% sure I am missing many things below...
- Kubernetes Data Plane
- Mutable container resources (in-place pod update)
- Pod-level resources
- Mutable nodes
- Mutable containers in pods to support hierarchical scheduling
- DRA / Device Management including CPU, GPU, NIC, memory
- OCI Volume Source
- Kubernetes Control Plane
- Kube Scheduler (pod scheduling)
- Workload level scheduling (bigger than pods)
- Workload auto-scaling (including adapting to things like mutable pods)
- Cluster auto-scaling (including adapting to things like mutable nodes)
- Pod groups or similar resource reservation mechanisms
- Supporting specialized schedulers
- Workload Controllers
- Higher Layer Solutions
- Ecosystem
- Slurm
- Ray
- vLLM
- Model serving / management
- Kubeflow
Let's brainstorm!
- What other things does it impact?
- What else should we be doing?
***
Got an idea for a session?
Submit your topic or upvote existing ones that catch your eye in our
GitHub Discussion. The Summit co-chairs and CNCF Projects Team will make decisions about which sessions to add to the schedule midday during the Summit, based on community votes.