v1.7.0
What's New
Enhanced Plugin for PyTorch Jobs
As one of the most popular AI frameworks, PyTorch has been widely used in deep learning fields such as computer vision and natural language processing. More and more users turn to Kubernetes to run PyTorch in containers for higher resource utilization and parallel processing efficiency.
Volcano 1.7 enhanced the plugin for PyTorch Jobs, freeing you from the manual configuration of container ports, MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and RANK environment variables.
Other enhanced plugins include those for TensorFlow, MPI, and PyTorch Jobs. They are designed to help you run computing jobs on desired training frameworks with ease.
Volcano also provides an extended development framework for you to tailor Job plugins to your needs.
Refer to the links for more details. (#2313, @ccchenjiahuan)
Ray on Volcano
Ray is a unified framework for extending AI and Python applications. It can run on any machine, cluster, cloud, and Kubernetes cluster. Its community and ecosystem are growing steadily.
As machine learning workloads are hosting computing jobs at a density higher than ever before, single-node environments are failing in providing enough resources for training tasks. Here's where Ray comes in, which seamlessly coordinates resources of the entire cluster, instead of a single node, to run the same set of code. Ray is designed for common scenarios and any type of workloads.
For users running multiple types of Jobs, Volcano partners with Ray to provide high-performance batch scheduling. Ray on Volcano has been released in KubeRay 0.4.
Refer to the links for more details. (#2601(#755) @tgaddair)
Enhance Scheduling for Kubernetes long-running services
This enhancement makes Volcano fully compatible with the Kubernetes default scheduler for long-running services. With this enhancement, users can use Volcano to uniformly schedule long-running services and batch workloads in a single cluster.
Refer to the links for more details:
- support multi scheduler name for scheduler and webhook(#2393, @jinzhejz)
- Add nodeVolumeLimits plugin (#2458, @jiangkaihua)
- Volcano support volumeZone plugin (#2480, @jiangkaihua)
- Add podTopologySpread plugin (#2487, @Monokaix)
- Add selector spread plugin (#2500, @elinx)
Support Kubernetes v1.25
This feature is designed to make Volcano compatible with Kubernetes 1.25.
Refer to the links for more details. (#2533, @wangyang0616)
Support multi-arch images for Volcano
This feature is designed to cross-compile volcano images of different architectures. For example, compile an image for the ARM64 architecture on an AMD64 machine.
Refer to the links for more details.(#2435, @ccchenjiahuan)
Optimize Queue Status Information
This feature is designed to enrich the information of the queue. Through this function, users can view the resource allocation of queues in real time, which is convenient for administrators to dynamically plan resources.
Refer to the links for more details.(#2592, @jiangkaihua)
Other Notable Changes
- change enqueue to optional action(#2309, @wpeng102)
- Add documentation on ttlSecondsAfterFinished(#2314, @jsolbrig)
- remove redundant parentheses(#2316, @lucming)
- update go.mod to add queue.spec.Affinity(#2319, @qiankunli)
- Support JobReady for extender plugin(#2334, @xiaoxubeii)
- add jobflow desgin docs(#2339, @zhoumingcheng)
- deploy webhook by yaml(#2346, @hwdef)
- add details for nodegroup doc(#2347, @qiankunli)
- change e2e dependencies of makefile(#2350, @lucming)
- update go to 1.18(#2353, @hwdef)
- clean up the code(#2360, @lucming)
- add csiNode cache for plugin(#2371, @wpeng102)
- add rest config into ssn(#2378, @wpeng102)
- Update field comment(#2386, @zhoumingcheng)
- use patch to replace update pod operator(#2392, @wpeng102)
- get csinodes from ssn(#2399, @wpeng102)
- Consider initContainer GPUs quota in calculating(#2423, @kerthcet)
- Some cleanups in job_info.go(#2434, @kerthcet)
- Add initContainer GPU number when calculating GPUs(#2440, @kerthcet)
- Optimize the way to build images in makefile(#2445, @hwdef)
- add a flag to control whether inherit owner annotations when podgroup…(#2461, @elinx)
- Update CA insert method in webhooks(#2463, @jiangkaihua)
- chore: remove duplicate word in comments(#2470, @Abirdcfly)
- add plugin registration log(#2477, @Monokaix)
- Modify format verification by gofmt(#2499, @jiangkaihua)
- scheduler support ephemeral-storage resources(#2505, @WulixuanS)
- delete task qos limit in webhook(#2513, @waiterQ)
- enable https healthz listen(#2523, @waiterQ)
- Use RWMutex in framework(#2525, @kerthcet)
- Realias scheduling api version name in package imports(#2526, @kerthcet)
- Bump ginkgo version to v2.3.0(#2532, @kerthcet)
- upgrade golangci-lint to v1.50.0(#2537, @waiterQ)
- move prefilter out of predicates to improve performance(#2580, @elinx)
- Move spark e2e integration from self-hosted to github-hosted(#2590, @Yikun)
- Add node image information to the cache of the scheduler(#2593, @wangyang0616)
- By default, the preemption function of gang and drf is turned off(#2613, @wangyang0616)
- The referenced Volcano API version is updated to 1.7(#2618, @wangyang0616)
- update image to v1.7.0-beta.0(#2628, @william-wang)
- update image to v1.7.0(#2636, @wangyang0616)
Bug Fixes
- fix: proportion metrics accuracy(#2297, @LY-today)
- fix scheduler cache waitforcachesync(#2307, @xiaoanyunfei)
- To record the start and end time of job scheduling(#2318, @dontan001)
- fix convertQuanToPercent func(#2325, @autumn0207)
- fix defaultMetricsInternal variable(#2326, @autumn0207)
- filter the rescheduling strategies which contain victim functions(#2342, @Thor-wl)
- fix bug in task dependsOn(#2351, @hwdef)
- fix ci error about mpi plugin struct naming is not standardized(#2354, @hwdef)
- try get get old pg when new pg not exist(#2400, @Akiqqqqqqq)
- fix scheduler panic when webhook is not ready(#2410, @hwdef)
- bugfix: panic if queue already exists(#2413, @elinx)
- fix nil pointer in jobCache.update(#2420, @Akiqqqqqqq)
- fix README.md clearly(#2427, @waiterQ)
- Fix calculating available gpu num error(#2441, @kerthcet)
- fix performance downgrade issue(#2443, @wpeng102)
- docs: fix error in how to configure scheduler(#2446, @hwdef)
- fix error during daily release(#2448, @hwdef)
- fix potential mem leakage in nodeorder.go(#2453, @kerthcet)
- fix bug predicateGPUbyMemory when gpu id not continuous(#2465, @WingkaiHo)
- fix wrong comments in nodeorder plugin(#2472, @hwdef)
- fix gpu shareing predictor allocate more than one gpu for pod(#2475, @WingkaiHo)
- fix scheduler panic issue when pv is not created(#2483, @jinzhejz)
- fix 2488, scheduler panic(#2489, @zhifanggao)
- fix printing volcano-scheduler-configmap loading log(#2504, @waiterQ)
- Fix scheduler plugin arguments readin bug(#2540, @jiangkaihua)
- fix unit test(#2541, @waiterQ)
- Fix unit test in cmd(#2548, @kerthcet)
- fix issue queue is not met even if oldDeserved and deserved are the same(#2553, @jinzhejz)
- pod could be preempted by default(#2545, @jinzhejz)
- The requests of extended resource(such as nvidia.com/gpu) is missing in PodGroup's minResource(#2573, @jimoosciuc)
- Remove Undetermined reason to fix cluster autoscaler compatibility(#2602, @tgaddair)
- Modify preempt victim order(#2623, @jiangkaihua)
- fix: deployment and pod, high priority cannot preempt low priority resources(#2630, @wangyang0616)
- The preemption logic is only controlled by jobMinavailable in the gang plugin(#2634, @wangyang0616)