Always keep driver core as 0 when start ray on driver in cluster mode #4169

jenniew · 2021-06-28T22:43:13Z

If start ray on driver with more than 0 cores, the TFRunner or other actors would run on driver which is not expected.
So not pass driver cores values when calling ray_context.init()

hkvision · 2021-06-29T02:05:16Z

pyzoo/zoo/orca/common.py

@@ -262,8 +262,6 @@ def init_orca_context(cluster_mode="local", cores=2, memory="2g", num_nodes=1,
    ray_ctx = RayContext(sc, **ray_args)
    if init_ray_on_spark:
        driver_cores = 0  # This is the default value.
-        if "driver_cores" in kwargs:


instead of removing it, can we change to ray_driver_cores to differ with spark?

Do we need to need to provide option to set driver cores if start ray on driver? I think we need to hardcode to 0 to avoid actors running on driver.

Sometimes if the driver has the capability to run actors, we can?

I think running actors on driver is a little strange, as in general in cluster environment, the tasks are often running on executors not driver. In addition, we may run multiple jobs on the same driver and all the jobs would share the resources that may cause problem. In that case, ray_driver_cores should be less than or equal to spark driver_cores.

jenniew · 2021-07-09T03:07:09Z

Jenkins: http://10.239.176.111:18888/job/ZOO-PR-Validation/6058/

* Orca PyTorch Estimator load data once (#2669) * load once * fix * fix * remove num_steps * remove iter * meet comments * wrap once * style fix * Add Orca Overview and Context Doc (#2748) * add docs * minor * Add init_orca_context (#2774) * initial imple * update * meet review * review and style * remove stopped * add doc * minor * move import * fix mxnet * remove * Update UTs and examples with init_orca_context (#2787) * update unit tests * minor * update * update mxnet * move barrier * fix mxnet * update * bug fix * update * update test * update mxnet example * update mxnet * minor * minor * minor * update examples * move ray import dependencies * readme * minor * bug fix * remove default * Add website doc for init_orca_context (#2822) * Support numa binding in init_spark_standalone (#2847) * support numa binding in init_spark_standalone * add doc and add to orca context * address comments * address comments * update scripts * hyperthreading * fix * shutdown hook (#2853) * Support RayOnSpark for k8s and add docs (#2836) * support ray on k8s * add to init orca context * style * minor * minor * ut * Fix stop_orca_context being called twice (#2878) * Update website doc of init orca context (#2879) * update doc * update * update Torch example (#3022) * update example * Update README.md * update resnet_finetune.py * delete some file * Create README.md * Update README.md * Update README.md * Update README.md * add detect conda env name * some change * update run example scripte * update init orca context * attempt to fix ray memory (#3205) * attempt to fix ray memory * exclude webui * Update doc for orca context (#3287) * update * add back * update * update * minor * meet review * fix typo * Add memory type support for Orca tf estimator (#3280) * add mem type for dataframe dataset * add ZooTrigger support * update mem type * add test * update * update mem type and orca context * update orca context docs * update orca context with comments * fix style * Add get_spark_session in OrcaContext (#3520) * add and modify * style * add shard size to dataframe_to_xshards (#3491) * add shard size to dataframe_to_xshards * add ut and change default shard_size as None * add shard size in orca context and fix style * add ut in pytorch estimator and tf estimator * move shard_size to internal use * fix * Add support for non-barrier mode to launch ray (#4014) * add support for non-barrier mode * fix style * meet review * meet review * move barrier mode to zoocontext * bug fix * modify * update * remove driver cores (#4169) * Add OrcaContext and make spark default read file backend (#2593) * orca context * handle error * enrich error msg * meet review * move log output * style * meet review * add zoocontext * fix ut * change import Co-authored-by: Yina Chen <[email protected]> Co-authored-by: Kai Huang <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Xin Qiu <[email protected]> Co-authored-by: jenniew <[email protected]> Co-authored-by: dding3 <[email protected]> Co-authored-by: Le-Zheng <[email protected]>

* Orca PyTorch Estimator load data once (intel#2669) * load once * fix * fix * remove num_steps * remove iter * meet comments * wrap once * style fix * Add Orca Overview and Context Doc (intel#2748) * add docs * minor * Add init_orca_context (intel#2774) * initial imple * update * meet review * review and style * remove stopped * add doc * minor * move import * fix mxnet * remove * Update UTs and examples with init_orca_context (intel#2787) * update unit tests * minor * update * update mxnet * move barrier * fix mxnet * update * bug fix * update * update test * update mxnet example * update mxnet * minor * minor * minor * update examples * move ray import dependencies * readme * minor * bug fix * remove default * Add website doc for init_orca_context (intel#2822) * Support numa binding in init_spark_standalone (intel#2847) * support numa binding in init_spark_standalone * add doc and add to orca context * address comments * address comments * update scripts * hyperthreading * fix * shutdown hook (intel#2853) * Support RayOnSpark for k8s and add docs (intel#2836) * support ray on k8s * add to init orca context * style * minor * minor * ut * Fix stop_orca_context being called twice (intel#2878) * Update website doc of init orca context (intel#2879) * update doc * update * update Torch example (intel#3022) * update example * Update README.md * update resnet_finetune.py * delete some file * Create README.md * Update README.md * Update README.md * Update README.md * add detect conda env name * some change * update run example scripte * update init orca context * attempt to fix ray memory (intel#3205) * attempt to fix ray memory * exclude webui * Update doc for orca context (intel#3287) * update * add back * update * update * minor * meet review * fix typo * Add memory type support for Orca tf estimator (intel#3280) * add mem type for dataframe dataset * add ZooTrigger support * update mem type * add test * update * update mem type and orca context * update orca context docs * update orca context with comments * fix style * Add get_spark_session in OrcaContext (intel#3520) * add and modify * style * add shard size to dataframe_to_xshards (intel#3491) * add shard size to dataframe_to_xshards * add ut and change default shard_size as None * add shard size in orca context and fix style * add ut in pytorch estimator and tf estimator * move shard_size to internal use * fix * Add support for non-barrier mode to launch ray (intel#4014) * add support for non-barrier mode * fix style * meet review * meet review * move barrier mode to zoocontext * bug fix * modify * update * remove driver cores (intel#4169) * Add OrcaContext and make spark default read file backend (intel#2593) * orca context * handle error * enrich error msg * meet review * move log output * style * meet review * add zoocontext * fix ut * change import Co-authored-by: Yina Chen <[email protected]> Co-authored-by: Kai Huang <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Xin Qiu <[email protected]> Co-authored-by: jenniew <[email protected]> Co-authored-by: dding3 <[email protected]> Co-authored-by: Le-Zheng <[email protected]>

remove driver cores

c84b9af

jenniew requested a review from hkvision June 28, 2021 22:43

hkvision reviewed Jun 29, 2021

View reviewed changes

hkvision approved these changes Jul 9, 2021

View reviewed changes

jenniew merged commit 62a884d into intel:master Jul 9, 2021

shanyu-sys pushed a commit to shanyu-sys/analytics-zoo that referenced this pull request Sep 16, 2021

remove driver cores (intel#4169)

163d2b6

Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 22, 2021

remove driver cores (intel#4169)

97b2244

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always keep driver core as 0 when start ray on driver in cluster mode #4169

Always keep driver core as 0 when start ray on driver in cluster mode #4169

jenniew commented Jun 28, 2021

hkvision Jun 29, 2021

jenniew Jun 29, 2021

hkvision Jun 30, 2021

jenniew Jul 6, 2021

jenniew commented Jul 9, 2021

Always keep driver core as 0 when start ray on driver in cluster mode #4169

Always keep driver core as 0 when start ray on driver in cluster mode #4169

Conversation

jenniew commented Jun 28, 2021

hkvision Jun 29, 2021

Choose a reason for hiding this comment

jenniew Jun 29, 2021

Choose a reason for hiding this comment

hkvision Jun 30, 2021

Choose a reason for hiding this comment

jenniew Jul 6, 2021

Choose a reason for hiding this comment

jenniew commented Jul 9, 2021