Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always keep driver core as 0 when start ray on driver in cluster mode #4169

Merged
merged 1 commit into from
Jul 9, 2021

Conversation

jenniew
Copy link
Contributor

@jenniew jenniew commented Jun 28, 2021

If start ray on driver with more than 0 cores, the TFRunner or other actors would run on driver which is not expected.
So not pass driver cores values when calling ray_context.init()

@jenniew jenniew requested a review from hkvision June 28, 2021 22:43
@@ -262,8 +262,6 @@ def init_orca_context(cluster_mode="local", cores=2, memory="2g", num_nodes=1,
ray_ctx = RayContext(sc, **ray_args)
if init_ray_on_spark:
driver_cores = 0 # This is the default value.
if "driver_cores" in kwargs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of removing it, can we change to ray_driver_cores to differ with spark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to need to provide option to set driver cores if start ray on driver? I think we need to hardcode to 0 to avoid actors running on driver.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes if the driver has the capability to run actors, we can?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think running actors on driver is a little strange, as in general in cluster environment, the tasks are often running on executors not driver. In addition, we may run multiple jobs on the same driver and all the jobs would share the resources that may cause problem. In that case, ray_driver_cores should be less than or equal to spark driver_cores.

@jenniew
Copy link
Contributor Author

jenniew commented Jul 9, 2021

@jenniew jenniew merged commit 62a884d into intel:master Jul 9, 2021
shanyu-sys pushed a commit to shanyu-sys/analytics-zoo that referenced this pull request Sep 16, 2021
shanyu-sys added a commit that referenced this pull request Sep 22, 2021
* Orca PyTorch Estimator load data once (#2669)

* load once

* fix

* fix

* remove num_steps

* remove iter

* meet comments

* wrap once

* style fix

* Add Orca Overview and Context Doc (#2748)

* add docs

* minor

* Add init_orca_context (#2774)

* initial imple

* update

* meet review

* review and style

* remove stopped

* add doc

* minor

* move import

* fix mxnet

* remove

* Update UTs and examples with init_orca_context (#2787)

* update unit tests

* minor

* update

* update mxnet

* move barrier

* fix mxnet

* update

* bug fix

* update

* update test

* update mxnet example

* update mxnet

* minor

* minor

* minor

* update examples

* move ray import dependencies

* readme

* minor

* bug fix

* remove default

* Add website doc for init_orca_context (#2822)

* Support numa binding in init_spark_standalone (#2847)

* support numa binding in init_spark_standalone

* add doc and add to orca context

* address comments

* address comments

* update scripts

* hyperthreading

* fix

* shutdown hook (#2853)

* Support RayOnSpark for k8s and add docs (#2836)

* support ray on k8s

* add to init orca context

* style

* minor

* minor

* ut

* Fix stop_orca_context being called twice (#2878)

* Update website doc of init orca context (#2879)

* update doc

* update

* update Torch example (#3022)

* update example

* Update README.md

* update resnet_finetune.py

* delete some file

* Create README.md

* Update README.md

* Update README.md

* Update README.md

* add detect conda env name

* some change

* update run example scripte

* update init orca context

* attempt to fix ray memory (#3205)

* attempt to fix ray memory

* exclude webui

* Update doc for orca context (#3287)

* update

* add back

* update

* update

* minor

* meet review

* fix typo

* Add memory type support for Orca tf estimator (#3280)

* add mem type for dataframe dataset

* add ZooTrigger support

* update mem type

* add test

* update

* update mem type and orca context

* update orca context docs

* update orca context with comments

* fix style

* Add get_spark_session in OrcaContext (#3520)

* add and modify

* style

* add shard size to dataframe_to_xshards (#3491)

* add shard size to dataframe_to_xshards

* add ut and change default shard_size as None

* add shard size in orca context and fix style

* add ut in pytorch estimator and tf estimator

* move shard_size to internal use

* fix

* Add support for non-barrier mode to launch ray (#4014)

* add support for non-barrier mode

* fix style

* meet review

* meet review

* move barrier mode to zoocontext

* bug fix

* modify

* update

* remove driver cores (#4169)

* Add OrcaContext and make spark default read file backend (#2593)

* orca context

* handle error

* enrich error msg

* meet review

* move log output

* style

* meet review

* add zoocontext

* fix ut

* change import

Co-authored-by: Yina Chen <[email protected]>
Co-authored-by: Kai Huang <[email protected]>
Co-authored-by: Yang Wang <[email protected]>
Co-authored-by: Xin Qiu <[email protected]>
Co-authored-by: jenniew <[email protected]>
Co-authored-by: dding3 <[email protected]>
Co-authored-by: Le-Zheng <[email protected]>
Le-Zheng pushed a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 22, 2021
Le-Zheng added a commit to Le-Zheng/analytics-zoo that referenced this pull request Sep 22, 2021
* Orca PyTorch Estimator load data once (intel#2669)

* load once

* fix

* fix

* remove num_steps

* remove iter

* meet comments

* wrap once

* style fix

* Add Orca Overview and Context Doc (intel#2748)

* add docs

* minor

* Add init_orca_context (intel#2774)

* initial imple

* update

* meet review

* review and style

* remove stopped

* add doc

* minor

* move import

* fix mxnet

* remove

* Update UTs and examples with init_orca_context (intel#2787)

* update unit tests

* minor

* update

* update mxnet

* move barrier

* fix mxnet

* update

* bug fix

* update

* update test

* update mxnet example

* update mxnet

* minor

* minor

* minor

* update examples

* move ray import dependencies

* readme

* minor

* bug fix

* remove default

* Add website doc for init_orca_context (intel#2822)

* Support numa binding in init_spark_standalone (intel#2847)

* support numa binding in init_spark_standalone

* add doc and add to orca context

* address comments

* address comments

* update scripts

* hyperthreading

* fix

* shutdown hook (intel#2853)

* Support RayOnSpark for k8s and add docs (intel#2836)

* support ray on k8s

* add to init orca context

* style

* minor

* minor

* ut

* Fix stop_orca_context being called twice (intel#2878)

* Update website doc of init orca context (intel#2879)

* update doc

* update

* update Torch example (intel#3022)

* update example

* Update README.md

* update resnet_finetune.py

* delete some file

* Create README.md

* Update README.md

* Update README.md

* Update README.md

* add detect conda env name

* some change

* update run example scripte

* update init orca context

* attempt to fix ray memory (intel#3205)

* attempt to fix ray memory

* exclude webui

* Update doc for orca context (intel#3287)

* update

* add back

* update

* update

* minor

* meet review

* fix typo

* Add memory type support for Orca tf estimator (intel#3280)

* add mem type for dataframe dataset

* add ZooTrigger support

* update mem type

* add test

* update

* update mem type and orca context

* update orca context docs

* update orca context with comments

* fix style

* Add get_spark_session in OrcaContext (intel#3520)

* add and modify

* style

* add shard size to dataframe_to_xshards (intel#3491)

* add shard size to dataframe_to_xshards

* add ut and change default shard_size as None

* add shard size in orca context and fix style

* add ut in pytorch estimator and tf estimator

* move shard_size to internal use

* fix

* Add support for non-barrier mode to launch ray (intel#4014)

* add support for non-barrier mode

* fix style

* meet review

* meet review

* move barrier mode to zoocontext

* bug fix

* modify

* update

* remove driver cores (intel#4169)

* Add OrcaContext and make spark default read file backend (intel#2593)

* orca context

* handle error

* enrich error msg

* meet review

* move log output

* style

* meet review

* add zoocontext

* fix ut

* change import

Co-authored-by: Yina Chen <[email protected]>
Co-authored-by: Kai Huang <[email protected]>
Co-authored-by: Yang Wang <[email protected]>
Co-authored-by: Xin Qiu <[email protected]>
Co-authored-by: jenniew <[email protected]>
Co-authored-by: dding3 <[email protected]>
Co-authored-by: Le-Zheng <[email protected]>
dding3 added a commit to dding3/analytics-zoo that referenced this pull request Oct 4, 2021
* Orca PyTorch Estimator load data once (intel#2669)

* load once

* fix

* fix

* remove num_steps

* remove iter

* meet comments

* wrap once

* style fix

* Add Orca Overview and Context Doc (intel#2748)

* add docs

* minor

* Add init_orca_context (intel#2774)

* initial imple

* update

* meet review

* review and style

* remove stopped

* add doc

* minor

* move import

* fix mxnet

* remove

* Update UTs and examples with init_orca_context (intel#2787)

* update unit tests

* minor

* update

* update mxnet

* move barrier

* fix mxnet

* update

* bug fix

* update

* update test

* update mxnet example

* update mxnet

* minor

* minor

* minor

* update examples

* move ray import dependencies

* readme

* minor

* bug fix

* remove default

* Add website doc for init_orca_context (intel#2822)

* Support numa binding in init_spark_standalone (intel#2847)

* support numa binding in init_spark_standalone

* add doc and add to orca context

* address comments

* address comments

* update scripts

* hyperthreading

* fix

* shutdown hook (intel#2853)

* Support RayOnSpark for k8s and add docs (intel#2836)

* support ray on k8s

* add to init orca context

* style

* minor

* minor

* ut

* Fix stop_orca_context being called twice (intel#2878)

* Update website doc of init orca context (intel#2879)

* update doc

* update

* update Torch example (intel#3022)

* update example

* Update README.md

* update resnet_finetune.py

* delete some file

* Create README.md

* Update README.md

* Update README.md

* Update README.md

* add detect conda env name

* some change

* update run example scripte

* update init orca context

* attempt to fix ray memory (intel#3205)

* attempt to fix ray memory

* exclude webui

* Update doc for orca context (intel#3287)

* update

* add back

* update

* update

* minor

* meet review

* fix typo

* Add memory type support for Orca tf estimator (intel#3280)

* add mem type for dataframe dataset

* add ZooTrigger support

* update mem type

* add test

* update

* update mem type and orca context

* update orca context docs

* update orca context with comments

* fix style

* Add get_spark_session in OrcaContext (intel#3520)

* add and modify

* style

* add shard size to dataframe_to_xshards (intel#3491)

* add shard size to dataframe_to_xshards

* add ut and change default shard_size as None

* add shard size in orca context and fix style

* add ut in pytorch estimator and tf estimator

* move shard_size to internal use

* fix

* Add support for non-barrier mode to launch ray (intel#4014)

* add support for non-barrier mode

* fix style

* meet review

* meet review

* move barrier mode to zoocontext

* bug fix

* modify

* update

* remove driver cores (intel#4169)

* Add OrcaContext and make spark default read file backend (intel#2593)

* orca context

* handle error

* enrich error msg

* meet review

* move log output

* style

* meet review

* add zoocontext

* fix ut

* change import

Co-authored-by: Yina Chen <[email protected]>
Co-authored-by: Kai Huang <[email protected]>
Co-authored-by: Yang Wang <[email protected]>
Co-authored-by: Xin Qiu <[email protected]>
Co-authored-by: jenniew <[email protected]>
Co-authored-by: dding3 <[email protected]>
Co-authored-by: Le-Zheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants