Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] bridge the gaps between jvm package and native xgboost #7802

Closed
19 of 34 tasks
wbo4958 opened this issue Apr 13, 2022 · 13 comments
Closed
19 of 34 tasks

[jvm-packages] bridge the gaps between jvm package and native xgboost #7802

wbo4958 opened this issue Apr 13, 2022 · 13 comments

Comments

@wbo4958
Copy link
Contributor

wbo4958 commented Apr 13, 2022

JVM-packages is far behind the native XGBoost. I would like to file this issue to track some missing features or bugs that should be fixed in the incoming 2.0.0 release. Please feel free to add some.

New Features

Bugs

@hcho3 hcho3 pinned this issue Apr 13, 2022
@trivialfis
Copy link
Member

Related #4793

@trivialfis trivialfis unpinned this issue Apr 25, 2022
@mallman
Copy link
Contributor

mallman commented May 20, 2022

[X] XGBoost4j-spark-GPU dose not support multi-worker training.

Since this is checked off does this mean xgboost4j-spark-gpu supports multi-worker training? I have not been able to get anything other than 1 worker to work. Is there a particular configuration that needs to be applied to enable multi-worker training?

FYI I'm using XGBoost 1.6.1 and Spark 3.2.1.

@wbo4958
Copy link
Contributor Author

wbo4958 commented May 23, 2022

@mallman, Thx for testing xgboost4j-spark-gpu. XGBoost 1.6.1 and Spark 3.2.1 is ok for testing multi-worker.

Please note that each xgboost worker requires 1 GPU for 1 process, so if you are trying multi-worker, please be sure that you have multi-gpus. And you should also configure your spark cluster with GPU support, please refer to https://nvidia.github.io/spark-rapids/Getting-Started/

And as to how to submit the xgboost job, please follow up https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_gpu_tutorial.html#submit-the-application.

Please feel free to feed back. Thx very much.

@wbo4958
Copy link
Contributor Author

wbo4958 commented May 23, 2022

BTW, @mallman have you seen the obvious speed up?

@mallman
Copy link
Contributor

mallman commented May 23, 2022

Hi @wbo4958. I think there's some ambiguity in my question. Let me clarify.

What I want to do is run distributed training with a single worker per executor, like we can do in CPU mode. I have been able to make it work if I configure my Spark job with spark.task.resource.gpu.amount set to 1. But then I can only run one task per executor at a time. This severely limits data-parallelism, and we are working with a very large training set, ~100,000,000 to ~1,000,000,000 records.

I'm starting to think that what I want is not achievable, at least not with ordinary Spark configuration. I think that maybe what I need is to use Spark's stage-level scheduling, introduced in Spark 3.1. We're using the standalone scheduler, which does not support this capability yet. So we may be stuck unless we switch to YARN or Kubernetes.

So my question is, is it possible to run distributed-mode training in GPU mode without limiting the number of running tasks per executor to 1? Cheers.

@wbo4958
Copy link
Contributor Author

wbo4958 commented May 24, 2022

@mallman, I got you.

Hmm. If, at any time, there is only 1 xgboost application running on your cluster (without any other spark application), then it's okay to set spark.task.resource.gpu.amount to the fraction. Eg, if your executor cpu cores = 12, and per task cpu core = 1, then spark.task.resource.gpu.amount should be set to 1/12 = 0.08

@mallman
Copy link
Contributor

mallman commented May 25, 2022

Hi @wbo4958. If I do that, all of the xgboost tasks run on a single executor, but no progress is made. I don't get an error either. It just waits.

@wbo4958
Copy link
Contributor Author

wbo4958 commented May 26, 2022

@mallman Could we file an issue to describe your issue, including env, script and so on?

@mallman
Copy link
Contributor

mallman commented May 27, 2022

@wbo4958 I'm sorry, but I don't know when I'll return to this effort. But basically the question is whether one can run distributed xgboost with gpus without sacrificing task-parallelism in non-xgboost stages.

@wbo4958
Copy link
Contributor Author

wbo4958 commented May 29, 2022

@wbo4958 I'm sorry, but I don't know when I'll return to this effort. But basically the question is whether one can run distributed xgboost with gpus without sacrificing task-parallelism in non-xgboost stages.

The answer is yes just like #7802 (comment). So if you can't make it, I mean, you can file an issue with detailed information, so we can figure out why you can't run it successfully.

@wbo4958 wbo4958 changed the title [jvm-packages] Make up the gaps between jvm package and native xgboost [jvm-packages] bridge the gaps between jvm package and native xgboost Jun 8, 2022
@shadyelgewily-slimstock
Copy link

shadyelgewily-slimstock commented Jan 26, 2023

We have a strong appetite for categorical feature support for the jvm package and willing to contribute, but it would help to get a bit more granular overview what still needs to happen, and which components we can contribute to in order to get this feature in. @wbo4958 any chance that we could extend the list of action points to get clarity what is done and what still needs to happen? "Support categorical data in jvm" is a bit vaguely defined for me, as a new contributor, to see where I can help.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Jan 30, 2023

Hi @shadyelgewily-slimstock, according to #8727 (comment), seems you'd like to use java APIs to handle the categorical data instead of spark? if that is so, I think current the xgboost4j package has covered your requirement, please see https://github.com/dmlc/xgboost/pull/7966/files#diff-303feb16c30765909c132d10a2a38788c0a5e6cce038eed115e58322c0016f2fR268-R270 and https://github.com/dmlc/xgboost/pull/7966/files#diff-303feb16c30765909c132d10a2a38788c0a5e6cce038eed115e58322c0016f2fR286-R288.

And you can refer this test https://github.com/dmlc/xgboost/pull/7966/files#diff-350a33aa9a66e2d51e745c5dc6a190113d2f0a2853a5974878686a30a2b0e47cR408-R430 for the usage. Currently, the item to support categorical data in xgboost4j-spark has not been implemented, you're welcome to contribute it. Thx

@wbo4958
Copy link
Contributor Author

wbo4958 commented Jun 14, 2024

Close task, tracked by #10415

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 2.0 Done
Development

No branches or pull requests

4 participants