-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] [dask] Dask fit task just crashes #6196
Comments
Thanks for the report. One clarifying question.... are you using Jython? I'm very surprised to see If not Jython... can you help me understand where those JVM logs are coming from? |
I'm using python. Maybe because I'm reading files from HDFS and using Yarn as the resource manager? (it is a Yarn cluster) |
ahhhh ok, I missed I've reformatted your example to make it a bit easier to read. It seems like a very important detail (the fact that you're using If you're unsure how I got the example to look the way it does, please review https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax. |
I strongly suspect that the issues is that you're losing Dask workers due to out-of-memory issues. I see you're using a 2-worker The logs from the failed run show that you're using raw data with the following characteristics:
Even if they were all Add to that:
It seems very likely that a worker could be lost due to out-of-memory issues. Dask can't see into memory allocations made in LightGBM's C++ code, so it can't prevent some forms of memory issues by spilling to disk the way it can with other Python objects. As described in #3775 (a not-yet-implemented feature request), LightGBM distributed training cannot currently handle the case where a worker process dies during training. I know this isn't the answer you were hoping for, but I think the best path forward is some mix of giving training more memory and reducing the amount of memory used. For example, try some combination of the following:
If you're knowledgeable about Dask, we'd welcome any suggestions to improve memory usage in |
thanks for quick response @jameslamb . |
total size of data is just 2GB so I think it can't be out-of-memory issue. |
It depends on what you mean by "the data". I listed out some examples of that in my comments above. |
i meant raw data. I will try out with more memory but if 2GB of raw data is going to take that much memory then I probably should try different options. I'll try to use some of your suggestions above. Btw I tried to use synapseML earlier and I was facing some issues there as well but unfortunately community isn't as responsive there. Anyway, really appreciate quick responses and all the suggestions! Thanks a lot! |
Besides the other approaches I provided for reducing memory usage, you could also try using Dask Array instead of Dask DataFrame. Since it appears you're just loading raw data from files and directly passing it into LightGBM for training, it doesn't appear that you really need any of the functionality of Dask DataFrame. I'd expect Dask Array (made up of underlying |
Other things to consider:
|
Just want to add that the error caught by the JRE is a SIGSEGV (segfault), so there could be some weird interaction going on as well. |
sure, let me try will Dask Arrays. Machines aren't removed during training. These machines are hadoop datanodes but they are not consuming as much memory. |
@dpdrmj hey, any luck in this ? I am also facing kind of same issue. i am using yarn, though dask is able to load datasets and preprocess them using |
@dpdrmj @jameslamb i have observed few things ( and now i am able to train on large dataset)
I am sharing stdout logs of both runs when using latest lightgbm version
and when using lightgbm 3.3.5
all of the enviroment and configurations are same except lightgbm. |
Description
Whenever I run this code, the dask job crashes and all the workers get lost and then the task just hangs forever. While if I provide small size files then the same code works fine. (<100MB). I'm not sure what the issue is. Pasting the error below in "Additional Comments section"
Reproducible example
Environment info
LightGBM version or commit hash:
all the dependencies:
Command(s) you used to install LightGBM
Additional Comments
I had reported this on dask-distributed github (dask/distributed#8341) but someone asked me to report to lightgbm.
The text was updated successfully, but these errors were encountered: