-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running dask_cudf on 2 GPUs and everything hangs #655
Comments
When you run with just TCP do you still see this error ? |
When using I would suggest you try |
@pentschev Thanks, I assumed they were redundant, except some didn't match, like cuda_copy one is not in the kwargs explicitly at least. I was just trying to follow your docs as closely as possible, where there is also such redundancy: https://dask-cuda.readthedocs.io/en/latest/ucx.html#starting-a-local-cluster-single-node-only |
No such hangs or errors appear in tcp mode. Note the message is only about ucx. |
Yes I'm aware of the warning and the choice that default is not ucx in rapids dask_cudf. I'm using conda install with ucx-py with matched versions, however, so I thought it would work. I'm not using infiniband, this is just a simple single node system with 2 GPUs.
Yes, the docs were not entirely clear. E.g the link I posted: https://dask-cuda.readthedocs.io/en/latest/ucx.html#starting-a-local-cluster-single-node-only shows 1GB RMM but then passes 24GB RMM pool. I guess I didn't follow if one was per worker and the other was across all workers or what. I'm also not sure xgboost uses anything related to RMM, but the docs seem to stress that using RMM is required to avoid slowness. I can have it use most of GPU memory, but again it wasn't clear from docs if it was per worker-GPU or per cluster or per node etc.
Yes, as the setup link shows I'm using rapids 0.14 as I can't quite update to >cuda 10.0 yet. So perhaps some things have been fixed already. Here are my explicit packages used in my conda environment that was constructed as a self-consistent conda solution (i.e. no conflicts). Note, however, that rapids 0.14 fails to work properly if go by the conda versions alone, as I discovered when trying to use rapids 0.14. E.g. one cannot use new dask with rapids 0.14, despite version limits suggesting one can and conda happily installing it. There are many things like that (critically pandas, numpy, fastavro, fsspec, dask, and distributed). So I had to choose versions that go back to when 0.14 was released by NVIDIA in order to reconstruct what conda would have done then. This leads to essentially (all but a few) rapids tests passing (i.e. cudf, cuml, cugraph, cusignal, rmm, ucx, pyarrow, dask_cuda, etc.). Once I move to rapids 0.15 with cuda 10.2 and python38 I can see if dask/distributed updates help things |
BTW, it's highly repeatable. I'll have to stop using ucx for now. Do you have any other suggestions apart from RMM? Note that xgboost using much more than 1GB in some cases, but basically I'm asking specifically for how to debug the specific ucx error I see that precedes the hang:
However, I also see times when those messages are shown, but there was no hang or issue, even the ERROR one. |
@pseudotensor i re-wrote your fit example and ran with both 0.15 and latest nightly and didn't see any errors like the one reported: def fit():
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import pandas as pd
with LocalCUDACluster(CUDA_VISIBLE_DEVICES=[0,1],
protocol="ucx",
enable_nvlink=True,
device_memory_limit="28GB",
rmm_pool_size="30GB") as cluster:
with Client(cluster) as client:
import xgboost as xgb
import dask_cudf
target = "default payment next month"
Xpd = pd.read_csv("creditcard.csv")
Xpd = Xpd[['AGE', target]]
Xpd.to_csv("creditcard.csv")
X = dask_cudf.read_csv("creditcard.csv")
y = X[target]
X = X.drop(target, axis=1)
kwargs_fit = {}
kwargs_cudf_fit = kwargs_fit.copy()
valid_X = dask_cudf.read_csv("creditcard.csv")
valid_y = valid_X[target]
valid_X = valid_X.drop(target, axis=1)
kwargs_cudf_fit['eval_set'] = [(valid_X, valid_y)]
params = {} # copy.deepcopy(self.model.get_params())
params['tree_method'] = 'gpu_hist'
dask_model = xgb.dask.DaskXGBClassifier(**params)
dask_model.fit(X, y, verbose=True) |
Guessing it is mainly the small initial pool sizes that results in issues. A larger initial pool size likely avoids it. |
Yes,
I can totally understand the confusion, I think this is a bit outdated now and at the time we weren't sure ourselves if this was necessary, as we tried to state with this sentence in that page: "One possible exception is DASK_RMM__POOL_SIZE, at this time it’s unclear whether this is necessary or not, but using that should not cause any issues nevertheless.".
In that case you should use should not use
In the doc, 1GB is meant specifically for the client process only, whereas 24GB will be passed down to each worker, and that 24GB is per worker (GPU). Unfortunately, we can't control the client's memory pool in any other way, as we do for the workers in
Using a memory pool is indeed very important for performance in RAPIDS as a whole, and even more for UCX-Py. However, I don't know if there's a way you can balance xgboost memory usage with that of RAPIDS/Dask.
Yes, unfortunately we can't pin maximum versions of those libraries as we never know when a fix will be added to those libraries for some bug that wasn't known at the time of RAPIDS release. We only test stable and development versions, meaning that users that can't upgrade to one of those versions are unfortunately out of luck, as we don't anymore check whether there are non-backwards compatible changes. Note that RAPIDS 0.16 was released a couple days ago, so I suggest going to stable and not the 0.15 (which is now legacy).
The warnings suggest this is at the very end and the error just happens at exit time, is that correct? If that's the case, I think this is related to a bug in dask/distributed that I was able to identify yesterday: dask/distributed#4181 . Other than that, I would like to know whether using the Ethernet interface instead of |
It's also worth noting that as of 1.20 (I think) RMM should be available as an allocator in XGBoost: |
Thanks, I'll try specifying the ethernet device, this is just a bit awkward to have to do when only using a single node. ucx would be better if it worked on single node without having to specify this. Yes, I'm not sure about RMM vs. xgboost. I'm using 1.2 lately, but I'm not sure what it means to use the RMM pool. Is it always using that pool? Does the user have to do something? It's not clear to me. Thanks for notice on rapids 0.16 being released, definitely will jump to that instead. |
I'm not totally sure either. Pinged an XGBoost dev offline, but it looks like he may be out today. |
You normally don't need to, I suggested it because you had |
Following up on this, so this should already be enabled in 0.15. This should just work automatically. Though it does require that one is getting |
Thanks, I had been wondering about the rapids-xgboost and if/how different from xgboost. Is there some reason why the changes nvidia makes are not just already in dmlc? It's confusing, but also I thought many people at nvidia are making changes to dmlc directly anyways. |
The difference is in how it's built. On rapids channel the CMake flag Hanging is usually the result of another abortion. Segfault leads to restart of worker and XGBoost will try to establish the allreduce tree/ring with rest of the workers, which is impossible as others are in different states. I'm confused by this issue. So:
Is this a fair summary? |
See for setup details: dmlc/xgboost#6232
Last thing in logs before hang is:
If I poke the faulthandler in python to see where things are hung for a process using 100% of 1 core, I see:
I run dask_cudf like:
I'm only giving a schematic of the code. It is not an MRE yet. But, for me here the hang is during the predict.
I'm new to using ucx, and hadn't seen this kind of hang before when using default options.
Any ideas?
Thanks!
The text was updated successfully, but these errors were encountered: