-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Massive memory (100GB) used by dask-scheduler #6388
Comments
Here is the basic workflow in xgboost:
I don't understand why the scheduler blows up... |
I think histogram-merging happens on the scheduler because that's where the xgboost/python-package/xgboost/dask.py Line 612 in fcfeb49
Doesn't that mean the scheduler acts as the |
No. The |
Once the allreduce tree is constructed, all peers pass messages directly. |
Ah ok I see! I misunderstood the role of the tracker in this process @hcho3 explained to me |
@trivialfis I'll write a complete repro soon that hopefully is fairly minimum (hopefully only 1 node). Also, what would you recommend? If I have an in-CPU-memory frame, what is the best way to (at low CPU and GPU memory cost) get that to xgboost? i.e. I can't use dask_cudf.read_csv() since the file is only on one node in the cluster, the disk is not shared. I don't want to use s3/hdfs/etc. this is just an on-prem cluster. I'm currently doing what I showed in the dask issue:
over and over in separate forks, where the client is just connecting to the scheduler. I see excessive scheduler memory use and accumulation even when I added a persist call (I didn't need because you already do that) and with client.cancel(X) and client.cancel(y) after the fit is done. But still the scheduler continues to accumulate memory every fit. Do I have to also run client.scatter(X) and client.scatter(y) to get it off the scheduler? Seems like alot of extra work. But also, what if I want to get the data on the GPU ASAP? But, I don't want to ever have all the data on 1 GPU. Can I call |
Where that is just HIGGS with 5000000 rows and 29 columns. Problem 1) On 64GB system the first (iteration 0 of the loop) model.fit() step goes through all my system memory and OOM killer kills it. I don't see why model.fit should use so much memory. It's not even on GPU yet. |
If I lower the rows to 1M and chunksize to 50000
Then I get GPU OOM and xgoost hangs in client code: worker restarts fine:
but client code using xgboost hangs:
Hang is not good behavior to have for the client side. The fact that worker went down to be dealt with by xgboost I guess. |
If I lower to 1M rows but only 11 columns:
then things finally run.
But so far I'm not able with this code reproduce the creep-up in dask-scheduler memory use. Here is just bounces up and down. So something else is required, even though I'm pretty closely following my use. One difference is I'm using multiple nodes in the bad case where scheduler accumulates memory. |
Another cycle, scheduler using more memory. Worker went through a few OOM killer events during loop, so it also has issues but not as obvious at this exact moment in time after the script completed: In another cycle, even the scheduler hits OOM killer: scheduler hits:
worker sees:
client sees:
Once this happens everything is gone. |
Moved dask-specific discussion to: dask/distributed#4243 since seems to be dask.distributed problem only perhaps. However, @teju85 , this issue makes using dask and xgboost with multi-GPU or multi-node with NVIDIA rapids/xgboost impossible. As for the other problem #6388 (comment) , it would seem that is a highly excessive amount of memory being used by xgboost during model.fit before going to dask or using GPU. That also is a major issue. |
Well, I'm not sure the memory problem is not xgboost. If I perform the same exact task but without xgboost, there is never any accumulation of memory:
This never accumulates. Either running 10000 times or running it over and over like before. Never accumulates. @teju85 , so new theory is that xgboost is creating frames and leaking them. So then all of the problems mentioned are xgboosts fault. |
Could you please run this script and see how it goes:
import pandas as pd
import dask.dataframe as dd
import xgboost as xgb
def main():
colnames = ['label'] + ['feature-%02d' % i for i in range(1, 29)]
df = pd.read_csv('HIGGS.csv', names=colnames)
y = df['label']
X = df[df.columns.difference(['label'])]
print('X.shape', X.shape)
for i in range(0, 100):
from dask.distributed import Client
with Client(scheduler_file='scheduler.json') as client:
params = {
'tree_method': 'gpu_hist',
'n_estimators': 2,
'objective': 'binary:logistic'
}
model = xgb.dask.DaskXGBClassifier(**params)
chunksize = 50000
dX = dd.from_pandas(X, chunksize=chunksize)
dy = dd.from_pandas(y, chunksize=chunksize)
print("Done getting data into dask")
model.fit(dX, dy)
print("Done with fit")
preds = model.predict(dX)
print("Done with predict")
print(preds[0].compute())
client.cancel(dX)
client.cancel(dy)
print("Done with del")
if __name__ == '__main__':
main() The HIGGS.csv is unmodified version of HIGGS with |
What is |
It's a file created by |
Also, I'm no sure what you are asking me to do. It appears that your script is just the same as mine. But I'll try the json thing. You also used main. But on even on 5M higgs with all 28 columns I hit the #6388 (comment) massive memory use already. With 10 columns I hit the GPU OOM on my 1 GPU 1080ti. Only with 1M x 10 did I avoid those issues. |
Yup. I modified your script to be minimal. I'm curious as I got to 12 rounds and there's no sign of memory leak. |
12 rounds? You mean 12/100 iterations? That's probably too little to see. I showed the increment is about 10% of 64GB every 100 iterations for the scheduler and for worker is more. |
I ran your exact case. There are some differences:
So somehow my version of setup and script burns through all 64GB of memory just doing model.fit and even 5M x 10 hits GPU OOM. How do you explain? Did you try my version?
Even within 1 overall cycle of 100 iterations, at about 50 iterations the worker dies due to being killed by OOM killer:
At that point xgboost/client seems to hang, unlike before. No progress in the iterations of that loop. CTRL-C on the hung client code shows:
|
To be clear, as mentioned at first, I'm using rapids 0.14 xgboost 1.2 So there may be improvements. But maybe you would know if such things were fixed. |
Sure. Could you provide a script that I can exactly follow without making any edits (except for data path)? |
I'm not aware of memory leak in xgboost. We run benchmarks with larger datasets every now and then. |
In the meanwhile, it's still running. Hopefully can reproduce it later. |
If I start to bisect between our codes to look into the first-iteration-CPU-OOM-in-model-fit issue and the GPU OOM issue, some notes:
To be clear, this first-iteration CPU OOM is not the same issue as the worker or scheduler hitting OOM. Instead, the client process itself hits the CPU OOM during model.fit(). If you have more than 64GB, you should stare at top or something to see memory usage. Can you just try your script but use from_array()? Just use
|
i.e. if I use your exact sequence but only 5M rows of HIGGS, I hit the CPU OOM during/inside the very first model.fit() on the client -- again, different issue than worker/scheduler memory problems or the GPU OOM.
|
Replying to: #6388 (comment) @pseudotensor So here we have a number of issues:
After using
Is this a fair summary? Some more questions: How did you install xgboost? pip/conda/conda with rapids build/built from source? Replying to #6388 (comment): Considering that I'm running it on 11M rows right now. Should I cut it down to 5M and try again? Seems very strange. My machine is exactly 64 GB mem. |
Feel free to correct me if I'm wrong about the summary. But it's a bit difficult for me to comprehend with many comments mixed together. It would be great if you can list out some precise details: With which script, under what condition, how large is your data, which process OOM, and whether OOM is happening on CPU or GPU. |
I'll summarize:
RMM option there or not doesn't matter. Still hits GPU OOM message. client hangs, bad behavior as well. from_pandas does nothing to fix this. No mitigation so far.
No mitigation so far. Your script shows same accumulation. When worker dies of by OOM killer after 31 loops (for the 5M rows case, but otherwise your original script), the client running xgboost hangs, which is also bad. To generate the HIGGS5M.csv I'm just taking original full higgs and just doing |
Thanks for summarizing. How about:
Assuming that you are using 1.2. |
Yes, as for last month all my issues have been on exact same setup software-wise: #6232 (comment) Yes, rapids 0.14, conda, all binaries from conda repos. xgboost 1.2. I can share full details of my conda solution if needed. Only difference for this issue is I'm on 1080ti for all cases (full airlines 3-node cluster case or this MRE case). |
Great! I will try reproducing it over this weekend. |
I made comment above, but your version of script without xgboost leads to no scheduler/worker memory accumulation. So the issue 3 does seem like a pure xgboost issue of (I guess) not cleaning up temporary dask objects, e.g. cancelling futures.
With this otherwise same script but doing no xgboost stuff, worker stays at about 3.6% of 64GB of memory and scheduler at 2%. Even doing 1000 iterations and never is there anything like issue 3. |
A number of things have changed since RAPIDS 0.14. Can you try with 0.16? This was just released in late October? |
Yes, although I think these are pure xgboost issues for all 3 problems: #6388 (comment) For problem 1) This happens before any GPU stuff, the numpy frame is poorly handled by xgboost. |
Continued discussion in dask/dask#6833 . A reproducible example without xgboost is created in dask/dask#6833 (comment) |
Trying to use rapids 0.14, conda, Ubuntu 16.04, python 3.6 on xgboost 1.2.
I can't tell if this issue is a dask problem or xgboost one: dask/dask#6833
So I'm cross-posting here. It's a serious problem.
Just running sample of airlines data in a fit. About 5M x 10 columns. Workers never show too much memory use, but dask-scheduler keeps accumulating memory. The work is well-distributed on the cluster, using all GPUs on a 3-node cluster each with 2 GPUs.
I can possiblly imagine that xgboost is persisting memory and when work is done that data gets pushed to the scheduler, but I'm not clear on what is going on. I can imagine my use of data is sticking data into the graph, but unclear why persists once xgboost is done and fork is closed.
Any ideas @trivialfis ? Thanks!
The text was updated successfully, but these errors were encountered: