-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow releasing GPU memory #4668
Comments
Or we pass num boost round to c++? |
For those who look for a quick workaround till you fix it properly check my solution here |
@seanthegreat7 Thanks. That's actually an interesting workaround. |
None of the workarounds seem to be working on Windows 10. Tried deleting and loading the booster object (still crashed). Tried predicting in a subprocess similar to @seanthegreat7 (but for R instead of python). The subprocess just ran indefinitely without finishing. Would indeed be greatly appreciated if you provided a solution for this issue! |
I'm finding this very difficult especially when performing a wide parameter search in a loop of some kind. For example:
This will crash, since I guess the trained_model hangs around on the GPU indefinitely. Alternatively, if I However, I'm also running into the same issue when submitting numerous jobs via a Dask scheduler (note not dask-xgboost). In both cases I eventually get:
I don't have a view on the best solution, but would love to resolve. |
My hack is to do this
and it has solved my GPU mem-leak |
wouldn't it be easier to implement the function as in pytorch? |
It wouldn't be easier, but that's an option. |
@trivialfis Do you (or someone else) plan to fix this problem at all? |
I am running into this same issue, when training many small gpu_hist models. |
Could you please open a new issue? |
One common piece of feedback we receive about the GPU algorithms is that memory is not released after training. It may be possible to release memory by deleting the booster object but this is not a great user experience.
See
#4018
#3083
#2663
#3045
The reason why we have not implemented this already is that the internal C++ code does not actually know when training is finished. The language bindings call each training iteration one by one and I don't believe we have any information inside the GPU training code to say if another training iteration is expected or not.
I see a few solutions:
I am leaning towards option 3) but I think it relies on #3980 to make sure all parameters are correctly saved. Maybe it's still possible to do this with current serialization and not have any unexpected side-effects due to parameters not all being saved.
@trivialfis @sriramch @rongou
The text was updated successfully, but these errors were encountered: