-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU][python] GPU training not working on win10 #438
Comments
Can you run the clinfo utility and see if OpenCL devices can be detected? |
Number of platforms: 1 Platform Name: NVIDIA CUDA |
maybe similar issue? dmlc/xgboost#1163 |
Because the crash in inside clGetPlatformIDs(), a basic OpenCL function, it could be a linking/runtime issue. |
@gugatr0n1c You can try to get a gdb disassembly at the crashing RIP, if the instruction is an indirect JMP, then it is very likely you have the same issue as the stack overflow question I mentioned above (conflicting OpenCL.lib and OpenCL.dll installed to the system). |
Sorry but, this is out of my skill (I was happy enougth to go thru the tutorial and install all that stuff). I tried to read stackoverflow link you reffering, but have no idea what to try. Anyway if this is the case, how can I change installation to link to correct opencl.dll file? |
@gugatr0n1c Does cmake locates NVIDIA OpenCL library correctly? It should be something like:
There is a DLL in the system32 which might be picked up before NVIDIA's Try something like this, but adapt the path correctly (change AMD stuff to NVIDIA stuff) in cmake: Remember to cleanup LightGBM directory before doing it (make a new clone and folder if you are unsure). |
@Laurae2 Seems to be OK, or not? |
@gugatr0n1c Seems OK if configuring and generating are shown as correct. Does the command line interface GPU demo works with that setup? (make sure to re-configure and re-generate file when you edit manually values) |
@Laurae2 cd C:/github_repos/LightGBM/examples/binary_classification And I get default windows dialog error: Program lightgbm.exe stopped working... Before this dialog error I only got: |
@gugatr0n1c clMathLibraries/clFFT#133 (comment) is what you are experiencing (with solution inside). There might be incompatibilities between CUDA OpenCL and MinGW which can be circumvented using Intel OpenCL file (because NVIDIA made changes to their OpenCL, so it becomes non-standard and unsupported by MinGW for some reason). If you installed Intel HD Graphics after NVIDIA CUDA Toolkit, I think this issue should not show up. In the reverse case, it would always show up because it would hop onto NVIDIA modified OpenCL instead of Intel OpenCL files (it does not mean you cannot use NVIDIA GPUs for OpenCL, it means you need to get the right OpenCL lib/dll file which is provided by either Intel or AMD). Download the Intel SDK for OpenCL here: https://software.intel.com/en-us/articles/opencl-drivers |
There is new tutorial for visual studio, I tried it and it work like charm. So I am closing this. |
Btw instalation of intel SDK did not help. |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
For bugs and unexpected issues, please provide following information, so that we could reproduce on our system.
Environment info
WIN10
cuda 8.0, with latest driver, cuDNN 5.1 (training keras+tensorflow working OK)
gtx1080
python 3.5 (anaconda3 4.1.1, x64)
boost 1.63
cmake 3.8
mingw-x64 4.9.1
Error Message:
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 36823
[LightGBM] [Info] Number of data: 707, number of used features: 391
Traceback (most recent call last):
File "C:\Users\sery\Dropbox\myProjects\project_escape\vypocty\search_44_light.py", line 215, in
early_stopping_rounds = 250
File "C:\Anaconda3\lib\site-packages\lightgbm-0.1-py3.5.egg\lightgbm\engine.py", line 163, in train
booster = Booster(params=params, train_set=train_set)
File "C:\Anaconda3\lib\site-packages\lightgbm-0.1-py3.5.egg\lightgbm\basic.py", line 1198, in init
ctypes.byref(self.handle)))
OSError: exception: access violation reading 0x00000000D62FC52E
[Finished in 1.5s with exit code 1]
[shell_cmd: python -u "C:\Users\sery\Dropbox\myProjects\project_escape\vypocty\search_44_light.py"]
[dir: C:\Users\sery\Dropbox\myProjects\project_escape\vypocty]
[path: C:\Program Files\mingw-w64\x86_64-4.9.1-release-posix-seh-rt_v3-rev2\mingw64\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\libnvvp;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Anaconda3;C:\Anaconda3\Scripts;C:\Anaconda3\Library\bin;C:\boost\boost-build\bin;C:\boost\boost-build\include\boost;C:\Program Files\Git\cmd;C:\Users\sery\AppData\Local\Microsoft\WindowsApps;]
I followed the instalation tutorial and everything seems to worked smoothly.
CPU training is working properly.
The I deleted Lightgbm whole dir, also with python lib dir inside Anaconda and tryied with debug:
Starting program: C:\github_repos\LightGBM\lightgbm.exe "config=train.conf" "dat a=binary.train" "valid=binary.test" "objective=binary" "device=gpu"
[New Thread 1820.0x23a8]
[New Thread 1820.0x4b8]
[New Thread 1820.0x1348]
[New Thread 1820.0x268c]
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Loading weights...
[LightGBM] [Info] Loading weights...
[LightGBM] [Info] Finished loading data in 1.030765 seconds
[LightGBM] [Info] Number of positive: 3716, number of negative: 3284
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 6143
[LightGBM] [Info] Number of data: 7000, number of used features: 28
[New Thread 1820.0x1a14]
[Thread 1820.0x1a14 exited with code 0]
[New Thread 1820.0x24e4]
[New Thread 1820.0x1950]
[New Thread 1820.0x2350]
[New Thread 1820.0x1fb0]
[New Thread 1820.0x26d8]
[New Thread 1820.0x276c]
[New Thread 1820.0xcec]
[New Thread 1820.0x1c84]
[Thread 1820.0x1c84 exited with code 0]
[New Thread 1820.0x398]
[Thread 1820.0x398 exited with code 0]
[New Thread 1820.0xff0]
[Thread 1820.0xff0 exited with code 0]
[New Thread 1820.0x279c]
[Thread 1820.0x279c exited with code 0]
Program received signal SIGSEGV, Segmentation fault.
0x0000000000478c48 in clGetPlatformIDs ()
(gdb) backtrace
#0 0x0000000000478c48 in clGetPlatformIDs ()
#1 0x00000000004a5d23 in boost::compute::system::platforms() ()
#2 0x00000000004a5c3b in boost::compute::system::devices() ()
#3 0x00000000004a53d0 in boost::compute::system::find_default_device() ()
#4 0x000000000046d31d in LightGBM::GPUTreeLearner::InitGPU(int, int) ()
#5 0x0000000000418c1a in LightGBM::GBDT::ResetTrainingData(LightGBM::BoostingConfig const*, LightGBM::Dataset const*, LightGBM::ObjectiveFunction const*, std::vector<LightGBM::Metric const*, std::allocator<LightGBM::Metric const*> > const&) ()
#6 0x0000000000411969 in LightGBM::GBDT::Init(LightGBM::BoostingConfig const*, LightGBM::Dataset const*, LightGBM::ObjectiveFunction const*, std::vector<LightGBM::Metric const*, std::allocator<LightGBM::Metric const*> > const&) ()
#7 0x0000000000404a5a in LightGBM::Application::InitTrain() ()
#8 0x00000000004f2a25 in main ()
The text was updated successfully, but these errors were encountered: