Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU][python] GPU training not working on win10 #438

Closed
gugatr0n1c opened this issue Apr 21, 2017 · 14 comments
Closed

[GPU][python] GPU training not working on win10 #438

gugatr0n1c opened this issue Apr 21, 2017 · 14 comments

Comments

@gugatr0n1c
Copy link

For bugs and unexpected issues, please provide following information, so that we could reproduce on our system.

Environment info

WIN10
cuda 8.0, with latest driver, cuDNN 5.1 (training keras+tensorflow working OK)
gtx1080
python 3.5 (anaconda3 4.1.1, x64)
boost 1.63
cmake 3.8
mingw-x64 4.9.1

Error Message:

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 36823
[LightGBM] [Info] Number of data: 707, number of used features: 391
Traceback (most recent call last):
File "C:\Users\sery\Dropbox\myProjects\project_escape\vypocty\search_44_light.py", line 215, in
early_stopping_rounds = 250
File "C:\Anaconda3\lib\site-packages\lightgbm-0.1-py3.5.egg\lightgbm\engine.py", line 163, in train
booster = Booster(params=params, train_set=train_set)
File "C:\Anaconda3\lib\site-packages\lightgbm-0.1-py3.5.egg\lightgbm\basic.py", line 1198, in init
ctypes.byref(self.handle)))
OSError: exception: access violation reading 0x00000000D62FC52E
[Finished in 1.5s with exit code 1]
[shell_cmd: python -u "C:\Users\sery\Dropbox\myProjects\project_escape\vypocty\search_44_light.py"]
[dir: C:\Users\sery\Dropbox\myProjects\project_escape\vypocty]
[path: C:\Program Files\mingw-w64\x86_64-4.9.1-release-posix-seh-rt_v3-rev2\mingw64\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\libnvvp;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Anaconda3;C:\Anaconda3\Scripts;C:\Anaconda3\Library\bin;C:\boost\boost-build\bin;C:\boost\boost-build\include\boost;C:\Program Files\Git\cmd;C:\Users\sery\AppData\Local\Microsoft\WindowsApps;]

I followed the instalation tutorial and everything seems to worked smoothly.
CPU training is working properly.

The I deleted Lightgbm whole dir, also with python lib dir inside Anaconda and tryied with debug:

Starting program: C:\github_repos\LightGBM\lightgbm.exe "config=train.conf" "dat a=binary.train" "valid=binary.test" "objective=binary" "device=gpu"
[New Thread 1820.0x23a8]
[New Thread 1820.0x4b8]
[New Thread 1820.0x1348]
[New Thread 1820.0x268c]
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Loading weights...
[LightGBM] [Info] Loading weights...
[LightGBM] [Info] Finished loading data in 1.030765 seconds
[LightGBM] [Info] Number of positive: 3716, number of negative: 3284
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 6143
[LightGBM] [Info] Number of data: 7000, number of used features: 28
[New Thread 1820.0x1a14]
[Thread 1820.0x1a14 exited with code 0]
[New Thread 1820.0x24e4]
[New Thread 1820.0x1950]
[New Thread 1820.0x2350]
[New Thread 1820.0x1fb0]
[New Thread 1820.0x26d8]
[New Thread 1820.0x276c]
[New Thread 1820.0xcec]
[New Thread 1820.0x1c84]
[Thread 1820.0x1c84 exited with code 0]
[New Thread 1820.0x398]
[Thread 1820.0x398 exited with code 0]
[New Thread 1820.0xff0]
[Thread 1820.0xff0 exited with code 0]
[New Thread 1820.0x279c]
[Thread 1820.0x279c exited with code 0]

Program received signal SIGSEGV, Segmentation fault.
0x0000000000478c48 in clGetPlatformIDs ()
(gdb) backtrace
#0 0x0000000000478c48 in clGetPlatformIDs ()
#1 0x00000000004a5d23 in boost::compute::system::platforms() ()
#2 0x00000000004a5c3b in boost::compute::system::devices() ()
#3 0x00000000004a53d0 in boost::compute::system::find_default_device() ()
#4 0x000000000046d31d in LightGBM::GPUTreeLearner::InitGPU(int, int) ()
#5 0x0000000000418c1a in LightGBM::GBDT::ResetTrainingData(LightGBM::BoostingConfig const*, LightGBM::Dataset const*, LightGBM::ObjectiveFunction const*, std::vector<LightGBM::Metric const*, std::allocator<LightGBM::Metric const*> > const&) ()
#6 0x0000000000411969 in LightGBM::GBDT::Init(LightGBM::BoostingConfig const*, LightGBM::Dataset const*, LightGBM::ObjectiveFunction const*, std::vector<LightGBM::Metric const*, std::allocator<LightGBM::Metric const*> > const&) ()
#7 0x0000000000404a5a in LightGBM::Application::InitTrain() ()
#8 0x00000000004f2a25 in main ()

@huanzhang12
Copy link
Contributor

Can you run the clinfo utility and see if OpenCL devices can be detected?
You can get a copy of clinfo from the boinc project (untested because I don't run Windows, but should work):
https://boinc.berkeley.edu/dl/clinfo.zip

@gugatr0n1c
Copy link
Author

Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.2 CUDA 8.0.0
Platform Name: NVIDIA CUDA
Platform Vendor: NVIDIA Corporation
Platform Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts cl_nv_create_buffer

Platform Name: NVIDIA CUDA
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Device ID: 4318
Max compute units: 20
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 64
Max work group size: 1024
Preferred vector width char: 1
Preferred vector width short: 1
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Max clock frequency: 1733Mhz
Address bits: 14757395255531667488
Max memory allocation: 2147483648
Image support: Yes
Max number of images read arguments: 256
Max number of images write arguments: 16
Max image 2D width: 16384
Max image 2D height: 32768
Max image 3D width: 16384
Max image 3D height: 16384
Max image 3D depth: 16384
Max samplers within kernel: 32
Max size of kernel argument: 4352
Alignment (bits) of base address: 4096
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 128
Cache size: 327680
Global memory size: 8589934592
Constant buffer size: 65536
Max number of constant args: 9
Local memory type: Scratchpad
Local memory size: 49152
Error correction support: 0
Profiling timer resolution: 1000
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 00A79D18
Name: GeForce GTX 1080
Vendor: NVIDIA Corporation
Driver version: 381.65
Profile: FULL_PROFILE
Version: OpenCL 1.2 CUDA
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts cl_nv_create_buffer

@gugatr0n1c
Copy link
Author

maybe similar issue? dmlc/xgboost#1163

@huanzhang12
Copy link
Contributor

Because the crash in inside clGetPlatformIDs(), a basic OpenCL function, it could be a linking/runtime issue.
I would suggest you compile a simple OpenCL program and see if it works. I suggest you take a look at the problem in this question http://stackoverflow.com/questions/24641898/opencl-crashes-on-call-to-clgetplatformids, which is very similar to yours.

@huanzhang12
Copy link
Contributor

@gugatr0n1c You can try to get a gdb disassembly at the crashing RIP, if the instruction is an indirect JMP, then it is very likely you have the same issue as the stack overflow question I mentioned above (conflicting OpenCL.lib and OpenCL.dll installed to the system).

@gugatr0n1c
Copy link
Author

Sorry but, this is out of my skill (I was happy enougth to go thru the tutorial and install all that stuff). I tried to read stackoverflow link you reffering, but have no idea what to try.
I tried to search for opencl.dll and opencl.lib. And yes there is multiple version of that (in system32 dir, in cuda dir, in opencl subdir of nvidia,..). It seems to me that there is no SDK from intel and gpucapsviewer detects only one device - 1080gtx card. And also this is almost new installation of win10, so not sure how I was creating conflich of two opencl libs (maybe one is part of windows?).
When I run opencl demo from gpu caps viewer, it is working.

Anyway if this is the case, how can I change installation to link to correct opencl.dll file?

@Laurae2
Copy link
Contributor

Laurae2 commented Apr 21, 2017

@gugatr0n1c Does cmake locates NVIDIA OpenCL library correctly? It should be something like:

  • C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\lib\x64 for CL library (do you have this? For CLI / Python, please SELECT the OpenCL.lib file)
  • C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include for CL include (what you currently have, this one is OK)

There is a DLL in the system32 which might be picked up before NVIDIA's lib\x64 path, the former should be good as default OpenCL but afaik NVIDIA's Windows OpenCL is slightly different at detecting devices so it might cause such issues.

Try something like this, but adapt the path correctly (change AMD stuff to NVIDIA stuff) in cmake:

image

Remember to cleanup LightGBM directory before doing it (make a new clone and folder if you are unsure).

@gugatr0n1c
Copy link
Author

@Laurae2 Seems to be OK, or not?

image

@Laurae2
Copy link
Contributor

Laurae2 commented Apr 21, 2017

@gugatr0n1c Seems OK if configuring and generating are shown as correct. Does the command line interface GPU demo works with that setup? (make sure to re-configure and re-generate file when you edit manually values)

@gugatr0n1c
Copy link
Author

@Laurae2
I now tried to run demo from CLI as suggested in tutorial (I have same dir structure)

cd C:/github_repos/LightGBM/examples/binary_classification
"../../lightgbm.exe" config=train.conf data=binary.train valid=binary.test objective=binary device=gpu

And I get default windows dialog error: Program lightgbm.exe stopped working...

Before this dialog error I only got:
C:\github_repos\LightGBM\examples\binary_classification>"../../lightgbm.exe" config=train.conf data=binary.train valid=binary.test objective=binary device=gpu
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Loading weights...
[LightGBM] [Info] Loading weights...
[LightGBM] [Info] Finished loading data in 0.082058 seconds
[LightGBM] [Info] Number of positive: 3716, number of negative: 3284
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 6143
[LightGBM] [Info] Number of data: 7000, number of used features: 28

@Laurae2
Copy link
Contributor

Laurae2 commented Apr 21, 2017

@gugatr0n1c clMathLibraries/clFFT#133 (comment) is what you are experiencing (with solution inside).

There might be incompatibilities between CUDA OpenCL and MinGW which can be circumvented using Intel OpenCL file (because NVIDIA made changes to their OpenCL, so it becomes non-standard and unsupported by MinGW for some reason).

If you installed Intel HD Graphics after NVIDIA CUDA Toolkit, I think this issue should not show up. In the reverse case, it would always show up because it would hop onto NVIDIA modified OpenCL instead of Intel OpenCL files (it does not mean you cannot use NVIDIA GPUs for OpenCL, it means you need to get the right OpenCL lib/dll file which is provided by either Intel or AMD).

Download the Intel SDK for OpenCL here: https://software.intel.com/en-us/articles/opencl-drivers

@gugatr0n1c
Copy link
Author

There is new tutorial for visual studio, I tried it and it work like charm. So I am closing this.

@gugatr0n1c
Copy link
Author

Btw instalation of intel SDK did not help.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants