[GPU][python] GPU training not working on win10 #438

gugatr0n1c · 2017-04-21T07:38:53Z

For bugs and unexpected issues, please provide following information, so that we could reproduce on our system.

Environment info

WIN10
cuda 8.0, with latest driver, cuDNN 5.1 (training keras+tensorflow working OK)
gtx1080
python 3.5 (anaconda3 4.1.1, x64)
boost 1.63
cmake 3.8
mingw-x64 4.9.1

Error Message:

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 36823
[LightGBM] [Info] Number of data: 707, number of used features: 391
Traceback (most recent call last):
File "C:\Users\sery\Dropbox\myProjects\project_escape\vypocty\search_44_light.py", line 215, in
early_stopping_rounds = 250
File "C:\Anaconda3\lib\site-packages\lightgbm-0.1-py3.5.egg\lightgbm\engine.py", line 163, in train
booster = Booster(params=params, train_set=train_set)
File "C:\Anaconda3\lib\site-packages\lightgbm-0.1-py3.5.egg\lightgbm\basic.py", line 1198, in init
ctypes.byref(self.handle)))
OSError: exception: access violation reading 0x00000000D62FC52E
[Finished in 1.5s with exit code 1]
[shell_cmd: python -u "C:\Users\sery\Dropbox\myProjects\project_escape\vypocty\search_44_light.py"]
[dir: C:\Users\sery\Dropbox\myProjects\project_escape\vypocty]
[path: C:\Program Files\mingw-w64\x86_64-4.9.1-release-posix-seh-rt_v3-rev2\mingw64\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\libnvvp;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Anaconda3;C:\Anaconda3\Scripts;C:\Anaconda3\Library\bin;C:\boost\boost-build\bin;C:\boost\boost-build\include\boost;C:\Program Files\Git\cmd;C:\Users\sery\AppData\Local\Microsoft\WindowsApps;]

I followed the instalation tutorial and everything seems to worked smoothly.
CPU training is working properly.

The I deleted Lightgbm whole dir, also with python lib dir inside Anaconda and tryied with debug:

Starting program: C:\github_repos\LightGBM\lightgbm.exe "config=train.conf" "dat a=binary.train" "valid=binary.test" "objective=binary" "device=gpu"
[New Thread 1820.0x23a8]
[New Thread 1820.0x4b8]
[New Thread 1820.0x1348]
[New Thread 1820.0x268c]
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Loading weights...
[LightGBM] [Info] Loading weights...
[LightGBM] [Info] Finished loading data in 1.030765 seconds
[LightGBM] [Info] Number of positive: 3716, number of negative: 3284
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 6143
[LightGBM] [Info] Number of data: 7000, number of used features: 28
[New Thread 1820.0x1a14]
[Thread 1820.0x1a14 exited with code 0]
[New Thread 1820.0x24e4]
[New Thread 1820.0x1950]
[New Thread 1820.0x2350]
[New Thread 1820.0x1fb0]
[New Thread 1820.0x26d8]
[New Thread 1820.0x276c]
[New Thread 1820.0xcec]
[New Thread 1820.0x1c84]
[Thread 1820.0x1c84 exited with code 0]
[New Thread 1820.0x398]
[Thread 1820.0x398 exited with code 0]
[New Thread 1820.0xff0]
[Thread 1820.0xff0 exited with code 0]
[New Thread 1820.0x279c]
[Thread 1820.0x279c exited with code 0]

Program received signal SIGSEGV, Segmentation fault.
0x0000000000478c48 in clGetPlatformIDs ()
(gdb) backtrace
#0 0x0000000000478c48 in clGetPlatformIDs ()
#1 0x00000000004a5d23 in boost::compute::system::platforms() ()
#2 0x00000000004a5c3b in boost::compute::system::devices() ()
#3 0x00000000004a53d0 in boost::compute::system::find_default_device() ()
#4 0x000000000046d31d in LightGBM::GPUTreeLearner::InitGPU(int, int) ()
#5 0x0000000000418c1a in LightGBM::GBDT::ResetTrainingData(LightGBM::BoostingConfig const*, LightGBM::Dataset const*, LightGBM::ObjectiveFunction const*, std::vector<LightGBM::Metric const*, std::allocator<LightGBM::Metric const*> > const&) ()
#6 0x0000000000411969 in LightGBM::GBDT::Init(LightGBM::BoostingConfig const*, LightGBM::Dataset const*, LightGBM::ObjectiveFunction const*, std::vector<LightGBM::Metric const*, std::allocator<LightGBM::Metric const*> > const&) ()
#7 0x0000000000404a5a in LightGBM::Application::InitTrain() ()
#8 0x00000000004f2a25 in main ()

huanzhang12 · 2017-04-21T08:30:09Z

Can you run the clinfo utility and see if OpenCL devices can be detected?
You can get a copy of clinfo from the boinc project (untested because I don't run Windows, but should work):
https://boinc.berkeley.edu/dl/clinfo.zip

gugatr0n1c · 2017-04-21T08:57:50Z

Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.2 CUDA 8.0.0
Platform Name: NVIDIA CUDA
Platform Vendor: NVIDIA Corporation
Platform Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts cl_nv_create_buffer

Platform Name: NVIDIA CUDA
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Device ID: 4318
Max compute units: 20
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 64
Max work group size: 1024
Preferred vector width char: 1
Preferred vector width short: 1
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Max clock frequency: 1733Mhz
Address bits: 14757395255531667488
Max memory allocation: 2147483648
Image support: Yes
Max number of images read arguments: 256
Max number of images write arguments: 16
Max image 2D width: 16384
Max image 2D height: 32768
Max image 3D width: 16384
Max image 3D height: 16384
Max image 3D depth: 16384
Max samplers within kernel: 32
Max size of kernel argument: 4352
Alignment (bits) of base address: 4096
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 128
Cache size: 327680
Global memory size: 8589934592
Constant buffer size: 65536
Max number of constant args: 9
Local memory type: Scratchpad
Local memory size: 49152
Error correction support: 0
Profiling timer resolution: 1000
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 00A79D18
Name: GeForce GTX 1080
Vendor: NVIDIA Corporation
Driver version: 381.65
Profile: FULL_PROFILE
Version: OpenCL 1.2 CUDA
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts cl_nv_create_buffer

gugatr0n1c · 2017-04-21T09:02:46Z

maybe similar issue? dmlc/xgboost#1163

huanzhang12 · 2017-04-21T09:11:37Z

Because the crash in inside clGetPlatformIDs(), a basic OpenCL function, it could be a linking/runtime issue.
I would suggest you compile a simple OpenCL program and see if it works. I suggest you take a look at the problem in this question http://stackoverflow.com/questions/24641898/opencl-crashes-on-call-to-clgetplatformids, which is very similar to yours.

huanzhang12 · 2017-04-21T09:17:05Z

@gugatr0n1c You can try to get a gdb disassembly at the crashing RIP, if the instruction is an indirect JMP, then it is very likely you have the same issue as the stack overflow question I mentioned above (conflicting OpenCL.lib and OpenCL.dll installed to the system).

gugatr0n1c · 2017-04-21T10:51:01Z

Sorry but, this is out of my skill (I was happy enougth to go thru the tutorial and install all that stuff). I tried to read stackoverflow link you reffering, but have no idea what to try.
I tried to search for opencl.dll and opencl.lib. And yes there is multiple version of that (in system32 dir, in cuda dir, in opencl subdir of nvidia,..). It seems to me that there is no SDK from intel and gpucapsviewer detects only one device - 1080gtx card. And also this is almost new installation of win10, so not sure how I was creating conflich of two opencl libs (maybe one is part of windows?).
When I run opencl demo from gpu caps viewer, it is working.

Anyway if this is the case, how can I change installation to link to correct opencl.dll file?

Laurae2 · 2017-04-21T10:55:34Z

@gugatr0n1c Does cmake locates NVIDIA OpenCL library correctly? It should be something like:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\lib\x64 for CL library (do you have this? For CLI / Python, please SELECT the OpenCL.lib file)
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include for CL include (what you currently have, this one is OK)

There is a DLL in the system32 which might be picked up before NVIDIA's lib\x64 path, the former should be good as default OpenCL but afaik NVIDIA's Windows OpenCL is slightly different at detecting devices so it might cause such issues.

Try something like this, but adapt the path correctly (change AMD stuff to NVIDIA stuff) in cmake:

Remember to cleanup LightGBM directory before doing it (make a new clone and folder if you are unsure).

gugatr0n1c · 2017-04-21T12:00:21Z

@Laurae2 Seems to be OK, or not?

Laurae2 · 2017-04-21T13:18:36Z

@gugatr0n1c Seems OK if configuring and generating are shown as correct. Does the command line interface GPU demo works with that setup? (make sure to re-configure and re-generate file when you edit manually values)

gugatr0n1c · 2017-04-21T15:34:10Z

@Laurae2
I now tried to run demo from CLI as suggested in tutorial (I have same dir structure)

cd C:/github_repos/LightGBM/examples/binary_classification
"../../lightgbm.exe" config=train.conf data=binary.train valid=binary.test objective=binary device=gpu

And I get default windows dialog error: Program lightgbm.exe stopped working...

Before this dialog error I only got:
C:\github_repos\LightGBM\examples\binary_classification>"../../lightgbm.exe" config=train.conf data=binary.train valid=binary.test objective=binary device=gpu
[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Loading weights...
[LightGBM] [Info] Loading weights...
[LightGBM] [Info] Finished loading data in 0.082058 seconds
[LightGBM] [Info] Number of positive: 3716, number of negative: 3284
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 6143
[LightGBM] [Info] Number of data: 7000, number of used features: 28

Laurae2 · 2017-04-21T16:15:15Z

@gugatr0n1c clMathLibraries/clFFT#133 (comment) is what you are experiencing (with solution inside).

There might be incompatibilities between CUDA OpenCL and MinGW which can be circumvented using Intel OpenCL file (because NVIDIA made changes to their OpenCL, so it becomes non-standard and unsupported by MinGW for some reason).

If you installed Intel HD Graphics after NVIDIA CUDA Toolkit, I think this issue should not show up. In the reverse case, it would always show up because it would hop onto NVIDIA modified OpenCL instead of Intel OpenCL files (it does not mean you cannot use NVIDIA GPUs for OpenCL, it means you need to get the right OpenCL lib/dll file which is provided by either Intel or AMD).

Download the Intel SDK for OpenCL here: https://software.intel.com/en-us/articles/opencl-drivers

gugatr0n1c · 2017-04-25T08:56:58Z

There is new tutorial for visual studio, I tried it and it work like charm. So I am closing this.

gugatr0n1c · 2017-04-25T08:57:31Z

Btw instalation of intel SDK did not help.

github-actions · 2023-08-24T02:15:14Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

gugatr0n1c closed this as completed Apr 25, 2017

github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU][python] GPU training not working on win10 #438

[GPU][python] GPU training not working on win10 #438

gugatr0n1c commented Apr 21, 2017

huanzhang12 commented Apr 21, 2017

gugatr0n1c commented Apr 21, 2017

gugatr0n1c commented Apr 21, 2017

huanzhang12 commented Apr 21, 2017

huanzhang12 commented Apr 21, 2017

gugatr0n1c commented Apr 21, 2017

Laurae2 commented Apr 21, 2017 •

edited

Loading

gugatr0n1c commented Apr 21, 2017

Laurae2 commented Apr 21, 2017

gugatr0n1c commented Apr 21, 2017

Laurae2 commented Apr 21, 2017 •

edited

Loading

gugatr0n1c commented Apr 25, 2017

gugatr0n1c commented Apr 25, 2017

github-actions bot commented Aug 24, 2023

[GPU][python] GPU training not working on win10 #438

[GPU][python] GPU training not working on win10 #438

Comments

gugatr0n1c commented Apr 21, 2017

Environment info

Error Message:

huanzhang12 commented Apr 21, 2017

gugatr0n1c commented Apr 21, 2017

gugatr0n1c commented Apr 21, 2017

huanzhang12 commented Apr 21, 2017

huanzhang12 commented Apr 21, 2017

gugatr0n1c commented Apr 21, 2017

Laurae2 commented Apr 21, 2017 • edited Loading

gugatr0n1c commented Apr 21, 2017

Laurae2 commented Apr 21, 2017

gugatr0n1c commented Apr 21, 2017

Laurae2 commented Apr 21, 2017 • edited Loading

gugatr0n1c commented Apr 25, 2017

gugatr0n1c commented Apr 25, 2017

github-actions bot commented Aug 24, 2023

Laurae2 commented Apr 21, 2017 •

edited

Loading

Laurae2 commented Apr 21, 2017 •

edited

Loading