Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

knn predict wrong and varying predictions, cudaErrorIllegalAddress, or core dump #4629

Open
pseudotensor opened this issue Mar 10, 2022 · 7 comments

Comments

@pseudotensor
Copy link

pseudotensor commented Mar 10, 2022

Same as this, but was closed by author even though not fixed: #1685

foo_df323a9b-bbb7-49a3-b06e-a9699702c09f.pkl.zip

import pickle
func, X = pickle.load(open("foo_df323a9b-bbb7-49a3-b06e-a9699702c09f.pkl", "rb"))
func(X)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 586, in inner_get
    ret_val = func(*args, **kwargs)
  File "cuml/neighbors/kneighbors_classifier.pyx", line 300, in cuml.neighbors.kneighbors_classifier.KNeighborsClassifier.predict_proba
  File "cuml/raft/common/handle.pyx", line 86, in cuml.raft.common.handle.Handle.sync
cuml.raft.common.cuda.CudaRuntimeError: Error! cudaErrorIllegalAddress reason='an illegal memory access was encountered' extraMsg='Stream sync'

and upon exit of python interpreter there is core dump.

Seems possible that it is due to constant features, i.e. all 0 or all 1 etc.

This is using rapids 21.08 and other details about system are here: #4610

However, what's also really bad about this situation is that sometimes the predictions are generated but are wrong, or keep changing (e.g. recalls to predict_proba(X) keep giving different results), or (e.g.) for multiclass one case will have 0's for all probas

E.g. for this file:
KNNCUML_predict_b30d7318-b285-475d-943e-c48ebd2235df.pkl.zip

This is what the sequence looks like:

import pickle
func, X = pickle.load(open("KNNCUML_predict_b30d7318-b285-475d-943e-c48ebd2235df.pkl", "rb"))
>>> func(X)[0:5]
2022-03-10 15:11:25,215 C:  3% D:43.6GB  M:46.6GB  NODE:SERVER      26864  INFO   | init
array([[0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.]], dtype=float32)
>>> func(X)[0:5]
array([[1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.]], dtype=float32)
>>> func(X)[0:5]
array([[0., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.]], dtype=float32)
>>> func(X)[0:5]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 586, in inner_get
    ret_val = func(*args, **kwargs)
  File "cuml/neighbors/kneighbors_classifier.pyx", line 300, in cuml.neighbors.kneighbors_classifier.KNeighborsClassifier.predict_proba
  File "cuml/raft/common/handle.pyx", line 86, in cuml.raft.common.handle.Handle.sync
cuml.raft.common.cuda.CudaRuntimeError: Error! cudaErrorIllegalAddress reason='an illegal memory access was encountered' extraMsg='Stream sync'
>>> func(X)[0:5]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 586, in inner_get
    ret_val = func(*args, **kwargs)
  File "cuml/neighbors/kneighbors_classifier.pyx", line 256, in cuml.neighbors.kneighbors_classifier.KNeighborsClassifier.predict_proba
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 586, in inner_get
    ret_val = func(*args, **kwargs)
  File "cuml/neighbors/nearest_neighbors.pyx", line 488, in cuml.neighbors.nearest_neighbors.NearestNeighbors.kneighbors
  File "cuml/neighbors/nearest_neighbors.pyx", line 573, in cuml.neighbors.nearest_neighbors.NearestNeighbors._kneighbors
  File "cuml/neighbors/nearest_neighbors.pyx", line 635, in cuml.neighbors.nearest_neighbors.NearestNeighbors._kneighbors_dense
  File "/home/jon/minicondadai_py38/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 360, in inner
    return func(*args, **kwargs)
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/common/input_utils.py", line 306, in input_to_cuml_array
    X = convert_dtype(X,
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 360, in inner
    return func(*args, **kwargs)
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/common/input_utils.py", line 560, in convert_dtype
    would_lose_info = _typecast_will_lose_information(X, to_dtype)
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cuml/common/input_utils.py", line 612, in _typecast_will_lose_information
    X_m = X.values
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cudf/core/dataframe.py", line 994, in values
    return cupy.asarray(self.as_gpu_matrix())
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cudf/core/dataframe.py", line 3577, in as_gpu_matrix
    matrix = cupy.empty(shape=(nrow, ncol), dtype=cupy_dtype, order=order)
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/cupy/_creation/basic.py", line 22, in empty
    return cupy.ndarray(shape, dtype, order=order)
  File "cupy/_core/core.pyx", line 164, in cupy._core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 735, in cupy.cuda.memory.alloc
  File "/home/jon/minicondadai_py38/lib/python3.8/site-packages/rmm/rmm.py", line 212, in rmm_cupy_allocator
    buf = librmm.device_buffer.DeviceBuffer(size=nbytes, stream=stream)
  File "rmm/_lib/device_buffer.pyx", line 84, in rmm._lib.device_buffer.DeviceBuffer.__cinit__
MemoryError: std::bad_alloc: CUDA error at: /home/jon/minicondadai_py38/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorIllegalAddress an illegal memory access was encountered

sometimes a repeat will do a full core dump.

So even when predictions don't cause a crash or error, they still give wrong/varying answers and even probas don't even add up to 1 (every class label has 0 proba).

The actual GPU usage is minimal:

Thu Mar 10 15:13:19 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.80       Driver Version: 460.80       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 2080    On   | 00000000:01:00.0  On |                  N/A |
| 45%   53C    P0    50W / 215W |   2021MiB /  7979MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2530      G   /usr/lib/xorg/Xorg                 55MiB |
|    0   N/A  N/A      2621      G   /usr/bin/gnome-shell              143MiB |
|    0   N/A  N/A      3966      G   /usr/lib/xorg/Xorg                738MiB |
|    0   N/A  N/A      4097      G   /usr/bin/gnome-shell              100MiB |
|    0   N/A  N/A      4903      G   ...AAAAAAAAA= --shared-files      132MiB |
|    0   N/A  N/A     26864      C   python                            843MiB |
+-----------------------------------------------------------------------------+

@pseudotensor pseudotensor changed the title knn predict cuml.raft.common.cuda.CudaRuntimeError: Error! cudaErrorIllegalAddress reason='an illegal memory access was encountered' extraMsg='Stream sync' knn predict wrong and varying predictions, cudaErrorIllegalAddress, or core dump Mar 10, 2022
@viclafargue
Copy link
Contributor

Is this reproducible in RAPIDS 22.02 / 22.04?

@pseudotensor
Copy link
Author

I gave MRE so you guys can check.

@pseudotensor
Copy link
Author

This shouldn't have been closed.

@viclafargue
Copy link
Contributor

Sorry for not replying earlier. It turns out that the pickle files could not be imported in 22.04. Since I could't see the code used, it's not possible for me to reproduce the issue.

For both exemples, I get :

Exception ignored in: <bound method NearestNeighbors.__del__ of KNeighborsClassifier()>
Traceback (most recent call last):
  File "cuml/neighbors/nearest_neighbors.pyx", line 889, in cuml.neighbors.nearest_neighbors.NearestNeighbors.__del__
  File "cuml/common/base.pyx", line 269, in cuml.common.base.Base.__getattr__
AttributeError: knn_index
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'cuml.raft'

@viclafargue
Copy link
Contributor

Regarding the cudaErrorIllegalAddress/coredump there might possibly be a link to an issue in RMM that was since solved : rapidsai/rmm#931.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants