RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #4

gxu-tz · 2021-11-19T09:15:38Z

thank you for sharing pc2cad.py,when I run the code:

python pc2cad.py --exp_name pretrained --ae_ckpt 1000 -g 0 --pc_root /public1/tz/DeepCAD/data/pc_cad，

I got the error:

Traceback (most recent call last):
File "pc2cad.py", line 246, in
outputs, losses = agent.train_func(data)
File "/public1/tz/DeepCAD/trainer/base.py", line 118, in train_func
outputs, losses = self.forward(data)
File "pc2cad.py", line 159, in forward
pred_code = self.net(points)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "pc2cad.py", line 138, in forward
xyz, features = module(xyz, features)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/pointnet2_ops/pointnet2_modules.py", line 66, in forward
new_features = self.mlpsi # (B, mlp[-1], npoint, nsample)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 106, in forward
exponential_average_factor, self.eps)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/functional.py", line 1923, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Package Version Location

absl-py 1.0.0
cachetools 4.2.4
certifi 2021.10.8
charset-normalizer 2.0.7
cycler 0.11.0
Cython 0.29.13
future 0.18.2
google-auth 2.3.3
google-auth-oauthlib 0.4.6
grpcio 1.41.1
h5py 2.10.0
hydra-core 0.11.3
idna 3.3
importlib-metadata 4.8.2
joblib 0.14.1
kiwisolver 1.3.2
lmdb 1.2.1
loguru 0.5.3
Markdown 3.3.4
matplotlib 3.1.3
msgpack 1.0.2
msgpack-numpy 0.4.7.1
numpy 1.18.1
oauthlib 3.1.1
omegaconf 1.4.1
Pillow 8.3.2
pip 21.0.1
plyfile 0.7.2
pointnet2 3.0.0 /public1/tz/Pointnet2_PyTorch-master
pointnet2-ops 3.0.0
protobuf 3.19.1
pyasn1 0.4.8
pyasn1-modules 0.2.8
pyparsing 3.0.6
python-dateutil 2.8.2
pytorch-lightning 0.7.1
PyYAML 6.0
requests 2.26.0
requests-oauthlib 1.3.0
rsa 4.7.2
scikit-learn 0.24.2
scipy 1.4.1
setuptools 58.0.4
six 1.16.0
tensorboard 2.7.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.0
tensorboardX 2.0
threadpoolctl 3.0.0
torch 1.5.1
torchvision 0.6.1
tqdm 4.42.1
trimesh 3.2.19
typing-extensions 4.0.0
urllib3 1.26.7
vtk 9.0.1
Werkzeug 2.0.2
wheel 0.37.0
zipp 3.6.0

I have two RTX3090,CUDA10.2,CuDNN7.6.5,Pytorch1.5.1,Python3.7

The text was updated successfully, but these errors were encountered:

ChrisWu1997 · 2021-11-19T15:10:17Z

It looks like an incompatibility issue with CUDA, cuDNN and pytorch. Are you able to successfully run other files, like train.py or test.py?

gxu-tz · 2021-11-20T02:27:53Z

When I run python train.py --exp_name newDeepCAD -g 0,I got

Traceback (most recent call last):
File "train.py", line 62, in
main()
File "train.py", line 35, in main
outputs, losses = tr_agent.train_func(data)
File "/public1/tz/DeepCAD/trainer/base.py", line 118, in train_func
outputs, losses = self.forward(data)
File "/public1/tz/DeepCAD/trainer/trainerAE.py", line 27, in forward
outputs = self.net(commands, args)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/public1/tz/DeepCAD/model/autoencoder.py", line 154, in forward
z = self.encoder(commands_enc_, args_enc_)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/public1/tz/DeepCAD/model/autoencoder.py", line 74, in forward
src = self.embedding(commands, args, group_mask)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/public1/tz/DeepCAD/model/autoencoder.py", line 32, in forward
self.embed_fcn(self.arg_embed((args + 1).long()).view(S, N, -1)) # shift due to -1 PAD_VAL
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
return F.linear(input, self.weight, self.bias)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/functional.py", line 1612, in linear
output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

gxu-tz · 2021-11-22T12:11:19Z

I change the environment with CUDA11.1,cuDNN8.0.4,pytorch1.8.0,it solves the previous problem but a new one emerged.
When I run python pc2cad.py --exp_name pretrained --ae_ckpt 1000 -g 0 --pc_root /public1/tz/DeepCAD/data/pc_cad,I got

Traceback (most recent call last):
File "pc2cad.py", line 244, in
for b, data in enumerate(pbar):
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/tqdm/std.py", line 1107, in iter
for obj in iterable:
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
return self._process_data(data)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
data.reraise()
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 2.
Original Traceback (most recent call last):
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "pc2cad.py", line 186, in getitem
return self.getitem(index + 1)
File "pc2cad.py", line 186, in getitem
return self.getitem(index + 1)
File "pc2cad.py", line 186, in getitem
return self.getitem(index + 1)
File "pc2cad.py", line 183, in getitem
data_id = self.all_data[index]
IndexError: list index out of range

ChrisWu1997 · 2021-11-22T15:03:25Z

Did you run python json2pc.py first to get all training and testing point clouds?

gxu-tz · 2021-11-23T14:38:31Z

When I run python json2pc.py,I got the error:

[Parallel(n_jobs=-1)]: Done 126816 tasks | elapsed: 7.0min
convert point cloud failed: 0041/00415456
convert point cloud failed: 0056/00560203
create_CAD failed: 0078/00787173
convert point cloud failed: 0077/00777387
Warning: tmp_out_00336192.stl file already exists and will be replaced
create_CAD failed: 0060/00604130
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
exception calling callback for <Future at 0x14f517023c10 state=finished raised TerminatedWorkerError>
Traceback (most recent call last):
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
callback(self)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 340, in call
self.parallel.dispatch_next()
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 769, in dispatch_next
if not self.dispatch_one_batch(self._original_iterator):
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 835, in dispatch_one_batch
self._dispatch(tasks)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 754, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 551, in apply_async
future = self._workers.submit(SafeFunction(func))
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/externals/loky/reusable_executor.py", line 160, in submit
fn, *args, **kwargs)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 1027, in submit
raise self._flags.broken
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGSEGV(-11)}
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
Warning: 1 face has been skipped due to null triangulation
face_normals didn't match triangles, ignoring!
Traceback (most recent call last):
File "json2pc.py", line 84, in
Parallel(n_jobs=-1, verbose=2)(delayed(process_one)(x) for x in all_data["train"])
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 1017, in call
self.retrieve()
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 909, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 562, in wrap_future_result
return future.result(timeout=timeout)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
callback(self)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 340, in call
self.parallel.dispatch_next()
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 769, in dispatch_next
if not self.dispatch_one_batch(self._original_iterator):
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 835, in dispatch_one_batch
self._dispatch(tasks)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 754, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 551, in apply_async
future = self._workers.submit(SafeFunction(func))
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/externals/loky/reusable_executor.py", line 160, in submit
fn, *args, **kwargs)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 1027, in submit
raise self._flags.broken
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGSEGV(-11)}

So I didn't get all point clouds,I think perhaps it cause the last problem.

ChrisWu1997 · 2021-11-23T16:15:48Z

This segmentation fault is caused by OpenCascade. Some cad models can not be converted to point clouds successfully. You can find those problematic data by replacing the Parallel execution with a for loop and printing out each data_id to see which one caused the problem. Then just skip it in the next run. Another quick solution is to simply give up those unprocessed data (if not too many) and to replace the following line in pc2cad.py

DeepCAD/pc2cad.py

Line 182 in 1ff0ab1

return self.__getitem__(index + 1)

with

return self.__getitem__(random.randint(0, self.__len__()))

gxu-tz · 2021-11-24T08:17:00Z

Thank you for your advice.I wrote a for loop and found the problematic data_id and I successfully run pc2cad.py.

gxu-tz closed this as completed Nov 24, 2021

joeychsu mentioned this issue Jun 20, 2023

CUBLAS_STATUS_EXECUTION_FAILED and .h5 file not found #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #4

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #4

gxu-tz commented Nov 19, 2021 •

edited

Loading

ChrisWu1997 commented Nov 19, 2021

gxu-tz commented Nov 20, 2021

gxu-tz commented Nov 22, 2021

ChrisWu1997 commented Nov 22, 2021

gxu-tz commented Nov 23, 2021

ChrisWu1997 commented Nov 23, 2021

gxu-tz commented Nov 24, 2021

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #4

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #4

Comments

gxu-tz commented Nov 19, 2021 • edited Loading

ChrisWu1997 commented Nov 19, 2021

gxu-tz commented Nov 20, 2021

gxu-tz commented Nov 22, 2021

ChrisWu1997 commented Nov 22, 2021

gxu-tz commented Nov 23, 2021

ChrisWu1997 commented Nov 23, 2021

gxu-tz commented Nov 24, 2021

gxu-tz commented Nov 19, 2021 •

edited

Loading