Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #4

Closed
gxu-tz opened this issue Nov 19, 2021 · 7 comments
Closed

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #4

gxu-tz opened this issue Nov 19, 2021 · 7 comments

Comments

@gxu-tz
Copy link

gxu-tz commented Nov 19, 2021

thank you for sharing pc2cad.py,when I run the code:

python pc2cad.py --exp_name pretrained --ae_ckpt 1000 -g 0 --pc_root /public1/tz/DeepCAD/data/pc_cad

I got the error:

Traceback (most recent call last):
File "pc2cad.py", line 246, in
outputs, losses = agent.train_func(data)
File "/public1/tz/DeepCAD/trainer/base.py", line 118, in train_func
outputs, losses = self.forward(data)
File "pc2cad.py", line 159, in forward
pred_code = self.net(points)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "pc2cad.py", line 138, in forward
xyz, features = module(xyz, features)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/pointnet2_ops/pointnet2_modules.py", line 66, in forward
new_features = self.mlpsi # (B, mlp[-1], npoint, nsample)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 106, in forward
exponential_average_factor, self.eps)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/functional.py", line 1923, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Package Version Location


absl-py 1.0.0
cachetools 4.2.4
certifi 2021.10.8
charset-normalizer 2.0.7
cycler 0.11.0
Cython 0.29.13
future 0.18.2
google-auth 2.3.3
google-auth-oauthlib 0.4.6
grpcio 1.41.1
h5py 2.10.0
hydra-core 0.11.3
idna 3.3
importlib-metadata 4.8.2
joblib 0.14.1
kiwisolver 1.3.2
lmdb 1.2.1
loguru 0.5.3
Markdown 3.3.4
matplotlib 3.1.3
msgpack 1.0.2
msgpack-numpy 0.4.7.1
numpy 1.18.1
oauthlib 3.1.1
omegaconf 1.4.1
Pillow 8.3.2
pip 21.0.1
plyfile 0.7.2
pointnet2 3.0.0 /public1/tz/Pointnet2_PyTorch-master
pointnet2-ops 3.0.0
protobuf 3.19.1
pyasn1 0.4.8
pyasn1-modules 0.2.8
pyparsing 3.0.6
python-dateutil 2.8.2
pytorch-lightning 0.7.1
PyYAML 6.0
requests 2.26.0
requests-oauthlib 1.3.0
rsa 4.7.2
scikit-learn 0.24.2
scipy 1.4.1
setuptools 58.0.4
six 1.16.0
tensorboard 2.7.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.0
tensorboardX 2.0
threadpoolctl 3.0.0
torch 1.5.1
torchvision 0.6.1
tqdm 4.42.1
trimesh 3.2.19
typing-extensions 4.0.0
urllib3 1.26.7
vtk 9.0.1
Werkzeug 2.0.2
wheel 0.37.0
zipp 3.6.0

I have two RTX3090,CUDA10.2,CuDNN7.6.5,Pytorch1.5.1,Python3.7

@ChrisWu1997
Copy link
Owner

It looks like an incompatibility issue with CUDA, cuDNN and pytorch. Are you able to successfully run other files, like train.py or test.py?

@gxu-tz
Copy link
Author

gxu-tz commented Nov 20, 2021

When I run python train.py --exp_name newDeepCAD -g 0,I got

Traceback (most recent call last):
File "train.py", line 62, in
main()
File "train.py", line 35, in main
outputs, losses = tr_agent.train_func(data)
File "/public1/tz/DeepCAD/trainer/base.py", line 118, in train_func
outputs, losses = self.forward(data)
File "/public1/tz/DeepCAD/trainer/trainerAE.py", line 27, in forward
outputs = self.net(commands, args)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/public1/tz/DeepCAD/model/autoencoder.py", line 154, in forward
z = self.encoder(commands_enc_, args_enc_)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/public1/tz/DeepCAD/model/autoencoder.py", line 74, in forward
src = self.embedding(commands, args, group_mask)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/public1/tz/DeepCAD/model/autoencoder.py", line 32, in forward
self.embed_fcn(self.arg_embed((args + 1).long()).view(S, N, -1)) # shift due to -1 PAD_VAL
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
return F.linear(input, self.weight, self.bias)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/nn/functional.py", line 1612, in linear
output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

@gxu-tz
Copy link
Author

gxu-tz commented Nov 22, 2021

I change the environment with CUDA11.1,cuDNN8.0.4,pytorch1.8.0,it solves the previous problem but a new one emerged.
When I run python pc2cad.py --exp_name pretrained --ae_ckpt 1000 -g 0 --pc_root /public1/tz/DeepCAD/data/pc_cad,I got

Traceback (most recent call last):
File "pc2cad.py", line 244, in
for b, data in enumerate(pbar):
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/tqdm/std.py", line 1107, in iter
for obj in iterable:
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
return self._process_data(data)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
data.reraise()
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 2.
Original Traceback (most recent call last):
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "pc2cad.py", line 186, in getitem
return self.getitem(index + 1)
File "pc2cad.py", line 186, in getitem
return self.getitem(index + 1)
File "pc2cad.py", line 186, in getitem
return self.getitem(index + 1)
File "pc2cad.py", line 183, in getitem
data_id = self.all_data[index]
IndexError: list index out of range

@ChrisWu1997
Copy link
Owner

Did you run python json2pc.py first to get all training and testing point clouds?

@gxu-tz
Copy link
Author

gxu-tz commented Nov 23, 2021

When I run python json2pc.py,I got the error:

[Parallel(n_jobs=-1)]: Done 126816 tasks | elapsed: 7.0min
convert point cloud failed: 0041/00415456
convert point cloud failed: 0056/00560203
create_CAD failed: 0078/00787173
convert point cloud failed: 0077/00777387
Warning: tmp_out_00336192.stl file already exists and will be replaced
create_CAD failed: 0060/00604130
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
exception calling callback for <Future at 0x14f517023c10 state=finished raised TerminatedWorkerError>
Traceback (most recent call last):
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
callback(self)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 340, in call
self.parallel.dispatch_next()
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 769, in dispatch_next
if not self.dispatch_one_batch(self._original_iterator):
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 835, in dispatch_one_batch
self._dispatch(tasks)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 754, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 551, in apply_async
future = self._workers.submit(SafeFunction(func))
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/externals/loky/reusable_executor.py", line 160, in submit
fn, *args, **kwargs)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 1027, in submit
raise self._flags.broken
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGSEGV(-11)}
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
face_normals didn't match triangles, ignoring!
Warning: 1 face has been skipped due to null triangulation
face_normals didn't match triangles, ignoring!
Traceback (most recent call last):
File "json2pc.py", line 84, in
Parallel(n_jobs=-1, verbose=2)(delayed(process_one)(x) for x in all_data["train"])
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 1017, in call
self.retrieve()
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 909, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 562, in wrap_future_result
return future.result(timeout=timeout)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
callback(self)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 340, in call
self.parallel.dispatch_next()
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 769, in dispatch_next
if not self.dispatch_one_batch(self._original_iterator):
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 835, in dispatch_one_batch
self._dispatch(tasks)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/parallel.py", line 754, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 551, in apply_async
future = self._workers.submit(SafeFunction(func))
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/externals/loky/reusable_executor.py", line 160, in submit
fn, *args, **kwargs)
File "/home/server/anaconda3/envs/DeepCAD/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 1027, in submit
raise self._flags.broken
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGSEGV(-11)}

So I didn't get all point clouds,I think perhaps it cause the last problem.

@ChrisWu1997
Copy link
Owner

This segmentation fault is caused by OpenCascade. Some cad models can not be converted to point clouds successfully. You can find those problematic data by replacing the Parallel execution with a for loop and printing out each data_id to see which one caused the problem. Then just skip it in the next run. Another quick solution is to simply give up those unprocessed data (if not too many) and to replace the following line in pc2cad.py

return self.__getitem__(index + 1)
with

return self.__getitem__(random.randint(0, self.__len__()))

@gxu-tz
Copy link
Author

gxu-tz commented Nov 24, 2021

Thank you for your advice.I wrote a for loop and found the problematic data_id and I successfully run pc2cad.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants