Skip to content
This repository has been archived by the owner on Jan 3, 2023. It is now read-only.

train.cfg in video-c3d #448

Open
lixiangchun opened this issue Mar 19, 2018 · 8 comments
Open

train.cfg in video-c3d #448

lixiangchun opened this issue Mar 19, 2018 · 8 comments

Comments

@lixiangchun
Copy link

Any example about the content of train.cfg used in video-c3d?

@baojun-nervana
Copy link
Contributor

@lixiangchun Below is an example. Hope it can help you.

manifest = [train:/dataset/aeon/V3D/ucf-extracted/train-index.csv, test:/dataset/aeon/V3D/ucf-extracted/test-index.csv]
manifest_root = /dataset/aeon/V3D/ucf-extracted
backend = gpu
epochs = 10
batch_size = 32
eval_freq = 1
log = video-c3d.log
output_file = video-c3d.hdf5
device_id = 0
data_dir = /dataset

@lixiangchun
Copy link
Author

@baojun-nervana Thanks for your help.

Error in running python3 examples/video-c3d/train.py:

Traceback (most recent call last):
  File "/media/storage1/software/github/neon/examples/video-c3d/train.py", line 31, in <module>
    parser = NeonArgparser(__doc__, default_config_files=config_files)
  File "/usr/local/lib/python3.5/dist-packages/neon/util/argparser.py", line 80, in __init__
    super(NeonArgparser, self).__init__(*args, **kwargs)
TypeError: __init__() got multiple values for argument 'add_config_file_help'

@baojun-nervana
Copy link
Contributor

That might be an issue related to configargparse version. That occurs on the newest version of the configargparse. The requirements.txt file recommends to use the following version.

configargparse==0.10.0

@lixiangchun
Copy link
Author

lixiangchun commented Mar 22, 2018

Thanks, it works now.

However, I found that this repo only supports CPU or MLK as backend.The training process is very slow.

How to enable GPU as the backend for this repo?

@baojun-nervana
Copy link
Contributor

@lixiangchun The example can run with GPU backend. What error did you see with gpu backend?
you might need to install the gpu dependencies.
https://github.com/NervanaSystems/neon/blob/master/gpu_requirements.txt

@lixiangchun
Copy link
Author

@baojun-nervana After installing all packages in gpu_requirements.txt, the GPU backend can be used; however, the following error occurs:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/pycuda/tools.py", line 426, in context_dependent_memoize
    return ctx_dict[cur_ctx][args]
KeyError: <pycuda._driver.Context object at 0x7f3534cbe450>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/media/storage1/software/github/neon/examples/video-c3d/train.py", line 57, in <module>
    model.fit(train, optimizer=opt, num_epochs=args.epochs, cost=cost, callbacks=callbacks)
  File "/usr/local/lib/python3.5/dist-packages/neon/models/model.py", line 183, in fit
    self._epoch_fit(dataset, callbacks)
  File "/usr/local/lib/python3.5/dist-packages/neon/models/model.py", line 205, in _epoch_fit
    x = self.fprop(x)
  File "/usr/local/lib/python3.5/dist-packages/neon/models/model.py", line 236, in fprop
    res = self.layers.fprop(x, inference)
  File "/usr/local/lib/python3.5/dist-packages/neon/layers/container.py", line 395, in fprop
    x = l.fprop(x, inference=inference)
  File "/usr/local/lib/python3.5/dist-packages/neon/layers/layer.py", line 1061, in fprop
    bias=self.weight_bias, bsum=self.batch_sum, layer_op=self)
  File "/usr/local/lib/python3.5/dist-packages/neon/backends/nervanagpu.py", line 1990, in fprop_conv
    return self._execute_conv("fprop", layer, layer.fprop_kernels, repeat)
  File "/usr/local/lib/python3.5/dist-packages/neon/backends/nervanagpu.py", line 2072, in _execute_conv
    kernels.execute(repeat)
  File "/usr/local/lib/python3.5/dist-packages/neon/backends/convolution.py", line 551, in execute
    kernel = kernel_specs.get_kernel(self.kernel_name, self.kernel_options)
  File "<decorator-gen-35>", line 2, in get_kernel
  File "/usr/local/lib/python3.5/dist-packages/pycuda/tools.py", line 430, in context_dependent_memoize
    result = func(*args)
  File "/usr/local/lib/python3.5/dist-packages/neon/backends/kernel_specs.py", line 842, in get_kernel
    run_command([ "ptxas -v -arch", arch, "-o", cubin_file, ptx_file ])
  File "/usr/local/lib/python3.5/dist-packages/neon/backends/kernel_specs.py", line 785, in run_command
    raise RuntimeError("Error(%d):\n%s\n%s" % (proc.returncode, cmd, err))
RuntimeError: Error(136):
ptxas -v -arch sm_61 -o /home/lixc/.cache/neon/kernels/cubin/sconv_direct_fprop_64x32_SN_bias.cubin /home/lixc/.cache/neon/kernels/ptx/sconv_direct_fprop_64x32_SN_bias.ptx
b'Floating point exception (core dumped)\n'

My train.cfg is:

manifest = [train:/media/storage1/project/deep_learning/c3d_ucf/data/ucf-extracted/train-index.csv, test:/media/storage1/project/deep_learning/c3d_ucf/data/ucf-extracted/test-index.csv]
manifest_root = /media/storage1/project/deep_learning/c3d_ucf/data/ucf-extracted
backend = gpu
epochs = 10
batch_size = 16
eval_freq = 1
log = video-c3d.log
output_file = video-c3d.hdf5
device_id = 1
data_dir = train_output_dir
serialize = 1

Training was done via:

export LD_LIBRARY_PATH=/media/storage1/software/github/neon/mklml_lnx_2018.0.1.20171227/lib:$LD_LIBRARY_PATH
python3 /media/storage1/software/github/neon/examples/video-c3d/train.py -c train.cfg

@baojun-nervana
Copy link
Contributor

@lixiangchun Are you using cuda9?
I am using cuda8 and there was issue reported on cuda9.

$nvcc --version │·
nvcc: NVIDIA (R) Cuda compiler driver │·
Copyright (c) 2005-2016 NVIDIA Corporation │·
Built on Tue_Jan_10_13:22:03_CST_2017 │·
Cuda compilation tools, release 8.0, V8.0.61

@lixiangchun
Copy link
Author

@baojun-nervana Thanks. Yes, I use cuda9. Will go back to cuda8 and try again.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants