Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mdlstm IAM demo crashes after loading train.2.h5 #4

Closed
cwig opened this issue Feb 2, 2017 · 10 comments
Closed

mdlstm IAM demo crashes after loading train.2.h5 #4

cwig opened this issue Feb 2, 2017 · 10 comments

Comments

@cwig
Copy link

cwig commented Feb 2, 2017

I ran the IAM demo and it crashes part way through epoch 2 right after loading train.2.h5. All of the other demos worked correctly. Here is a script of exactly what I was running: https://gist.github.com/cwig/315d212964542f7f1797d5fdd122891e

Let me know if I need to run anything differently. Thank you.

This is the traceback.

train epoch 2, batch 190, cost:output 2.56628417969, elapsed 0:04:28, exp. remaining 1:04:21, complete 6.49%
1:04:21 [|||| 6.49% ]running 2 sequence slices (442764 nts) of batch 191 on device gpu0
loading file features/raw/train.2.h5
TaskThread train failed
Unhandled exception <type 'exceptions.ValueError'> in thread <TrainTaskThread(TaskThread train, started daemon 140405735151360)>, proc 2207.
EXCEPTION
Traceback (most recent call last):
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 373, in run
line: self.run_inner()
locals:
self = <TrainTaskThread(TaskThread train, started daemon 140405735151360)>
self.run_inner = <bound method TrainTaskThread.run_inner of <TrainTaskThread(TaskThread train, started daemon 140405735151360)>>
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 466, in run_inner
line: deviceRuns[i].allocate()
locals:
deviceRuns = [<DeviceBatchRun(DeviceThread gpu0, started daemon 140405385864960)>]
i = 0
allocate =
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 218, in allocate
line: self.devices_batches = self.parent.allocate_devices(self.alloc_devices)
locals:
self = <DeviceBatchRun(DeviceThread gpu0, started daemon 140405385864960)>
self.devices_batches = [[<Batch start_seq:3846, #seqs:2>]]
self.parent = <TrainTaskThread(TaskThread train, started daemon 140405735151360)>
self.parent.allocate_devices = <bound method TrainTaskThread.allocate_devices of <TrainTaskThread(TaskThread train, started daemon 140405735151360)>>
self.alloc_devices = [<Device.Device object at 0x7fb2c2ba19d0>]
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 84, in allocate_devices
line: success, batch_adv_idx = self.assign_dev_data(device, batches)
locals:
success =
batch_adv_idx =
self = <TrainTaskThread(TaskThread train, started daemon 140405735151360)>
self.assign_dev_data = <bound method TrainTaskThread.assign_dev_data of <TrainTaskThread(TaskThread train, started daemon 140405735151360)>>
device = <Device.Device object at 0x7fb2c2ba19d0>
batches = [<Batch start_seq:3349, #seqs:2>]
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 54, in assign_dev_data
line: return assign_dev_data(device, self.data, batches)
locals:
assign_dev_data = <function assign_dev_data at 0x7fb2ca2ab230>
device = <Device.Device object at 0x7fb2c2ba19d0>
self = <TrainTaskThread(TaskThread train, started daemon 140405735151360)>
self.data = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
batches = [<Batch start_seq:3349, #seqs:2>]
File "/mnt/3TB_A/workspace/returnn_IAM/EngineUtil.py", line 23, in assign_dev_data
line: if load_seqs: dataset.load_seqs(batch.start_seq, batch.end_seq)
locals:
load_seqs = True
dataset = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
dataset.load_seqs = <bound method HDFDataset.load_seqs of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
batch = <Batch start_seq:3349, #seqs:2>
batch.start_seq = 3349
batch.end_seq = 3351
File "/mnt/3TB_A/workspace/returnn_IAM/CachedDataset.py", line 140, in load_seqs
line: self._load_seqs_with_cache(start, end)
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self._load_seqs_with_cache = <bound method HDFDataset._load_seqs_with_cache of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
start = 3349
end = 3351
File "/mnt/3TB_A/workspace/returnn_IAM/CachedDataset.py", line 168, in _load_seqs_with_cache
line: self.load_seqs(start, end, with_cache=False)
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self.load_seqs = <bound method HDFDataset.load_seqs of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
start = 3349
end = 3359
with_cache =
File "/mnt/3TB_A/workspace/returnn_IAM/CachedDataset.py", line 143, in load_seqs
line: super(CachedDataset, self).load_seqs(start, end)
locals:
super = <type 'super'>
CachedDataset = <class 'CachedDataset.CachedDataset'>
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
load_seqs =
start = 3349
end = 3359
File "/mnt/3TB_A/workspace/returnn_IAM/Dataset.py", line 159, in load_seqs
line: self._load_seqs(start, end)
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self._load_seqs = <bound method HDFDataset._load_seqs of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
start = 3349
end = 3359
File "/mnt/3TB_A/workspace/returnn_IAM/HDFDataset.py", line 152, in _load_seqs
line: self._set_alloc_intervals_data(idc, data=fin['inputs'][p[0] : p[0] + l[0]][...])
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self._set_alloc_intervals_data = <bound method HDFDataset._set_alloc_intervals_data of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
idc = 3358
data =
fin = <HDF5 file "train.2.h5" (mode r)>, len = 5
p = array([110348430, 25353, 1172]), len = 3
l = array([151211, 40, 2], dtype=int32), len = 3
File "/mnt/3TB_A/workspace/returnn_IAM/CachedDataset.py", line 220, in _set_alloc_intervals_data
line: self.alloc_intervals[idi][2][o:o + l] = x
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self.alloc_intervals = [(0, 1321, array([[ 0.],
[ 0.],
[ 0.],
...,
[ 0.],
[ 0.],
[ 0.]], dtype=float32)), (1329, 1333, array([[ 0.],
[ 0.],
[ 0.],
...,
[ 0.],
[ 0.],
[ 0.]], dtype=float32)), (1417, 1420, array([[ 0.],
[ 0.],
..., len = 52
idi = 48
o = 1824769
l = 151211
x = array([[ 0.],
[ 0.],
[ 0.],
...,
[ 0.],
[ 0.],
[ 0.]], dtype=float32), len = 151211, _[0]: {len = 1}
ValueError: could not broadcast input array from shape (151211,1) into shape (107131,1)

@doetsch
Copy link
Contributor

doetsch commented Feb 3, 2017

Unfortunately I am not able to reproduce the error. Are you using the most recent commit? You can also try to deactivate caching by setting cache_size to "8G".

@pvoigtlaender
Copy link
Contributor

pvoigtlaender commented Feb 3, 2017

I was able to reproduce it:

ValueError: could not broadcast input array from shape (151211,1) into shape (107131,1)

KeyboardInterrupt
train epoch 2, batch 191, cost:output 2.81097157796, elapsed 0:07:17, exp. remaining 1:44:52, complete 6.49%
1:44:52 [||||||||||||| 6.49%

But I don't know yet, what is the problem here
edit: setting setting cache_size to "8G" did not help, and it loads the data anyway:
1:47:03 [|||||||||| 4.81% ]running 2 sequence slices (473110 nts) of batch 141 on device gpu0
train epoch 2, batch 141, cost:output 3.09054326076, elapsed 0:05:26, exp. remaining 1:47:47, complete 4.81%
1:47:47 [|||||||||| 4.81% ]loading file features/raw/train.2.h5
running 2 sequence slices (463386 nts) of batch 142 on device gpu0
loading file features/raw/train.1.h5
TaskThread train failed
Unhandled exception <type 'exceptions.AssertionError'> in thread <TrainTaskThread(TaskThread train, started daemon 140624219232000)>, proc 23277.

@pvoigtlaender
Copy link
Contributor

for a quick fix you could try to put all the data into one file instead of two, although this does not solve the actual issue ofcourse.
You can also try an older commit. The demo used to work in earlier commits. If you can find out, which commit broke it, then it might be easy to fix it

@cwig
Copy link
Author

cwig commented Feb 6, 2017

Thanks for looking into this. I did try putting all the training data in one file and I still had the issue. I modified the create_IAM_dataset.py file on line 203-209.

I'll try an older commit.

@cwig
Copy link
Author

cwig commented Feb 8, 2017

This didn't solve the actual problem, but it worked when I reverted back to commit 82be088

@pvoigtlaender
Copy link
Contributor

Is this the last commit which works? It would be very helpful to find it, so we can see which change was the problem.

@cwig
Copy link
Author

cwig commented Feb 8, 2017

I'm not sure. I have only tried two so far. a925c7a did not work so it is somewhere between a925c7a and 82be088.

@doetsch
Copy link
Contributor

doetsch commented Mar 3, 2017

Could you try the most recent commit? There seems to be an issue with the cache size calculation on some few machines and it took me a while to reproduce it. Hard coding it to 16GB in config_real as done by commit 2d1744c resolved the issue for me on this machine.

@pvoigtlaender
Copy link
Contributor

With cache_size set to 256G (as in the latest version in the repository) it works with the latest commit now. There seems to be a problem with the size calculation for two-dimensional data. So for now just set the cache_size real high

@doetsch
Copy link
Contributor

doetsch commented Mar 6, 2017

The fix has been confirmed on three independent machines. Therefore I am closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants