mdlstm IAM demo crashes after loading train.2.h5 #4

cwig · 2017-02-02T17:07:01Z

I ran the IAM demo and it crashes part way through epoch 2 right after loading train.2.h5. All of the other demos worked correctly. Here is a script of exactly what I was running: https://gist.github.com/cwig/315d212964542f7f1797d5fdd122891e

Let me know if I need to run anything differently. Thank you.

This is the traceback.

train epoch 2, batch 190, cost:output 2.56628417969, elapsed 0:04:28, exp. remaining 1:04:21, complete 6.49%
1:04:21 [|||| 6.49% ]running 2 sequence slices (442764 nts) of batch 191 on device gpu0
loading file features/raw/train.2.h5
TaskThread train failed
Unhandled exception <type 'exceptions.ValueError'> in thread <TrainTaskThread(TaskThread train, started daemon 140405735151360)>, proc 2207.
EXCEPTION
Traceback (most recent call last):
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 373, in run
line: self.run_inner()
locals:
self = <TrainTaskThread(TaskThread train, started daemon 140405735151360)>
self.run_inner = <bound method TrainTaskThread.run_inner of <TrainTaskThread(TaskThread train, started daemon 140405735151360)>>
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 466, in run_inner
line: deviceRuns[i].allocate()
locals:
deviceRuns = [<DeviceBatchRun(DeviceThread gpu0, started daemon 140405385864960)>]
i = 0
allocate =
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 218, in allocate
line: self.devices_batches = self.parent.allocate_devices(self.alloc_devices)
locals:
self = <DeviceBatchRun(DeviceThread gpu0, started daemon 140405385864960)>
self.devices_batches = [[<Batch start_seq:3846, #seqs:2>]]
self.parent = <TrainTaskThread(TaskThread train, started daemon 140405735151360)>
self.parent.allocate_devices = <bound method TrainTaskThread.allocate_devices of <TrainTaskThread(TaskThread train, started daemon 140405735151360)>>
self.alloc_devices = [<Device.Device object at 0x7fb2c2ba19d0>]
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 84, in allocate_devices
line: success, batch_adv_idx = self.assign_dev_data(device, batches)
locals:
success =
batch_adv_idx =
self = <TrainTaskThread(TaskThread train, started daemon 140405735151360)>
self.assign_dev_data = <bound method TrainTaskThread.assign_dev_data of <TrainTaskThread(TaskThread train, started daemon 140405735151360)>>
device = <Device.Device object at 0x7fb2c2ba19d0>
batches = [<Batch start_seq:3349, #seqs:2>]
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 54, in assign_dev_data
line: return assign_dev_data(device, self.data, batches)
locals:
assign_dev_data = <function assign_dev_data at 0x7fb2ca2ab230>
device = <Device.Device object at 0x7fb2c2ba19d0>
self = <TrainTaskThread(TaskThread train, started daemon 140405735151360)>
self.data = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
batches = [<Batch start_seq:3349, #seqs:2>]
File "/mnt/3TB_A/workspace/returnn_IAM/EngineUtil.py", line 23, in assign_dev_data
line: if load_seqs: dataset.load_seqs(batch.start_seq, batch.end_seq)
locals:
load_seqs = True
dataset = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
dataset.load_seqs = <bound method HDFDataset.load_seqs of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
batch = <Batch start_seq:3349, #seqs:2>
batch.start_seq = 3349
batch.end_seq = 3351
File "/mnt/3TB_A/workspace/returnn_IAM/CachedDataset.py", line 140, in load_seqs
line: self._load_seqs_with_cache(start, end)
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self._load_seqs_with_cache = <bound method HDFDataset._load_seqs_with_cache of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
start = 3349
end = 3351
File "/mnt/3TB_A/workspace/returnn_IAM/CachedDataset.py", line 168, in _load_seqs_with_cache
line: self.load_seqs(start, end, with_cache=False)
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self.load_seqs = <bound method HDFDataset.load_seqs of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
start = 3349
end = 3359
with_cache =
File "/mnt/3TB_A/workspace/returnn_IAM/CachedDataset.py", line 143, in load_seqs
line: super(CachedDataset, self).load_seqs(start, end)
locals:
super = <type 'super'>
CachedDataset = <class 'CachedDataset.CachedDataset'>
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
load_seqs =
start = 3349
end = 3359
File "/mnt/3TB_A/workspace/returnn_IAM/Dataset.py", line 159, in load_seqs
line: self._load_seqs(start, end)
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self._load_seqs = <bound method HDFDataset._load_seqs of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
start = 3349
end = 3359
File "/mnt/3TB_A/workspace/returnn_IAM/HDFDataset.py", line 152, in _load_seqs
line: self._set_alloc_intervals_data(idc, data=fin['inputs'][p[0] : p[0] + l[0]][...])
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self._set_alloc_intervals_data = <bound method HDFDataset._set_alloc_intervals_data of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
idc = 3358
data =
fin = <HDF5 file "train.2.h5" (mode r)>, len = 5
p = array([110348430, 25353, 1172]), len = 3
l = array([151211, 40, 2], dtype=int32), len = 3
File "/mnt/3TB_A/workspace/returnn_IAM/CachedDataset.py", line 220, in _set_alloc_intervals_data
line: self.alloc_intervals[idi][2][o:o + l] = x
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self.alloc_intervals = [(0, 1321, array([[ 0.],
[ 0.],
[ 0.],
...,
[ 0.],
[ 0.],
[ 0.]], dtype=float32)), (1329, 1333, array([[ 0.],
[ 0.],
[ 0.],
...,
[ 0.],
[ 0.],
[ 0.]], dtype=float32)), (1417, 1420, array([[ 0.],
[ 0.],
..., len = 52
idi = 48
o = 1824769
l = 151211
x = array([[ 0.],
[ 0.],
[ 0.],
...,
[ 0.],
[ 0.],
[ 0.]], dtype=float32), len = 151211, _[0]: {len = 1}
ValueError: could not broadcast input array from shape (151211,1) into shape (107131,1)

doetsch · 2017-02-03T10:04:57Z

Unfortunately I am not able to reproduce the error. Are you using the most recent commit? You can also try to deactivate caching by setting cache_size to "8G".

pvoigtlaender · 2017-02-03T11:49:40Z

I was able to reproduce it:

ValueError: could not broadcast input array from shape (151211,1) into shape (107131,1)

KeyboardInterrupt
train epoch 2, batch 191, cost:output 2.81097157796, elapsed 0:07:17, exp. remaining 1:44:52, complete 6.49%
1:44:52 [||||||||||||| 6.49%

But I don't know yet, what is the problem here
edit: setting setting cache_size to "8G" did not help, and it loads the data anyway:
1:47:03 [|||||||||| 4.81% ]running 2 sequence slices (473110 nts) of batch 141 on device gpu0
train epoch 2, batch 141, cost:output 3.09054326076, elapsed 0:05:26, exp. remaining 1:47:47, complete 4.81%
1:47:47 [|||||||||| 4.81% ]loading file features/raw/train.2.h5
running 2 sequence slices (463386 nts) of batch 142 on device gpu0
loading file features/raw/train.1.h5
TaskThread train failed
Unhandled exception <type 'exceptions.AssertionError'> in thread <TrainTaskThread(TaskThread train, started daemon 140624219232000)>, proc 23277.

pvoigtlaender · 2017-02-06T12:31:30Z

for a quick fix you could try to put all the data into one file instead of two, although this does not solve the actual issue ofcourse.
You can also try an older commit. The demo used to work in earlier commits. If you can find out, which commit broke it, then it might be easy to fix it

cwig · 2017-02-06T15:42:33Z

Thanks for looking into this. I did try putting all the training data in one file and I still had the issue. I modified the create_IAM_dataset.py file on line 203-209.

I'll try an older commit.

cwig · 2017-02-08T16:46:00Z

This didn't solve the actual problem, but it worked when I reverted back to commit 82be088

pvoigtlaender · 2017-02-08T16:56:33Z

Is this the last commit which works? It would be very helpful to find it, so we can see which change was the problem.

cwig · 2017-02-08T16:59:56Z

I'm not sure. I have only tried two so far. a925c7a did not work so it is somewhere between a925c7a and 82be088.

doetsch · 2017-03-03T00:55:47Z

Could you try the most recent commit? There seems to be an issue with the cache size calculation on some few machines and it took me a while to reproduce it. Hard coding it to 16GB in config_real as done by commit 2d1744c resolved the issue for me on this machine.

pvoigtlaender · 2017-03-03T13:33:40Z

With cache_size set to 256G (as in the latest version in the repository) it works with the latest commit now. There seems to be a problem with the size calculation for two-dimensional data. So for now just set the cache_size real high

doetsch · 2017-03-06T20:26:58Z

The fix has been confirmed on three independent machines. Therefore I am closing this issue.

doetsch closed this as completed Mar 6, 2017

wellescastro mentioned this issue May 12, 2017

Experiments with IAM Database #10

Closed

JackTemaki mentioned this issue Jul 4, 2024

Support torch.compile for RF #1491

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mdlstm IAM demo crashes after loading train.2.h5 #4

mdlstm IAM demo crashes after loading train.2.h5 #4

cwig commented Feb 2, 2017

doetsch commented Feb 3, 2017

pvoigtlaender commented Feb 3, 2017 •

edited

Loading

pvoigtlaender commented Feb 6, 2017

cwig commented Feb 6, 2017

cwig commented Feb 8, 2017 •

edited

Loading

pvoigtlaender commented Feb 8, 2017

cwig commented Feb 8, 2017

doetsch commented Mar 3, 2017

pvoigtlaender commented Mar 3, 2017

doetsch commented Mar 6, 2017

mdlstm IAM demo crashes after loading train.2.h5 #4

mdlstm IAM demo crashes after loading train.2.h5 #4

Comments

cwig commented Feb 2, 2017

doetsch commented Feb 3, 2017

pvoigtlaender commented Feb 3, 2017 • edited Loading

pvoigtlaender commented Feb 6, 2017

cwig commented Feb 6, 2017

cwig commented Feb 8, 2017 • edited Loading

pvoigtlaender commented Feb 8, 2017

cwig commented Feb 8, 2017

doetsch commented Mar 3, 2017

pvoigtlaender commented Mar 3, 2017

doetsch commented Mar 6, 2017

pvoigtlaender commented Feb 3, 2017 •

edited

Loading

cwig commented Feb 8, 2017 •

edited

Loading