-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mdlstm IAM demo crashes after loading train.2.h5 #4
Comments
Unfortunately I am not able to reproduce the error. Are you using the most recent commit? You can also try to deactivate caching by setting cache_size to "8G". |
I was able to reproduce it: ValueError: could not broadcast input array from shape (151211,1) into shape (107131,1) KeyboardInterrupt But I don't know yet, what is the problem here |
for a quick fix you could try to put all the data into one file instead of two, although this does not solve the actual issue ofcourse. |
Thanks for looking into this. I did try putting all the training data in one file and I still had the issue. I modified the create_IAM_dataset.py file on line 203-209. I'll try an older commit. |
This didn't solve the actual problem, but it worked when I reverted back to commit 82be088 |
Is this the last commit which works? It would be very helpful to find it, so we can see which change was the problem. |
Could you try the most recent commit? There seems to be an issue with the cache size calculation on some few machines and it took me a while to reproduce it. Hard coding it to 16GB in config_real as done by commit 2d1744c resolved the issue for me on this machine. |
With cache_size set to 256G (as in the latest version in the repository) it works with the latest commit now. There seems to be a problem with the size calculation for two-dimensional data. So for now just set the cache_size real high |
The fix has been confirmed on three independent machines. Therefore I am closing this issue. |
I ran the IAM demo and it crashes part way through epoch 2 right after loading train.2.h5. All of the other demos worked correctly. Here is a script of exactly what I was running: https://gist.github.com/cwig/315d212964542f7f1797d5fdd122891e
Let me know if I need to run anything differently. Thank you.
This is the traceback.
train epoch 2, batch 190, cost:output 2.56628417969, elapsed 0:04:28, exp. remaining 1:04:21, complete 6.49%
1:04:21 [|||| 6.49% ]running 2 sequence slices (442764 nts) of batch 191 on device gpu0
loading file features/raw/train.2.h5
TaskThread train failed
Unhandled exception <type 'exceptions.ValueError'> in thread <TrainTaskThread(TaskThread train, started daemon 140405735151360)>, proc 2207.
EXCEPTION
Traceback (most recent call last):
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 373, in run
line: self.run_inner()
locals:
self = <TrainTaskThread(TaskThread train, started daemon 140405735151360)>
self.run_inner = <bound method TrainTaskThread.run_inner of <TrainTaskThread(TaskThread train, started daemon 140405735151360)>>
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 466, in run_inner
line: deviceRuns[i].allocate()
locals:
deviceRuns = [<DeviceBatchRun(DeviceThread gpu0, started daemon 140405385864960)>]
i = 0
allocate =
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 218, in allocate
line: self.devices_batches = self.parent.allocate_devices(self.alloc_devices)
locals:
self = <DeviceBatchRun(DeviceThread gpu0, started daemon 140405385864960)>
self.devices_batches = [[<Batch start_seq:3846, #seqs:2>]]
self.parent = <TrainTaskThread(TaskThread train, started daemon 140405735151360)>
self.parent.allocate_devices = <bound method TrainTaskThread.allocate_devices of <TrainTaskThread(TaskThread train, started daemon 140405735151360)>>
self.alloc_devices = [<Device.Device object at 0x7fb2c2ba19d0>]
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 84, in allocate_devices
line: success, batch_adv_idx = self.assign_dev_data(device, batches)
locals:
success =
batch_adv_idx =
self = <TrainTaskThread(TaskThread train, started daemon 140405735151360)>
self.assign_dev_data = <bound method TrainTaskThread.assign_dev_data of <TrainTaskThread(TaskThread train, started daemon 140405735151360)>>
device = <Device.Device object at 0x7fb2c2ba19d0>
batches = [<Batch start_seq:3349, #seqs:2>]
File "/mnt/3TB_A/workspace/returnn_IAM/EngineTask.py", line 54, in assign_dev_data
line: return assign_dev_data(device, self.data, batches)
locals:
assign_dev_data = <function assign_dev_data at 0x7fb2ca2ab230>
device = <Device.Device object at 0x7fb2c2ba19d0>
self = <TrainTaskThread(TaskThread train, started daemon 140405735151360)>
self.data = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
batches = [<Batch start_seq:3349, #seqs:2>]
File "/mnt/3TB_A/workspace/returnn_IAM/EngineUtil.py", line 23, in assign_dev_data
line: if load_seqs: dataset.load_seqs(batch.start_seq, batch.end_seq)
locals:
load_seqs = True
dataset = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
dataset.load_seqs = <bound method HDFDataset.load_seqs of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
batch = <Batch start_seq:3349, #seqs:2>
batch.start_seq = 3349
batch.end_seq = 3351
File "/mnt/3TB_A/workspace/returnn_IAM/CachedDataset.py", line 140, in load_seqs
line: self._load_seqs_with_cache(start, end)
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self._load_seqs_with_cache = <bound method HDFDataset._load_seqs_with_cache of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
start = 3349
end = 3351
File "/mnt/3TB_A/workspace/returnn_IAM/CachedDataset.py", line 168, in _load_seqs_with_cache
line: self.load_seqs(start, end, with_cache=False)
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self.load_seqs = <bound method HDFDataset.load_seqs of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
start = 3349
end = 3359
with_cache =
File "/mnt/3TB_A/workspace/returnn_IAM/CachedDataset.py", line 143, in load_seqs
line: super(CachedDataset, self).load_seqs(start, end)
locals:
super = <type 'super'>
CachedDataset = <class 'CachedDataset.CachedDataset'>
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
load_seqs =
start = 3349
end = 3359
File "/mnt/3TB_A/workspace/returnn_IAM/Dataset.py", line 159, in load_seqs
line: self._load_seqs(start, end)
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self._load_seqs = <bound method HDFDataset._load_seqs of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
start = 3349
end = 3359
File "/mnt/3TB_A/workspace/returnn_IAM/HDFDataset.py", line 152, in _load_seqs
line: self._set_alloc_intervals_data(idc, data=fin['inputs'][p[0] : p[0] + l[0]][...])
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self._set_alloc_intervals_data = <bound method HDFDataset._set_alloc_intervals_data of <HDFDataset.HDFDataset object at 0x7fb2c8056690>>
idc = 3358
data =
fin = <HDF5 file "train.2.h5" (mode r)>, len = 5
p = array([110348430, 25353, 1172]), len = 3
l = array([151211, 40, 2], dtype=int32), len = 3
File "/mnt/3TB_A/workspace/returnn_IAM/CachedDataset.py", line 220, in _set_alloc_intervals_data
line: self.alloc_intervals[idi][2][o:o + l] = x
locals:
self = <HDFDataset.HDFDataset object at 0x7fb2c8056690>
self.alloc_intervals = [(0, 1321, array([[ 0.],
[ 0.],
[ 0.],
...,
[ 0.],
[ 0.],
[ 0.]], dtype=float32)), (1329, 1333, array([[ 0.],
[ 0.],
[ 0.],
...,
[ 0.],
[ 0.],
[ 0.]], dtype=float32)), (1417, 1420, array([[ 0.],
[ 0.],
..., len = 52
idi = 48
o = 1824769
l = 151211
x = array([[ 0.],
[ 0.],
[ 0.],
...,
[ 0.],
[ 0.],
[ 0.]], dtype=float32), len = 151211, _[0]: {len = 1}
ValueError: could not broadcast input array from shape (151211,1) into shape (107131,1)
The text was updated successfully, but these errors were encountered: