-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-86] Gluon dataloader crash on speech recognition training #10042
Comments
Thanks for reporting! Is there any code to reproduce the error message? |
Seems like related to multiprocessing. when num_worker=0 problem is resolved. |
This isn't actionable since we don't have your code. Please attach code. |
will produce minimal reproducible code soon |
Here is the code that get stuck. Change num_worker=0 will work, however, on Mac with 1.0.0post3, this is not an issue
|
@Jerryzcn found that v1.1.0 does not have this problem. |
I'm now doing a binary search to locate the problem. |
@cjolivier01 @piiswrong After BinarySearch, I can confirm that the problem is due to this PR: 106f97f
Also, I compile without setting the |
with what frequency does it occur? |
100%
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Chris Olivier <[email protected]>
Sent: Monday, March 12, 2018 6:45:46 PM
To: apache/incubator-mxnet
Cc: Xingjian SHI; Comment
Subject: Re: [apache/incubator-mxnet] Gluon dataloader crash on speech recognition training (#10042)
with what frequency does it occur?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#10042 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE8D7rU25otxMOyO4doeItZPtMQf0Lp_ks5tdyTKgaJpZM4SjV9U>.
|
ok, will take a look tomorrow
…On Mon, Mar 12, 2018 at 7:03 PM Xingjian Shi ***@***.***> wrote:
100%
Get Outlook for iOS<https://aka.ms/o0ukef>
________________________________
From: Chris Olivier ***@***.***>
Sent: Monday, March 12, 2018 6:45:46 PM
To: apache/incubator-mxnet
Cc: Xingjian SHI; Comment
Subject: Re: [apache/incubator-mxnet] Gluon dataloader crash on speech
recognition training (#10042)
with what frequency does it occur?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<
#10042 (comment)>,
or mute the thread<
https://github.com/notifications/unsubscribe-auth/AE8D7rU25otxMOyO4doeItZPtMQf0Lp_ks5tdyTKgaJpZM4SjV9U
>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#10042 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKts_ZbbmqFPW_O5dIPcF5TLXofbJ8BZks5tdykHgaJpZM4SjV9U>
.
|
Created JIRA work item: https://issues.apache.org/jira/browse/MXNET-86 |
This script just freezes for me without doing anything. In Connection._recv, it seems., What is it doing? |
The error output only happens when you train on the actual data. When you use the test script, it will freeze. If revert back to 1.1.0 the problem is resolved. |
@cjolivier01 Before the commit it will not freeze and will print the data instead. |
there seems to be other issues as well, after training for 1 day or so i got segfault. This does not happen with small dataset. Segfault is tested with 1.2.0. I will try previous version |
I think that would be a separate issue. This one so far is just the "stuck" fix. |
segfault seems to related to #10096 |
When using num_workers > 0 I get after a few hundreds/thousands of batches (the higher the number of workers, the sooner the segfault): I am using mxnet-cu90 1.1.0:
Running this code: https://github.com/ThomasDelteil/CNN_NLP_MXNet/blob/master/Crepe-Gluon.ipynb and changing this line: Sometimes, not all the times, I also get the workers filling up 100% of my /dev/shm after the segfault. I am running the code in jupyter lab. Is this the same issue? Should I open a new one? The issue does not happen without multi-processing or when blocking on every batch (see the .asscalar()) change that triggered it |
@ThomasDelteil can you try build from source using master? #10096 might fix it. I am testing it right now. |
I am trying now, it did still happen though with |
The segfault does not seem to happen with latest master, however latest master does seem to be MUCH slower than 1.1.0. by a factor of 3-4. optimized refers to the version where
@Jerryzcn Any idea why that might be? Looks like on current master, the data loading is not the limiting factor in the performance. whilst when using 1.1.0 it was. If I didn't botch anything during my build, that looks like a pretty bad regression for 1.2.0 |
@ThomasDelteil The bug is related to a race condition in memory management, where a space is double freed. The latest master add a lock on the space, so that might slow down the dataloading. However, I'm not sure exactly why. ping @zhreshold |
Thanks @Jerryzcn import mxnet as mx
print(mx.__version__)
from mxnet import nd, autograd, gluon
import os
import pandas as pd
from mxnet.gluon.data import ArrayDataset
from mxnet.gluon.data import DataLoader
import numpy as np
import multiprocessing
import wget
if not os.path.isfile('pickleddata.pkl'):
wget.download('https://s3.us-east-2.amazonaws.com/tdelteil-test-mxnet/pickleddata.pkl')
data = pd.read_pickle('pickleddata.pkl')
# /!\ The important bit:
NUM_WORKERS = multiprocessing.cpu_count() # number of workers used in the data loading
optimized = True
categories = [
'Home_and_Kitchen',
'Books',
'CDs_and_Vinyl',
'Movies_and_TV',
'Cell_Phones_and_Accessories',
'Sports_and_Outdoors',
'Clothing_Shoes_and_Jewelry'
]
ALPHABET = list("abcdefghijklmnopqrstuvwxyz0123456789-,;.!?:'\"/\\|_@#$%^&*~`+ =<>()[]{}") # The 69 characters as specified in the paper
ALPHABET_INDEX = {letter: index for index, letter in enumerate(ALPHABET)} # { a: 0, b: 1, etc}
FEATURE_LEN = 1014 # max-length in characters for one document
BATCH_SIZE = 128 # number of documents per batch
def encode(text):
encoded = np.zeros([len(ALPHABET), FEATURE_LEN], dtype='float32')
review = text.lower()[:FEATURE_LEN-1:-1]
i = 0
for letter in text:
if i >= FEATURE_LEN:
break;
if letter in ALPHABET_INDEX:
encoded[ALPHABET_INDEX[letter]][i] = 1
i += 1
return encoded
class AmazonDataSet(ArrayDataset):
# We pre-process the documents on the fly
def __getitem__(self, idx):
return encode(self._data[0][idx]), self._data[1][idx]
# Data loaders:
split = 0.8
split_index = int(split*len(data))
train_data_X = data['X'][:split_index].as_matrix()
train_data_Y = data['Y'][:split_index].as_matrix()
test_data_X = data['X'][split_index:].as_matrix()
test_data_Y = data['Y'][split_index:].as_matrix()
train_dataset = AmazonDataSet(train_data_X, train_data_Y)
test_dataset = AmazonDataSet(test_data_X, test_data_Y)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, last_batch='discard')
test_dataloader = DataLoader(test_dataset, shuffle=True, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, last_batch='discard')
# context:
ctx = mx.gpu() # to run on GPU
# build network
NUM_FILTERS = 256 # number of convolutional filters per convolutional layer
NUM_OUTPUTS = len(categories) # number of classes
FULLY_CONNECTED = 1024 # number of unit in the fully connected dense layer
DROPOUT_RATE = 0.5 # probability of node drop out
LEARNING_RATE = 0.01 # learning rate of the gradient
MOMENTUM = 0.9 # momentum of the gradient
WDECAY = 0.00001 # regularization term to limit size of weights
net = gluon.nn.HybridSequential()
with net.name_scope():
net.add(gluon.nn.Conv1D(channels=NUM_FILTERS, kernel_size=7, activation='relu'))
net.add(gluon.nn.MaxPool1D(pool_size=3, strides=3))
net.add(gluon.nn.Conv1D(channels=NUM_FILTERS, kernel_size=7, activation='relu'))
net.add(gluon.nn.MaxPool1D(pool_size=3, strides=3))
net.add(gluon.nn.Conv1D(channels=NUM_FILTERS, kernel_size=3, activation='relu'))
net.add(gluon.nn.Conv1D(channels=NUM_FILTERS, kernel_size=3, activation='relu'))
net.add(gluon.nn.Conv1D(channels=NUM_FILTERS, kernel_size=3, activation='relu'))
net.add(gluon.nn.Conv1D(channels=NUM_FILTERS, kernel_size=3, activation='relu'))
net.add(gluon.nn.MaxPool1D(pool_size=3, strides=3))
net.add(gluon.nn.Flatten())
net.add(gluon.nn.Dense(FULLY_CONNECTED, activation='relu'))
net.add(gluon.nn.Dropout(DROPOUT_RATE))
net.add(gluon.nn.Dense(FULLY_CONNECTED, activation='relu'))
net.add(gluon.nn.Dropout(DROPOUT_RATE))
net.add(gluon.nn.Dense(NUM_OUTPUTS))
net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
# loss
softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
# optimizer
trainer = gluon.Trainer(net.collect_params(), 'sgd',
{'learning_rate': LEARNING_RATE,
'wd':WDECAY,
'momentum':MOMENTUM})
# Training Loop
import time
start_epoch = 6
number_epochs = 7
smoothing_constant = .01
tick = time.time()
net.hybridize()
for e in range(start_epoch, number_epochs):
for i, (review, label) in enumerate(train_dataloader):
review = review.as_in_context(ctx)
label = label.as_in_context(ctx)
with autograd.record():
output = net(review)
loss = softmax_cross_entropy(output, label)
loss.backward()
trainer.step(review.shape[0])
# moving average of the loss
if optimized:
curr_loss = nd.mean(loss)
else:
curr_loss = nd.mean(loss).asscalar()
moving_loss = (curr_loss if (i == 0)
else (1 - smoothing_constant) * moving_loss + (smoothing_constant) * curr_loss)
if (i%100 == 0):
tock = time.time()
if optimized:
print('Batch {}:{},{},{} seconds for 100 batches'.format(i, curr_loss.asscalar(),moving_loss.asscalar(), tock-tick))
else:
print('Batch {}:{},{},{} seconds for 100 batches'.format(i, curr_loss, moving_loss, tock-tick))
tick = tock
print("Epoch %s. Loss: %s, Test_acc %s" % (e, moving_loss.asscalar(), test_accuracy)) |
@ThomasDelteil |
I removed the network operations and got essentially the same results for master and mxnet-cu90 in pure I/O. It looks like my network is not I/O bound with the current master (stuck on 100 batches for ~14s) but that the GPU is not processing the operations as fast as previously. edit2: I get the performance drop when using any 1.2.0 versions starting with |
Should I close this? |
Yeah, we can close this. |
Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.
For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io
Description
(Brief description of the problem in no more than 2 sentences.)
Gluon data loader crash during training.
Environment info (Required)
Package used (Python/R/Scala/Julia):
I'm using Python
For Scala user, please provide:
java -version
)mvn -version
)scala -version
)For R user, please provide R
sessionInfo()
:Build info (Required if built from source)
Compiler (gcc/clang/mingw/visual studio):
MXNet commit hash:
(Paste the output of
git rev-parse HEAD
here.)2a9c7d9
Build config:
(Paste the content of config.mk, or the build command.)
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1
Error Message:
(Paste the complete error message, including stack trace.)
Minimum reproducible example
(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)
The code is in a private repo
Steps to reproduce
(Paste the commands you ran that produced the error.)
It happens intermittently when we train on large speech dataset.
What have you tried to solve it?
Not yet.
The text was updated successfully, but these errors were encountered: