[MXNET-86] Gluon dataloader crash on speech recognition training #10042

Jerryzcn · 2018-03-08T21:04:36Z

Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

(Brief description of the problem in no more than 2 sentences.)
Gluon data loader crash during training.

Environment info (Required)

What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.
----------Python Info----------
Version      : 3.6.3
Compiler     : GCC 7.2.0
Build        : ('default', 'Oct 13 2017 12:02:49')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 9.0.1
Directory    : /home/ubuntu/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
/home/ubuntu/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
/home/ubuntu/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py:46: DeprecationWarning: OpenSSL.rand is deprecated - you should use os.urandom instead
  import OpenSSL.SSL
Version      : 1.2.0
Directory    : /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.4.0-1052-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-172-31-22-177
release      : 4.4.0-1052-aws
version      : #61-Ubuntu SMP Mon Feb 12 23:05:58 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2699.984
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.07
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-15,32-47
NUMA node1 CPU(s):     16-31,48-63
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single retpoline kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt ida
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0022 sec, LOAD: 0.5053 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1427 sec, LOAD: 0.0642 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1806 sec, LOAD: 0.1844 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0252 sec, LOAD: 0.1844 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0117 sec, LOAD: 0.2449 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0081 sec, LOAD: 0.3814 sec.

Package used (Python/R/Scala/Julia):
I'm using Python

For Scala user, please provide:

Java version: (java -version)
Maven version: (mvn -version)
Scala runtime if applicable: (scala -version)

For R user, please provide R sessionInfo():

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):

gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.9' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)

MXNet commit hash:
(Paste the output of git rev-parse HEAD here.)
2a9c7d9

Build config:
(Paste the content of config.mk, or the build command.)
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1

Error Message:

(Paste the complete error message, including stack trace.)

Traceback (most recent call last):
  File "/home/ubuntu/openspeech/openspeech/experiment.py", line 358, in <module>
    exp.run()
  File "/home/ubuntu/openspeech/openspeech/experiment.py", line 352, in run
    stage_inst.run()
  File "/home/ubuntu/openspeech/openspeech/train.py", line 140, in run
    for step, data in enumerate(train_loader):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet/gluon/data/dataloader.py", line 224, in __iter__
    idx, batch = data_queue.get()
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet/gluon/data/dataloader.py", line 38, in rebuild_ndarray
    return nd.NDArray(nd.ndarray._new_from_shared_mem(*args))
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet/ndarray/ndarray.py", line 151, in _new_from_shared_mem
    ctypes.byref(hdl)))
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet/base.py", line 149, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [20:48:23] src/storage/./cpu_shared_storage_manager.h:178: Check failed: ptr != ((void *) -1) (0xffffffffffffffff vs. 0xffffffffffffffff) Failed to map shared memory. mmap failed with error Cannot allocate memory

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f77c7b896cb]
[bt] (1) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f77c7b8a208]
[bt] (2) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet/libmxnet.so(mxnet::storage::CPUSharedStorageManager::Alloc(mxnet::Storage::Handle*)+0x99b) [0x7f77ca7ef5db]
[bt] (3) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet/libmxnet.so(mxnet::StorageImpl::Alloc(mxnet::Storage::Handle*)+0x5d) [0x7f77ca7efd4d]
[bt] (4) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet/libmxnet.so(MXNDArrayCreateFromSharedMem+0x825) [0x7f77ca848135]
[bt] (5) /home/ubuntu/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f7877dbaec0]
[bt] (6) /home/ubuntu/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7f7877dba87d]
[bt] (7) /home/ubuntu/anaconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f7877fcf82e]
[bt] (8) /home/ubuntu/anaconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x12265) [0x7f7877fd0265]
[bt] (9) python(_PyObject_FastCallDict+0x8b) [0x55a780a2554b]

Minimum reproducible example

(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)
The code is in a private repo

Steps to reproduce

(Paste the commands you ran that produced the error.)

It happens intermittently when we train on large speech dataset.

What have you tried to solve it?

Not yet.

The text was updated successfully, but these errors were encountered:

sxjscience · 2018-03-08T21:51:47Z

Thanks for reporting! Is there any code to reproduce the error message?

Jerryzcn · 2018-03-11T21:53:42Z

Seems like related to multiprocessing. when num_worker=0 problem is resolved.

szha · 2018-03-12T01:23:04Z

This isn't actionable since we don't have your code. Please attach code.

Jerryzcn · 2018-03-12T01:49:28Z

will produce minimal reproducible code soon

Jerryzcn · 2018-03-12T21:39:28Z

Here is the code that get stuck. Change num_worker=0 will work, however, on Mac with 1.0.0post3, this is not an issue

from mxnet.gluon.data import DataLoader

import random

import mxnet.ndarray as nd
import numpy as np
from mxnet import context
from mxnet.gluon.data.dataset import Dataset


class Dummy(Dataset):
    def __init__(self, random_shape):
        self.random_shape = random_shape

    def __getitem__(self, idx):
        key = idx
        if self.random_shape:
            out = np.random.uniform(size=(random.randint(1000, 1100), 40))
            labels = np.random.uniform(size=(random.randint(10, 15)))
        else:
            out = np.random.uniform(size=(1000, 40))
            labels = np.random.uniform(size=(10))
        return key, out, labels

    def __len__(self):
        return 50000

    def batchify(self, data):
        """
        Collate data into batch. Use shared memory for stacking.

        :param data: a list of array, with layout of 'NTC'.
        :return either x  and x's unpadded lengths, or x, x's unpadded lengths, y and y's unpadded lengths
                if labels are not supplied.
        """

        # input layout is NTC
        keys, inputs, labels = [item[0] for item in data], [item[1] for item in data], \
                               [item[2] for item in data]

        if len(data) > 1:
            max_data_len = max([seq.shape[0] for seq in inputs])
            max_labels_len = 0 if not labels else max([seq.shape[0] for seq in labels])
        else:
            max_data_len = inputs[0].shape[0]
            max_labels_len = 0 if not labels else labels[0].shape[0]

        x_lens = [item.shape[0] for item in inputs]
        y_lens = [item.shape[0] for item in labels]

        for i, seq in enumerate(inputs):
            pad_len = max_data_len - seq.shape[0]
            inputs[i] = np.pad(seq, ((0, pad_len), (0, 0)), 'constant', constant_values=0)
            labels[i] = np.pad(labels[i], (0, max_labels_len - labels[i].shape[0]),
                               'constant', constant_values=-1)

        inputs = np.asarray(inputs, dtype=np.float32)
        if labels is not None:
            labels = np.asarray(labels, dtype=np.float32)
        inputs = inputs.transpose((1, 0, 2))
        labels = labels.transpose((1, 0))

        return (nd.array(inputs, dtype=inputs.dtype, ctx=context.Context('cpu_shared', 0)),
                nd.array(x_lens, ctx=context.Context('cpu_shared', 0))) \
            if labels is None else (
                                    nd.array(inputs, dtype=inputs.dtype, ctx=context.Context('cpu_shared', 0)),
                                    nd.array(x_lens, ctx=context.Context('cpu_shared', 0)),
                                    nd.array(labels, dtype=labels.dtype, ctx=context.Context('cpu_shared', 0)),
                                    nd.array(y_lens, ctx=context.Context('cpu_shared', 0)))


def main():
    data = Dummy(True)
    loader = DataLoader(data, batch_size=40, batchify_fn=data.batchify, num_workers=2)
    for epoch in range(20):
        for i, data in enumerate(loader):
            if i % 10 == 0:
                print(data)
                print(i)


if __name__ == '__main__':
    main()

sxjscience · 2018-03-13T00:01:54Z

@Jerryzcn found that v1.1.0 does not have this problem.

sxjscience · 2018-03-13T00:17:28Z

I'm now doing a binary search to locate the problem.

sxjscience · 2018-03-13T01:37:18Z

@cjolivier01 @piiswrong After BinarySearch, I can confirm that the problem is due to this PR: 106f97f

3/4 94d3c06f8511782f405f5dbf4bccf61647a78cf3 Fail
2/27 7a0509d6f5ee19b0d3530fb2a4cb944e4f743b33 Fail
2/23 fbbc080d47323dbc23eef4a1453452624cea859b Fail
2/23 106f97f1881e6bb1a00c56a0ae55200e27297733 Fail
2/23 b23e0a9f9f28f886ab20e48d0fcabcf0f8db91c4 Succeed
2/21 ed21873b73fc633fa0a3866236ec5a92057c2056 Succeed

Also, I compile without setting the USE_PROFILE flag.

cjolivier01 · 2018-03-13T01:45:34Z

with what frequency does it occur?

sxjscience · 2018-03-13T02:03:35Z

100% Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Chris Olivier <[email protected]> Sent: Monday, March 12, 2018 6:45:46 PM To: apache/incubator-mxnet Cc: Xingjian SHI; Comment Subject: Re: [apache/incubator-mxnet] Gluon dataloader crash on speech recognition training (#10042) with what frequency does it occur? — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#10042 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE8D7rU25otxMOyO4doeItZPtMQf0Lp_ks5tdyTKgaJpZM4SjV9U>.

cjolivier01 · 2018-03-13T02:08:02Z

ok, will take a look tomorrow

…

On Mon, Mar 12, 2018 at 7:03 PM Xingjian Shi ***@***.***> wrote: 100% Get Outlook for iOS<https://aka.ms/o0ukef> ________________________________ From: Chris Olivier ***@***.***> Sent: Monday, March 12, 2018 6:45:46 PM To: apache/incubator-mxnet Cc: Xingjian SHI; Comment Subject: Re: [apache/incubator-mxnet] Gluon dataloader crash on speech recognition training (#10042) with what frequency does it occur? — You are receiving this because you commented. Reply to this email directly, view it on GitHub< #10042 (comment)>, or mute the thread< https://github.com/notifications/unsubscribe-auth/AE8D7rU25otxMOyO4doeItZPtMQf0Lp_ks5tdyTKgaJpZM4SjV9U >. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10042 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKts_ZbbmqFPW_O5dIPcF5TLXofbJ8BZks5tdykHgaJpZM4SjV9U> .

cjolivier01 · 2018-03-13T17:29:49Z

Created JIRA work item: https://issues.apache.org/jira/browse/MXNET-86

cjolivier01 · 2018-03-13T20:10:17Z

This script just freezes for me without doing anything. In Connection._recv, it seems., What is it doing?
I don't get to the error output described.

Jerryzcn · 2018-03-13T21:00:17Z

The error output only happens when you train on the actual data. When you use the test script, it will freeze. If revert back to 1.1.0 the problem is resolved.

sxjscience · 2018-03-13T21:02:28Z

@cjolivier01 Before the commit it will not freeze and will print the data instead.

Jerryzcn · 2018-03-13T21:02:54Z

there seems to be other issues as well, after training for 1 day or so i got segfault. This does not happen with small dataset. Segfault is tested with 1.2.0. I will try previous version

cjolivier01 · 2018-03-13T21:38:04Z

I think that would be a separate issue. This one so far is just the "stuck" fix.

Jerryzcn · 2018-03-14T21:51:42Z

segfault seems to related to #10096

ThomasDelteil · 2018-03-19T18:24:00Z

When using num_workers > 0 I get after a few hundreds/thousands of batches (the higher the number of workers, the sooner the segfault):

I am using mxnet-cu90 1.1.0:

Segmentation fault: 11

Stack trace returned 10 entries:
[bt] (0) /home/ec2-user/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x276938) [0x7fe86492c938]
[bt] (1) /home/ec2-user/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x28c53ae) [0x7fe866f7b3ae]
[bt] (2) /lib64/libc.so.6(+0x353a0) [0x7fe8e4fe33a0]
[bt] (3) /home/ec2-user/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x28c2703) [0x7fe866f78703]
[bt] (4) /home/ec2-user/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x28c46d8) [0x7fe866f7a6d8]
[bt] (5) /home/ec2-user/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(MXNDArrayCreateFromSharedMem+0x5f5) [0x7fe866a4f4c5]
[bt] (6) /home/ec2-user/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7fe8d9299ec0]
[bt] (7) /home/ec2-user/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7fe8d929987d]
[bt] (8) /home/ec2-user/anaconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7fe8d94ae82e]
[bt] (9) /home/ec2-user/anaconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x12265) [0x7fe8d94af265]
*** Error in `/home/ec2-user/anaconda3/bin/python': malloc(): memory corruption: 0x00007fe8380111f0 ***

Running this code: https://github.com/ThomasDelteil/CNN_NLP_MXNet/blob/master/Crepe-Gluon.ipynb and changing this line:
curr_loss = nd.mean(loss).asscalar()
to curr_loss = nd.mean(loss)

Sometimes, not all the times, I also get the workers filling up 100% of my /dev/shm after the segfault. I am running the code in jupyter lab.

Is this the same issue?

Should I open a new one?

The issue does not happen without multi-processing or when blocking on every batch (see the .asscalar()) change that triggered it

Jerryzcn · 2018-03-19T18:35:29Z

@ThomasDelteil can you try build from source using master? #10096 might fix it. I am testing it right now.

ThomasDelteil · 2018-03-19T18:49:58Z

I am trying now, it did still happen though withmxnet-cu90==1.2.0b20180315

ThomasDelteil · 2018-03-19T20:22:25Z

The segfault does not seem to happen with latest master, however latest master does seem to be MUCH slower than 1.1.0. by a factor of 3-4.
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1

optimized refers to the version where .asscalar() is only called every 100 batches.

mxnet	0 worker	8 workers	0 worker, optimized	8 workers, optimized
mxnet-cu90: 1.1.0	11s	4s	8s	seg fault
master: 1.2.0	13.5s	14s	13.5	14.5

@Jerryzcn Any idea why that might be? Looks like on current master, the data loading is not the limiting factor in the performance. whilst when using 1.1.0 it was. If I didn't botch anything during my build, that looks like a pretty bad regression for 1.2.0

Jerryzcn · 2018-03-19T20:54:03Z

@ThomasDelteil The bug is related to a race condition in memory management, where a space is double freed. The latest master add a lock on the space, so that might slow down the dataloading. However, I'm not sure exactly why. ping @zhreshold

ThomasDelteil · 2018-03-19T21:10:27Z

Thanks @Jerryzcn
here is a reproducible example:
Play with the NUM_WORKERS and optimized parameters to surface the problems i mentioned above

import mxnet as mx
print(mx.__version__)
from mxnet import nd, autograd, gluon
import os
import pandas as pd
from mxnet.gluon.data import ArrayDataset
from mxnet.gluon.data import DataLoader
import numpy as np
import multiprocessing
import wget

if not os.path.isfile('pickleddata.pkl'):
    wget.download('https://s3.us-east-2.amazonaws.com/tdelteil-test-mxnet/pickleddata.pkl')
data = pd.read_pickle('pickleddata.pkl')


# /!\ The important bit:
NUM_WORKERS = multiprocessing.cpu_count() # number of workers used in the data loading
optimized = True


categories = [
    'Home_and_Kitchen',
    'Books', 
    'CDs_and_Vinyl', 
    'Movies_and_TV', 
    'Cell_Phones_and_Accessories',
    'Sports_and_Outdoors', 
    'Clothing_Shoes_and_Jewelry'
]

ALPHABET = list("abcdefghijklmnopqrstuvwxyz0123456789-,;.!?:'\"/\\|_@#$%^&*~`+ =<>()[]{}") # The 69 characters as specified in the paper
ALPHABET_INDEX = {letter: index for index, letter in enumerate(ALPHABET)} # { a: 0, b: 1, etc}
FEATURE_LEN = 1014 # max-length in characters for one document
BATCH_SIZE = 128 # number of documents per batch

def encode(text):
    encoded = np.zeros([len(ALPHABET), FEATURE_LEN], dtype='float32')
    review = text.lower()[:FEATURE_LEN-1:-1]
    i = 0
    for letter in text:
        if i >= FEATURE_LEN:
            break;
        if letter in ALPHABET_INDEX:
            encoded[ALPHABET_INDEX[letter]][i] = 1
        i += 1
    return encoded

class AmazonDataSet(ArrayDataset):
    # We pre-process the documents on the fly
    def __getitem__(self, idx):
        return encode(self._data[0][idx]), self._data[1][idx]


# Data loaders:
split = 0.8
split_index = int(split*len(data))
train_data_X = data['X'][:split_index].as_matrix()
train_data_Y = data['Y'][:split_index].as_matrix()
test_data_X = data['X'][split_index:].as_matrix()
test_data_Y = data['Y'][split_index:].as_matrix()
train_dataset = AmazonDataSet(train_data_X, train_data_Y)
test_dataset = AmazonDataSet(test_data_X, test_data_Y)

train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, last_batch='discard')
test_dataloader = DataLoader(test_dataset, shuffle=True, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, last_batch='discard')

# context:
ctx = mx.gpu() # to run on GPU

# build network
NUM_FILTERS = 256 # number of convolutional filters per convolutional layer
NUM_OUTPUTS = len(categories) # number of classes
FULLY_CONNECTED = 1024 # number of unit in the fully connected dense layer
DROPOUT_RATE = 0.5 # probability of node drop out
LEARNING_RATE = 0.01 # learning rate of the gradient
MOMENTUM = 0.9 # momentum of the gradient
WDECAY = 0.00001 # regularization term to limit size of weights

net = gluon.nn.HybridSequential()
with net.name_scope():
    net.add(gluon.nn.Conv1D(channels=NUM_FILTERS, kernel_size=7, activation='relu'))
    net.add(gluon.nn.MaxPool1D(pool_size=3, strides=3))
    net.add(gluon.nn.Conv1D(channels=NUM_FILTERS, kernel_size=7, activation='relu'))
    net.add(gluon.nn.MaxPool1D(pool_size=3, strides=3))
    net.add(gluon.nn.Conv1D(channels=NUM_FILTERS, kernel_size=3, activation='relu'))
    net.add(gluon.nn.Conv1D(channels=NUM_FILTERS, kernel_size=3, activation='relu'))
    net.add(gluon.nn.Conv1D(channels=NUM_FILTERS, kernel_size=3, activation='relu'))
    net.add(gluon.nn.Conv1D(channels=NUM_FILTERS, kernel_size=3, activation='relu'))
    net.add(gluon.nn.MaxPool1D(pool_size=3, strides=3))
    net.add(gluon.nn.Flatten())
    net.add(gluon.nn.Dense(FULLY_CONNECTED, activation='relu'))
    net.add(gluon.nn.Dropout(DROPOUT_RATE))
    net.add(gluon.nn.Dense(FULLY_CONNECTED, activation='relu'))
    net.add(gluon.nn.Dropout(DROPOUT_RATE))
    net.add(gluon.nn.Dense(NUM_OUTPUTS))



net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)

# loss
softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()

# optimizer
trainer = gluon.Trainer(net.collect_params(), 'sgd', 
                        {'learning_rate': LEARNING_RATE, 
                         'wd':WDECAY, 
                         'momentum':MOMENTUM})

# Training Loop

import time
start_epoch = 6
number_epochs = 7
smoothing_constant = .01
tick = time.time()
net.hybridize()
for e in range(start_epoch, number_epochs):
    for i, (review, label) in enumerate(train_dataloader):
        review = review.as_in_context(ctx)
        label = label.as_in_context(ctx)
        with autograd.record():
            output = net(review)
            loss = softmax_cross_entropy(output, label)
        loss.backward()
        trainer.step(review.shape[0])
        
        # moving average of the loss
        if optimized:
            curr_loss = nd.mean(loss)
        else:
            curr_loss = nd.mean(loss).asscalar()
        moving_loss = (curr_loss if (i == 0) 
                       else (1 - smoothing_constant) * moving_loss + (smoothing_constant) * curr_loss)

        if (i%100 == 0):
            tock = time.time()
            if optimized:
                print('Batch {}:{},{},{} seconds for 100 batches'.format(i, curr_loss.asscalar(),moving_loss.asscalar(), tock-tick))
            else:
                print('Batch {}:{},{},{} seconds for 100 batches'.format(i, curr_loss, moving_loss, tock-tick))
            tick = tock

    print("Epoch %s. Loss: %s, Test_acc %s" % (e, moving_loss.asscalar(), test_accuracy))

zhreshold · 2018-03-19T22:15:49Z

@ThomasDelteil
Measuring IO with training is not a good idea. Can you pull up with pure IO test?

ThomasDelteil · 2018-03-19T23:40:08Z

I removed the network operations and got essentially the same results for master and mxnet-cu90 in pure I/O. It looks like my network is not I/O bound with the current master (stuck on 100 batches for ~14s) but that the GPU is not processing the operations as fast as previously.
Either I botched my cudnn linking or the performance hit is much bigger than reported here #10116

edit2: I get the performance drop when using any 1.2.0 versions starting with pip install mxnet-cu90==1.2.0b20180221 performance is restored when using pip install mxnet-cu90==1.1.0.

sxjscience · 2018-03-30T18:17:19Z

Should I close this?

Jerryzcn · 2018-03-30T21:39:44Z

Yeah, we can close this.

sxjscience added Bug Data-loading labels Mar 8, 2018

cjolivier01 added the Gluon label Mar 13, 2018

cjolivier01 changed the title ~~Gluon dataloader crash on speech recognition training~~ [MXNET-86] Gluon dataloader crash on speech recognition training Mar 13, 2018

cjolivier01 self-assigned this Mar 13, 2018

cjolivier01 mentioned this issue Mar 13, 2018

[MXNET-86] Revert to pre-profile-changes copy code #10090

Merged

7 tasks

zhreshold mentioned this issue Mar 15, 2018

Fix multi worker #10096

Merged

7 tasks

sxjscience closed this as completed Mar 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-86] Gluon dataloader crash on speech recognition training #10042

[MXNET-86] Gluon dataloader crash on speech recognition training #10042

Jerryzcn commented Mar 8, 2018 •

edited

Loading

sxjscience commented Mar 8, 2018

Jerryzcn commented Mar 11, 2018

szha commented Mar 12, 2018

Jerryzcn commented Mar 12, 2018

Jerryzcn commented Mar 12, 2018 •

edited

Loading

sxjscience commented Mar 13, 2018

sxjscience commented Mar 13, 2018

sxjscience commented Mar 13, 2018 •

edited

Loading

cjolivier01 commented Mar 13, 2018

sxjscience commented Mar 13, 2018 via email

cjolivier01 commented Mar 13, 2018 via email

cjolivier01 commented Mar 13, 2018

cjolivier01 commented Mar 13, 2018 •

edited

Loading

Jerryzcn commented Mar 13, 2018 •

edited

Loading

sxjscience commented Mar 13, 2018 •

edited

Loading

Jerryzcn commented Mar 13, 2018

cjolivier01 commented Mar 13, 2018 •

edited

Loading

Jerryzcn commented Mar 14, 2018

ThomasDelteil commented Mar 19, 2018 •

edited

Loading

Jerryzcn commented Mar 19, 2018 •

edited

Loading

ThomasDelteil commented Mar 19, 2018

ThomasDelteil commented Mar 19, 2018 •

edited

Loading

Jerryzcn commented Mar 19, 2018

ThomasDelteil commented Mar 19, 2018 •

edited

Loading

zhreshold commented Mar 19, 2018

ThomasDelteil commented Mar 19, 2018 •

edited

Loading

sxjscience commented Mar 30, 2018

Jerryzcn commented Mar 30, 2018

[MXNET-86] Gluon dataloader crash on speech recognition training #10042

[MXNET-86] Gluon dataloader crash on speech recognition training #10042

Comments

Jerryzcn commented Mar 8, 2018 • edited Loading

Description

Environment info (Required)

Build info (Required if built from source)

Error Message:

Minimum reproducible example

Steps to reproduce

What have you tried to solve it?

sxjscience commented Mar 8, 2018

Jerryzcn commented Mar 11, 2018

szha commented Mar 12, 2018

Jerryzcn commented Mar 12, 2018

Jerryzcn commented Mar 12, 2018 • edited Loading

sxjscience commented Mar 13, 2018

sxjscience commented Mar 13, 2018

sxjscience commented Mar 13, 2018 • edited Loading

cjolivier01 commented Mar 13, 2018

sxjscience commented Mar 13, 2018 via email

cjolivier01 commented Mar 13, 2018 via email

cjolivier01 commented Mar 13, 2018

cjolivier01 commented Mar 13, 2018 • edited Loading

Jerryzcn commented Mar 13, 2018 • edited Loading

sxjscience commented Mar 13, 2018 • edited Loading

Jerryzcn commented Mar 13, 2018

cjolivier01 commented Mar 13, 2018 • edited Loading

Jerryzcn commented Mar 14, 2018

ThomasDelteil commented Mar 19, 2018 • edited Loading

Jerryzcn commented Mar 19, 2018 • edited Loading

ThomasDelteil commented Mar 19, 2018

ThomasDelteil commented Mar 19, 2018 • edited Loading

Jerryzcn commented Mar 19, 2018

ThomasDelteil commented Mar 19, 2018 • edited Loading

zhreshold commented Mar 19, 2018

ThomasDelteil commented Mar 19, 2018 • edited Loading

sxjscience commented Mar 30, 2018

Jerryzcn commented Mar 30, 2018

Jerryzcn commented Mar 8, 2018 •

edited

Loading

Jerryzcn commented Mar 12, 2018 •

edited

Loading

sxjscience commented Mar 13, 2018 •

edited

Loading

cjolivier01 commented Mar 13, 2018 •

edited

Loading

Jerryzcn commented Mar 13, 2018 •

edited

Loading

sxjscience commented Mar 13, 2018 •

edited

Loading

cjolivier01 commented Mar 13, 2018 •

edited

Loading

ThomasDelteil commented Mar 19, 2018 •

edited

Loading

Jerryzcn commented Mar 19, 2018 •

edited

Loading

ThomasDelteil commented Mar 19, 2018 •

edited

Loading

ThomasDelteil commented Mar 19, 2018 •

edited

Loading

ThomasDelteil commented Mar 19, 2018 •

edited

Loading