Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-86] Gluon dataloader crash on speech recognition training #10042

Closed
Jerryzcn opened this issue Mar 8, 2018 · 28 comments
Closed

[MXNET-86] Gluon dataloader crash on speech recognition training #10042

Jerryzcn opened this issue Mar 8, 2018 · 28 comments

Comments

@Jerryzcn
Copy link
Contributor

Jerryzcn commented Mar 8, 2018

Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

(Brief description of the problem in no more than 2 sentences.)
Gluon data loader crash during training.

Environment info (Required)

What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.
----------Python Info----------
Version      : 3.6.3
Compiler     : GCC 7.2.0
Build        : ('default', 'Oct 13 2017 12:02:49')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 9.0.1
Directory    : /home/ubuntu/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
/home/ubuntu/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
/home/ubuntu/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py:46: DeprecationWarning: OpenSSL.rand is deprecated - you should use os.urandom instead
  import OpenSSL.SSL
Version      : 1.2.0
Directory    : /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.4.0-1052-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-172-31-22-177
release      : 4.4.0-1052-aws
version      : #61-Ubuntu SMP Mon Feb 12 23:05:58 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2699.984
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.07
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-15,32-47
NUMA node1 CPU(s):     16-31,48-63
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single retpoline kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt ida
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0022 sec, LOAD: 0.5053 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1427 sec, LOAD: 0.0642 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1806 sec, LOAD: 0.1844 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0252 sec, LOAD: 0.1844 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0117 sec, LOAD: 0.2449 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0081 sec, LOAD: 0.3814 sec.

Package used (Python/R/Scala/Julia):
I'm using Python

For Scala user, please provide:

  1. Java version: (java -version)
  2. Maven version: (mvn -version)
  3. Scala runtime if applicable: (scala -version)

For R user, please provide R sessionInfo():

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):

gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.9' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9) 

MXNet commit hash:
(Paste the output of git rev-parse HEAD here.)
2a9c7d9

Build config:
(Paste the content of config.mk, or the build command.)
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1

Error Message:

(Paste the complete error message, including stack trace.)

Traceback (most recent call last):
  File "/home/ubuntu/openspeech/openspeech/experiment.py", line 358, in <module>
    exp.run()
  File "/home/ubuntu/openspeech/openspeech/experiment.py", line 352, in run
    stage_inst.run()
  File "/home/ubuntu/openspeech/openspeech/train.py", line 140, in run
    for step, data in enumerate(train_loader):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet/gluon/data/dataloader.py", line 224, in __iter__
    idx, batch = data_queue.get()
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet/gluon/data/dataloader.py", line 38, in rebuild_ndarray
    return nd.NDArray(nd.ndarray._new_from_shared_mem(*args))
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet/ndarray/ndarray.py", line 151, in _new_from_shared_mem
    ctypes.byref(hdl)))
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet/base.py", line 149, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [20:48:23] src/storage/./cpu_shared_storage_manager.h:178: Check failed: ptr != ((void *) -1) (0xffffffffffffffff vs. 0xffffffffffffffff) Failed to map shared memory. mmap failed with error Cannot allocate memory

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f77c7b896cb]
[bt] (1) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f77c7b8a208]
[bt] (2) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet/libmxnet.so(mxnet::storage::CPUSharedStorageManager::Alloc(mxnet::Storage::Handle*)+0x99b) [0x7f77ca7ef5db]
[bt] (3) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet/libmxnet.so(mxnet::StorageImpl::Alloc(mxnet::Storage::Handle*)+0x5d) [0x7f77ca7efd4d]
[bt] (4) /home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet-1.2.0-py3.6.egg/mxnet/libmxnet.so(MXNDArrayCreateFromSharedMem+0x825) [0x7f77ca848135]
[bt] (5) /home/ubuntu/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f7877dbaec0]
[bt] (6) /home/ubuntu/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7f7877dba87d]
[bt] (7) /home/ubuntu/anaconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f7877fcf82e]
[bt] (8) /home/ubuntu/anaconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x12265) [0x7f7877fd0265]
[bt] (9) python(_PyObject_FastCallDict+0x8b) [0x55a780a2554b]

Minimum reproducible example

(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)
The code is in a private repo

Steps to reproduce

(Paste the commands you ran that produced the error.)

It happens intermittently when we train on large speech dataset.

What have you tried to solve it?

Not yet.

@sxjscience
Copy link
Member

Thanks for reporting! Is there any code to reproduce the error message?

@Jerryzcn
Copy link
Contributor Author

Seems like related to multiprocessing. when num_worker=0 problem is resolved.

@szha
Copy link
Member

szha commented Mar 12, 2018

This isn't actionable since we don't have your code. Please attach code.

@Jerryzcn
Copy link
Contributor Author

will produce minimal reproducible code soon

@Jerryzcn
Copy link
Contributor Author

Jerryzcn commented Mar 12, 2018

Here is the code that get stuck. Change num_worker=0 will work, however, on Mac with 1.0.0post3, this is not an issue

from mxnet.gluon.data import DataLoader

import random

import mxnet.ndarray as nd
import numpy as np
from mxnet import context
from mxnet.gluon.data.dataset import Dataset


class Dummy(Dataset):
    def __init__(self, random_shape):
        self.random_shape = random_shape

    def __getitem__(self, idx):
        key = idx
        if self.random_shape:
            out = np.random.uniform(size=(random.randint(1000, 1100), 40))
            labels = np.random.uniform(size=(random.randint(10, 15)))
        else:
            out = np.random.uniform(size=(1000, 40))
            labels = np.random.uniform(size=(10))
        return key, out, labels

    def __len__(self):
        return 50000

    def batchify(self, data):
        """
        Collate data into batch. Use shared memory for stacking.

        :param data: a list of array, with layout of 'NTC'.
        :return either x  and x's unpadded lengths, or x, x's unpadded lengths, y and y's unpadded lengths
                if labels are not supplied.
        """

        # input layout is NTC
        keys, inputs, labels = [item[0] for item in data], [item[1] for item in data], \
                               [item[2] for item in data]

        if len(data) > 1:
            max_data_len = max([seq.shape[0] for seq in inputs])
            max_labels_len = 0 if not labels else max([seq.shape[0] for seq in labels])
        else:
            max_data_len = inputs[0].shape[0]
            max_labels_len = 0 if not labels else labels[0].shape[0]

        x_lens = [item.shape[0] for item in inputs]
        y_lens = [item.shape[0] for item in labels]

        for i, seq in enumerate(inputs):
            pad_len = max_data_len - seq.shape[0]
            inputs[i] = np.pad(seq, ((0, pad_len), (0, 0)), 'constant', constant_values=0)
            labels[i] = np.pad(labels[i], (0, max_labels_len - labels[i].shape[0]),
                               'constant', constant_values=-1)

        inputs = np.asarray(inputs, dtype=np.float32)
        if labels is not None:
            labels = np.asarray(labels, dtype=np.float32)
        inputs = inputs.transpose((1, 0, 2))
        labels = labels.transpose((1, 0))

        return (nd.array(inputs, dtype=inputs.dtype, ctx=context.Context('cpu_shared', 0)),
                nd.array(x_lens, ctx=context.Context('cpu_shared', 0))) \
            if labels is None else (
                                    nd.array(inputs, dtype=inputs.dtype, ctx=context.Context('cpu_shared', 0)),
                                    nd.array(x_lens, ctx=context.Context('cpu_shared', 0)),
                                    nd.array(labels, dtype=labels.dtype, ctx=context.Context('cpu_shared', 0)),
                                    nd.array(y_lens, ctx=context.Context('cpu_shared', 0)))


def main():
    data = Dummy(True)
    loader = DataLoader(data, batch_size=40, batchify_fn=data.batchify, num_workers=2)
    for epoch in range(20):
        for i, data in enumerate(loader):
            if i % 10 == 0:
                print(data)
                print(i)


if __name__ == '__main__':
    main()

@sxjscience
Copy link
Member

@Jerryzcn found that v1.1.0 does not have this problem.

@sxjscience
Copy link
Member

I'm now doing a binary search to locate the problem.

@sxjscience
Copy link
Member

sxjscience commented Mar 13, 2018

@cjolivier01 @piiswrong After BinarySearch, I can confirm that the problem is due to this PR: 106f97f

3/4 94d3c06f8511782f405f5dbf4bccf61647a78cf3 Fail
2/27 7a0509d6f5ee19b0d3530fb2a4cb944e4f743b33 Fail
2/23 fbbc080d47323dbc23eef4a1453452624cea859b Fail
2/23 106f97f1881e6bb1a00c56a0ae55200e27297733 Fail
2/23 b23e0a9f9f28f886ab20e48d0fcabcf0f8db91c4 Succeed
2/21 ed21873b73fc633fa0a3866236ec5a92057c2056 Succeed

Also, I compile without setting the USE_PROFILE flag.

@cjolivier01
Copy link
Member

with what frequency does it occur?

@sxjscience
Copy link
Member

sxjscience commented Mar 13, 2018 via email

@cjolivier01
Copy link
Member

cjolivier01 commented Mar 13, 2018 via email

@cjolivier01
Copy link
Member

Created JIRA work item: https://issues.apache.org/jira/browse/MXNET-86

@cjolivier01 cjolivier01 changed the title Gluon dataloader crash on speech recognition training [MXNET-86] Gluon dataloader crash on speech recognition training Mar 13, 2018
@cjolivier01 cjolivier01 self-assigned this Mar 13, 2018
@cjolivier01
Copy link
Member

cjolivier01 commented Mar 13, 2018

This script just freezes for me without doing anything. In Connection._recv, it seems., What is it doing?
I don't get to the error output described.

@Jerryzcn
Copy link
Contributor Author

Jerryzcn commented Mar 13, 2018

The error output only happens when you train on the actual data. When you use the test script, it will freeze. If revert back to 1.1.0 the problem is resolved.

@sxjscience
Copy link
Member

sxjscience commented Mar 13, 2018

@cjolivier01 Before the commit it will not freeze and will print the data instead.

@Jerryzcn
Copy link
Contributor Author

there seems to be other issues as well, after training for 1 day or so i got segfault. This does not happen with small dataset. Segfault is tested with 1.2.0. I will try previous version

@cjolivier01
Copy link
Member

cjolivier01 commented Mar 13, 2018

I think that would be a separate issue. This one so far is just the "stuck" fix.

@Jerryzcn
Copy link
Contributor Author

segfault seems to related to #10096

@zhreshold zhreshold mentioned this issue Mar 15, 2018
7 tasks
@ThomasDelteil
Copy link
Contributor

ThomasDelteil commented Mar 19, 2018

When using num_workers > 0 I get after a few hundreds/thousands of batches (the higher the number of workers, the sooner the segfault):

I am using mxnet-cu90 1.1.0:

Segmentation fault: 11

Stack trace returned 10 entries:
[bt] (0) /home/ec2-user/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x276938) [0x7fe86492c938]
[bt] (1) /home/ec2-user/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x28c53ae) [0x7fe866f7b3ae]
[bt] (2) /lib64/libc.so.6(+0x353a0) [0x7fe8e4fe33a0]
[bt] (3) /home/ec2-user/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x28c2703) [0x7fe866f78703]
[bt] (4) /home/ec2-user/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x28c46d8) [0x7fe866f7a6d8]
[bt] (5) /home/ec2-user/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(MXNDArrayCreateFromSharedMem+0x5f5) [0x7fe866a4f4c5]
[bt] (6) /home/ec2-user/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7fe8d9299ec0]
[bt] (7) /home/ec2-user/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7fe8d929987d]
[bt] (8) /home/ec2-user/anaconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7fe8d94ae82e]
[bt] (9) /home/ec2-user/anaconda3/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x12265) [0x7fe8d94af265]
*** Error in `/home/ec2-user/anaconda3/bin/python': malloc(): memory corruption: 0x00007fe8380111f0 ***

Running this code: https://github.com/ThomasDelteil/CNN_NLP_MXNet/blob/master/Crepe-Gluon.ipynb and changing this line:
curr_loss = nd.mean(loss).asscalar()
to curr_loss = nd.mean(loss)

Sometimes, not all the times, I also get the workers filling up 100% of my /dev/shm after the segfault. I am running the code in jupyter lab.

Is this the same issue?

Should I open a new one?

The issue does not happen without multi-processing or when blocking on every batch (see the .asscalar()) change that triggered it

@Jerryzcn
Copy link
Contributor Author

Jerryzcn commented Mar 19, 2018

@ThomasDelteil can you try build from source using master? #10096 might fix it. I am testing it right now.

@ThomasDelteil
Copy link
Contributor

I am trying now, it did still happen though withmxnet-cu90==1.2.0b20180315

@ThomasDelteil
Copy link
Contributor

ThomasDelteil commented Mar 19, 2018

The segfault does not seem to happen with latest master, however latest master does seem to be MUCH slower than 1.1.0. by a factor of 3-4.
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1

optimized refers to the version where .asscalar() is only called every 100 batches.

mxnet 0 worker 8 workers 0 worker, optimized 8 workers, optimized
mxnet-cu90: 1.1.0 11s 4s 8s seg fault
master: 1.2.0 13.5s 14s 13.5 14.5

@Jerryzcn Any idea why that might be? Looks like on current master, the data loading is not the limiting factor in the performance. whilst when using 1.1.0 it was. If I didn't botch anything during my build, that looks like a pretty bad regression for 1.2.0

@Jerryzcn
Copy link
Contributor Author

@ThomasDelteil The bug is related to a race condition in memory management, where a space is double freed. The latest master add a lock on the space, so that might slow down the dataloading. However, I'm not sure exactly why. ping @zhreshold

@ThomasDelteil
Copy link
Contributor

ThomasDelteil commented Mar 19, 2018

Thanks @Jerryzcn
here is a reproducible example:
Play with the NUM_WORKERS and optimized parameters to surface the problems i mentioned above

import mxnet as mx
print(mx.__version__)
from mxnet import nd, autograd, gluon
import os
import pandas as pd
from mxnet.gluon.data import ArrayDataset
from mxnet.gluon.data import DataLoader
import numpy as np
import multiprocessing
import wget

if not os.path.isfile('pickleddata.pkl'):
    wget.download('https://s3.us-east-2.amazonaws.com/tdelteil-test-mxnet/pickleddata.pkl')
data = pd.read_pickle('pickleddata.pkl')


# /!\ The important bit:
NUM_WORKERS = multiprocessing.cpu_count() # number of workers used in the data loading
optimized = True


categories = [
    'Home_and_Kitchen',
    'Books', 
    'CDs_and_Vinyl', 
    'Movies_and_TV', 
    'Cell_Phones_and_Accessories',
    'Sports_and_Outdoors', 
    'Clothing_Shoes_and_Jewelry'
]

ALPHABET = list("abcdefghijklmnopqrstuvwxyz0123456789-,;.!?:'\"/\\|_@#$%^&*~`+ =<>()[]{}") # The 69 characters as specified in the paper
ALPHABET_INDEX = {letter: index for index, letter in enumerate(ALPHABET)} # { a: 0, b: 1, etc}
FEATURE_LEN = 1014 # max-length in characters for one document
BATCH_SIZE = 128 # number of documents per batch

def encode(text):
    encoded = np.zeros([len(ALPHABET), FEATURE_LEN], dtype='float32')
    review = text.lower()[:FEATURE_LEN-1:-1]
    i = 0
    for letter in text:
        if i >= FEATURE_LEN:
            break;
        if letter in ALPHABET_INDEX:
            encoded[ALPHABET_INDEX[letter]][i] = 1
        i += 1
    return encoded

class AmazonDataSet(ArrayDataset):
    # We pre-process the documents on the fly
    def __getitem__(self, idx):
        return encode(self._data[0][idx]), self._data[1][idx]


# Data loaders:
split = 0.8
split_index = int(split*len(data))
train_data_X = data['X'][:split_index].as_matrix()
train_data_Y = data['Y'][:split_index].as_matrix()
test_data_X = data['X'][split_index:].as_matrix()
test_data_Y = data['Y'][split_index:].as_matrix()
train_dataset = AmazonDataSet(train_data_X, train_data_Y)
test_dataset = AmazonDataSet(test_data_X, test_data_Y)

train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, last_batch='discard')
test_dataloader = DataLoader(test_dataset, shuffle=True, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, last_batch='discard')

# context:
ctx = mx.gpu() # to run on GPU

# build network
NUM_FILTERS = 256 # number of convolutional filters per convolutional layer
NUM_OUTPUTS = len(categories) # number of classes
FULLY_CONNECTED = 1024 # number of unit in the fully connected dense layer
DROPOUT_RATE = 0.5 # probability of node drop out
LEARNING_RATE = 0.01 # learning rate of the gradient
MOMENTUM = 0.9 # momentum of the gradient
WDECAY = 0.00001 # regularization term to limit size of weights

net = gluon.nn.HybridSequential()
with net.name_scope():
    net.add(gluon.nn.Conv1D(channels=NUM_FILTERS, kernel_size=7, activation='relu'))
    net.add(gluon.nn.MaxPool1D(pool_size=3, strides=3))
    net.add(gluon.nn.Conv1D(channels=NUM_FILTERS, kernel_size=7, activation='relu'))
    net.add(gluon.nn.MaxPool1D(pool_size=3, strides=3))
    net.add(gluon.nn.Conv1D(channels=NUM_FILTERS, kernel_size=3, activation='relu'))
    net.add(gluon.nn.Conv1D(channels=NUM_FILTERS, kernel_size=3, activation='relu'))
    net.add(gluon.nn.Conv1D(channels=NUM_FILTERS, kernel_size=3, activation='relu'))
    net.add(gluon.nn.Conv1D(channels=NUM_FILTERS, kernel_size=3, activation='relu'))
    net.add(gluon.nn.MaxPool1D(pool_size=3, strides=3))
    net.add(gluon.nn.Flatten())
    net.add(gluon.nn.Dense(FULLY_CONNECTED, activation='relu'))
    net.add(gluon.nn.Dropout(DROPOUT_RATE))
    net.add(gluon.nn.Dense(FULLY_CONNECTED, activation='relu'))
    net.add(gluon.nn.Dropout(DROPOUT_RATE))
    net.add(gluon.nn.Dense(NUM_OUTPUTS))



net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)

# loss
softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()

# optimizer
trainer = gluon.Trainer(net.collect_params(), 'sgd', 
                        {'learning_rate': LEARNING_RATE, 
                         'wd':WDECAY, 
                         'momentum':MOMENTUM})

# Training Loop

import time
start_epoch = 6
number_epochs = 7
smoothing_constant = .01
tick = time.time()
net.hybridize()
for e in range(start_epoch, number_epochs):
    for i, (review, label) in enumerate(train_dataloader):
        review = review.as_in_context(ctx)
        label = label.as_in_context(ctx)
        with autograd.record():
            output = net(review)
            loss = softmax_cross_entropy(output, label)
        loss.backward()
        trainer.step(review.shape[0])
        
        # moving average of the loss
        if optimized:
            curr_loss = nd.mean(loss)
        else:
            curr_loss = nd.mean(loss).asscalar()
        moving_loss = (curr_loss if (i == 0) 
                       else (1 - smoothing_constant) * moving_loss + (smoothing_constant) * curr_loss)

        if (i%100 == 0):
            tock = time.time()
            if optimized:
                print('Batch {}:{},{},{} seconds for 100 batches'.format(i, curr_loss.asscalar(),moving_loss.asscalar(), tock-tick))
            else:
                print('Batch {}:{},{},{} seconds for 100 batches'.format(i, curr_loss, moving_loss, tock-tick))
            tick = tock

    print("Epoch %s. Loss: %s, Test_acc %s" % (e, moving_loss.asscalar(), test_accuracy))

@zhreshold
Copy link
Member

@ThomasDelteil
Measuring IO with training is not a good idea. Can you pull up with pure IO test?

@ThomasDelteil
Copy link
Contributor

ThomasDelteil commented Mar 19, 2018

I removed the network operations and got essentially the same results for master and mxnet-cu90 in pure I/O. It looks like my network is not I/O bound with the current master (stuck on 100 batches for ~14s) but that the GPU is not processing the operations as fast as previously.
Either I botched my cudnn linking or the performance hit is much bigger than reported here #10116

edit2: I get the performance drop when using any 1.2.0 versions starting with pip install mxnet-cu90==1.2.0b20180221 performance is restored when using pip install mxnet-cu90==1.1.0.

@sxjscience
Copy link
Member

Should I close this?

@Jerryzcn
Copy link
Contributor Author

Yeah, we can close this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants