Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Sockeye failure with MXNet #15297

Open
anirudh2290 opened this issue Jun 21, 2019 · 15 comments
Open

Sockeye failure with MXNet #15297

anirudh2290 opened this issue Jun 21, 2019 · 15 comments

Comments

@anirudh2290
Copy link
Member

Description

Install sockeye and run python setup.py test.
Change line in requirements.txt and requirements.gpu-cu100.txt and change mxnet version to nightly after commit at or after 09202f7.

Run the following from inside sockeye directory.

python3 setup.py test
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Build

@pengzhao-intel
Copy link
Contributor

@anirudh2290 thanks for the issue.
Could you help to take a quick try for the fix of #15298 ?

@roywei
Copy link
Member

roywei commented Jun 21, 2019

Hi @anirudh2290 @ptrendx

I m still not able to reproduce the crash. My steps are below, could you help point out what's wrong?

Machine: AWS Deeplearning Base AMI, P3.8xLarge, Ubuntu 16.05
MXNet:

pip3 list | grep mxnet 
mxnet-cu100mkl      1.5.0b20190621

Sockeye:

sudo pip3 install sockeye --no-deps

changed requirements

git diff
diff --git a/setup.py b/setup.py
index ffa2a7b..4d0741d 100644
--- a/setup.py
+++ b/setup.py
@@ -116,7 +116,7 @@ args = dict(
 
     extras_require={
         'optional': ['mxboard', 'matplotlib'],
-        'dev': get_requirements(os.path.join('requirements', 'requirements.dev.txt'))
+        'dev': get_requirements(os.path.join('requirements', 'requirements.gpu-cu100.txt'))
     },

run:

python3 setup.py test -r requirements/requirements.gpu-cu100.txt

result:

sockeye/output_handler.py                          139     36    74%
sockeye/prepare_data.py                             39      1    97%
sockeye/rerank.py                                   59     25    58%
sockeye/rnn.py                                     217      2    99%
sockeye/rnn_attention.py                           221      7    97%
sockeye/score.py                                    53      3    94%
sockeye/scoring.py                                 110      6    95%
sockeye/train.py                                   379     90    76%
sockeye/training.py                                682    180    74%
sockeye/transformer.py                             130      4    97%
sockeye/translate.py                               112     13    88%
sockeye/utils.py                                   483    151    69%
sockeye/vocab.py                                   137     25    82%
--------------------------------------------------------------------
TOTAL                                             8935   1363    85%


================================================== 542 passed in 39.93 seconds ===================================================

@anirudh2290
Copy link
Member Author

did you modify the mxnet version in requirements file ?

@anirudh2290
Copy link
Member Author

anirudh2290 commented Jun 21, 2019

@pengzhao-intel @roywei I am currently building with @ZhennanQin cmmit and will try it out.

@anirudh2290
Copy link
Member Author

With the PR : #15298 also it segfaults and core dumps.

@pengzhao-intel
Copy link
Contributor

Thanks, we are trying to reproduce the crash (we can't reproduce till now).

@anirudh2290
Copy link
Member Author

anirudh2290 commented Jun 21, 2019

@pengzhao-intel were you able to reproduce. Did you make sure you modified requirements file in sockeye?

@roywei
Copy link
Member

roywei commented Jun 23, 2019

I was able to reproduce the failure at test_constraints_int.py and test_seq_copy_int.py
https://github.com/awslabs/sockeye/tree/master/test/integration
@fhieber could you help take a look and convert this to a MXNet unit test?

@pengzhao-intel
Copy link
Contributor

@anirudh2290 yes, we can reproduce the issue and WIP to fix it :)

@ZhennanQin
Copy link
Contributor

Confirmed that this can be reproduced. Need more time to investigate.

@leleamol
Copy link
Contributor

@mxnet-label-bot add [Build, Bug]

@ZhennanQin
Copy link
Contributor

#15298 can fix sockeye failure on my machine. @anirudh2290 @roywei Please have a try.

@ptrendx
Copy link
Member

ptrendx commented Jun 25, 2019

Hi @ZhennanQin, could you give a small explanation of the issue and the fix? I see 2 changes in that PR - one related to converting older models and 1 that looks like no-op (moving from arrays to arrays_with_in_out). Which one fixes the segfault?

@ZhennanQin
Copy link
Contributor

@ptrendx They worked together to fix the segfault. moving from arrays to arrays_with_in_out is a MKLDNN backend specific fix, to address #15281. Sockeye will also have problem without this fix. But sockeye have more problems, which is about the legacy op state. I've explained the reason here: #15298

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants