Fix BERT fp16 bugs, add test #1270

MoisesHer · 2020-07-18T21:43:36Z

Description

Fixing fp16 bug in a previous PR: #1264
A bug was introduced in the last commit (merging with numpy upstream) of that PR when solving a conflict. Apologies
Bug only affects fp16 case, returning NaNs
Model was also not hybridized after being cast.
[x] Both issues are solved in this PR.
[x] A test was added to compare FP32 vs. FP16 results in BERT inference

cc @dmlc/gluon-nlp-team @sxjscience

codecov · 2020-07-18T22:19:28Z

Codecov Report

Merging #1270 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #1270   +/-   ##
=======================================
  Coverage   81.75%   81.75%           
=======================================
  Files          52       52           
  Lines        6862     6862           
=======================================
  Hits         5610     5610           
  Misses       1252     1252

Impacted Files	Coverage Δ
src/gluonnlp/data/tokenizers/sentencepiece.py	`75.44% <0.00%> (-0.60%)`	⬇️
src/gluonnlp/data/tokenizers/yttm.py	`82.75% <0.00%> (+0.86%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0b293e...9ce2552. Read the comment docs.

szha · 2020-07-18T23:32:54Z

Could you add a test for the expected data type?

sxjscience · 2020-07-19T04:20:21Z

@MoisesHer Would you help add a test in https://github.com/dmlc/gluon-nlp/blob/numpy/tests/test_models_bert.py? You can do a similar forward test as in

gluon-nlp/tests/test_models_roberta.py

Lines 16 to 48 in e78a24e

    
           @pytest.mark.remote_required 
        
           @pytest.mark.parametrize('model_name', list_pretrained_roberta()) 
        
           def test_roberta(model_name): 
        
               # test from pretrained 
        
               assert len(list_pretrained_roberta()) > 0 
        
               with tempfile.TemporaryDirectory() as root: 
        
                   cfg, tokenizer, params_path =\ 
        
                       get_pretrained_roberta(model_name, root=root) 
        
                   assert cfg.MODEL.vocab_size == len(tokenizer.vocab) 
        
                   roberta_model = RobertaModel.from_cfg(cfg) 
        
                   roberta_model.load_parameters(params_path) 
        
               # test forward 
        
               batch_size = 3 
        
               seq_length = 32 
        
               vocab_size = len(tokenizer.vocab) 
        
               input_ids = mx.np.array( 
        
                   np.random.randint( 
        
                       2, 
        
                       vocab_size, 
        
                       (batch_size, seq_length) 
        
                   ), 
        
                   dtype=np.int32 
        
               ) 
        
               valid_length = mx.np.array( 
        
                   np.random.randint( 
        
                       seq_length // 2, 
        
                       seq_length, 
        
                       (batch_size,) 
        
                   ), 
        
                   dtype=np.int32 
        
               ) 
        
               x = roberta_model(input_ids, valid_length)

and compare the final result.

sxjscience · 2020-07-22T17:22:52Z

Need to wait for the GPU CI.

sxjscience · 2020-09-01T17:40:53Z

@MoisesHer ~~The GPU CI should be functional now. Would you try to create a new PR to add FP16 functionality?~~
Sorry for the confusion, we may still need to fix the GPU CI test.

sxjscience · 2020-09-01T17:41:55Z

You can refer to

gluon-nlp/tests/test_models_bert.py

Line 15 in a842bf3

def test_bert_small_cfg(compute_layout, ctx):

. Here, we just add a ctx to the arguments so it becomes a fixture.

sxjscience

LGTM

sxjscience · 2020-09-04T08:12:21Z

CC @zheyuye @szha @hymzoque I'll merge this in first.

* Fix BERT fp16 bugs, add test (dmlc#1270) * Fix fp16 bug: not passing dtype to TransformerEncoderLayer * Re-hybridize after casting & add BERT test * Skip fp16 test if CPU ctx * remove debugging messages Co-authored-by: root <[email protected]> * [Fix][SageMaker] Make sure that the installation works in SageMaker (dmlc#1348) * Fasttext to 0.9.1 * Update setup.py * [CI] Add Codecov and Test Logs (dmlc#1349) * [Fix] Some minor fixes for AMLC Tutorial (dmlc#1355) * update update update update * Update test_utils_misc.py * update * update * Update test_layers.py * Update misc.py * Update mobilebert.py * add in_units and in_channels * Update __init__.py * Update mobilebert.py * Update README.md * fix test case * fix * Update test_utils_misc.py * fix bug * [FEATURE] gpt2 generation scripts (dmlc#1354) * remove prev_len in hybrid_forward parameters * update * sample * update * add gpt2_1558M * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update Co-authored-by: Hu <[email protected]> * [Fix] Minor fix for AMLC Tutorial - QA (dmlc#1359) * update Update README.md update try to use dataclasses * Update squad_utils.py * Update preprocessing.py * Update squad_utils.py * Update run_squad.py * [Log Message Improvement] Improve nlp process (dmlc#1362) * Update learn_subword.py * Update learn_subword.py * Update learn_subword.py * Update apply_subword.py * Set default ctx in conftest (dmlc#1363) * Fix the correctness of the Horovod support on squad (dmlc#1353) * revise squad * tiny fix * fix total_norm logging * shuffle before and after splitting * make pre_shuffle_seed fixed * fix flags * remove do_pre_shuffle * remove inside_split_shuffle Co-authored-by: Ubuntu <[email protected]> * [CI][BUGFIX] Custom Step for Uploading Code Coverage in Pull Request Event (dmlc#1364) * [FEATURE]Generation script improvement (dmlc#1365) * update * update * update * update * update * udpate * update * update * update * update Co-authored-by: Hu <[email protected]> * [Website][CI] Build Website without Warnings + Add Workflow for Building Website (dmlc#1327) * [Website] Documentation warnings Fixed + Create Makefile [Website] Documentation bug fix [Website] Bug fix [Website] Build without model_zoo [Website] Fix notebook * [Website][CI] Add workflow for building website * [CI] Add more dependencies * [CI] Update buildwebsite.yml [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml [CI] Update buildwebsite.yml [CI] Update buildwebsite.yml [CI] Update buildwebsite.yml [CI] Update buildwebsite.yml [CI] Update buildwebsite.yml [CI] Update buildwebsite.yml [CI] Update buildwebsite.yml [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml * [Website] Add more dependencies * [Website][CI] Add Compile notebook step + Preview website * [CI] Add shell script for compiling notebooks * [CI] Add permission for shell script * [Website] Update * [Website] Update * [CI] Add uploading build artifacts * [CI] Update * [CI] Update Indentation * [CI] Remove some dependencies * [BUGFIX] Fix URL encoding (dmlc#1370) * [FEATURE]Update readme of nmt (dmlc#1373) * update * update * update * update * update * update * update * update Co-authored-by: Hu <[email protected]> * [CI] Improve website building workflow (dmlc#1377) * BERT pretraining (dmlc#1376) * bert * update * address comments * update * [Fix][Docker] Fix the docker image + Fix pretrain_corpus document. (dmlc#1378) * update * Update ubuntu18.04-devel-gpu.Dockerfile * fix the docker image * Update README.md * Update ubuntu18.04-devel-gpu.Dockerfile * Update README.md * fix readme * Add CPU DockerFile * update * update * Update ubuntu18.04-devel-gpu.Dockerfile * update * prepare to add TVM to docker * try to update * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * Update install_openmpi.sh * update * Create install_llvm.sh * Update ubuntu18.04-base-gpu.Dockerfile * Update ubuntu18.04-base-gpu.Dockerfile * Update run_squad2_albert_base.sh * Update prepare_squad.py * Update prepare_squad.py * Update prepare_squad.py * fix * Update README.md * update * update * Update README.md * Update README.md * Update ubuntu18.04-devel-gpu.Dockerfile * update * Update README.md * fix * Update ubuntu18.04-base-cpu.Dockerfile * update * add tvm to lazy import * update * Update README.md * update * Update README.md * Update run_squad2_albert_base.sh * update * update * update * update * update * Update README.md * Update install_ubuntu18.04_core.sh * update * update * update * fix * Update README.md * Update run_batch_squad.sh * update * Update run_batch_squad.sh * Update run_batch_squad.sh * update * Update README.md * fix * Update gluon_nlp_job.sh * update * Update README.md * Update README.md * Update README.md * update * Update README.md * update * Update install_python_packages.sh * Update install_llvm.sh * Update install_python_packages.sh * Update install_llvm.sh * update * Update install_ubuntu18.04_core.sh * fix * Update submit-job.py * Update submit-job.py * Update README.md * Update README.md * Update prepare_gutenberg.py * Delete gluon_nlp_cpu_job.sh * Update prepare_gutenberg.py * Update prepare_gutenberg.py * Update prepare_gutenberg.py * Update conf.py * update * Update generate_commands.py * fix readme * use os.link for hard link * Update README.md * Update README.md * Update gluon_nlp_job.sh * Update __init__.py * Update benchmark_utils.py * try to use multi-stage build * Update benchmark_utils.py * multi-stage build * Update README.md * Update README.md * update * Update submit-job.py * fix documentation * fix * update * Update test.sh * Update test.sh * Update test.sh * Update test.sh * Update README.md * Update test.sh * fix * Update README.md * Update gluon_nlp_job.sh * [Website] Add AMLC Tutorial to Website (dmlc#1379) * [Website] Add AMLC Tutorial * [Website] Add tsv encoding * [Website] Add model zoo * [Website] Update Makefile * [Website] Update Makefile * [Website] Update Makefile * [Website] Update compile_notebooks.sh * [Website] Update Makefile * [Website] Add title to generation * [Website] Update workflow * update * [Website] Update model_zoo.rst * [Website] Update model_zoo.rst * [BUGFIX] Fix Codecov (dmlc#1391) * Update coveragerc * Update coveragerc * Update coveragerc * Update workflow * Update workflow * update * update Co-authored-by: MoisesHer <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Xingjian Shi <[email protected]> Co-authored-by: barry-jin <[email protected]> Co-authored-by: ht <[email protected]> Co-authored-by: Hu <[email protected]> Co-authored-by: Leonard Lausen <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Ziyue Huang <[email protected]>

Fix fp16 bug: not passing dtype to TransformerEncoderLayer

260354f

Re-hybridize after casting & add BERT test

970143c

MoisesHer changed the title ~~Fix fp16 bug: not passing dtype to TransformerEncoderLayer~~ Fix BERT fp16 bugs, add test Jul 21, 2020

Skip fp16 test if CPU ctx

5b59fa3

szha changed the base branch from numpy to master August 13, 2020 02:29

merge with master

996302e

MoisesHer requested a review from a team as a code owner September 3, 2020 23:11

remove debugging messages

9ce2552

sxjscience approved these changes Sep 4, 2020

View reviewed changes

sxjscience merged commit 9711e5e into dmlc:master Sep 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix BERT fp16 bugs, add test #1270

Fix BERT fp16 bugs, add test #1270

MoisesHer commented Jul 18, 2020 •

edited

Loading

codecov bot commented Jul 18, 2020 •

edited

Loading

szha commented Jul 18, 2020

sxjscience commented Jul 19, 2020

sxjscience commented Jul 22, 2020

sxjscience commented Sep 1, 2020 •

edited

Loading

sxjscience commented Sep 1, 2020

sxjscience left a comment

sxjscience commented Sep 4, 2020

Fix BERT fp16 bugs, add test #1270

Fix BERT fp16 bugs, add test #1270

Conversation

MoisesHer commented Jul 18, 2020 • edited Loading

Description

codecov bot commented Jul 18, 2020 • edited Loading

Codecov Report

szha commented Jul 18, 2020

sxjscience commented Jul 19, 2020

sxjscience commented Jul 22, 2020

sxjscience commented Sep 1, 2020 • edited Loading

sxjscience commented Sep 1, 2020

sxjscience left a comment

Choose a reason for hiding this comment

sxjscience commented Sep 4, 2020

MoisesHer commented Jul 18, 2020 •

edited

Loading

codecov bot commented Jul 18, 2020 •

edited

Loading

sxjscience commented Sep 1, 2020 •

edited

Loading