BERT pretraining #1376

ZiyueHuang · 2020-09-29T06:54:08Z

Description

@sxjscience

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

cc @dmlc/gluon-nlp-team

scripts/pretraining/bert/create_pretraining_data.py

github-actions · 2020-09-29T07:07:03Z

The documentation website for preview: http://gluon-nlp-dev.s3-accelerate.amazonaws.com/PR1376/bert/index.html

scripts/pretraining/bert/create_pretraining_data.py

sxjscience · 2020-09-29T07:15:05Z

scripts/pretraining/bert/pretraining_utils.py

+        for label, pred_label, mask in zip(labels, preds, masks):
+            if pred_label.shape != label.shape:
+                #pred_label = pred_label.argmax(axis=self.axis)
+                pred_label = mx.npx.topk(pred_label.astype('float32', copy=False),


Have you met some bugs of argmax?

mx.npx.topk is faster than mx.np.argmax about 50 times. The issue (apache/mxnet#11061) still exists in MXNet 2.0.

import time import mxnet as mx import numpy as np import os tmp = mx.np.random.normal(-1, 1, (64,300000), ctx=mx.gpu()) tic = time.time() for i in range(20): if (i == 5): begin = time.time(); elif (i == 15): end = time.time(); tic = time.time() #out = mx.np.argmax(tmp, axis=1) out = mx.npx.topk(tmp, k=1, ret_typ='indices', axis=1, dtype=np.int32) out.wait_to_read() toc = time.time() - tic print ("used time %f:"%toc) avg = (end - begin) / 10 print ("avg time %f"%avg)

Thanks Ziyue, I can reproduce....

@szha Our argmax is too slow...

I'm actually not aware of this. Argmax is quite easy to accelerate and we can even call packages like cub.

codecov · 2020-09-29T07:18:22Z

Codecov Report

Merging #1376 into master will decrease coverage by 0.04%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1376      +/-   ##
==========================================
- Coverage   71.09%   71.04%   -0.05%     
==========================================
  Files         107      107              
  Lines       12607    12607              
==========================================
- Hits         8963     8957       -6     
- Misses       3644     3650       +6

Impacted Files	Coverage Δ
src/gluonnlp/data/filtering.py	`78.26% <0.00%> (-4.35%)`	⬇️
src/gluonnlp/data/tokenizers/yttm.py	`81.89% <0.00%> (-0.87%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4fb41d7...9f83ab7. Read the comment docs.

scripts/pretraining/bert/run_pretraining.py

scripts/pretraining/bert/create_pretraining_data.py

scripts/pretraining/bert/pretraining_utils.py

github-actions · 2020-09-30T06:29:55Z

The documentation website for preview: http://gluon-nlp-dev.s3-accelerate.amazonaws.com/PR1376/bert/index.html

github-actions · 2020-09-30T11:39:08Z

The documentation website for preview: http://gluon-nlp-dev.s3-accelerate.amazonaws.com/PR1376/bert/index.html

sxjscience · 2020-10-02T17:46:49Z

LGTM overall. But we may later need to consider to merge the Electra + Bert (nsp + mlm) + Albert (sop + mlm) implementations.

sxjscience · 2020-10-05T20:03:24Z

Would you try to merge the upstream/master? It seems that there's something wrong in the GPU test.

ZiyueHuang · 2020-10-06T04:17:51Z

Would you try to merge the upstream/master? It seems that there's something wrong in the GPU test.

done

github-actions · 2020-10-06T04:37:28Z

The documentation website for preview: http://gluon-nlp-dev.s3-accelerate.amazonaws.com/PR1376/bert/index.html

* Fix BERT fp16 bugs, add test (dmlc#1270) * Fix fp16 bug: not passing dtype to TransformerEncoderLayer * Re-hybridize after casting & add BERT test * Skip fp16 test if CPU ctx * remove debugging messages Co-authored-by: root <[email protected]> * [Fix][SageMaker] Make sure that the installation works in SageMaker (dmlc#1348) * Fasttext to 0.9.1 * Update setup.py * [CI] Add Codecov and Test Logs (dmlc#1349) * [Fix] Some minor fixes for AMLC Tutorial (dmlc#1355) * update update update update * Update test_utils_misc.py * update * update * Update test_layers.py * Update misc.py * Update mobilebert.py * add in_units and in_channels * Update __init__.py * Update mobilebert.py * Update README.md * fix test case * fix * Update test_utils_misc.py * fix bug * [FEATURE] gpt2 generation scripts (dmlc#1354) * remove prev_len in hybrid_forward parameters * update * sample * update * add gpt2_1558M * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update Co-authored-by: Hu <[email protected]> * [Fix] Minor fix for AMLC Tutorial - QA (dmlc#1359) * update Update README.md update try to use dataclasses * Update squad_utils.py * Update preprocessing.py * Update squad_utils.py * Update run_squad.py * [Log Message Improvement] Improve nlp process (dmlc#1362) * Update learn_subword.py * Update learn_subword.py * Update learn_subword.py * Update apply_subword.py * Set default ctx in conftest (dmlc#1363) * Fix the correctness of the Horovod support on squad (dmlc#1353) * revise squad * tiny fix * fix total_norm logging * shuffle before and after splitting * make pre_shuffle_seed fixed * fix flags * remove do_pre_shuffle * remove inside_split_shuffle Co-authored-by: Ubuntu <[email protected]> * [CI][BUGFIX] Custom Step for Uploading Code Coverage in Pull Request Event (dmlc#1364) * [FEATURE]Generation script improvement (dmlc#1365) * update * update * update * update * update * udpate * update * update * update * update Co-authored-by: Hu <[email protected]> * [Website][CI] Build Website without Warnings + Add Workflow for Building Website (dmlc#1327) * [Website] Documentation warnings Fixed + Create Makefile [Website] Documentation bug fix [Website] Bug fix [Website] Build without model_zoo [Website] Fix notebook * [Website][CI] Add workflow for building website * [CI] Add more dependencies * [CI] Update buildwebsite.yml [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml [CI] Update buildwebsite.yml [CI] Update buildwebsite.yml [CI] Update buildwebsite.yml [CI] Update buildwebsite.yml [CI] Update buildwebsite.yml [CI] Update buildwebsite.yml [CI] Update buildwebsite.yml [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml * [CI] Update buildwebsite.yml * [Website] Add more dependencies * [Website][CI] Add Compile notebook step + Preview website * [CI] Add shell script for compiling notebooks * [CI] Add permission for shell script * [Website] Update * [Website] Update * [CI] Add uploading build artifacts * [CI] Update * [CI] Update Indentation * [CI] Remove some dependencies * [BUGFIX] Fix URL encoding (dmlc#1370) * [FEATURE]Update readme of nmt (dmlc#1373) * update * update * update * update * update * update * update * update Co-authored-by: Hu <[email protected]> * [CI] Improve website building workflow (dmlc#1377) * BERT pretraining (dmlc#1376) * bert * update * address comments * update * [Fix][Docker] Fix the docker image + Fix pretrain_corpus document. (dmlc#1378) * update * Update ubuntu18.04-devel-gpu.Dockerfile * fix the docker image * Update README.md * Update ubuntu18.04-devel-gpu.Dockerfile * Update README.md * fix readme * Add CPU DockerFile * update * update * Update ubuntu18.04-devel-gpu.Dockerfile * update * prepare to add TVM to docker * try to update * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * Update install_openmpi.sh * update * Create install_llvm.sh * Update ubuntu18.04-base-gpu.Dockerfile * Update ubuntu18.04-base-gpu.Dockerfile * Update run_squad2_albert_base.sh * Update prepare_squad.py * Update prepare_squad.py * Update prepare_squad.py * fix * Update README.md * update * update * Update README.md * Update README.md * Update ubuntu18.04-devel-gpu.Dockerfile * update * Update README.md * fix * Update ubuntu18.04-base-cpu.Dockerfile * update * add tvm to lazy import * update * Update README.md * update * Update README.md * Update run_squad2_albert_base.sh * update * update * update * update * update * Update README.md * Update install_ubuntu18.04_core.sh * update * update * update * fix * Update README.md * Update run_batch_squad.sh * update * Update run_batch_squad.sh * Update run_batch_squad.sh * update * Update README.md * fix * Update gluon_nlp_job.sh * update * Update README.md * Update README.md * Update README.md * update * Update README.md * update * Update install_python_packages.sh * Update install_llvm.sh * Update install_python_packages.sh * Update install_llvm.sh * update * Update install_ubuntu18.04_core.sh * fix * Update submit-job.py * Update submit-job.py * Update README.md * Update README.md * Update prepare_gutenberg.py * Delete gluon_nlp_cpu_job.sh * Update prepare_gutenberg.py * Update prepare_gutenberg.py * Update prepare_gutenberg.py * Update conf.py * update * Update generate_commands.py * fix readme * use os.link for hard link * Update README.md * Update README.md * Update gluon_nlp_job.sh * Update __init__.py * Update benchmark_utils.py * try to use multi-stage build * Update benchmark_utils.py * multi-stage build * Update README.md * Update README.md * update * Update submit-job.py * fix documentation * fix * update * Update test.sh * Update test.sh * Update test.sh * Update test.sh * Update README.md * Update test.sh * fix * Update README.md * Update gluon_nlp_job.sh * [Website] Add AMLC Tutorial to Website (dmlc#1379) * [Website] Add AMLC Tutorial * [Website] Add tsv encoding * [Website] Add model zoo * [Website] Update Makefile * [Website] Update Makefile * [Website] Update Makefile * [Website] Update compile_notebooks.sh * [Website] Update Makefile * [Website] Add title to generation * [Website] Update workflow * update * [Website] Update model_zoo.rst * [Website] Update model_zoo.rst * [BUGFIX] Fix Codecov (dmlc#1391) * Update coveragerc * Update coveragerc * Update coveragerc * Update workflow * Update workflow * update * update Co-authored-by: MoisesHer <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Xingjian Shi <[email protected]> Co-authored-by: barry-jin <[email protected]> Co-authored-by: ht <[email protected]> Co-authored-by: Hu <[email protected]> Co-authored-by: Leonard Lausen <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Ziyue Huang <[email protected]>

ZiyueHuang added 2 commits September 27, 2020 02:44

bert

0d316f6

update

8946abc

ZiyueHuang requested a review from a team as a code owner September 29, 2020 06:54