-
Notifications
You must be signed in to change notification settings - Fork 526
Experiments
GPU:Tesla P40
CPU:Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
We use BERT to test the speed of distributed training mode. Google BERT is trained for 1 million steps and each step contains 128,000 tokens. It takes around 18 days to reproduce the experiments by UER-py on 3 GPU machines (24 GPU in total).
#(machine) | #(GPU)/machine | tokens/second |
---|---|---|
1 | 0 | 276 |
1 | 1 | 7050 |
1 | 2 | 13071 |
1 | 4 | 24695 |
1 | 8 | 44300 |
3 | 8 | 84386 |
We qualitatively evaluate pre-trained models by finding words' near neighbours.
Evaluation of context-independent word embedding:
Target word: 苹 | Target word: 吃 | Target word: 水 | |||
蘋 | 0.762 | 喝 | 0.539 | 河 | 0.286 |
apple | 0.447 | 食 | 0.475 | 海 | 0.278 |
iphone | 0.400 | 啃 | 0.340 | water | 0.276 |
柠 | 0.347 | 煮 | 0.324 | 油 | 0.266 |
ios | 0.317 | 嚐 | 0.322 | 雨 | 0.259 |
Evaluation of context-dependent word embedding:
Target sentence: 其冲积而形成小平原沙土层厚而肥沃,盛产苹果、大樱桃、梨和葡萄。
Target word: 苹 | |
蘋 | 0.822 |
莓 | 0.714 |
芒 | 0.706 |
柠 | 0.704 |
樱 | 0.696 |
Target sentence: 苹果削减了台式Mac产品线上的众多产品。
Target word: 苹 | |
蘋 | 0.892 |
apple | 0.788 |
iphone | 0.743 |
ios | 0.720 |
ipad | 0.706 |
Evaluation of context-independent word embedding:
Target word: 苹果 | Target word: 腾讯 | Target word: 吉利 | |||
苹果公司 | 0.419 | 新浪 | 0.357 | 沃尔沃 | 0.277 |
apple | 0.415 | 网易 | 0.356 | 伊利 | 0.243 |
苹果电脑 | 0.349 | 搜狐 | 0.356 | 长荣 | 0.235 |
微软 | 0.320 | 百度 | 0.341 | 天安 | 0.224 |
mac | 0.298 | 乐视 | 0.332 | 哈达 | 0.220 |
Evaluation of context-dependent word embedding:
Target sentence: 其冲积而形成小平原沙土层厚而肥沃,盛产苹果、大樱桃、梨和葡萄。
Target word: 苹果 | |
柠檬 | 0.734 |
草莓 | 0.725 |
荔枝 | 0.719 |
树林 | 0.697 |
牡丹 | 0.686 |
Target sentence: 苹果削减了台式Mac产品线上的众多产品
Target word: 苹果 | |
苹果公司 | 0.836 |
apple | 0.829 |
福特 | 0.796 |
微软 | 0.777 |
苹果电脑 | 0.773 |
Target sentence: 讨吉利是通过做民间习俗的吉祥事,或重现过去曾经得到好结果的行为,以求得好兆头。
Target word: 吉利 | |
仁德 | 0.749 |
光彩 | 0.743 |
愉快 | 0.736 |
永元 | 0.736 |
仁和 | 0.732 |
Target sentence: 2010年6月2日福特汽车公司宣布出售旗下高端汽车沃尔沃予中国浙江省的吉利汽车,同时将于2010年第四季停止旗下中阶房车品牌所有业务
Target word: 吉利 | |
沃尔沃 | 0.771 |
卡比 | 0.751 |
永利 | 0.745 |
天安 | 0.741 |
仁和 | 0.741 |
Target sentence: 主要演员有扎克·布拉夫、萨拉·朝克、唐纳德·费森、尼尔·弗林、肯·詹金斯、约翰·麦吉利、朱迪·雷耶斯、迈克尔·莫斯利等。
Target word: 吉利 | |
玛利 | 0.791 |
米格 | 0.768 |
韦利 | 0.767 |
马力 | 0.764 |
安吉 | 0.761 |
We use a range of Chinese datasets to evaluate the performance of UER-py. Douban book review, ChnSentiCorp, Shopping, and Tencentnews are sentence-level small-scale sentiment classification datasets. MSRA-NER is a sequence labeling dataset. These datasets are included in this project. Dianping, JDfull, JDbinary, Ifeng, and Chinanews are large-scale classification datasets. They are collected in glyph and can be downloaded at glyph's github project. These five datasets don't contain validation set. We use 10% instances in trainset for validation.
Most pre-training models consist of 2 stages: pre-training on general-domain corpus and fine-tuning on downstream dataset. We recommend 3-stage mode: 1)Pre-training on general-domain corpus; 2)Pre-training on downstream dataset; 3)Fine-tuning on downstream dataset. Stage 2 enables models to get familiar with distributions of downstream tasks. It is sometimes known as semi-supervised fune-tuning.
Hyper-parameter settings are as follows:
- Stage 1: We train with batch size of 256 sequences and each sequence contains 256 tokens. We load Google's pretrained models and train upon it for 500,000 steps. The learning rate is 2e-5 and other optimizer settings are identical with Google BERT. BERT tokenizer is used.
- Stage 2: We train with batch size of 256 sequences. For classification datasets, the sequence length is 128. For sequence labeling datasets, the sequence length is 256. We train upon Google's pretrained model for 20,000 steps. Optimizer settings and tokenizer are identical with stage 1.
- Stage 3: For classification datasets, the training batch size and epochs are 64 and 3. For sequence labeling datasets, the training batch size and epochs are 32 and 5. Optimizer settings and tokenizer are identical with stage 1.
We provide the pre-trained models (using BERT target) on different downstream datasets: book_review_model.bin; chnsenticorp_model.bin; shopping_model.bin; msra_model.bin. Tencentnews dataset and its pretrained model will be publicly available after data desensitization.
Model/Dataset | Douban book review | ChnSentiCorp | Shopping | MSRA-NER | Tencentnews review |
---|---|---|---|---|---|
BERT | 87.5 | 94.3 | 96.3 | 93.0/92.4/92.7 | 84.2 |
BERT+semi_BertTarget | 88.1 | 95.6 | 97.0 | 94.3/92.6/93.4 | 85.1 |
BERT+semi_MlmTarget | 87.9 | 95.5 | 97.1 | 85.1 |
Pre-training is also important for other encoders and targets. We pre-train a 2-layer LSTM on 1.9G review corpus with language model target. Embedding size and hidden size are 512. The model is much more efficient than BERT in pre-training and fine-tuning stages. We show that pre-training brings significant improvements and achieves competitive results (the differences are not big compared with the results of BERT).
Model/Dataset | Douban book review | ChnSentiCorp | Shopping |
---|---|---|---|
BERT | 87.5 | 94.3 | 96.3 |
LSTM | 80.2 | 88.3 | 94.4 |
LSTM+pre-training | 86.6(+6.4) | 94.5(+6.2) | 96.5(+2.1) |
It requires tremendous computional resources to fine-tune on large-scale datasets. For Ifeng, Chinanews, Dianping, JDbinary, and JDfull datasets, we provide their classification models (see Chinese model zoo). Classification models on large-scale datasets allow users to reproduce the results without training. Besides that, classification models could be used for improving other related tasks. More experimental results will come soon.
Ifeng and Chinanews datasets contain news' titles and abstracts. In stage 2, we use title to predict abstract.
Model/Dataset | Ifeng | Chinanews | Dianping | JDbinary | JDfull |
---|---|---|---|---|---|
pre-SOTA (Glyph & Glyce) | 85.76 | 91.88 | 78.46 | 91.76 | 54.24 |
BERT | 87.50 | 93.37 | 92.37 | 54.79 | |
BERT+semi+BertTarget | 87.65 |
We also provide the pre-trained models on different corpora, encoders, and targets (see Chinese model zoo). Selecting proper pre-training models is beneficial to the performance of downstream tasks.
Model/Dataset | MSRA-NER |
---|---|
Wikizh corpus (Google) | 93.0/92.4/92.7 |
Renminribao corpus | 94.4/94.4/94.4 |