A PaddlePaddle implementation of CharCNN.
论文: Character-level Convolutional Networks for Text Classification
Datasets | Paper error rate (large / small) |
Our error rate (large / small) |
abs. improv. (large / small) |
epochs |
---|---|---|---|---|
AG’s News | 13.39 / 14.80 | 9.38 / 10.17 | 4.01 / 4.63 | 60 |
Yahoo! Answers | 28.80 / 29.84 | 27.73 / 28.69 | 1.07 / 1.15 | 15 |
Amazon Review Full | 40.45 / 40.43 | 38.22 / 38.97 | 2.23 / 1.46 | 7 |
Note: the
large
model has not yet converged, and the accuracy can be improved by continuing training.
Format:
"class idx","sentence or text to be classified"
Samples are separated by newline.
Example:
"3","Fears for T N pension after talks, Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul."
"4","The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com)","SPACE.com - TORONTO, Canada -- A second\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\privately funded suborbital space flight, has officially announced the first\launch date for its manned rocket."
- Python >= 3
- PaddlePaddle >= 2.0.0
- see
requirements.txt
- 下载数据集到
/data
文件夹,并将训练集划分为train
和dev
集:
bash split_data.sh data/ag_news/train.csv
- start train
bash train_ag_news.sh
Small model:
Large model:
将模型分别放置于
output/models_yahoo_answers/
和output/models_amz_full
目录下,如下运行eval
bash 脚本即可测试模型。
bash eval_ag_news.sh
bash eval_yahoo_answers.sh
bash eval_amz_full.sh
We use nlpaug to augment data, specifically, we substitute similar word according to WordNet
.
there's two implementation: SynonymAug
and GeometricSynonymAug
, GeometricSynonymAug
is our adapted version of SynonymAug
, which leverages geometric distribution in substitution as described in the CharCNN paper.
Augumentation demos:
==================== GeometricSynonymAug
The straightaway brown dodger rise complete the lazy domestic dog
The quick john brown fox jumps over the lazy dog
The quick brown slyboots jumps over the lazy dog
The straightaway brownness fox start all over the lazy canis familiaris
The quick brown fox jumps over the indolent canis familiaris
The straightaway brown charles james fox jumps terminated the lazy domestic dog
The quick brown george fox jumps over the lazy domestic dog
The quick brown fox jumps over the indolent dog
The immediate brownness fox jumps ended the slothful dog
The quick brown fox jumps over the lazy canis familiaris
--- 2.56608247756958 seconds ---
==================== SynonymAug
The quick brown fox leap over the lazy frank
The ready brown charles james fox jumps over the lazy dog
The quick brown fox jump over the lazy frank
The speedy brown university fox jumps over the lazy dog
The ready brown fox jump off over the lazy dog
The quick robert brown fox jump over the lazy dog
The quick brown fox jumps concluded the lazy hound
The quick brown university fox jumps over the lazy click
The quick brown fox jumps over the slothful andiron
The quick brown fox parachute over the lazy domestic dog
--- 0.011068582534790039 seconds ---
We experimented GeometricSynonymAug on AG’s News
with small
model, the accuracy dropped by about 0.4
(error rate: 10.59).
@article{zhang2015character,
title={Character-level convolutional networks for text classification},
author={Zhang, Xiang and Zhao, Junbo and LeCun, Yann},
journal={Advances in neural information processing systems},
volume={28},
pages={649--657},
year={2015}
}