Skip to content

SMP2020微博情绪分类技术评测

embedding edited this page May 10, 2021 · 12 revisions

SMP2020-EWECT由两个任务构成,分别是常规场景任务和新冠病毒场景任务。这里我们主要以SMP2020-EWECT常规场景任务为例说明stacking集成的使用方法。我们随机选择一些预训练模型并对其微调。我们使用K折交叉验证,用微调的模型对训练数据集进行预测,之后生成特征(预测概率)以进行stacking集成。--train_features_path 指定生成的特征路径。可以在预训练模型仓库章节中找到下面使用的预训练模型。

CUDA_VISIBLE_DEVICES=0,1 python3 run_classifier_cv.py --pretrained_model_path models/review_roberta_large_model.bin \
                                                      --vocab_path models/google_zh_vocab.txt \
                                                      --output_model_path models/ewect_usual_classifier_model_0.bin \
                                                      --config_path models/bert/large_config.json \
                                                      --train_path datasets/smp2020-ewect/usual/train.tsv \
                                                      --train_features_path datasets/smp2020-ewect/usual/train_features_0.npy \
                                                      --folds_num 5 --epochs_num 3 --batch_size 64 --seed 17 \
                                                      --embedding word_pos_seg --encoder transformer --mask fully_visible

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 run_classifier_cv.py --pretrained_model_path models/mixed_corpus_bert_large_model.bin \
                                                          --vocab_path models/google_zh_vocab.txt \
                                                          --output_model_path models/ewect_usual_classifier_model_1.bin \
                                                          --config_path models/bert/large_config.json \
                                                          --train_path datasets/smp2020-ewect/usual/train.tsv \
                                                          --train_features_path datasets/smp2020-ewect/usual/train_features_1.npy \
                                                          --folds_num 5 --epochs_num 3 --batch_size 64 --seq_length 160 \
                                                          --embedding word_pos_seg --encoder transformer --mask fully_visible

CUDA_VISIBLE_DEVICES=0 python3 run_classifier_cv.py --pretrained_model_path models/cluecorpussmall_gpt2_seq1024_model.bin-250000 \
                                                    --vocab_path models/google_zh_vocab.txt \
                                                    --output_model_path models/ewect_usual_classifier_model_2.bin \
                                                    --config_path models/gpt2/config.json \
                                                    --train_path datasets/smp2020-ewect/usual/train.tsv \
                                                    --train_features_path datasets/smp2020-ewect/usual/train_features_2.npy \
                                                    --folds_num 5 --epochs_num 3 --batch_size 32 --seq_length 100 \
                                                    --embedding word_pos --remove_embedding_layernorm \
                                                    --encoder transformer --mask causal --layernorm_position pre --pooling mean

CUDA_VISIBLE_DEVICES=0 python3 run_classifier_cv.py --pretrained_model_path models/cluecorpussmall_elmo_model.bin-500000 \
                                                    --vocab_path models/google_zh_vocab.txt \
                                                    --config_path models/birnn_config.json \
                                                    --output_model_path models/ewect_usual_classifier_model_3.bin \
                                                    --train_path datasets/smp2020-ewect/usual/train.tsv \
                                                    --train_features_path datasets/smp2020-ewect/usual/train_features_3.npy \
                                                    --epochs_num 3 --batch_size 64 --learning_rate 5e-4 --folds_num 5 \
                                                    --embedding word --remove_embedding_layernorm --encoder bilstm --pooling max

CUDA_VISIBLE_DEVICES=0 python3 run_classifier_cv.py --pretrained_model_path models/cluecorpussmall_gatedcnn_lm_model.bin-500000 \
                                                    --vocab_path models/google_zh_vocab.txt \
                                                    --config_path models/gatedcnn_9_config.json \
                                                    --output_model_path models/ewect_usual_classifier_model_4.bin \
                                                    --train_path datasets/smp2020-ewect/usual/train.tsv \
                                                    --train_features_path datasets/smp2020-ewect/usual/train_features_4.npy \
                                                    --epochs_num 3 --batch_size 64 --learning_rate 5e-5 --folds_num 5 \
                                                    --embedding word --remove_embedding_layernorm --encoder gatedcnn --pooling mean

--output_model_path 指定了微调后的分类器路径。K折交叉验证会产生K个分类器。我们使用这些分类器在验证集上进行推理,得到验证集的特征(--test_features_path):

CUDA_VISIBLE_DEVICES=0 python3 inference/run_classifier_infer_cv.py --load_model_path models/ewect_usual_classifier_model_0.bin \
                                                                    --vocab_path models/google_zh_vocab.txt \
                                                                    --config_path models/bert/large_config.json \
                                                                    --test_path datasets/smp2020-ewect/usual/dev.tsv \
                                                                    --test_features_path datasets/smp2020-ewect/usual/test_features_0.npy \
                                                                    --folds_num 5 --labels_num 6 \
                                                                    --embedding word_pos_seg --encoder transformer --mask fully_visible

CUDA_VISIBLE_DEVICES=0 python3 inference/run_classifier_infer_cv.py --load_model_path models/ewect_usual_classifier_model_1.bin \
                                                                    --vocab_path models/google_zh_vocab.txt \
                                                                    --config_path models/bert/large_config.json \
                                                                    --test_path datasets/smp2020-ewect/usual/dev.tsv \
                                                                    --test_features_path datasets/smp2020-ewect/usual/test_features_1.npy \
                                                                    --folds_num 5 --labels_num 6 --seq_length 160 \
                                                                    --embedding word_pos_seg --encoder transformer --mask fully_visible

CUDA_VISIBLE_DEVICES=0 python3 inference/run_classifier_infer_cv.py --load_model_path models/ewect_usual_classifier_model_2.bin \
                                                                    --vocab_path models/google_zh_vocab.txt \
                                                                    --config_path models/gpt2/config.json \
                                                                    --test_path datasets/smp2020-ewect/usual/dev.tsv \
                                                                    --test_features_path datasets/smp2020-ewect/usual/test_features_2.npy \
                                                                    --folds_num 5 --labels_num 6 --seq_length 100 \
                                                                    --embedding word_pos --remove_embedding_layernorm \
                                                                    --encoder transformer --mask causal --layernorm_position pre --pooling mean

CUDA_VISIBLE_DEVICES=0 python3 inference/run_classifier_infer_cv.py --load_model_path models/ewect_usual_classifier_model_3.bin \
                                                                    --vocab_path models/google_zh_vocab.txt \
                                                                    --config_path models/birnn_config.json \
                                                                    --test_path datasets/smp2020-ewect/usual/dev.tsv \
                                                                    --test_features_path datasets/smp2020-ewect/usual/test_features_3.npy \
                                                                     --folds_num 5 --labels_num 6 \
                                                                    --embedding word --remove_embedding_layernorm --encoder bilstm --pooling max

CUDA_VISIBLE_DEVICES=0 python3 inference/run_classifier_infer_cv.py --load_model_path models/ewect_usual_classifier_model_4.bin \
                                                                    --vocab_path models/google_zh_vocab.txt \
                                                                    --config_path models/gatedcnn_9_config.json \
                                                                    --test_path datasets/smp2020-ewect/usual/dev.tsv \
                                                                    --test_features_path datasets/smp2020-ewect/usual/test_features_4.npy \
                                                                    --folds_num 5 --labels_num 6 \
                                                                    --embedding word --remove_embedding_layernorm --encoder gatedcnn --pooling mean


然后,我们使用LightGBM来处理提取的特征。首先使用贝叶斯优化来寻找合适的超参数。评估方式为在训练集上进行交叉验证:

python3 scripts/run_lgb_cv_bayesopt.py --train_path datasets/smp2020-ewect/usual/train.tsv \
                                       --train_features_path datasets/smp2020-ewect/usual/ \
                                       --models_num 5 --folds_num 5 --labels_num 6 --epochs_num 100

通常使用LightGBM对多个模型(--models_num)集成相对于单模型有更好的表现。

我们使用搜索好的超参数对LightGBM进行训练和验证:

python3 scripts/run_lgb.py --train_path datasets/smp2020-ewect/usual/train.tsv \
                           --test_path datasets/smp2020-ewect/usual/dev.tsv \
                           --train_features_path datasets/smp2020-ewect/usual/ \
                           --test_features_path datasets/smp2020-ewect/usual/ \
                           --models_num 5 --labels_num 6

用户可以在 scripts/run_lgb.py 中改变超参数。--train_path--test_path 给出训练集和验证集样本的标签信息; --train_features_path--test_features_path 给出训练集和验证集样本的特征。

这是使用stacking集成的简单展示,以上操作可以为我们带来非常有竞争力的结果。当我们继续增加使用不同预处理,预训练和微调策略的模型时,将会获得进一步的提升。可以在比赛主页上找到更多详细信息。

Clone this wiki locally