SMP2020微博情绪分类技术评测

SMP2020-EWECT 常规任务

我们随机选择了一些预训练模型并对其微调。我们使用K折交叉验证，用微调的模型对训练数据集进行预测，之后生成特征（预测概率）以进行stacking集成：

CUDA_VISIBLE_DEVICES=0,1 python3 run_classifier_cv.py --pretrained_model_path models/reviews_bert_large_model.bin \
                                                      --vocab_path models/google_zh_vocab.txt \
                                                      --output_model_path models/ewect_usual_classifier_model_0.bin \
                                                      --config_path models/bert_large_config.json \
                                                      --train_path datasets/smp2020-ewect/usual/train.tsv \
                                                      --train_features_path datasets/smp2020-ewect/usual/train_features_0.npy \
                                                      --folds_num 5 --epochs_num 3 --batch_size 64 --seed 17 \
                                                      --embedding word_pos_seg --encoder transformer --mask fully_visible

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 run_classifier_cv.py --pretrained_model_path models/mixed_corpus_bert_large_model.bin \
                                                          --vocab_path models/google_zh_vocab.txt \
                                                          --output_model_path models/ewect_usual_classifier_model_1.bin \
                                                          --config_path models/bert_large_config.json \
                                                          --train_path datasets/smp2020-ewect/usual/train.tsv \
                                                          --train_features_path datasets/smp2020-ewect/usual/train_features_1.npy \
                                                          --folds_num 5 --epochs_num 3 --batch_size 64 --seq_length 160 \
                                                          --embedding word_pos_seg --encoder transformer --mask fully_visible

CUDA_VISIBLE_DEVICES=0 python3 run_classifier_cv.py --pretrained_model_path models/mixed_corpus_gpt_base_model.bin \
                                                    --vocab_path models/google_zh_vocab.txt \
                                                    --output_model_path models/ewect_usual_classifier_model_2.bin \
                                                    --config_path models/bert_base_config.json \
                                                    --train_path datasets/smp2020-ewect/usual/train.tsv \
                                                    --train_features_path datasets/smp2020-ewect/usual/train_features_2.npy \
                                                    --folds_num 5 --epochs_num 3 --batch_size 32 --seq_length 100 \
                                                    --embedding word_pos_seg --encoder transformer --mask causal --pooling mean

CUDA_VISIBLE_DEVICES=0 python3 run_classifier_cv.py --pretrained_model_path models/mixed_corpus_elmo_model.bin \
                                                    --vocab_path models/google_zh_vocab.txt \
                                                    --config_path models/birnn_config.json \
                                                    --output_model_path models/ewect_usual_classifier_model_3.bin \
                                                    --train_path datasets/smp2020-ewect/usual/train.tsv \
                                                    --train_features_path datasets/smp2020-ewect/virus/usual_features_3.npy \
                                                    --epochs_num 3 --batch_size 64 --learning_rate 5e-4 --folds_num 5 \
                                                    --embedding word --encoder bilstm --pooling mean

CUDA_VISIBLE_DEVICES=0 python3 run_classifier_cv.py --pretrained_model_path models/wikizh_gatedcnn_model.bin \
                                                    --vocab_path models/google_zh_vocab.txt \
                                                    --config_path models/gatedcnn_9_config.json \
                                                    --output_model_path models/ewect_usual_classifier_model_4.bin \
                                                    --train_path datasets/smp2020-ewect/usual/train.tsv \
                                                    --train_features_path datasets/smp2020-ewect/usual/train_features_4.npy \
                                                    --epochs_num 3 --batch_size 64 --learning_rate 5e-5 --folds_num 5 \
                                                    --embedding word --encoder gatedcnn --pooling max

然后，我们使用基于树的模型来处理提取的特征，我们使用贝叶斯优化来找到合适的超参数：

python3 scripts/run_bayesopt.py --train_path datasets/smp2020-ewect/usual/train.tsv \
                                --train_features_path datasets/smp2020-ewect/usual/ \
                                --models_num 5 --folds_num 5 --labels_num 6 --epochs_num 100

确定基于树的模型的超参数后，我们提取验证集特征并进行基于树的模型的训练和预测：

CUDA_VISIBLE_DEVICES=0 python3 inference/run_classifier_infer_cv.py --load_model_path models/ewect_usual_classifier_model_0.bin \
                                                                    --vocab_path models/google_zh_vocab.txt \
                                                                    --config_path models/bert_large_config.json \
                                                                    --test_path datasets/smp2020-ewect/usual/dev.tsv \
                                                                    --test_features_path datasets/smp2020-ewect/usual/test_features_0.npy --folds_num 5 --labels_num 6 \
                                                                    --embedding word_pos_seg --encoder transformer --mask fully_visible

CUDA_VISIBLE_DEVICES=0 python3 inference/run_classifier_infer_cv.py --load_model_path models/ewect_usual_classifier_model_1.bin \
                                                                    --vocab_path models/google_zh_vocab.txt \
                                                                    --config_path models/bert_large_config.json \
                                                                    --test_path datasets/smp2020-ewect/usual/dev.tsv \
                                                                    --test_features_path datasets/smp2020-ewect/usual/test_features_1.npy --folds_num 5 --labels_num 6 --seq_length 160 \
                                                                    --embedding word_pos_seg --encoder transformer --mask fully_visible

CUDA_VISIBLE_DEVICES=0 python3 inference/run_classifier_infer_cv.py --load_model_path models/ewect_usual_classifier_model_2.bin \
                                                                    --vocab_path models/google_zh_vocab.txt \
                                                                    --config_path models/bert_base_config.json \
                                                                    --test_path datasets/smp2020-ewect/usual/dev.tsv \
                                                                    --test_features_path datasets/smp2020-ewect/usual/test_features_2.npy --folds_num 5 --labels_num 6 --seq_length 100 \
                                                                    --embedding word_pos_seg --encoder transformer --mask causal --pooling mean

CUDA_VISIBLE_DEVICES=0 python3 inference/run_classifier_infer_cv.py --load_model_path models/ewect_usual_classifier_model_3.bin \
                                                                    --vocab_path models/google_zh_vocab.txt \
                                                                    --config_path models/birnn_config.json \
                                                                    --test_path datasets/smp2020-ewect/usual/dev.tsv \
                                                                    --test_features_path datasets/smp2020-ewect/usual/test_features_3.npy --folds_num 5 --labels_num 6 \
                                                                    --embedding word --encoder bilstm --pooling mean

CUDA_VISIBLE_DEVICES=0 python3 inference/run_classifier_infer_cv.py --load_model_path models/ewect_usual_classifier_model_4.bin \
                                                                    --vocab_path models/google_zh_vocab.txt \
                                                                    --config_path models/gatedcnn_9_config.json \
                                                                    --test_path datasets/smp2020-ewect/usual/dev.tsv \
                                                                    --test_features_path datasets/smp2020-ewect/usual/test_features_4.npy --folds_num 5 --labels_num 6 \
                                                                    --embedding word --encoder gatedcnn --pooling max

python3 scripts/run_lgb.py --train_path datasets/smp2020-ewect/usual/train.tsv \
                           --test_path datasets/smp2020-ewect/usual/dev.tsv \
                           --train_features_path datasets/smp2020-ewect/usual/ \
                           --test_features_path datasets/smp2020-ewect/usual/ \
                           --models_num 5 --labels_num 6

用户可以在 run_lgb.py 中改变超参数。

这是使用stacking集成的简单展示，以上操作可以为我们带来非常有竞争力的结果（前5名）。当我们继续增加使用不同预处理，预训练和微调策略的模型时，将会获得进一步的提升。可以在比赛首页上找到更多详细信息。

Home
主页
- 项目特色
- 依赖环境
- 快速上手
- 预训练数据
- 下游任务数据集
- 预训练模型仓库
- 使用说明
- 竞赛解决方案
  - 中文任务测评基准CLUE
  - SMP2020-EWECT
  - SMP2019-ECISA
  - CCF-BDCI2021-面向黑灰产治理的恶意短信变体字还原
  - 英文任务测评基准GLUE
- 引用

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SMP2020微博情绪分类技术评测

SMP2020-EWECT 常规任务

Clone this wiki locally