Skip to content

Latest commit

 

History

History
333 lines (256 loc) · 12.4 KB

README_EN.md

File metadata and controls

333 lines (256 loc) · 12.4 KB

license

Dialog-KoELECTRA


Introduction

Dialog-KoELECTRA is a language model specialized for dialogue. It was trained with 22GB colloquial and written style Korean text data. Dialog-ELECTRA model is made based on the ELECTRA model. ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU.

Dialog-KoELECTRA can speed up learning and use less memory by using the mixed precision option during pre-training. When finetuning, parameter optimization is possible by using NNI option.


Released Models

We are initially releasing small version pre-trained model. The model was trained on Korean text. We hope to release other models, such as base/large models, in the future.

Model Layers Hidden Size Params Max
Seq Len
Learning
Rate
Batch Size Train Steps Train Time
Dialog-KoELECTRA-Small 12 256 14M 128 1e-4 512 1M 28day

How to use from the transformers library

The Dialog-KoELECTRA model is uploaded to the hugging face, so it is easy to use.

from transformers import ElectraTokenizer, ElectraForSequenceClassification
  
tokenizer = ElectraTokenizer.from_pretrained("skplanet/dialog-koelectra-small-discriminator")

model = ElectraForSequenceClassification.from_pretrained("skplanet/dialog-koelectra-small-discriminator")

If you want to download the model directly without using the transformers library, you can download it through the link below.


Model Pytorch-Generator Pytorch-Discriminator Tensorflow-v1 ONNX
Dialog-KoELECTRA-Small link link link link

Model Performance

Dialog-KoELECTRA shows strong performance in colloquial data downstream tasks.

Colloquial data Written data
NSMC (acc) Question Pair (acc) Korean-Hate-Speech (F1) Naver NER (F1) KorNLI (acc) KorSTS (spearman)
DistilKoBERT 88.60 92.48 60.72 84.65 72.00 72.59
KoELECTRA-Small 89.36 94.85 63.07 85.40 78.60 80.79
Dialog-KoELECTRA-Small 90.01 94.99 68.26 85.51 78.54 78.96

Train Data

corpus name size
dialog Aihub Korean dialog corpus 7GB
NIKL Spoken corpus
Korean chatbot data
KcBERT
written NIKL Newspaper corpus 15GB
namuwikitext

Vocabulary

We applied morpheme analysis using huggingface_konlpy when creating a vocabulary dictionary. As a result of the experiment, it showed better performance than a vocabulary dictionary created without applying morpheme analysis.

vocabulary size unused token size limit alphabet min frequency
40,000 500 6,000 3

Demo


Pre-training

Use preprocess.py to preprocess from a raw text. Data preprocessing only removed repetitive characters and Chinese characters. It has the following arguments:

  • --corpus_dir: A directory containing raw text files.
  • --output_file: File created after preprocessing.

Then run (for example)

python3 preprocess.py \
    --corpus_dir raw_data_dir \
    --output_file preprocessed_data.txt \

Use build_vocab.py to create a vocabulary file from a raw text or preprocessed data. It has the following arguments:

  • --corpus: A raw text file or preprocessed file to turn into a vocabulary file.
  • --tokenizer: a name for the tokenizer such as a wordpiece or mecab_wordpiece (wordpiece by default).
  • --vocab_size: The number of word in vocabulary (40000 by default).
  • --min_frequency: The minimum frequency a pair must have to produce a merge operation (3 by default).
  • --limit_alphabet: The number of initial tokens that can be kept before computing merges (6000 by default).
  • --unused_size: The number of unused token (500 by default).

Then run (for example)

python3 build_vocab.py \
    --corpus preprocessed_data.txt \
    --tokenizer mecab_wordpiece \
    --vocab_size 40000 \
    --min_frequency 3 \
    --limit_alphabet 6000 \
    --unused_size 500

Use build_pretraining_dataset.py to create a pre-training dataset from a dump of raw text. It has the following arguments:

  • --corpus_dir: A directory containing raw text files to turn into Dialog-KoELECTRA examples. A text file can contain multiple documents with empty lines separating them.
  • --vocab_file: File defining the wordpiece vocabulary.
  • --output_dir: Where to write out Dialog-KoELECTRA examples.
  • --max_seq_length: The number of tokens per example (128 by default).
  • --num_processes: If >1 parallelize across multiple processes (1 by default).
  • --blanks-separate-docs: Whether blank lines indicate document boundaries (True by default).
  • --do-lower-case/--no-lower-case: Whether to lower case the input text (True by default).
  • --tokenizer_type: a name for the tokenizer such as a wordpiece or mecab_wordpiece (wordpiece by default).

Then run (for example)

python3 build_pretraining_dataset.py \
    --corpus_dir data/train_data/raw/split_normalize \
    --vocab_file data/vocab/vocab.txt \
    --tokenizer_type wordpiece \
    --output_dir data/train_data/tfrecord/pretrain_tfrecords_len_128_wordpiece_train \
    --max_seq_length 128 \
    --num_processes 8

Use run_pretraining.py to pre-train an Dialog-KoELECTRA model. It has the following arguments:

  • --data_dir: a directory where pre-training data, model weights, etc. are stored.
  • --model_name: a name for the model being trained. Model weights will be saved in <data-dir>/models/<model-name> by default.
  • --hparams (optional): a JSON dict or path to a JSON file containing model hyperparameters, data paths, etc. See configure_pretraining.py for the supported hyperparameters.
  • --use_tpu (optional): Option to use tpu when training the model.
  • --mixed_precision (optional): Option for whether to use mixed precision when training the model.

Then run (for example)

python3 run_pretraining.py \
    --data_dir data/train_data/tfrecord/pretrain_tfrecords_len_128_wordpiece_train \
    --model_name data/ckpt/pretrain_ckpt_len_128_small_wordpiece_train \
    --hparams data/config/small_config_kor_wordpiece_train.json \
    --mixed_precision

Use pytorch_convert.py to convert the tf model to pytorch model. It has the following arguments:

  • --tf_ckpt_path: a directory where tensorflow checkpoint are stored.
  • --pt_discriminator_path: Where to write out pytorch discriminator model.
  • --pt_generator_path (optional): Where to write out pytorch generator model.

Then run (for example)

python3 pytorch_convert.py \
    --tf_ckpt_path model/ckpt/pretrain_ckpt_len_128_small \
    --pt_discriminator_path model/pytorch/dialog-koelectra-small-discriminator \
    --pt_generator_path model/pytorch/dialog-koelectra-small-generator \

Fine-tuning

Use run_finetuning.py to fine-tune and evaluate an Dialog-KoELECTRA model on a downstream NLP task. It expects three arguments:

  • --config_file: a YAML file containing model hyperparameters, data paths, etc..
  • --nni: Option for whether to use nni when finetuning the model.

Then run (for example)

python3 run_finetune.py --config_file conf/hate-speech/electra-small.yaml

References

  • ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators.
  • KoELECTRA: Pretrained ELECTRA Model for Korean

Contact Info

For help or issues using Dialog-KoELECTRA, please submit a GitHub issue.

For personal communication related to Dialog-KoELECTRA, please contact Wonchul Kim ([email protected]).


Citation

If you apply this library to any project and research, please cite our code:

@misc{DialogKoELECTRA,
  author       = {Wonchul Kim and Junseok Kim and Okkyun Jeong},
  title        = {Dialog-KoELECTRA: Korean conversational language model based on ELECTRA model},
  howpublished = {\url{https://github.com/skplanet/Dialog-KoELECTRA}},
  year         = {2021},
}

License

Dialog-KoELECTRA project is licensed under the Apache License 2.0.

 Copyright 2020 ~ present SK Planet Co. RB Dialog solution

 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.