Sentence Embedding for Korean using translated NLI data

Unofficial implementation of Sentence-BERT for korean sentence embedding

Model

The encoder for embedding two sentences during NLI task is Korean Distilled BERT from DistilKoBERT. The embedding for each sentence is concatenated along with their absolute mean difference which then enters to 3-way classifier (entail, neutral, contradict)

Data

I have used Korean NLI dataset, which was released from Kakao Brain

Train

First, run preprocess.py to tokenize, convert into indicies and save into .pt files Then, run train.py for training

Requirements

torch == 1.1.0
transformers == 2.3.0
gluonnlp == 0.8.1
tensorflow == 2.0.0 (for tensorflow.keras.preprocessing.sequence only)

Result

After around 20 epochs, the accuracy for test dataset was 73.7%

Acknowledgments

monologg

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
result		result
README.md		README.md
constant.py		constant.py
dataset.py		dataset.py
kobert_tokenizer.py		kobert_tokenizer.py
logger.py		logger.py
model.py		model.py
preprocess.py		preprocess.py
train.py		train.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentence Embedding for Korean using translated NLI data

Model

Data

Train

Requirements

Result

Acknowledgments

About

Releases

Packages

Languages

robinsongh381/Sentence_BERT_Korean

Folders and files

Latest commit

History

Repository files navigation

Sentence Embedding for Korean using translated NLI data

Model

Data

Train

Requirements

Result

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages