Unofficial implementation of Sentence-BERT for korean sentence embedding
The encoder for embedding two sentences during NLI task is Korean Distilled BERT from DistilKoBERT. The embedding for each sentence is concatenated along with their absolute mean difference which then enters to 3-way classifier (entail, neutral, contradict)
I have used Korean NLI dataset, which was released from Kakao Brain
First, run preprocess.py
to tokenize, convert into indicies and save into .pt
files
Then, run train.py
for training
torch == 1.1.0
transformers == 2.3.0
gluonnlp == 0.8.1
tensorflow == 2.0.0 (for tensorflow.keras.preprocessing.sequence
only)
After around 20 epochs, the accuracy for test dataset was 73.7%