This repo contains the code of my Master's thesis, which is about comparative comments classification.
The repo has several parts included:
-
Data folder contains the training dataset and some badcase files. Please use "jd_comp_final_v5.xlsx"
-
Result folder contains some attention visulaization html files and some model structure picture.
-
Old folder contains some original scripts, just for keeping for backup(will be removed in the next commit)
-
Python scripts start with "baidu" use Baidu API to complete word segment and embedding tasks.
-
Text Preprocessing scripts: utils.py, langconv.py, zh_wiki.py
-
Char/Word embedding script: embedding.py(You need to train the embeddings first for the first time)
-
Traditional models script: traditional_ml_models.py
-
Deep Learning models scripts:
- config.py: model hyperparameters class
- evaluator.py: model evaluation class
- layers.py: attention mechanism implementation
- main.py: the main program for training, more details please see the code comments(the command line version is coming soon)
- model_library.py: DL text classification model used in thesis
- metrics.py: model evaluation class during training
- reader.py: data generator
- trainer.py: model training class
-
Average embedding model: average_embedding.py
-
Some model results and attention visualization: visualization.py
This repo has not completed. The following steps are:
- Improve the model prediction modules
- Comparative Relations Extraction(ongoing): crf.py for traditional method and plan to use bi-lstm-crf model