-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a text classifier for the IMDB large movie dataset #294
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we said we use Texar to implement the classifier, it means that we won't use tensorflow or pytorch directly. Texar is the layer that hides the actual implementation. And then you can easily switch between pytorch and tensorflow.
Please implement this using Texar-Pytorch instead.
Sorry, I'm a bit confused. Why do we have to use PyTorch? Is it because we are using PyTorch for the rest of Forte? I was referring to Texar TF's BertClassfier for this. I will modify the Texar-PyTorch version of the BertClassifier instead. Does that sound good? |
|
Codecov Report
@@ Coverage Diff @@
## master #294 +/- ##
==========================================
- Coverage 79.81% 79.36% -0.46%
==========================================
Files 154 150 -4
Lines 9919 9722 -197
==========================================
- Hits 7917 7716 -201
- Misses 2002 2006 +4
Continue to review full report at Codecov.
|
Reimplement with texar-pytorch. |
Could you open this as another PR and make sure this file is not included here directly? examples/text_classification/data/IMDB_raw/train_id_list.txt Files in GitHub will be there permanently once committed. So this file would increase the size of the project permanently. Can you simply provide a download link? When you open another PR, make sure this file is not in your commit history of the branch. This means that you should create a fresh branch from master, add your changes to that branch, and do not include Note that |
This PR fixes #293.
Description of changes
Added a text classifier for the IMDB large movie dataset based on Texar TF and BERT. The model expects CSV file inputs with columns (content, label, id), which can be generated from a Forte pipeline.
Test Conducted
Added an example and trained locally & got ~84% accuracy.