Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a text classifier for the IMDB large movie dataset #294

Closed
wants to merge 12 commits into from

Conversation

jrxk
Copy link
Collaborator

@jrxk jrxk commented Oct 29, 2020

This PR fixes #293.

Description of changes

Added a text classifier for the IMDB large movie dataset based on Texar TF and BERT. The model expects CSV file inputs with columns (content, label, id), which can be generated from a Forte pipeline.

Test Conducted

Added an example and trained locally & got ~84% accuracy.

@jrxk jrxk self-assigned this Oct 29, 2020
@jrxk jrxk added data_aug Features on data augmentation topic: data Issue about data loader modules and data processing related labels Oct 29, 2020
Copy link
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we said we use Texar to implement the classifier, it means that we won't use tensorflow or pytorch directly. Texar is the layer that hides the actual implementation. And then you can easily switch between pytorch and tensorflow.

Please implement this using Texar-Pytorch instead.

@jrxk
Copy link
Collaborator Author

jrxk commented Nov 7, 2020

When we said we use Texar to implement the classifier, it means that we won't use tensorflow or pytorch directly. Texar is the layer that hides the actual implementation. And then you can easily switch between pytorch and tensorflow.

Please implement this using Texar-Pytorch instead.

Sorry, I'm a bit confused. Why do we have to use PyTorch? Is it because we are using PyTorch for the rest of Forte?

I was referring to Texar TF's BertClassfier for this. I will modify the Texar-PyTorch version of the BertClassifier instead. Does that sound good?

@hunterhector
Copy link
Member

When we said we use Texar to implement the classifier, it means that we won't use tensorflow or pytorch directly. Texar is the layer that hides the actual implementation. And then you can easily switch between pytorch and tensorflow.
Please implement this using Texar-Pytorch instead.

Sorry, I'm a bit confused. Why do we have to use PyTorch? Is it because we are using PyTorch for the rest of Forte?

I was referring to Texar TF's BertClassfier for this. I will modify the Texar-PyTorch version of the BertClassifier instead. Does that sound good?

  1. We would like to use more general interfaces. In this implementation, there are still some specific tensorflow calls.
  2. And most of the other implementation examples are texar-pytorch, so let's stick with that.

@codecov
Copy link

codecov bot commented Nov 9, 2020

Codecov Report

Merging #294 (8758d09) into master (66462e7) will decrease coverage by 0.45%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #294      +/-   ##
==========================================
- Coverage   79.81%   79.36%   -0.46%     
==========================================
  Files         154      150       -4     
  Lines        9919     9722     -197     
==========================================
- Hits         7917     7716     -201     
- Misses       2002     2006       +4     
Impacted Files Coverage Δ
...te/processors/ir/elastic_search_index_processor.py 59.09% <0.00%> (-34.02%) ⬇️
forte/processors/base/index_processor.py 50.00% <0.00%> (-13.64%) ⬇️
forte/processors/base/query_processor.py 89.47% <0.00%> (-1.84%) ⬇️
forte/data/base_pack.py 80.10% <0.00%> (-1.58%) ⬇️
forte/processors/writers.py 55.00% <0.00%> (-1.53%) ⬇️
forte/processors/ir/elastic_search_processor.py 43.24% <0.00%> (-1.21%) ⬇️
forte/data/data_pack.py 81.81% <0.00%> (-0.25%) ⬇️
forte/processors/base/__init__.py 100.00% <0.00%> (ø)
...orte/processors/data_augment/data_selector_test.py
forte/processors/data_augment/selectors.py
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 66462e7...3e0a1e8. Read the comment docs.

@jrxk jrxk marked this pull request as ready for review November 11, 2020 23:56
@jrxk
Copy link
Collaborator Author

jrxk commented Nov 11, 2020

Reimplement with texar-pytorch.

@hunterhector
Copy link
Member

Could you open this as another PR and make sure this file is not included here directly?

examples/text_classification/data/IMDB_raw/train_id_list.txt

Files in GitHub will be there permanently once committed. So this file would increase the size of the project permanently. Can you simply provide a download link?

When you open another PR, make sure this file is not in your commit history of the branch. This means that you should create a fresh branch from master, add your changes to that branch, and do not include train_id_list.

Note that git rm on the current branch will not reduce the size of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data_aug Features on data augmentation topic: data Issue about data loader modules and data processing related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a text classifier for the IMDB large movie dataset
2 participants