Create a text classifier for the IMDB large movie dataset #294

jrxk · 2020-10-29T01:24:21Z

This PR fixes #293.

Description of changes

Added a text classifier for the IMDB large movie dataset based on Texar TF and BERT. The model expects CSV file inputs with columns (content, label, id), which can be generated from a Forte pipeline.

Test Conducted

Added an example and trained locally & got ~84% accuracy.

hunterhector

When we said we use Texar to implement the classifier, it means that we won't use tensorflow or pytorch directly. Texar is the layer that hides the actual implementation. And then you can easily switch between pytorch and tensorflow.

Please implement this using Texar-Pytorch instead.

…ssifier

jrxk · 2020-11-07T17:36:07Z

When we said we use Texar to implement the classifier, it means that we won't use tensorflow or pytorch directly. Texar is the layer that hides the actual implementation. And then you can easily switch between pytorch and tensorflow.

Please implement this using Texar-Pytorch instead.

Sorry, I'm a bit confused. Why do we have to use PyTorch? Is it because we are using PyTorch for the rest of Forte?

I was referring to Texar TF's BertClassfier for this. I will modify the Texar-PyTorch version of the BertClassifier instead. Does that sound good?

hunterhector · 2020-11-07T18:05:27Z

When we said we use Texar to implement the classifier, it means that we won't use tensorflow or pytorch directly. Texar is the layer that hides the actual implementation. And then you can easily switch between pytorch and tensorflow.
Please implement this using Texar-Pytorch instead.

Sorry, I'm a bit confused. Why do we have to use PyTorch? Is it because we are using PyTorch for the rest of Forte?

I was referring to Texar TF's BertClassfier for this. I will modify the Texar-PyTorch version of the BertClassifier instead. Does that sound good?

We would like to use more general interfaces. In this implementation, there are still some specific tensorflow calls.
And most of the other implementation examples are texar-pytorch, so let's stick with that.

…imdb_classifier

codecov · 2020-11-09T08:00:42Z

Codecov Report

Merging #294 (8758d09) into master (66462e7) will decrease coverage by 0.45%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #294      +/-   ##
==========================================
- Coverage   79.81%   79.36%   -0.46%     
==========================================
  Files         154      150       -4     
  Lines        9919     9722     -197     
==========================================
- Hits         7917     7716     -201     
- Misses       2002     2006       +4

Impacted Files	Coverage Δ
...te/processors/ir/elastic_search_index_processor.py	`59.09% <0.00%> (-34.02%)`	⬇️
forte/processors/base/index_processor.py	`50.00% <0.00%> (-13.64%)`	⬇️
forte/processors/base/query_processor.py	`89.47% <0.00%> (-1.84%)`	⬇️
forte/data/base_pack.py	`80.10% <0.00%> (-1.58%)`	⬇️
forte/processors/writers.py	`55.00% <0.00%> (-1.53%)`	⬇️
forte/processors/ir/elastic_search_processor.py	`43.24% <0.00%> (-1.21%)`	⬇️
forte/data/data_pack.py	`81.81% <0.00%> (-0.25%)`	⬇️
forte/processors/base/__init__.py	`100.00% <0.00%> (ø)`
...orte/processors/data_augment/data_selector_test.py
forte/processors/data_augment/selectors.py
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 66462e7...3e0a1e8. Read the comment docs.

jrxk · 2020-11-11T23:57:41Z

Reimplement with texar-pytorch.

hunterhector · 2020-11-23T22:06:45Z

Could you open this as another PR and make sure this file is not included here directly?

examples/text_classification/data/IMDB_raw/train_id_list.txt

Files in GitHub will be there permanently once committed. So this file would increase the size of the project permanently. Can you simply provide a download link?

When you open another PR, make sure this file is not in your commit history of the branch. This means that you should create a fresh branch from master, add your changes to that branch, and do not include train_id_list.

Note that git rm on the current branch will not reduce the size of this PR.

jrxk added 6 commits October 28, 2020 19:59

add text classifier

a229a58

add list

ae0ffa5

fix main

04e92fa

fix

2dc228c

delete some files

1d1a577

Merge branch 'master' into imdb_classifier

3c46e9b

jrxk self-assigned this Oct 29, 2020

jrxk requested a review from hunterhector October 29, 2020 04:42

jrxk added data_aug Features on data augmentation topic: data Issue about data loader modules and data processing related labels Oct 29, 2020

hunterhector requested changes Nov 3, 2020

View reviewed changes

Merge branch 'master' of https://github.com/asyml/forte into imdb_cla…

a386c62

…ssifier

hunterhector and others added 4 commits November 7, 2020 13:07

Merge branch 'master' into imdb_classifier

f8c0995

switch to texar-pytorch

c14024a

Merge branch 'imdb_classifier' of https://github.com/jrxk/forte into …

2082208

…imdb_classifier

fix travis

8758d09

jrxk marked this pull request as ready for review November 11, 2020 23:56

Merge branch 'master' into imdb_classifier

3e0a1e8

hunterhector closed this Nov 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a text classifier for the IMDB large movie dataset #294

Create a text classifier for the IMDB large movie dataset #294

jrxk commented Oct 29, 2020 •

edited

Loading

hunterhector left a comment

jrxk commented Nov 7, 2020 •

edited

Loading

hunterhector commented Nov 7, 2020

codecov bot commented Nov 9, 2020 •

edited

Loading

jrxk commented Nov 11, 2020

hunterhector commented Nov 23, 2020

Create a text classifier for the IMDB large movie dataset #294

Create a text classifier for the IMDB large movie dataset #294

Conversation

jrxk commented Oct 29, 2020 • edited Loading

Description of changes

Test Conducted

hunterhector left a comment

Choose a reason for hiding this comment

jrxk commented Nov 7, 2020 • edited Loading

hunterhector commented Nov 7, 2020

codecov bot commented Nov 9, 2020 • edited Loading

Codecov Report

jrxk commented Nov 11, 2020

hunterhector commented Nov 23, 2020

jrxk commented Oct 29, 2020 •

edited

Loading

jrxk commented Nov 7, 2020 •

edited

Loading

codecov bot commented Nov 9, 2020 •

edited

Loading