Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify Quick Start #112

Closed

Conversation

wangkuiyi
Copy link
Collaborator

@wangkuiyi wangkuiyi commented Sep 23, 2016

Fixes #111

Motivation

The initial purpose of this PR is that it took me >12 hours to run preprocess.sh on a VM with my MacBook Pro. I checked with Yi Yang that he can run this in a few minutes on his powerful CPU&GPU desktop. But I am afraid that the purpose of a QuickStart should be quick and start enough, so that potential clients can realistically feel the convenience brought by Paddle. Hence this PR.

Comparison

The time cost are primarily due to that the current approach downloads the full Amazon Reviews dataset, which is ~500MB gzipped and ~1.5GB unzipped. The process of the whole dataset also costs much time. So this PR's primary target is to download only part of the dataset. Compare with the existing approach,

  1. this PR uses a ~100-line Python script preprocess_data.py to replace data/get_data.sh, preprocess.py and preprocess.sh, which add up to ~300 lines code,
  2. after a short discussion with @emailweixu , we decided to use space-delimited word segmentation to replace the Moses word segmenter, so no need to download the Mesos segmenter.
  3. preprocess_data.py can read directly from the HTTP server that hosts the data, or from a local copy of the data. In either case, it reads until required number of instances are scanned. This releases it from reading the whole dataset.
  4. The new script doesn't use shuf, which exists in Linux but not in Mac OS X. So the new script works with both Linux and Mac OS X.

Usage

If we get this PR merged, the initialization steps described in the Quick Start guide would change from

cd demo/quick_start
./data/get_data.sh
./preprocess.sh

into

cd demo/quick_start
python ./process_data.py

Details

Above ./process_data.py commands read directly from the default URL http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz for a number of JSON objects until it can generate {train,test,pred}.txt which add up to 100 instances, the default number of dataset size.

If we are going to generate a bigger dataset, say 1000 instances in total, we can run

python ./process_data.py -n 1000

Or, if we already downloaded the reviews_Electronics_5.json.gz file, we can run

python ./process_data.py ~/Download/reviews_Electronics_5.json.gz

An additional command line parameter -t controls the cap size of the dictionary. If we want to generate an 1000-instance dataset while limitinug the dictionary size to 1999, we can do

python ./process_data.py -n 1000 -t 1999 ~/Download/reviews_Electronics_5.json.gz

written = written + 1
elif rate < 3.0:
o.write('0\t%s\n' % text)
written = written + 1
Copy link
Contributor

@qingqing01 qingqing01 Sep 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We hope the rate of positive sample : negative sample is 1:1 in the original process.
  2. There are duplicated samples in reviews_Electronics_5.json.gz. It's necessary to remove them to make distinct train set and test set.
  3. The moses tools is used to tokenize the words and punctuation. If we don't want to care about the punctuation, it is ok without moses.
  4. In fact, there is preprocessed data by other people, http://riejohnson.com/cnn_download.html#sup-paper

Copy link
Collaborator Author

@wangkuiyi wangkuiyi Sep 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments!

Following the link you provided, I found this proprocessed dataset: http://jmcauley.ucsd.edu/data/amazon/ . I am checking if it matches requirement 1.~3. as you commented above. If I can train a model using the data and if the model passes testing, I will go back here to update this PR.

@reyoung reyoung changed the base branch from master to develop October 26, 2016 10:25
@qingqing01 qingqing01 closed this Dec 12, 2016
zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this pull request Sep 25, 2019
gglin001 added a commit to graphcore/Paddle-fork that referenced this pull request Dec 8, 2021
* code format

* add IpuInplacePass
wangxicoding pushed a commit to wangxicoding/Paddle that referenced this pull request Dec 9, 2021
DesmonDay pushed a commit to DesmonDay/Paddle that referenced this pull request Sep 14, 2022
AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this pull request Sep 19, 2022
* Add AnimeGANv2 in link

* Update README_en.md
seemingwang added a commit to seemingwang/Paddle that referenced this pull request Oct 23, 2022
test new sample

optimize thrust alloc (PaddlePaddle#112)

fix deepwalk sample kernel (PaddlePaddle#122)

Update graphsage speed(thrust -> cub), fix sample async bug (PaddlePaddle#120)

* fix deepwalk sample kernel, fix inplace kernel sync bug

* update v2 sample

* change kernel name for readability

* delete unused kernel

support slot_feature with different length (PaddlePaddle#124)

Co-authored-by: root <[email protected]>

add graphsage slot feature (PaddlePaddle#126)

【graphsage】don't alloc a new d_feature_buf if the old one is enough (PaddlePaddle#128)

* add graphsage slot feature

* update graphsage slot feature

* update graphsage slot feature

fix linking

use type optimization

remove file

add type-optimization config

fix bug in slot_feature (PaddlePaddle#134)

Co-authored-by: root <[email protected]>

sage network optimization

remove log

fix bug in slot_feature (PaddlePaddle#134)

Co-authored-by: root <[email protected]>
zmxdream pushed a commit to zmxdream/Paddle that referenced this pull request Dec 7, 2022
tianyan01 added a commit to tianyan01/Paddle that referenced this pull request Feb 20, 2024
* add & fix zeus int8

* ptq int8 moe add grouped gemm
lizexu123 pushed a commit to lizexu123/Paddle that referenced this pull request Feb 23, 2024
jack603047588 pushed a commit to jiaoxuewu/PaddleBox that referenced this pull request Oct 29, 2024
…te_0816

sync xpu's fused_seqpool_cvm_op with gpu
WAYKEN-TSE pushed a commit to WAYKEN-TSE/Paddle that referenced this pull request Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants