-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify Quick Start #112
Simplify Quick Start #112
Conversation
written = written + 1 | ||
elif rate < 3.0: | ||
o.write('0\t%s\n' % text) | ||
written = written + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We hope the rate of positive sample : negative sample is 1:1 in the original process.
- There are duplicated samples in reviews_Electronics_5.json.gz. It's necessary to remove them to make distinct train set and test set.
- The moses tools is used to tokenize the words and punctuation. If we don't want to care about the punctuation, it is ok without moses.
- In fact, there is preprocessed data by other people, http://riejohnson.com/cnn_download.html#sup-paper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments!
Following the link you provided, I found this proprocessed dataset: http://jmcauley.ucsd.edu/data/amazon/ . I am checking if it matches requirement 1.~3. as you commented above. If I can train a model using the data and if the model passes testing, I will go back here to update this PR.
Update mobile_readme.md
* code format * add IpuInplacePass
* Add AnimeGANv2 in link * Update README_en.md
test new sample optimize thrust alloc (PaddlePaddle#112) fix deepwalk sample kernel (PaddlePaddle#122) Update graphsage speed(thrust -> cub), fix sample async bug (PaddlePaddle#120) * fix deepwalk sample kernel, fix inplace kernel sync bug * update v2 sample * change kernel name for readability * delete unused kernel support slot_feature with different length (PaddlePaddle#124) Co-authored-by: root <[email protected]> add graphsage slot feature (PaddlePaddle#126) 【graphsage】don't alloc a new d_feature_buf if the old one is enough (PaddlePaddle#128) * add graphsage slot feature * update graphsage slot feature * update graphsage slot feature fix linking use type optimization remove file add type-optimization config fix bug in slot_feature (PaddlePaddle#134) Co-authored-by: root <[email protected]> sage network optimization remove log fix bug in slot_feature (PaddlePaddle#134) Co-authored-by: root <[email protected]>
* add & fix zeus int8 * ptq int8 moe add grouped gemm
…te_0816 sync xpu's fused_seqpool_cvm_op with gpu
Fixes #111
Motivation
The initial purpose of this PR is that it took me >12 hours to run
preprocess.sh
on a VM with my MacBook Pro. I checked with Yi Yang that he can run this in a few minutes on his powerful CPU&GPU desktop. But I am afraid that the purpose of a QuickStart should be quick and start enough, so that potential clients can realistically feel the convenience brought by Paddle. Hence this PR.Comparison
The time cost are primarily due to that the current approach downloads the full Amazon Reviews dataset, which is ~500MB gzipped and ~1.5GB unzipped. The process of the whole dataset also costs much time. So this PR's primary target is to download only part of the dataset. Compare with the existing approach,
preprocess_data.py
to replacedata/get_data.sh
,preprocess.py
andpreprocess.sh
, which add up to ~300 lines code,preprocess_data.py
can read directly from the HTTP server that hosts the data, or from a local copy of the data. In either case, it reads until required number of instances are scanned. This releases it from reading the whole dataset.shuf
, which exists in Linux but not in Mac OS X. So the new script works with both Linux and Mac OS X.Usage
If we get this PR merged, the initialization steps described in the Quick Start guide would change from
into
Details
Above
./process_data.py
commands read directly from the default URLhttp://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz
for a number of JSON objects until it can generate{train,test,pred}.txt
which add up to 100 instances, the default number of dataset size.If we are going to generate a bigger dataset, say 1000 instances in total, we can run
Or, if we already downloaded the
reviews_Electronics_5.json.gz
file, we can runAn additional command line parameter
-t
controls the cap size of the dictionary. If we want to generate an 1000-instance dataset while limitinug the dictionary size to 1999, we can do