-
Notifications
You must be signed in to change notification settings - Fork 33
/
data.sh
77 lines (65 loc) · 4.31 KB
/
data.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# step 0. create the data folder
mkdir "data/bz2"
# Step 1. Download raw data from a third party dump: https://files.pushshift.io/reddit
# download comments for year 2011
wget https://files.pushshift.io/reddit/comments/RC_2011-01.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2011-02.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2011-03.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2011-04.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2011-05.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2011-06.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2011-07.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2011-08.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2011-09.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2011-10.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2011-11.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2011-12.bz2 -P data/bz2
# download comments for year 2012
wget https://files.pushshift.io/reddit/comments/RC_2012-01.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2012-02.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2012-03.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2012-04.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2012-05.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2012-06.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2012-07.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2012-08.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2012-09.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2012-10.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2012-11.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/comments/RC_2012-12.bz2 -P data/bz2
# download submissions for year 2011
wget https://files.pushshift.io/reddit/submissions/RS_2011-01.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2011-02.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2011-03.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2011-04.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2011-05.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2011-06.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2011-07.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2011-08.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2011-09.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2011-10.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2011-11.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2011-12.bz2 -P data/bz2
# download submissions for year 2011
wget https://files.pushshift.io/reddit/submissions/RS_2012-01.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2012-02.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2012-03.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2012-04.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2012-05.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2012-06.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2012-07.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2012-08.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2012-09.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2012-10.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2012-11.bz2 -P data/bz2
wget https://files.pushshift.io/reddit/submissions/RS_2012-12.bz2 -P data/bz2
# Step 2. Read the `.bz2` files and group items from the same subreddit
python src/data.py bz2 2011
python src/data.py bz2 2012
# Step 3. extract basic attributes and dialog trees.
python src/data.py basic 2011
python src/data.py basic 2012
# Step 4. Build training and testing data for different feedback signals.
python src/data.py updown 2011 --year_to=2012
python src/data.py depth 2011 --year_to=2012
python src/data.py width 2011 --year_to=2012