Project for Massive Data Processing at Universitat de Lleida
Authors : Tomasz Dąbek and Stefania Perlak
To use StackOverflow module, downloading and installing Py-StackExchange is necessary
LINK: https://github.com/lucjon/Py-StackExchange
This module works in two ways. The first way is downloading raw json file using API wrapper. Wrapper provides methods which can be used to download explicit data. However, here whole json dump is downloaded in purpose to have offline access to raw data. Questions are downloaded in date order, starting with the newest. If it contains answers, they are also downloaded. In this step, user must be aware of SO API throtling. On cleaning step, markdown tags and source code are eliminated using regular expresions. Data is stored in custom json structured file : text, author, date.
To download 350 questions from SO, which are tagged as python :
python stackoverflow.py sdd python 350 pythonrawdump.json
To clean downloaded data:
python stackoverflow.py c pythonrawdump.json pythonformateddump.json
To use Reddit module, downloading and installig PRAW is necessary
LINK: https://github.com/praw-dev/praw
What is more, to run reddit you need to generate unique OAuth key to API. The key used by us in development process is not published in public repo because of security reason.
We do not figure how to download raw data using PRAW, so we decided to download data into common format in single step. In this case, we do not think about separate cleaning working module.
To download 10 topics with comments from subreddit learnpython
python reddit.py sdd learnpython 10 pythonrawdump.json id secret agent
Where id
is client id generated in reddit account, secret
is OAuth secret geenrated by reddit and agent
is user agent, which fulfill Reddit API usage policy