-
Notifications
You must be signed in to change notification settings - Fork 0
Datasets
The datasets listed below are all preloaded into the HackReduce hadoop clusters and ready for immediate use at the event. The [datasets/*] notice next to each title indicates the path where its located depending on where you want to access it:
- Hadoop HDFS: Can be found at
/datasets/*
- Namenode local filesystem: Can be found at
/mnt/datasets/*
- HackReduce Github project: Samples found in the
datasets/*
folder of the project. Note: not all the datasets listed on this page will have samples in the Github project.
There's also the possibility of loading new data at the event, but this process could take a few hours. Please see Greg about loading new data into your clusters.
If you're looking for data sets, below are a few good places to start: Buzzdata has many city-related data sets rounded up here.
Global news's data set of Toronto parking tickets is here, again courtesy of Buzzdata.
The Visua.ly Blog has a great article called 30 Places to Find Open Data on the Web.
Special thanks to Echo Nest for converting the whole 200+ GB HDF5 format of the dataset to TSV for us
- Quad dump (http://wiki.freebase.com/wiki/Data_dumps#Quad_dump) [datasets/freebase/quadruples]
- Simple topic dump (http://wiki.freebase.com/wiki/Data_dumps#Simple_Topic_Dump) [datasets/freebase/topics]
- Only the 1-gram and 2-gram datasets are available
- http://ngrams.googlelabs.com/datasets
- Data format documentation: http://dcc.icgc.org/pages/docs/ICGC_Data_Submission_Manual-0.6b-Unextended.pdf
- French: datasets/fre-eng/fre
- English: datasets/fre-eng/eng
- http://www.statmt.org/wmt09/translation-task.html
- Provided by the Mate1 team
- Includes (with all personally identifiable data excluded, of course):
- Profiles: datasets/mate1/profile
- Iinternal messages: datasets/mate1/internal_message
- Subscriptions: datasets/mate1/subscription
- Who's seen who: datasets/mate1/whos_seen_who
- Hot block list: datasets/mate1/hot_block_list
- Take a look at the datasets/mate1/*-cols.txt files for a description of the CSV fields for each dataset.
- Includes Bixi (branded differently in other cities) data for Toronto, Ottawa and Boston.
- Updated by Julia Evans and Kamal Marhubi for HackReduce Montreal 2012
- Dataset location on the namenode local filesystem is: /mnt/bixidata
- XML dump of all the bike station information queried every minute over a couple of months.
- Provided by Fabrice (http://twitter.com/f8full)
- Contains the root file with all the domain names and their associated nameservers for the "com" TLD.
- Data of the social graph, user id to names, and selected celebrity profiles. This does not contain actual tweets because of Twitter policies.
- http://an.kaist.ac.kr/traces/WWW2010.html
- Limited set of flight data containing origin, destination, departure time, return time, price and date.
- Only has flights originating from SEA
- Provided by Hopper
- Description of data formats: http://131.193.40.52/data/README.txt
- Data listing: http://131.193.40.52/data/
- Taken around of the time of Elizabeth Taylor's death in late March 2011, this dataset was a search of all tweets containing the word "taylor" in them.
- JSON format
-
Arxiv HEP-PH (high energy physics phenomenology) [datasets/citation-networks/hep-ph/{dates,graph}]: http://snap.stanford.edu/data/cit-HepPh.html
-
Arxiv HEP-TH (high energy physics theory) [datasets/citation-networks/hep-th/{dates,graph}]: http://snap.stanford.edu/data/cit-HepTh.html
-
U.S. patent dataset: http://snap.stanford.edu/data/cit-Patents.html
- All the data for the U.S. specifying origin/destination of orders from our system, including price and date.