-
Notifications
You must be signed in to change notification settings - Fork 147
Scipy2013
Skdata: Data sets and algorithm evaluation protocols in Python
Bergstra, James, University of Waterloo; Pinto, Nicolas, Massachusetts Institute of Technology; Cox, David D., Harvard University;
James Bergstra is an NSERC Banting Fellow at the University of Waterloo's Centre for Theoretical Neuroscience. His research interests include visual system models and learning algorithms, deep learning, Bayesian optimization, high performance computing, and music information retrieval. Previously he was a member of Professor David Cox's Computer and Biological Vision Lab in the Rowland Institute for Science at Harvard University. He completed doctoral studies at the University of Montreal in July 2011 under the direction of Professor Yoshua Bengio with a dissertation on how to incorporate complex cells into deep learning models. As part of his doctoral work he co-developed Theano, a popular meta-programming system for Python that can target GPUs for high-performance computation.
Nicolas Pinto is Chief Technology Officer and Chief Scientist of two stealth startups in the Silicon Valley, focusing on development of high-performance machine perception technologies and their applications. He holds two M.Sc. in Computer Science and Engineering from France (2007), and a Ph.D. in Neuroscience from MIT (2010). Previously he was a lecturer in Computer Science at Harvard, and a research scientist in Prof. Jim DiCarlo's Lab at MIT and Prof. David Cox's lab at Harvard (2012).
David Cox is an Assistant Professor of Molecular and Cellular Biology and of Computer Science, and is a member of the Center for Brain Science at Harvard University. He completed his Ph.D. in the Department of Brain and Cognitive Sciences at MIT with a specialization in computational neuroscience. Prior to joining MCB/CBS, he was a Junior Fellow at the Rowland Institute at Harvard, a multidisciplinary institute focused on high-risk, high-reward scientific research at the boundaries of traditional fields.
Machine learning benchmark data sets come in all shapes and sizes, yet classification algorithm implementations often insist on operating on sanitized input, such as (x, y) pairs with vector-valued input x and integer class label y. Researchers and practitioners are well aware of how much work (and even sometimes judgement) is required to get from the URL of a new data set to an ndarray fit for e.g. pandas or sklearn. The skdata library [1] handles that work for a growing number of benchmark data sets, so that one-off in-house scripts for downloading and parsing data sets can be replaced with library code that is reliable, community-tested, and documented.
Skdata consists primarily of independent submodules that deal with individual data sets. Each submodule has three important sub-sub-module files:
-
a 'dataset' file with the nitty-gritty details of how to download, extract, and parse a particular data set;
-
a 'view' file with any standard evaluation protocols from relevant literature; and
-
a 'main' file with CLI entry points for e.g. downloading and visualizing the data set in question.
Various skdata utilities help to manage the data sets themselves, which are stored in the user's "~/.skdata" directory.
The evaluation protocols represent the logic that turns parsed (but potentially ideosyncratic) data into one or more standardized learning tasks. The basic approach has been developed over years of combined experience by the authors, and used extensively in recent work (e.g. [2]). The presentation will cover the design of data set submodules, and the basic interactions between a learning algorithm and an evaluation protocol.
[1] Skdata: http://jaberg.github.com/skdata
[2] J. Bergstra, D. Yamins and D. D. Cox (2013). Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. Proc. 30th International Conference on Machine Learning (ICML-13). http://jmlr.csail.mit.edu/proceedings/papers/v28/bergstra13.pdf
More information about the presenting author can be found at http://www.eng.uwaterloo.ca/~jbergstr/