Skip to content

Commit

Permalink
Added Claerbout citation as suggested by Carl in issue #2
Browse files Browse the repository at this point in the history
  • Loading branch information
karthik committed Jan 19, 2013
1 parent d6c5169 commit 7e0e467
Show file tree
Hide file tree
Showing 3 changed files with 25 additions and 9 deletions.
6 changes: 3 additions & 3 deletions git_manuscript.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,12 @@ Version control systems (VCS), which have long been used to maintain code reposi


# Introduction
Reproducible science provides the critical standard by which published results are judged and central findings are either validated or refuted [@Vink2012b]. Reproducibility also allows others to build upon existing work and use it to test new ideas and develop methods. Advances over the years have resulted in the development of complex methodologies that allow us to collect ever increasing amounts of data. While repeating expensive studies to validate findings is often a problem, a whole host of other reasons have contributed to the problem of reproducibility [@Peng2011; @Begley2012]. One such reason has been the lack of detailed access to underlying data and statistical code used for analysis, which can provide opportunities for others to verify findings [@Ince2012a]. In an era rife with costly retractions, scientists have an increasing burden to be more transparent in order to maintain their credibility [@VanNoorden2011a]. While post-publication sharing of data and code is on the rise, driven in part by funder mandates and journal requirements [@Whitlock2010a], access to such research outputs is still not very common [@Vines2013; @Wolkovich2012]. By sharing detailed and versioned copies of one's data and code researchers can not only ensure that reviewers can make well-informed decisions, but also provide opportunities for such artifacts to be repurposed and brought to bear on new research questions.
Reproducible science provides the critical standard by which published results are judged and central findings are either validated or refuted [@Vink2012b]. Reproducibility also allows others to build upon existing work and use it to test new ideas and develop methods. Advances over the years have resulted in the development of complex methodologies that allow us to collect ever increasing amounts of data. While repeating expensive studies to validate findings is often difficult, a whole host of other reasons have contributed to the problem of reproducibility [@Peng2011; @Begley2012]. One such reason has been the lack of detailed access to underlying data and statistical code used for analysis, which can provide opportunities for others to verify findings [@Schwab2000a; @Ince2012a]. In an era rife with costly retractions, scientists have an increasing burden to be more transparent in order to maintain their credibility [@VanNoorden2011a]. While post-publication sharing of data and code is on the rise, driven in part by funder mandates and journal requirements [@Whitlock2010a], access to such research outputs is still not very common [@Vines2013; @Wolkovich2012]. By sharing detailed and versioned copies of one's data and code researchers can not only ensure that reviewers can make well-informed decisions, but also provide opportunities for such artifacts to be repurposed and brought to bear on new research questions.

Opening up access to the data and software, not just the final publication, is one of goals of the open science movement.
Such sharing can lower barriers and serve as a powerful catalyst to accelerate progress. In the era of limited funding, there is a need to leverage existing data and code to the fullest extent to solve both applied and basic problems. This requires that scientists share their research artifacts more openly, with reasonable licenses that encourage fair use while providing credit to original authors [@Neylon2013]. Besides overcoming social challenges to these issues, existing technologies can also be leveraged to increase reproducibility.

All scientists use version control in one form or another at various stages of their research projects, from the data collection stage all the way to manuscript preparation. This process is often informal and haphazard, where multiple revisions of papers, code, and datasets are saved as duplicate copies with uninformative file names (e.g. *draft_1.doc, draft_2.doc*). As authors receive new data and feedback from peers and collaborators, maintaining those versions and merging changes can result in an unmanageable proliferation of files. One solution to these problems would be to use a formal Version Control System (VCS), which have long been used in the software industry to manage code. A key feature common to all types of VCS is that ability save versions of files during development along with informative comments which are referred to as commit messages. Every change and accompanying notes are stored independent of the files, which prevents a proliferation of duplicate copies. Commits serve as anchor points where individual files or an entire project can be safely reverted to when necessary. Most traditional VCS are centralized which means that they require a connection to a central server which maintains the master copy. Users with appropriate privileges can check out copies, make changes, and upload them back to the server.
All scientists use version control in one form or another at various stages of their research projects, from the data collection all the way to manuscript preparation. This process is often informal and haphazard, where multiple revisions of papers, code, and datasets are saved as duplicate copies with uninformative file names (e.g. *draft_1.doc, draft_2.doc*). As authors receive new data and feedback from peers and collaborators, maintaining those versions and merging changes can result in an unmanageable proliferation of files. One solution to these problems would be to use a formal Version Control System (VCS), which have long been used in the software industry to manage code. A key feature common to all types of VCS is that ability save versions of files during development along with informative comments which are referred to as commit messages. Every change and accompanying notes are stored independent of the files, which prevents a proliferation of duplicate copies. Commits serve as anchor points where individual files or an entire project can be safely reverted to when necessary. Most traditional VCS are centralized which means that they require a connection to a central server which maintains the master copy. Users with appropriate privileges can check out copies, make changes, and upload them back to the server.

Among the suite of version control systems currently available, **git** stands out in particular because it offers features that make it ideal for managing artifacts of scientific research. The most compelling feature of git is its decentralized and distributed nature. Every copy of a git repository can serve either as the server (a central point for synchronizing changes) or as a client. This ensures that there is no single point of failure. Authors can work asynchronously without being connected to a central server and synchronize their changes when possible. This is particularly useful when working from remote field sites where internet connections are often slow or non-existent. Unlike other VCS, every copy of a git repository carries a complete history of all changes, including authorship, that can be viewed and searched by anyone. This feature allows new authors to build from any stage of a versioned project. git also has a small footprint and nearly all operations occur locally.

Expand Down Expand Up @@ -55,7 +55,7 @@ In collaborative efforts, authors contribute to one or more stages of the manusc

When projects are tracked using git, every single action (such as additions, deletions, and changes) is attributed to an author. Multiple authors can choose to work on a single branch of a repository (the '*master*' branch), or in separate branches and work asynchronously. In other words, authors do not have to wait on coauthors before contributing. As each author adds their contribution, they can sync those to the master branch and update their copies at any time. Over time, all of the decisions that go into the production of a manuscript from entering data and checking for errors, to choosing appropriate statistical models and creating figures, can be traced back to specific authors.

With the help of a remote git hosting services, maintaining various copies in sync with each other becomes effortless. While most changes are merged automatically, conflicts will need to be resolved manually which would also be the case with most other workflows (e.g. using Microsoft Word with track changes). By syncing changes back and forth with a remote repository, every author can update their local copies as well as push their changes to the remote version at any time, all the while maintaining a complete audit trail. Mistakes or unnecessary changes can easily undone by reverting either the entire repository or individual files to earlier commits. Since commits are attributed to specific authors, error or clarifications can also be appropriately directed.
With the help of a remote git hosting services, maintaining various copies in sync with each other becomes effortless. While most changes are merged automatically, conflicts will need to be resolved manually which would also be the case with most other workflows (e.g. using Microsoft Word with track changes). By syncing changes back and forth with a remote repository, every author can update their local copies as well as push their changes to the remote version at any time, all the while maintaining a complete audit trail. Mistakes or unnecessary changes can easily undone by reverting either the entire repository or individual files to earlier commits. Since commits are attributed to specific authors, error or clarifications can also be appropriately directed. Perhaps most importantly this workflow ensures that revisions do not have to be emailed back and forth. While cloud storage providers like Dropbox alleviate some of these annoyances and also provide versioning, the process is not controlled making it hard to discern what and how many changes have occurred between two time intervals.

In a recent paper led by Philippe Desjardins-Proulx [https://github.com/PhDP/article_preprint/network](https://github.com/PhDP/article_preprint/network) all of the authors successfully collaborated using only git and GitHub ([https://github.com/]([@Vink2012b])). In this particular git workflow, each of us cloned a copy of the main repository and contributed our changes back to the original author. Figures `2` and `3` show the list of collaborators and a network diagram of how and when changes were contributed back the master branch.

Expand Down
Binary file modified git_manuscript.pdf
Binary file not shown.
28 changes: 22 additions & 6 deletions git_ms.bib
Original file line number Diff line number Diff line change
@@ -1,12 +1,21 @@
Automatically generated by Mendeley 1.7.1
Any changes to this file will be lost if it is regenerated by Mendeley.
@misc{nsf2012,
file = {:Users/karthik/Documents/Work/Reference/Mendeley Desktop/Unknown - Unknown - US NSF - Dear Colleague Letter - Issuance of a new NSF Proposal \& Award Policies and Procedures Guide (NSF13004).html:html},
title = {{US NSF - Dear Colleague Letter - Issuance of a new NSF Proposal \& Award Policies and Procedures Guide (NSF13004)}},
url = {http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp?WT.mc\_id=USNSF\_109},
urldate = {2012-11-11},
year = {2012}
@article{Schwab2000a,
abstract = {To verify a research paper's computational results, readers typically have to recreate them from scratch. ReDoc is a simple software filing system for authors that lets readers easily reproduce computational results using standardized rules and commands},
author = {Schwab, Matthias and Karrenbach, Martin and Claerbout, Jon},
doi = {10.1109/5992.881708},
file = {:Users/karthik/Documents/Work/Reference/Mendeley Desktop/Schwab, Karrenbach, Claerbout - 2000 - Making Scientific Computations Reproducible.pdf:pdf},
institution = {SEP},
issn = {15219615},
journal = {Computing in Science Engineering},
number = {6},
pages = {61--67},
publisher = {IEEE},
title = {{Making Scientific Computations Reproducible}},
url = {http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=881708},
volume = {2},
year = {2000}
}
@article{Wilson2012,
abstract = {Scientists spend an increasing amount of time building and using software. However, most scientists are never taught how to do this efficiently. As a result, many are unaware of tools and practices that would allow them to write more reliable and maintainable code with less effort. We describe a set of best practices for scientific software development that have solid foundations in research and experience, and that improve scientists' productivity and the reliability of their software.},
Expand Down Expand Up @@ -187,6 +196,13 @@ @article{Morin2012b
volume = {8},
year = {2012}
}
@misc{nsf2012,
file = {:Users/karthik/Documents/Work/Reference/Mendeley Desktop/Unknown - Unknown - US NSF - Dear Colleague Letter - Issuance of a new NSF Proposal \& Award Policies and Procedures Guide (NSF13004).html:html},
title = {{US NSF - Dear Colleague Letter - Issuance of a new NSF Proposal \& Award Policies and Procedures Guide (NSF13004)}},
url = {http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp?WT.mc\_id=USNSF\_109},
urldate = {2012-11-11},
year = {2012}
}
@article{Begley2012,
author = {Begley, C Glenn and Ellis, Lee M},
doi = {10.1038/483531a},
Expand Down

0 comments on commit 7e0e467

Please sign in to comment.