Skip to content

Commit

Permalink
Commit addresses minor comment #2 from reviewer 2
Browse files Browse the repository at this point in the history
  • Loading branch information
karthik committed Feb 1, 2013
1 parent 3442933 commit e47d34b
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions git_manuscript.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Such sharing can lower barriers and serve as a powerful catalyst to accelerate p

All scientists use version control in one form or another at various stages of their research projects, from the data collection all the way to manuscript preparation. This process is often informal and haphazard, where multiple revisions of papers, code, and datasets are saved as duplicate copies with uninformative file names (e.g. *draft_1.doc, draft_2.doc*). As authors receive new data and feedback from peers and collaborators, maintaining those versions and merging changes can result in an unmanageable proliferation of files. One solution to these problems would be to use a formal Version Control System (VCS), which have long been used in the software industry to manage code. A key feature common to all types of VCS is that ability save versions of files during development along with informative comments which are referred to as commit messages. Every change and accompanying notes are stored independent of the files, which obviates the need for duplicate copies. Commits serve as checkpoints where individual files or an entire project can be safely reverted to when necessary. Most traditional VCS are centralized which means that they require a connection to a central server which maintains the master copy. Users with appropriate privileges can check out copies, make changes, and upload them back to the server.

Among the suite of version control systems currently available, **git** stands out in particular because it offers features that make it ideal for managing artifacts of scientific research. The most compelling feature of git is its decentralized and distributed nature. Every copy of a git repository can serve either as the server (a central point for synchronizing changes) or as a client. This ensures that there is no single point of failure. Authors can work asynchronously without being connected to a central server and synchronize their changes when possible. This is particularly useful when working from remote field sites where internet connections are often slow or non-existent. Unlike other VCS, every copy of a git repository carries a complete history of all changes, including authorship, that can be viewed and searched by anyone. This feature allows new authors to build from any stage of a versioned project. git also has a small footprint and nearly all operations occur locally.
Among the suite of version control systems currently available, **git** stands out in particular because it offers features that make it desirable for managing artifacts of scientific research. The most compelling feature of git is its decentralized and distributed nature. Every copy of a git repository can serve either as the server (a central point for synchronizing changes) or as a client. This ensures that there is no single point of failure. Authors can work asynchronously without being connected to a central server and synchronize their changes when possible. This is particularly useful when working from remote field sites where internet connections are often slow or non-existent. Unlike other VCS, every copy of a git repository carries a complete history of all changes, including authorship, that can be viewed and searched by anyone. This feature allows new authors to build from any stage of a versioned project. git also has a small footprint and nearly all operations occur locally.

By using a formal VCS, researchers can not only increase their own productivity but also make it for others to fully understand, use, and build upon their contributions. In the rest of the paper I describe how git can be used to manage common science outputs and move on to describing larger use-cases and benefits of this workflow.
Readers should note that I do not aim to provide a comprehensive review of version control systems or even git itself. My goal here is to broadly outline some of advantages of using one such system and how it can benefit individual researchers, collaborative efforts, and the wider research community.
Expand Down Expand Up @@ -65,7 +65,7 @@ In a recent paper led by Philippe Desjardins-Proulx [https://github.com/PhDP/art

Collecting new data and developing methods for analysis are often expensive endeavors requiring significant amounts of grant funding. Therefore protecting such valuable products from loss or theft is paramount. A recent study found that a vast majority of data and code are stored on lab computers or web servers both of which are prone to failure and often become inaccessible after a certain length of time. One survey found that only 72% of studies of 1000 surveyed still had data that were accessible [@Schultheiss2011; @Wren2004]. Hosting data and code publicly not only ensures protection against loss but also increases visibility for research efforts and provides opportunities for collaboration and early review [@Prlic2012b].

While git provides a powerful features that can leveraged by individual scientists, git hosting services open up a whole new set of possibilities. Any local git repository can be linked to one or more **git remotes**, which are copies hosted on a remote cloud severs. Git remotes serve as hubs for collaboration where authors with write privileges can contribute anytime while others can download up-to-date versions or submit revisions with author approval. There are currently several git hosting services such as SourceForge, Google Code, GitHub, and BitBucket that provide free git hosting. Among them, Github has surpassed other popular provides like Google Code and SourceForge and hosts over 2 million public repositories at the time of this writing [@github_2013; @github_popularity]. While these services are usually free for publicly open projects, some research efforts, especially those containing embargoed or sensitive data will need to be kept private. There are multiple ways to deal with such situations. For example, certain files can be excluded from git's history, others maintained as private sub-modules, or entire repositories can be made private and opened to the public at a future time. Some git hosts like BitBucket offer unlimited public and private accounts for academic use.
While git provides a powerful features that can leveraged by individual scientists, git hosting services open up a whole new set of possibilities. Any local git repository can be linked to one or more **git remotes**, which are copies hosted on a remote cloud severs. Git remotes serve as hubs for collaboration where authors with write privileges can contribute anytime while others can download up-to-date versions or submit revisions with author approval. There are currently several git hosting services such as SourceForge, Google Code, GitHub, and BitBucket that provide free git hosting. Among them, Github has surpassed other hosting provides like Google Code and SourceForge in popularity and hosts over 4.6 million repositories as of December 2012 [@github_2013; @github_popularity; @git2013]. While these services are usually free for publicly open projects, some research efforts, especially those containing embargoed or sensitive data will need to be kept private. There are multiple ways to deal with such situations. For example, certain files can be excluded from git's history, others maintained as private sub-modules, or entire repositories can be made private and opened to the public at a future time. Some git hosts like BitBucket offer unlimited public and private accounts for academic use.

Managing a research project with git provides several safe guards against short-term loss. Frequent commits synced to remote repositories ensure that multiple versioned copies are accessible from anywhere. In projects involving multiple collaborators, the presence of additional copies makes even more difficult to lose work. While git hosting services protect against short-term data loss, they are not a solution for more permanent archiving since none of them offer any such guarantees. For long-term archiving, researchers should submit their git-managed projects to academic repositories that are members of CLOCKSS ([http://www.clockss.org/](http://www.clockss.org/)). Output stored on such repositories (e.g. figshare) are archived over a network of redundant nodes and ensure indefinite availability across geographic and geopolitical regions.

Expand Down

0 comments on commit e47d34b

Please sign in to comment.