-
Notifications
You must be signed in to change notification settings - Fork 25
/
git_manuscript.tex
656 lines (569 loc) · 34.5 KB
/
git_manuscript.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
\documentclass[]{article}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage{amssymb,amsmath}
\usepackage{ifxetex,ifluatex}
\usepackage{fixltx2e} % provides \textsubscript
% use microtype if available
\IfFileExists{microtype.sty}{\usepackage{microtype}}{}
% use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
\usepackage[utf8]{inputenc}
\else % if luatex or xelatex
\usepackage{fontspec}
\ifxetex
\usepackage{xltxtra,xunicode}
\fi
\defaultfontfeatures{Mapping=tex-text,Scale=MatchLowercase}
\newcommand{\euro}{€}
\fi
\ifxetex
\usepackage[setpagesize=false, % page size defined by xetex
unicode=false, % unicode breaks when used with xetex
xetex]{hyperref}
\else
\usepackage[unicode=true]{hyperref}
\fi
\hypersetup{breaklinks=true,
bookmarks=true,
pdfauthor={},
pdftitle={},
colorlinks=true,
urlcolor=blue,
linkcolor=magenta,
pdfborder={0 0 0}}
\urlstyle{same} % don't use monospace font for urls
\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 2pt minus 1pt}
\setlength{\emergencystretch}{3em} % prevent overfull lines
\setcounter{secnumdepth}{0}
\usepackage[vmargin=1in,hmargin=1in]{geometry}
\author{}
\date{}
\begin{document}
\section{Git can facilitate greater reproducibility and increased
transparency in science.}
\textbf{Karthik Ram}, Ph.D.\\Environmental Science, Policy, and
Management.\\University of California, Berkeley.\\Berkeley, CA 94720.
USA.\\\href{mailto:[email protected]}{[email protected]}
\subsection{Abstract}
\textbf{Background:} Reproducibility is the hallmark of good science.
Maintaining a high degree of transparency in scientific reporting is
essential not just for gaining trust and credibility within the
scientific community but also for facilitating the development of new
ideas. Sharing data and computer code associated with publications is
becoming increasingly common, motivated partly in response to data
deposition requirements from journals and mandates from funders. Despite
this increase in transparency, it is still difficult to reproduce or
build upon the findings of most scientific publications without access
to a more complete workflow.
\textbf{Findings:} Version control systems (VCS), which have long been
used to maintain code repositories in the software industry, are now
finding new applications in science. One such open source VCS, Git,
provides a lightweight yet robust framework that is ideal for managing
the full suite of research outputs such as datasets, statistical code,
figures, lab notes, and manuscripts. For individual researchers, Git
provides a powerful way to track and compare versions, retrace errors,
explore new approaches in a structured manner, while maintaining a full
audit trail. For larger collaborative efforts, Git and Git hosting
services make it possible for everyone to work asynchronously and merge
their contributions at any time, all the while maintaining a complete
authorship trail. In this paper I provide an overview of Git along with
use-cases that highlight how this tool can be leveraged to make science
more reproducible and transparent, foster new collaborations, and
support novel uses.
\section{Keywords}
reproducible research, version control, open science. \newpage
\section{Findings}
\section{Introduction}
Reproducible science provides the critical standard by which published
results are judged and central findings are either validated or refuted
{[}1{]}. Reproducibility also allows others to build upon existing work
and use it to test new ideas and develop methods. Advances over the
years have resulted in the development of complex methodologies that
allow us to collect ever increasing amounts of data. While repeating
expensive studies to validate findings is often difficult, a whole host
of other reasons have contributed to the problem of reproducibility
{[}2{]}. One such reason has been the lack of detailed access to
underlying data and statistical code used for analysis, which can
provide opportunities for others to verify findings {[}3,4{]}. In an era
rife with costly retractions, scientists have an increasing burden to be
more transparent in order to maintain their credibility
{[}@VanNoorden2011a{]}. While post-publication sharing of data and code
is on the rise, driven in part by funder mandates and journal
requirements {[}5{]}, access to such research outputs is still not very
common {[}6,7{]}. By sharing detailed and versioned copies of one's data
and code researchers can not only ensure that reviewers can make
well-informed decisions, but also provide opportunities for such
artifacts to be repurposed and brought to bear on new research
questions.
Opening up access to the data and software, not just the final
publication, is one of goals of the open science movement. Such sharing
can lower barriers and serve as a powerful catalyst to accelerate
progress. In the era of limited funding, there is a need to leverage
existing data and code to the fullest extent to solve both applied and
basic problems. This requires that scientists share their research
artifacts more openly, with reasonable licenses that encourage fair use
while providing credit to original authors {[}@Neylon2013{]}. Besides
overcoming social challenges to these issues, existing technologies can
also be leveraged to increase reproducibility.
All scientists use version control in one form or another at various
stages of their research projects, from the data collection all the way
to manuscript preparation. This process is often informal and haphazard,
where multiple revisions of papers, code, and datasets are saved as
duplicate copies with uninformative file names (e.g. \emph{draft\_1.doc,
draft\_2.doc}). As authors receive new data and feedback from peers and
collaborators, maintaining those versions and merging changes can result
in an unmanageable proliferation of files. One solution to these
problems would be to use a formal Version Control System (VCS), which
have long been used in the software industry to manage code. A key
feature common to all types of VCS is that ability save versions of
files during development along with informative comments which are
referred to as commit messages. Every change and accompanying notes are
stored independent of the files, which obviates the need for duplicate
copies. Commits serve as checkpoints where individual files or an entire
project can be safely reverted to when necessary. Most traditional VCS
are centralized which means that they require a connection to a central
server which maintains the master copy. Users with appropriate
privileges can check out copies, make changes, and upload them back to
the server.
Among the suite of version control systems currently available,
\textbf{Git} stands out in particular because it offers features that
make it desirable for managing artifacts of scientific research. The
most compelling feature of Git is its decentralized and distributed
nature. Every copy of a Git repository can serve either as the server (a
central point for synchronizing changes) or as a client. This ensures
that there is no single point of failure. Authors can work
asynchronously without being connected to a central server and
synchronize their changes when possible. This is particularly useful
when working from remote field sites where internet connections are
often slow or non-existent. Unlike other VCS, every copy of a Git
repository carries a complete history of all changes, including
authorship, that can be viewed and searched by anyone. This feature
allows new authors to build from any stage of a versioned project. Git
also has a small footprint and nearly all operations occur locally.
By using a formal VCS, researchers can not only increase their own
productivity but also make it for others to fully understand, use, and
build upon their contributions. In the rest of the paper I describe how
Git can be used to manage common science outputs and move on to
describing larger use-cases and benefits of this workflow. Readers
should note that I do not aim to provide a comprehensive review of
version control systems or even Git itself. There are also other
comparable alternatives such as Mercurial and Bazaar which provide many
of the features described below. My goal here is to broadly outline some
of advantages of using one such system and how it can benefit individual
researchers, collaborative efforts, and the wider research community.
\subsection{How Git can track various artifacts of a research effort}
Before delving into common use-cases, I first describe how Git can be
used to manage familiar research outputs such as data, code used for
statistical analyses, and documents. Git can be used to manage them not
just separately but also in various combinations for different use cases
such as maintaining lab notebooks, lectures, datasets, and manuscripts.
\subsubsection{Manuscripts and notes}
Version control can operate on any file type including ones most
commonly used in academia such as Microsoft Word. However, since these
file types are binary, Git cannot examine the contents and highlight
sections that have changed between revisions. In such cases, one would
have to rely solely on commit messages or scan through file contents.
The full power of Git can best be leveraged when working with plain-text
files. These include data stored in non-proprietary spreadsheet formats
(e.g.~comma separated files versus \texttt{xls}), scripts from
programming languages, and manuscripts stored in plain text formats
(\texttt{LaTeX} and \texttt{markdown} versus Word documents). With such
formats, Git not only tracks versions but can also highlight which
sections of a file have changed.\\In Microsoft Word documents the
\emph{track changes} feature is often used to solicit comments and
feedback. Once those comments and changes have either been accepted or
rejected, any record of their existence also disappears forever. When
changes are submitted using Git, a permanent record of author
contributions remains in the version history and available in every copy
of the repository.
\subsubsection{Datasets}
Data are ideal for managing with Git. These include data manually
entered via spreadsheets, recorded as part of observational studies, or
ones retrieved from sensors (see also section on \emph{Managing large
data}). With each significant change or additions, commits can record a
log those activities (e.g. ``\emph{Entered data collected between
12/10/2012 and 12/20/2012}'', or ``\emph{Updated data from temperature
loggers for December 2012}''). Over time this process avoids
proliferation of files, while the Git history maintains a complete
provenance that can be reviewed at any time. When errors are discovered,
earlier versions of a file can be reverted without affecting other
assets in the project.
\subsubsection{Statistical code and figures}
When data are analyzed programmatically using software such as
\texttt{R} and \texttt{Python}, code files start out small and often
become more complex over time. Somewhere along the process, inadvertent
errors such as misplaced subscripts and incorrectly applied functions
can lead to serious errors down the line. When such errors are
discovered well into a project, comparing versions of statistical
scripts can provide a way to quickly trace the source of the problem and
recover from them.
Similarly, figures that are published in a paper often undergo multiple
revisions before resulting in a final version that gets published.
Without version control, one would have to deal with multiple copies and
use imperfect information such as file creation dates to determine the
sequence in which they were generated. Without additional information,
figuring out why certain versions were created (e.g.~in response to
comments from coauthors) also becomes more difficult. When figures are
managed with Git, the commit messages (e.g. ``\emph{Updated figure in
response to Ethan's comments regarding use of normalized data.}'')
provide an unambiguous way to track various versions.
\subsubsection{Complete manuscripts}
When all of the above artifacts are used in a single effort, such as
writing a manuscript, Git can collectively manage versions in a powerful
way for both individual authors and groups of collaborators. This
process avoids rapid multiplication of unmanageable files with
uninformative names (e.g. \emph{final\_1.doc, final\_2.doc,
final\_final.doc, final\_KR\_1.doc} etc.) as illustrated by the popular
cartoon strip PhD Comics
\url{http://www.phdcomics.com/comics/archive.php?comicid=1531}.
\section{Use cases for Git in science}
\subsection{1. Lab notebook}
Day to day decisions made over the course of a study are often logged
for review and reference in lab notebooks. Such notebooks contain
important information useful to both future readers attempting to
replicating a study, or for thorough reviewers seeking additional
clarification. However, lab notebooks are rarely shared along with
publications or made public although there are some exceptions {[}8{]}.
Git commit logs can serve as a proxies for lab notebooks if clear yet
concise messages are recorded over the course of a project. One of the
fundamental features of Git that make it so useful to science is that
every copy of a repository carries a complete history of changes
available for anyone to review. These logs can be be easily searched to
retrieve versions of artifacts like data and code. Third party tools can
also be leveraged to mine Git histories from one or more projects for
other types of analyses.
\subsection{2. Facilitating Collaboration}
In collaborative efforts, authors contribute to one or more stages of
the manuscript preparation such as collecting data, analyzing them,
and/or writing up the results. Such information is extremely useful for
both readers and reviewers when assessing relative author contributions
to a body of work. With high profile journals now discouraging the
practice of honorary authorship {[}9{]}, Git commit logs can provide a
highly granular way to track and assess individual author contributions
to a project.
When projects are tracked using Git, every single action (such as
additions, deletions, and changes) is attributed to an author. Multiple
authors can choose to work on a single branch of a repository (the
`\emph{master}' branch), or in separate branches and work
asynchronously. In other words, authors do not have to wait on coauthors
before contributing. As each author adds their contribution, they can
sync those to the master branch and update their copies at any time.
Over time, all of the decisions that go into the production of a
manuscript from entering data and checking for errors, to choosing
appropriate statistical models and creating figures, can be traced back
to specific authors.
With the help of a remote Git hosting services, maintaining various
copies in sync with each other becomes effortless. While most changes
are merged automatically, conflicts will need to be resolved manually
which would also be the case with most other workflows (e.g.~using
Microsoft Word with track changes). By syncing changes back and forth
with a remote repository, every author can update their local copies as
well as push their changes to the remote version at any time, all the
while maintaining a complete audit trail. Mistakes or unnecessary
changes can easily undone by reverting either the entire repository or
individual files to earlier commits. Since commits are attributed to
specific authors, error or clarifications can also be appropriately
directed. Perhaps most importantly this workflow ensures that revisions
do not have to be emailed back and forth. While cloud storage providers
like Dropbox alleviate some of these annoyances and also provide
versioning, the process is not controlled making it hard to discern what
and how many changes have occurred between two time intervals.
In a recent paper led by Philippe Desjardins-Proulx
\url{https://github.com/PhDP/article_preprint/network} all of the
authors successfully collaborated using only Git and GitHub
(\href{{[}@Vink2012b{]}}{https://github.com/}). In this particular Git
workflow, each of us cloned a copy of the main repository and
contributed our changes back to the lead author. Figures \texttt{2} and
\texttt{3} show the list of collaborators and a network diagram of how
and when changes were contributed back the master branch.
\subsection{3. Backup and failsafe against data loss}
Collecting new data and developing methods for analysis are often
expensive endeavors requiring significant amounts of grant funding.
Therefore protecting such valuable products from loss or theft is
paramount. A recent study found that a vast majority of data and code
are stored on lab computers or web servers both of which are prone to
failure and often become inaccessible after a certain length of time.
One survey found that only 72\% of studies of 1000 surveyed still had
data that were accessible {[}10,11{]}. Hosting data and code publicly
not only ensures protection against loss but also increases visibility
for research efforts and provides opportunities for collaboration and
early review {[}12{]}.
While Git provides a powerful features that can leveraged by individual
scientists, Git hosting services open up a whole new set of
possibilities. Any local Git repository can be linked to one or more
\textbf{Git remotes}, which are copies hosted on a remote cloud severs.
Git remotes serve as hubs for collaboration where authors with write
privileges can contribute anytime while others can download up-to-date
versions or submit revisions with author approval. There are currently
several Git hosting services such as SourceForge, Google Code, GitHub,
and BitBucket that provide free Git hosting. Among them, GitHub has
surpassed other source code hosts like Google Code and SourceForge in
popularity and hosts over 4.6 million repositories from 2.8 million
users as of December 2012 {[}13--15{]}. While these services are usually
free for publicly open projects, some research efforts, especially those
containing embargoed or sensitive data will need to be kept private.
There are multiple ways to deal with such situations. For example,
certain files can be excluded from Git's history, others maintained as
private sub-modules, or entire repositories can be made private and
opened to the public at a future time. Some Git hosts like BitBucket
offer unlimited public and private accounts for academic use.
Managing a research project with Git provides several safe guards
against short-term loss. Frequent commits synced to remote repositories
ensure that multiple versioned copies are accessible from anywhere. In
projects involving multiple collaborators, the presence of additional
copies makes even more difficult to lose work. While Git hosting
services protect against short-term data loss, they are not a solution
for more permanent archiving since none of them offer any such
guarantees. For long-term archiving, researchers should submit their
Git-managed projects to academic repositories that are members of
CLOCKSS (\url{http://www.clockss.org/}). Output stored on such
repositories (e.g.~figshare) are archived over a network of redundant
nodes and ensure indefinite availability across geographic and
geopolitical regions.
\subsection{4. Freedom to explore new ideas and methods}
Git tracks development of projects along timelines referred to as
\textbf{\emph{branches}}. By default, there is always a master branch
(line with blue dots in figure \texttt{1}). For most authors, working
with this single branch is sufficient. However, Git provides a powerful
branching mechanism that makes it easy for exploring alternate ideas in
a structured and documented way without disrupting the central flow of a
project. For example, one might want to try an improved simulation
algorithm, a novel statistical method, or plot figures in a more
compelling way. If these changes don't work out, one could revert
changes back to an earlier commit when working on a single master
branch. Frequent reverts on a master branch can be disruptive,
especially when projects involve multiple collaborators. Branching
provides a risk-free way to test new algorithms, explore better data
visualization techniques, or develop new analytical models. When
branches yield desired outcomes, they can easily be merged into the
master copy while unsuccessful efforts can be deleted or left as-is to
serve as a historical record (illustrated in figure \texttt{1}).
Branches can prove extremely useful when responding to reviewer
questions about the rationale for choosing one method over another since
the Git history contains a record of failed, unsuitable, or abandoned
attempts. This is particularly helpful given that the time between
submission and response can be fairly long. Additionally, future users
can mine Git histories to avoid repeating approaches that were never
fruitful in earlier studies.
\subsection{5. Mechanism to solicit feedback and reviews}
While it is possible to leverage most of core functionality in Git at
the local level, Git hosting services offer additional services such as
issue trackers, collaboration graphs, and wikis. These can easily be
used to assign tasks, manage milestones, and maintain lab protocols.
Issue trackers can be repurposed as a mechanism for soliciting both
feedback and review, especially since the comments can easily be linked
to particular lines of code or blocks of text. Early comments and
reviews for this article were also solicited via GitHub Issues
\url{https://github.com/karthikram/smb_git/issues/}
\subsection{6. Increase transparency and verifiability}
Methods sections in papers are often succinct to adhere to strict word
limits imposed by journal guidelines. This practice is especially common
when describing well-known methods where authors assume a certain degree
of familiarity among informed readers. One unfortunate consequence of
this practice is that any modifications to the standard protocol
(typically noted in internal lab notebooks) implemented in a study may
not available to the reviewers and readers. However, seemingly small
decisions, such as choosing an appropriate distribution to use in a
statistical method, can have a disproportionately strong influence on
the central finding of a paper. Without access to a detailed history, a
reviewer competent in statistical methods has to trust that authors
carefully met necessary assumptions, or engage in a long back and forth
discussion thereby delaying the review process. Sharing a Git repository
can alleviate these kinds of ambiguities and allow authors to point out
commits where certain key decisions were made before choosing certain
approaches. Journals could facilitate this process by allowing authors
to submit links to their Git repository alongside manuscripts and
sharing them with reviewers.
\subsection{7. Managing large data}
Git is extremely efficient with managing small data files such as ones
routinely collected in experimental and observational studies. However,
when the data are particularly large such as those in bioinformatics
studies (in the order of tens of megabytes to gigabytes), managing them
with Git can degrade efficiency and slow down the performance of Git
operations. With large data files, the best practice would be to exclude
them from the repository and only track changes in metadata. This
protocol is especially ideal when large datasets do not change often
over the course of a study. In situations where the data are large
\emph{and} undergo frequent updates, one could leverage third-party
tools such as git-annex \url{http://git-annex.branchable.com/} and still
seamlessly use Git to manage a project.
\subsection{8. Lowering barriers to reuse}
A common barrier that prevents someone from reproducing or building upon
an existing method is lack of sufficient details about a method. Even in
cases where methods are adequately described, the use of expensive
proprietary software with restrictive licenses makes it difficult to use
{[}16{]}. Sharing code with licenses that encourage fair use with
appropriate attribution removes such artificial barriers and encourages
readers to modify methods to suit their research needs, improve upon
them, or find new applications {[}@Neylon2013{]}. With open source
software, analysis pipelines can be easily \emph{forked} or branched
from public Git repositories and modified to answer other questions.
Although this process of depositing code somewhere public with
appropriate licenses involves additional work for the authors, the
overall benefits outweigh the costs. Making all research products
publicly available not only increases citation rates {[}17,18{]} but can
also increase opportunities for collaboration by increasing overall
visibility. For example, Niedermeyer \& Strohalm {[}19{]} describe their
struggle with finding appropriate software for comprehensive mass
spectrum annotation, and eventually found an open source software which
they where able to extend. In particular, the authors cite availability
of complete source code along with an open license as the motivation for
their choice. Examples of such collaboration and extensions are likely
to become more common with increased availability of fully versioned
projects with permissive licenses.
A similar argument can be made for data as well. Even publications that
deposit data in persistent repositories rarely share the original raw
data. The versions submitted to persistent repositories are often
\emph{cleaned} and finalized versions of datasets. In cases where no
datasets are deposited, the only data accessible are likely mean values
reported in the main text or appendix of a paper. Raw data can be
leveraged to answer questions not originally intended by the authors.
For example, research areas that address questions about uncertainty
often require messy raw data to test competing methods. Thus, versioned
data provide opportunities to retrieve copies before they have been
modified for use in different contexts and have lost some of their
utility.
\section{Conclusions}
Wider use of Git has the potential to revolutionize scholarly
communication and increase opportunities for reuse, novel synthesis, and
new collaborative efforts. Since Git is a standard tool that is widely
used and backed by a large developer community, there are numerous
resources for learning (official tutorial at \url{http://git-scm.com/})
and seeking help. With disciplined use of Git, individual scientists and
labs can ensure that the entire timeline of events that occur over the
development of a research project are securely logged in a system that
provides security against data loss and encourages risk-free exploration
of new ideas and approaches. In an era with shrinking research budgets,
scientists are under increasing pressure to produce more with less. If
more granular sharing via Git reduces time spent developing new
software, or repeating expensive data collection efforts, then everyone
stands to benefit. Scientists should note that these efforts don't have
to viewed as entirely altruistic. In a recent mandate the National
Science Foundation {[}20{]} has expanded its merit guidelines to include
a range of academic products such as software and data, in addition to
peer-reviewed publications. With the rise in use of altmetric tools that
track and credit such efforts, then everyone can benefit {[}21{]}.
Although I have laid out various arguments for why more scientists
should be using Git, one should be careful not to view Git as a one stop
solution to all the problems facing reproducibility in science. Git can
be readily used without any knowledge of command-line tools due to the
available of many fully featured Git graphic user interfaces
\url{http://git-scm.com/downloads/guis}. However, leveraging its full
potential, especially when working on complex projects where one might
encounter unwieldy merge conflicts, comes at a significant learning
cost. There are also comparable alternatives to Git (e.g.~Mercurial)
which offer less granularity but are more user-friendly. While time
invested in becoming proficient in Git would be valuable in the
long-term, most scientists do not have the luxury of learning software
skills that do not address more immediate problems. Despite the fact
that scientists spent considerable time using and creating their own
software to address domain specific needs, good programming practices
are rarely taught {[}@Wilson2012{]}. Therefore wider adoption of useful
tools like Git will require greater software development literacy among
scientists. On a more optimistic note, such literacy is slowly becoming
common in the new generation of academics, driven in part by efforts
such as Software Carpentry \url{http://software-carpentry.org/} and
newer courses taught in graduate curricula (e.g.~Programming for
biologists \url{http://www.programmingforbiologists.org/} taught at Utah
State University).
\subsection{List of Abbreviations}
VCS: Version Control System; NSF: National Science Foundation; CSV:
Comma Separated Values.
\subsection{Acknowledgements}
Comments from Carl Boettiger, Yoav Ram, David Jones, and Scott
Chamberlain on earlier drafts greatly improved the final version of this
article. I also thank the rOpenSci project (\url{http://ropensci.org})
for helping me gain a greater appreciation for Git as a tool for
advancing science. This manuscript is available both as a Git repository
(with a full history of changes)
\url{https://github.com/karthikram/smb_git.git} and also as a permanent
archived copy on figshare
(\url{http://dx.doi.org/10.6084/m9.figshare.155613}).
\subsection{Author contributions}
KR conceived and wrote the manuscript. The author has read and approved
the manuscript.
\subsection{Competing interests}
I declare that I have no competing interests.
\subsection{Funding support}
The author did not receive any specific funding for this work.
\subsection{Literature Cited}
1. Vink CJ, Paquin P, Cruickshank RH (2012) Taxonomy and Irreproducible
Biological Science. BioScience 62: 451--452. Available:
\url{http://www.bioone.org/doi/abs/10.1525/bio.2012.62.5.3}.
2. Begley CG, Ellis LM (2012) Drug development: Raise standards for
preclinical cancer research. Nature 483: 531--3. Available:
\url{http://dx.doi.org/10.1038/483531a}.
3. Schwab M, Karrenbach M, Claerbout J (2000) Making Scientific
Computations Reproducible. Computing in Science Engineering 2: 61--67.
Available:
\url{http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=881708}.
4. Ince DC, Hatton L, Graham-Cumming J (2012) The case for open computer
programs. Nature 482: 485--8. Available:
\url{http://dx.doi.org/10.1038/nature10836}.
5. Whitlock MC, McPeek MA, Rausher MD, Rieseberg L, Moore AJ (2010) Data
archiving. The American naturalist 175: 145--6. Available:
\url{http://www.jstor.org/stable/10.1086/650340}.
6. Vines TH, Andrew RL, Bock DG, Franklin MT, Gilbert KJ, et al. (2013)
Mandated data archiving greatly improves access to research data. FASEB
journal official publication of the Federation of American Societies for
Experimental Biology. doi:10.1096/fj.12-218164
7. Wolkovich EM, Regetz J, O'Connor MI (2012) Advances in global change
research require open science by individual researchers. Global Change
Biology 18: 2102--2110. Available:
\url{http://apps.webofknowledge.com/full\textbackslash{}_record.do?product=UA\textbackslash{}\&search\textbackslash{}_mode=GeneralSearch\textbackslash{}\&qid=1\textbackslash{}\&SID=1CfaPnJ9gbl5bo171Jc\textbackslash{}\&page=1\textbackslash{}\&doc=4}.
8. Wald C (2010) Issues \& Perspectives Scientists Embrace Openness.
Available:
\url{http://sciencecareers.sciencemag.org/career\textbackslash{}_magazine/previous\textbackslash{}_issues/articles/2010\textbackslash{}_04\textbackslash{}_09/caredit.a1000036}.
Accessed 16 Jan 2013.
9. Greenland P, Fontanarosa PB (2012) Ending honorary authorship.
Science (New York, N.Y.) 337: 1019. Available:
\url{http://www.sciencemag.org/content/337/6098/1019.short}.
10. Schultheiss SJ, Münch M-C, Andreeva GD, Rätsch G (2011) Persistence
and availability of Web services in computational biology. PloS one 6:
e24914. Available:
\url{http://dx.plos.org/10.1371/journal.pone.0024914}.
11. Wren JD (2004) 404 not found: the stability and persistence of URLs
published in MEDLINE. Bioinformatics (Oxford, England) 20: 668--72.
Available:
\url{http://bioinformatics.oxfordjournals.org/content/20/5/668.abstract}.
12. Prlić A, Procter JB (2012) Ten Simple Rules for the Open Development
of Scientific Software. PLoS Computational Biology 8: e1002802.
Available: \url{http://dx.plos.org/10.1371/journal.pcbi.1002802}.
13. Pearson DP (2013) GitHub sees 3 millionth member account. Available:
\url{http://www.gamesindustry.biz/articles/2013-01-17-github-sees-3-millionth-member-account}.
Accessed 18 Jan 2013.
14. Finley K (2011) Github Has Surpassed Sourceforge and Google Code in
Popularity. Available:
\url{http://readwrite.com/2011/06/02/github-has-passed-sourceforge}.
Accessed 15 Jan 2013.
15. The Octoverse in 2012 · GitHub Blog (n.d.) The Octoverse in 2012 ·
GitHub Blog. Available:
\url{https://github.com/blog/1359-the-octoverse-in-2012}. Accessed
01AD--Feb 13AD.
16. Morin A, Urban J, Sliz P (2012) A quick guide to software licensing
for the scientist-programmer. PLoS computational biology 8: e1002598.
Available: \url{http://dx.plos.org/10.1371/journal.pcbi.1002598}.
17. Piwowar H a (2011) Who shares? Who doesn't? Factors associated with
openly archiving raw research data. PloS one 6: e18657. Available:
\url{http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3135593\textbackslash{}\&tool=pmcentrez\textbackslash{}\&rendertype=abstract}.
18. Alsheikh-Ali A a, Qureshi W, Al-Mallah MH, Ioannidis JP a (2011)
Public availability of published research data in high-impact journals.
PloS one 6: e24357. Available:
\url{http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3168487\textbackslash{}\&tool=pmcentrez\textbackslash{}\&rendertype=abstract}.
19. Niedermeyer THJ, Strohalm M (2012) mMass as a software tool for the
annotation of cyclic peptide tandem mass spectra. PloS one 7: e44913.
Available:
\url{http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3441486\textbackslash{}\&tool=pmcentrez\textbackslash{}\&rendertype=abstract}.
20. US NSF - Dear Colleague Letter - Issuance of a new NSF Proposal \&
Award Policies and Procedures Guide (NSF13004) (2012) US NSF - Dear
Colleague Letter - Issuance of a new NSF Proposal \& Award Policies and
Procedures Guide (NSF13004). Available:
\url{http://www.nsf.gov/pubs/2013/nsf13004/nsf13004.jsp?WT.mc\textbackslash{}_id=USNSF\textbackslash{}_109}.
Accessed 11 Nov 2012.
21. Piwowar H (2013) Altmetrics: Value all research products. Nature
493: 159--159. Available:
\url{http://www.nature.com/doifinder/10.1038/493159a}.
\end{document}