-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathDataManagement.tex
73 lines (59 loc) · 10.3 KB
/
DataManagement.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
%setcounter{page}{1}
\renewcommand{\thepage}{Data Management Plan - Page \arabic{page} of 2}
%\required{Supplementary Documentation}
\required{Data Management Plan}
%\begin{center}
%\emph{Maximum of 2 pages}
%\end{center}¥
% Maximum of 2 pages
%------------------------------
% This supplement should describe how the proposal will conform to NSF policy on
% the dissemination and sharing of research results and may include:
%
% 1. The types of data, samples, physical collections, software, curriculum
% materials, and other materials to be produced in the course of the project;
%
% 2. The standards to be used for data and metadata format and content (where
% existing standards are absent or deemed inadequate, this should be documented
% along with any proposed solutions or remedies);
%
% 3. Policies for access and sharing including provisions for appropriate
% protection of privacy, confidentiality, security, intellectual property, or
% other rights or requirements;
%
% 4. Policies and provisions for re-use, re-distribution, and the production of
% derivatives; and
%
% 5. Plans for archiving data, samples, and other research products, and for
% preservation of access to them.
%
% A valid Data Management Plan may include only the statement that no detailed
% plan is needed, as long as the statement is accompanied by a clear
% justification. Proposers who feel that the plan cannot fit within the supplement
% limit of two pages may use part of the 15-page Project Description for
% additional data management information. Proposers are advised that the Data
% Management Plan may not be used to circumvent the 15-page Project Description
% limitation.
%(A-1) Sharing of Results and Management of Intellectual Property (maximum 3 pages): Describe the management of intellectual property rights related to the proposed project, including plans for sharing data, information, and materials resulting from the award. This plan must be specific about the nature of the results to be shared, the timing and means of release, and any constraints on release. The proposed plan must take into consideration the following conditions where applicable:
%
%Sequences resulting from high-throughput large-scale sequencing projects (low pass whole genome sequencing, BAC end sequencing, ESTs, full-length cDNA sequencing, etc.) must be released according to the currently accepted community standard (e.g. Bermuda/Ft. Lauderdale agreement) to public databases (GenBank if applicable), as soon as they are assembled and quality checked against a stated, pre-determined quality standard. "At publication" is not acceptable.
%Proposals that would develop genome-scale expression data through approaches such as microarrays or next-generation sequencing should meet community standards for these data (for example, Minimum Information About a Microarray Experiment or MIAME standards). The community databases (e.g. Gene Expression Omnibus) into which the data would be deposited, in addition to any project database(s) should be indicated.
%If the proposed project would produce genome-scale data sets generated using proteomics and/or metabolomics approaches, NSF encourages that they be made available as soon as their quality is checked to satisfy the specifications approved prior to funding. The timing of release should be stated clearly in the proposal. The community databases into which the data would be deposited, in addition to any project database(s) should be indicated.
%If the proposed project would produce community resources (biological materials, software, etc.), NSF encourages that they be made available as soon as their quality is checked to satisfy the specifications approved prior to funding. The timing of release should be stated clearly in the proposal. The resources produced must be available to all segments of the scientific community, including industry. A reasonable charge is permissible, but the fee structure must be outlined clearly in the proposal. If accessibility differs between industry and the academic community, the differences must be clearly spelled out. If a Material Transfer Agreement is required for release of project outcomes, the terms must be described in detail.
%When the project involves the use of proprietary data or materials from other sources, the data or materials resulting from NSF funded research must be readily available without any restrictions to the users of such data or materials (no reach-through rights). The terms of any usage agreements should be stated clearly in the proposal.
%Budgeting and planning for short-term and long-term distribution of the project outcomes must be described in the proposal. If a fee is to be charged for distribution of project outcomes, the details should be described clearly in the proposal. Letters of commitment should be provided from databases or stock centers that would distribute project outcomes, including an indication of what activities would be undertaken and funds needed for these activities (if any).
%In case of a multi-institutional proposal, the lead institution is responsible for coordinating and managing the intellectual property resulting from the PGRP award. Institutions participating in multi-institutional projects should formulate a coherent plan for the project prior to submission of the proposal.
%
\subsection*{Data Types}
This proposal will make use of existing maize and teosinte datasets from multiple sources as well as data currently being generated. Unless additional genotyping is needed in Objective \ref{validation}, this project will only generate simulated datasets.
These real-world maize and teosinte datasets are used within all objectives of the proposal, while simulated data is produced within Objectives \ref{modeling},\ref{domestication}, and \ref{surfing}.
\paragraph{Sequence, Genotype, and Phenotype Data}
This project will make use of both published data as well as data generated or currently being generated as part of a related Plant Genome project (Biology of Rare Alleles IOS-1238014) of which sponsoring scientist Dr. Ross-Ibarra is a Co-PI. As part of this project, Dr. Ross-Ibarra's group is currently in the process of sequencing 70 teosinte and 55 maize genomes to high depth. These sequences should be complete by the end of 2015 and will be made publicly available early 2016 via the group website (\url{www.panzea.org}). The individuals used for genome sequencing are also the parents of two large mapping populations of $\sim$5000 progeny. Both populations have been genotyped and phenotyped for a number of traits including seed yield, flowering times, and plant height, and would be used for comparisons in Objectives \ref{modeling} and \ref{domestication}. Finally, the same project has developed, genotyped, and phenotyped a synthetic population of 11 teosinte and 26 maize parents, resulting in a population that is roughly 12\% teosinte genes, 40\% B73 (the reference genome line), and 2\% from 25 other inbred maize lines. This latter population will be used in Objective \ref{validation} to compare model predictions to maize and teosinte regions segregating in a common population. All of these data are currently available from collaborators. Publication of results from analyses of these data must await publication of collaborators' GWAS analyses, but as these are currently underway, and comparisons with empirical data would likely not begin until early 2017 at the soonest, we do not foresee this being an issue. In the unlikely event of a problem in the generation of that data, existing data is also available publicly for teosinte \citep{Weber:2009} and maize \citep{Wallace:2014} as well as $\sim$1500 maize genomes through the HapMap 3 project \citep{Bukowski:2015} and 5,000 landraces each with 1 million SNPs and several phenotypes \citep{Hearne2015}.
\paragraph{Simulated Data}
Outputs from simulations will consist of genotype data and allele effect sizes from genomic regions representing important phenotypic traits. This data will be organized to describe how users can interpret the data in terms of what phenotype each region represents and what file corresponds to each demographic and selective scenario simulated. All of this information will be contained in Readme files associated with each directory of simulated datasets. Additionally, links to the exact code used (hosted on GitHub) and random number seeds will be included in publications, so that the exact simulated data can be replicated in the future if necessary.
\subsection*{General Organization}
All obtained and generated datasets will be stored locally and remotely backed up on the 300 terabyte array for the Ayala School of Biological Sciences at UC Irvine. In the rare event of simultaneous failure of these storage facilities, maize and teosinte datasets can be re-obtained from original and simulated datasets can be re-generated from the same random seed to produce identical results. In addition to data, analysis and simulation scripts will be stored locally and remotely on my GitHub account. I will use GitHub to version control all scripts so that previous versions are not deleted nor is confusion of the proper published version a problem. Using version control and GitHub are already protocols I follow currently and will provide no future difficulty in maintaining this workflow. Readme files can be generated for each set of analysis or simulation scripts that serve as a sort of metadata for reproducing the analyses within and instructions for future users to understand how each script is run.
\subsection*{Data Archiving, Plan for Sharing, Public Access Policy}
The existing maize and teosinte datasets are all either currently publicly available or will be made publicly available during the completion of this research.
Simulated data will be publicly archived concurrently with publication of the research results. These datasets will be archived on Dryad Digital Repository, unless otherwise directed to another appropriate, publicly accessible data archive by the publication journal. Detailed metadata will be provided to ensure reproducibility of the studies. The scripts used to generate these data files and conduct analyses will be made available with detailed instructions on my GitHub account. I will post links to all data sources on my academic website, which is a standard I already maintain.
Since all data is either simulated or comes from plant datasets, there will be no privacy concerns for sharing data.