wiki: added Guidelines vignette

UCLouvain-CBIO · Dec 4, 2023 · 90bc502 · 90bc502
1 parent 154024b
commit 90bc502
Showing 1 changed file with 262 additions and 0 deletions.
diff --git a/inst/wiki/GUIDELINES.md b/inst/wiki/GUIDELINES.md
@@ -0,0 +1,262 @@
+Welcome to the `scpdata` package, and thank you for your interest in
+contributing!
+
+The `scpdata` data package is a repository of curated mass
+spectrometry-based single-cell proteomics (SCP) datasets. The purpose
+of `scpdata` is to provide users with streamlined access to
+high-quality SCP data, alleviating the need for time-consuming data
+wrangling. We currently provide data at the peptide-to-spectrum match
+(PSM) level, the peptide level and/or the protein level. The package
+also encompasses a large diversity of technologies, including DDA and
+DIA, label-free and multiplexed experiments from various laboratories
+such as the Slavov Lab, the Kelly Lab, and the Schoof Lab.
+
+Contributions are very much welcome. We happily accept major
+contributions such as adding a new dataset, as well as minor
+contributions as fixing typos or improving current documentation.
+
+To facilitate our collaboration, this wiki page will guide you through
+the process of adding a new dataset to the package. We will first get
+you started with some basic guidelines on how to contribute using
+GitHub. We'll proceed with a description of the data structure and the
+data pieces we expect. Next, we will provide an overview of the
+package's folder structure to help you navigate through the project.
+Finally, we'll explain the workflow you should follow to add your
+dataset to the repository.
+
+# Getting started with GitHub
+
+1. Fork the `scpdata` GitHub repository ([click
+   here](https://github.com/UCLouvain-CBIO/scpdata/fork)). 
+2. Clone the forked repo locally using `git`:
+
+```
+git clone [email protected]:YOUR_USER_NAME/scpdata
+```
+3. Adapt the cloned repo as desired. Do not forget to regularly `git
+   commit`` your changes.
+4. Once finished, send your improvements and/or new features as a [pull
+   request](https://github.com/UCLouvain-CBIO/scpdata/compare).
+
+If you have any questions or face any hurdles, do not hesitate to open
+a [new
+issue](https://github.com/UCLouvain-CBIO/scpdata/issues/new/choose)
+and we'll be happy to provide additional guidance. 
+
+# What do we expect?
+
+## `QFeatures` object
+
+All datasets in `scpdata` are stored in a `QFeatures` object (see
+[intro
+vignette](https://uclouvain-cbio.github.io/scp/articles/QFeatures_nutshell.html)).
+The object is created following the
+[`scp`](https://github.com/UCLouvain-CBIO/scp) data framework, as
+described in [this short
+demo](https://uclouvain-cbio.github.io/scp-teaching/read_scp_data).
+
+### Feature data
+
+We refer to feature data as the data generated by MS data
+identification and quantification tools. Depending on the tool,
+features may represent PSMs, peptides and/or proteins. For instance,
+MaxQuant provides an `evidence.txt` file with PSM-level information,
+a `peptides.txt` file with peptide-level information and
+`proteinGroups.txt` with protein-level information. We encourage
+adding as many of the three feature layers when contributing a dataset
+to `scpdata`. 
+
+For each feature, the tools provide quantification data as well as
+feature annotations. These two pieces of information should be
+separated in a `SingleCellExperiment` object. Feature annotations are
+stored in the `rowData` and the quantitative values are stored in the
+`assay`.
+
+### Sample annotations
+
+Sample annotations contain information about each sample (single cell)
+in the dataset. This information is generated by the experimenter
+and should contain biological descriptors, such as the cell line or
+the treatment applied, and technical descriptors, such as the day of
+acquisition, the acquisition batch, the LC batch, etc. The sample
+annotations are stored in the `colData` of the `QFeatures` object. 
+
+If you want to contribute to `scpdata` with a dataset you generated
+yourself, we suggest you read the last section of initial
+recommendations for SCP experiments that provides a comprehensive
+discussion about descriptors of interest you should collect:
+
+> Gatto, Laurent, Ruedi Aebersold, Juergen Cox, Vadim Demichev, Jason
+> Derks, Edward Emmott, Alexander M. Franks, et al. 2023. “Initial
+> Recommendations for Performing, Benchmarking and Reporting
+> Single-Cell Proteomics Experiments.” Nature Methods 20 (3): 375–86. 
+
+## Experiment description
+
+We also require the collection of experimental data that describes the
+dataset. This information is commonly retrieved from the publication
+associated with the dataset and provides a scientific context to the
+dataset. This information is used for building the dataset
+documentation.
+
+## Data source information
+
+Finally, the `ExperimentHub` project, on which `scpdata` relies,
+requires every dataset to thoroughly provide a description of the data
+sources.
+
+# Folder structure
+
+We here provide an overview of the key folders and files relevant when
+contributing a new dataset. The current files may provide a source of
+inspiration when preparing a new dataset.
+
+## inst/scripts/
+
+The folder contains all R scripts used to generate the `QFeatures`
+objects from the source files, one script for each dataset. Each
+script is named as follows: `make-data_` + `DATASET_NAME` + `.R`. 
+
+Note the file called `make-metadata.R`. It generates a CSV table
+required by `ExperimentHub` where each line corresponds to a dataset
+and the columns contains the data sources. The table is stored in
+`inst/extdata/metadata.csv`, which should never be changed manually.
+
+## R/
+
+The folder contains 3 R scripts, but new contributions should only
+consider the `data.R` and can safely ignore the other two. The
+`data.R` script contains the documentation for each dataset, formatted
+using `roxygen2` markup. 
+
+## man/
+
+The folder contains the compiled documentation manuals, one for each
+dataset. These were automatically generated by `roxygen2` and
+shouldn't therefore never be changed manually.
+
+# Workflow
+
+In practice, contributing a new dataset involves 7 steps.
+
+## 1. Collect data
+
+If you want to contribute an already published dataset, identify the
+data source for all feature data and the sample annotations. This is
+generally provided in the article, but you may need to request
+additional information from the authors.
+
+If you want to contribute with your own dataset, make sure that all
+feature data and the sample annotation table are available from a
+public repository (eg PRIDE, MASSive or Zenodo).
+
+## 2. Create the `QFeatures` object
+
+Create a new R script, `inst/scripts/make-data_DATASET_NAME.R`, which
+contains all the code to convert the data source data into the
+`QFeatures` object. Here are some tips and tricks for generating a
+high-quality dataset:
+
+- Sample annotations are often cluttered, and spread over different
+  tables or contained within sample names. Generating high-quality
+  sample annotations may be time-consuming and frustrating. Don't
+  overlook this task, sample annotations are essential for rigourous
+  and accurate downstream analysis.
+- Converting feature data tables and annotation tables into
+  `QFeatures` or `SingleCellExperiment` objects can be streamlined
+  using
+  [`scp::readSCP()`](https://uclouvain-cbio.github.io/scp/reference/readSCP.html)
+  and
+  [`scp::readSingleCellExperiment()`](https://uclouvain-cbio.github.io/scp/reference/readSingleCellExperiment.html),
+  respectively.
+- Always start with the lowest feature level (eg PSMs). If available,
+  you should add peptide and protein data using
+  [`QFeatures::addAssay()`](https://rformassspectrometry.github.io/QFeatures/reference/QFeatures-class.html).
+  You should then add links between the assays. This is streamlined
+  using
+  [`QFeatures::addAssayLink()`](https://rformassspectrometry.github.io/QFeatures/reference/AssayLinks.html).
+- Make sure to add data with as little processing as possible. For
+  instance, MaxQuant provides peptide intensities, but also iBAQ and
+  MaxLFQ normalised values. You should favour the former over the
+  latter two, which you could add as supplementary assays (for
+  example, see
+  [here](https://github.com/UCLouvain-CBIO/scpdata/blob/master/inst/scripts/make-data_woo2022_macrophage.R)).
+
+## 3. Document the dataset
+
+Add the data documentation and the data collection procedure in 
+`scpdata/R/data.R`. Use `roxygen2` markup language. The documentation
+is structured as follows, but you can best use the documentation of an
+existing dataset as a template:
+
+- *Title*: First authors et al. Year (Journal): minimal description.
+- Short description of the data set. What and how many cells were
+  acquired? What technology? What is the research question? 
+- *Format*: describe your `QFeatures` object. Describe each assay,
+  namely what level features it contains, the number of features and
+  the number of cells/samples
+- *Data acquisition*: summarise the data acquisition protocol, namely
+  the sample isolation, sample preparation, liquid chromatography,
+  mass spectrometry and raw data processing.
+- *Data collection*: summarise the steps you undertook to generate the
+  `QFeatures` object, and where to find the script you created.
+- *Source*: link the public repository with the source data
+- *References*: if published, refer to the original work that
+  acquired the data.
+- *Example*: add an example to show how to retrieve the dataset. To
+  avoid the associated overhead when testing the package, we recommend
+  adding the example as follows: 
+
+```
+##' \donttest{
+##' dataset_name()
+##' }
+```
+- *Keywords*: add the line `##' @keywords datasets` 
+- `"dataset_name"`: end the documentation with the name of your
+  dataset, ensuring your data set is correctly exported. 
+
+## 4. Update metadata
+
+Add the data source information in the `inst/script/make-metadata.R`
+script and run the complete script that will update the
+`inst/extdata/metadata.csv`. You can use a previous dataset as
+template. All fields are mandatory: Title, Description, BiocVersion,
+Genome, SourceType, SourceUrl, SourceVersion, Species, TaxonomyId,
+Coordinate_1_based, DataProvider, Maintainer, RDataClass,
+DispatchClass, PublicationDate, NumberAssays, PreprocessingSoftware,
+LabelingProtocol, PsmsAvailable, PeptidesAvailable, ProteinsAvailable,
+ContainsSingleCells, Notes. See
+`?ExperimentHubData::makeExperimentHubMetadata` for a comprehensive
+description of the fields. 
+
+Next, ensure that your updated `metadata.csv` file is valid by
+running `ExperimentHubData::makeExperimentHubMetadata("scpdata")`.
+
+## 6. Create a pull request
+
+Push any change you made to GitHub and open a pull request to notify
+us of your contribution. The pull request should include all the
+commits related to the dataset you want to contribute. Provide in the
+description where we can retrieve your `QFeatures` object, e.g.
+through Zenodo.
+
+## 7. Almost done!
+
+Once your pull request is submitted, we take over and we will proceed 
+to the following steps: 
+
+1. We will review your changes to ensure you comply with the above
+   guidelines. We may eventually request changes. 
+2. We will contact the Bioconductor team
+   ([[email protected]](mailto:[email protected])) to upload
+   your Rda to Microsoft Azure, if needed, and to update the
+   `metadata.csv` on their server. See the [help
+   page](https://bioconductor.org/packages/devel/bioc/vignettes/HubPub/inst/doc/CreateAHubPackage.html#uploading-data-to-microsoft-azure-genomic-data-lake)
+   for more information. 
+3. We will compile the documentation with roxygen2 and check the 
+   package is still valid. We may eventually request changes.
+4. We will update the NEWS.md file and bump package version
+5. If this is your first contribution, we will add your name to the
+   package authors. 
+