Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reference recommendations to usage docs #1314

Merged
merged 15 commits into from
Jul 15, 2024
Merged
46 changes: 43 additions & 3 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,12 @@ If unique molecular identifiers were used to prepare the library, add the follow

Please refer to the [nf-core website](https://nf-co.re/usage/reference_genomes) for general usage docs and guidelines regarding reference genomes.

:::note
**Consistent reference resource usage**

When supplying reference files as discussed below, it is important to be consistent in the reference resource used (Ensembl, GENCODE, UCSC etc), since differences in conventions between these resources can make their files incompatible. For example, UCSC prefixes chromosomes with `chr`, while Ensembl does not, so a GTF file from Ensembl should not be supplied alongside a genome FASTA from UCSC. GENCODE also attaches version identifiers to gene and transcript names (e.g. `ENSG00000254647.1`) while Ensembl does not.
:::

### Explicit reference file specification (recommended)

The minimum reference genome requirements for this pipeline are a FASTA and GTF file, all other files required to run the pipeline can be generated from these files. For example, the latest reference files for human can be derived from Ensembl like:
Expand All @@ -158,7 +164,39 @@ Notes:
- If `--additional_fasta` is provided then the features in this file (e.g. ERCC spike-ins) will be automatically concatenated onto both the reference FASTA file as well as the GTF annotation before building the appropriate indices.
- When using `--aligner star_rsem`, both the STAR and RSEM indices should be present in the path specified by `--rsem_index` (see [#568](https://github.com/nf-core/rnaseq/issues/568)).

#### Indices
### Reference genome
lazappi marked this conversation as resolved.
Show resolved Hide resolved

It is recommended to provide the most complete reference genome for your species, without additional loci (haplotypes) or patches. For model organisms such as mouse or human, this is the "primary assembly," which includes the reference chromosomes and some additional scaffolds. For the human assembly GRCh38 (hg38), use the `GRCh38.primary_assembly.genome.fa.gz` file from GENCODE or the `Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz` file from Ensembl. These files cover the largest portion of the reference genome without including multiple copies of the same sequence, which can confuse aligners like STAR.
lazappi marked this conversation as resolved.
Show resolved Hide resolved

Most other species (e.g., fly, cow, dog) do not have a primary assembly. In these cases, use the complete reference sequence, or "toplevel" assembly. The main difference between the primary and toplevel assemblies is the inclusion of alternative loci (haplotypes), which typically do not exist for species outside of mouse and human.
lazappi marked this conversation as resolved.
Show resolved Hide resolved
lazappi marked this conversation as resolved.
Show resolved Hide resolved

### Gene annotation
lazappi marked this conversation as resolved.
Show resolved Hide resolved

Gene annotations are updated more frequently than the reference genome sequence, so you much choose an appropriate annotation version (e.g. Ensembl release) We recommend using sources with well-defined, versioned releases such as ENSEMBL or GENCODE. Generally, it is best to use the most recent release for the latest gene annotations. However, if you are combining your data with older datasets, use the annotation version previously used for consistency.

Once you have chosen a release, select the annotation file that matches your reference genome. For the human primary assembly, use the comprehensive annotation (e.g., `gencode.{release}.primary_assembly.annotation.gtf.gz` from GENCODE or `Homo_sapiens.GRCh38.{release}.gtf.gz` from Ensembl). For other species, like fly, use the annotation matching the toplevel assembly (e.g., `Drosophila_melanogaster.BDGP6.46.{release}.gtf.gz` from Ensembl).

GENCODE also provides "basic" annotations, which include only representative transcripts, but we do not recommend using these.
lazappi marked this conversation as resolved.
Show resolved Hide resolved

Gene annotations provide a primary identifier for each feature as well as a common name. For example, the Ensembl ID `ENSG00000254647` corresponds to the `INS` gene, which encodes the insulin protein. While gene names are more familiar, it is crucial to retain and use the primary identifiers as they are unique and easier to map between annotation versions or sources.
lazappi marked this conversation as resolved.
Show resolved Hide resolved

To take advantage of all the quality control modules implemented in the pipeline, the gene annotation should include a `gene_biotype` field which describes the function of each feature (protein coding, long non-coding etc.). This is usually the case for annotations from GENCODE or Ensembl but may not be if your annotation comes from another source. If your annotation does not include this field, please set the `--skip_biotype_qc` option to avoid running the steps that rely on it.

:::note
**GTF vs GFF**

GFF (General Feature Format) is a tab-separated text file format for representing genomic annotations, while GTF (General Transfer Format) is a specific implementation of this format corresponding to GFF version 2. The pipeline can accept both GFF and GTF but any GFF files will be converted to GTF so if a GTF is available for your annotation of choice it is better to provide that directly.

More information and links to further resources are [available from Ensembl](https://www.ensembl.org/info/website/upload/gff.html).
:::

### Reference transcriptome
pinin4fjords marked this conversation as resolved.
Show resolved Hide resolved

In addition to the reference genome sequence and annotation, you can provide a reference transcriptome FASTA file. These files can be obtained from GENCODE or Ensembl. However, these sequences only cover the reference chromosomes and can cause inconsistencies if you are using a primary or toplevel genome assembly and annotation.

We recommend not providing a transcriptome FASTA file and instead allowing the pipeline to create it from the provided genome and annotation. Similar to aligner indexes, you can save the created transcriptome FASTA and BED files to a central location for future pipeline runs. This helps avoid redundant computation and having multiple copies on your system. Ensure that all genome, annotation, transcriptome, and index versions match to maintain consistency.

### Indices
pinin4fjords marked this conversation as resolved.
Show resolved Hide resolved

By default, indices are generated dynamically by the workflow for tools such as STAR and Salmon. Since indexing is an expensive process in time and resources you should ensure that it is only done once, by retaining the indices generated from each batch of reference files:

Expand All @@ -167,14 +205,16 @@ By default, indices are generated dynamically by the workflow for tools such as

Once you have the indices from a workflow run you should save them somewhere central and reuse them in subsequent runs using custom config files or command line parameters such as `--star_index '/path/to/STAR/index/'`.

#### Gencode
When doing this, it is important to record the genome and annotations versions they correspond to so you can easily locate the correct index to use and the program version as an index produced with one version may not have a format compatible with other versions.
lazappi marked this conversation as resolved.
Show resolved Hide resolved

### GENCODE
pinin4fjords marked this conversation as resolved.
Show resolved Hide resolved

If you are using [GENCODE](https://www.gencodegenes.org/) reference genome files please specify the `--gencode` parameter because the format of these files is slightly different to ENSEMBL genome files:

- The `--gtf_group_features_type` parameter will automatically be set to `gene_type` as opposed to `gene_biotype`, respectively.
- If you are running Salmon, the `--gencode` flag will also be passed to the index building step to overcome parsing issues resulting from the transcript IDs in GENCODE fasta files being separated by vertical pipes (`|`) instead of spaces (see [this issue](https://github.com/COMBINE-lab/salmon/issues/15)).

#### Prokaryotic genome annotations
### Prokaryotic genome annotations
pinin4fjords marked this conversation as resolved.
Show resolved Hide resolved

This pipeline uses featureCounts to generate QC metrics based on [biotype](http://www.ensembl.org/info/genome/genebuild/biotypes.html) information available within GFF/GTF genome annotation files. The format of these annotation files can vary significantly depending on the source of the annotation and the type of organism. The default settings in the pipeline are tailored towards Ensembl GTF annotations available for eukaryotic genomes. Prokaryotic genome annotations tend to be distributed in GFF format which are structured differently in terms of the feature naming conventions. There are a number of ways you can tune the behaviour of the pipeline to cater for differences/absence of biotype information:

Expand Down
Loading