From d40544285d222be26c0bdef92c5449dddcd9781b Mon Sep 17 00:00:00 2001
From: Luke Zappia <lazappi@users.noreply.github.com>
Date: Tue, 11 Jun 2024 10:04:46 +0200
Subject: [PATCH 01/11] Add reference files section to usage docs

---
 docs/usage.md | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/docs/usage.md b/docs/usage.md
index 7b148c636..701f7f324 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -55,6 +55,46 @@ An [example samplesheet](../assets/samplesheet.csv) has been provided with the p
 
 > **NB:** The `group` and `replicate` columns were replaced with a single `sample` column as of v3.1 of the pipeline. The `sample` column is essentially a concatenation of the `group` and `replicate` columns, however it now also offers more flexibility in instances where replicate information is not required e.g. when sequencing clinical samples. If all values of `sample` have the same number of underscores, fields defined by these underscore-separated names may be used in the PCA plots produced by the pipeline, to regain the ability to represent different groupings.
 
+## Reference files
+
+The only reference files required by the pipeline are a FASTA file with the reference genome sequence and a GTF/GFF file with a gene annotation. All other reference files can be created from those by the pipeline. However, selecting the appropriate reference genome and annotation to use analysis can still be difficult. Here we provide some advice on what is expected by the pipeline:
+
+:::note
+**GENCODE vs ENSEMBL**
+
+Two of the most common sources of genomic references are GENCODE (for mouse and human) and ENSEMBL (for many organisms). There has been an effort to standardise information between the two sources and now the references [should be consistent](https://www.gencodegenes.org/pages/faq.html) regardless of where they are obtained from (for mouse and human).
+
+However, while the information is consistent, there are still some practical differences. ENSEMBL prefixes chromosome names with `chr` (e.g. `chr1`, `chr1`, ...) while GENCODE uses simple `1`, `2`, etc. There can also be different names used for sequences outside the reference chromosomes. GENCODE also attaches version identifiers to gene and transcript names (e.g. `ENSG00000254647.1`). For these reasons, resources from the two sources cannot be mixed and it is important to stick to one reference source. Some of the steps in the pipeline expect an ENSEMBL reference by default so it is important to set the `--gencode` option if your reference comes from GENCODE.
+:::
+
+### Reference genome
+
+It is recommended to provide the most complete reference genome for your species, without additional loci (haplotypes) or patches. For models organisms such as mouse or human this is the so-called "primary assembly" which includes the reference chromosomes as well as some additional scaffolds. For human assembly GRCh38 (hg38) this would be the `GRCh38.primary_assembly.genome.fa.gz` file from GENCODE or the `Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz` file from ENSEMBL. These files are preferred as they cover the largest amount of the reference genome without including multiple copies of the same sequence which can confuse aligners such as STAR. Most other species (fly, cow, dog etc.) do not have a primary assembly, in which case the complete reference sequence, or "toplevel" assembly, should be used. The difference between the two is the inclusion of alternative loci (haplotypes) but these do not typically exist for species outside mouse and human.
+
+### Gene annotation
+
+Gene annotations are updated more frequently than the reference genome sequence and there are more options to consider here. Because annotations can be updated frequently, you should rely on sources that include well-defined, versioned releases such as ENSEMBL or GENCODE. We generally recommend using the most recent release in order to have the latest and most up-to-date gene annotations. However, if you are planning to combine your data with a dataset that was processed in the past you may want to use the annotation version that was used previously for greater consistency. Once you have decided on a release to use, you can then select an annotation file. This should be the most comprehensive annotation that matches the reference genome you are using. So if you are using the human primary assembly you would want the comprehensive annotation for the primary assembly (the `gencode.{release}.primary_assembly.annotation.gtf.gz` file from GENCODE or the `Homo_sapiens.GRCh38.{release}.gtf.gz` file from ENSEMBL). For something like fly, you would want the annotation matching the toplevel assembly (e.g. `Drosophila_melanogaster.BDGP6.46.{release}.gtf.gz` from ENSEMBL). As well as the comprehensive annotations for the primary and toplevel assemblies, and just the reference chromomes, GENCODE also provides "basic" annotations which only include representative transcripts, but we do not recommend using these.
+
+Gene annotations typically provide a primary identifier for each feature as well as a more common name. For example, the ENSEMBL ID `ENSG00000254647` corresponds to the `INS` gene which encodes the insulin protein. While the gene names may be more familiar and easier to understand it is important to retain and use the primary identifiers as the are unique for a given annotation and are much easier to map between annotation versions or sources.
+
+To take advantage of all the quality control modules implemented in the pipeline, the gene annotation should include a `gene_biotype` field which describes the function of each feature (protein coding, long non-coding etc.). This is usually the case for annotations from GENCODE or ENSEMBL but may not be if your annotation comes from another source. If your annotation does not include this field, please set the `--skip_biotype_qc` option to avoid running the steps that rely on it.
+
+:::note
+**GTF vs GFF**
+
+GFF (General Feature Format) is a tab-separated text file format for representing genomic annotations. GTF (General Transfer Format) is a specific implementation of this format corresponding to GFF version 2. The pipeline can accept both GFF and GTF but any GFF files will be converted to GFF so if a GTF is available for your annotation of choice it is better to provide that directly.
+
+More information and links to further resources are [available from ENSEMBL](https://www.ensembl.org/info/website/upload/gff.html).
+:::
+
+### Reference transcriptome
+
+As well as the reference genome sequence and annotation it is possible to provide a reference transcriptome FASTA file. These can be obtained from GENCODE or ENSEMBL but it is important to note that the sequences they provide only cover the reference chromosome and can result in inconsistencies if you have provided a primary or toplevel genome assembly and annotation. For this reason, we recommend to not provide a transcriptome FASTA and instead let the pipeline create it from the provided genome and annotation. As with the aligner indexes, it is possible to save the created transcriptome FASTA and BED files to a central location and provide it to future pipeline runs in order to avoid having multiple copies on your system but it is important to make sure that all genome, annotation, transcriptome and index versions match.
+
+### Indexes
+
+Creating the index files required for the alignment and/or pseudoalignment steps can be computationally intensive and the files they produce are quite large. To avoid repeating this work and having multiple redundant files we recommend saving the indexes using the `--save_reference` option and moving them to a central location where they can be accessed by future pipeline runs. When doing this, it is important to record the genome and annotations versions they correspond to so you can easily locate the correct index to use and the program version as an index produced with one version may not have a format compatible with other versions.
+
 ## Adapter trimming options
 
 [Trim Galore!](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) is a wrapper tool around Cutadapt and FastQC to peform quality and adapter trimming on FastQ files. Trim Galore! will automatically detect and trim the appropriate adapter sequence. It is the default trimming tool used by this pipeline, however you can use fastp instead by specifying the `--trimmer fastp` parameter. [fastp](https://github.com/OpenGene/fastp) is a tool designed to provide fast, all-in-one preprocessing for FastQ files. It has been developed in C++ with multithreading support to achieve higher performance. You can specify additional options for Trim Galore! and fastp via the `--extra_trimgalore_args` and `--extra_fastp_args` parameters, respectively.

From 6b529261b815435d4049dca20db8f7b1224c781f Mon Sep 17 00:00:00 2001
From: Luke Zappia <lazappi@users.noreply.github.com>
Date: Tue, 11 Jun 2024 10:08:32 +0200
Subject: [PATCH 02/11] Adjust line breaks

Sentence per line to allow easier commenting.
---
 docs/usage.md | 50 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 40 insertions(+), 10 deletions(-)

diff --git a/docs/usage.md b/docs/usage.md
index 701f7f324..ce2290071 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -57,43 +57,73 @@ An [example samplesheet](../assets/samplesheet.csv) has been provided with the p
 
 ## Reference files
 
-The only reference files required by the pipeline are a FASTA file with the reference genome sequence and a GTF/GFF file with a gene annotation. All other reference files can be created from those by the pipeline. However, selecting the appropriate reference genome and annotation to use analysis can still be difficult. Here we provide some advice on what is expected by the pipeline:
+The only reference files required by the pipeline are a FASTA file with the reference genome sequence and a GTF/GFF file with a gene annotation. All other reference files can be created from those by the pipeline.
+However, selecting the appropriate reference genome and annotation to use analysis can still be difficult.
+Here we provide some advice on what is expected by the pipeline:
 
 :::note
 **GENCODE vs ENSEMBL**
 
-Two of the most common sources of genomic references are GENCODE (for mouse and human) and ENSEMBL (for many organisms). There has been an effort to standardise information between the two sources and now the references [should be consistent](https://www.gencodegenes.org/pages/faq.html) regardless of where they are obtained from (for mouse and human).
+Two of the most common sources of genomic references are GENCODE (for mouse and human) and ENSEMBL (for many organisms).
+There has been an effort to standardise information between the two sources and now the references [should be consistent](https://www.gencodegenes.org/pages/faq.html) regardless of where they are obtained from (for mouse and human).
 
-However, while the information is consistent, there are still some practical differences. ENSEMBL prefixes chromosome names with `chr` (e.g. `chr1`, `chr1`, ...) while GENCODE uses simple `1`, `2`, etc. There can also be different names used for sequences outside the reference chromosomes. GENCODE also attaches version identifiers to gene and transcript names (e.g. `ENSG00000254647.1`). For these reasons, resources from the two sources cannot be mixed and it is important to stick to one reference source. Some of the steps in the pipeline expect an ENSEMBL reference by default so it is important to set the `--gencode` option if your reference comes from GENCODE.
+However, while the information is consistent, there are still some practical differences.
+ENSEMBL prefixes chromosome names with `chr` (e.g. `chr1`, `chr1`, ...) while GENCODE uses simple `1`, `2`, etc.
+There can also be different names used for sequences outside the reference chromosomes.
+GENCODE also attaches version identifiers to gene and transcript names (e.g. `ENSG00000254647.1`).
+For these reasons, resources from the two sources cannot be mixed and it is important to stick to one reference source. Some of the steps in the pipeline expect an ENSEMBL reference by default so it is important to set the `--gencode` option if your reference comes from GENCODE.
 :::
 
 ### Reference genome
 
-It is recommended to provide the most complete reference genome for your species, without additional loci (haplotypes) or patches. For models organisms such as mouse or human this is the so-called "primary assembly" which includes the reference chromosomes as well as some additional scaffolds. For human assembly GRCh38 (hg38) this would be the `GRCh38.primary_assembly.genome.fa.gz` file from GENCODE or the `Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz` file from ENSEMBL. These files are preferred as they cover the largest amount of the reference genome without including multiple copies of the same sequence which can confuse aligners such as STAR. Most other species (fly, cow, dog etc.) do not have a primary assembly, in which case the complete reference sequence, or "toplevel" assembly, should be used. The difference between the two is the inclusion of alternative loci (haplotypes) but these do not typically exist for species outside mouse and human.
+It is recommended to provide the most complete reference genome for your species, without additional loci (haplotypes) or patches.
+For models organisms such as mouse or human this is the so-called "primary assembly" which includes the reference chromosomes as well as some additional scaffolds.
+For human assembly GRCh38 (hg38) this would be the `GRCh38.primary_assembly.genome.fa.gz` file from GENCODE or the `Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz` file from ENSEMBL.
+These files are preferred as they cover the largest amount of the reference genome without including multiple copies of the same sequence which can confuse aligners such as STAR. Most other species (fly, cow, dog etc.) do not have a primary assembly, in which case the complete reference sequence, or "toplevel" assembly, should be used.
+The difference between the two is the inclusion of alternative loci (haplotypes) but these do not typically exist for species outside mouse and human.
 
 ### Gene annotation
 
-Gene annotations are updated more frequently than the reference genome sequence and there are more options to consider here. Because annotations can be updated frequently, you should rely on sources that include well-defined, versioned releases such as ENSEMBL or GENCODE. We generally recommend using the most recent release in order to have the latest and most up-to-date gene annotations. However, if you are planning to combine your data with a dataset that was processed in the past you may want to use the annotation version that was used previously for greater consistency. Once you have decided on a release to use, you can then select an annotation file. This should be the most comprehensive annotation that matches the reference genome you are using. So if you are using the human primary assembly you would want the comprehensive annotation for the primary assembly (the `gencode.{release}.primary_assembly.annotation.gtf.gz` file from GENCODE or the `Homo_sapiens.GRCh38.{release}.gtf.gz` file from ENSEMBL). For something like fly, you would want the annotation matching the toplevel assembly (e.g. `Drosophila_melanogaster.BDGP6.46.{release}.gtf.gz` from ENSEMBL). As well as the comprehensive annotations for the primary and toplevel assemblies, and just the reference chromomes, GENCODE also provides "basic" annotations which only include representative transcripts, but we do not recommend using these.
+Gene annotations are updated more frequently than the reference genome sequence and there are more options to consider here.
+Because annotations can be updated frequently, you should rely on sources that include well-defined, versioned releases such as ENSEMBL or GENCODE.
+We generally recommend using the most recent release in order to have the latest and most up-to-date gene annotations.
+However, if you are planning to combine your data with a dataset that was processed in the past you may want to use the annotation version that was used previously for greater consistency.
+Once you have decided on a release to use, you can then select an annotation file.
+This should be the most comprehensive annotation that matches the reference genome you are using.
+So if you are using the human primary assembly you would want the comprehensive annotation for the primary assembly (the `gencode.{release}.primary_assembly.annotation.gtf.gz` file from GENCODE or the `Homo_sapiens.GRCh38.{release}.gtf.gz` file from ENSEMBL).
+For something like fly, you would want the annotation matching the toplevel assembly (e.g. `Drosophila_melanogaster.BDGP6.46.{release}.gtf.gz` from ENSEMBL).
+As well as the comprehensive annotations for the primary and toplevel assemblies, and just the reference chromomes, GENCODE also provides "basic" annotations which only include representative transcripts, but we do not recommend using these.
 
-Gene annotations typically provide a primary identifier for each feature as well as a more common name. For example, the ENSEMBL ID `ENSG00000254647` corresponds to the `INS` gene which encodes the insulin protein. While the gene names may be more familiar and easier to understand it is important to retain and use the primary identifiers as the are unique for a given annotation and are much easier to map between annotation versions or sources.
+Gene annotations typically provide a primary identifier for each feature as well as a more common name.
+For example, the ENSEMBL ID `ENSG00000254647` corresponds to the `INS` gene which encodes the insulin protein.
+While the gene names may be more familiar and easier to understand it is important to retain and use the primary identifiers as the are unique for a given annotation and are much easier to map between annotation versions or sources.
 
-To take advantage of all the quality control modules implemented in the pipeline, the gene annotation should include a `gene_biotype` field which describes the function of each feature (protein coding, long non-coding etc.). This is usually the case for annotations from GENCODE or ENSEMBL but may not be if your annotation comes from another source. If your annotation does not include this field, please set the `--skip_biotype_qc` option to avoid running the steps that rely on it.
+To take advantage of all the quality control modules implemented in the pipeline, the gene annotation should include a `gene_biotype` field which describes the function of each feature (protein coding, long non-coding etc.).
+This is usually the case for annotations from GENCODE or ENSEMBL but may not be if your annotation comes from another source.
+If your annotation does not include this field, please set the `--skip_biotype_qc` option to avoid running the steps that rely on it.
 
 :::note
 **GTF vs GFF**
 
-GFF (General Feature Format) is a tab-separated text file format for representing genomic annotations. GTF (General Transfer Format) is a specific implementation of this format corresponding to GFF version 2. The pipeline can accept both GFF and GTF but any GFF files will be converted to GFF so if a GTF is available for your annotation of choice it is better to provide that directly.
+GFF (General Feature Format) is a tab-separated text file format for representing genomic annotations.
+GTF (General Transfer Format) is a specific implementation of this format corresponding to GFF version 2.
+The pipeline can accept both GFF and GTF but any GFF files will be converted to GFF so if a GTF is available for your annotation of choice it is better to provide that directly.
 
 More information and links to further resources are [available from ENSEMBL](https://www.ensembl.org/info/website/upload/gff.html).
 :::
 
 ### Reference transcriptome
 
-As well as the reference genome sequence and annotation it is possible to provide a reference transcriptome FASTA file. These can be obtained from GENCODE or ENSEMBL but it is important to note that the sequences they provide only cover the reference chromosome and can result in inconsistencies if you have provided a primary or toplevel genome assembly and annotation. For this reason, we recommend to not provide a transcriptome FASTA and instead let the pipeline create it from the provided genome and annotation. As with the aligner indexes, it is possible to save the created transcriptome FASTA and BED files to a central location and provide it to future pipeline runs in order to avoid having multiple copies on your system but it is important to make sure that all genome, annotation, transcriptome and index versions match.
+As well as the reference genome sequence and annotation it is possible to provide a reference transcriptome FASTA file.
+These can be obtained from GENCODE or ENSEMBL but it is important to note that the sequences they provide only cover the reference chromosome and can result in inconsistencies if you have provided a primary or toplevel genome assembly and annotation.
+For this reason, we recommend to not provide a transcriptome FASTA and instead let the pipeline create it from the provided genome and annotation.
+As with the aligner indexes, it is possible to save the created transcriptome FASTA and BED files to a central location and provide it to future pipeline runs in order to avoid having multiple copies on your system but it is important to make sure that all genome, annotation, transcriptome and index versions match.
 
 ### Indexes
 
-Creating the index files required for the alignment and/or pseudoalignment steps can be computationally intensive and the files they produce are quite large. To avoid repeating this work and having multiple redundant files we recommend saving the indexes using the `--save_reference` option and moving them to a central location where they can be accessed by future pipeline runs. When doing this, it is important to record the genome and annotations versions they correspond to so you can easily locate the correct index to use and the program version as an index produced with one version may not have a format compatible with other versions.
+Creating the index files required for the alignment and/or pseudoalignment steps can be computationally intensive and the files they produce are quite large.
+To avoid repeating this work and having multiple redundant files we recommend saving the indexes using the `--save_reference` option and moving them to a central location where they can be accessed by future pipeline runs.
+When doing this, it is important to record the genome and annotations versions they correspond to so you can easily locate the correct index to use and the program version as an index produced with one version may not have a format compatible with other versions.
 
 ## Adapter trimming options
 

From 1bbf1fe47849fe0f30d5d90358756cbb8a15892c Mon Sep 17 00:00:00 2001
From: Luke Zappia <lazappi@users.noreply.github.com>
Date: Fri, 14 Jun 2024 11:47:39 +0200
Subject: [PATCH 03/11] Apply suggestions from code review

Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>
---
 docs/usage.md | 55 +++++++++++++++------------------------------------
 1 file changed, 16 insertions(+), 39 deletions(-)

diff --git a/docs/usage.md b/docs/usage.md
index f12bb95ee..48638fdfd 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -57,68 +57,45 @@ An [example samplesheet](../assets/samplesheet.csv) has been provided with the p
 
 ## Reference files
 
-The only reference files required by the pipeline are a FASTA file with the reference genome sequence and a GTF/GFF file with a gene annotation. All other reference files can be created from those by the pipeline.
-However, selecting the appropriate reference genome and annotation to use analysis can still be difficult.
-Here we provide some advice on what is expected by the pipeline:
+The pipeline has a number of options for reference files provided for maximum flexibility, but most can be generated dynamically and need not be supplied. The minimal requirement is two reference files: a FASTA file containing the reference genome sequence and a GTF/GFF file with gene annotations. Further guidance is provided below.
 
 :::note
-**GENCODE vs ENSEMBL**
+**Consistent reference resource usage**
 
-Two of the most common sources of genomic references are GENCODE (for mouse and human) and ENSEMBL (for many organisms).
-There has been an effort to standardise information between the two sources and now the references [should be consistent](https://www.gencodegenes.org/pages/faq.html) regardless of where they are obtained from (for mouse and human).
-
-However, while the information is consistent, there are still some practical differences.
-ENSEMBL prefixes chromosome names with `chr` (e.g. `chr1`, `chr1`, ...) while GENCODE uses simple `1`, `2`, etc.
-There can also be different names used for sequences outside the reference chromosomes.
-GENCODE also attaches version identifiers to gene and transcript names (e.g. `ENSG00000254647.1`).
-For these reasons, resources from the two sources cannot be mixed and it is important to stick to one reference source. Some of the steps in the pipeline expect an ENSEMBL reference by default so it is important to set the `--gencode` option if your reference comes from GENCODE.
+When supplying reference files as discussed below, it is important to be consistent in the reference resource used (Ensembl, Gencode, UCSC etc), since differences in conventions between these resources can make their files incompatible. For example, UCSC prefixes chromosomes with `chr`, while Ensembl does not, so a GTF file from Ensembl should not be supplied alongside a genome FASTA from UCSC.
 :::
 
 ### Reference genome
 
-It is recommended to provide the most complete reference genome for your species, without additional loci (haplotypes) or patches.
-For models organisms such as mouse or human this is the so-called "primary assembly" which includes the reference chromosomes as well as some additional scaffolds.
-For human assembly GRCh38 (hg38) this would be the `GRCh38.primary_assembly.genome.fa.gz` file from GENCODE or the `Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz` file from ENSEMBL.
-These files are preferred as they cover the largest amount of the reference genome without including multiple copies of the same sequence which can confuse aligners such as STAR. Most other species (fly, cow, dog etc.) do not have a primary assembly, in which case the complete reference sequence, or "toplevel" assembly, should be used.
-The difference between the two is the inclusion of alternative loci (haplotypes) but these do not typically exist for species outside mouse and human.
+It is recommended to provide the most complete reference genome for your species, without additional loci (haplotypes) or patches. For model organisms such as mouse or human, this is the "primary assembly," which includes the reference chromosomes and some additional scaffolds. For the human assembly GRCh38 (hg38), use the `GRCh38.primary_assembly.genome.fa.gz` file from GENCODE or the `Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz` file from ENSEMBL. These files cover the largest portion of the reference genome without including multiple copies of the same sequence, which can confuse aligners like STAR.
+
+Most other species (e.g., fly, cow, dog) do not have a primary assembly. In these cases, use the complete reference sequence, or "toplevel" assembly. The main difference between the primary and toplevel assemblies is the inclusion of alternative loci (haplotypes), which typically do not exist for species outside of mouse and human.
 
 ### Gene annotation
 
-Gene annotations are updated more frequently than the reference genome sequence and there are more options to consider here.
-Because annotations can be updated frequently, you should rely on sources that include well-defined, versioned releases such as ENSEMBL or GENCODE.
-We generally recommend using the most recent release in order to have the latest and most up-to-date gene annotations.
-However, if you are planning to combine your data with a dataset that was processed in the past you may want to use the annotation version that was used previously for greater consistency.
-Once you have decided on a release to use, you can then select an annotation file.
-This should be the most comprehensive annotation that matches the reference genome you are using.
-So if you are using the human primary assembly you would want the comprehensive annotation for the primary assembly (the `gencode.{release}.primary_assembly.annotation.gtf.gz` file from GENCODE or the `Homo_sapiens.GRCh38.{release}.gtf.gz` file from ENSEMBL).
-For something like fly, you would want the annotation matching the toplevel assembly (e.g. `Drosophila_melanogaster.BDGP6.46.{release}.gtf.gz` from ENSEMBL).
-As well as the comprehensive annotations for the primary and toplevel assemblies, and just the reference chromomes, GENCODE also provides "basic" annotations which only include representative transcripts, but we do not recommend using these.
+Gene annotations are updated more frequently than the reference genome sequence, so you much choose an appropriate annotation version (e.g. Ensembl release)  We recommend using sources with well-defined, versioned releases such as ENSEMBL or GENCODE. Generally, it is best to use the most recent release for the latest gene annotations. However, if you are combining your data with older datasets, use the annotation version previously used for consistency.
+
+Once you have chosen a release, select the annotation file that matches your reference genome. For the human primary assembly, use the comprehensive annotation (e.g., `gencode.{release}.primary_assembly.annotation.gtf.gz` from GENCODE or `Homo_sapiens.GRCh38.{release}.gtf.gz` from ENSEMBL). For other species, like fly, use the annotation matching the toplevel assembly (e.g., `Drosophila_melanogaster.BDGP6.46.{release}.gtf.gz` from ENSEMBL).
+
+GENCODE also provides "basic" annotations, which include only representative transcripts, but we do not recommend using these.
 
-Gene annotations typically provide a primary identifier for each feature as well as a more common name.
-For example, the ENSEMBL ID `ENSG00000254647` corresponds to the `INS` gene which encodes the insulin protein.
-While the gene names may be more familiar and easier to understand it is important to retain and use the primary identifiers as the are unique for a given annotation and are much easier to map between annotation versions or sources.
+Gene annotations provide a primary identifier for each feature as well as a common name. For example, the ENSEMBL ID `ENSG00000254647` corresponds to the `INS` gene, which encodes the insulin protein. While gene names are more familiar, it is crucial to retain and use the primary identifiers as they are unique and easier to map between annotation versions or sources.
 
-To take advantage of all the quality control modules implemented in the pipeline, the gene annotation should include a `gene_biotype` field which describes the function of each feature (protein coding, long non-coding etc.).
-This is usually the case for annotations from GENCODE or ENSEMBL but may not be if your annotation comes from another source.
-If your annotation does not include this field, please set the `--skip_biotype_qc` option to avoid running the steps that rely on it.
+To take advantage of all the quality control modules implemented in the pipeline, the gene annotation should include a `gene_biotype` field which describes the function of each feature (protein coding, long non-coding etc.). This is usually the case for annotations from GENCODE or ENSEMBL but may not be if your annotation comes from another source. If your annotation does not include this field, please set the `--skip_biotype_qc` option to avoid running the steps that rely on it.
 
 :::note
 **GTF vs GFF**
 
-GFF (General Feature Format) is a tab-separated text file format for representing genomic annotations.
-GTF (General Transfer Format) is a specific implementation of this format corresponding to GFF version 2.
-The pipeline can accept both GFF and GTF but any GFF files will be converted to GFF so if a GTF is available for your annotation of choice it is better to provide that directly.
+GFF (General Feature Format) is a tab-separated text file format for representing genomic annotations, while GTF (General Transfer Format) is a specific implementation of this format corresponding to GFF version 2. The pipeline can accept both GFF and GTF but any GFF files will be converted to GTF so if a GTF is available for your annotation of choice it is better to provide that directly.
 
 More information and links to further resources are [available from ENSEMBL](https://www.ensembl.org/info/website/upload/gff.html).
 :::
 
 ### Reference transcriptome
 
-As well as the reference genome sequence and annotation it is possible to provide a reference transcriptome FASTA file.
-These can be obtained from GENCODE or ENSEMBL but it is important to note that the sequences they provide only cover the reference chromosome and can result in inconsistencies if you have provided a primary or toplevel genome assembly and annotation.
-For this reason, we recommend to not provide a transcriptome FASTA and instead let the pipeline create it from the provided genome and annotation.
-As with the aligner indexes, it is possible to save the created transcriptome FASTA and BED files to a central location and provide it to future pipeline runs in order to avoid having multiple copies on your system but it is important to make sure that all genome, annotation, transcriptome and index versions match.
+In addition to the reference genome sequence and annotation, you can provide a reference transcriptome FASTA file. These files can be obtained from GENCODE or ENSEMBL. However, these sequences only cover the reference chromosomes and can cause inconsistencies if you are using a primary or toplevel genome assembly and annotation.
 
+We recommend not providing a transcriptome FASTA file and instead allowing the pipeline to create it from the provided genome and annotation. Similar to aligner indexes, you can save the created transcriptome FASTA and BED files to a central location for future pipeline runs. This helps avoid multiple copies on your system. Ensure that all genome, annotation, transcriptome, and index versions match to maintain consistency.
 ### Indexes
 
 Creating the index files required for the alignment and/or pseudoalignment steps can be computationally intensive and the files they produce are quite large.

From 500c673b364c19e7e762dbfdedb7cbbb885d0b9a Mon Sep 17 00:00:00 2001
From: Luke Zappia <lazappi@users.noreply.github.com>
Date: Fri, 14 Jun 2024 12:21:48 +0200
Subject: [PATCH 04/11] Move reference guidance to existing sections

---
 docs/usage.md | 93 ++++++++++++++++++++++++---------------------------
 1 file changed, 43 insertions(+), 50 deletions(-)

diff --git a/docs/usage.md b/docs/usage.md
index 48638fdfd..e381addb8 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -55,53 +55,6 @@ An [example samplesheet](../assets/samplesheet.csv) has been provided with the p
 
 > **NB:** The `group` and `replicate` columns were replaced with a single `sample` column as of v3.1 of the pipeline. The `sample` column is essentially a concatenation of the `group` and `replicate` columns, however it now also offers more flexibility in instances where replicate information is not required e.g. when sequencing clinical samples. If all values of `sample` have the same number of underscores, fields defined by these underscore-separated names may be used in the PCA plots produced by the pipeline, to regain the ability to represent different groupings.
 
-## Reference files
-
-The pipeline has a number of options for reference files provided for maximum flexibility, but most can be generated dynamically and need not be supplied. The minimal requirement is two reference files: a FASTA file containing the reference genome sequence and a GTF/GFF file with gene annotations. Further guidance is provided below.
-
-:::note
-**Consistent reference resource usage**
-
-When supplying reference files as discussed below, it is important to be consistent in the reference resource used (Ensembl, Gencode, UCSC etc), since differences in conventions between these resources can make their files incompatible. For example, UCSC prefixes chromosomes with `chr`, while Ensembl does not, so a GTF file from Ensembl should not be supplied alongside a genome FASTA from UCSC.
-:::
-
-### Reference genome
-
-It is recommended to provide the most complete reference genome for your species, without additional loci (haplotypes) or patches. For model organisms such as mouse or human, this is the "primary assembly," which includes the reference chromosomes and some additional scaffolds. For the human assembly GRCh38 (hg38), use the `GRCh38.primary_assembly.genome.fa.gz` file from GENCODE or the `Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz` file from ENSEMBL. These files cover the largest portion of the reference genome without including multiple copies of the same sequence, which can confuse aligners like STAR.
-
-Most other species (e.g., fly, cow, dog) do not have a primary assembly. In these cases, use the complete reference sequence, or "toplevel" assembly. The main difference between the primary and toplevel assemblies is the inclusion of alternative loci (haplotypes), which typically do not exist for species outside of mouse and human.
-
-### Gene annotation
-
-Gene annotations are updated more frequently than the reference genome sequence, so you much choose an appropriate annotation version (e.g. Ensembl release)  We recommend using sources with well-defined, versioned releases such as ENSEMBL or GENCODE. Generally, it is best to use the most recent release for the latest gene annotations. However, if you are combining your data with older datasets, use the annotation version previously used for consistency.
-
-Once you have chosen a release, select the annotation file that matches your reference genome. For the human primary assembly, use the comprehensive annotation (e.g., `gencode.{release}.primary_assembly.annotation.gtf.gz` from GENCODE or `Homo_sapiens.GRCh38.{release}.gtf.gz` from ENSEMBL). For other species, like fly, use the annotation matching the toplevel assembly (e.g., `Drosophila_melanogaster.BDGP6.46.{release}.gtf.gz` from ENSEMBL).
-
-GENCODE also provides "basic" annotations, which include only representative transcripts, but we do not recommend using these.
-
-Gene annotations provide a primary identifier for each feature as well as a common name. For example, the ENSEMBL ID `ENSG00000254647` corresponds to the `INS` gene, which encodes the insulin protein. While gene names are more familiar, it is crucial to retain and use the primary identifiers as they are unique and easier to map between annotation versions or sources.
-
-To take advantage of all the quality control modules implemented in the pipeline, the gene annotation should include a `gene_biotype` field which describes the function of each feature (protein coding, long non-coding etc.). This is usually the case for annotations from GENCODE or ENSEMBL but may not be if your annotation comes from another source. If your annotation does not include this field, please set the `--skip_biotype_qc` option to avoid running the steps that rely on it.
-
-:::note
-**GTF vs GFF**
-
-GFF (General Feature Format) is a tab-separated text file format for representing genomic annotations, while GTF (General Transfer Format) is a specific implementation of this format corresponding to GFF version 2. The pipeline can accept both GFF and GTF but any GFF files will be converted to GTF so if a GTF is available for your annotation of choice it is better to provide that directly.
-
-More information and links to further resources are [available from ENSEMBL](https://www.ensembl.org/info/website/upload/gff.html).
-:::
-
-### Reference transcriptome
-
-In addition to the reference genome sequence and annotation, you can provide a reference transcriptome FASTA file. These files can be obtained from GENCODE or ENSEMBL. However, these sequences only cover the reference chromosomes and can cause inconsistencies if you are using a primary or toplevel genome assembly and annotation.
-
-We recommend not providing a transcriptome FASTA file and instead allowing the pipeline to create it from the provided genome and annotation. Similar to aligner indexes, you can save the created transcriptome FASTA and BED files to a central location for future pipeline runs. This helps avoid multiple copies on your system. Ensure that all genome, annotation, transcriptome, and index versions match to maintain consistency.
-### Indexes
-
-Creating the index files required for the alignment and/or pseudoalignment steps can be computationally intensive and the files they produce are quite large.
-To avoid repeating this work and having multiple redundant files we recommend saving the indexes using the `--save_reference` option and moving them to a central location where they can be accessed by future pipeline runs.
-When doing this, it is important to record the genome and annotations versions they correspond to so you can easily locate the correct index to use and the program version as an index produced with one version may not have a format compatible with other versions.
-
 ## Adapter trimming options
 
 [Trim Galore!](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) is a wrapper tool around Cutadapt and FastQC to peform quality and adapter trimming on FastQ files. Trim Galore! will automatically detect and trim the appropriate adapter sequence. It is the default trimming tool used by this pipeline, however you can use fastp instead by specifying the `--trimmer fastp` parameter. [fastp](https://github.com/OpenGene/fastp) is a tool designed to provide fast, all-in-one preprocessing for FastQ files. It has been developed in C++ with multithreading support to achieve higher performance. You can specify additional options for Trim Galore! and fastp via the `--extra_trimgalore_args` and `--extra_fastp_args` parameters, respectively.
@@ -184,6 +137,12 @@ If unique molecular identifiers were used to prepare the library, add the follow
 
 Please refer to the [nf-core website](https://nf-co.re/usage/reference_genomes) for general usage docs and guidelines regarding reference genomes.
 
+:::note
+**Consistent reference resource usage**
+
+When supplying reference files as discussed below, it is important to be consistent in the reference resource used (Ensembl, GENCODE, UCSC etc), since differences in conventions between these resources can make their files incompatible. For example, UCSC prefixes chromosomes with `chr`, while Ensembl does not, so a GTF file from Ensembl should not be supplied alongside a genome FASTA from UCSC. GENCODE also attaches version identifiers to gene and transcript names (e.g. `ENSG00000254647.1`) while Ensembl does not.
+:::
+
 ### Explicit reference file specification (recommended)
 
 The minimum reference genome requirements for this pipeline are a FASTA and GTF file, all other files required to run the pipeline can be generated from these files. For example, the latest reference files for human can be derived from Ensembl like:
@@ -205,7 +164,39 @@ Notes:
 - If `--additional_fasta` is provided then the features in this file (e.g. ERCC spike-ins) will be automatically concatenated onto both the reference FASTA file as well as the GTF annotation before building the appropriate indices.
 - When using `--aligner star_rsem`, both the STAR and RSEM indices should be present in the path specified by `--rsem_index` (see [#568](https://github.com/nf-core/rnaseq/issues/568)).
 
-#### Indices
+### Reference genome
+
+It is recommended to provide the most complete reference genome for your species, without additional loci (haplotypes) or patches. For model organisms such as mouse or human, this is the "primary assembly," which includes the reference chromosomes and some additional scaffolds. For the human assembly GRCh38 (hg38), use the `GRCh38.primary_assembly.genome.fa.gz` file from GENCODE or the `Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz` file from Ensembl. These files cover the largest portion of the reference genome without including multiple copies of the same sequence, which can confuse aligners like STAR.
+
+Most other species (e.g., fly, cow, dog) do not have a primary assembly. In these cases, use the complete reference sequence, or "toplevel" assembly. The main difference between the primary and toplevel assemblies is the inclusion of alternative loci (haplotypes), which typically do not exist for species outside of mouse and human.
+
+### Gene annotation
+
+Gene annotations are updated more frequently than the reference genome sequence, so you much choose an appropriate annotation version (e.g. Ensembl release)  We recommend using sources with well-defined, versioned releases such as ENSEMBL or GENCODE. Generally, it is best to use the most recent release for the latest gene annotations. However, if you are combining your data with older datasets, use the annotation version previously used for consistency.
+
+Once you have chosen a release, select the annotation file that matches your reference genome. For the human primary assembly, use the comprehensive annotation (e.g., `gencode.{release}.primary_assembly.annotation.gtf.gz` from GENCODE or `Homo_sapiens.GRCh38.{release}.gtf.gz` from Ensembl). For other species, like fly, use the annotation matching the toplevel assembly (e.g., `Drosophila_melanogaster.BDGP6.46.{release}.gtf.gz` from Ensembl).
+
+GENCODE also provides "basic" annotations, which include only representative transcripts, but we do not recommend using these.
+
+Gene annotations provide a primary identifier for each feature as well as a common name. For example, the Ensembl ID `ENSG00000254647` corresponds to the `INS` gene, which encodes the insulin protein. While gene names are more familiar, it is crucial to retain and use the primary identifiers as they are unique and easier to map between annotation versions or sources.
+
+To take advantage of all the quality control modules implemented in the pipeline, the gene annotation should include a `gene_biotype` field which describes the function of each feature (protein coding, long non-coding etc.). This is usually the case for annotations from GENCODE or Ensembl but may not be if your annotation comes from another source. If your annotation does not include this field, please set the `--skip_biotype_qc` option to avoid running the steps that rely on it.
+
+:::note
+**GTF vs GFF**
+
+GFF (General Feature Format) is a tab-separated text file format for representing genomic annotations, while GTF (General Transfer Format) is a specific implementation of this format corresponding to GFF version 2. The pipeline can accept both GFF and GTF but any GFF files will be converted to GTF so if a GTF is available for your annotation of choice it is better to provide that directly.
+
+More information and links to further resources are [available from Ensembl](https://www.ensembl.org/info/website/upload/gff.html).
+:::
+
+### Reference transcriptome
+
+In addition to the reference genome sequence and annotation, you can provide a reference transcriptome FASTA file. These files can be obtained from GENCODE or Ensembl. However, these sequences only cover the reference chromosomes and can cause inconsistencies if you are using a primary or toplevel genome assembly and annotation.
+
+We recommend not providing a transcriptome FASTA file and instead allowing the pipeline to create it from the provided genome and annotation. Similar to aligner indexes, you can save the created transcriptome FASTA and BED files to a central location for future pipeline runs. This helps avoid redundant computation and having multiple copies on your system. Ensure that all genome, annotation, transcriptome, and index versions match to maintain consistency.
+
+### Indices
 
 By default, indices are generated dynamically by the workflow for tools such as STAR and Salmon. Since indexing is an expensive process in time and resources you should ensure that it is only done once, by retaining the indices generated from each batch of reference files:
 
@@ -214,14 +205,16 @@ By default, indices are generated dynamically by the workflow for tools such as
 
 Once you have the indices from a workflow run you should save them somewhere central and reuse them in subsequent runs using custom config files or command line parameters such as `--star_index '/path/to/STAR/index/'`.
 
-#### Gencode
+When doing this, it is important to record the genome and annotations versions they correspond to so you can easily locate the correct index to use and the program version as an index produced with one version may not have a format compatible with other versions.
+
+### GENCODE
 
 If you are using [GENCODE](https://www.gencodegenes.org/) reference genome files please specify the `--gencode` parameter because the format of these files is slightly different to ENSEMBL genome files:
 
 - The `--gtf_group_features_type` parameter will automatically be set to `gene_type` as opposed to `gene_biotype`, respectively.
 - If you are running Salmon, the `--gencode` flag will also be passed to the index building step to overcome parsing issues resulting from the transcript IDs in GENCODE fasta files being separated by vertical pipes (`|`) instead of spaces (see [this issue](https://github.com/COMBINE-lab/salmon/issues/15)).
 
-#### Prokaryotic genome annotations
+### Prokaryotic genome annotations
 
 This pipeline uses featureCounts to generate QC metrics based on [biotype](http://www.ensembl.org/info/genome/genebuild/biotypes.html) information available within GFF/GTF genome annotation files. The format of these annotation files can vary significantly depending on the source of the annotation and the type of organism. The default settings in the pipeline are tailored towards Ensembl GTF annotations available for eukaryotic genomes. Prokaryotic genome annotations tend to be distributed in GFF format which are structured differently in terms of the feature naming conventions. There are a number of ways you can tune the behaviour of the pipeline to cater for differences/absence of biotype information:
 

From 12412bb9c6418721b06872755b496277ae9b77f8 Mon Sep 17 00:00:00 2001
From: Luke Zappia <lazappi@users.noreply.github.com>
Date: Thu, 20 Jun 2024 10:40:56 +0200
Subject: [PATCH 05/11] Apply suggestions from code review

Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>
---
 docs/usage.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/usage.md b/docs/usage.md
index e381addb8..d5acd1815 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -164,13 +164,13 @@ Notes:
 - If `--additional_fasta` is provided then the features in this file (e.g. ERCC spike-ins) will be automatically concatenated onto both the reference FASTA file as well as the GTF annotation before building the appropriate indices.
 - When using `--aligner star_rsem`, both the STAR and RSEM indices should be present in the path specified by `--rsem_index` (see [#568](https://github.com/nf-core/rnaseq/issues/568)).
 
-### Reference genome
+#### Reference genome
 
 It is recommended to provide the most complete reference genome for your species, without additional loci (haplotypes) or patches. For model organisms such as mouse or human, this is the "primary assembly," which includes the reference chromosomes and some additional scaffolds. For the human assembly GRCh38 (hg38), use the `GRCh38.primary_assembly.genome.fa.gz` file from GENCODE or the `Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz` file from Ensembl. These files cover the largest portion of the reference genome without including multiple copies of the same sequence, which can confuse aligners like STAR.
 
 Most other species (e.g., fly, cow, dog) do not have a primary assembly. In these cases, use the complete reference sequence, or "toplevel" assembly. The main difference between the primary and toplevel assemblies is the inclusion of alternative loci (haplotypes), which typically do not exist for species outside of mouse and human.
 
-### Gene annotation
+#### Gene annotation
 
 Gene annotations are updated more frequently than the reference genome sequence, so you much choose an appropriate annotation version (e.g. Ensembl release)  We recommend using sources with well-defined, versioned releases such as ENSEMBL or GENCODE. Generally, it is best to use the most recent release for the latest gene annotations. However, if you are combining your data with older datasets, use the annotation version previously used for consistency.
 
@@ -178,7 +178,7 @@ Once you have chosen a release, select the annotation file that matches your ref
 
 GENCODE also provides "basic" annotations, which include only representative transcripts, but we do not recommend using these.
 
-Gene annotations provide a primary identifier for each feature as well as a common name. For example, the Ensembl ID `ENSG00000254647` corresponds to the `INS` gene, which encodes the insulin protein. While gene names are more familiar, it is crucial to retain and use the primary identifiers as they are unique and easier to map between annotation versions or sources.
+Ensure that the annotation files use gene IDs as the primary identifier, not the gene name/ symbol. For example, the Ensembl ID `ENSG00000254647` corresponds to the `INS` gene, which encodes the insulin protein. While gene names are more familiar, it is crucial to retain and use the primary identifiers as they are unique and easier to map between annotation versions or sources.
 
 To take advantage of all the quality control modules implemented in the pipeline, the gene annotation should include a `gene_biotype` field which describes the function of each feature (protein coding, long non-coding etc.). This is usually the case for annotations from GENCODE or Ensembl but may not be if your annotation comes from another source. If your annotation does not include this field, please set the `--skip_biotype_qc` option to avoid running the steps that rely on it.
 
@@ -205,7 +205,7 @@ By default, indices are generated dynamically by the workflow for tools such as
 
 Once you have the indices from a workflow run you should save them somewhere central and reuse them in subsequent runs using custom config files or command line parameters such as `--star_index '/path/to/STAR/index/'`.
 
-When doing this, it is important to record the genome and annotations versions they correspond to so you can easily locate the correct index to use and the program version as an index produced with one version may not have a format compatible with other versions.
+Note the genome and annotation versions as well as the versions of the software used for indexing, as an index created with one version may not be compatible with other versions.
 
 ### GENCODE
 

From f1c85ebea2057b77b0e9c526a719eac78e17ce3d Mon Sep 17 00:00:00 2001
From: Luke Zappia <lazappi@users.noreply.github.com>
Date: Thu, 20 Jun 2024 10:48:11 +0200
Subject: [PATCH 06/11] Minor tidying to usage docs

---
 docs/usage.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/usage.md b/docs/usage.md
index 5d1c5f105..1510dee29 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -190,7 +190,7 @@ Once you have chosen a release, select the annotation file that matches your ref
 
 GENCODE also provides "basic" annotations, which include only representative transcripts, but we do not recommend using these.
 
-Ensure that the annotation files use gene IDs as the primary identifier, not the gene name/ symbol. For example, the Ensembl ID `ENSG00000254647` corresponds to the `INS` gene, which encodes the insulin protein. While gene names are more familiar, it is crucial to retain and use the primary identifiers as they are unique and easier to map between annotation versions or sources.
+Ensure that the annotation files use gene IDs as the primary identifier, not the gene name/symbol. For example, the Ensembl ID `ENSG00000254647` corresponds to the `INS` gene, which encodes the insulin protein. While gene names are more familiar, it is crucial to retain and use the primary identifiers as they are unique and easier to map between annotation versions or sources.
 
 To take advantage of all the quality control modules implemented in the pipeline, the gene annotation should include a `gene_biotype` field which describes the function of each feature (protein coding, long non-coding etc.). This is usually the case for annotations from GENCODE or Ensembl but may not be if your annotation comes from another source. If your annotation does not include this field, please set the `--skip_biotype_qc` option to avoid running the steps that rely on it.
 
@@ -217,7 +217,7 @@ By default, indices are generated dynamically by the workflow for tools such as
 
 Once you have the indices from a workflow run you should save them somewhere central and reuse them in subsequent runs using custom config files or command line parameters such as `--star_index '/path/to/STAR/index/'`.
 
-Note the genome and annotation versions as well as the versions of the software used for indexing, as an index created with one version may not be compatible with other versions.
+Remember to note the genome and annotation versions as well as the versions of the software used for indexing, as an index created with one version may not be compatible with other versions.
 
 ### GENCODE
 

From 805339849319bbeda35ba0282d77a54571ae27df Mon Sep 17 00:00:00 2001
From: Luke Zappia <lazappi@users.noreply.github.com>
Date: Thu, 20 Jun 2024 10:57:16 +0200
Subject: [PATCH 07/11] Run pre-commit

---
 docs/usage.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/usage.md b/docs/usage.md
index 1510dee29..85ac90888 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -184,7 +184,7 @@ Most other species (e.g., fly, cow, dog) do not have a primary assembly. In thes
 
 #### Gene annotation
 
-Gene annotations are updated more frequently than the reference genome sequence, so you much choose an appropriate annotation version (e.g. Ensembl release)  We recommend using sources with well-defined, versioned releases such as ENSEMBL or GENCODE. Generally, it is best to use the most recent release for the latest gene annotations. However, if you are combining your data with older datasets, use the annotation version previously used for consistency.
+Gene annotations are updated more frequently than the reference genome sequence, so you must choose an appropriate annotation version (e.g. Ensembl release). We recommend using sources with well-defined, versioned releases such as ENSEMBL or GENCODE. Generally, it is best to use the most recent release for the latest gene annotations. However, if you are combining your data with older datasets, use the annotation version previously used for consistency.
 
 Once you have chosen a release, select the annotation file that matches your reference genome. For the human primary assembly, use the comprehensive annotation (e.g., `gencode.{release}.primary_assembly.annotation.gtf.gz` from GENCODE or `Homo_sapiens.GRCh38.{release}.gtf.gz` from Ensembl). For other species, like fly, use the annotation matching the toplevel assembly (e.g., `Drosophila_melanogaster.BDGP6.46.{release}.gtf.gz` from Ensembl).
 

From fbcee9db7501e4080fc2e42eaac9a02529887a80 Mon Sep 17 00:00:00 2001
From: Luke Zappia <lazappi@users.noreply.github.com>
Date: Fri, 21 Jun 2024 08:55:34 +0200
Subject: [PATCH 08/11] Move note about basic annotations to GENCODE section

---
 docs/usage.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/usage.md b/docs/usage.md
index 85ac90888..4ea7bf483 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -188,8 +188,6 @@ Gene annotations are updated more frequently than the reference genome sequence,
 
 Once you have chosen a release, select the annotation file that matches your reference genome. For the human primary assembly, use the comprehensive annotation (e.g., `gencode.{release}.primary_assembly.annotation.gtf.gz` from GENCODE or `Homo_sapiens.GRCh38.{release}.gtf.gz` from Ensembl). For other species, like fly, use the annotation matching the toplevel assembly (e.g., `Drosophila_melanogaster.BDGP6.46.{release}.gtf.gz` from Ensembl).
 
-GENCODE also provides "basic" annotations, which include only representative transcripts, but we do not recommend using these.
-
 Ensure that the annotation files use gene IDs as the primary identifier, not the gene name/symbol. For example, the Ensembl ID `ENSG00000254647` corresponds to the `INS` gene, which encodes the insulin protein. While gene names are more familiar, it is crucial to retain and use the primary identifiers as they are unique and easier to map between annotation versions or sources.
 
 To take advantage of all the quality control modules implemented in the pipeline, the gene annotation should include a `gene_biotype` field which describes the function of each feature (protein coding, long non-coding etc.). This is usually the case for annotations from GENCODE or Ensembl but may not be if your annotation comes from another source. If your annotation does not include this field, please set the `--skip_biotype_qc` option to avoid running the steps that rely on it.
@@ -226,6 +224,8 @@ If you are using [GENCODE](https://www.gencodegenes.org/) reference genome files
 - The `--gtf_group_features_type` parameter will automatically be set to `gene_type` as opposed to `gene_biotype`, respectively.
 - If you are running Salmon, the `--gencode` flag will also be passed to the index building step to overcome parsing issues resulting from the transcript IDs in GENCODE fasta files being separated by vertical pipes (`|`) instead of spaces (see [this issue](https://github.com/COMBINE-lab/salmon/issues/15)).
 
+As well as the standard annotations, GENCODE also provides "basic" annotations, which include only representative transcripts, but we do not recommend using these.
+
 ### Prokaryotic genome annotations
 
 This pipeline uses featureCounts to generate QC metrics based on [biotype](http://www.ensembl.org/info/genome/genebuild/biotypes.html) information available within GFF/GTF genome annotation files. The format of these annotation files can vary significantly depending on the source of the annotation and the type of organism. The default settings in the pipeline are tailored towards Ensembl GTF annotations available for eukaryotic genomes. Prokaryotic genome annotations tend to be distributed in GFF format which are structured differently in terms of the feature naming conventions. There are a number of ways you can tune the behaviour of the pipeline to cater for differences/absence of biotype information:

From c6c8bccc81b86be0c5def13df80e7cedd0086df9 Mon Sep 17 00:00:00 2001
From: Luke Zappia <lazappi@users.noreply.github.com>
Date: Wed, 10 Jul 2024 08:03:32 +0200
Subject: [PATCH 09/11] Add suggestions from @MatthiasZepper

Co-authored-by: Matthias Zepper <6963520+MatthiasZepper@users.noreply.github.com>
---
 docs/usage.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/usage.md b/docs/usage.md
index c3fa4679d..636d5de37 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -219,9 +219,9 @@ Notes:
 
 #### Reference genome
 
-It is recommended to provide the most complete reference genome for your species, without additional loci (haplotypes) or patches. For model organisms such as mouse or human, this is the "primary assembly," which includes the reference chromosomes and some additional scaffolds. For the human assembly GRCh38 (hg38), use the `GRCh38.primary_assembly.genome.fa.gz` file from GENCODE or the `Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz` file from Ensembl. These files cover the largest portion of the reference genome without including multiple copies of the same sequence, which can confuse aligners like STAR.
+It is recommended to provide the most complete reference genome for your species, without additional loci (haplotypes) or patches. For model organisms such as mouse or human, this is the "primary assembly", which includes the reference chromosomes and some additional scaffolds. For the human assembly GRCh38 (hg38), use the `GRCh38.primary_assembly.genome.fa.gz` file from GENCODE or the `Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz` file from Ensembl. These files cover the largest portion of the reference genome without including multiple copies of the same sequence, which would result in heavy mapping quality penalties.
 
-Most other species (e.g., fly, cow, dog) do not have a primary assembly. In these cases, use the complete reference sequence, or "toplevel" assembly. The main difference between the primary and toplevel assemblies is the inclusion of alternative loci (haplotypes), which typically do not exist for species outside of mouse and human.
+For most other species (e.g., fly, cow, dog), no primary assembly is published. This reflects inadequately characterized genomic variation and a lower degree of curation, meaning that there are no established alternative loci (haplotypes), and that the toplevel file is equivalent to a primary assembly. Therefore, while the toplevel assembly may be utilized for these organisms, it is nonetheless advisable to verify the absence of N-padded haplotype or patch regions first.
 
 #### Gene annotation
 

From bb3986647018301d7539c59eae34ad3b66eda21f Mon Sep 17 00:00:00 2001
From: Jonathan Manning <pininforthefjords@gmail.com>
Date: Thu, 11 Jul 2024 10:41:21 +0100
Subject: [PATCH 10/11] Demote titles to return structure to previous

---
 docs/usage.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/usage.md b/docs/usage.md
index 636d5de37..a5d50cdeb 100644
--- a/docs/usage.md
+++ b/docs/usage.md
@@ -241,13 +241,13 @@ GFF (General Feature Format) is a tab-separated text file format for representin
 More information and links to further resources are [available from Ensembl](https://www.ensembl.org/info/website/upload/gff.html).
 :::
 
-### Reference transcriptome
+#### Reference transcriptome
 
 In addition to the reference genome sequence and annotation, you can provide a reference transcriptome FASTA file. These files can be obtained from GENCODE or Ensembl. However, these sequences only cover the reference chromosomes and can cause inconsistencies if you are using a primary or toplevel genome assembly and annotation.
 
 We recommend not providing a transcriptome FASTA file and instead allowing the pipeline to create it from the provided genome and annotation. Similar to aligner indexes, you can save the created transcriptome FASTA and BED files to a central location for future pipeline runs. This helps avoid redundant computation and having multiple copies on your system. Ensure that all genome, annotation, transcriptome, and index versions match to maintain consistency.
 
-### Indices
+#### Indices
 
 By default, indices are generated dynamically by the workflow for tools such as STAR and Salmon. Since indexing is an expensive process in time and resources you should ensure that it is only done once, by retaining the indices generated from each batch of reference files:
 
@@ -258,7 +258,7 @@ Once you have the indices from a workflow run you should save them somewhere cen
 
 Remember to note the genome and annotation versions as well as the versions of the software used for indexing, as an index created with one version may not be compatible with other versions.
 
-### GENCODE
+#### GENCODE
 
 If you are using [GENCODE](https://www.gencodegenes.org/) reference genome files please specify the `--gencode` parameter because the format of these files is slightly different to ENSEMBL genome files:
 
@@ -267,7 +267,7 @@ If you are using [GENCODE](https://www.gencodegenes.org/) reference genome files
 
 As well as the standard annotations, GENCODE also provides "basic" annotations, which include only representative transcripts, but we do not recommend using these.
 
-### Prokaryotic genome annotations
+#### Prokaryotic genome annotations
 
 This pipeline uses featureCounts to generate QC metrics based on [biotype](http://www.ensembl.org/info/genome/genebuild/biotypes.html) information available within GFF/GTF genome annotation files. The format of these annotation files can vary significantly depending on the source of the annotation and the type of organism. The default settings in the pipeline are tailored towards Ensembl GTF annotations available for eukaryotic genomes. Prokaryotic genome annotations tend to be distributed in GFF format which are structured differently in terms of the feature naming conventions. There are a number of ways you can tune the behaviour of the pipeline to cater for differences/absence of biotype information:
 

From c016418870b960aa626ca59ed01d2680016411bd Mon Sep 17 00:00:00 2001
From: Matthias Zepper <MatthiasZepper@users.noreply.github.com>
Date: Mon, 15 Jul 2024 11:40:43 +0200
Subject: [PATCH 11/11] Update CHANGELOG

---
 CHANGELOG.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index dfeded678..519cea90a 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -13,6 +13,7 @@ Special thanks to the following for their contributions to the release:
 - [Edmund Miller](https://github.com/edmundmiller)
 - [Jonathan Manning](https://github.com/pinin4fjords)
 - [Laramie Lindsey](https://github.com/laramiellindsey)
+- [Luke Zappia](https://github.com/lazappi)
 - [Matthias Zepper](https://github.com/MatthiasZepper)
 - [Maxime Garcia](https://github.com/maxulysse)
 - [Rob Syme](https://github.com/robsyme)
@@ -85,6 +86,7 @@ Thank you to everyone else that has contributed by reporting bugs, enhancements
 - [PR #1309](https://github.com/nf-core/rnaseq/pull/1309) - Document FASTP sampling
 - [PR #1310](https://github.com/nf-core/rnaseq/pull/1310) - Reinstate pseudoalignment subworkflow config
 - [PR #1312](https://github.com/nf-core/rnaseq/pull/1312) - Fix issues with unzipping of GTF/ GFF files without absolute paths
+- [PR #1314](https://github.com/nf-core/rnaseq/pull/1314) - Add reference genome recommendations to usage docs
 - [PR #1317](https://github.com/nf-core/rnaseq/pull/1317) - Strip problematic ifEmpty()
 - [PR #1319](https://github.com/nf-core/rnaseq/pull/1319) - Reinstate oncomplete error messages
 - [PR #1321](https://github.com/nf-core/rnaseq/pull/1321) - Remove push and release triggers from CI