nf-core · chriswyatt1 · Sep 25, 2024 · Sep 25, 2024 · Sep 27, 2024 · Oct 2, 2024
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,14 @@
+*.pyc
+.DS_Store
+.nextflow*
+.nf-test.log
+data/
+nf-test
+.nf-test*
+results/
+test.xml
+testing*
+testing/
+work/
+log
+out
diff --git a/.gitpod.yml b/.gitpod.yml
@@ -0,0 +1,21 @@
+image: nfcore/gitpod:latest
+tasks:
+  - name: Update Nextflow and setup pre-commit
+    command: |
+      pre-commit install --install-hooks
+      nextflow self-update
+  - name: unset JAVA_TOOL_OPTIONS
+    command: |
+      unset JAVA_TOOL_OPTIONS
+vscode:
+  extensions: # based on nf-core.nf-core-extensionpack
+    - codezombiech.gitignore # Language support for .gitignore files
+    # - cssho.vscode-svgviewer                 # SVG viewer
+    - esbenp.prettier-vscode # Markdown/CommonMark linting and style checking for Visual Studio Code
+    - eamodio.gitlens # Quickly glimpse into whom, why, and when a line or code block was changed
+    - EditorConfig.EditorConfig # override user/workspace settings with settings found in .editorconfig files
+    - Gruntfuggly.todo-tree # Display TODO and FIXME in a tree view in the activity bar
+    - mechatroner.rainbow-csv # Highlight columns in csv files in different colors
+    # - nextflow.nextflow                      # Nextflow syntax highlighting
+    - oderwat.indent-rainbow # Highlight indentation level
+    - streetsidesoftware.code-spell-checker # Spelling checker for source code
diff --git a/.nf-core.yml b/.nf-core.yml
@@ -0,0 +1 @@
+repository_type: pipeline
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,10 @@
+# See https://pre-commit.com for more information
+# See https://pre-commit.com/hooks.html for more hooks
+repos:
+-   repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v3.2.0
+    hooks:
+    -   id: trailing-whitespace
+    -   id: end-of-file-fixer
+    -   id: check-yaml
+    -   id: check-added-large-files
diff --git a/CITATIONS.md b/CITATIONS.md
@@ -18,6 +18,10 @@
 
   > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
 
+- [RIdeogram](https://cran.r-project.org/web/packages/RIdeogram/vignettes/RIdeogram.html)
+
+  > Hao, Z., Lv, D., Ge, Y. et al. RIdeogram: drawing SVG graphics to visualize and map genome-wide data on the idiograms. PeerJ Comput. Sci. 6, e251 (2020). https://doi.org/10.7717/peerj-cs.251
+
 ## Software packaging/containerisation tools
 
 - [Anaconda](https://anaconda.com)

diff --git a/README.md b/README.md
@@ -10,43 +10,96 @@
 
 ## Introduction
 
-**ecoflow/genomeqc** is a bioinformatics pipeline that ...
+**ecoflow/genomeqc** is a bioinformatics pipeline that compares the quality of multiple genomes, along with their annotations.
+
+The pipeline takes a list of genomes and annotations (from raw files or Refseq IDs), and runs commonly used tools to assess their quality.
+
+There are three different ways you can run this pipeline. 1. Genome only, 2. Annotation only, or 3. Genome and Annotation. **Only Genome plus Annotation is functional**
 
 <!-- TODO nf-core:
-   Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
-   major pipeline sections and the types of output it produces. You're giving an overview to someone new
-   to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
+For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
 -->
 
 <!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
-     workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples.   -->
-<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
+     workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples.   
+-->
+
+**Genome and Annnotation:**
+1. Downloads the genome and gene annotation files from NCBI `[NCBIGENOMEDOWNLOAD]` - Or you provide your own genomes/annotations
+2. Describes genome assembly:
+2a. `[BUSCO_BUSCO]`: Determines how complete is the genome compared to expected (protein mode).
+2b. `[BUSCO_IDEOGRAM]`: Plots the location of BUSCO markers on the assembly.
+2c. `[QUAST]`: Determines the N50, how contiguous the genome is.
+2d. More options
+3. Describes your annotation : `[AGAT]`: Gene, feature, length, averages, counts. 
+4. Extract longest protein fasta sequences `[GFFREAD]`.
+5. Finds orthologous genes `[ORTHOFINDER]`.
+6. Summary with MulitQC.
+
+> [!WARNING]
+> We strongly suggest users to specify the lineage using the `--busco_lineage` parameter, as setting the lineage to `auto` (default value) might cause problems with `[BUSCO]` during the leneage determination step.
 
-1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
-2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
+> [!NOTE]
+> `BUSCO_IDEOGRAM` will only plot those chromosomes -or scaffolds- that contain single copy markers.
+
+**Genome Only (in development):**
+1. Downloads the genome files from NCBI `[NCBIGENOMEDOWNLOAD]` - Or you provide your own genomes
+2. Describes genome assembly:
+2a. `[BUSCO_BUSCO]`: Determines how complete is the genome compared to expected (genome mode).
+2b. `[QUAST]`: Determines the N50, how contiguous the genome is.
+2c. More options
+3. Summary with MulitQC.
+
+**Annnotation Only (in development):**
+1. Downloads the gene annotation files from NCBI `[NCBIGENOMEDOWNLOAD]` - Or you provide your own annotations.
+2. Describes your annotation : `[AGAT]`: Gene, feature, length, averages, counts.
+3. Summary with MulitQC.
+
+In addition to the three different modes described above, it is also possible to run the pipeline with or without sequencing reads. When supplying sequencing reads, Merqury can also be run. [Merqury](https://github.com/marbl/merqury) is a tool for genome quality assessment that uses k-mer counts from raw sequencing data to evaluate the accuracy and completeness of a genome assembly. Meryl is the companion tool that efficiently counts and stores k-mers from sequencing reads, enabling Merqury to estimate metrics like assembly completeness and base accuracy. These tools provide a k-mer-based approach to assess assembly quality, helping to identify potential errors or gaps.
+
+To run the pipeline with reads, you must supply a single FASTQ file for each genome in the samplesheet, alongside the `--run_merqury` flag. It is assumed that reads used to create the assembly are from long read technology such as PacBio or ONT, and are therefore single end. If reads are in a .bam file, they must be converted to FASTQ format first. If you have paired end reads, these must be interleaved first.
 
 ## Usage
 
 > [!NOTE]
 > If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
 
-<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
-     Explain what rows and columns represent. For instance (please edit as appropriate):
+First, prepare a `samplesheet.csv`, where your input data points to genomes + or annotations:
+
+```csv
+species,refseq,fasta,gff,fastq
+Homo_sapiens,,/path/to/genome.fasta,/path/to/annotation.gff3,[/path/to/reads.fq.gz]
+Gorilla_gorilla,,/path/to/genome.fasta,/path/to/annotation.gff3,[/path/to/reads.fq.gz]
+Pan_paniscus,,/path/to/genome.fasta,/path/to/annotation.gff3,[/path/to/reads.fq.gz]
+```
 
-First, prepare a samplesheet with your input data that looks as follows:
+When running on ``--genome_only`` mode, you can leave the **gff** field empty. Otherwise, this field will be ignored.
 
-`samplesheet.csv`:
+Additionally, you can run the pipeline using the Refseq IDs of your species:
 
 ```csv
-sample,fastq_1,fastq_2
-CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
+species,refseq,fasta,gff,fastq
+Pongo_abelii,GCF_028885655.2,,,[/path/to/reads.fq.gz]
+Macaca_mulatta,GCF_003339765.1,,,[/path/to/reads.fq.gz]
 ```
 
-Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
+The **fastq** field is optional. Supply sequencing reads if you intend to run merqury using the `--run_merqury`. Otherwise, this filed will be ignored.
 
--->
+You can mix the two input types **(in development)**.
+
+Each row represents a species, with its associated genome, gff or Refseq ID (to autodownload the genome + gff).
+
+You can run the pipeline using test profiles or example input samplesheets. To run a test set with a samplesheet containing reads:
+
+```
+nextflow run main.nf -resume -profile docker,test --outdir results --run_merqury
+```
+
+To run this pipeline on an example samplesheet included in the repo assets (_does not include reads_):
 
-Now, you can run the pipeline using:
+```
+nextflow run main.nf -resume -profile docker --input assets/samplesheet.csv --outdir results
+```
 
 <!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
 
@@ -67,6 +120,8 @@ ecoflow/genomeqc was originally written by Chris Wyatt, Fernando Duarte.
 
 We thank the following people for their extensive assistance in the development of this pipeline:
 
+- [Stephen Turner](https://github.com/stephenturner/) ([Colossal Biosciences](https://colossal.com/))
+
 <!-- TODO nf-core: If applicable, make list of people who have also contributed -->
 
 ## Contributions and Support

diff --git a/assets/samplesheet.csv b/assets/samplesheet.csv
@@ -1,3 +1,5 @@
-sample,fastq_1,fastq_2
-SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz
-SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,
+species,refseq,fasta,gff,fastq
+Vespula_vulgaris,GCF_905475345.1,,,
+Vespa_velutina,GCF_912470025.1,,,
+Apis_mellifera,GCF_003254395.2,,,
+Osmia_bicornis,GCF_907164935.1,,,
diff --git a/assets/schema_input.json b/assets/schema_input.json
@@ -7,27 +7,31 @@
     "items": {
         "type": "object",
         "properties": {
-            "sample": {
+            "species": {
                 "type": "string",
-                "pattern": "^\\S+$",
-                "errorMessage": "Sample name must be provided and cannot contain spaces",
+                "errorMessage": "Species name must be provided and cannot contain spaces",
                 "meta": ["id"]
             },
-            "fastq_1": {
+            "refseq": {
+                "type": "string",
+                "errorMessage": "RefSeq accession number"
+            },
+            "fasta": {
+                "type": "string",
+                "format": "file-path",
+                "errorMessage": "FASTA file with genome assembly"
+            },
+            "gff": {
                 "type": "string",
                 "format": "file-path",
-                "exists": true,
-                "pattern": "^\\S+\\.f(ast)?q\\.gz$",
-                "errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
+                "errorMessage": "GFF file with genome annotation"
             },
-            "fastq_2": {
+            "fastq": {
                 "type": "string",
                 "format": "file-path",
-                "exists": true,
-                "pattern": "^\\S+\\.f(ast)?q\\.gz$",
-                "errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
+                "errorMessage": "Single compressed FASTQ file, must have extension '.fq.gz' or '.fastq.gz'"
             }
         },
-        "required": ["sample", "fastq_1"]
+        "required": ["species"]
     }
 }
diff --git a/bin/busco_2_table.py b/bin/busco_2_table.py
@@ -0,0 +1,39 @@
+#!/usr/bin/python3
+
+# Written by Chris Wyatt and released under the MIT license. 
+# Converts a group of busco outputs to a table to plot on a tree
+
+import pandas as pd
+import argparse
+
+# Set up the argument parser
+parser = argparse.ArgumentParser(description='Extract and merge specific columns from a table.')
+parser.add_argument('input_file', type=str, help='Path to the input TSV file.')
+parser.add_argument('output_file', type=str, help='Path to save the output TSV file.')
+
+# Parse the arguments
+args = parser.parse_args()
+
+# Read the input table into a pandas DataFrame
+df = pd.read_csv(args.input_file, sep='\t')
+
+# Select the required columns
+df_extracted = df[['Input_file', 'Single', 'Duplicated', 'Fragmented', 'Missing']]
+
+# Merge the columns from 'Complete' to 'Missing' into a single column, with values separated by commas
+df_extracted['busco'] = df_extracted[['Single', 'Duplicated', 'Fragmented', 'Missing']].astype(str).agg(','.join, axis=1)
+
+# Drop the individual 'Complete' to 'Missing' columns
+df_extracted = df_extracted[['Input_file', 'busco']]
+
+# Write the header and custom line first
+with open(args.output_file, 'w') as f:
+    # Write the header
+    f.write('species\tbusco\n')
+    # Insert 'NA<tab>stacked' as the second line
+    f.write('NA\tpie\n')
+
+# Append the DataFrame content to the file without the header
+df_extracted.to_csv(args.output_file, sep='\t', index=False, mode='a', header=False)
+
+print(f"Extraction completed successfully. Output saved to {args.output_file}.")
diff --git a/bin/busco_create_table_for_plot.R b/bin/busco_create_table_for_plot.R
@@ -0,0 +1,74 @@
+#!/usr/bin/env Rscript
+
+# Load required libraries
+suppressMessages(library(dplyr))
+suppressMessages(library(readr))
+suppressMessages(library(stringr))
+
+# Get command line arguments
+args <- commandArgs(trailingOnly = TRUE)
+if (length(args) != 3) {
+  stop("Usage: Rscript match_busco_gff.R <busco_file> <gff_file> <output_file>")
+}
+
+busco_file <- args[1]
+gff_file <- args[2]
+output_file <- args[3]
+
+# Step 1: Read the BUSCO file line-by-line, filter out comment and "Missing" lines
+busco_raw <- readLines(busco_file)
+busco_filtered <- busco_raw[!grepl("^#|Missing", busco_raw)]
+
+# Step 2: Parse the remaining lines as a TSV without column names, then rename columns
+busco_data <- read_delim(
+  I(busco_filtered),
+  delim = "\t",
+  col_names = FALSE,
+  show_col_types = FALSE
+)
+
+# Check if the expected 7 columns are present
+if (ncol(busco_data) != 7) {
+  stop("Expected 7 columns in BUSCO data after filtering, but found ", ncol(busco_data), ". Please check the input file format.")
+}
+
+# Rename columns
+colnames(busco_data) <- c("Busco_id", "Status", "Sequence", "Score", "Length", "OrthoDB_url", "Description")
+
+# Read the GFF file
+gff_data <- read_tsv(
+  gff_file, 
+  comment = "#", 
+  col_names = FALSE, 
+  col_types = cols(
+    X1 = col_character(), X2 = col_character(), X3 = col_character(),
+    X4 = col_integer(), X5 = col_integer(), X6 = col_character(),
+    X7 = col_character(), X8 = col_character(), X9 = col_character()
+  ),
+  show_col_types = FALSE,
+  skip_empty_rows = TRUE
+)
+
+# Extract the gene name from the 9th column in GFF, looking for ID=<value> up to the first ;
+gff_data <- gff_data %>%
+  mutate(gene_name = str_extract(X9, "ID=([^;]+)")) %>%
+  mutate(gene_name = str_replace(gene_name, "ID=", "")) %>%  # Remove the "ID=" prefix
+  filter(!is.na(gene_name))
+
+# Perform the join on gene name from both data frames
+result <- inner_join(
+  busco_data,
+  gff_data,
+  by = c("Sequence" = "gene_name")
+)
+
+# Select and rename the columns we need
+output_data <- result %>%
+  select(Status, Scaffold = X1, Start = X4, End = X5) %>%
+  distinct()  # Remove any potential duplicates
+
+# Write the output in the requested format
+write.table(output_data, file = output_file, sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
+
+# Print a message to confirm the output has been written
+cat("Output has been written to", output_file, "\n")