Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First PR on nf-core #53

Open
wants to merge 165 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
165 commits
Select commit Hold shift + click to select a range
6259231
Add a gitpod yml
chriswyatt1 Sep 25, 2024
1f695de
gitignore
chriswyatt1 Sep 25, 2024
ff92a5c
Update README.md
chriswyatt1 Sep 27, 2024
ed77896
Merge pull request #1 from Eco-Flow/chriswyatt1-patch-1
FernandoDuarteF Oct 2, 2024
3450dd9
Added ncbigenomedownload
FernandoDuarteF Oct 3, 2024
bfb0680
Fixed output for create_path.nf
FernandoDuarteF Oct 3, 2024
6d8403b
Modified samplesheet.csv
FernandoDuarteF Oct 3, 2024
3d1577b
Looks good, just checked it worked and made a few minor edits
chriswyatt1 Oct 6, 2024
3fe1f9a
Merge pull request #7 from Eco-Flow/first_try
chriswyatt1 Oct 6, 2024
492503f
Added input validation for sample sheet
FernandoDuarteF Oct 7, 2024
68707c6
Updated error message in sample sheet validation
FernandoDuarteF Oct 8, 2024
0768675
Added busco module
FernandoDuarteF Oct 8, 2024
646568d
Added busco parameters
FernandoDuarteF Oct 8, 2024
9da8887
Added local GFFREAD and bin/ folder
FernandoDuarteF Oct 9, 2024
2851b40
Removed GFFREAD from modules.json
FernandoDuarteF Oct 9, 2024
86dd2c5
Merge pull request #10 from Eco-Flow/busco
FernandoDuarteF Oct 10, 2024
67f8de2
Added orthofinder module
FernandoDuarteF Oct 10, 2024
9f0aff1
Updated modules.json
FernandoDuarteF Oct 14, 2024
d74154b
Merge pull request #13 from Eco-Flow/orthofinder
chriswyatt1 Oct 14, 2024
500aaf5
Added GFFREAD and longest modules
FernandoDuarteF Oct 15, 2024
ef619fa
TIDK subworkflow
chriswyatt1 Oct 16, 2024
27814af
Revert "TIDK subworkflow"
chriswyatt1 Oct 16, 2024
b6c1fce
Updated main workflow for compressed files
FernandoDuarteF Oct 17, 2024
5c4f54d
Merge pull request #20 from Eco-Flow/longest_isoform
chriswyatt1 Oct 17, 2024
64ae1da
Added TIDK subworkflow
FernandoDuarteF Oct 17, 2024
239e64a
Merge pull request #23 from Eco-Flow/tidk_fer
chriswyatt1 Oct 18, 2024
c05f5e8
Update test_full.config
chriswyatt1 Oct 18, 2024
997aaf9
Merge pull request #24 from Eco-Flow/new_input
chriswyatt1 Oct 18, 2024
758c763
Add mod
chriswyatt1 Oct 18, 2024
921c70e
Added AGAT spstatistics and Quast modules
FernandoDuarteF Oct 18, 2024
0d2041d
Broken pipeline, need fix of channels
chriswyatt1 Oct 18, 2024
8a4dd8a
Fixed up to orthofinder
chriswyatt1 Oct 18, 2024
9a130d0
Fixed up to just before working tree script. Need a container
chriswyatt1 Oct 19, 2024
d12e27e
Add container for tree build in R
chriswyatt1 Oct 19, 2024
616f0e3
Merge pull request #28 from Eco-Flow/treefigure
chriswyatt1 Oct 21, 2024
2fde67b
Merge pull request #26 from Eco-Flow/quast_agatstats
chriswyatt1 Oct 21, 2024
3d543b4
Added local subworkflows
FernandoDuarteF Oct 21, 2024
1383947
Remove/add versions where needed
chriswyatt1 Oct 22, 2024
bc1e6bf
redundant_info2
chriswyatt1 Oct 22, 2024
bcc2963
A basic working tree plot and busco module
chriswyatt1 Oct 22, 2024
b724c93
Merge pull request #35 from Eco-Flow/redundant_info
FernandoDuarteF Oct 23, 2024
3c8ad5f
Two working tree plots inc pie
chriswyatt1 Oct 23, 2024
d65aa69
Resolve conflicts
FernandoDuarteF Oct 23, 2024
1715fcc
Working plots for busco and quast
chriswyatt1 Oct 23, 2024
21e4e4e
Merge branch 'dev' into working_tree
chriswyatt1 Oct 23, 2024
2ddb427
Merge pull request #36 from Eco-Flow/working_tree
chriswyatt1 Oct 23, 2024
c5f4670
Added GFFREAD from excon
FernandoDuarteF Oct 23, 2024
7340ef9
Added tree plot
FernandoDuarteF Oct 24, 2024
2525b8d
Fix species extension removal for plotting
chriswyatt1 Oct 24, 2024
f2714c0
pie chart for busco
chriswyatt1 Oct 24, 2024
972552e
Merge branch 'dev' into subworkflows
FernandoDuarteF Oct 24, 2024
5349cce
Merge pull request #32 from Eco-Flow/subworkflows
FernandoDuarteF Oct 24, 2024
0cf0dd0
new plots organisation
chriswyatt1 Oct 24, 2024
b6f060e
Update README.md
chriswyatt1 Oct 24, 2024
63065f3
Merge pull request #47 from Eco-Flow/Better_tree_plots
FernandoDuarteF Oct 24, 2024
94558df
Removed excon scripts
FernandoDuarteF Oct 25, 2024
230e61b
Fixed genome only option not working
FernandoDuarteF Oct 28, 2024
8a0a2df
add merqury and meryl modules to json with nf-core tools genomeqc/#42…
stephenturner Oct 28, 2024
1acf3f1
add merqury and meryl module configs genomeqc#42 genomeqc#58
stephenturner Oct 28, 2024
081ad3f
add merqury module genomeqc#42
stephenturner Oct 28, 2024
e43785b
add meryl module genomeqc#58
stephenturner Oct 28, 2024
d0e2859
include modules for meryl and merqury
stephenturner Oct 28, 2024
6336ad5
Add files via upload
fperezcobos Oct 28, 2024
7b4b083
add fasta null default and kvalue for meryl default k=21
stephenturner Oct 28, 2024
ab1d1c5
typo in outdir
stephenturner Oct 28, 2024
6b32c66
add meryl count to workflow genomeqc#58
stephenturner Oct 28, 2024
a34a1cb
update schema with web ui builder
stephenturner Oct 28, 2024
b63b944
fix whitespace in schema
stephenturner Oct 28, 2024
c193d3c
add meryl unionsum genomeqc#60
stephenturner Oct 28, 2024
56f5e42
merqury skip param
stephenturner Oct 28, 2024
9380cb0
add merqury step in wf
stephenturner Oct 28, 2024
9008ece
remove stray view()
stephenturner Oct 28, 2024
5490ddb
Updated subworkflows and workflows
FernandoDuarteF Oct 28, 2024
22973ef
allow for fastq input in samplesheet see also nf-core/test-datasets#1365
stephenturner Oct 29, 2024
374dc82
update
fperezcobos Oct 29, 2024
8c4757c
push
fperezcobos Oct 29, 2024
813e85f
added agat module
fperezcobos Oct 29, 2024
832e115
Merge pull request #1 from fperezcobos/test
fperezcobos Oct 29, 2024
aa13541
Delete test_prunus_dulcis.csv
fperezcobos Oct 29, 2024
9e70870
Delete test_athaliana.csv
fperezcobos Oct 29, 2024
a084ff4
Delete felipe_testing.config
fperezcobos Oct 29, 2024
788ae45
Scripts updated
chriswyatt1 Oct 29, 2024
b7c4ff2
return fastq in validateinputsamplesheet #62
stephenturner Oct 29, 2024
e558904
fix cardinality for CREATE_PATH and branch cardinality on samplesheet
stephenturner Oct 29, 2024
f5247e1
Update conf/test_full.config
chriswyatt1 Oct 29, 2024
2ef206a
Added AGAT gff checking to genome_and_annotation
fperezcobos Oct 29, 2024
05b062c
Update .gitignore
fperezcobos Oct 29, 2024
bd55499
merqury_skip = false in test profile #62 #42
stephenturner Oct 29, 2024
9615c08
run merqury if providing fastq file #62 #42
stephenturner Oct 29, 2024
1611259
Fix Script path issue
chriswyatt1 Oct 29, 2024
d8b7cca
test profiles
stephenturner Oct 29, 2024
4c85fb8
Merge pull request #66 from nf-core/Fix-bin-path-removal
FernandoDuarteF Oct 29, 2024
72f2a6e
conditionally file() the fastq if it's present in the sample sheet, f…
stephenturner Oct 29, 2024
a6fd2a4
remove views
stephenturner Oct 29, 2024
25f9bbd
Add tidk optional
chriswyatt1 Oct 29, 2024
3a29414
Merge pull request #65 from fperezcobos/dev
chriswyatt1 Oct 29, 2024
5e88fd9
Merge branch 'dev' into dev
FernandoDuarteF Oct 29, 2024
01d7ced
make test profile use reads, test_nofastq use _nofastq csv
stephenturner Oct 29, 2024
491ba12
add fastq column to samplesheet
stephenturner Oct 29, 2024
6d1c3f9
update readme with info about reads/merqury, different test profiles,…
stephenturner Oct 29, 2024
91f9f5a
add missing test run command in readme for running test data without …
stephenturner Oct 29, 2024
5e5f897
Merge pull request #63 from stephenturner/dev
stephenturner Oct 29, 2024
e06b754
busco and quast added to mq report
Oct 30, 2024
f3493bb
CREATE_PATH now outputs a tuple
FernandoDuarteF Oct 30, 2024
03d6601
Merge branch 'dev' into Conditional-tidk-flags
chriswyatt1 Nov 4, 2024
0159836
Merge pull request #70 from nf-core/Conditional-tidk-flags
FernandoDuarteF Nov 5, 2024
fbe093e
Fixed uncompressing not working
FernandoDuarteF Nov 5, 2024
84f5ec8
First commit
chriswyatt1 Nov 5, 2024
b04b51f
with gff lineage and busco into the process
chriswyatt1 Nov 5, 2024
8532448
Used multiMap instead of map for combined input channels
FernandoDuarteF Nov 5, 2024
c86b26e
Merge branch 'dev' into new_input_validation
FernandoDuarteF Nov 6, 2024
e45081d
Resolve merge conflicts
FernandoDuarteF Nov 6, 2024
6613139
Merge pull request #64 from nf-core/Organise_script_comments
FernandoDuarteF Nov 6, 2024
336c35c
Working_version
chriswyatt1 Nov 6, 2024
2b17f56
Solved conflicts with dev and improved syntax (I think)
FernandoDuarteF Nov 6, 2024
d51d5d8
Fix out channels and published correctly
chriswyatt1 Nov 6, 2024
6ec7fc1
Updated nextflow_schema.json
FernandoDuarteF Nov 6, 2024
c581a37
Added ".pre-commit-config.yaml" for pre-commit
FernandoDuarteF Nov 6, 2024
c56b5e6
Changed "merqury_skip" to "skip_merqury"
FernandoDuarteF Nov 6, 2024
49a21b3
Merge pull request #57 from Eco-Flow/new_input_validation
FernandoDuarteF Nov 6, 2024
eac2215
Fixed combine() not working when empty ch_fastq
FernandoDuarteF Nov 7, 2024
3c870c9
Added --run_merquery flag
FernandoDuarteF Nov 8, 2024
dfe2aaf
Fixed QUAST inputs out of sync
FernandoDuarteF Nov 8, 2024
de9a0d7
Merge pull request #84 from Eco-Flow/fix_ch_input
chriswyatt1 Nov 8, 2024
45e6f1e
Merge branch 'dev' into busco_ideograms
FernandoDuarteF Nov 8, 2024
111424d
Fixed channel names for ideogram
FernandoDuarteF Nov 8, 2024
5736fa6
Fixed busco ideogram not working
FernandoDuarteF Nov 13, 2024
11e9cb4
Commented some lines
FernandoDuarteF Nov 13, 2024
a26730a
Merge pull request #82 from nf-core/busco_ideograms
FernandoDuarteF Nov 13, 2024
9be9c72
Updated README
FernandoDuarteF Nov 13, 2024
64a7fdf
Merge pull request #87 from Eco-Flow/update_readme
FernandoDuarteF Nov 13, 2024
d1d6e64
Added longest and nf-core GFFREAD modules
FernandoDuarteF Nov 13, 2024
a7f288e
Added AGAT extract sequences
FernandoDuarteF Nov 18, 2024
57df688
Added AGAT extract sequences
FernandoDuarteF Nov 18, 2024
af478f3
Added fasta validator
FernandoDuarteF Nov 19, 2024
f0a3123
Added nf-core GFFREAD back
FernandoDuarteF Nov 20, 2024
41d689a
Modified ideogram script for better visualization
FernandoDuarteF Nov 20, 2024
f97b80b
Update README
FernandoDuarteF Nov 20, 2024
28c4c49
Update README.md
FernandoDuarteF Nov 20, 2024
f82c2cc
Update README.md
FernandoDuarteF Nov 20, 2024
e381163
Improved readability
FernandoDuarteF Nov 20, 2024
5206882
Merge pull request #88 from Eco-Flow/agat_longest_isoform
FernandoDuarteF Nov 20, 2024
53bc65d
Update README.md
FernandoDuarteF Nov 20, 2024
4d00256
Gene overlap first commit
chriswyatt1 Nov 21, 2024
12dba5c
second commit
chriswyatt1 Nov 21, 2024
5c7711f
remove installs
chriswyatt1 Nov 21, 2024
6399712
third commit working
chriswyatt1 Nov 21, 2024
ab3f837
Merge branch 'dev' into dev
FernandoDuarteF Nov 25, 2024
1d560ca
Merge pull request #74 from awanalkoerdi289/dev
FernandoDuarteF Nov 26, 2024
4835f86
Fixed multiqc not working
FernandoDuarteF Nov 26, 2024
520f6d5
Added second table with count stats
chriswyatt1 Nov 26, 2024
435368a
Removed meta from output channels for multiqc
FernandoDuarteF Nov 27, 2024
e642bee
Fixed Quast results not showing in the multiqc report
FernandoDuarteF Nov 27, 2024
e8a6faa
Merge pull request #95 from Eco-Flow/fix_multiqc
FernandoDuarteF Nov 29, 2024
6729de5
Merge branch 'dev' into gene_overlap
chriswyatt1 Dec 1, 2024
20bb70f
Merge pull request #94 from chriswyatt1/gene_overlap
chriswyatt1 Dec 1, 2024
6385ae8
Decreased size of results folder
FernandoDuarteF Dec 2, 2024
f944325
Updated nextflow_schema.json
FernandoDuarteF Dec 2, 2024
ae76a10
Merge pull request #101 from nf-core/publish_results
FernandoDuarteF Dec 3, 2024
d6350fa
Added genome ideogram local module
FernandoDuarteF Dec 3, 2024
a43634a
Update plot_markers scripts
FernandoDuarteF Dec 5, 2024
b84b762
Fixed ideogram not working for genome mode
FernandoDuarteF Dec 6, 2024
c69d846
More descriptive names for modules and subworkflows
FernandoDuarteF Dec 6, 2024
fe07df8
Removed redundant lines and updated modules.config
FernandoDuarteF Dec 9, 2024
e7df634
Merge pull request #102 from nf-core/genome_ideogram
FernandoDuarteF Dec 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
*.pyc
.DS_Store
.nextflow*
.nf-test.log
data/
nf-test
.nf-test*
results/
test.xml
testing*
testing/
work/
log
out
21 changes: 21 additions & 0 deletions .gitpod.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
image: nfcore/gitpod:latest
tasks:
- name: Update Nextflow and setup pre-commit
command: |
pre-commit install --install-hooks
nextflow self-update
- name: unset JAVA_TOOL_OPTIONS
command: |
unset JAVA_TOOL_OPTIONS
vscode:
extensions: # based on nf-core.nf-core-extensionpack
- codezombiech.gitignore # Language support for .gitignore files
# - cssho.vscode-svgviewer # SVG viewer
- esbenp.prettier-vscode # Markdown/CommonMark linting and style checking for Visual Studio Code
- eamodio.gitlens # Quickly glimpse into whom, why, and when a line or code block was changed
- EditorConfig.EditorConfig # override user/workspace settings with settings found in .editorconfig files
- Gruntfuggly.todo-tree # Display TODO and FIXME in a tree view in the activity bar
- mechatroner.rainbow-csv # Highlight columns in csv files in different colors
# - nextflow.nextflow # Nextflow syntax highlighting
- oderwat.indent-rainbow # Highlight indentation level
- streetsidesoftware.code-spell-checker # Spelling checker for source code
1 change: 1 addition & 0 deletions .nf-core.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
repository_type: pipeline
10 changes: 10 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v3.2.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- id: check-added-large-files
4 changes: 4 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

- [RIdeogram](https://cran.r-project.org/web/packages/RIdeogram/vignettes/RIdeogram.html)

> Hao, Z., Lv, D., Ge, Y. et al. RIdeogram: drawing SVG graphics to visualize and map genome-wide data on the idiograms. PeerJ Comput. Sci. 6, e251 (2020). https://doi.org/10.7717/peerj-cs.251

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)
Expand Down
89 changes: 72 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,43 +10,96 @@

## Introduction

**ecoflow/genomeqc** is a bioinformatics pipeline that ...
**ecoflow/genomeqc** is a bioinformatics pipeline that compares the quality of multiple genomes, along with their annotations.

The pipeline takes a list of genomes and annotations (from raw files or Refseq IDs), and runs commonly used tools to assess their quality.

There are three different ways you can run this pipeline. 1. Genome only, 2. Annotation only, or 3. Genome and Annotation. **Only Genome plus Annotation is functional**

<!-- TODO nf-core:
Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
major pipeline sections and the types of output it produces. You're giving an overview to someone new
to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
-->

<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. -->
<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples.
-->

**Genome and Annnotation:**
1. Downloads the genome and gene annotation files from NCBI `[NCBIGENOMEDOWNLOAD]` - Or you provide your own genomes/annotations
2. Describes genome assembly:
2a. `[BUSCO_BUSCO]`: Determines how complete is the genome compared to expected (protein mode).
2b. `[BUSCO_IDEOGRAM]`: Plots the location of BUSCO markers on the assembly.
2c. `[QUAST]`: Determines the N50, how contiguous the genome is.
2d. More options
3. Describes your annotation : `[AGAT]`: Gene, feature, length, averages, counts.
4. Extract longest protein fasta sequences `[GFFREAD]`.
5. Finds orthologous genes `[ORTHOFINDER]`.
6. Summary with MulitQC.

> [!WARNING]
> We strongly suggest users to specify the lineage using the `--busco_lineage` parameter, as setting the lineage to `auto` (default value) might cause problems with `[BUSCO]` during the leneage determination step.

1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
> [!NOTE]
> `BUSCO_IDEOGRAM` will only plot those chromosomes -or scaffolds- that contain single copy markers.

**Genome Only (in development):**
1. Downloads the genome files from NCBI `[NCBIGENOMEDOWNLOAD]` - Or you provide your own genomes
2. Describes genome assembly:
2a. `[BUSCO_BUSCO]`: Determines how complete is the genome compared to expected (genome mode).
2b. `[QUAST]`: Determines the N50, how contiguous the genome is.
2c. More options
3. Summary with MulitQC.

**Annnotation Only (in development):**
1. Downloads the gene annotation files from NCBI `[NCBIGENOMEDOWNLOAD]` - Or you provide your own annotations.
2. Describes your annotation : `[AGAT]`: Gene, feature, length, averages, counts.
3. Summary with MulitQC.

In addition to the three different modes described above, it is also possible to run the pipeline with or without sequencing reads. When supplying sequencing reads, Merqury can also be run. [Merqury](https://github.com/marbl/merqury) is a tool for genome quality assessment that uses k-mer counts from raw sequencing data to evaluate the accuracy and completeness of a genome assembly. Meryl is the companion tool that efficiently counts and stores k-mers from sequencing reads, enabling Merqury to estimate metrics like assembly completeness and base accuracy. These tools provide a k-mer-based approach to assess assembly quality, helping to identify potential errors or gaps.​

To run the pipeline with reads, you must supply a single FASTQ file for each genome in the samplesheet, alongside the `--run_merqury` flag. It is assumed that reads used to create the assembly are from long read technology such as PacBio or ONT, and are therefore single end. If reads are in a .bam file, they must be converted to FASTQ format first. If you have paired end reads, these must be interleaved first.

## Usage

> [!NOTE]
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.

<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
Explain what rows and columns represent. For instance (please edit as appropriate):
First, prepare a `samplesheet.csv`, where your input data points to genomes + or annotations:

```csv
species,refseq,fasta,gff,fastq
Homo_sapiens,,/path/to/genome.fasta,/path/to/annotation.gff3,[/path/to/reads.fq.gz]
Gorilla_gorilla,,/path/to/genome.fasta,/path/to/annotation.gff3,[/path/to/reads.fq.gz]
Pan_paniscus,,/path/to/genome.fasta,/path/to/annotation.gff3,[/path/to/reads.fq.gz]
```

First, prepare a samplesheet with your input data that looks as follows:
When running on ``--genome_only`` mode, you can leave the **gff** field empty. Otherwise, this field will be ignored.

`samplesheet.csv`:
Additionally, you can run the pipeline using the Refseq IDs of your species:

```csv
sample,fastq_1,fastq_2
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
species,refseq,fasta,gff,fastq
Pongo_abelii,GCF_028885655.2,,,[/path/to/reads.fq.gz]
Macaca_mulatta,GCF_003339765.1,,,[/path/to/reads.fq.gz]
```

Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
The **fastq** field is optional. Supply sequencing reads if you intend to run merqury using the `--run_merqury`. Otherwise, this filed will be ignored.

-->
You can mix the two input types **(in development)**.

Each row represents a species, with its associated genome, gff or Refseq ID (to autodownload the genome + gff).

You can run the pipeline using test profiles or example input samplesheets. To run a test set with a samplesheet containing reads:

```
nextflow run main.nf -resume -profile docker,test --outdir results --run_merqury
```

To run this pipeline on an example samplesheet included in the repo assets (_does not include reads_):

Now, you can run the pipeline using:
```
nextflow run main.nf -resume -profile docker --input assets/samplesheet.csv --outdir results
```

<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->

Expand All @@ -67,6 +120,8 @@ ecoflow/genomeqc was originally written by Chris Wyatt, Fernando Duarte.

We thank the following people for their extensive assistance in the development of this pipeline:

- [Stephen Turner](https://github.com/stephenturner/) ([Colossal Biosciences](https://colossal.com/))

<!-- TODO nf-core: If applicable, make list of people who have also contributed -->

## Contributions and Support
Expand Down
8 changes: 5 additions & 3 deletions assets/samplesheet.csv
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
sample,fastq_1,fastq_2
SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz
SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,
species,refseq,fasta,gff,fastq
Vespula_vulgaris,GCF_905475345.1,,,
Vespa_velutina,GCF_912470025.1,,,
Apis_mellifera,GCF_003254395.2,,,
Osmia_bicornis,GCF_907164935.1,,,
28 changes: 16 additions & 12 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,27 +7,31 @@
"items": {
"type": "object",
"properties": {
"sample": {
"species": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Sample name must be provided and cannot contain spaces",
"errorMessage": "Species name must be provided and cannot contain spaces",
"meta": ["id"]
},
"fastq_1": {
"refseq": {
"type": "string",
"errorMessage": "RefSeq accession number"
},
"fasta": {
"type": "string",
"format": "file-path",
"errorMessage": "FASTA file with genome assembly"
},
"gff": {
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
"errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
"errorMessage": "GFF file with genome annotation"
},
"fastq_2": {
"fastq": {
"type": "string",
"format": "file-path",
"exists": true,
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
"errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
"errorMessage": "Single compressed FASTQ file, must have extension '.fq.gz' or '.fastq.gz'"
}
},
"required": ["sample", "fastq_1"]
"required": ["species"]
}
}
39 changes: 39 additions & 0 deletions bin/busco_2_table.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#!/usr/bin/python3

# Written by Chris Wyatt and released under the MIT license.
# Converts a group of busco outputs to a table to plot on a tree

import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a personal thing, but I think loose scripts should have a comment at the top, either saying where a script originated from ( e.g. adapted from script at url:path/to/script or if it's part of an existing package and copied into the workflow ) or have a authored by to denote it's a custom written script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added link to original script, or note what the script does, and who wrote it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#64

import argparse

# Set up the argument parser
parser = argparse.ArgumentParser(description='Extract and merge specific columns from a table.')
parser.add_argument('input_file', type=str, help='Path to the input TSV file.')
parser.add_argument('output_file', type=str, help='Path to save the output TSV file.')

# Parse the arguments
args = parser.parse_args()

# Read the input table into a pandas DataFrame
df = pd.read_csv(args.input_file, sep='\t')

# Select the required columns
df_extracted = df[['Input_file', 'Single', 'Duplicated', 'Fragmented', 'Missing']]

# Merge the columns from 'Complete' to 'Missing' into a single column, with values separated by commas
df_extracted['busco'] = df_extracted[['Single', 'Duplicated', 'Fragmented', 'Missing']].astype(str).agg(','.join, axis=1)

# Drop the individual 'Complete' to 'Missing' columns
df_extracted = df_extracted[['Input_file', 'busco']]

# Write the header and custom line first
with open(args.output_file, 'w') as f:
# Write the header
f.write('species\tbusco\n')
# Insert 'NA<tab>stacked' as the second line
f.write('NA\tpie\n')

# Append the DataFrame content to the file without the header
df_extracted.to_csv(args.output_file, sep='\t', index=False, mode='a', header=False)

print(f"Extraction completed successfully. Output saved to {args.output_file}.")
74 changes: 74 additions & 0 deletions bin/busco_create_table_for_plot.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
#!/usr/bin/env Rscript

# Load required libraries
suppressMessages(library(dplyr))
suppressMessages(library(readr))
suppressMessages(library(stringr))

# Get command line arguments
args <- commandArgs(trailingOnly = TRUE)
if (length(args) != 3) {
stop("Usage: Rscript match_busco_gff.R <busco_file> <gff_file> <output_file>")
}

busco_file <- args[1]
gff_file <- args[2]
output_file <- args[3]

# Step 1: Read the BUSCO file line-by-line, filter out comment and "Missing" lines
busco_raw <- readLines(busco_file)
busco_filtered <- busco_raw[!grepl("^#|Missing", busco_raw)]

# Step 2: Parse the remaining lines as a TSV without column names, then rename columns
busco_data <- read_delim(
I(busco_filtered),
delim = "\t",
col_names = FALSE,
show_col_types = FALSE
)

# Check if the expected 7 columns are present
if (ncol(busco_data) != 7) {
stop("Expected 7 columns in BUSCO data after filtering, but found ", ncol(busco_data), ". Please check the input file format.")
}

# Rename columns
colnames(busco_data) <- c("Busco_id", "Status", "Sequence", "Score", "Length", "OrthoDB_url", "Description")

# Read the GFF file
gff_data <- read_tsv(
gff_file,
comment = "#",
col_names = FALSE,
col_types = cols(
X1 = col_character(), X2 = col_character(), X3 = col_character(),
X4 = col_integer(), X5 = col_integer(), X6 = col_character(),
X7 = col_character(), X8 = col_character(), X9 = col_character()
),
show_col_types = FALSE,
skip_empty_rows = TRUE
)

# Extract the gene name from the 9th column in GFF, looking for ID=<value> up to the first ;
gff_data <- gff_data %>%
mutate(gene_name = str_extract(X9, "ID=([^;]+)")) %>%
mutate(gene_name = str_replace(gene_name, "ID=", "")) %>% # Remove the "ID=" prefix
filter(!is.na(gene_name))

# Perform the join on gene name from both data frames
result <- inner_join(
busco_data,
gff_data,
by = c("Sequence" = "gene_name")
)

# Select and rename the columns we need
output_data <- result %>%
select(Status, Scaffold = X1, Start = X4, End = X5) %>%
distinct() # Remove any potential duplicates

# Write the output in the requested format
write.table(output_data, file = output_file, sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)

# Print a message to confirm the output has been written
cat("Output has been written to", output_file, "\n")
Loading