Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for grid view (and a few other things) #114

Merged
merged 22 commits into from
Oct 2, 2024
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
d13d1ed
Fixed the comment
muffato Sep 19, 2024
2c793aa
UPDATEBLOBDIR needs to pass the tsv files through certain arguments
muffato Sep 19, 2024
b55f76c
Populate as much as possible of the metadata at the beginning
muffato Sep 19, 2024
0daf1e2
Also generate TSV files
muffato Jul 2, 2024
2cf12fd
Example plugging the TSV files into blobtools commands
muffato Jul 2, 2024
b8b1fa4
Skip BLOBTOOLKIT_METADATA as we generate the yaml file correctly from…
muffato Sep 19, 2024
41b101f
Introduced a new samplesheet parameter to track single/paired readsets
muffato Sep 20, 2024
80a100b
Populate as much as possible of the metadata at the beginning
muffato Sep 20, 2024
bea3a77
Version bumps
muffato Sep 20, 2024
4edafb4
Updated the documentation
muffato Sep 20, 2024
904a512
Version bump
muffato Sep 20, 2024
51985ad
Publish the windowmasker outputs too
muffato Sep 20, 2024
f47dd16
Convert input BAM/CRAM files to Fasta on the fly
muffato Sep 23, 2024
43254bc
Bumped the version down
muffato Sep 26, 2024
c89b1b9
Pin the version of editorconfig like in the readmapping pipeline
muffato Sep 26, 2024
152282b
Exclude sqlite databases from ECLint
muffato Sep 26, 2024
476b736
add exclusions for nf-core lint
tkchafin Sep 30, 2024
303b221
Update CITATIONS.md
tkchafin Sep 30, 2024
27b0a51
prettier linting
tkchafin Sep 30, 2024
4eef0fa
The release is going to happen today
muffato Oct 2, 2024
0d8a5a9
Decreased the runtime requirements for illumina alignments
muffato Oct 2, 2024
1909646
Wording of the changelog
muffato Oct 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/linting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,10 @@ jobs:
- uses: actions/setup-node@v4

- name: Install editorconfig-checker
run: npm install -g editorconfig-checker
run: npm install -g editorconfig-checker@3.0.2

- name: Run ECLint check
run: editorconfig-checker -exclude README.md $(find .* -type f | grep -v '.git\|.py\|.md\|json\|yml\|yaml\|html\|css\|work\|.nextflow\|build\|nf_core.egg-info\|log.txt\|Makefile')
run: editorconfig-checker -exclude README.md $(find .* -type f | grep -v '.git\|.py\|.md\|json\|yml\|yaml\|html\|css\|work\|.nextflow\|build\|nf_core.egg-info\|log.txt\|Makefile\|.sqlite3')

Prettier:
runs-on: ubuntu-latest
Expand Down
5 changes: 5 additions & 0 deletions .nf-core.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,18 @@ lint:
- LICENCE
- lib/NfcoreTemplate.groovy
- CODE_OF_CONDUCT.md
- assets/sendmail_template.txt
- assets/email_template.html
- assets/email_template.txt
- assets/nf-core-blobtoolkit_logo_light.png
- docs/images/nf-core-blobtoolkit_logo_light.png
- docs/images/nf-core-blobtoolkit_logo_dark.png
- .github/ISSUE_TEMPLATE/bug_report.yml
- .github/PULL_REQUEST_TEMPLATE.md
- .github/workflows/linting.yml
- .github/workflows/branch.yml
- .github/CONTRIBUTING.md
- .github/workflows/linting_comment.yml
multiqc_config:
- report_comment
nextflow_config:
Expand Down
23 changes: 16 additions & 7 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,15 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [[0.7.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.7.0)] – Psyduck – [2024-XX-YY]

The pipeline is now considered to be a complete and suitable replacement for the Snakemake version.

- Fetch information about the chromosomes of the assemblies. Used to power
"grid plots".
- Fill in accurate read information in the blobDir. Users are now reqiured
to indicate whether the reads are paired or single.

## [[0.6.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.6.0)] – Bellsprout – [2024-09-13]

The pipeline has now been validated for draft (unpublished) assemblies.
Expand Down Expand Up @@ -87,13 +96,13 @@ The pipeline has now been validated on dozens of genomes, up to 11 Gbp.

Note, since the pipeline is using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference. Only `Docker` or `Singularity` containers are supported, `conda` is not supported.

| Dependency | Old version | New version |
| ----------- | ------------- | ------------- |
| blobtoolkit | 4.3.3 | 4.3.9 |
| blast | 2.14.0 | 2.15.0 |
| multiqc | 1.17 and 1.18 | 1.20 and 1.21 |
| samtools | 1.18 | 1.19.2 |
| seqtk | 1.3 | 1.4 |
| Dependency | Old version | New version |
| ----------- | ------------- | ----------------- |
| blobtoolkit | 4.3.3 | 4.3.9 |
| blast | 2.14.0 | 2.15.0 and 2.14.1 |
| multiqc | 1.17 and 1.18 | 1.20 and 1.21 |
| samtools | 1.18 | 1.19.2 |
| seqtk | 1.3 | 1.4 |

> **NB:** Dependency has been **updated** if both old and new version information is present. </br> **NB:** Dependency has been **added** if just the new version information is present. </br> **NB:** Dependency has been **removed** if version information isn't present.

Expand Down
12 changes: 9 additions & 3 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@

## Pipeline tools

- [BLAST+](https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html)

> Camacho, Chritiam, et al. “BLAST+: architecture and applications.” BMC Bioinformatics, vol. 10, no. 412, Dec. 2009, https://doi.org/10.1186/1471-2105-10-421

- [BlobToolKit](https://github.com/blobtoolkit/blobtoolkit)

> Challis, Richard, et al. “BlobToolKit – Interactive Quality Assessment of Genome Assemblies.” G3 Genes|Genomes|Genetics, vol. 10, no. 4, Apr. 2020, pp. 1361–74, https://doi.org/10.1534/g3.119.400908.
Expand All @@ -26,9 +30,7 @@

- [Fasta_windows](https://github.com/tolkit/fasta_windows)

- [GoaT](https://goat.genomehubs.org)

> Challis, Richard, et al. “Genomes on a Tree (GoaT): A versatile, scalable search engine for genomic and sequencing project metadata across the eukaryotic tree of life.” Wellcome Open Research, vol. 8, no. 24, 2023, https://doi.org/10.12688/wellcomeopenres.18658.1.
> Brown, Max, et al. "Fasta_windows v0.2.3". GitHub, 2021. https://github.com/tolkit/fasta_windows

- [Minimap2](https://github.com/lh3/minimap2)

Expand All @@ -42,6 +44,10 @@

> Danecek, Petr, et al. “Twelve Years of SAMtools and BCFtools.” GigaScience, vol. 10, no. 2, Jan. 2021, https://doi.org/10.1093/gigascience/giab008.

- [SeqTK](https://github.com/lh3/seqtk)

> Li, Heng. "SeqTK v1.4" GitHub, 2023, https://github.com/lh3/seqtk

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)
Expand Down
23 changes: 12 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ It takes a samplesheet of BAM/CRAM/FASTQ/FASTA files as input, calculates genome
4. Run BUSCO ([`busco`](https://busco.ezlab.org/))
5. Extract BUSCO genes ([`blobtoolkit/extractbuscos`](https://github.com/blobtoolkit/blobtoolkit))
6. Run Diamond BLASTp against extracted BUSCO genes ([`diamond/blastp`](https://github.com/bbuchfink/diamond))
7. Run BLASTx against sequences with no hit ([`blast/blastn`](https://www.ncbi.nlm.nih.gov/books/NBK131777/))
8. Run BLASTn against sequences still with not hit ([`blast/blastx`](https://www.ncbi.nlm.nih.gov/books/NBK131777/))
7. Run BLASTx against sequences with no hit ([`diamond/blastx`](https://github.com/bbuchfink/diamond))
8. Run BLASTn against sequences still with not hit ([`blast/blastn`](https://www.ncbi.nlm.nih.gov/books/NBK131777/))
9. Count BUSCO genes ([`blobtoolkit/countbuscos`](https://github.com/blobtoolkit/blobtoolkit))
10. Generate combined sequence stats across various window sizes ([`blobtoolkit/windowstats`](https://github.com/blobtoolkit/blobtoolkit))
11. Imports analysis results into a BlobDir dataset ([`blobtoolkit/blobdir`](https://github.com/blobtoolkit/blobtoolkit))
Expand All @@ -37,13 +37,17 @@ First, prepare a samplesheet with your input data that looks as follows:
`samplesheet.csv`:

```csv
sample,datatype,datafile
mMelMel3,hic,GCA_922984935.2.hic.mMelMel3.cram
mMelMel1,illumina,GCA_922984935.2.illumina.mMelMel1.cram
mMelMel3,ont,GCA_922984935.2.ont.mMelMel3.cram
sample,datatype,datafile,library_layout
mMelMel3,hic,GCA_922984935.2.hic.mMelMel3.cram,PAIRED
mMelMel1,illumina,GCA_922984935.2.illumina.mMelMel1.cram,PAIRED
mMelMel3,ont,GCA_922984935.2.ont.mMelMel3.cram,SINGLE
```

Each row represents an aligned file. Rows with the same sample identifier are considered technical replicates. The datatype refers to the sequencing technology used to generate the underlying raw data and follows a controlled vocabulary (`ont`, `hic`, `pacbio`, `pacbio_clr`, `illumina`). The aligned read files can be generated using the [sanger-tol/readmapping](https://github.com/sanger-tol/readmapping) pipeline.
Each row represents an aligned file.
Rows with the same sample identifier are considered technical replicates.
The datatype refers to the sequencing technology used to generate the underlying raw data and follows a controlled vocabulary (`ont`, `hic`, `pacbio`, `pacbio_clr`, `illumina`).
The library layout indicates whether the reads are paired or single.
The aligned read files can be generated using the [sanger-tol/readmapping](https://github.com/sanger-tol/readmapping) pipeline.

Now, you can run the pipeline using:

Expand Down Expand Up @@ -77,9 +81,8 @@ sanger-tol/blobtoolkit was written in Nextflow by [Alexander Ramos Diaz](https:/

We thank the following people for their assistance in the development of this pipeline:

<!-- If applicable, make list of people who have also contributed -->

- [Guoying Qi](https://github.com/gq1)
- [Bethan Yates](https://github.com/BethYates)

## Contributions and Support

Expand All @@ -89,8 +92,6 @@ If you would like to contribute to this pipeline, please see the [contributing g

If you use sanger-tol/blobtoolkit for your analysis, please cite it using the following doi: [10.5281/zenodo.7949058](https://doi.org/10.5281/zenodo.7949058)

<!-- Add bibliography of tools and data used in your pipeline -->

An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.

This pipeline uses code and infrastructure developed and maintained by the [nf-core](https://nf-co.re) community, reused here under the [MIT license](https://github.com/nf-core/tools/blob/master/LICENSE).
Expand Down
5 changes: 5 additions & 0 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,11 @@
"type": "string",
"pattern": "^\\S+\\.(bam|cram|fa|fa.gz|fasta|fasta.gz|fq|fq.gz|fastq|fastq.gz)$",
"errorMessage": "Data file for reads cannot contain spaces and must be BAM/CRAM/FASTQ/FASTA"
},
"library_layout": {
"type": "string",
"pattern": "^(SINGLE|PAIRED)$",
"errorMessage": "The only valid layouts are SINGLE and PAIRED"
}
},
"required": ["datafile", "datatype", "sample"]
Expand Down
10 changes: 5 additions & 5 deletions assets/test/samplesheet.csv
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
sample,datatype,datafile
mMelMel3,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/hic/GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram
mMelMel1,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel1.cram
mMelMel2,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel2.cram
mMelMel3,ont,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/ont/GCA_922984935.2.subset.unmasked.ont.mMelMel3.cram
sample,datatype,datafile,library_layout
mMelMel3,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/hic/GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram,PAIRED
mMelMel1,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel1.cram,PAIRED
mMelMel2,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel2.cram,PAIRED
mMelMel3,ont,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/ont/GCA_922984935.2.subset.unmasked.ont.mMelMel3.cram,SINGLE
8 changes: 4 additions & 4 deletions assets/test/samplesheet_raw.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
sample,datatype,datafile
mMelMel1,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel1/illumina/31231_3#1_subset.cram
mMelMel2,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel2/illumina/31231_4#1_subset.cram
mMelMel3,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel3/hic-arima2/35528_2#1_subset.cram
sample,datatype,datafile,library_layout
mMelMel1,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel1/illumina/31231_3#1_subset.cram,PAIRED
mMelMel2,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel2/illumina/31231_4#1_subset.cram,PAIRED
mMelMel3,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel3/hic-arima2/35528_2#1_subset.cram,PAIRED
10 changes: 5 additions & 5 deletions assets/test/samplesheet_s3.csv
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
sample,datatype,datafile
mMelMel3,hic,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/hic/GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram
mMelMel1,illumina,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel1.cram
mMelMel2,illumina,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel2.cram
mMelMel3,ont,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/ont/GCA_922984935.2.subset.unmasked.ont.mMelMel3.cram
sample,datatype,datafile,library_layout
mMelMel3,hic,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/hic/GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram,PAIRED
mMelMel1,illumina,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel1.cram,PAIRED
mMelMel2,illumina,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel2.cram,PAIRED
mMelMel3,ont,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/ont/GCA_922984935.2.subset.unmasked.ont.mMelMel3.cram,SINGLE
6 changes: 3 additions & 3 deletions assets/test_full/full_samplesheet.csv
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
sample,datatype,datafile
gfLaeSulp1,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Laetiporus_sulphureus/analysis/gfLaeSulp1.1/read_mapping/hic/GCA_927399515.1.unmasked.hic.gfLaeSulp1.cram
gfLaeSulp1,pacbio,/lustre/scratch123/tol/resources/nextflow/test-data/Laetiporus_sulphureus/analysis/gfLaeSulp1.1/read_mapping/pacbio/GCA_927399515.1.unmasked.pacbio.gfLaeSulp1.cram
sample,datatype,datafile,library_layout
gfLaeSulp1,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Laetiporus_sulphureus/analysis/gfLaeSulp1.1/read_mapping/hic/GCA_927399515.1.unmasked.hic.gfLaeSulp1.cram,PAIRED
gfLaeSulp1,pacbio,/lustre/scratch123/tol/resources/nextflow/test-data/Laetiporus_sulphureus/analysis/gfLaeSulp1.1/read_mapping/pacbio/GCA_927399515.1.unmasked.pacbio.gfLaeSulp1.cram,SINGLE
24 changes: 21 additions & 3 deletions bin/check_samplesheet.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,17 @@ class RowChecker:
"ont",
)

VALID_LAYOUTS = (
"SINGLE",
"PAIRED",
)

def __init__(
self,
sample_col="sample",
type_col="datatype",
file_col="datafile",
layout_col="library_layout",
**kwargs,
):
"""
Expand All @@ -62,11 +68,14 @@ def __init__(
the read data (default "datatype").
file_col (str): The name of the column that contains the file path for
the read data (default "datafile").
layout_col(str): The name of the column that contains the layout of the
library (i.e. "PAIRED" or "SINGLE").
"""
super().__init__(**kwargs)
self._sample_col = sample_col
self._type_col = type_col
self._file_col = file_col
self._layout_col = layout_col
self._seen = set()
self.modified = []

Expand All @@ -82,6 +91,7 @@ def validate_and_transform(self, row):
self._validate_sample(row)
self._validate_type(row)
self._validate_file(row)
self._validate_layout(row)
self._seen.add((row[self._sample_col], row[self._file_col]))
self.modified.append(row)

Expand All @@ -94,7 +104,7 @@ def _validate_sample(self, row):

def _validate_type(self, row):
"""Assert that the data type matches expected values."""
if not any(row[self._type_col] for datatype in self.VALID_DATATYPES):
if row[self._type_col] not in self.VALID_DATATYPES:
raise AssertionError(
f"The datatype is unrecognized: {row[self._type_col]}\n"
f"It should be one of: {', '.join(self.VALID_DATATYPES)}"
Expand All @@ -114,6 +124,14 @@ def _validate_data_format(self, filename):
f"It should be one of: {', '.join(self.VALID_FORMATS)}"
)

def _validate_layout(self, row):
"""Assert that the library layout matches expected values."""
if not row[self._layout_col] in self.VALID_LAYOUTS:
raise AssertionError(
f"The library layout is unrecognized: {row[self._layout_col]}\n"
f"It should be one of: {', '.join(self.VALID_LAYOUTS)}"
)

def validate_unique_samples(self):
"""
Assert that the combination of sample name and aligned filename is unique.
Expand Down Expand Up @@ -178,7 +196,7 @@ def check_samplesheet(file_in, file_out):
This function checks that the samplesheet follows the following structure,
see also the `blobtoolkit samplesheet`_::

sample,datatype,datafile
sample,datatype,datafile,library_layout
sample1,hic,/path/to/file1.cram
sample1,pacbio,/path/to/file2.cram
sample1,ont,/path/to/file3.cram
Expand All @@ -187,7 +205,7 @@ def check_samplesheet(file_in, file_out):
https://raw.githubusercontent.com/sanger-tol/blobtoolkit/main/assets/test/samplesheet.csv

"""
required_columns = {"sample", "datatype", "datafile"}
required_columns = {"sample", "datatype", "datafile", "library_layout"}
# See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`.
with file_in.open(newline="") as in_handle:
reader = csv.DictReader(in_handle, dialect=sniff_format(in_handle))
Expand Down
Loading
Loading