Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 0.7 #116

Draft
wants to merge 67 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
2584645
Update sanger_test_full.yml
gq1 Sep 13, 2024
ea1b05a
Update sanger_test_full.yml
gq1 Sep 16, 2024
e0bf8ba
Merge pull request #113 from sanger-tol/small_tower_compute_env
gq1 Sep 16, 2024
d13d1ed
Fixed the comment
muffato Sep 19, 2024
2c793aa
UPDATEBLOBDIR needs to pass the tsv files through certain arguments
muffato Sep 19, 2024
b55f76c
Populate as much as possible of the metadata at the beginning
muffato Sep 19, 2024
0daf1e2
Also generate TSV files
muffato Jul 2, 2024
2cf12fd
Example plugging the TSV files into blobtools commands
muffato Jul 2, 2024
b8b1fa4
Skip BLOBTOOLKIT_METADATA as we generate the yaml file correctly from…
muffato Sep 19, 2024
41b101f
Introduced a new samplesheet parameter to track single/paired readsets
muffato Sep 20, 2024
80a100b
Populate as much as possible of the metadata at the beginning
muffato Sep 20, 2024
bea3a77
Version bumps
muffato Sep 20, 2024
4edafb4
Updated the documentation
muffato Sep 20, 2024
904a512
Version bump
muffato Sep 20, 2024
51985ad
Publish the windowmasker outputs too
muffato Sep 20, 2024
f47dd16
Convert input BAM/CRAM files to Fasta on the fly
muffato Sep 23, 2024
43254bc
Bumped the version down
muffato Sep 26, 2024
c89b1b9
Pin the version of editorconfig like in the readmapping pipeline
muffato Sep 26, 2024
152282b
Exclude sqlite databases from ECLint
muffato Sep 26, 2024
476b736
add exclusions for nf-core lint
tkchafin Sep 30, 2024
303b221
Update CITATIONS.md
tkchafin Sep 30, 2024
27b0a51
prettier linting
tkchafin Sep 30, 2024
4eef0fa
The release is going to happen today
muffato Oct 2, 2024
0d8a5a9
Decreased the runtime requirements for illumina alignments
muffato Oct 2, 2024
1909646
Wording of the changelog
muffato Oct 2, 2024
102dbf4
Merge pull request #114 from sanger-tol/chrom_view
muffato Oct 2, 2024
8a05dc2
Documentation update
muffato Oct 2, 2024
abe2b76
Bumped up the underlying BTK version because 4.3.9 had a bug
muffato Oct 14, 2024
dfb4655
Updated the blastn runtime requirements to avoid basement jobs
muffato Oct 14, 2024
17b38ee
database parameterisation and separate tests from sanger infrastructure
tkchafin Oct 14, 2024
93bda84
Merge pull request #117 from sanger-tol/misc_fixes
tkchafin Oct 16, 2024
ea6d70a
Updated database installation instructions
muffato Oct 19, 2024
36fff07
fix URL not found in uniprot ref ftp path
tkchafin Oct 19, 2024
61c0328
fix ftp path for taxdump
tkchafin Oct 19, 2024
b9bcef6
Merge pull request #118 from sanger-tol/db_install
tkchafin Oct 21, 2024
2b22f24
Update test.config
tkchafin Oct 24, 2024
8d2b83f
Merge pull request #120 from sanger-tol/blastn_test
tkchafin Nov 4, 2024
254fea1
separate databases from sanger infra; handle compression
tkchafin Nov 11, 2024
2afc63d
Merge pull request #3 from tkchafin/dev
tkchafin Nov 11, 2024
6399d0a
Update conf/test_full.config
tkchafin Nov 19, 2024
779ec64
Update subworkflows/local/input_check.nf
tkchafin Nov 19, 2024
dddef77
remove local database copies from assets
tkchafin Nov 19, 2024
a49bbd3
remove pre-download from ci
tkchafin Nov 19, 2024
4e176e0
docs update
tkchafin Nov 19, 2024
959537d
changelog update
tkchafin Nov 19, 2024
641d464
prettier linting
tkchafin Nov 19, 2024
c5ab0a1
remove local files from CI
tkchafin Nov 19, 2024
151e154
modules partially updated; stuck on samtools/view
tkchafin Nov 19, 2024
2793130
all modules updated exc. BUSCO
tkchafin Nov 21, 2024
a87c43f
busco updated and patched
tkchafin Nov 21, 2024
4e5a942
cleanup ignored file
tkchafin Nov 21, 2024
7cfbcc0
Update docs/usage.md
tkchafin Nov 22, 2024
301bcff
Merge pull request #121 from tkchafin/db_params
tkchafin Nov 22, 2024
127ea41
bugfix: this transformation should only apply on directories (that co…
muffato Nov 22, 2024
01007ab
Merge pull request #125 from sanger-tol/db_params
muffato Nov 22, 2024
c181b81
bugfix: the blastp and blastx paths can have the same name
muffato Nov 22, 2024
3706165
prettier linting
tkchafin Nov 22, 2024
c0b8e31
Removed references to "defaults"
muffato Oct 24, 2024
6fe1861
Merge pull request #127 from sanger-tol/db_params
muffato Nov 22, 2024
1342ba0
Updated the modules
muffato Nov 25, 2024
d3fbb42
Merge branch 'dev' into anaconda_purge_2
muffato Nov 25, 2024
16ed4ec
Updated a few more references to Anaconda
muffato Nov 25, 2024
3976215
Updated the changelog too
muffato Nov 25, 2024
15974b4
[prettier]
muffato Nov 25, 2024
000b442
Merge pull request #124 from tkchafin/anaconda_purge_2
tkchafin Nov 25, 2024
6fb2a8a
Update ci.yml
tkchafin Dec 16, 2024
81d26b4
Merge pull request #130 from sanger-tol/ci
tkchafin Dec 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/linting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,10 @@ jobs:
- uses: actions/setup-node@v4

- name: Install editorconfig-checker
run: npm install -g editorconfig-checker
run: npm install -g editorconfig-checker@3.0.2

- name: Run ECLint check
run: editorconfig-checker -exclude README.md $(find .* -type f | grep -v '.git\|.py\|.md\|json\|yml\|yaml\|html\|css\|work\|.nextflow\|build\|nf_core.egg-info\|log.txt\|Makefile')
run: editorconfig-checker -exclude README.md $(find .* -type f | grep -v '.git\|.py\|.md\|json\|yml\|yaml\|html\|css\|work\|.nextflow\|build\|nf_core.egg-info\|log.txt\|Makefile\|.sqlite3')

Prettier:
runs-on: ubuntu-latest
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/sanger_test_full.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ jobs:
with:
workspace_id: ${{ secrets.TOWER_WORKSPACE_ID }}
access_token: ${{ secrets.TOWER_ACCESS_TOKEN }}
compute_env: ${{ secrets.TOWER_COMPUTE_ENV_LARGE }}
compute_env: ${{ secrets.TOWER_COMPUTE_ENV }}
revision: ${{ env.REVISION }}
workdir: ${{ secrets.TOWER_WORKDIR_PARENT }}/work/${{ github.repository }}/work-${{ env.REVISION }}
parameters: |
Expand Down
5 changes: 5 additions & 0 deletions .nf-core.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,18 @@ lint:
- LICENCE
- lib/NfcoreTemplate.groovy
- CODE_OF_CONDUCT.md
- assets/sendmail_template.txt
- assets/email_template.html
- assets/email_template.txt
- assets/nf-core-blobtoolkit_logo_light.png
- docs/images/nf-core-blobtoolkit_logo_light.png
- docs/images/nf-core-blobtoolkit_logo_dark.png
- .github/ISSUE_TEMPLATE/bug_report.yml
- .github/PULL_REQUEST_TEMPLATE.md
- .github/workflows/linting.yml
- .github/workflows/branch.yml
- .github/CONTRIBUTING.md
- .github/workflows/linting_comment.yml
multiqc_config:
- report_comment
nextflow_config:
Expand Down
33 changes: 26 additions & 7 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,25 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [[0.7.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.7.0)] – Psyduck – [2024-10-02]

The pipeline is now considered to be a complete and suitable replacement for the Snakemake version.

- Fetch information about the chromosomes of the assemblies. Used to power
"grid plots".
- Fill in accurate read information in the blobDir. Users are now reqiured
to indicate in the samplesheet whether the reads are paired or single.
- Updated the Blastn settings to allow 7 days runtime at most, since that
covers 99.7% of the jobs.

### Software dependencies

Note, since the pipeline is using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference. Only `Docker` or `Singularity` containers are supported, `conda` is not supported.

| Dependency | Old version | New version |
| ----------- | ----------- | ----------- |
| blobtoolkit | 4.3.9 | 4.3.13 |

## [[0.6.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.6.0)] – Bellsprout – [2024-09-13]

The pipeline has now been validated for draft (unpublished) assemblies.
Expand Down Expand Up @@ -87,13 +106,13 @@ The pipeline has now been validated on dozens of genomes, up to 11 Gbp.

Note, since the pipeline is using Nextflow DSL2, each process will be run with its own [Biocontainer](https://biocontainers.pro/#/registry). This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference. Only `Docker` or `Singularity` containers are supported, `conda` is not supported.

| Dependency | Old version | New version |
| ----------- | ------------- | ------------- |
| blobtoolkit | 4.3.3 | 4.3.9 |
| blast | 2.14.0 | 2.15.0 |
| multiqc | 1.17 and 1.18 | 1.20 and 1.21 |
| samtools | 1.18 | 1.19.2 |
| seqtk | 1.3 | 1.4 |
| Dependency | Old version | New version |
| ----------- | ------------- | ----------------- |
| blobtoolkit | 4.3.3 | 4.3.9 |
| blast | 2.14.0 | 2.15.0 and 2.14.1 |
| multiqc | 1.17 and 1.18 | 1.20 and 1.21 |
| samtools | 1.18 | 1.19.2 |
| seqtk | 1.3 | 1.4 |

> **NB:** Dependency has been **updated** if both old and new version information is present. </br> **NB:** Dependency has been **added** if just the new version information is present. </br> **NB:** Dependency has been **removed** if version information isn't present.

Expand Down
12 changes: 9 additions & 3 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@

## Pipeline tools

- [BLAST+](https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html)

> Camacho, Chritiam, et al. “BLAST+: architecture and applications.” BMC Bioinformatics, vol. 10, no. 412, Dec. 2009, https://doi.org/10.1186/1471-2105-10-421

- [BlobToolKit](https://github.com/blobtoolkit/blobtoolkit)

> Challis, Richard, et al. “BlobToolKit – Interactive Quality Assessment of Genome Assemblies.” G3 Genes|Genomes|Genetics, vol. 10, no. 4, Apr. 2020, pp. 1361–74, https://doi.org/10.1534/g3.119.400908.
Expand All @@ -26,9 +30,7 @@

- [Fasta_windows](https://github.com/tolkit/fasta_windows)

- [GoaT](https://goat.genomehubs.org)

> Challis, Richard, et al. “Genomes on a Tree (GoaT): A versatile, scalable search engine for genomic and sequencing project metadata across the eukaryotic tree of life.” Wellcome Open Research, vol. 8, no. 24, 2023, https://doi.org/10.12688/wellcomeopenres.18658.1.
> Brown, Max, et al. "Fasta_windows v0.2.3". GitHub, 2021. https://github.com/tolkit/fasta_windows

- [Minimap2](https://github.com/lh3/minimap2)

Expand All @@ -42,6 +44,10 @@

> Danecek, Petr, et al. “Twelve Years of SAMtools and BCFtools.” GigaScience, vol. 10, no. 2, Jan. 2021, https://doi.org/10.1093/gigascience/giab008.

- [SeqTK](https://github.com/lh3/seqtk)

> Li, Heng. "SeqTK v1.4" GitHub, 2023, https://github.com/lh3/seqtk

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)
Expand Down
23 changes: 12 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@ It takes a samplesheet of BAM/CRAM/FASTQ/FASTA files as input, calculates genome
4. Run BUSCO ([`busco`](https://busco.ezlab.org/))
5. Extract BUSCO genes ([`blobtoolkit/extractbuscos`](https://github.com/blobtoolkit/blobtoolkit))
6. Run Diamond BLASTp against extracted BUSCO genes ([`diamond/blastp`](https://github.com/bbuchfink/diamond))
7. Run BLASTx against sequences with no hit ([`blast/blastn`](https://www.ncbi.nlm.nih.gov/books/NBK131777/))
8. Run BLASTn against sequences still with not hit ([`blast/blastx`](https://www.ncbi.nlm.nih.gov/books/NBK131777/))
7. Run BLASTx against sequences with no hit ([`diamond/blastx`](https://github.com/bbuchfink/diamond))
8. Run BLASTn against sequences still with not hit ([`blast/blastn`](https://www.ncbi.nlm.nih.gov/books/NBK131777/))
9. Count BUSCO genes ([`blobtoolkit/countbuscos`](https://github.com/blobtoolkit/blobtoolkit))
10. Generate combined sequence stats across various window sizes ([`blobtoolkit/windowstats`](https://github.com/blobtoolkit/blobtoolkit))
11. Imports analysis results into a BlobDir dataset ([`blobtoolkit/blobdir`](https://github.com/blobtoolkit/blobtoolkit))
Expand All @@ -37,13 +37,17 @@ First, prepare a samplesheet with your input data that looks as follows:
`samplesheet.csv`:

```csv
sample,datatype,datafile
mMelMel3,hic,GCA_922984935.2.hic.mMelMel3.cram
mMelMel1,illumina,GCA_922984935.2.illumina.mMelMel1.cram
mMelMel3,ont,GCA_922984935.2.ont.mMelMel3.cram
sample,datatype,datafile,library_layout
mMelMel3,hic,GCA_922984935.2.hic.mMelMel3.cram,PAIRED
mMelMel1,illumina,GCA_922984935.2.illumina.mMelMel1.cram,PAIRED
mMelMel3,ont,GCA_922984935.2.ont.mMelMel3.cram,SINGLE
```

Each row represents an aligned file. Rows with the same sample identifier are considered technical replicates. The datatype refers to the sequencing technology used to generate the underlying raw data and follows a controlled vocabulary (`ont`, `hic`, `pacbio`, `pacbio_clr`, `illumina`). The aligned read files can be generated using the [sanger-tol/readmapping](https://github.com/sanger-tol/readmapping) pipeline.
Each row represents an aligned file.
Rows with the same sample identifier are considered technical replicates.
The datatype refers to the sequencing technology used to generate the underlying raw data and follows a controlled vocabulary (`ont`, `hic`, `pacbio`, `pacbio_clr`, `illumina`).
The library layout indicates whether the reads are paired or single.
The aligned read files can be generated using the [sanger-tol/readmapping](https://github.com/sanger-tol/readmapping) pipeline.

Now, you can run the pipeline using:

Expand Down Expand Up @@ -77,9 +81,8 @@ sanger-tol/blobtoolkit was written in Nextflow by [Alexander Ramos Diaz](https:/

We thank the following people for their assistance in the development of this pipeline:

<!-- If applicable, make list of people who have also contributed -->

- [Guoying Qi](https://github.com/gq1)
- [Bethan Yates](https://github.com/BethYates)

## Contributions and Support

Expand All @@ -89,8 +92,6 @@ If you would like to contribute to this pipeline, please see the [contributing g

If you use sanger-tol/blobtoolkit for your analysis, please cite it using the following doi: [10.5281/zenodo.7949058](https://doi.org/10.5281/zenodo.7949058)

<!-- Add bibliography of tools and data used in your pipeline -->

An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.

This pipeline uses code and infrastructure developed and maintained by the [nf-core](https://nf-co.re) community, reused here under the [MIT license](https://github.com/nf-core/tools/blob/master/LICENSE).
Expand Down
5 changes: 5 additions & 0 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,11 @@
"type": "string",
"pattern": "^\\S+\\.(bam|cram|fa|fa.gz|fasta|fasta.gz|fq|fq.gz|fastq|fastq.gz)$",
"errorMessage": "Data file for reads cannot contain spaces and must be BAM/CRAM/FASTQ/FASTA"
},
"library_layout": {
"type": "string",
"pattern": "^(SINGLE|PAIRED)$",
"errorMessage": "The only valid layouts are SINGLE and PAIRED"
}
},
"required": ["datafile", "datatype", "sample"]
Expand Down
10 changes: 5 additions & 5 deletions assets/test/samplesheet.csv
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
sample,datatype,datafile
mMelMel3,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/hic/GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram
mMelMel1,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel1.cram
mMelMel2,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel2.cram
mMelMel3,ont,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/ont/GCA_922984935.2.subset.unmasked.ont.mMelMel3.cram
sample,datatype,datafile,library_layout
mMelMel3,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/hic/GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram,PAIRED
mMelMel1,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel1.cram,PAIRED
mMelMel2,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel2.cram,PAIRED
mMelMel3,ont,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/ont/GCA_922984935.2.subset.unmasked.ont.mMelMel3.cram,SINGLE
8 changes: 4 additions & 4 deletions assets/test/samplesheet_raw.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
sample,datatype,datafile
mMelMel1,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel1/illumina/31231_3#1_subset.cram
mMelMel2,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel2/illumina/31231_4#1_subset.cram
mMelMel3,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel3/hic-arima2/35528_2#1_subset.cram
sample,datatype,datafile,library_layout
mMelMel1,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel1/illumina/31231_3#1_subset.cram,PAIRED
mMelMel2,illumina,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel2/illumina/31231_4#1_subset.cram,PAIRED
mMelMel3,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Meles_meles/genomic_data/mMelMel3/hic-arima2/35528_2#1_subset.cram,PAIRED
10 changes: 5 additions & 5 deletions assets/test/samplesheet_s3.csv
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
sample,datatype,datafile
mMelMel3,hic,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/hic/GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram
mMelMel1,illumina,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel1.cram
mMelMel2,illumina,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel2.cram
mMelMel3,ont,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/ont/GCA_922984935.2.subset.unmasked.ont.mMelMel3.cram
sample,datatype,datafile,library_layout
mMelMel3,hic,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/hic/GCA_922984935.2.subset.unmasked.hic.mMelMel3.cram,PAIRED
mMelMel1,illumina,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel1.cram,PAIRED
mMelMel2,illumina,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/illumina/GCA_922984935.2.subset.unmasked.illumina.mMelMel2.cram,PAIRED
mMelMel3,ont,https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/analysis/mMelMel3.2_paternal_haplotype/read_mapping/ont/GCA_922984935.2.subset.unmasked.ont.mMelMel3.cram,SINGLE
6 changes: 3 additions & 3 deletions assets/test_full/full_samplesheet.csv
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
sample,datatype,datafile
gfLaeSulp1,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Laetiporus_sulphureus/analysis/gfLaeSulp1.1/read_mapping/hic/GCA_927399515.1.unmasked.hic.gfLaeSulp1.cram
gfLaeSulp1,pacbio,/lustre/scratch123/tol/resources/nextflow/test-data/Laetiporus_sulphureus/analysis/gfLaeSulp1.1/read_mapping/pacbio/GCA_927399515.1.unmasked.pacbio.gfLaeSulp1.cram
sample,datatype,datafile,library_layout
gfLaeSulp1,hic,/lustre/scratch123/tol/resources/nextflow/test-data/Laetiporus_sulphureus/analysis/gfLaeSulp1.1/read_mapping/hic/GCA_927399515.1.unmasked.hic.gfLaeSulp1.cram,PAIRED
gfLaeSulp1,pacbio,/lustre/scratch123/tol/resources/nextflow/test-data/Laetiporus_sulphureus/analysis/gfLaeSulp1.1/read_mapping/pacbio/GCA_927399515.1.unmasked.pacbio.gfLaeSulp1.cram,SINGLE
24 changes: 21 additions & 3 deletions bin/check_samplesheet.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,17 @@ class RowChecker:
"ont",
)

VALID_LAYOUTS = (
"SINGLE",
"PAIRED",
)

def __init__(
self,
sample_col="sample",
type_col="datatype",
file_col="datafile",
layout_col="library_layout",
**kwargs,
):
"""
Expand All @@ -62,11 +68,14 @@ def __init__(
the read data (default "datatype").
file_col (str): The name of the column that contains the file path for
the read data (default "datafile").
layout_col(str): The name of the column that contains the layout of the
library (i.e. "PAIRED" or "SINGLE").
"""
super().__init__(**kwargs)
self._sample_col = sample_col
self._type_col = type_col
self._file_col = file_col
self._layout_col = layout_col
self._seen = set()
self.modified = []

Expand All @@ -82,6 +91,7 @@ def validate_and_transform(self, row):
self._validate_sample(row)
self._validate_type(row)
self._validate_file(row)
self._validate_layout(row)
self._seen.add((row[self._sample_col], row[self._file_col]))
self.modified.append(row)

Expand All @@ -94,7 +104,7 @@ def _validate_sample(self, row):

def _validate_type(self, row):
"""Assert that the data type matches expected values."""
if not any(row[self._type_col] for datatype in self.VALID_DATATYPES):
if row[self._type_col] not in self.VALID_DATATYPES:
raise AssertionError(
f"The datatype is unrecognized: {row[self._type_col]}\n"
f"It should be one of: {', '.join(self.VALID_DATATYPES)}"
Expand All @@ -114,6 +124,14 @@ def _validate_data_format(self, filename):
f"It should be one of: {', '.join(self.VALID_FORMATS)}"
)

def _validate_layout(self, row):
"""Assert that the library layout matches expected values."""
if not row[self._layout_col] in self.VALID_LAYOUTS:
raise AssertionError(
f"The library layout is unrecognized: {row[self._layout_col]}\n"
f"It should be one of: {', '.join(self.VALID_LAYOUTS)}"
)

def validate_unique_samples(self):
"""
Assert that the combination of sample name and aligned filename is unique.
Expand Down Expand Up @@ -178,7 +196,7 @@ def check_samplesheet(file_in, file_out):
This function checks that the samplesheet follows the following structure,
see also the `blobtoolkit samplesheet`_::

sample,datatype,datafile
sample,datatype,datafile,library_layout
sample1,hic,/path/to/file1.cram
sample1,pacbio,/path/to/file2.cram
sample1,ont,/path/to/file3.cram
Expand All @@ -187,7 +205,7 @@ def check_samplesheet(file_in, file_out):
https://raw.githubusercontent.com/sanger-tol/blobtoolkit/main/assets/test/samplesheet.csv

"""
required_columns = {"sample", "datatype", "datafile"}
required_columns = {"sample", "datatype", "datafile", "library_layout"}
# See https://docs.python.org/3.9/library/csv.html#id3 to read up on `newline=""`.
with file_in.open(newline="") as in_handle:
reader = csv.DictReader(in_handle, dialect=sniff_format(in_handle))
Expand Down
Loading
Loading