diff --git a/README.md b/README.md index 318a4aa..8a2e95c 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ # Hostile -Hostile accurately removes host sequences from short and long reads, consuming paired or unpaired `fastq[.gz]` input. Batteries are included – a human reference genome is downloaded when run for the first time. Hostile is precise by default, removing an [order of magnitude fewer microbial reads](https://log.bede.im/2023/08/29/precise-host-read-removal.html#evaluating-accuracy) than existing approaches while removing >99.5% of real human reads from 1000 Genomes Project samples. For the best possible retention of microbial reads, one can use a reference masked against bacterial and/or viral genomes, or make your own using the built-in masking utility. Read headers can be replaced with integers (using `--rename`) for privacy and smaller FASTQs. Heavy lifting is done with fast existing tools (Minimap2/Bowtie2 and Samtools). Bowtie2 is the default aligner for short (paired) reads while Minimap2 is default aligner for long reads. In benchmarks, bacterial Illumina reads were decontaminated at 32Mbp/s (210k reads/sec) and bacterial ONT reads at 22Mbp/s, using 8 alignment threads. Further information and benchmarks can be found in the [paper](https://doi.org/10.1093/bioinformatics/btad728) and [blog post](https://log.bede.im/2023/08/29/precise-host-read-removal.html). Please open an issue to report problems [or](mailto:b@bede.im) [otherwise](https://twitter.com/beconsta) [reach](https://bsky.app/profile/bedec.bsky.social) [out](https://mstdn.science/@bede) for help, advice etc. +Hostile accurately removes host sequences from short and long read (meta)genomes, consuming paired or unpaired `fastq[.gz]` input. Batteries are included – a human reference genome is downloaded when run for the first time. Hostile is precise by default, removing an [order of magnitude fewer microbial reads](https://log.bede.im/2023/08/29/precise-host-read-removal.html#evaluating-accuracy) than existing approaches while removing >99.5% of real human reads from 1000 Genomes Project samples. For the best possible retention of microbial reads, use an existing index masked against bacterial and/or viral genomes, or make your own using the built-in masking utility. Read headers can be replaced with integers (using `--rename`) for privacy and smaller FASTQs. Heavy lifting is done with fast existing tools (Minimap2/Bowtie2 and Samtools). Bowtie2 is the default aligner for short (paired) reads while Minimap2 is default aligner for long reads. In benchmarks, bacterial Illumina reads were decontaminated at 32Mbp/s (210k reads/sec) and bacterial ONT reads at 22Mbp/s, using 8 alignment threads. Further information and benchmarks can be found in the [paper](https://doi.org/10.1093/bioinformatics/btad728) and [blog post](https://log.bede.im/2023/08/29/precise-host-read-removal.html). Please open an issue to report problems [or](mailto:b@bede.im) [otherwise](https://twitter.com/beconsta) [reach](https://bsky.app/profile/bedec.bsky.social) [out](https://mstdn.science/@bede) for help, advice etc. @@ -14,14 +14,17 @@ Hostile accurately removes host sequences from short and long reads, consuming p The default index `human-t2t-hla` comprises [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/11828891) and [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) v3.51, and is downloaded automatically when running Hostile unless another index is specified. Slightly higher microbial sequence retention is possible using a masked index, of which several are available. The index `human-t2t-hla-argos985` is masked against [985 reference grade bacterial genomes](https://www.ncbi.nlm.nih.gov/bioproject/231221) including common human pathogens, while `human-t2t-hla-argos985-mycob140` is further masked against mycobacterial genomes. To use a standard index, simply pass its name as the value of the `--index` argument which takes care of downloading and cacheing the relevant index. Automatic download can be disabled using the `--offline` flag, and `--index` can accept a path to a custom reference genome or Bowtie2 index. [Object storage](https://objectstorage.uk-london-1.oraclecloud.com/n/lrbvkel2wjot/b/human-genome-bucket/o) is provided by the [ModMedMicro research group](https://www.expmedndm.ox.ac.uk/modernising-medical-microbiology). -| Name | Composition | Date | Masked positions | -| :----------------------------------------------------------: | :----------------------------------------------------------: | ------- | --------------------- | -| `human-t2t-hla` **(default)** | [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/11828891) + [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) v3.51 | 2023-07 | 0 (**0%**) | -| `human-t2t-hla-argos985` | `human-t2t-hla` masked with 150mers for [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) bacterial genomes | 2023-07 | 317,973 (**0.01%**) | -| `human-t2t-hla-argos985-mycob140` | `human-t2t-hla` masked with 150mers for [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) bacterial & [140](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) mycobacterial genomes | 2023-07 | 319,752 (**0.01%**) | -| `human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401` | `human-t2t-hla` masked with 150mers for [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) bacterial, 18,719 RefSeq viral, and 26,928 [Millard Lab phage](https://millardlab.org/phage-genomes-jan-2024/) genomes | 2024-01 | 1,473,260 (**0.04%**) | +| Name | Composition | Date | Masked positions | +| :----------------------------------------------------------: | :----------------------------------------------------------: | ------- | ---------------------- | +| `human-t2t-hla` **(default)** | [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/assembly/11828891) + [IPD-IMGT/HLA](https://www.ebi.ac.uk/ipd/imgt/hla/) v3.51 | 2023-07 | 0 (**0%**) | +| `human-t2t-hla-argos985` | `human-t2t-hla` masked with 150mers for [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) **bacterial** genomes | 2023-07 | 317,973 (**0.010%**) | +| `human-t2t-hla.rs-viral-202401_ml-phage-202401` | `human-t2t-hla` masked with 150mers for 18,719 RefSeq **viral** and 26,928 [Millard Lab **phage**](https://millardlab.org/phage-genomes-jan-2024/) genomes | 2024-01 | 1,172,993 (**0.037%**) | +| `human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401` | `human-t2t-hla` masked with 150mers for [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) bacterial, 18,719 RefSeq **viral**, and 26,928 [Millard Lab **phage**](https://millardlab.org/phage-genomes-jan-2024/) genomes | 2024-01 | 1,473,260 (**0.046%**) | +| `human-t2t-hla-argos985-mycob140` | `human-t2t-hla` masked with 150mers for [985](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) [FDA-ARGOS](https://www.ncbi.nlm.nih.gov/bioproject/231221) **bacterial** & [140](https://github.com/bede/hostile/blob/main/paper/supplementary-table-2.tsv) **mycobacterial** genomes | 2023-07 | 319,752 (**0.010%**) | + +*Performance of `human-t2t-hla` and `human-t2t-hla-argos985-mycob140` was evaluated in the [paper](https://doi.org/10.1093/bioinformatics/btad728)* + -*Performance of `human-t2t-hla` and `human-t2t-hla-argos985-mycob140` was evaluated in the [paper](https://www.biorxiv.org/content/10.1101/2023.07.04.547735)* ## Install [![Install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square&logo=anaconda)](https://biocontainers.pro/tools/hostile) [![Install with Docker](https://img.shields.io/badge/install%20with-docker-important.svg?style=flat-square&logo=docker)](https://biocontainers.pro/tools/hostile) @@ -44,27 +47,16 @@ wget https://raw.githubusercontent.com/bede/hostile/main/Dockerfile docker build . --platform linux/amd64 ``` -**Development install** - -```bash -git clone https://github.com/bede/hostile.git -cd hostile -conda env create -y -f environment.yml -conda activate hostile -pip install --editable '.[dev]' -pytest -``` - -## Index installation +## Index installation (optional) -This is optional since Hostile automatically downloads and caches the default index `human-t2t-hla` when run for the first time. +Hostile automatically downloads and caches the default index `human-t2t-hla` when run for the first time, meaning that there is no need to download an index in advance. Neverthless: -- To force Hostile to download the default index (`human-t2t-hla`), run `hostile fetch` +- To download and cache the default index (`human-t2t-hla`), run `hostile fetch` - To list available indexes, run `hostile fetch --list` -- To download another standard index, run e.g. `hostile fetch --name human-t2t-hla-argos985` -- To use a custom genome (made perhaps with `hostile mask`?), run `hostile clean --index path/to/genome.fa` +- To download and cache another standard index, run e.g. `hostile fetch --name human-t2t-hla-argos985` +- To use a custom genome (made perhaps with `hostile mask`), run `hostile clean --index path/to/genome.fa` @@ -228,16 +220,31 @@ You may wish to use one of the existing [reference genomes](#reference-genomes-- Bede Constantinides, Martin Hunt, Derrick W Crook, Hostile: accurate decontamination of microbial host sequences, *Bioinformatics*, 2023; btad728, https://doi.org/10.1093/bioinformatics/btad728 ```latex -@article{Constantinides2023, +@article{10.1093/bioinformatics/btad728, author = {Constantinides, Bede and Hunt, Martin and Crook, Derrick W}, title = {Hostile: accurate decontamination of microbial host sequences}, journal = {Bioinformatics}, + volume = {39}, + number = {12}, pages = {btad728}, year = {2023}, month = {12}, issn = {1367-4811}, doi = {10.1093/bioinformatics/btad728}, url = {https://doi.org/10.1093/bioinformatics/btad728}, - eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btad728/53970806/btad728.pdf}, + eprint = {https://academic.oup.com/bioinformatics/article-pdf/39/12/btad728/54850422/btad728.pdf}, } ``` + + + +## Development install + +```bash +git clone https://github.com/bede/hostile.git +cd hostile +conda env create -y -f environment.yml +conda activate hostile +pip install --editable '.[dev]' +pytest +``` diff --git a/bucket/manifest.json b/bucket/manifest.json index 8009601..e5fe747 100644 --- a/bucket/manifest.json +++ b/bucket/manifest.json @@ -29,13 +29,23 @@ } } }, + "human-t2t-hla.rs-viral-202401_ml-phage-202401": { + "assets": { + "human-t2t-hla.rs-viral-202401_ml-phage-202401.fa.gz": { + "sha256": "3f95be8ffd99a0ebc7667af2dc699ac7921bd474b5edf200fe319b13f98b0265" + }, + "human-t2t-hla.rs-viral-202401_ml-phage-202401.tar": { + "sha256": "d1d5386e65f2e9006c09cfd3f983499e07e968ce3a6332b5d546bbd09088af3e" + } + } + }, "human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401": { "assets": { "human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401.fa.gz": { - "sha256": "" + "sha256": "a9693618eaf93fae24e07c184019ca96f81cd1ec098858f3f4c901c706efa8f0" }, "human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401.tar": { - "sha256": "" + "sha256": "681819bec809f2e3b613d449492954baa018d032879e3acfbcc78c5568009410" } } } diff --git a/src/hostile/aligner.py b/src/hostile/aligner.py index 28ae4ee..fe6a29b 100644 --- a/src/hostile/aligner.py +++ b/src/hostile/aligner.py @@ -42,7 +42,6 @@ def check_index(self, index: str, offline: bool = False) -> Path: manifest = util.fetch_manifest(util.BUCKET_URL) with tempfile.NamedTemporaryFile() as temporary_file: tmp_path = Path(temporary_file.name) - # util.download(f"http://localhost:8000/{file_name}", tmp_path) util.download(f"{file_url}", tmp_path) expected_sha256 = manifest[index]["assets"][file_name]["sha256"] logging.info(f"Verifying checksum {expected_sha256}…") @@ -74,7 +73,6 @@ def check_index(self, index: str, offline: bool = False) -> Path: manifest = util.fetch_manifest(util.BUCKET_URL) with tempfile.NamedTemporaryFile() as temporary_file: tmp_path = Path(temporary_file.name) - # util.download(f"http://localhost:8000/{file_name}", tmp_path) util.download(f"{file_url}", tmp_path) expected_sha256 = manifest[index]["assets"][file_name]["sha256"] logging.info(f"Verifying checksum {expected_sha256}…") @@ -97,7 +95,7 @@ def gen_clean_cmd( self, fastq: Path, out_dir: Path, - index: str, + index_path: Path, invert: bool, rename: bool, reorder: bool, @@ -116,7 +114,6 @@ def gen_clean_cmd( raise FileExistsError( f"Output file already exists. Use --force to overwrite" ) - index_path = self.check_index(index, offline=offline) filter_cmd = " | samtools view -hF 4 -" if invert else " | samtools view -f 4 -" reorder_cmd = " | samtools sort -n -O sam -@ 6 -m 1G" if reorder else "" rename_cmd = ( @@ -159,7 +156,7 @@ def gen_paired_clean_cmd( fastq1: Path, fastq2: Path, out_dir: Path, - index: Path | None, + index_path: Path, invert: bool, rename: bool, reorder: bool, @@ -180,7 +177,6 @@ def gen_paired_clean_cmd( raise FileExistsError( f"Output files already exist. Use --force to overwrite" ) - index_path = self.check_index(index, offline=offline) filter_cmd = ( " | samtools view -hF 12 -" if invert else " | samtools view -f 12 -" ) diff --git a/src/hostile/cli.py b/src/hostile/cli.py index ac433d7..a2d478b 100644 --- a/src/hostile/cli.py +++ b/src/hostile/cli.py @@ -129,9 +129,7 @@ def fetch( list: bool = False, ) -> None: """ - Download indexes from object storage for use with hostile clean. For Minimap2 a - compressed genome is downloaded, while for Bowtie2 a tarred index is downloaded - and extracted + Download and cache indexes from object storage for use with hostile clean :arg filename: filename of index to download :arg aligner: aligner(s) for which to download an index diff --git a/src/hostile/lib.py b/src/hostile/lib.py index 0241b14..74532b1 100644 --- a/src/hostile/lib.py +++ b/src/hostile/lib.py @@ -175,11 +175,12 @@ def clean_fastqs( if not all(fastq.is_file() for fastq in fastqs): raise FileNotFoundError("One or more fastq files do not exist") Path(out_dir).mkdir(exist_ok=True, parents=True) + index_path = aligner.value.check_index(index, offline=offline) backend_cmds = [ aligner.value.gen_clean_cmd( fastq=fastq, out_dir=out_dir, - index=index, + index_path=index_path, invert=invert, rename=rename, reorder=reorder, @@ -233,12 +234,13 @@ def clean_paired_fastqs( if not all(path.is_file() for fastq_pair in fastqs for path in fastq_pair): raise FileNotFoundError("One or more fastq files do not exist") Path(out_dir).mkdir(exist_ok=True, parents=True) + index_path = aligner.value.check_index(index, offline=offline) backend_cmds = [ aligner.value.gen_paired_clean_cmd( fastq1=pair[0], fastq2=pair[1], out_dir=out_dir, - index=index, + index_path=index_path, invert=invert, rename=rename, reorder=reorder, @@ -279,8 +281,8 @@ def mask( out_dir.mkdir(exist_ok=True, parents=True) kmers_path = out_dir / "kmers.fasta.gz" mask_path = out_dir / "mask.bed" - masked_ref_path = out_dir / "masked.fa" - masked_ref_index_path = out_dir / "masked" + masked_ref_path = out_dir / f"{out_dir.name}.fa" + masked_ref_index_path = out_dir / out_dir.name if ref_path.suffix == ".gz": # Decompress reference if necessary new_ref_path = out_dir / ref_path.stem diff --git a/tests/test_all.py b/tests/test_all.py index 200e978..1556e09 100644 --- a/tests/test_all.py +++ b/tests/test_all.py @@ -584,6 +584,24 @@ def test_invert_paired(): shutil.rmtree(out_dir, ignore_errors=True) +def test_invert_paired_2(): + stats = lib.clean_paired_fastqs( + fastqs=[ + ( + data_dir / "sars-cov-2_100_1.fastq.gz", + data_dir / "sars-cov-2_100_2.fastq.gz", + ), + ], + aligner=lib.ALIGNER.bowtie2, + index=data_dir / "sars-cov-2/sars-cov-2", + out_dir=out_dir, + invert=True, + force=True, + ) + assert stats[0]["reads_out"] == 92 + shutil.rmtree(out_dir, ignore_errors=True) + + def test_minimap2_reordering_linux(): stats = lib.clean_paired_fastqs( fastqs=[