Skip to content

Commit

Permalink
Error msg for mismatched read counts, update log output for stdout, test
Browse files Browse the repository at this point in the history
  • Loading branch information
bede committed Dec 16, 2024
1 parent b5ccc77 commit 493b44a
Show file tree
Hide file tree
Showing 4 changed files with 220 additions and 69 deletions.
111 changes: 74 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,8 +69,8 @@ A [Biocontainer image](https://biocontainers.pro/tools/hostile) is also availabl

```bash
$ hostile clean -h
usage: hostile clean [-h] --fastq1 FASTQ1 [--fastq2 FASTQ2] [--aligner {bowtie2,minimap2,auto}] [--index INDEX] [--invert] [--rename] [--reorder] [--out-dir OUT_DIR]
[--threads THREADS] [--aligner-args ALIGNER_ARGS] [--force] [--offline] [--debug]
usage: hostile clean [-h] --fastq1 FASTQ1 [--fastq2 FASTQ2] [--aligner {bowtie2,minimap2,auto}] [--index INDEX] [--invert] [--rename] [--reorder] [--out-dir OUT_DIR] [-s] [-t THREADS] [--force]
[--aligner-args ALIGNER_ARGS] [--offline] [-d]

Remove reads aligning to an index from fastq[.gz] input files.

Expand All @@ -81,7 +81,7 @@ options:
(default: None)
--aligner {bowtie2,minimap2,auto}
alignment algorithm. Defaults to minimap2 (long read) given fastq1 only or bowtie2 (short read)
given fastq1 and fastq2. Override with bowtie2 if cleaning single/unpaired short reads
given fastq1 and fastq2. Override with bowtie2 for single/unpaired short reads
(default: auto)
--index INDEX name of standard index or path to custom genome (Minimap2) or Bowtie2 index
(default: human-t2t-hla)
Expand All @@ -92,26 +92,90 @@ options:
--reorder ensure deterministic output order
(default: False)
--out-dir OUT_DIR path to output directory
(default: ./)
--threads THREADS number of alignment threads. A sensible default is chosen automatically
(default: /Users/bede/Research/git/hostile)
-s, --stdout send FASTQ to stdout instead of writing fastq.gz file(s). Sends log to stderr instead. Paired output is interleaved
(default: False)
-t, --threads THREADS
number of alignment threads. A sensible default is chosen automatically
(default: 5)
--force overwrite existing output files
(default: False)
--aligner-args ALIGNER_ARGS
additional arguments for alignment
(default: )
--force overwrite existing output files
(default: False)
--offline disable automatic index download
(default: False)
--debug show debug messages
-d, --debug show debug messages
(default: False)
```
**Long reads, default index**
Writes compressed fastq.gz files to current working directory, sends log to stdout
```bash
$ hostile clean --fastq1 tests/data/tuberculosis_1_1.fastq.gz
INFO: Hostile version 1.0.0. Mode: long read (Minimap2)
INFO: Found cached standard index human-t2t-hla
INFO: Cleaning…
INFO: Cleaning complete
[
{
"version": "1.0.0",
"aligner": "minimap2",
"index": "human-t2t-hla",
"options": [],
"fastq1_in_name": "tuberculosis_1_1.fastq.gz",
"fastq1_in_path": "/Users/bede/Research/git/hostile/tests/data/tuberculosis_1_1.fastq.gz",
"fastq1_out_name": "tuberculosis_1_1.clean.fastq.gz",
"fastq1_out_path": "/Users/bede/Research/git/hostile/tuberculosis_1_1.clean.fastq.gz",
"reads_in": 1,
"reads_out": 1,
"reads_removed": 0,
"reads_removed_proportion": 0.0
}
]

```
**Long reads, default index, stream reads to stdout**
Sends uncompressed FASTQ to stdout, log to stderr
```bash
$ hostile clean --fastq1 tests/data/tuberculosis_1_1.fastq.gz
INFO: Hostile version 1.0.0. Mode: long read (Minimap2)
INFO: Found cached standard index human-t2t-hla
INFO: Cleaning…
INFO: Cleaning complete
[
{
"version": "1.0.0",
"aligner": "minimap2",
"index": "human-t2t-hla",
"options": [],
"fastq1_in_name": "tuberculosis_1_1.fastq.gz",
"fastq1_in_path": "/Users/bede/Research/git/hostile/tests/data/tuberculosis_1_1.fastq.gz",
"fastq1_out_name": "tuberculosis_1_1.clean.fastq.gz",
"fastq1_out_path": "/Users/bede/Research/git/hostile/tuberculosis_1_1.clean.fastq.gz",
"reads_in": 1,
"reads_out": 1,
"reads_removed": 0,
"reads_removed_proportion": 0.0
}
]

```
**Short paired reads, default index**
```bash
$ hostile clean --fastq1 human_1_1.fastq.gz --fastq2 human_1_2.fastq.gz
$ hostile clean --fastq1 human_1_1.fastq.gz --fastq2 human_1_2.fastq.gz --aligner bowtie2
INFO: Hostile version 1.0.0. Mode: paired short read (Bowtie2)
INFO: Found cached standard index human-t2t-hla
INFO: Cleaning…
Expand Down Expand Up @@ -141,7 +205,7 @@ INFO: Cleaning complete
**Short paired reads, masked index, save log**
```bash
$ hostile clean --fastq1 human_1_1.fastq.gz --fastq2 human_1_2.fastq.gz --index human-t2t-hla-argos985 > log.json
$ hostile clean --fastq1 human_1_1.fastq.gz --fastq2 human_1_2.fastq.gz --aligner bowtie2 --index human-t2t-hla-argos985 > log.json
INFO: Hostile version 1.0.0. Mode: paired short read (Bowtie2)
INFO: Found cached standard index human-t2t-hla
INFO: Cleaning…
Expand All @@ -164,33 +228,6 @@ INFO: Cleaning complete
**Long reads**
```bash
$ hostile clean --fastq1 tests/data/tuberculosis_1_1.fastq.gz
INFO: Hostile version 1.0.0. Mode: long read (Minimap2)
INFO: Found cached standard index human-t2t-hla
INFO: Cleaning…
INFO: Cleaning complete
[
{
"version": "1.0.0",
"aligner": "minimap2",
"index": "human-t2t-hla",
"options": [],
"fastq1_in_name": "tuberculosis_1_1.fastq.gz",
"fastq1_in_path": "/Users/bede/Research/git/hostile/tests/data/tuberculosis_1_1.fastq.gz",
"fastq1_out_name": "tuberculosis_1_1.clean.fastq.gz",
"fastq1_out_path": "/Users/bede/Research/git/hostile/tuberculosis_1_1.clean.fastq.gz",
"reads_in": 1,
"reads_out": 1,
"reads_removed": 0,
"reads_removed_proportion": 0.0
}
]

```
## Python usage
Expand Down
66 changes: 40 additions & 26 deletions src/hostile/lib.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,14 @@ class SampleReport:
options: list[str]
fastq1_in_name: str
fastq1_in_path: str
fastq1_out_name: str
fastq1_out_path: str
reads_in: int
reads_out: int
reads_removed: int
reads_removed_proportion: float
fastq2_in_name: str | None = None
fastq2_in_path: str | None = None
fastq1_out_name: str | None = None
fastq1_out_path: str | None = None
fastq2_out_name: str | None = None
fastq2_out_path: str | None = None

Expand All @@ -46,6 +46,7 @@ def gather_stats(
aligner: str,
invert: bool,
index: str,
stdout: bool,
) -> list[dict[str, str | int | float | list[str]]]:
stats = []
for fastq1 in fastqs:
Expand All @@ -64,7 +65,12 @@ def gather_stats(
proportion_removed = float(0)
options = [
k
for k, v in {"rename": rename, "reorder": reorder, "invert": invert}.items()
for k, v in {
"invert": invert,
"rename": rename,
"reorder": reorder,
"stdout": stdout,
}.items()
if v
]
report = SampleReport(
Expand All @@ -74,8 +80,8 @@ def gather_stats(
options=options,
fastq1_in_name=fastq1.name,
fastq1_in_path=str(fastq1),
fastq1_out_name=fastq1_out_path.name,
fastq1_out_path=str(fastq1_out_path),
fastq1_out_name=fastq1_out_path.name if not stdout else None,
fastq1_out_path=str(fastq1_out_path) if not stdout else None,
reads_in=n_reads_in,
reads_out=n_reads_out,
reads_removed=n_reads_removed,
Expand All @@ -93,6 +99,7 @@ def gather_stats_paired(
aligner: str,
index: str,
invert: bool,
stdout: bool,
) -> list[dict[str, str | int | float]]:
stats = []
for fastq1, fastq2 in fastqs:
Expand All @@ -113,29 +120,34 @@ def gather_stats_paired(
proportion_removed = float(0)
options = [
k
for k, v in {"rename": rename, "reorder": reorder, "invert": invert}.items()
for k, v in {
"invert": invert,
"rename": rename,
"reorder": reorder,
"stdout": stdout,
}.items()
if v
]
stats.append(
SampleReport(
version=__version__,
aligner=aligner,
index=index,
options=options,
fastq1_in_name=fastq1.name,
fastq2_in_name=fastq2.name,
fastq1_in_path=str(fastq1),
fastq2_in_path=str(fastq2),
fastq1_out_name=fastq1_out_path.name,
fastq2_out_name=fastq2_out_path.name,
fastq1_out_path=str(fastq1_out_path),
fastq2_out_path=str(fastq2_out_path),
reads_in=n_reads_in,
reads_out=n_reads_out,
reads_removed=n_reads_removed,
reads_removed_proportion=proportion_removed,
).__dict__
)
report = SampleReport(
version=__version__,
aligner=aligner,
index=index,
options=options,
fastq1_in_name=fastq1.name,
fastq2_in_name=fastq2.name,
fastq1_in_path=str(fastq1),
fastq2_in_path=str(fastq2),
fastq1_out_name=fastq1_out_path.name if not stdout else None,
fastq2_out_name=fastq2_out_path.name if not stdout else None,
fastq1_out_path=str(fastq1_out_path) if not stdout else None,
fastq2_out_path=str(fastq2_out_path) if not stdout else None,
reads_in=n_reads_in,
reads_out=n_reads_out,
reads_removed=n_reads_removed,
reads_removed_proportion=proportion_removed,
).__dict__
stats.append({k: v for k, v in report.items() if v is not None})

return stats


Expand Down Expand Up @@ -191,6 +203,7 @@ def clean_fastqs(
aligner=aligner.name,
index=index,
invert=invert,
stdout=stdout,
)
util.fix_empty_fastqs(stats)
logging.info("Cleaning complete")
Expand Down Expand Up @@ -256,6 +269,7 @@ def clean_paired_fastqs(
aligner=aligner.name,
index=index,
invert=invert,
stdout=stdout,
)
util.fix_empty_fastqs(stats)
logging.info("Cleaning complete")
Expand Down
4 changes: 3 additions & 1 deletion src/hostile/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,8 @@ def handle_alignment_exceptions(exception: subprocess.CalledProcessError) -> Non
logging.debug(f"stderr: {exception.stderr}")
alignment_successful = False
stream_empty = False
if "Error, fewer reads in file specified" in exception.stderr: # Bowtie2
raise RuntimeError("fastq1 and fastq2 contain different numbers of reads")
if 'Failed to read header for "-"' in exception.stderr:
stream_empty = True
if "overall alignment rate" in exception.stderr: # Bowtie2
Expand All @@ -83,7 +85,7 @@ def handle_alignment_exceptions(exception: subprocess.CalledProcessError) -> Non
pass
else:
logging.error(
f"Hostile encountered a problem. Check available RAM and storage\n"
f"Hostile encountered a problem. Details below\n"
f"pipeline stdout:\n{exception.stdout}\n"
f"pipeline stderr:\n{exception.stderr}\n"
)
Expand Down
Loading

0 comments on commit 493b44a

Please sign in to comment.