Skip to content
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.

short-read-mngs RunAssembly: contig length filter #127

Merged
merged 7 commits into from
Jun 9, 2021

Conversation

mlin
Copy link
Collaborator

@mlin mlin commented Jun 3, 2021

Apply a 100-nucleotide length filter to the contigs as they emerge from SPAdes, to reduce the burden on subsequent steps like BlastContigs. Shorter contigs can make up a majority of SPAdes output currently (sample-dependent), yet seem of minimal value considering they're shorter than the input reads.

The downfiltered contigs are used for all subsequent steps (including the Bowtie2 realignment of reads to contigs immediately following SPAdes). The original set is copied into a new workflow output, assembly_out_assembly_contigs_all_fasta, which remains unused.

Copy link
Contributor

@katrinakalantar katrinakalantar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I added a few minor comments.
I'll let @morsecodist give the final approval.

@@ -659,6 +661,7 @@ workflow idseq_postprocess {

output {
File assembly_out_assembly_contigs_fasta = RunAssembly.assembly_contigs_fasta
File assembly_out_assembly_contigs_all_fasta = RunAssembly.assembly_contigs_all_fasta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that we're still outputting the full assembled contig file!

@@ -26,5 +26,5 @@ Then, build the docker image from the current code revision and start the tests:
```bash
cd idseq-workflows
docker build -t idseq-short-read-mngs short-read-mngs
make test-short-read-mngs
DOCKER_IMAGE_ID=idseq-short-read-mngs make test-short-read-mngs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be a single line?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That just makes the definition effective only for this invocation instead of the rest of the shell session (since DOCKER_IMAGE_ID is a pretty generic name)

short-read-mngs/idseq-dag/idseq_dag/util/fasta.py Outdated Show resolved Hide resolved
)
contigs_headers = fasta_headers(assembly_contigs_fasta)
contigs_all_headers = fasta_headers(assembly_contigs_all_fasta)
assert contigs_headers and (set(contigs_all_headers) - set(contigs_headers))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: for completeness, should we additionally assert on the length of a sequence in set(contigs_all_headers) - set(contigs_headers) to ensure length is < min_contig_length?

Copy link
Contributor

@morsecodist morsecodist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall. Love the tests. Just a few small tweaks.

if min_contig_length:
# apply contig length filter
with open(assembled_contig, "w") as outfile:
for record in fasta.iterator(assembled_contig_all):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to move us away from custom parsing. We already have bio python as a dependency. Could you replace with something like:

from Bio import SeqIO

SeqIO.write(
    (r for r in SeqIO.parse(assembled_contig_all, "fasta") if len(r.seq) >= min_contig_length),
    assembled_contig,
    "fasta",
)

This would make your changes in short-read-mngs/idseq-dag/idseq_dag/util/fasta.py unnecessary

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed and implemented. I found I had to add biopython to the idseq-dag setup.py requires, I guess the production Dockerfile installs it separately.

short-read-mngs/idseq-dag/idseq_dag/util/fasta.py Outdated Show resolved Hide resolved
@@ -6,6 +6,7 @@ task RunAssembly {
String s3_wd_uri
Array[File] host_filter_out_gsnap_filter_fa
File duplicate_cluster_sizes_tsv
Int min_contig_length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you set the default here and make it optional so it is more obvious and we could disable it more cleanly if we ever support that?

Int? min_contig_length = 100

Assuming we can explicitly pass in null here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm setting the default at the workflow-level input rn: https://github.com/chanzuckerberg/idseq-workflows/pull/127/files#diff-d2e6572472ef03a6e96e659a5de68aaf3c3954db41875231be060bbaf4e3a8c0R484

There's actually an ambiguity in the WDL spec we're working to clarify right now that makes it awkward to set the default value inside the task, but also expose a workflow-level input to optionally override it. It's funny it comes up organically here, as we were just having a healthy debate about that PR last week.

If we needed to, we'd set the workflow-level input to 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah OK

read2contig = {}
memory = self.additional_attributes.get('memory', 100)
self.assemble(input_fasta, input_fasta2, bowtie_fasta, duplicate_cluster_sizes_path, assembled_contig, assembled_scaffold,
bowtie_sam, contig_stats, read2contig, int(memory))
min_contig_length = int(self.additional_attributes.get('min_contig_length', 0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment about moving the default. Also here it looks like the default is no filtering rather than 100?

Copy link
Contributor

@morsecodist morsecodist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good thanks!

@mlin mlin merged commit 1851405 into main Jun 9, 2021
@mlin mlin deleted the mlin/assembly-min-contig-length branch June 9, 2021 20:46
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants