-
Notifications
You must be signed in to change notification settings - Fork 12
short-read-mngs RunAssembly: contig length filter #127
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, I added a few minor comments.
I'll let @morsecodist give the final approval.
@@ -659,6 +661,7 @@ workflow idseq_postprocess { | |||
|
|||
output { | |||
File assembly_out_assembly_contigs_fasta = RunAssembly.assembly_contigs_fasta | |||
File assembly_out_assembly_contigs_all_fasta = RunAssembly.assembly_contigs_all_fasta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great that we're still outputting the full assembled contig file!
@@ -26,5 +26,5 @@ Then, build the docker image from the current code revision and start the tests: | |||
```bash | |||
cd idseq-workflows | |||
docker build -t idseq-short-read-mngs short-read-mngs | |||
make test-short-read-mngs | |||
DOCKER_IMAGE_ID=idseq-short-read-mngs make test-short-read-mngs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be a single line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That just makes the definition effective only for this invocation instead of the rest of the shell session (since DOCKER_IMAGE_ID
is a pretty generic name)
) | ||
contigs_headers = fasta_headers(assembly_contigs_fasta) | ||
contigs_all_headers = fasta_headers(assembly_contigs_all_fasta) | ||
assert contigs_headers and (set(contigs_all_headers) - set(contigs_headers)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional: for completeness, should we additionally assert on the length of a sequence in set(contigs_all_headers) - set(contigs_headers)
to ensure length is < min_contig_length
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great overall. Love the tests. Just a few small tweaks.
if min_contig_length: | ||
# apply contig length filter | ||
with open(assembled_contig, "w") as outfile: | ||
for record in fasta.iterator(assembled_contig_all): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to move us away from custom parsing. We already have bio python as a dependency. Could you replace with something like:
from Bio import SeqIO
SeqIO.write(
(r for r in SeqIO.parse(assembled_contig_all, "fasta") if len(r.seq) >= min_contig_length),
assembled_contig,
"fasta",
)
This would make your changes in short-read-mngs/idseq-dag/idseq_dag/util/fasta.py unnecessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed and implemented. I found I had to add biopython to the idseq-dag setup.py requires, I guess the production Dockerfile installs it separately.
@@ -6,6 +6,7 @@ task RunAssembly { | |||
String s3_wd_uri | |||
Array[File] host_filter_out_gsnap_filter_fa | |||
File duplicate_cluster_sizes_tsv | |||
Int min_contig_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you set the default here and make it optional so it is more obvious and we could disable it more cleanly if we ever support that?
Int? min_contig_length = 100
Assuming we can explicitly pass in null here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm setting the default at the workflow-level input rn: https://github.com/chanzuckerberg/idseq-workflows/pull/127/files#diff-d2e6572472ef03a6e96e659a5de68aaf3c3954db41875231be060bbaf4e3a8c0R484
There's actually an ambiguity in the WDL spec we're working to clarify right now that makes it awkward to set the default value inside the task, but also expose a workflow-level input to optionally override it. It's funny it comes up organically here, as we were just having a healthy debate about that PR last week.
If we needed to, we'd set the workflow-level input to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah OK
read2contig = {} | ||
memory = self.additional_attributes.get('memory', 100) | ||
self.assemble(input_fasta, input_fasta2, bowtie_fasta, duplicate_cluster_sizes_path, assembled_contig, assembled_scaffold, | ||
bowtie_sam, contig_stats, read2contig, int(memory)) | ||
min_contig_length = int(self.additional_attributes.get('min_contig_length', 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment about moving the default. Also here it looks like the default is no filtering rather than 100?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good thanks!
Apply a 100-nucleotide length filter to the contigs as they emerge from SPAdes, to reduce the burden on subsequent steps like BlastContigs. Shorter contigs can make up a majority of SPAdes output currently (sample-dependent), yet seem of minimal value considering they're shorter than the input reads.
The downfiltered contigs are used for all subsequent steps (including the Bowtie2 realignment of reads to contigs immediately following SPAdes). The original set is copied into a new workflow output,
assembly_out_assembly_contigs_all_fasta
, which remains unused.