Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overrepresented Sequences shows "no hit" for all sequences #17

Closed
edgardomortiz opened this issue Sep 3, 2021 · 10 comments
Closed

Overrepresented Sequences shows "no hit" for all sequences #17

edgardomortiz opened this issue Sep 3, 2021 · 10 comments
Labels
bug Something isn't working

Comments

@edgardomortiz
Copy link

edgardomortiz commented Sep 3, 2021

We compared a FastQC run and a falco 0.2.4 (from bioconda) run and the Overrepresented sequences table shows hit names such as "Truseq adaptor XX" for FastQC while all overrepresented sequences are shown as "no hit" for falco.

The result is identical when adding --contaminants and the path to the contaminant list file (this file is not shipped with the conda installation).

Thanks

Edgardo

@guilhermesena1
Copy link
Collaborator

Hi,

I'll address both the issues you created here if that's ok?

Thank you for bringing this up, I think I didn't configure the Conda metadata correctly to have the default adapters and contaminants be downloaded when falco is installed. I'll try to fix this in the next few days. In the meantime, and I apologize for the inconvenience, you can locally download the adapters and contaminants list manually and provide them using the --contaminants and --adapters flag. I apologize for the inconvenience!

That being said, I'm puzzled why the contaminants is not consistent with FastQC in your case. Would you be able to provide the first 40,000 lines of your input FASTQ file and the contaminants file you are using so I can try to reproduce the issue and look into it? Thank you!

@edgardomortiz
Copy link
Author

edgardomortiz commented Sep 3, 2021

Thanks! Unfortunately the file is in our lab, I will pass by tomorrow to retrieve it.

I was checking the code and I understand the default adapter list is also present within the code, so theoretically falco should be able to find the adaptors in the FASTQ files even when the adaptor list is not available (am I right? I program mostly in python so I might have missed something there).

The contaminant file is the one provided in falco's repository in directory Configuration

Edgardo

@edgardomortiz
Copy link
Author

Here I attach the FASTQ files and the reports produced by each program, the contaminant list file was the one supplied in this repository, the commands were:

fastqc --nogroup -o fastqc Anthopterus*
falco --nogroup -o falco_c -c contaminants.txt Anthopterus*

Anthopterus-racemosus_LV16228_R2.fq.gz
Anthopterus-racemosus_LV16228_R1.fq.gz
Anthopterus-racemosus_LV16228_R1_fastqc.zip
Anthopterus-racemosus_LV16228_R2_fastqc.zip
falco_c.zip

@guilhermesena1
Copy link
Collaborator

Thank you so much for providing these reports!

Regarding hits in the overrepresented sequences module, I think you just found a feature that needs to be incorporated into falco. I looked at the contaminant and the sequence that FastQC claims to be a contaminant and saw that this is how they overlap:


r1      CATGATCAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCA----------	50
r2      ----------GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGC	50
                  ****************************************

so the suffix of the sequence is the prefix of a contaminant. Currently falco only checks if the sequence is contained in the contaminant, so it does not account for these overlaps. This is useful, I will fix this in the upcoming release for sure!

Regarding adapters, it looks like falco is checking for adapters, but it is simply not finding the Illumina universal adapter that FastQC finds. I'll have to look into this one in more detail.

@guilhermesena1 guilhermesena1 added the bug Something isn't working label Sep 8, 2021
@edgardomortiz
Copy link
Author

Cool, Falco does find similar amounts of adaptors as FastQC but only when specifying the adaptor list with --adapters. I could send those report as well if they help to replicate the Issue.

@guilhermesena1
Copy link
Collaborator

I might need the actual reads to reproduce. You are right that falco has hard-coded adapter and contaminant list inside src/FalcoConfig.cpp which are identical to the default files inside Configuration, so in the absence of these files the behaviors should be identical whether or not you pass the --adapters flag. I double checked if the sequences and hashes inside src/FalcoConfig.cpp are the same as in Configuration/adapters.txt and so far I really don't see what could be causing the difference. I'll keep looking.

@edgardomortiz
Copy link
Author

Hi again,
The reads are the same as for the previous test:
https://github.com/smithlabcode/falco/files/7127232/Anthopterus-racemosus_LV16228_R2.fq.gz
https://github.com/smithlabcode/falco/files/7127235/Anthopterus-racemosus_LV16228_R1.fq.gz

The FastQC command was:

fastqc --nogroup -o fastqc Anthopterus*

And its results are:
https://github.com/smithlabcode/falco/files/7127250/Anthopterus-racemosus_LV16228_R1_fastqc.zip
https://github.com/smithlabcode/falco/files/7127252/Anthopterus-racemosus_LV16228_R2_fastqc.zip

The Falco commands were (using the adapter list provided in this repository):

falco --nogroup -o falco_defaults Anthopterus*
falco --nogroup -o falco_a -a adapters.txt Anthopterus*

And the Falco results were:
falco_default.zip
falco_a.zip

@guilhermesena1
Copy link
Collaborator

guilhermesena1 commented Sep 9, 2021

thank you so much! I pushed a modification ( f3f6f58 ) of the contaminant identification algorithm that allows partial overlap. In your test case at least it is identifying the truseq contaminants correctly. I still have to test it more thoroughly to see if I haven't retroactively broken anything. Will look into the adapter issue next.

@edgardomortiz
Copy link
Author

Thanks again for the quick solution to these issues!

@guilhermesena1
Copy link
Collaborator

My pleasure! Closing for now but feel free to reopen (I'll do that too) if any datasets do not match FastQC or the expected correct answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants