-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overrepresented Sequences shows "no hit" for all sequences #17
Comments
Hi, I'll address both the issues you created here if that's ok? Thank you for bringing this up, I think I didn't configure the Conda metadata correctly to have the default adapters and contaminants be downloaded when falco is installed. I'll try to fix this in the next few days. In the meantime, and I apologize for the inconvenience, you can locally download the adapters and contaminants list manually and provide them using the That being said, I'm puzzled why the contaminants is not consistent with FastQC in your case. Would you be able to provide the first 40,000 lines of your input FASTQ file and the contaminants file you are using so I can try to reproduce the issue and look into it? Thank you! |
Thanks! Unfortunately the file is in our lab, I will pass by tomorrow to retrieve it. I was checking the code and I understand the default adapter list is also present within the code, so theoretically The contaminant file is the one provided in Edgardo |
Here I attach the FASTQ files and the reports produced by each program, the contaminant list file was the one supplied in this repository, the commands were:
Anthopterus-racemosus_LV16228_R2.fq.gz |
Thank you so much for providing these reports! Regarding hits in the overrepresented sequences module, I think you just found a feature that needs to be incorporated into falco. I looked at the contaminant and the sequence that FastQC claims to be a contaminant and saw that this is how they overlap:
so the suffix of the sequence is the prefix of a contaminant. Currently falco only checks if the sequence is contained in the contaminant, so it does not account for these overlaps. This is useful, I will fix this in the upcoming release for sure! Regarding adapters, it looks like falco is checking for adapters, but it is simply not finding the Illumina universal adapter that FastQC finds. I'll have to look into this one in more detail. |
Cool, Falco does find similar amounts of adaptors as FastQC but only when specifying the adaptor list with |
I might need the actual reads to reproduce. You are right that falco has hard-coded adapter and contaminant list inside |
Hi again, The
And its results are: The
And the |
thank you so much! I pushed a modification ( f3f6f58 ) of the contaminant identification algorithm that allows partial overlap. In your test case at least it is identifying the truseq contaminants correctly. I still have to test it more thoroughly to see if I haven't retroactively broken anything. Will look into the adapter issue next. |
Thanks again for the quick solution to these issues! |
My pleasure! Closing for now but feel free to reopen (I'll do that too) if any datasets do not match FastQC or the expected correct answer. |
We compared a
FastQC
run and afalco 0.2.4
(frombioconda
) run and the Overrepresented sequences table shows hit names such as "Truseq adaptor XX" forFastQC
while all overrepresented sequences are shown as "no hit" forfalco
.The result is identical when adding
--contaminants
and the path to the contaminant list file (this file is not shipped with theconda
installation).Thanks
Edgardo
The text was updated successfully, but these errors were encountered: