-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Salmon fails to match the transcript name between Gencode reference and annotation files #15
Comments
Hi @nicolasstransky --- thanks for reporting this. Now the question is, how should this be handled? I see at least 2 obvious possibilities :
Of course, there are also potentially other, better solutions; so I'm open to suggestions. The problem with 1 is that de-novo assemblers may have transcript names that are not unique up to the first
which is also accepted by the |
Fair points. There are potentially a lot of special cases but since Gencode is widely used, it would be great to have a way to handle its format natively (i.e consider |
This issue reminds me to ask: what is the best way to ingest a GTF plus reference FASTA file and produce a transcript FASTA file ready for salmon indexing? I see that there may be some issues with using cufflinks gtf-to-fasta tool: https://groups.google.com/forum/#!msg/sailfish-users/oNVLlxJzgv4/nQYt9m4BBOcJ |
@nicolasstransky --- Ok, so, while I'm generally reticent to adopt special cases, GenCode may warrant one. Or, a more general solution would be to allow the user to specify a list of "separator" characters while indexing (which defaults to |
@mdshw5, the best option I've found so far is actually rsem-prepare-reference. It's a bit slower than gtf-to-fasta, but, so far, seems to do a better job producing a usable transcriptome in the general case. |
@rob-p Using a list of "separator" characters is a nice idea. I think that's the best solution so far. However, it would also be a good thing that Gencode files work "out of the box" since they are so commonly used. |
Thanks, @rob-p. In the same vein, have you considered taking a GTF + FASTA for |
Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 he commit message for your changes. Lines starting
The gencode option behaves described above, and is implemented as of commit d44df88, so it should make it into the next tagged release. |
The transcript names in Gencode's reference sequence fasta files have the following format:
ENST00000257408.4|ENSG00000134962.6|OTTHUMG00000128577.1|OTTHUMT00000250429.1|KLB-001|KLB|6082|UTR5:1-97|CDS:98-3232|UTR3:3233-6082|
In the .gtf gene annotation files, only the transcript name appears:
ENST00000257408.4
As a consequence, salmon fails to match them and does not report the correct values in quant.genes.sf. Values in quant.sf seem to be correct though.
Nico
The text was updated successfully, but these errors were encountered: