Salmon fails to match the transcript name between Gencode reference and annotation files #15

nicolasstransky · 2015-09-30T16:30:47Z

The transcript names in Gencode's reference sequence fasta files have the following format:
ENST00000257408.4|ENSG00000134962.6|OTTHUMG00000128577.1|OTTHUMT00000250429.1|KLB-001|KLB|6082|UTR5:1-97|CDS:98-3232|UTR3:3233-6082|

In the .gtf gene annotation files, only the transcript name appears:
ENST00000257408.4

As a consequence, salmon fails to match them and does not report the correct values in quant.genes.sf. Values in quant.sf seem to be correct though.

Nico

rob-p · 2015-09-30T16:42:21Z

Hi @nicolasstransky --- thanks for reporting this. Now the question is, how should this be handled? I see at least 2 obvious possibilities :

Assume that the transcript name should be split at the first whitespace character or |. Currently,
it is only split at the first whitespace.
If a gtf is provided for gene-level quantification, ensure that some non-trivial number of genes (e.g.
more than half?) have at least 1 transcript in the index corresponding to them. If not, then complain.

Of course, there are also potentially other, better solutions; so I'm open to suggestions. The problem with 1 is that de-novo assemblers may have transcript names that are not unique up to the first |, so that the whole name needs to be taken into account. The problem with 2 is that it alerts the user of this potential issue, but doesn't resolve it. In the latter case, the user could provide the transcript-to-gene mapping using the provided transcript names in the "simple" format — i.e.

a simple tab-delimited format where each line contains the name of a transcript and the gene to which it belongs separated by a tab

which is also accepted by the --geneMap option. I sort of lean toward 2, but, as I said, am happy to consider other suggestions.

nicolasstransky · 2015-09-30T16:54:04Z

Fair points. There are potentially a lot of special cases but since Gencode is widely used, it would be great to have a way to handle its format natively (i.e consider | in addition to a whitespace).
It think the problem with 2. is not a real problem because if you can't match transcript names in the gtf file that is provided, it's likely that there is a problem with the input.

mdshw5 · 2015-09-30T17:00:17Z

This issue reminds me to ask: what is the best way to ingest a GTF plus reference FASTA file and produce a transcript FASTA file ready for salmon indexing? I see that there may be some issues with using cufflinks gtf-to-fasta tool: https://groups.google.com/forum/#!msg/sailfish-users/oNVLlxJzgv4/nQYt9m4BBOcJ

rob-p · 2015-09-30T17:02:35Z

@nicolasstransky --- Ok, so, while I'm generally reticent to adopt special cases, GenCode may warrant one. Or, a more general solution would be to allow the user to specify a list of "separator" characters while indexing (which defaults to \s+). I think that, so far, I actually like this option the best. Also, this isn't mutually exclusive with 2. The ideal thing would be to (1) allow arbitrary separators defined by the user and (2) warn the user if many genes seem to have no transcripts in the index.

rob-p · 2015-09-30T17:04:09Z

@mdshw5, the best option I've found so far is actually rsem-prepare-reference. It's a bit slower than gtf-to-fasta, but, so far, seems to do a better job producing a usable transcriptome in the general case.

nicolasstransky · 2015-09-30T17:19:35Z

@rob-p Using a list of "separator" characters is a nice idea. I think that's the best solution so far. However, it would also be a good thing that Gencode files work "out of the box" since they are so commonly used.

mdshw5 · 2015-09-30T19:33:25Z

Thanks, @rob-p. In the same vein, have you considered taking a GTF + FASTA for salmon index? It seems this might even solve @nicolasstransky's issue here.

Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 Add --gencode option to salmon indexer Addresses #15 he commit message for your changes. Lines starting

rob-p · 2016-08-21T02:46:39Z

The gencode option behaves described above, and is implemented as of commit d44df88, so it should make it into the next tagged release.

rob-p added the enhancement label Sep 30, 2015

rob-p self-assigned this Sep 30, 2015

rob-p closed this as completed Aug 21, 2016

olgabot mentioned this issue Jun 25, 2019

[Feature Addition: Salmon] nf-core/rnaseq#221

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Salmon fails to match the transcript name between Gencode reference and annotation files #15

Salmon fails to match the transcript name between Gencode reference and annotation files #15

nicolasstransky commented Sep 30, 2015

rob-p commented Sep 30, 2015

nicolasstransky commented Sep 30, 2015

mdshw5 commented Sep 30, 2015

rob-p commented Sep 30, 2015

rob-p commented Sep 30, 2015

nicolasstransky commented Sep 30, 2015

mdshw5 commented Sep 30, 2015

rob-p commented Aug 21, 2016

Salmon fails to match the transcript name between Gencode reference and annotation files #15

Salmon fails to match the transcript name between Gencode reference and annotation files #15

Comments

nicolasstransky commented Sep 30, 2015

rob-p commented Sep 30, 2015

nicolasstransky commented Sep 30, 2015

mdshw5 commented Sep 30, 2015

rob-p commented Sep 30, 2015

rob-p commented Sep 30, 2015

nicolasstransky commented Sep 30, 2015

mdshw5 commented Sep 30, 2015

rob-p commented Aug 21, 2016