convert_gff3_to_ncbi_tbl #54

bernt-matthias · 2018-02-13T10:13:12Z

Can someone tell me which assumption convert_gff3_to_ncbi_tbl makes on the formatting of the names? Apparently ours miss something:

python3 gff/convert_gff3_to_ncbi_tbl.py -i ../juncus.fasta.transdecoder.refined.sort.gff3 -o ../juncus.fasta.transdecoder.refined.sort.tbl -ln LAB -nap NAP -gf ../juncus.fasta 
Traceback (most recent call last):
  File "gff/convert_gff3_to_ncbi_tbl.py", line 89, in <module>
    main()
  File "gff/convert_gff3_to_ncbi_tbl.py", line 82, in main
    tbl.print_tbl_from_assemblies(assemblies=assemblies, ofh=ofh, go_obo=args.go_obo, lab_name=args.lab_name)
  File "/tmp/biocode/lib/biocode/tbl.py", line 95, in print_tbl_from_assemblies
    print_biogene(gene=gene, fh=ofh, obo_dict=go_idx, lab_name=lab_name)
  File "/tmp/biocode/lib/biocode/tbl.py", line 122, in print_biogene
    raise Exception("ERROR: locus_tag attributes are required for all gene elements (gene id: {0}".format(gene.id))
Exception: ERROR: locus_tag attributes are required for all gene elements (gene id: Transcript_32960|g.33387

ping @arsilan324

The text was updated successfully, but these errors were encountered:

bernt-matthias · 2018-02-13T11:24:45Z

Just remembered add_gff3_locus_tags.py. But apparently some entries in the gff file dont get a locus_tag. I'm using this command line:

python3 gff/add_gff3_locus_tags.py -i ../juncus.fasta.transdecoder.refined.sort.gff3 -o ../juncus.fasta.transdecoder.refined.sort.lt.gff3 -p PREFIX -a 10

jorvis · 2018-02-15T16:52:01Z

Home from my conference and travels. Will get to these tickets later today, just FYI.

jorvis · 2018-02-16T06:12:46Z

I have tracked down this issue. The problem is with the library's treatment of genes with multiple isoforms. When it sees more than one mRNA for a particular gene, it's currently spawning off another gene and attaching the mRNA to that one, flattening out the gene/mRNA relationships. I can find no justification of why this was the decided behavior (after about an hour spent tonight searching through fun archives of e-mails with NCBI staff when submitting eukaryotic genomes.)

Your file has 95,646 genes and 120,335 mRNAs, so multiple isoforms are common. What was a little surprising was that the mRNA, CDS and exon count are all 120,335. At first I thought it strange that all your genes were single-exon genes, then realized the source (transdecoder) implied these were from Trinity. So you're doing in this in preparation for tbl2asn running for transcriptome submission.

I'll fix this so that proper gene representation is done when more than one mRNA is present. If you haven't already, it would be good to review the submission guidelines to see if there are any transcriptome-specific format details. I'll be happy to add any you uncover.

bernt-matthias · 2018-02-16T08:37:04Z

Wonderful. Please send me a ping here, then I can try.

I guess @arsilan324 can say about if the the counts of genes, mRNA, CDS, and exons are reasonable.

arsilan324 · 2018-02-16T14:19:34Z

According to Brian Haas (Transdecoder developer): In the data model of transdecoder, each CDS (and corresponding exon) is tied to it's own mRNA, and a single gene is allowed to produce multiple mRNAs. It doesn't allow for the single mRNA, multi-CDS arrangement (ie. doesn't do operons).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert_gff3_to_ncbi_tbl #54

convert_gff3_to_ncbi_tbl #54

bernt-matthias commented Feb 13, 2018

bernt-matthias commented Feb 13, 2018

jorvis commented Feb 15, 2018

jorvis commented Feb 16, 2018

bernt-matthias commented Feb 16, 2018

arsilan324 commented Feb 16, 2018

convert_gff3_to_ncbi_tbl #54

convert_gff3_to_ncbi_tbl #54

Comments

bernt-matthias commented Feb 13, 2018

bernt-matthias commented Feb 13, 2018

jorvis commented Feb 15, 2018

jorvis commented Feb 16, 2018

bernt-matthias commented Feb 16, 2018

arsilan324 commented Feb 16, 2018