Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert_gff3_to_ncbi_tbl #54

Open
bernt-matthias opened this issue Feb 13, 2018 · 5 comments
Open

convert_gff3_to_ncbi_tbl #54

bernt-matthias opened this issue Feb 13, 2018 · 5 comments

Comments

@bernt-matthias
Copy link

Can someone tell me which assumption convert_gff3_to_ncbi_tbl makes on the formatting of the names? Apparently ours miss something:

python3 gff/convert_gff3_to_ncbi_tbl.py -i ../juncus.fasta.transdecoder.refined.sort.gff3 -o ../juncus.fasta.transdecoder.refined.sort.tbl -ln LAB -nap NAP -gf ../juncus.fasta 
Traceback (most recent call last):
  File "gff/convert_gff3_to_ncbi_tbl.py", line 89, in <module>
    main()
  File "gff/convert_gff3_to_ncbi_tbl.py", line 82, in main
    tbl.print_tbl_from_assemblies(assemblies=assemblies, ofh=ofh, go_obo=args.go_obo, lab_name=args.lab_name)
  File "/tmp/biocode/lib/biocode/tbl.py", line 95, in print_tbl_from_assemblies
    print_biogene(gene=gene, fh=ofh, obo_dict=go_idx, lab_name=lab_name)
  File "/tmp/biocode/lib/biocode/tbl.py", line 122, in print_biogene
    raise Exception("ERROR: locus_tag attributes are required for all gene elements (gene id: {0}".format(gene.id))
Exception: ERROR: locus_tag attributes are required for all gene elements (gene id: Transcript_32960|g.33387

ping @arsilan324

@bernt-matthias
Copy link
Author

Just remembered add_gff3_locus_tags.py. But apparently some entries in the gff file dont get a locus_tag. I'm using this command line:

python3 gff/add_gff3_locus_tags.py -i ../juncus.fasta.transdecoder.refined.sort.gff3 -o ../juncus.fasta.transdecoder.refined.sort.lt.gff3 -p PREFIX -a 10

@jorvis
Copy link
Owner

jorvis commented Feb 15, 2018

Home from my conference and travels. Will get to these tickets later today, just FYI.

@jorvis
Copy link
Owner

jorvis commented Feb 16, 2018

I have tracked down this issue. The problem is with the library's treatment of genes with multiple isoforms. When it sees more than one mRNA for a particular gene, it's currently spawning off another gene and attaching the mRNA to that one, flattening out the gene/mRNA relationships. I can find no justification of why this was the decided behavior (after about an hour spent tonight searching through fun archives of e-mails with NCBI staff when submitting eukaryotic genomes.)

Your file has 95,646 genes and 120,335 mRNAs, so multiple isoforms are common. What was a little surprising was that the mRNA, CDS and exon count are all 120,335. At first I thought it strange that all your genes were single-exon genes, then realized the source (transdecoder) implied these were from Trinity. So you're doing in this in preparation for tbl2asn running for transcriptome submission.

I'll fix this so that proper gene representation is done when more than one mRNA is present. If you haven't already, it would be good to review the submission guidelines to see if there are any transcriptome-specific format details. I'll be happy to add any you uncover.

@bernt-matthias
Copy link
Author

Wonderful. Please send me a ping here, then I can try.

I guess @arsilan324 can say about if the the counts of genes, mRNA, CDS, and exons are reasonable.

@arsilan324
Copy link

According to Brian Haas (Transdecoder developer): In the data model of transdecoder, each CDS (and corresponding exon) is tied to it's own mRNA, and a single gene is allowed to produce multiple mRNAs. It doesn't allow for the single mRNA, multi-CDS arrangement (ie. doesn't do operons).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants