Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scripts/convert_refseq_to_prokka_gff.py produces only 1 chromosome output #147

Open
martinastoycheva opened this issue Mar 17, 2022 · 4 comments

Comments

@martinastoycheva
Copy link

Hello,

I have a refseq gff that contains two chromosomes in it which I wanted to use in the panaroo pipeline. I tried using the script to convert it to a prokka gff but I get only the second chromosome in the output gff. Is this inteded behaviour?

Cheers,
Martina

@gtonkinhill
Copy link
Owner

Hi Martina,

Sorry for the slow reply. This is not intended behaviour. Without looking at the GFF file it is a bit challenging to work out what might be going wrong. Is it possible you could send me a small example that reproduces the problem?

@martinastoycheva
Copy link
Author

Hello,

Thanks for your reply! I have solved the issue by providing the gff and fasta separetely.

@fwhelan
Copy link

fwhelan commented Jul 2, 2024

Hi gtonkinhill,

I have had something similar happen to me with a new dataset. Of the 441 input genomes, 146 are missing >=1 chromosome after using conver_refseq_to_prokka_gff.py. I can't give you a reproducible example at the moment, but there doesn't seem to be any inconsistency in chromosome order (e.g. last chromosome being omitted), length, or content. No error message is output when this occurs.

Thank you,
Fiona

@gtonkinhill
Copy link
Owner

Hi Fiona,

The conver_refseq_to_prokka_gff script can be pretty strict in throwing out annotations that don't fit within the expected output of Prokka. My guess is that this might be causing the issue. Unfortunately, it doesn't currently print which genes it's ignoring but essentially it will ignore

  • Genes that have a premature stop codon
  • Genes that have a length less than 34nt or which is not a multiple of 3
  • Anything that is not classed as a 'CDS'

As an alternative you should be able to run Panaroo with the --remove-invalid-genes option directly which is what I would recommend. It should then print which genes are being ignored.

If this doesn't fix things, let me know and I'll see if I can work out what's going on.

Cheers,

Gerry

@gtonkinhill gtonkinhill reopened this Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants