Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Midas DB compatibility #12

Open
CreatorOfMoon opened this issue Jun 17, 2021 · 7 comments
Open

Midas DB compatibility #12

CreatorOfMoon opened this issue Jun 17, 2021 · 7 comments

Comments

@CreatorOfMoon
Copy link

Hi and thank you for your software,

I'm trying to make your Kalamari Database match the requirement for https://github.com/snayfach/MIDAS/blob/master/docs/build_db.md

It would be nice if you were downloading :
"<genome_id>.faa:" the protein sequence in FASTA format

"<genome_id>.ffn": the gene sequence in FASTA format

"<genome_id>.genes": a tab delimited file with genomic coordinates of genes. The file should be tab-delimited file with a header and the following fields.

I don't know if those are available on esearch :

my $command = "esearch -db nuccore -query '$acc' | efetch -format fasta > $outfile.tmp";

But it would be a nice addition to your pipeline :)

thank you !

@lskatz
Copy link
Owner

lskatz commented Jun 17, 2021

It might be helpful to know what the edirect commands would be for these. I don't think I have this exactly right. Do I need to have a step through elink -db assembly? Alternatively, would it be helpful to simply run each assembly through prokka instead and just get all these files through consistent annotation? Thank you for your feedback.

esearch -db nuccore -query '$acc' | elink -target protein | efetch -format fasta > $outfile.tmp

@lskatz
Copy link
Owner

lskatz commented Jun 18, 2021

I think I figured it out in branch more-formats. Can you try it out?

@CreatorOfMoon
Copy link
Author

So, i tested it and it seems working like a charm. For midas it seems perfect for me to work with, i'll probably include a script to format correctly the data which i've been working on.

I must say that the process of downloading faa, ffn and genes file however slow down a lot the downloading ( more than 2 hours to download v3.9 here compared to only 20 minute with only fna.)
It might be worth it to add an argument wherever the user wants or not the other files to be downloaded.

@lskatz
Copy link
Owner

lskatz commented Jun 28, 2021

It might be worth it to add an argument wherever the user wants or not the other files to be downloaded.

Good point!

For midas it seems perfect for me to work with, i'll probably include a script to format correctly the data which i've been working on.

The data should be formatted correctly from the start with the Kalamari script. Could you let me know the right way to format it for Midas?

@lskatz
Copy link
Owner

lskatz commented Jun 28, 2021

I made an option to download optional files with a0ac20a..522c9be and with --and

@CreatorOfMoon
Copy link
Author

The data should be formatted correctly from the start with the Kalamari script. Could you let me know the right way to format it for Midas?

Well i don't know if by formatting it the correct way, you'll lose your compatibility with Kraken for example.

Here is what you have to do :

create a mapfile with 3 column :
genome_id (CHAR): corresponds to subdirectory within INDIR
species_id (CHAR): : species identifier for genome_id
rep_genome (0 or 1): indicator if genome_id should be used for SNP calling

And then name each file and folder with the good name :

<genome_id>
|
|- <genome_id>.fna
|- <genome_id>.faa
|- <genome_id>.ffn
|- <genome_id>.genes

And this should work.

@lskatz
Copy link
Owner

lskatz commented Jul 22, 2021

I think I'll leave this open for now but it would be interesting to come back to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants