Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building custom Database #14

Open
Natalie-koks opened this issue Jul 10, 2024 · 7 comments
Open

Building custom Database #14

Natalie-koks opened this issue Jul 10, 2024 · 7 comments

Comments

@Natalie-koks
Copy link

Hello,
Please I would like to know if I can create my own database with fungi included.
Also, I am using RefSeq224 data files which are *.fna.gz will the command pick that instead of the fa.gz in your example?
Thanks

@bluenote-1577
Copy link
Owner

Hi @Natalie-koks,

Yes, *.fna.gz also work. You can simply do sylph sketch *.fna.gz to create your db.

Sylph doesn't support taxonomic annotations for fungi yet. See https://github.com/bluenote-1577/sylph/wiki/Integrating-taxonomic-information-with-sylph#metaphlan-like-or-cami-like-outputs for how you could build your own metadata for creating a taxonomic profile.

@Natalie-koks
Copy link
Author

Thank you very much.

@bluenote-1577
Copy link
Owner

@Natalie-koks I've recently updated sylph to include taxonomic annotations for fungi and a new fungi database. See https://github.com/bluenote-1577/sylph/wiki/Pre%E2%80%90built-databases#eukaryotic-databases and https://github.com/bluenote-1577/sylph-utils

@humbleflowers
Copy link

Hello @bluenote-1577

I created a custom database with around 20K reference fasta files from NCBI.

i want to create a taxonomy profile using sylph_to_taxprof.py script but it need metadata tsv?

Do you have a script to create metadata for custom database using taxonomy dump from ncbi?

@bluenote-1577
Copy link
Owner

Hi @humbleflowers,

Take a look at https://github.com/ctb/2022-assembly-summary-to-lineages by Titus Brown.

From this snakemake workflow, here's what I did (I'll maybe write a tutorial on this later):

  1. get the necessary assembly_summary_refseq/genbank.txt file. https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/

  2. Use the snakemake repo I linked. For each genome you care about, get the corresponding line in the assembly summary file as input to the Snakemake file (you'll have to call it example.assembly_summary.txt or something like that)

  3. You get an 'example.lineages' file from the snakemake file.

  4. Edit the script following to convert example.lineages to a metadata file:

def format_taxonomy(input_file, output_file):
    with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
        # Read the header line and ignore it
        header = infile.readline().strip().split(',')

        for line in infile:
            fields = line.strip().split(',')

            ident = fields[0] + '_' + fields[1]
            superkingdom = f"d__{fields[2]}"
            phylum = f"p__{fields[3]}"
            class_ = f"c__{fields[4]}"
            order = f"o__{fields[5]}"
            family = f"f__{fields[6]}"
            genus = f"g__{fields[7]}"
            species = f"s__{fields[8]}"

            # Handle potential empty 'strain' field
            strain = fields[9] if len(fields) > 9 and fields[9] else species

            formatted_line = f"{ident}\t{superkingdom};{phylum};{class_};{order};{family};{genus};{species}"
            outfile.write(formatted_line + '\n')

# Example usage
input_file = './example.lineages.csv'
output_file = 'fungi_refseq_2024-07-25_metadata.tsv'
format_taxonomy(input_file, output_file)

@bluenote-1577 bluenote-1577 reopened this Aug 27, 2024
@ykm7788
Copy link

ykm7788 commented Oct 14, 2024

Hello @bluenote-1577
Based on this issue, just wondering if I can combine my custom database with the pre-built one, along with the taxonomic file?
Cheers

@bluenote-1577
Copy link
Owner

Hello @bluenote-1577 Based on this issue, just wondering if I can combine my custom database with the pre-built one, along with the taxonomic file? Cheers

@ykm7788

Yes you can combine them. For the taxonomy files, you can input multiple; they will all be concatenated in sylph_to_taxprof.py.

You can also input multiple database sketch files .syldb. They will be concatenated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants