Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about de novo multi-species Nocardiopsis database #339

Open
ERBringHorvath opened this issue Dec 12, 2024 · 2 comments
Open

Question about de novo multi-species Nocardiopsis database #339

ERBringHorvath opened this issue Dec 12, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@ERBringHorvath
Copy link

I was hoping to ask a question about a draft Nocardiopsis spp. model I've built using PopPUNK; I work in the Natural Products space, primarily with phylum Actinomycetota—a group of bacteria that can be incredibly genetically diverse even at the species level. I am working with a novel Nocardiopsis species and was using PopPUNK to try to determine its taxonomic position relative to Nocardiopsis genomes available via NCBI. Because my strain is a new species (confirmed chemotaxonomically), I need a tool that would be appropriate for mixed-species taxonomic characterization. I started with PopPUNK because the 2019 publication highlighted its utility in classifying multi-species cohorts of bacterial pathogens, and also because I'd like to use core + accessory genomes to build a phylogeny. My question is this—is PopPUNK appropriate for a diverse genus like Nocardiopsis, given the following results?

I've attached all files generated from the database creation here as a .tar.gz file.

Reference Nocardiopsis genomes were downloaded from NCBI using Datasets. 157 genomes (either complete or draft) were used.

The following workflow was used to generate this first database:
Sketch:
poppunk --create-db --output nocardiopsis_ref --r-files ref.txt --min-k 13 --max-k 29

QC database:
poppunk --qc-db --ref-db nocardiopsis_ref --qc-keep

This resulted in 128 files failing QC for various reasons, mostly for failing distance.

Fit model:
poppunk --fit-model bgmm --ref-db nocardiopsis_ref --output nocardiopsis_ref

Fit summary:
Avg. entropy of assignment 0.0160
Number of components used 2

Scaled component means:
[0.69171908 0.07866528]
[0.36643846 0.3914161 ]

Network summary:
Components 1
Density 0.1921
Transitivity 0.5733
Mean betweenness 0.1924
Weighted-mean betweenness 0.1924
Score 0.4631
Score (w/ betweenness) 0.3740
Score (w/ weighted-betweenness) 0.3740
Removing 136 sequences

Thanks!

nocardiopsis_ref.tar.gz

@ERBringHorvath ERBringHorvath added the enhancement New feature or request label Dec 12, 2024
@johnlees
Copy link
Member

Some thoughts:

  • Generally PopPUNK works with within-species datasets, rather than across a genus. Some of the defaults may not work well at this larger range.
  • The QC seems to be removing a lot of your sequences, and I would review that the criteria being applied are appropriate here too.
  • The distances are the most important thing for you to look at. Presently, they don't look like what you'd expect.
  • I'd suggest reviewing the k-mer range, and looking at the k-mer regression plots to ensure you are getting a good fit. See https://poppunk.bacpop.org/sketching.html#choosing-the-right-k-mer-lengths
  • An alternative starting point could be to use sketchlib.rust to find ANI between your samples and try clustering at 95%.

@ERBringHorvath
Copy link
Author

ERBringHorvath commented Dec 13, 2024

Thanks for the advice, I'll looking into modifying both my QC criteria and investigating a more optimal kmer range. I did run Kchooser4 from the kSNP4 package on my cohort, which reported an optimal kmer length of 23; however, even using that parameter for the kSNP4 pipeline resulted in a tree with low confidence. As far as ANI is concerned, I did run fastANI on my dataset, however there are no strains that meet the ANI threshold when compared with my strain–all other strains of importance exhibit ANI values <90%, which encapsulates my core problem, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants