Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run HDBSCAN directly on genetic distances and compare clusters to those from embeddings #119

Merged
merged 40 commits into from
Aug 7, 2024

Conversation

huddlej
Copy link
Collaborator

@huddlej huddlej commented Jul 25, 2024

Description

Adds rules to all natural flu and SARS-CoV-2 workflows to apply HDBSCAN clustering to the genetic distance matrix that we use to produce the embeddings. We name this clustering "method" as "genetic" and include it in the grid search to find the optimal distance threshold per method for early H3N2 HA data. This PR updates tables, figures, and manuscript text to reflect the inclusion of these genetic distance clusters as a point of comparison to embedding clusters.

Development checklist

  • Add workflow logic to find optimal cluster threshold for clusters based on genetic distances
  • Rerun workflow to produce cluster accuracies for early flu and SC2 datasets
  • Update accuracy by threshold figure to not refer to "Euclidean" distance threshold
  • Rerun all workflow analyses related to clusters to get tables with genetic cluster mutations and monophyly and figures for within-between cluster distances, etc.
  • Expand accuracy table (Supplementary Table S1) to include VI values for late flu, HA/NA, and late SC2 data by adding columns for late datasets to existing table.
    • Add number of cluster labels per dataset as a column, too.
  • Update manuscript text to reflect clustering by genetic distances in methods and results

Related issues

Depends on blab/pathogen-embed#33
Closes #99

huddlej added 4 commits July 30, 2024 15:38
Adds a rule to apply HDBSCAN clustering to the genetic distance matrix
that we use to produce the embeddings. We name this clustering "method"
as "genetic" and include it in the grid search to find the optimal
distance threshold per method for early H3N2 HA data.

Related to #99
Updates the early flu workflow to use the new `--distance-matrix` input
argument to the `pathogen-cluster` command and adds the "genetic" method
to the logic for plotting cluster accuracy by distance threshold. Adds
the corresponding cluster accuracy table and figure for early flu.
Adds clustering of genetic distances to the early SC2 workflow and
updates the downstream figures and tables to include the optimal
distance thresholds for genetic distance clusters by pathogen and clade
type.
@huddlej huddlej force-pushed the cluster-by-distances branch from a29f896 to b269d6a Compare July 30, 2024 22:41
huddlej added 25 commits July 30, 2024 16:31
Updates results for late SC2 based on updated optimal cluster parameter. Also
updates the flu cluster accuracy replication figure for an unclear reason.
Increases the range of distance thresholds to use for the grid search across
clusters for flu and SC2 from an upper limit of 7 to 20. This increase comes
from the initial result that the optimal threshold for genetic distance clusters
of SC2 data plateaued at the highest distance threshold which suggested that
higher thresholds might yield more accurate clusters. These results confirm that
higher distance thresholds produce more accurate genetic clusters, while
clusters from the embedding methods had reached their optimal values in the
original range of the analysis.
Adds genetic clusters to early flu analyses including tables for
monophyly, mutations, and within and between cluster distances and the
Auspice JSON.
Fixes a bug in the late flu and HA/NA workflows where -1 cluster labels from
HDBSCAN did not count toward the VI distance between clusters and known genetic
groups. All workflows should count these -1 labels toward the VI distance, since
we want to penalize methods that produce clusters with more unclustered samples
than others. While the early flu and both early and late SC2 workflows included
-1 labels, the late flu and HA/NA workflows did not.
huddlej added 11 commits August 5, 2024 16:06
Updates late flu and HA/NA accuracies to reflect the inclusion of -1
cluster labels in the accuracy calculations. This change in the workflow
logic had little effect on the results.
Adds genetic distance cluster results and methods to text.
Fixes an off-by-one error with cluster counts where the "-1" label was being
counted toward the number of clusters per method, causing all counts in
Supplementary Table S1 to be one higher than expected for several rows.
@huddlej huddlej merged commit d12aee2 into master Aug 7, 2024
@huddlej huddlej deleted the cluster-by-distances branch August 7, 2024 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Run HDBSCAN directly on genetic distances and compare clusters to those from embeddings
1 participant