Run HDBSCAN directly on genetic distances and compare clusters to those from embeddings #119

huddlej · 2024-07-25T00:11:24Z

Description

Adds rules to all natural flu and SARS-CoV-2 workflows to apply HDBSCAN clustering to the genetic distance matrix that we use to produce the embeddings. We name this clustering "method" as "genetic" and include it in the grid search to find the optimal distance threshold per method for early H3N2 HA data. This PR updates tables, figures, and manuscript text to reflect the inclusion of these genetic distance clusters as a point of comparison to embedding clusters.

Development checklist

Add workflow logic to find optimal cluster threshold for clusters based on genetic distances
Rerun workflow to produce cluster accuracies for early flu and SC2 datasets
Update accuracy by threshold figure to not refer to "Euclidean" distance threshold
Rerun all workflow analyses related to clusters to get tables with genetic cluster mutations and monophyly and figures for within-between cluster distances, etc.
Expand accuracy table (Supplementary Table S1) to include VI values for late flu, HA/NA, and late SC2 data by adding columns for late datasets to existing table.
- Add number of cluster labels per dataset as a column, too.
Update manuscript text to reflect clustering by genetic distances in methods and results

Related issues

Depends on blab/pathogen-embed#33
Closes #99

Adds a rule to apply HDBSCAN clustering to the genetic distance matrix that we use to produce the embeddings. We name this clustering "method" as "genetic" and include it in the grid search to find the optimal distance threshold per method for early H3N2 HA data. Related to #99

Updates the early flu workflow to use the new `--distance-matrix` input argument to the `pathogen-cluster` command and adds the "genetic" method to the logic for plotting cluster accuracy by distance threshold. Adds the corresponding cluster accuracy table and figure for early flu.

Adds clustering of genetic distances to the early SC2 workflow and updates the downstream figures and tables to include the optimal distance thresholds for genetic distance clusters by pathogen and clade type.

Updates results for late SC2 based on updated optimal cluster parameter. Also updates the flu cluster accuracy replication figure for an unclear reason.

Increases the range of distance thresholds to use for the grid search across clusters for flu and SC2 from an upper limit of 7 to 20. This increase comes from the initial result that the optimal threshold for genetic distance clusters of SC2 data plateaued at the highest distance threshold which suggested that higher thresholds might yield more accurate clusters. These results confirm that higher distance thresholds produce more accurate genetic clusters, while clusters from the embedding methods had reached their optimal values in the original range of the analysis.

Adds genetic clusters to early flu analyses including tables for monophyly, mutations, and within and between cluster distances and the Auspice JSON.

Fixes a bug in the late flu and HA/NA workflows where -1 cluster labels from HDBSCAN did not count toward the VI distance between clusters and known genetic groups. All workflows should count these -1 labels toward the VI distance, since we want to penalize methods that produce clusters with more unclustered samples than others. While the early flu and both early and late SC2 workflows included -1 labels, the late flu and HA/NA workflows did not.

Updates late flu and HA/NA accuracies to reflect the inclusion of -1 cluster labels in the accuracy calculations. This change in the workflow logic had little effect on the results.

Adds genetic distance cluster results and methods to text.

Fixes an off-by-one error with cluster counts where the "-1" label was being counted toward the number of clusters per method, causing all counts in Supplementary Table S1 to be one higher than expected for several rows.

huddlej added 4 commits July 30, 2024 15:38

Add clusters for genetic distances to SC2 workflow

bf505da

Adds clustering of genetic distances to the early SC2 workflow and updates the downstream figures and tables to include the optimal distance thresholds for genetic distance clusters by pathogen and clade type.

Use Snakemake's supported regex format for ints

b269d6a

huddlej force-pushed the cluster-by-distances branch from a29f896 to b269d6a Compare July 30, 2024 22:41

huddlej added 25 commits July 30, 2024 16:31

Update accuracy by parameter to include genetic distance

884dd91

Update late SC2 results after updating optimal cluster parameters

35e0cea

Updates results for late SC2 based on updated optimal cluster parameter. Also updates the flu cluster accuracy replication figure for an unclear reason.

Remove 'Euclidean' from grid search plot axis title

841e6e9

Include genetic clusters in early flu analyses

f581c1e

Adds genetic clusters to early flu analyses including tables for monophyly, mutations, and within and between cluster distances and the Auspice JSON.

Left join tables to support inputs without internal nodes

314eb56

Include genetic clusters in early flu cluster outputs

56bdfb7

Add genetic distance clusters to late flu workflow

57f5956

Include genetic clusters in late flu cluster outputs

37ed581

Remove hardcoded colors from late flu Auspice JSON

ddeec38

Remove poster figures

133fb8c

Add genetic distance clusters to HA/NA workflow

1c79597

Fix bugs in HA/NA workflow references

b48b159

Include genetic clusters in HA/NA cluster outputs

89587b1

Add genetic distance clusters to early SC2 workflow and results

c8f6a5b

Add genetic distance clusters to late SC2 workflow

da17087

Include genetic clusters in late SC2 cluster outputs

bcdfe51

Fix layout of replication figures

d614a40

Update merged results to reflect genetic distance clusters

4d8312f

Fix genetic cluster label in within/between distance figure

9ad8bca

Fix spacing in within/between distance figures

b1ed635

Start adding genetic distance clusters to text

161099c

Add accuracy results for late flu dataset

46253f3

Start updating text for late flu results

cc88455

huddlej added 11 commits August 5, 2024 16:06

Update cluster accuracies to reflect not ignoring -1 labels

2b9d8e0

Update cluster accuracy text to reflect -1 labels

34bef24

Updates late flu and HA/NA accuracies to reflect the inclusion of -1 cluster labels in the accuracy calculations. This change in the workflow logic had little effect on the results.

Finish adding genetic distance clusters to text

dd7bccd

Adds genetic distance cluster results and methods to text.

Start restructuring accuracy table

6e424ef

Update accuracy table caption

e65094e

Count the number of distinct clusters per cluster accuracy table row

3fed9b9

Add number of clusters to all accuracy tables

c22173c

Shorten total clusters column of table S1

9eba192

Add references to updated accuracy table

f8bca9d

Fix off-by-one error with cluster counts

90e2704

Fixes an off-by-one error with cluster counts where the "-1" label was being counted toward the number of clusters per method, causing all counts in Supplementary Table S1 to be one higher than expected for several rows.

Add more references to updated accuracy table

3b182c6

huddlej merged commit d12aee2 into master Aug 7, 2024

huddlej deleted the cluster-by-distances branch August 7, 2024 21:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run HDBSCAN directly on genetic distances and compare clusters to those from embeddings #119

Run HDBSCAN directly on genetic distances and compare clusters to those from embeddings #119

huddlej commented Jul 25, 2024 •

edited

Loading

Run HDBSCAN directly on genetic distances and compare clusters to those from embeddings #119

Run HDBSCAN directly on genetic distances and compare clusters to those from embeddings #119

Conversation

huddlej commented Jul 25, 2024 • edited Loading

Description

Development checklist

Related issues

huddlej commented Jul 25, 2024 •

edited

Loading