Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scatterplots #1310

Merged
merged 5 commits into from
Mar 30, 2021
Merged

Scatterplots #1310

merged 5 commits into from
Mar 30, 2021

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Mar 23, 2021

This adds a new layout for scatterplots and allows users to choose the x and y variables from the available colorings. The defaults are the tree metric (x-axis) and the current color-by (y-axis). Choices are limited to continuous colorings (see below). Nodes without information are hidden from view, and branches can be toggled on/off.

As clock views are really just an instance of a scatterplot, the UI between the two is similar. Specifically, we now show toggles for both regression lines and display of branches for each layout. Regression lines for clock views are unchanged, however a new implementation is present for scatterplots which does not necessitate the regression passes through the root. Coefficients and R^2 are reported, although text formatting isn't perfect.

The relevant parameters are stored in URL queries to allow URLs to be shared (see documentation added here for details).

image

There are a number of future feature improvements, which I think are best as self-isolated issues:

  • Non-continuous colourings will require two large-ish pieces of work. Firstly, the d3 scale within PhyloTree has to change between scaleLinear and scaleOrdinal, which requires certain elements to re-render. Secondly, information about the variable (categories, ordering of categories etc) must be calculated and passed in to PhyloTree to calculate the appropriate coordinates. Medium priority.
  • JSON displayDefaults extended to allow specifying of scatterplot variables. Low priority.
  • Axis labelling. Low priority, but will become important when non-continuous variables are used.
  • Ability to swap x & y variables. Low priority.
  • Domains always consider (valid) internal nodes, and thus could be improved for views which hide the branches. Extremely low priority.

This adds a new layout for scatterplots and allows users to choose
the x and y variables from the available colorings. The defaults are
the tree metric (x-axis) and the current color-by (y-axis).

The layout algorithm is largely unchanged from the root-to-tip layout.
This presupposes that node trait values will be numeric, and thus
map nicely to an axis. Future work will allow scales to map
non-numeric values (e.g. categorical, ordinal, boolean scales) to
a d3 domain for rendering. Currently these traits get assigned `0`
as their x and/or y values.

Similarly, the algorithm presupposes that all nodes (internal and
terminal) have values and should be rendered. There will be many
cases where nodes (especially internal nodes) do not have traits
assigned. In these cases we should hide them from view, and remove
any connecting branches.

Future work needed:
* More testing is needed for rare use cases, e.g. trees without
divergence, datasets with no colorings.
* Dataset JSONs and URL queries should be able to select the
scatterplot variables.

This commit is based off previous work by trvrb.

Co-authored-by: Trevor Bedford <[email protected]>
@jameshadfield jameshadfield temporarily deployed to auspice-scatterplot-rev-dqvlx6 March 23, 2021 21:54 Inactive
@jameshadfield jameshadfield force-pushed the scatterplot-revisited branch from 5633ec8 to 1bbfca2 Compare March 24, 2021 06:18
@jameshadfield jameshadfield temporarily deployed to auspice-scatterplot-rev-dqvlx6 March 24, 2021 06:19 Inactive
@jameshadfield jameshadfield force-pushed the scatterplot-revisited branch from 1bbfca2 to daae5be Compare March 24, 2021 06:42
@jameshadfield jameshadfield temporarily deployed to auspice-scatterplot-rev-dqvlx6 March 24, 2021 06:42 Inactive
@jameshadfield
Copy link
Member Author

@huddlej / @trvrb would you mind taking a look at this? I've found a few minor problems with restoring views from URL state, but I think this is otherwise good to go.

@trvrb
Copy link
Member

trvrb commented Mar 24, 2021

Very excited about this work. Thank you for putting this together James.

Notes from review:

  1. Good to push categorical variables to a subsequent PR.
  2. The improvements to clock view are appreciated. It was always strange that time vs divergence was available as a toggle. And the new toggles for branches and regression are also nice additions.
  3. I like how show branches / show regression state persists between clock layout and scatter layout.
  4. I think it might be better to do "intercept = -4.66e+4, slope = 23.1, R^2 = 0.60" rather than the current "y = -4.66e+4 + 23.1x. R2 = 0.601".
  5. The URL state of scatterY=S1_mutations is not coming through with the link https://auspice-scatterplot-rev-dqvlx6.herokuapp.com/ncov/global?l=scatter&scatterY=S1_mutations and instead this is using default divergence for y.
  6. Not high priority and not blocking merge, but a bit surprising that clicking "zoom to selected" when branches are hidden will still zoom as if branches are present (resulting in not as much zoom as expected).
  7. I was surprised by not being able to use "epitope mutations" for the flu datasets in the scatterplot, but I see that this is an issue with seasonal-flu and not an issue with Auspice. We should update these entries https://github.com/nextstrain/seasonal-flu/blob/master/config/auspice_config_h3n2.json#L53 from ordinal to continuous. This seems more appropriate anyway, as if we have values 0, 1, 2 and 10 I'd want the virus with 10 epitope changes to be far over in the color ramp rather than the color difference between 1 and 2 being the same as between 2 and 10 (as ordinal would imply). I'll make this update to seasonal-flu.
  8. This is probably my biggest usability issue: when I look at more esoteric combinations (like scatterX=lbi&scatterY=cTiterSub on flu) I get quickly confused on what the scatterplot is actually showing. Providing axis labels on the Phylogeny panel itself would be hugely helpful to orient people (the way we have "Date" and "Divergence" for clock view).

I'd recommend addressing 4, 5 and 8 before merging.

@trvrb trvrb temporarily deployed to auspice-scatterplot-rev-dqvlx6 March 24, 2021 18:21 Inactive
@trvrb
Copy link
Member

trvrb commented Mar 24, 2021

I've updated seasonal flu with nextstrain/seasonal-flu@46f72c3, pushed new JSONs and redeployed the review app. Epitope mutations are now available for scatterplot for H3N2 and H1N1pdm. Fixes 7 above.

@huddlej
Copy link
Contributor

huddlej commented Mar 25, 2021

This is so cool, @jameshadfield! This is a killer feature and I can see myself using this all of the time. For instance, this view of antigenic advance (tree model) by date helps me immediately see which clades are more antigenically advanced, the range of the advance values, and how much variation there is.

image

I like the new scatter UI as a new layout option with dropdowns for x and y and the toggle buttons. I also like how the scatterplot view only plots points with assigned values. For example, in H3N2 HA 2y trees, we only calculate fitness for the most recent strains and all other strains do not get a fitness value assigned. If I switch from the color-by fitness view below:

image

To plot fitness on the y-axis, I get the following view showing only those tips with fitness values:

image

Initially, I was surprised that the branches and all other tips disappeared, but after toggling between these views, the reason became clear.

Given the fitness by date view above, I wish I knew how many samples are being shown in the display. Maybe this is an edge case, but by zooming to only points with assigned fitness values, I’ve effectively “filtered” by view without applying an actual filter. I don't know the best way to address this though.

I tested the scatterplot layout with a local auspice installation and with Sravani’s H3N2 HA embeddings (PCA, MDS, etc.). The following plot shows t-SNE x and y coordinates from HA sequence inputs. This is exactly what I was hoping to be able to do with this scatterplot interface!

image

This view does highlight the need for x- and y-axis labels (I can imagine making screenshots like this all of the time, or embedding these views in a narrative where the scatterplot controls are not visible).

I also find myself wanting to “zoom out” so I can still see the legend but it isn’t obscuring the data points in the top-left.

On a related note, I also was surprised by the view when I filtered to show just two clades as below. The x- and y- view didn’t rescale like I expected and it still shows clade labels for data that aren’t being shown.

image

I guess both of these issues are related to the view “acting as is” the branches are still visible? I can turn off clade labels, but it would be cool if I could somehow zoom or pan to center on the current data. I also get that could require a major amount of new code, so it’s not a blocking issue for this PR.

In terms of issues that would be important for this release, I second Trevor's issues 5 (loading scatterplot view from URL parameters) and 8 (axis titles).

@huddlej
Copy link
Contributor

huddlej commented Mar 25, 2021

I just found one more issue where switching from the scatterplot back to the rectangular tree view defaults to the divergence view instead of time tree view. Steps to recreate:

  1. Go to a flu build (this defaults to time tree view)
  2. Select the scatter layout
  3. Select "rectangular" layout. The tree shows divergence on the x-axis, but the "Branch Length" option for "Time" is selected.

This fixes some issues highlighted by the previous commit to improve
rendering of scatterplots. We now limit scatterplot x,y variable
choices to continuous-scaled colorings, and leave the display of
other scale types to future work as this requires PhyloTree to switch
to a new d3 scale.

As not all nodes may have traits assigned (contrary to other tree
layouts), we detect and hide those nodes from view, as well as any
joining branches. We also expose the ability to toggle branches
on/off.

We also improve the starting variable choices for x & y.
@jameshadfield jameshadfield force-pushed the scatterplot-revisited branch from daae5be to 59c5e08 Compare March 29, 2021 05:58
@jameshadfield jameshadfield temporarily deployed to auspice-scatterplot-rev-dqvlx6 March 29, 2021 05:59 Inactive
@trvrb
Copy link
Member

trvrb commented Mar 29, 2021

Thanks so much for these revisions @jameshadfield. I can confirm that updates to this branch fully address my issues 4, 5 and 8. This is good to be merged from my perspective.

As the clock view is simply a specific type of scatterplot layout,
this commit unifies the code and display between these two
"separate" layouts. We preserve the clock button in the sidebar
as this is a common action which we want to surface.

**Show branch toggles**
Are now rendered for both views. The layout of scatterplots does
not consider internal nodes for calculating the domain if branches
are not shown. Similarly, branch labels are not displayed if
branches are not.

**Regression Lines**
These are now available for both layouts, and are toggled via a UI
element similar to branches. Previously, the regression would be
shown for clock layouts _if_ the branch metric was time, however
the explicit UI element introduced here is better. For scatterplot
views we calculate the regression with a free intercept, as the root
node may not have co-ordinates defined (depending on chosen x,y
variables), and additionally report the R^2.
The display of the regression text can be improved in future commits.

**Persist chosen state**
To improve the UX, once a scatterplot has been viewed, we persist the
x,y  variables for future viewing. Similarly, the toggle state persists
between clock & scatter layouts.
See added documentation for available queries
@jameshadfield jameshadfield force-pushed the scatterplot-revisited branch from 59c5e08 to e3a96c4 Compare March 30, 2021 00:30
@jameshadfield jameshadfield temporarily deployed to auspice-scatterplot-rev-dqvlx6 March 30, 2021 00:31 Inactive
This commit updates the logic for deciding the gridlines for both
x and y axes for scatterplots. Previously we had a very limited range
of cases to consider here. We now have two general functions available
for creating grids - one for temporal scales and one for all other
numeric scales (previously used only for divergence). We will need
to add a third function when we expand scatterplots to plot
non-continuous variables.
@jameshadfield jameshadfield temporarily deployed to auspice-scatterplot-rev-dqvlx6 March 30, 2021 01:50 Inactive
@jameshadfield
Copy link
Member Author

jameshadfield commented Mar 30, 2021

Thanks for the great reviews @trvrb & @huddlej -- as per Trevor's last message the blocking issues have been resolved and so am going to merge, however I'll make notes of the changes here for posterity.

[@trvrb] Good to push categorical variables to a subsequent PR.

I've created #1316 which sketches out a path to implementing these.

[@trvrb] I think it might be better to do "intercept = -4.66e+4, slope = 23.1, R^2 = 0.60" rather than the current "y = -4.66e+4 + 23.1x. R2 = 0.601".

👍 Done

[@trvrb] The URL state of scatterY=S1_mutations is not coming through

👍 Fixed. I believe all state is now being restored appropriately, but there may be some rare edge cases I haven't run into.

[@trvrb] a bit surprising that clicking "zoom to selected" when branches are hidden will still zoom as if branches are present

I've improved how we calculated domains so that the zooming looks much better here. More generally, zooming in auspice doesn't map straightforwardly onto scatterplots (see #1317 for more).

[@trvrb] Providing axis labels on the Phylogeny panel itself would be hugely helpful to orient people

[@huddlej] This view does highlight the need for x- and y-axis labels

👍 Good reminder. Done.

[@huddlej] Initially, I was surprised that the branches and all other tips disappeared, but after toggling between these views, the reason became clear.

It's a bit confusing that we keep the "show branches" toggle in these situations, but unfortunately removing it isn't trivial (we only realise branches are never rendered inside PhyloTree, and there's no easy way to update the rest of the UI from there).

[@huddlej] I wish I knew how many samples are being shown in the display.

Agreed! I've created #1318 to fix this.

[@huddlej] I tested the scatterplot layout with a local auspice installation and with Sravani’s H3N2 HA embeddings (PCA, MDS, etc.).

I didn't know about these - they look great!

[@huddlej] I also find myself wanting to “zoom out” so I can still see the legend but it isn’t obscuring the data points in the top-left.

Fixed using the same approach as #1302

[@huddlej] I also was surprised by the view when I filtered to show just two clades as below. The x- and y- view didn’t rescale like I expected and it still shows clade labels for data that aren’t being shown.

Yes -- this (unfortunately) isn't an easy fix. I've written more about this in #1317.

[@huddlej] I just found one more issue where switching from the scatterplot back to the rectangular tree view defaults to the divergence view instead of time tree view. Steps to recreate:

Fixed 👍


In addition to fixes to the points raised above, I improved the logic behind axes grids, so that (e.g.) scatterplots with time on the y-axis look as they should.

@jameshadfield jameshadfield merged commit 93a171a into master Mar 30, 2021
@jameshadfield jameshadfield deleted the scatterplot-revisited branch March 30, 2021 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants