Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow non-continuous scatterplot variables #1346

Merged
merged 5 commits into from
May 24, 2021
Merged

Conversation

jameshadfield
Copy link
Member

This implements a requested improvement to the original
scatterplot implementation. The implementation hinges on two changes:
(1) The collection of values for a given variable (e.g. x-var) need
to be computed and passed to PhyloTree to act as the scale's domain.
We reuse the colorScale machinery here, which could be optimised
(see todo messages in code), but this has the advantage that the
domain ordering matches the legend (unless user supplied).
(2) PhyloTree needed to be modified to use non-linear scales, in this
case pointScale.

This commit should be fully functional, however there are some
future improvements to be made:

(i) Grid text is obscured and unreadable when there are many entries
in the domain.
(ii) Genotypes and Boolean scales are not yet available.
(iii) Jitter should be added to nodes to avoid obfuscation.

@trvrb @huddlej would you mind trying this out? Thanks in advance.

image
image

@jameshadfield jameshadfield temporarily deployed to auspice-ordinal-scatter-39fzn6 May 18, 2021 05:54 Inactive
@trvrb
Copy link
Member

trvrb commented May 18, 2021

This is fantastic. A couple notes as I review:

  1. I really like having default y be active color by.

  2. https://auspice-ordinal-scatter-39fzn6.herokuapp.com/ncov/global?branches=hide&l=scatter&scatterX=region&scatterY=emerging_lineage should really be a useful visualization of where in the world emerging lineages have been sampled. However, the lack of jitter makes this not as useful as it could be.

Medium priority would be adding jitter. I don't think blocking PR (which is accurate and useful as it stands), but I do think jitter would be a nice addition. Though I see why automating jitter will be a bit complex.

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so cool, @jameshadfield. I would be happy to use this now as it is, but I've provided more detailed comments below.

I love that I can compare categorical fields with each other and with quantitative fields! Just this functionality alone allows us to investigate some interesting questions for influenza that were previously more difficult with the tree view.

image
Which clades have the most epitope mutations and what is their distribution? This same type of view could be helpful for inspecting which flu clades are underrepresented for total titer measurements.

image
When were clade strains most recently submitted? The information is present in the corresponding tree colored by submission date with branch labels, but this view is much more succinct and unambiguous.

image
Which clades are present in each region? This is the kind of view where the ability to tune the opacity of points would produce a density plot where more overlapping points communicates a higher density. You could also imagine eventually supporting a radius encoding for that overlap to produce a punchcard plot.

image
Which regions have strains that are most likely to seed the future and what clades are those strains from?

image
What is the LBI of the closest strains to the future for two specific regions with the highest LBI strains? Enables a side-by-side comparison of forecast distributions between regions with the evidence from LBI to support those forecasts.

I wish that I could:

  • set an alpha/opacity value for each point to produce a quick density plot especially for categorical vs categorical plots
  • order categorical columns by any quantitative tip annotation. In addition to the UI and code complexity associated with this request, there’s also the issue of how to aggregate the quantitative values across each single categorical value. Would we order by the mean/median/max of the quantitative values?
  • filter tips to specific values in a category and have all other other values in the category disappear from the axis (see the last image above). This could be a flu-specific issue related to the vaccine strain “x” symbols though.

What if:

  • categorical vs. quantitative plots automatically sorted the categorical values by the mean/median of the quantitative field? This wouldn’t require any additional UI components unless you wanted to allow users to toggle between the default (maybe user-specific) and computed ordering.
  • we allowed users to define the order for any category based on another tip annotation that had a name based on the category field's name? For example, users could annotate a quantitative field named region-orderby based on latitude, for example, and that field would be automatically used to order the region values if it was present. This wouldn't be as nice as a full UI to select any arbitrary field but it could be a useful half-step. (This transmission network mockup is what I had in mind for this specific request.)

This implements a requested improvement to the original
scatterplot implementation. The implementation hinges on two changes:
(1) The collection of values for a given variable (e.g. x-var) need
to be computed and passed to PhyloTree to act as the scale's domain.
We reuse the colorScale machinery here, which could be optimised
(see todo messages in code), but this has the advantage that the
domain ordering matches the legend (unless user supplied).
(2) PhyloTree needed to be modified to use non-linear scales, in this
case `pointScale`.

This commit should be fully functional, however there are some
future improvements to be made:

(i) Grid text is obscured and unreadable when there are many entries
in the domain.
(ii) Genotypes and Boolean scales are not yet available.
(iii) Jitter should be added to nodes to avoid obfuscation.
This commit is in preparation for allowing genotype to be a scatterplot
variable. This will complicate the allowable scatterplot variables
and force these to update upon colorBy changes. This is much cleaner
if layout is changed in a thunk.
@jameshadfield jameshadfield force-pushed the ordinal-scatterplots branch from f6ee8e7 to 2d7b0a2 Compare May 21, 2021 08:18
@jameshadfield jameshadfield temporarily deployed to auspice-ordinal-scatter-39fzn6 May 21, 2021 08:19 Inactive
@jameshadfield
Copy link
Member Author

jameshadfield commented May 21, 2021

Thanks for the detailed review @huddlej! I've added jitter & a few other improvements to the PR, so you can re-click your links above and see a newer rendering of them.

Ordering is going to be complicated, so I'm going to avoid any changes in this PR!!!

filter tips to specific values in a category and have all other other values in the category disappear from the axis (see the last image above).

This isn't flu-specific, I'll take a look and see how easy a fix is going to be. (Underneath it's the same logic as we use for displaying dynamically filtered legend items, so I'm hopeful the fix will be easy.)

P.S. There's a bug somewhere which occurs in certain combinations of scatterplot variables and causes no nodes to be rendered, so this PR should at least wait until I've found that.

P.P.S. nextstrain/nextstrain.org#335 is a nextstrain.org PR using this version of Auspice, so you can use the Review App from that PR to access a wider range of datasets than the Auspice Review App.

@huddlej
Copy link
Contributor

huddlej commented May 21, 2021

Nice! The jitter makes such a huge difference and basically gets at what I was thinking with reduced opacity/density plot view.

Is there a hard threshold on the viewport width for when jittering is activated? When I go to this view of distance to future by region the points per region are jittered, but when I turn on the grid display of the same scatterplot the jitter disappears. I can imagine that it is hard to cram everything into such a small view, though. I can't really think of a better solution, if there is a hard threshold for width. 🤷🏻

@trvrb
Copy link
Member

trvrb commented May 21, 2021

Awesome! The jitter is an unexpectedly large improvement. I'm really happy with how things stand. And John, your flu examples are really interesting and things that I wouldn't have immediately thought to surface.

One small non-blocking issue. I saw from the commits that genotype was added as scatterplot option. But then I couldn't immediately figure out how to activate it. I eventually thought to select genotype as color by and toggle away from scatterplot layout and then toggle back to scatterplot layout. I then discovered the genotype option for x and y axes. I suspect this feature is somewhat hidden while it works like this (though many will be arriving at scatterplot after coloring tree by trait of interest, so it might not be so bad really).

Also, I just found a bug. If you go to: https://auspice-ordinal-scatter-39fzn6.herokuapp.com/ncov/global?c=gt-S_484,501&l=scatter&scatterY=gt to show sampling date vs 484,501 genotype everything works as expected, but then if you toggle "sampling date" to "emerging lineage" the points fly off the page. If you go to the resulting state directly https://auspice-ordinal-scatter-39fzn6.herokuapp.com/ncov/global?branches=hide&c=gt-S_484,501&l=scatter&scatterX=emerging_lineage&scatterY=gt it works as expected.

@emmahodcroft
Copy link
Member

I'm really enjoying playing around with this @jameshadfield - and @huddlej your examples above helped to get me going: I hadn't realised all the options!

Anyway, unfortunately I'm also posting as I managed to find a bit of an odd bug. I originally had plotted by X: Genotype S: 501 (also as a colourby, of course), and Y: Clade.

I then was curious if it would auto-adjust to changing the colourby, so typed in 681. It does auto adjust! However, it does not auto-recolour:

image
(All the columns should be the same colour)

Interestingly the map did recolour so it's just something about the scatter.
This isn't a big deal at all - and changing to another colourby and changing back resolves it - but well, figured I'd write it up! 🙃

@jameshadfield
Copy link
Member Author

jameshadfield commented May 24, 2021

Thanks all for the comments -- especially the two bugs and instructions on how to reproduce.

Is there a hard threshold on the viewport width for when jittering is activated?

There's a threshold of 50px spacing between values on an axis as below this it got confusing which point belonged to which value, but 50px is somewhat arbitrary! In Emma's screenshot you can see it's jittered on the x-axis but not the y-axis.

I saw from the commits that genotype was added as scatterplot option.

Yeah, it's somewhat secret squirrel. The only way we have of selecting a specific genotype is via the colouring (which uses the cascading dropdown UI); so if the colouring is not genotype then I don't think we can select this as a scatter variable without exposing the more complex UI, which I wanted to avoid. P.S. If the colouring is genotype, then genotype shows up as an option in the scatterplot variables.

@jameshadfield jameshadfield force-pushed the ordinal-scatterplots branch from 2d7b0a2 to 20d4985 Compare May 24, 2021 04:25
@jameshadfield jameshadfield temporarily deployed to auspice-ordinal-scatter-39fzn6 May 24, 2021 04:26 Inactive
Genotype is treated differently to other colorings in two important
ways: (1) it can change value, for instance when changing the
colorby to another genotype position and (2) it is stored in a
different place to other colorings. These require scatterplot logic
to be more complex as actions are no longer separate - we now require
a NEW_COLOURS action to potentially update the layout which was
formerly within the remit of the CHANGE_LAYOUT actions. This is
achieved through a middleware layer.

This implementation makes it clear that jitter and better domain
spacing are crucial for scatterplots.
This prevents nodes falling on the axis itself or at the very end of
the grid, which was especially noticeable for traits with small domains.
@jameshadfield jameshadfield force-pushed the ordinal-scatterplots branch from 20d4985 to 898a387 Compare May 24, 2021 04:35
@jameshadfield jameshadfield temporarily deployed to auspice-ordinal-scatter-39fzn6 May 24, 2021 04:35 Inactive
@jameshadfield
Copy link
Member Author

Bugs all fixed & (I believe) this is now good to go 😄

@trvrb I changed the behaviour to improve visibility and consistency - if you have selected a genotype colouring and then click scatterplot, the y-axis will be genotype.

@huddlej I'm going to leave the dynamic domains (i.e. filtering updating the axes) for a future PR in order to get this out ASAP

@trvrb
Copy link
Member

trvrb commented May 24, 2021

Fantastic! Thank you James. This is good to go from my perspective. Really excited about this as it opens up a bunch of expressive possibilities.

@jameshadfield jameshadfield merged commit b58f148 into master May 24, 2021
@jameshadfield jameshadfield deleted the ordinal-scatterplots branch May 24, 2021 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants