Allow non-continuous scatterplot variables #1346

jameshadfield · 2021-05-18T05:53:54Z

This implements a requested improvement to the original
scatterplot implementation. The implementation hinges on two changes:
(1) The collection of values for a given variable (e.g. x-var) need
to be computed and passed to PhyloTree to act as the scale's domain.
We reuse the colorScale machinery here, which could be optimised
(see todo messages in code), but this has the advantage that the
domain ordering matches the legend (unless user supplied).
(2) PhyloTree needed to be modified to use non-linear scales, in this
case pointScale.

This commit should be fully functional, however there are some
future improvements to be made:

(i) Grid text is obscured and unreadable when there are many entries
in the domain.
(ii) Genotypes and Boolean scales are not yet available.
(iii) Jitter should be added to nodes to avoid obfuscation.

@trvrb @huddlej would you mind trying this out? Thanks in advance.

trvrb · 2021-05-18T23:11:48Z

This is fantastic. A couple notes as I review:

I really like having default y be active color by.
https://auspice-ordinal-scatter-39fzn6.herokuapp.com/ncov/global?branches=hide&l=scatter&scatterX=region&scatterY=emerging_lineage should really be a useful visualization of where in the world emerging lineages have been sampled. However, the lack of jitter makes this not as useful as it could be.

Medium priority would be adding jitter. I don't think blocking PR (which is accurate and useful as it stands), but I do think jitter would be a nice addition. Though I see why automating jitter will be a bit complex.

huddlej

This is so cool, @jameshadfield. I would be happy to use this now as it is, but I've provided more detailed comments below.

I love that I can compare categorical fields with each other and with quantitative fields! Just this functionality alone allows us to investigate some interesting questions for influenza that were previously more difficult with the tree view.

Which clades have the most epitope mutations and what is their distribution? This same type of view could be helpful for inspecting which flu clades are underrepresented for total titer measurements.

When were clade strains most recently submitted? The information is present in the corresponding tree colored by submission date with branch labels, but this view is much more succinct and unambiguous.

Which clades are present in each region? This is the kind of view where the ability to tune the opacity of points would produce a density plot where more overlapping points communicates a higher density. You could also imagine eventually supporting a radius encoding for that overlap to produce a punchcard plot.

Which regions have strains that are most likely to seed the future and what clades are those strains from?

What is the LBI of the closest strains to the future for two specific regions with the highest LBI strains? Enables a side-by-side comparison of forecast distributions between regions with the evidence from LBI to support those forecasts.

I wish that I could:

set an alpha/opacity value for each point to produce a quick density plot especially for categorical vs categorical plots
order categorical columns by any quantitative tip annotation. In addition to the UI and code complexity associated with this request, there’s also the issue of how to aggregate the quantitative values across each single categorical value. Would we order by the mean/median/max of the quantitative values?
filter tips to specific values in a category and have all other other values in the category disappear from the axis (see the last image above). This could be a flu-specific issue related to the vaccine strain “x” symbols though.

What if:

categorical vs. quantitative plots automatically sorted the categorical values by the mean/median of the quantitative field? This wouldn’t require any additional UI components unless you wanted to allow users to toggle between the default (maybe user-specific) and computed ordering.
we allowed users to define the order for any category based on another tip annotation that had a name based on the category field's name? For example, users could annotate a quantitative field named region-orderby based on latitude, for example, and that field would be automatically used to order the region values if it was present. This wouldn't be as nice as a full UI to select any arbitrary field but it could be a useful half-step. (This transmission network mockup is what I had in mind for this specific request.)

This implements a requested improvement to the original scatterplot implementation. The implementation hinges on two changes: (1) The collection of values for a given variable (e.g. x-var) need to be computed and passed to PhyloTree to act as the scale's domain. We reuse the colorScale machinery here, which could be optimised (see todo messages in code), but this has the advantage that the domain ordering matches the legend (unless user supplied). (2) PhyloTree needed to be modified to use non-linear scales, in this case `pointScale`. This commit should be fully functional, however there are some future improvements to be made: (i) Grid text is obscured and unreadable when there are many entries in the domain. (ii) Genotypes and Boolean scales are not yet available. (iii) Jitter should be added to nodes to avoid obfuscation.

This commit is in preparation for allowing genotype to be a scatterplot variable. This will complicate the allowable scatterplot variables and force these to update upon colorBy changes. This is much cleaner if layout is changed in a thunk.

jameshadfield · 2021-05-21T08:28:41Z

Thanks for the detailed review @huddlej! I've added jitter & a few other improvements to the PR, so you can re-click your links above and see a newer rendering of them.

Ordering is going to be complicated, so I'm going to avoid any changes in this PR!!!

filter tips to specific values in a category and have all other other values in the category disappear from the axis (see the last image above).

This isn't flu-specific, I'll take a look and see how easy a fix is going to be. (Underneath it's the same logic as we use for displaying dynamically filtered legend items, so I'm hopeful the fix will be easy.)

P.S. There's a bug somewhere which occurs in certain combinations of scatterplot variables and causes no nodes to be rendered, so this PR should at least wait until I've found that.

P.P.S. nextstrain/nextstrain.org#335 is a nextstrain.org PR using this version of Auspice, so you can use the Review App from that PR to access a wider range of datasets than the Auspice Review App.

huddlej · 2021-05-21T15:23:02Z

Nice! The jitter makes such a huge difference and basically gets at what I was thinking with reduced opacity/density plot view.

Is there a hard threshold on the viewport width for when jittering is activated? When I go to this view of distance to future by region the points per region are jittered, but when I turn on the grid display of the same scatterplot the jitter disappears. I can imagine that it is hard to cram everything into such a small view, though. I can't really think of a better solution, if there is a hard threshold for width. 🤷🏻

trvrb · 2021-05-21T16:45:48Z

Awesome! The jitter is an unexpectedly large improvement. I'm really happy with how things stand. And John, your flu examples are really interesting and things that I wouldn't have immediately thought to surface.

One small non-blocking issue. I saw from the commits that genotype was added as scatterplot option. But then I couldn't immediately figure out how to activate it. I eventually thought to select genotype as color by and toggle away from scatterplot layout and then toggle back to scatterplot layout. I then discovered the genotype option for x and y axes. I suspect this feature is somewhat hidden while it works like this (though many will be arriving at scatterplot after coloring tree by trait of interest, so it might not be so bad really).

Also, I just found a bug. If you go to: https://auspice-ordinal-scatter-39fzn6.herokuapp.com/ncov/global?c=gt-S_484,501&l=scatter&scatterY=gt to show sampling date vs 484,501 genotype everything works as expected, but then if you toggle "sampling date" to "emerging lineage" the points fly off the page. If you go to the resulting state directly https://auspice-ordinal-scatter-39fzn6.herokuapp.com/ncov/global?branches=hide&c=gt-S_484,501&l=scatter&scatterX=emerging_lineage&scatterY=gt it works as expected.

emmahodcroft · 2021-05-21T17:00:39Z

I'm really enjoying playing around with this @jameshadfield - and @huddlej your examples above helped to get me going: I hadn't realised all the options!

Anyway, unfortunately I'm also posting as I managed to find a bit of an odd bug. I originally had plotted by X: Genotype S: 501 (also as a colourby, of course), and Y: Clade.

I then was curious if it would auto-adjust to changing the colourby, so typed in 681. It does auto adjust! However, it does not auto-recolour:

(All the columns should be the same colour)

Interestingly the map did recolour so it's just something about the scatter.
This isn't a big deal at all - and changing to another colourby and changing back resolves it - but well, figured I'd write it up! 🙃

jameshadfield · 2021-05-24T00:03:12Z

Thanks all for the comments -- especially the two bugs and instructions on how to reproduce.

Is there a hard threshold on the viewport width for when jittering is activated?

There's a threshold of 50px spacing between values on an axis as below this it got confusing which point belonged to which value, but 50px is somewhat arbitrary! In Emma's screenshot you can see it's jittered on the x-axis but not the y-axis.

I saw from the commits that genotype was added as scatterplot option.

Yeah, it's somewhat secret squirrel. The only way we have of selecting a specific genotype is via the colouring (which uses the cascading dropdown UI); so if the colouring is not genotype then I don't think we can select this as a scatter variable without exposing the more complex UI, which I wanted to avoid. P.S. If the colouring is genotype, then genotype shows up as an option in the scatterplot variables.

Genotype is treated differently to other colorings in two important ways: (1) it can change value, for instance when changing the colorby to another genotype position and (2) it is stored in a different place to other colorings. These require scatterplot logic to be more complex as actions are no longer separate - we now require a NEW_COLOURS action to potentially update the layout which was formerly within the remit of the CHANGE_LAYOUT actions. This is achieved through a middleware layer. This implementation makes it clear that jitter and better domain spacing are crucial for scatterplots.

This prevents nodes falling on the axis itself or at the very end of the grid, which was especially noticeable for traits with small domains.

jameshadfield · 2021-05-24T05:48:35Z

Bugs all fixed & (I believe) this is now good to go 😄

@trvrb I changed the behaviour to improve visibility and consistency - if you have selected a genotype colouring and then click scatterplot, the y-axis will be genotype.

@huddlej I'm going to leave the dynamic domains (i.e. filtering updating the axes) for a future PR in order to get this out ASAP

trvrb · 2021-05-24T17:47:58Z

Fantastic! Thank you James. This is good to go from my perspective. Really excited about this as it opens up a bunch of expressive possibilities.

jameshadfield temporarily deployed to auspice-ordinal-scatter-39fzn6 May 18, 2021 05:54 Inactive

huddlej approved these changes May 21, 2021

View reviewed changes

jameshadfield added 2 commits May 21, 2021 12:47

Layout changes occur via redux thunk

e40b03f

This commit is in preparation for allowing genotype to be a scatterplot variable. This will complicate the allowable scatterplot variables and force these to update upon colorBy changes. This is much cleaner if layout is changed in a thunk.

jameshadfield force-pushed the ordinal-scatterplots branch from f6ee8e7 to 2d7b0a2 Compare May 21, 2021 08:18

jameshadfield temporarily deployed to auspice-ordinal-scatter-39fzn6 May 21, 2021 08:19 Inactive

jameshadfield mentioned this pull request May 21, 2021

Test auspice PR 1346 nextstrain/nextstrain.org#335

Closed

jameshadfield force-pushed the ordinal-scatterplots branch from 2d7b0a2 to 20d4985 Compare May 24, 2021 04:25

jameshadfield temporarily deployed to auspice-ordinal-scatter-39fzn6 May 24, 2021 04:26 Inactive

jameshadfield added 3 commits May 24, 2021 16:34

Improve padding for categorical scatterplot variables

442dd0b

This prevents nodes falling on the axis itself or at the very end of the grid, which was especially noticeable for traits with small domains.

Add jitter to categorical scatterplots

898a387

jameshadfield force-pushed the ordinal-scatterplots branch from 20d4985 to 898a387 Compare May 24, 2021 04:35

jameshadfield temporarily deployed to auspice-ordinal-scatter-39fzn6 May 24, 2021 04:35 Inactive

jameshadfield merged commit b58f148 into master May 24, 2021

jameshadfield deleted the ordinal-scatterplots branch May 24, 2021 22:23

jameshadfield mentioned this pull request May 24, 2021

Scatterplot axis domains should dynamically update #1351

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow non-continuous scatterplot variables #1346

Allow non-continuous scatterplot variables #1346

jameshadfield commented May 18, 2021

trvrb commented May 18, 2021

huddlej left a comment

jameshadfield commented May 21, 2021 •

edited

Loading

huddlej commented May 21, 2021

trvrb commented May 21, 2021 •

edited

Loading

emmahodcroft commented May 21, 2021

jameshadfield commented May 24, 2021 •

edited

Loading

jameshadfield commented May 24, 2021

trvrb commented May 24, 2021

Allow non-continuous scatterplot variables #1346

Allow non-continuous scatterplot variables #1346

Conversation

jameshadfield commented May 18, 2021

trvrb commented May 18, 2021

huddlej left a comment

Choose a reason for hiding this comment

jameshadfield commented May 21, 2021 • edited Loading

huddlej commented May 21, 2021

trvrb commented May 21, 2021 • edited Loading

emmahodcroft commented May 21, 2021

jameshadfield commented May 24, 2021 • edited Loading

jameshadfield commented May 24, 2021

trvrb commented May 24, 2021

jameshadfield commented May 21, 2021 •

edited

Loading

trvrb commented May 21, 2021 •

edited

Loading

jameshadfield commented May 24, 2021 •

edited

Loading