-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow non-continuous scatterplot variables #1346
Conversation
This is fantastic. A couple notes as I review:
Medium priority would be adding jitter. I don't think blocking PR (which is accurate and useful as it stands), but I do think jitter would be a nice addition. Though I see why automating jitter will be a bit complex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so cool, @jameshadfield. I would be happy to use this now as it is, but I've provided more detailed comments below.
I love that I can compare categorical fields with each other and with quantitative fields! Just this functionality alone allows us to investigate some interesting questions for influenza that were previously more difficult with the tree view.
Which clades have the most epitope mutations and what is their distribution? This same type of view could be helpful for inspecting which flu clades are underrepresented for total titer measurements.
When were clade strains most recently submitted? The information is present in the corresponding tree colored by submission date with branch labels, but this view is much more succinct and unambiguous.
Which clades are present in each region? This is the kind of view where the ability to tune the opacity of points would produce a density plot where more overlapping points communicates a higher density. You could also imagine eventually supporting a radius encoding for that overlap to produce a punchcard plot.
Which regions have strains that are most likely to seed the future and what clades are those strains from?
What is the LBI of the closest strains to the future for two specific regions with the highest LBI strains? Enables a side-by-side comparison of forecast distributions between regions with the evidence from LBI to support those forecasts.
I wish that I could:
- set an alpha/opacity value for each point to produce a quick density plot especially for categorical vs categorical plots
- order categorical columns by any quantitative tip annotation. In addition to the UI and code complexity associated with this request, there’s also the issue of how to aggregate the quantitative values across each single categorical value. Would we order by the mean/median/max of the quantitative values?
- filter tips to specific values in a category and have all other other values in the category disappear from the axis (see the last image above). This could be a flu-specific issue related to the vaccine strain “x” symbols though.
What if:
- categorical vs. quantitative plots automatically sorted the categorical values by the mean/median of the quantitative field? This wouldn’t require any additional UI components unless you wanted to allow users to toggle between the default (maybe user-specific) and computed ordering.
- we allowed users to define the order for any category based on another tip annotation that had a name based on the category field's name? For example, users could annotate a quantitative field named
region-orderby
based on latitude, for example, and that field would be automatically used to order theregion
values if it was present. This wouldn't be as nice as a full UI to select any arbitrary field but it could be a useful half-step. (This transmission network mockup is what I had in mind for this specific request.)
This implements a requested improvement to the original scatterplot implementation. The implementation hinges on two changes: (1) The collection of values for a given variable (e.g. x-var) need to be computed and passed to PhyloTree to act as the scale's domain. We reuse the colorScale machinery here, which could be optimised (see todo messages in code), but this has the advantage that the domain ordering matches the legend (unless user supplied). (2) PhyloTree needed to be modified to use non-linear scales, in this case `pointScale`. This commit should be fully functional, however there are some future improvements to be made: (i) Grid text is obscured and unreadable when there are many entries in the domain. (ii) Genotypes and Boolean scales are not yet available. (iii) Jitter should be added to nodes to avoid obfuscation.
This commit is in preparation for allowing genotype to be a scatterplot variable. This will complicate the allowable scatterplot variables and force these to update upon colorBy changes. This is much cleaner if layout is changed in a thunk.
f6ee8e7
to
2d7b0a2
Compare
Thanks for the detailed review @huddlej! I've added jitter & a few other improvements to the PR, so you can re-click your links above and see a newer rendering of them. Ordering is going to be complicated, so I'm going to avoid any changes in this PR!!!
This isn't flu-specific, I'll take a look and see how easy a fix is going to be. (Underneath it's the same logic as we use for displaying dynamically filtered legend items, so I'm hopeful the fix will be easy.) P.S. There's a bug somewhere which occurs in certain combinations of scatterplot variables and causes no nodes to be rendered, so this PR should at least wait until I've found that. P.P.S. nextstrain/nextstrain.org#335 is a nextstrain.org PR using this version of Auspice, so you can use the Review App from that PR to access a wider range of datasets than the Auspice Review App. |
Nice! The jitter makes such a huge difference and basically gets at what I was thinking with reduced opacity/density plot view. Is there a hard threshold on the viewport width for when jittering is activated? When I go to this view of distance to future by region the points per region are jittered, but when I turn on the grid display of the same scatterplot the jitter disappears. I can imagine that it is hard to cram everything into such a small view, though. I can't really think of a better solution, if there is a hard threshold for width. 🤷🏻 |
Awesome! The jitter is an unexpectedly large improvement. I'm really happy with how things stand. And John, your flu examples are really interesting and things that I wouldn't have immediately thought to surface. One small non-blocking issue. I saw from the commits that genotype was added as scatterplot option. But then I couldn't immediately figure out how to activate it. I eventually thought to select genotype as color by and toggle away from scatterplot layout and then toggle back to scatterplot layout. I then discovered the genotype option for x and y axes. I suspect this feature is somewhat hidden while it works like this (though many will be arriving at scatterplot after coloring tree by trait of interest, so it might not be so bad really). Also, I just found a bug. If you go to: https://auspice-ordinal-scatter-39fzn6.herokuapp.com/ncov/global?c=gt-S_484,501&l=scatter&scatterY=gt to show sampling date vs 484,501 genotype everything works as expected, but then if you toggle "sampling date" to "emerging lineage" the points fly off the page. If you go to the resulting state directly https://auspice-ordinal-scatter-39fzn6.herokuapp.com/ncov/global?branches=hide&c=gt-S_484,501&l=scatter&scatterX=emerging_lineage&scatterY=gt it works as expected. |
I'm really enjoying playing around with this @jameshadfield - and @huddlej your examples above helped to get me going: I hadn't realised all the options! Anyway, unfortunately I'm also posting as I managed to find a bit of an odd bug. I originally had plotted by I then was curious if it would auto-adjust to changing the colourby, so typed in
Interestingly the map did recolour so it's just something about the scatter. |
Thanks all for the comments -- especially the two bugs and instructions on how to reproduce.
There's a threshold of 50px spacing between values on an axis as below this it got confusing which point belonged to which value, but 50px is somewhat arbitrary! In Emma's screenshot you can see it's jittered on the x-axis but not the y-axis.
Yeah, it's somewhat secret squirrel. The only way we have of selecting a specific genotype is via the colouring (which uses the cascading dropdown UI); so if the colouring is not genotype then I don't think we can select this as a scatter variable without exposing the more complex UI, which I wanted to avoid. P.S. If the colouring is genotype, then genotype shows up as an option in the scatterplot variables. |
2d7b0a2
to
20d4985
Compare
Genotype is treated differently to other colorings in two important ways: (1) it can change value, for instance when changing the colorby to another genotype position and (2) it is stored in a different place to other colorings. These require scatterplot logic to be more complex as actions are no longer separate - we now require a NEW_COLOURS action to potentially update the layout which was formerly within the remit of the CHANGE_LAYOUT actions. This is achieved through a middleware layer. This implementation makes it clear that jitter and better domain spacing are crucial for scatterplots.
This prevents nodes falling on the axis itself or at the very end of the grid, which was especially noticeable for traits with small domains.
20d4985
to
898a387
Compare
Bugs all fixed & (I believe) this is now good to go 😄 @trvrb I changed the behaviour to improve visibility and consistency - if you have selected a genotype colouring and then click scatterplot, the y-axis will be genotype. @huddlej I'm going to leave the dynamic domains (i.e. filtering updating the axes) for a future PR in order to get this out ASAP |
Fantastic! Thank you James. This is good to go from my perspective. Really excited about this as it opens up a bunch of expressive possibilities. |
This implements a requested improvement to the original
scatterplot implementation. The implementation hinges on two changes:
(1) The collection of values for a given variable (e.g. x-var) need
to be computed and passed to PhyloTree to act as the scale's domain.
We reuse the colorScale machinery here, which could be optimised
(see todo messages in code), but this has the advantage that the
domain ordering matches the legend (unless user supplied).
(2) PhyloTree needed to be modified to use non-linear scales, in this
case
pointScale
.This commit should be fully functional, however there are some
future improvements to be made:
(i) Grid text is obscured and unreadable when there are many entries
in the domain.
(ii) Genotypes and Boolean scales are not yet available.
(iii) Jitter should be added to nodes to avoid obfuscation.
@trvrb @huddlej would you mind trying this out? Thanks in advance.