Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entropy panel updates #478

Merged
merged 86 commits into from
Jan 18, 2018
Merged

Entropy panel updates #478

merged 86 commits into from
Jan 18, 2018

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Nov 28, 2017

This PR changes how the data shown in the entropy panel is calculated. The data is now calculated from the mutations on the tree and therefore updates as the visibility of nodes changes (e.g. after filtering, changing the date). This PR would close #470.

entropy vs change counts (updated)

The chart actually shows the number of observed mutations at a given position, simply because this is more straightforward to calculate. It's entirely possible to calculate entropy, but before I go down this road, perhaps the number of mutations is "better"? Calculating entropy would require knowledge of the bases at every visible node, not just a count of the mutations (I've sketched a way to do this in a single traversal).
There is a flag in globals - trueEntropyCalc - which toggles between the number of observed mutations vs entropy (it only works for nucleotides, will make it work for codons ASAP)
There is now a toggle in the panel to switch between entropy and number of mutations observed in the tree.

we can get rid of entropy.json

The only data that the JSON is used for is to create the gene annotations ({start, end, name}) and genome size (bp). This data could easily be included in the metadata JSON, thus getting rid of the entropy json.

Performance

The code to calculate the new data is surprisingly cheap (3-10ms), however the D3 rendering takes ~50ms. The calculation is therefore debounced (at 500ms) so it will not run when frequent changes to the visibility are happening. The structure of the data and the algorithm can be improved further to speed things up (I don't want to invest time in this if the end result is not what we're after...)


P.S. This PR is built upon PRs #477 (which is build upon #476) as it was helpful to have the genotype URLs working.

@rneher
Copy link
Member

rneher commented Dec 16, 2017

sorry for the slow response here. I really like this direction and suggest we push it a little further still by getting rid of the sequence.json as well. But first a few specific points.

  • Doesn't this line

https://github.com/nextstrain/auspice/blob/entropy/src/util/entropy.js#L84

overwrite the ancestral state if there are recurrent mutations at the same site? Since anc_state lives in the scope outside of the recursion and is used after the recursion, this can give wrong results. A few other ideas:

  • the current implementation basically duplicates the code for nucs and aa. any reason why you didn't collapse them?
  • We could get rid of the sequence.json along with the entropy.json. Based on what you got, we can easily extract the state of every node and replace this

https://github.com/nextstrain/auspice/blob/entropy/src/util/getGenotype.js#L27

by a function that parses mutations on the tree. (of course it would be much more efficient to get all states at once).

implementation would probably be easiest if we had a temporary field current_gt in every node. Basically pick gene/position, assign state to all nodes, determine color scale and color if genotype coloring is the goal, otherwise calc entropy. In the latter case, one would simply do this for every gene and position.

@rneher
Copy link
Member

rneher commented Dec 16, 2017

this is roughly what I am thinking (pseudo-code, really):

export const assign_sequence(nodes, anc_state, gene, pos){
  const root_node = nodes[0];
  root_node.tmp_gt = anc_state[gene][pos];
  for (const child of root_node.children) {
      assign_sequence_recursive(child, gene, pos);
    }
}

const assign_sequence_recursive(node, gene, pos){
  if (node.muts[gene][pos]){ //assuming muts = {'nuc':{123:['A', 'G'], 345:['C', 'T']}, 'HA1':{}}
    node.tmp_gt = node.muts[gene][pos][1];
  }else{
    node.tmp_gt = node.up.tmp_gt;
  }
  for (const child of node.children) {
    assign_sequence_recursive(child, gene, pos);
  }
}


const vis_state_count(nodes, anc_state, gene, pos){
  assign_sequence(nodes, anc_state, gene, pos);
  const counts={}
  for (var i=0; i<nodes.length; i++){
    if (visibility[i]&nodes[i].is_terminal()){
      if (!counts[nodes[i].tmp_gt]){
        counts[nodes[i].tmp_gt]=1;
      }else{
        counts[nodes[i].tmp_gt]++;        
      }
    }
  }
  return counts;
}

of course a traversal or all positions is probably slower than looping over variable position within one traversal. but one could be smarter about this by first calculating for each node the position that are variable within the clade and the number of visible nodes.

@jameshadfield
Copy link
Member Author

jameshadfield commented Dec 18, 2017

Thanks @rneher

  • re: anc_state - https://github.com/nextstrain/auspice/blob/entropy/src/util/entropy.js#L84 - this is only hit the first time a mutation is encountered at this position (even though it's inside recurse) so should never overwrite. I'll double check that now.
  • I'll combine the code blocks (nt & aa) - they were duplicated just to make things easier when I wasn't sure this was the right direction to go now. DONE
  • getting rid of sequence.json would be brilliant! I'll inplement something along those lines to see. (During this i found that tree traversals are ridiculously cheap - orders of magnitude cheaper than the D3 code we use)

@rneher
Copy link
Member

rneher commented Dec 18, 2017

Thanks, my bad re anc_state. But I am still fuzzy what happens to the state[pos] and the Object.assign({}, state) as it is passed down to children. Is state[pos] then modified in place overwriting previously set state? Or does Object.assign perform deep copy?

@jameshadfield
Copy link
Member Author

jameshadfield commented Dec 18, 2017

No worries. Object.assign is 1-layer-deep-copy - each time we traverse (recurse) from "parent X" down "child X1" we want to pass down a copy of the state so that it may be modified (if there's a mutation, e.g.), and then later in the recursion, when we traverse down "child X2" we need to have remembered the state at X to give to it. So that's why there's all this copying of state. (I'm sure there is a cleverer algorithm!)

Cloning of objects in JS is not good... Object.assign will deep copy things like {key -> INT, key-> STRING} but not {key-> OBJECT} (the object will be a reference). Unbelievably, the best way to deep-copy complex structs is to convert to JSON and then parse that.

In other news, i've gotten rid of the sequences.json - all that it's used for is the gene lengths, which i'll add to meta.json (they're needed to get rid of the entropy.json as well).

@rneher
Copy link
Member

rneher commented Dec 18, 2017

cool. the sequence json was giving us headaches when looking at the vcf/Tb cases.

@jameshadfield
Copy link
Member Author

The good news: We no longer use sequences or entropy JSONs :)

The bad news: This branch now needs "updated" metadata JSONs via nextstrain/augur@bf67f14

@rneher
Copy link
Member

rneher commented Dec 19, 2017

fabulous how this makes things simpler! I think we should test this on some of the different builds for a few days (could you put jsons on dev?), but otherwise it can go in.

@jameshadfield
Copy link
Member Author

jameshadfield commented Dec 21, 2017

Ok, i've come up with a way to test this - it's a bit complicated as this branch can't load the old augur JSONs, and you can't switch to the staging server JSONs without first successfully loading something!

  1. Go to: https://auspice-dev.herokuapp.com/measles ("new" JSONs are on the live server)
  2. Change to the staging server (i.e. "new" JSONs) via the toggle at the bottom of the sidebar
  3. Switch datasets via the dropdown - currently the staging server has "new" JSONs for ebola, zika, measles & mumps

@rneher
Copy link
Member

rneher commented Dec 21, 2017

thanks, didn't realize this required that much tweaking....
one question: I am having trouble reconciling the entropy values with diversity. have a look at

https://auspice-dev.herokuapp.com/measles?c=gt-L_610

and zoom into the bottom clade (should be a 2/11 and 9/11 mix of P and Q). The displayed entropy value is 0.584, but it should be -plog(p)-qlog(q) = 0.47413931305783746

also fuzzy about this line:
https://github.com/nextstrain/auspice/blob/entropy/src/util/entropy.js#L156

shouldn't this be simply the number of visible tips?

@trvrb
Copy link
Member

trvrb commented Jan 3, 2018

@jameshadfield: I really like this direction. I'm mildly surprised that performance didn't suffer more, but great that it's working well. I wasn't able to break this in testing. Two things:

  1. I think I prefer keeping the original "Diversity" non-clickable title to the panel and then moving "ENTROPY COUNTS" to the right, next to "AA NT".

  2. I guess the best JSON migration strategy is to:

  • run every build with updated augur
  • pushing JSONs just to staging
  • confirm that all new JSONs are working
  • simultaneously push auspice update and swap staging JSON live

This will be somewhat annoying, but not terrible. However, I wonder if we could make things easier by having a transitional redundant JSON format which has new annotations in meta.json, but keeps entropy.json around for the moment, thus making a JSON format that can be read by current auspice master and also by auspice entropy branch. This should work right?

@trvrb trvrb mentioned this pull request Jan 4, 2018
@jameshadfield
Copy link
Member Author

@trvrb re: updating JSONs, if we run new augur builds they will produce both the updated meta JSON (usable by auspice master & this PR) as well as the entropy & sequences JSONs (used only by master). So we can update all the files on S3 without worry and then merge this into master after testing...

* url-middleware:
  remove console logs
  posts use URL pushState
  use pushState and replaceState
  fix choose-dataset selector & pageChange API
  store dataset name in datasets.datapath
  remove npm history package (window.url)
  post URLs working (and they use redux)
  clean up URL handling
  handle URLs ourselves (remove react router)  (incomplete)
  restore modifyURLquery (now deprecated)
  remove context where no longer needed
  date URL changes in middleware
  panel layour & distance measure URL changes in middleware
  geo resolution URL changes in middleware
  layout URL changes in middleware
  filters now change URL in middleware
  move url state of  colorBy to middleware
  redux middleware prototype working
@jameshadfield
Copy link
Member Author

jameshadfield commented Jan 10, 2018

This branch now up at https://auspice-dev.herokuapp.com including the url-middleware PR. All new JSONs thanks to @trvrb

Bugs to fix before merge:

@jameshadfield
Copy link
Member Author

jameshadfield commented Jan 18, 2018

Hey @rneher, i've now fixed the bug in the entropy calculation (04ddea1) and tested it on trial data to ensure it's correct.

However, there are differences to the augur entropies! Why? Augur calculates entropy on the counts excluding gaps and Ns. However, the mutations encoded in the tree do not allow mutations to N, so as far as auspice is concerned, the base looks like its parent node (this is why we never get Ns when colouring by genotype). So the base/AA counts for auspice vs augur differ for columns where there are Ns or gaps, and therefore the entropy values differ. Some examples:

  • 5 taxa, with 2 As, 1G and 2Ns inferred as As: augur = 0.63, auspice = 0.50
  • 100 taxa, with 70As, 25Gs, 5Ns inferred as As : 0.58 vs 0.56

cc @huddlej @sidneymbell this may be of interest to you.

@jameshadfield jameshadfield merged commit b33e2c1 into master Jan 18, 2018
@jameshadfield jameshadfield deleted the entropy branch January 18, 2018 01:41
@rneher
Copy link
Member

rneher commented Jan 18, 2018

good point! great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

keep entropy in sync with visibility
3 participants