-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entropy panel updates #478
Conversation
sorry for the slow response here. I really like this direction and suggest we push it a little further still by getting rid of the sequence.json as well. But first a few specific points.
https://github.com/nextstrain/auspice/blob/entropy/src/util/entropy.js#L84 overwrite the ancestral state if there are recurrent mutations at the same site? Since
https://github.com/nextstrain/auspice/blob/entropy/src/util/getGenotype.js#L27 by a function that parses mutations on the tree. (of course it would be much more efficient to get all states at once). implementation would probably be easiest if we had a temporary field |
this is roughly what I am thinking (pseudo-code, really): export const assign_sequence(nodes, anc_state, gene, pos){
const root_node = nodes[0];
root_node.tmp_gt = anc_state[gene][pos];
for (const child of root_node.children) {
assign_sequence_recursive(child, gene, pos);
}
}
const assign_sequence_recursive(node, gene, pos){
if (node.muts[gene][pos]){ //assuming muts = {'nuc':{123:['A', 'G'], 345:['C', 'T']}, 'HA1':{}}
node.tmp_gt = node.muts[gene][pos][1];
}else{
node.tmp_gt = node.up.tmp_gt;
}
for (const child of node.children) {
assign_sequence_recursive(child, gene, pos);
}
}
const vis_state_count(nodes, anc_state, gene, pos){
assign_sequence(nodes, anc_state, gene, pos);
const counts={}
for (var i=0; i<nodes.length; i++){
if (visibility[i]&nodes[i].is_terminal()){
if (!counts[nodes[i].tmp_gt]){
counts[nodes[i].tmp_gt]=1;
}else{
counts[nodes[i].tmp_gt]++;
}
}
}
return counts;
} of course a traversal or all positions is probably slower than looping over variable position within one traversal. but one could be smarter about this by first calculating for each node the position that are variable within the clade and the number of visible nodes. |
Thanks @rneher
|
Thanks, my bad re |
No worries. Cloning of objects in JS is not good... In other news, i've gotten rid of the sequences.json - all that it's used for is the gene lengths, which i'll add to meta.json (they're needed to get rid of the entropy.json as well). |
cool. the sequence json was giving us headaches when looking at the vcf/Tb cases. |
The good news: We no longer use sequences or entropy JSONs :) The bad news: This branch now needs "updated" metadata JSONs via nextstrain/augur@bf67f14 |
fabulous how this makes things simpler! I think we should test this on some of the different builds for a few days (could you put jsons on dev?), but otherwise it can go in. |
Ok, i've come up with a way to test this - it's a bit complicated as this branch can't load the old augur JSONs, and you can't switch to the staging server JSONs without first successfully loading something!
|
thanks, didn't realize this required that much tweaking.... https://auspice-dev.herokuapp.com/measles?c=gt-L_610 and zoom into the bottom clade (should be a 2/11 and 9/11 mix of P and Q). The displayed entropy value is 0.584, but it should be -plog(p)-qlog(q) = 0.47413931305783746 also fuzzy about this line: shouldn't this be simply the number of visible tips? |
@jameshadfield: I really like this direction. I'm mildly surprised that performance didn't suffer more, but great that it's working well. I wasn't able to break this in testing. Two things:
This will be somewhat annoying, but not terrible. However, I wonder if we could make things easier by having a transitional redundant JSON format which has new |
@trvrb re: updating JSONs, if we run new augur builds they will produce both the updated meta JSON (usable by auspice master & this PR) as well as the entropy & sequences JSONs (used only by master). So we can update all the files on S3 without worry and then merge this into master after testing... |
* url-middleware: remove console logs posts use URL pushState use pushState and replaceState fix choose-dataset selector & pageChange API store dataset name in datasets.datapath remove npm history package (window.url) post URLs working (and they use redux) clean up URL handling handle URLs ourselves (remove react router) (incomplete) restore modifyURLquery (now deprecated) remove context where no longer needed date URL changes in middleware panel layour & distance measure URL changes in middleware geo resolution URL changes in middleware layout URL changes in middleware filters now change URL in middleware move url state of colorBy to middleware redux middleware prototype working
This branch now up at https://auspice-dev.herokuapp.com including the Bugs to fix before merge:
|
Hey @rneher, i've now fixed the bug in the entropy calculation (04ddea1) and tested it on trial data to ensure it's correct. However, there are differences to the augur entropies! Why? Augur calculates entropy on the counts excluding gaps and
cc @huddlej @sidneymbell this may be of interest to you. |
good point! great work! |
This PR changes how the data shown in the entropy panel is calculated. The data is now calculated from the mutations on the tree and therefore updates as the visibility of nodes changes (e.g. after filtering, changing the date). This PR would close #470.
entropy vs change counts (updated)
The chart actually shows the number of observed mutations at a given position, simply because this is more straightforward to calculate. It's entirely possible to calculate entropy, but before I go down this road, perhaps the number of mutations is "better"? Calculating entropy would require knowledge of the bases at every visible node, not just a count of the mutations (I've sketched a way to do this in a single traversal).There is a flag in globals -trueEntropyCalc
- which toggles between the number of observed mutations vs entropy (it only works for nucleotides, will make it work for codons ASAP)There is now a toggle in the panel to switch between entropy and number of mutations observed in the tree.
we can get rid of
entropy.json
The only data that the JSON is used for is to create the gene annotations (
{start, end, name}
) and genome size (bp). This data could easily be included in the metadata JSON, thus getting rid of the entropy json.Performance
The code to calculate the new data is surprisingly cheap (3-10ms), however the
D3
rendering takes ~50ms. The calculation is therefore debounced (at 500ms) so it will not run when frequent changes to the visibility are happening. The structure of the data and the algorithm can be improved further to speed things up (I don't want to invest time in this if the end result is not what we're after...)P.S. This PR is built upon PRs #477 (which is build upon #476) as it was helpful to have the genotype URLs working.