-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add lantern accuracy data #3826
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, added a couple of comments/sections that would be helpful.
docs/lantern.md
Outdated
| -- | -- | -- | -- | | ||
| Lantern predicting Default LH | .850 : 19.6% | .866 : 21.0% | .907 : 26.9% | | ||
| Lantern predicting LH on WPT | .764 : 34.4% | .795 : 32.5% | .879 : 33.1% | | ||
| Lantern w/adjusted settings predicting LH on WPT<sup>1</sup> | .769 : 32.9% | .808 : 31.1% | .879 : 32.6% | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What settings were adjusted here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RTT, throughput, and CPU multipliers
I'll move the footnote that explains this to be on adjusted settings
instead of the end of the line 👍
docs/lantern.md
Outdated
<sup>1</sup> 320 ms RTT, 1.3 mbps, 5x CPU | ||
|
||
<sup>2</sup> Default LH traces and WPT traces were captured several weeks apart, so some site changes may have occurred that skew these stats | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a section explaining what conclusions we are drawing from this data, and potential reasons for why Lantern is correlating TTI on WPT but not FMP/FCP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done-ish :) with the additional reference stats the TTI/FMP/FCP isn't necessarily an outlier that needs explaining anymore IMO, but let me know if you still think it needs some hypotheses on that
docs/lantern.md
Outdated
|
||
## Accuracy | ||
|
||
All of the following accuracy stats are reported excluding the 10% tail as the initial research found approximately ~10% of sites will radically vary simply by visiting the page a second time through no fault of the metrics or prediction logic. This means the accuracy is slightly overstated but should still hold for the controlled-enivornment/repeated view use case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it explicit that this was calculated based on an analysis of 1500 URLs run only once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
32448e1
to
c260b37
Compare
@vinamratasingal I believe I have addressed your concerns, mind taking another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost :) I have one more question that needs to be answered before giving a LGTM.
docs/lantern.md
Outdated
## Conclusions | ||
|
||
### Lantern Accuracy Conclusions | ||
Definitive conclusions on repeat view accuracy require much more data for the same URLs (i.e. more than 1 run for each URL per environment), but for the single view use case, Lantern is roughly as accurate at predicting the rank of a website the next time you visit it as the metrics themselves which is the highest goal we set out to achieve. As a sanity check, we also see that using the unthrottled metrics to predict the rank of throttled performance has a significantly lower rank correlation than Lantern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm just being silly, but can you help me understand what "Lantern is roughly as accurate at predicting the rank of a website the next time you visit it as the metrics themselves which is the highest goal we set out to achieve." means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is just pointing out that the rank correlation of Lantern with LH is roughly the same as LH with LH, meaning that running lantern once on a URL gives you just as good of a clue as to what the next load time will be as you loading the site for real, i.e. the accuracy of Lantern is smaller than or equal to the natural deviation of load timing.
The jury is still out on how inaccurate the estimate would be if you could run it 100 times, which is identified in future work.
Let me know which snippets from here you think are worth including or if it's still clear as mud :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm yeah the language feels a bit unclear to me. For me, what would be helpful is to reframe this section into two bullets:
-
For the single view use case, we conclude that the rank correlation of Lantern with LH is roughly the same as LH with LH. [add 1 sentence explaining what this means based on what you said above]
-
For repeat view accuracy, we need to do more work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How's this
- For the single view use case, we conclude that the rank correlation of Lantern with standard LH is roughly the same as any the rank correlation between any two arbitrary LH runs. That is to say, the average error we observe between a Lantern performance score and a LH on DevTools performance score is within the expectated natural deviation. As a sanity check, we also see that using the unthrottled metrics to predict throttled performance has a significantly lower correlation than Lantern does.
- For the repeat view use case, we require more data to reach a conclusion, but the high correlation of the single view use case suggests the accuracy meets our correlation requirements even if some sites may diverge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great! LGTM coming your way :)
docs/lantern.md
Outdated
## Conclusions | ||
|
||
### Lantern Accuracy Conclusions | ||
Definitive conclusions on repeat view accuracy require much more data for the same URLs (i.e. more than 1 run for each URL per environment), but for the single view use case, Lantern is roughly as accurate at predicting the rank of a website the next time you visit it as the metrics themselves which is the highest goal we set out to achieve. As a sanity check, we also see that using the unthrottled metrics to predict the rank of throttled performance has a significantly lower rank correlation than Lantern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great! LGTM coming your way :)
closes #3691
before rushing off to improve accuracy, the next AI is on @vinamratasingal to draw up document on what we actually want to achieve accuracy with