-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track session recording internal metrics #2560
Comments
Original issue: PostHog/posthog#2560
Part of PostHog/posthog#2560 Needed because we lack a feature: PostHog/posthog#2584
Original issue: PostHog/posthog#2560
lzstring has performance problems on the server-side, causing issues for session recording: see PostHog/posthog#2560 Note that this is not a backwards-compatible change, requires a change on posthog side to support plain/text header as well. We can't use application/json as content-type as that would trigger CORS requests for every xhr request as that's considered a custom header.
I've posted two summaries into slack previously, gonna put them here as well for posterity:
The next day:
|
lzstring has performance problems on the server-side, causing issues for session recording: see PostHog/posthog#2560 Note that this is not a backwards-compatible change, requires a change on posthog side to support plain/text header as well. We can't use application/json as content-type as that would trigger CORS requests for every xhr request as that's considered a custom header.
Since then we found at least one of the true bottlenecks: lzstring in python is slow enough to cause large events to time out. Next actionsI'd like to try an experiment where we don't compress the json data at all for our sites (app.posthog.com/posthog.com). Two PRs for this: #2648 PostHog/posthog-js#144 FutureIf this removes the failures we're seeing at https://app.posthog.com/dashboard/1240, we can discuss switching or dropping compression schemes or optimizing the python lzstring implementation. Any suggestions for light-weight browser compression libraries welcome! Fullsnapshot events are pretty massive though - after this experiment I'll also experiment with some tooling new to rrweb since we integrated it. LZString benchmarksI took 4 sample requests from my network panel and generated 2 more (26kb in size + 2.5mb in size) and did some measurements. On posthog.com the average /e/ with lzstring compression is ~500b, large session recording batch ~2kb, large update (switching from homepage to docs) is 9kb and full snapshot event ~370kb. TimeMeasurements were done using my lenovo laptop using ipython %timeit command.
Compression ratiosI've excluded the two artifically generated payloads (as these would compress well):
Unscientific takeaway: For small requests (which are the majority of requests to /e/) compression can hurt, but other than that compressing the data has a great effect. |
Is the lzstring JS implementation equally slow compressing the payloads? If posthog makes a weaker computer's browser freeze while sending the payload, then we should change it no matter how slow the python implementation is. |
I have yet to do experiments on that regard. As mentioned above, I'd like to have verification first on if this solves the issue for our team first (see the linked PRs which do not affect any other clients). Then if the hypothesis is proven focus on choosing the best alternative for the future, if not - back to the drawing board. :) Will keep this thread updated on that progress. |
No updates on skipping lzstring yet - data is still ticking in. However I did some measurements on browser-side compression algorithms. Benchmarksrrweb now (optionally) ships with pako.js based deflate algorithm built in, so I used that to compare against lzstring. Raw data/notes can be found in this gist compression ratios (compressed size / uncompressed size)This was tested using posthog.com and seeing real content-sizes for fair comparisons. branch Compression ratios when compressing events in batches (smaller = better):
Increasing batch frequency did not significantly change batch ratios.
The above was slightly wrong as it did not account for compression of the whole event body - only rrweb parts. In reality this would not work. lzstring vs pako.deflate speedDid some measurements on my computer via https://jsbench.me/, sadly cannot link due to sites size limitations.
Takeaway: pako.deflate is consistently many times faster with larger payloads and the speed is comparable Note: For some reason our bundled version of lzstring is somewhat slower than the one on cdnjs - did not dig into why. Package sizesminified version of pako.deflate is 25kb (!). LZString is much smaller, marius has floated the figure 5kb in the past. Takeaway: If we go for pako, we might want to only include it when session recording is enabled and pull it together with recording.js. |
Last time I started the following experiment:
More context on this in this post What we know from this?From initial measurements it seems not quite. Will do a more through pass on monday, but we're still seeing failures http status 0 errors which take several seconds to do. This is slightly surprising since livesession and co don't compress the payloads they send in obvious ways (besides making json keys obfuscated) This means that even if we have removed the obvious bottleneck on the server side, we're still seeing timeouts. This might be confounded by the aws migration that @fuziontech did, but grafana stats look relatively similar to from before. On my own computer I'm seeing 1s+ requests for full snapshot events and 300-400ms requests for normal (tiny) events - it still might be something going wrong on the serveri-side. In this case, increasing the batching threshold for session recording might help. Next stepsGiven this, I'm proposing:
This means that even if we don't end up using pako long-term (either due to not solving the problem, package size or lzstring being more efficient in node), we will only break some sessions for our own app by rolling back. Notes on toolingI'm severely lacking good tooling to do some of this:
|
Since last update: Benchmarking our endpointsCreated a k6 benchmark of our /s endpoint in production. Metrics in this gist Takeaways:
Using rrweb.packRealized my metrics in previous post were wrong. Compressing individual events wouldn't be a huge win as most of the server payload would be our normal object keys. Adding this scheme would also break this pr as snapshot type would not be queriable on the server if don't also add decompression support on the server. Conclusions / next stepsAll-in-all it seems getting session recording more reliable requires:
Achieving 1 is easy - we just add deflate support to the server.
We'll ship deflate compression within recorder.js as part of this iteration within the python server. |
Based on discussion, we'll move forward with pako.deflate, just for recorder.js. The tradeoff of extra library size is offset by the extra compression we'll get, as outbound upload speed seems to be the slowest piece. For now, we'll keep LZstring compression for normal events too. |
We've since implemented a gzip-based compression scheme which has been running for a couple of days for app.posthog.com. The resulting payloads are 2-3x smaller than with lz64 on app.posthog.com. It's hard to get good numbers on anything else though due to instability in the app bleeding into these stats. There's no new errors on the backend so we'll roll this out to all users (who use array.js or update npm) tomorrow. :) We should put focus into making our endpoint fast as well though, the p99 metrics are pretty scary and are leading to some data loss. |
Are you interested in trying https://github.com/101arrowz/fflate as the client-side compression lib? |
Is your feature request related to a problem?
We have had some reports on session recordings being incomplete/misleading/having white screens. #2457 #2365
#2459
We've already made some fixes and some workarounds: https://github.com/PostHog/posthog-js/blob/master/CHANGELOG.md #2547
These seem all to be caused by us not receiving all of the events from posthog-js. This can be caused by:
a. rrweb (and our integration with it) not correctly capturing these events
b. sending these events timing out or otherwise failing on client-side (most suspect right now)
c. processing these events failing inside our APIs due to payload sizes or other
Some of this is inevitable - if user leaves the page before we manage to send data from posthog-js then it's natural these sessions are cut short.
Describe the solution you'd like
Ideally these should inform us of where the issue is and help us track progress on our end as we implement future fixes such as retries and so on.
Describe alternatives you've considered
Waiting for go-based API endpoint to rule out c).
Additional context
Thank you for your feature request – we love each and every one!
The text was updated successfully, but these errors were encountered: