Track session recording internal metrics #2560

macobo · 2020-11-30T11:20:12Z

Is your feature request related to a problem?

We have had some reports on session recordings being incomplete/misleading/having white screens. #2457 #2365
#2459

We've already made some fixes and some workarounds: https://github.com/PostHog/posthog-js/blob/master/CHANGELOG.md #2547

These seem all to be caused by us not receiving all of the events from posthog-js. This can be caused by:
a. rrweb (and our integration with it) not correctly capturing these events
b. sending these events timing out or otherwise failing on client-side (most suspect right now)
c. processing these events failing inside our APIs due to payload sizes or other

Some of this is inevitable - if user leaves the page before we manage to send data from posthog-js then it's natural these sessions are cut short.

Describe the solution you'd like

Track (and keep tracking) internal reliability metrics on posthog-js side (Send metrics on unload posthog-js#136).
Dogfooding: Make this data nicely presentable within app.posthog.com - currently these graphs would end up being unreadable.

Ideally these should inform us of where the issue is and help us track progress on our end as we implement future fixes such as retries and so on.

Describe alternatives you've considered

Waiting for go-based API endpoint to rule out c).

Additional context

Thank you for your feature request – we love each and every one!

Original issue: PostHog/posthog#2560

Part of PostHog/posthog#2560 Needed because we lack a feature: PostHog/posthog#2584

Original issue: PostHog/posthog#2560

lzstring has performance problems on the server-side, causing issues for session recording: see PostHog/posthog#2560 Note that this is not a backwards-compatible change, requires a change on posthog side to support plain/text header as well. We can't use application/json as content-type as that would trigger CORS requests for every xhr request as that's considered a custom header.

macobo · 2020-12-03T11:14:46Z

I've posted two summaries into slack previously, gonna put them here as well for posterity:

First learnings on session recording reliability from stats from our sites https://app.posthog.com/dashboard/1240:

Posthog-js made 5671 requests 1-2 hours ago. This is from 66 different page loads.
Of those, 5629 got a successful 200 (99.26%)
Of the 42 failures: 8 were 'http status 0', rest were (assumption) in flight and likely cancelled after unload.
Status of 66 /e/ requests + 16 /s/ requests is unclear - they were sent on the way with sendbeacon which is more likely to fail than normal requests.
Those requests contained a total of 15236 .capture() calls - of which 14348 were session recording (!) - 94%.
5xx status does not seem to be a common issue: last hour 20 requests failed with status 0 (?), 24 were in flight during unload and only 1 failed with a 5xx (only failure during the whole day).

Learning points:
Need to figure out what causes http status 0. At first glance it seems like something retries would be effective at solving.
The number of in flight requests is also troubling - it would be interesting to know how much of this is the users network, how much is it our API servers.
We could also implement retries across pageloads to increase odds of requests reaching us, though this bloats the library significantly.
Session recording increases load per site significantly. At the moment I think a 10x-15x increase in requests a reasonable ballpark when doing calculations. This will likely start playing into pricing as well and bumps the priority for being able to capture snapshot of the data/storage in S3.
Doing more frequent requests for session recording seems to be correct. It might be worth it to bump the frequency of normal capture call send frequencies - though I think those numbers are affected by immediate bounces.

The next day:

Yep. Data is ticking in again with the enhanced analytics on posthog-js. https://app.posthog.com/dashboard/1240

What I've learned:
Dropping of full snapshot events is definitely visible from client side - so they're not reaching our servers.
Majority of full snapshots are dropped because the request has started but not finished when the page is unloaded.
AVG size of full snapshot event (for us) is 330kb and it times out at 3-5s after the request is made (either due to http status 0 or unload)
HTTP status 0 is likely caused by combination of client-side request cancellation and /e/ endpoint fails/timeouts. When /e/ fails the client is getting a CORS error (which translates to status 0) since we only set the right headers on a successful response.
We also drop /e/ events, but these are taking 'just' 300ms on average while being 1kb in size.

Open questions:
Why are requests taking that long? Is 330kb too large to send for a full snapshot?
Or are our API servers/load balancer just taking too long or geographically far from clients? The go-based API endpoints could answer this question.
I'm currently not measuring average response times/p95 on client-side as it's slightly too heavy for posthog-js. It would be great to have access to these numbers though, but impossible/hard without killing data budgets with our existing capture&data analysis tooling.

Notes:
Given that the main bottleneck is full snapshots, I don't think implementing retries> will have any significant effect on it.
One tool we have is adding strict xhr timeouts on client-side as a forcing function.

lzstring has performance problems on the server-side, causing issues for session recording: see PostHog/posthog#2560 Note that this is not a backwards-compatible change, requires a change on posthog side to support plain/text header as well. We can't use application/json as content-type as that would trigger CORS requests for every xhr request as that's considered a custom header.

macobo · 2020-12-03T11:45:37Z

Since then we found at least one of the true bottlenecks: lzstring in python is slow enough to cause large events to time out.

Next actions

I'd like to try an experiment where we don't compress the json data at all for our sites (app.posthog.com/posthog.com). Two PRs for this: #2648 PostHog/posthog-js#144

Future

If this removes the failures we're seeing at https://app.posthog.com/dashboard/1240, we can discuss switching or dropping compression schemes or optimizing the python lzstring implementation. Any suggestions for light-weight browser compression libraries welcome!

Fullsnapshot events are pretty massive though - after this experiment I'll also experiment with some tooling new to rrweb since we integrated it.

LZString benchmarks

I took 4 sample requests from my network panel and generated 2 more (26kb in size + 2.5mb in size) and did some measurements.

On posthog.com the average /e/ with lzstring compression is ~500b, large session recording batch ~2kb, large update (switching from homepage to docs) is 9kb and full snapshot event ~370kb.

Time

Measurements were done using my lenovo laptop using ipython %timeit command.

424b: 3.03 ms ± 28.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
520b: 3.49 ms ± 28.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2kb: 18.7 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9kb: 66.7 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
26kb: 188 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
372kb: 3.23 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2511kb: 22.4 s ± 455 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Compression ratios

I've excluded the two artifically generated payloads (as these would compress well):

424b: ratio=1.103932584269663 (original=0.34765625kb, compressed=0.3837890625kb)
520b: ratio=1.1333333333333333 (original=0.4248046875kb, compressed=0.4814453125kb)
2kb: ratio=0.7630739795918368 (original=3.0625kb, compressed=2.3369140625kb)
9kb: ratio=0.3572684298941282 (original=24.9970703125kb, compressed=8.9306640625kb)
372kb: ratio=0.6416829491564855 (original=579.2705078125kb, compressed=371.7080078125kb)

Unscientific takeaway: For small requests (which are the majority of requests to /e/) compression can hurt, but other than that compressing the data has a great effect.

mariusandra · 2020-12-03T11:52:25Z

Is the lzstring JS implementation equally slow compressing the payloads? If posthog makes a weaker computer's browser freeze while sending the payload, then we should change it no matter how slow the python implementation is.

macobo · 2020-12-03T11:55:56Z

I have yet to do experiments on that regard.

As mentioned above, I'd like to have verification first on if this solves the issue for our team first (see the linked PRs which do not affect any other clients). Then if the hypothesis is proven focus on choosing the best alternative for the future, if not - back to the drawing board. :)

Will keep this thread updated on that progress.

macobo · 2020-12-03T22:38:22Z

No updates on skipping lzstring yet - data is still ticking in. However I did some measurements on browser-side compression algorithms.

Benchmarks

rrweb now (optionally) ships with pako.js based deflate algorithm built in, so I used that to compare against lzstring.

Raw data/notes can be found in this gist

compression ratios (compressed size / uncompressed size)

This was tested using posthog.com and seeing real content-sizes for fair comparisons. branch

Compression ratios when compressing events in batches (smaller = better):

	pako.deflate	lzstring
session recording	0.18-0.25	0.45-0.55
normal events	0.11-0.15	0.5

Increasing batch frequency did not significantly change batch ratios.

Compressing individual events yielded a overall compression ratio of around 0.25 for session recording. This has a nice property in that no server-side changes are needed as the decoding can be done by the client/server when serving a recording.

Takeaway: If we can get away with it, using pako.deflate is 2-2.5x better than lzstring for compression ratios

The above was slightly wrong as it did not account for compression of the whole event body - only rrweb parts. In reality this would not work.

lzstring vs pako.deflate speed

Did some measurements on my computer via https://jsbench.me/, sadly cannot link due to sites size limitations.

payload size	pako.deflate	lzstring.compressToBase64
500kb	16.39ops/s +- 8.56%	3.98 ops/s +- 2.8%
3kb	1942.16 ops/s +- 3.75%	165.39 ops/s +- 16.5%
300b	2476.65 ops/s +- 6.8%	2830.78 ops/s +- 11.84%

Takeaway: pako.deflate is consistently many times faster with larger payloads and the speed is comparable

Note: For some reason our bundled version of lzstring is somewhat slower than the one on cdnjs - did not dig into why.

Package sizes

minified version of pako.deflate is 25kb (!). LZString is much smaller, marius has floated the figure 5kb in the past.

Takeaway: If we go for pako, we might want to only include it when session recording is enabled and pull it together with recording.js.

macobo · 2020-12-04T09:29:30Z

Last time I started the following experiment:

Does replacing lzstring with no encoding reduce the rate of full snapshots that fail to be delivered?

What we know from this?

From initial measurements it seems not quite. Will do a more through pass on monday, but we're still seeing failures http status 0 errors which take several seconds to do.

This is slightly surprising since livesession and co don't compress the payloads they send in obvious ways (besides making json keys obfuscated)

This means that even if we have removed the obvious bottleneck on the server side, we're still seeing timeouts. This might be confounded by the aws migration that @fuziontech did, but grafana stats look relatively similar to from before.

On my own computer I'm seeing 1s+ requests for full snapshot events and 300-400ms requests for normal (tiny) events - it still might be something going wrong on the serveri-side. In this case, increasing the batching threshold for session recording might help.

Next steps

Given this, I'm proposing:

benchmarking requests for full snapshot events with hey - it's possible something is still weird with our server-side setup
shipping a beta version, which includes pako.deflate to deflate rrweb events and
using that to compress individual events for our team only (all other clients unaffected)
doing the decompression in the frontend (in a way that only affects our data, easy)
doing measurements on how it affects failed request rates, request sizes and durations

This means that even if we don't end up using pako long-term (either due to not solving the problem, package size or lzstring being more efficient in node), we will only break some sessions for our own app by rolling back.

Notes on tooling

I'm severely lacking good tooling to do some of this:

On posthog-js side, I want to avoid flooding posthog with a ton of events, but it makes it hard to measure mean/p90/p99 request durations for session recording without introducing a lot of complexity. Oh well.
On app.posthog.com I've stumbled upon a lot of pain points doing the analysis (see the linked tickets). Of these, Custom filters on dashboards #2587 and Order metrics shown on tooltip #2620 are most painful, I might prioritize fixing these as well.

macobo · 2020-12-07T10:09:20Z

Since last update:

Benchmarking our endpoints

Created a k6 benchmark of our /s endpoint in production. Metrics in this gist

Takeaways:

Sending 500kb of data is slooow even on my semi-decent connection, so compressing the payloads will help for sure
Not sure whether my upload was capped on my end or on the server.
Given the high baseline http_req_waiting - I think plugin server-based ingestion OR go api ingestion will help a lot

Using rrweb.pack

Realized my metrics in previous post were wrong. Compressing individual events wouldn't be a huge win as most of the server payload would be our normal object keys.

Adding this scheme would also break this pr as snapshot type would not be queriable on the server if don't also add decompression support on the server.

Conclusions / next steps

All-in-all it seems getting session recording more reliable requires:

Making requests smaller - deflate achieves that
Making our ingestion API endpoints faster to avoid client-side timeouts.

Achieving 1 is easy - we just add deflate support to the server.
However 2 is trickier since even plain python endpoints are slow and decompression is bound to slow things down further. It might be achieved a variety of ways:

Go-based ingestion endpoints
plugin-server also handles /e/ and /s/ endpoint ingestion

~~Gonna put this on pause until I have a better understanding of what's going on with plugins/chat with Tim.~~

We'll ship deflate compression within recorder.js as part of this iteration within the python server.

timgl · 2020-12-07T10:35:43Z

Based on discussion, we'll move forward with pako.deflate, just for recorder.js. The tradeoff of extra library size is offset by the extra compression we'll get, as outbound upload speed seems to be the slowest piece. For now, we'll keep LZstring compression for normal events too.

macobo · 2020-12-10T20:32:46Z

We've since implemented a gzip-based compression scheme which has been running for a couple of days for app.posthog.com.

The resulting payloads are 2-3x smaller than with lz64 on app.posthog.com. It's hard to get good numbers on anything else though due to instability in the app bleeding into these stats.

There's no new errors on the backend so we'll roll this out to all users (who use array.js or update npm) tomorrow. :)

We should put focus into making our endpoint fast as well though, the p99 metrics are pretty scary and are leading to some data loss.

macobo · 2020-12-11T20:35:29Z

Before and after rolling gzip support out to everyone on app.posthog.com:

Yuyz0112 · 2020-12-15T02:01:49Z

Are you interested in trying https://github.com/101arrowz/fflate as the client-side compression lib?

macobo · 2020-12-15T08:34:30Z

We do now! https://github.com/PostHog/posthog-js/blob/master/CHANGELOG.md

macobo added enhancement New feature or request session recording labels Nov 30, 2020

macobo mentioned this issue Nov 30, 2020

Release 1.19.0 - 7 December 2020 #2523

Closed

This was referenced Dec 1, 2020

Allow custom expressions in trends #2584

Closed

Autosuggest last dashboard when adding graph to dashboard #2585

Closed

Confusing UX: Removing event/action from trends #2586

Closed

Custom filters on dashboards #2587

Closed

macobo added a commit to PostHog/posthog.com that referenced this issue Dec 1, 2020

Send $capture_metrics from posthog.com

1c52584

Original issue: PostHog/posthog#2560

macobo mentioned this issue Dec 1, 2020

Send $capture_metrics from posthog.com PostHog/posthog.com#717

Merged

macobo added a commit to PostHog/posthog-js that referenced this issue Dec 1, 2020

Track in-flight requests from posthog-js

6def0ff

Part of PostHog/posthog#2560 Needed because we lack a feature: PostHog/posthog#2584

macobo mentioned this issue Dec 1, 2020

Track in-flight requests from posthog-js PostHog/posthog-js#140

Merged

2 tasks

yakkomajuri pushed a commit to PostHog/posthog.com that referenced this issue Dec 1, 2020

Send $capture_metrics from posthog.com (#717)

37901ad

Original issue: PostHog/posthog#2560

This was referenced Dec 1, 2020

Show all items at date when hovering a LineGraph #2600

Merged

Allow filtering events view by event name #2625

Closed

Allow jumping into insights/trends from event view #2626

Closed

Sum/Avg/Min/Max of event property is inconvenient to use #2619

Closed

macobo mentioned this issue Dec 3, 2020

Experiment: don't compress json sent to the server PostHog/posthog-js#144

Merged

2 tasks

This was referenced Dec 4, 2020

Make colors different when graphing 2 different lines #2662

Closed

Insights query returned no data after using custom date ranges and changing time buckets #2661

Closed

Order metrics shown on tooltip #2620

Closed

timgl mentioned this issue Dec 7, 2020

Slow response times #2680

Closed

macobo mentioned this issue Dec 11, 2020

(Pre-) release 1.20.0 #2733

Closed

macobo closed this as completed Dec 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track session recording internal metrics #2560

Track session recording internal metrics #2560

macobo commented Nov 30, 2020

macobo commented Dec 3, 2020

macobo commented Dec 3, 2020 •

edited

Loading

mariusandra commented Dec 3, 2020

macobo commented Dec 3, 2020

macobo commented Dec 3, 2020 •

edited

Loading

macobo commented Dec 4, 2020 •

edited

Loading

macobo commented Dec 7, 2020 •

edited

Loading

timgl commented Dec 7, 2020

macobo commented Dec 10, 2020

macobo commented Dec 11, 2020

Yuyz0112 commented Dec 15, 2020

macobo commented Dec 15, 2020

Track session recording internal metrics #2560

Track session recording internal metrics #2560

Comments

macobo commented Nov 30, 2020

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Thank you for your feature request – we love each and every one!

macobo commented Dec 3, 2020

macobo commented Dec 3, 2020 • edited Loading

Next actions

Future

LZString benchmarks

Time

Compression ratios

mariusandra commented Dec 3, 2020

macobo commented Dec 3, 2020

macobo commented Dec 3, 2020 • edited Loading

Benchmarks

compression ratios (compressed size / uncompressed size)

lzstring vs pako.deflate speed

Package sizes

macobo commented Dec 4, 2020 • edited Loading

What we know from this?

Next steps

Notes on tooling

macobo commented Dec 7, 2020 • edited Loading

Benchmarking our endpoints

Using rrweb.pack

Conclusions / next steps

timgl commented Dec 7, 2020

macobo commented Dec 10, 2020

macobo commented Dec 11, 2020

Yuyz0112 commented Dec 15, 2020

macobo commented Dec 15, 2020

macobo commented Dec 3, 2020 •

edited

Loading

macobo commented Dec 3, 2020 •

edited

Loading

macobo commented Dec 4, 2020 •

edited

Loading

macobo commented Dec 7, 2020 •

edited

Loading