Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lifecycle query is slow #7382

Closed
1 of 2 tasks
macobo opened this issue Nov 26, 2021 · 9 comments · Fixed by #8021
Closed
1 of 2 tasks

Lifecycle query is slow #7382

macobo opened this issue Nov 26, 2021 · 9 comments · Fixed by #8021
Assignees
Labels
performance Has to do with performance. For PRs, runs the clickhouse query performance suite

Comments

@macobo
Copy link
Contributor

macobo commented Nov 26, 2021

In what situation are you experiencing subpar performance?

Lifecycle query is quite slow.

Causes

1. Query over all of time

Digging into the query, we do this subquery per user without filtering on timestamp:

JOIN (
    SELECT DISTINCT person_id, {trunc_func}(min(events.timestamp)) earliest FROM events
    JOIN
    ({GET_TEAM_PERSON_DISTINCT_IDS}) pdi on events.distinct_id = pdi.distinct_id
    WHERE team_id = %(team_id)s AND {event_query} {filters}
    GROUP BY person_id
) earliest ON e.person_id = earliest.person_id

This is to identify if the user we see in the period is a new user or someone we've seen before.

This is very very expensive in clickhouse since we end up reading all partitions in clickhouse. See also #5459 for a similar problem elsewhere.

Full query can be found here: https://github.com/PostHog/posthog/blob/master/ee/clickhouse/sql/trends/lifecycle.py

2. (Minor) No joins + property filter pushdown for person properties

Instead each person property ends up doing another subquery for pdis+persons, which makes the query even slower.

Compared to the above this is relatively minor but still remarkable.

How to reproduce

  1. Visit stickiness

Environment

  • PostHog Cloud
  • self-hosted PostHog, version/commit: please provide

Additional context

We aren't able to do "over all of time" queries in clickhouse - we should always have a time limit. Would love some product input on what to do here:

Couple of options:

  1. Kill the resurrected/new differentiation
  2. Introduce a fixed time range for detecting resurrected (e.g. +3 months to chosen time range)
  3. Leverage person.created_at timestamp over detecting first visit time. This sounds appealing but not sure if this data can be relied on - we don't use it elsewhere. Requires joining with person always, likely also rewriting the query completely.

Other product questions:

  1. Should this query support aggregating by groups over users?

cc @paolodamico and @marcushyett-ph for product input
cc @EDsCODE for core analytics context

Thank you for your performance issue report – we want PostHog to go supersonic!

@macobo macobo added performance Has to do with performance. For PRs, runs the clickhouse query performance suite insights labels Nov 26, 2021
@macobo
Copy link
Contributor Author

macobo commented Dec 8, 2021

@paolodamico and @marcushyett-ph given this is one of the goals next sprint let's make solve the business question. Are you in favor of using person.created_at column for discovering resurrected?

@EDsCODE
Copy link
Member

EDsCODE commented Dec 8, 2021

looping @hazzadous to follow this in preparation for next sprint

person.created_at makes sense as long as we validate the data is trustworthy

@paolodamico
Copy link
Contributor

In a general sense, product-wise, I strongly suggest keeping the base functionality as it's key part of the lifecycle insight (even though feature has low usage).

The proposed alternatives make sense to me conceptually, but let's dive a little bit deeper.

  • How do we set that created_at attribute? Do we set it automatically when creating a person record?
  • When was this attribute introduced?

Perhaps we can do a quick sanity check query to see if there are any major discrepancies between the first event seen for a user and this date? We could also do a one-off sync job.

@marcushyett-ph
Copy link
Contributor

+1 to @paolodamico and @EDsCODE, do we have any anecdotal reasons not to trust created_at?

@EDsCODE
Copy link
Member

EDsCODE commented Dec 8, 2021

I just checked and I think created_at was an original field on the model so we should have good data on it. The one thing with created_at is that it will lock us in to only being able to do $pageview lifecycles. We're making the assumption that a created person always hits a $pageview which is pretty fair but this won't be true for other events.

(Though this isn't a blocker because we could eventually materialize tables around events precalculate the earliest timestamp of a person and event)

@hazzadous
Copy link
Contributor

hazzadous commented Dec 17, 2021

Sitrep

  • have added a benchmark pr
  • have looked into what is slow, initial findings: events over all time is not so slow at least for the cases I tested. The slow part is loading all pdis into memory and joining by the looks. Placing pdis to the left of the join appears to speed this up

either we switch the order, or put in a join table. I wonder if using a smaller join key might help also. But at any rate filtering the pdis by person created date should reduce the size of the join considerably

I’ll follow up to more details and actions

@marcushyett-ph
Copy link
Contributor

Nice! @hazzadous totally anecdotal and probably not useful context.

If I remove the filter for internal users lifecycle queries are way way faster. Not sure if there's anything in that statement to help identifying any bottlenecks.

@hazzadous
Copy link
Contributor

Sitrep: @macobo is working on #7663 which should considerably speedup teams/projects with a large number of distinct_ids. This at least makes the query in isolation faster, a couple of other opportunities:

  1. even if the pdi query is fast, we still need to perform a join with a large, e.g. multiple millions right table on some cases, which appears to be slow as will need to be loaded into memory to perform the hash join. We can make the right table smaller, or keep it in memory e.g. with a Join Table
  2. as @marcushyett-ph mentions removing internal users filter speeds things up considerable. For filtering, we are parsing with JSONExtractRaw for all persons within a team for properties that are not materialized. This requires reading the entire properties column and json parsing it, both IO and CPU heavy, I'm not sure which is the bottleneck yet. Immediate options here are ensuring that any properties used for filtering are materialized. If hits are sparse, using data skipping indices might help, although I suspect the query pattern here doesn't relate. As @macobo mentions here there are potentially small gains with using Array for properties. Further, using a nested key, value structure may help with limiting io, and avoiding CPU having to parse JSON.

I'll do some experimentation to see what works, but we have the materialisation option as an immediate tool. The join speed looks more involved.

@hazzadous
Copy link
Contributor

After speaking to @macobo is seems email is materialized for us as pmat_email, but lifecycle isn't using it

hazzadous pushed a commit that referenced this issue Jan 4, 2022
Previously we were using the first event that matched the filtering
parameters. This could be expensive if there are lots of events/users
and event filtering doesn't utilize sorting or index skipping much.

Instead we use the created_at as the date of first activity, regardless
of any filtering that may have been applied to events. Note that this
may not be as selective as the query on events, but fingers crossed this
is an outlier. Note that this change also diverges from the current
functionality in that previously we would consider the first activity
for a specific event type, but now create_at is implicitly the earliest
of any event.

This PR doesn't handle an optimisation of further filtering the persons
by any person filters that may be applied to ensure the right hand
earliest JOIN is as small as it could be.

Refers to #7382
hazzadous pushed a commit that referenced this issue Jan 6, 2022
This updates the SQL to be comprised of two queries, one for getting
new, returning, and resurrecting periods of activity, one for getting
dormant periods right after periods of activity.

Refers to #7382
macobo pushed a commit that referenced this issue Jan 13, 2022
This updates the SQL to be comprised of two queries, one for getting
new, returning, and resurrecting periods of activity, one for getting
dormant periods right after periods of activity.

Refers to #7382
macobo added a commit that referenced this issue Jan 13, 2022
* refactor(lifecycle): simplify clickhouse sql logic

This updates the SQL to be comprised of two queries, one for getting
new, returning, and resurrecting periods of activity, one for getting
dormant periods right after periods of activity.

Refers to #7382

* refactor(lifecyle): use `ClickhouseEventQuery` to build event query

* format

* Use bounded_person_activity_by_period for both sides of dormant join

* refactor(lifecycle): reduce pdi2 join by one

This means we're now under the current query memory limit for orgs with
around 20m distinct_ids. It does remove some readability though :(

* update snapshot

* Add further comments to query

* Add further comments to query

* Add further comments to query

* Remove dead variables

* Refactor person_query overriding

* Lifecycle refactoring continued

* Update lifecycle tests (except people ones)

* Make lifecycle people endpoint happy

* Remove django lifecycle tests

* Add some edge case tests

* Add missing type

Co-authored-by: Harry Waye <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Has to do with performance. For PRs, runs the clickhouse query performance suite
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants