Skip to content
This repository has been archived by the owner on Nov 4, 2021. It is now read-only.

Match actions to events on the fly #235

Closed
mariusandra opened this issue Mar 9, 2021 · 15 comments · Fixed by #436
Closed

Match actions to events on the fly #235

mariusandra opened this issue Mar 9, 2021 · 15 comments · Fixed by #436
Assignees
Labels
enhancement New feature or request epic

Comments

@mariusandra
Copy link
Collaborator

We should probably match actions to events on the fly in the plugin server.

At the highest level:

  1. On the plugin server, keep in memory all actions and all action steps for all teams.
  2. After an event is ingested, see if any of the action steps apply to it. If so, create the action.

There are three different types of ActionSteps that we can match directly:

  • Autocapture: event == '$autocapture' && match elements via selector && url contains/regex && ...
  • Custom Event: event == 'custom event'
  • Pageview: event == '$pageview' && url contains/regex

Filters on all three of them add some complexity:

  • Event property equals/contains/etc something --> we have the event's properties, so easy to do all possible comparisons
  • User property contains something --> we need the Person for this. We are selecting it for all posthog-js events due to some automatic $set properties, so it's probably fine to also fetch the person (if not already fetched) to match a property for an action
  • User belongs to a cohort --> add one extra query to get all cohorts for a person? ... or inlined in the fetch above?

In addition:

  • We will need a system to keep all these ActionSteps in memory in the plugin server and reload them when changed. Similar to PluginConfigs now.

However:

  • We still need the django async action-event mapping code to clean up after changes to actions (calculating action...)
  • Should we find a way to run webhooks on late-matched or back-matched actions?

Shouldn't be that hard to pull off?


Next step after this: automatically update cohorts after ingesting events.

@yakkomajuri
Copy link
Contributor

Also responding to this here.

Indeed I'm also afraid of breaking some OSS setups - but submitted the PR as per the conversation on Slack.

I think you have more context here and the approach makes sense. Plus I guess the idea was to do this anyway #157.

Wondering now about the short-term, given that we wanted to even push a patch release. Realistically we won't get this done super soon I'm guessing.

Should we then maybe just change that webhook task to use event.actions instead of event.action_set.all() and let the webhooks flow?

The action mapping will still happen periodically and I guess the minor problem is that a user might receive a message via the webhook, go to the PostHog action, and the event isn't there yet.

@mariusandra
Copy link
Collaborator Author

Forgot to add: there are also filters/selectors on elements (in this case the elements chain) that we must support.

@Twixes
Copy link
Collaborator

Twixes commented May 3, 2021

We want to tackle this this week. Here's how I think we can go about this:

It looks like we can't feasibly do action matching in the plugin server, because we'd have to port all filtering logic to JS and maintain that all times (so new action definition or filtering options would not only need to be done for Postgres and for ClickHouse, but in fact for Postgres in Django, for ClickHouse in Django, for Postgres in Node, and for ClickHouse in Node). That sounds nigthmarish. Keeping this in the main Python repo is hugely simpler.

So, how can we tell the plugin servers about actions matched to an event?
Well, we already do this for webhooks and Zapier. Currently that logic is fired only when webhooks/Zapier are enabled, but we could do it for all events. And information about actions that were matched could be sent to the plugin server with Celery/Kafka.
There's at least one major problem here for EE, in that Celery is badly suited for EE volumes and we'd have to move this stream of data to Kafka, but that's simple enough.

The above is a plan. I can foresee performance issues and other problems may come up too. But, this can be tried out in a few days, while porting action matching/filtering/cohorts logic (along with all tests) would take weeks.
So, WDYT? Is this the best way? Did I miss something?

@mariusandra
Copy link
Collaborator Author

mariusandra commented May 3, 2021

I'd say the goal for this week is to realistically scope out how big of a task this would be. Nobody has yet asked for an onAction, so no need to rush something out the door for an arbitrary deadline. :)

The thing is: servers are not free. My hypothesis is that if we move action mapping, or at least part of it, into the plugin server, we will drastically cut down on cost. We can cut down on workers by an order of magnitude... with virtually no extra muscle spent by the plugin server.

That's worth spending a few weeks on.

Regarding the filtering, unless we find some creative shortcuts, like storing computed filter sql inside the actions... (probably a bad idea, but think outside the box), we will need to port some of it over. However, I don't think the plugin server needs to run any queries for most actions/filters. The data should already be there. That means the plugin server port won't be maintaining two database filters, but just some if statements that compare objects, and another set of matching rules, including for DOM selectors.

It's not trivial work, but let's at least spec it out. What filters are there? How can we do this with the least amount of resources? What could we do to make sure the filters stay in sync in both projects? Is there a way to at least split the work, or at least do the simple ones on the plugin server? Can we exclude most events that match nothing? What's the real world usage for all of these different types of actions? Maybe we should write the filters inside the posthog repo in JavaScript that gets transpilted to Python and directly imported by the nodejs plugin server? There are many things we could look at :D.

That all said, what are the plans for team core experience regarding filters, actions, etc? Are there plans on implementing LUA scripts inside nested property filters soon? How stable is the filters API? Is our plan of doing real time action matching in the plugin server a brave one ... or a brave and foolish one? Tagging @EDsCODE for this :).

If we can somehow avoid another round trip to django after ingesting an event, all the network cables and servers will be happy. Think of the servers. And the environment. And the cables.

@Twixes
Copy link
Collaborator

Twixes commented May 4, 2021

Alright, actually right now actions only use the properties feature of the Filter class (which, all in all, is _lots_of mixins, but most used in insights etc.), so I do see a way of adding this in a few days. I definitely see the potential for performance gains, but still concerned about duplicate maintenance. Mainly a concern for ClickHouse actually, where we calculate actions as needed - therefore Python needs the SQL. However, what if we somehow went for a precalculation-based approach similar to what we do for Postgres actions – this way all this logic could be indeed offloaded to the plugin server, with Celery-based (periodic and/or dispatch-based) calculation of actions to some table. Opinions from people who implemented ClickHouse action matching definitely needed here though. @EDsCODE @fuziontech

@mariusandra
Copy link
Collaborator Author

I see problems with precalculating actions for clickhouse. Needing to recalculate them might become annoying (solvable with some unique ids and lots of disk space). Plus I'm not sure if calculating them dynamically is actually a problem. It could be. If so, having the discussion makes sense. Otherwise we'd be creating more work than we should.

For postgres, we will need code which recalculates all the events for an action from scratch and stores those in the database. That logic is now and can remain in Django.

For both postgres and clickhouse, we need to detect when an event is being ingested, if it currently matches any action, and trigger relevant, uhm, actions on our part. Such as webhooks and possibly onAction(). And a INSERT for postgres.

This latter part is the code I'd like to port over to the plugin server. Because the plugin server is a long running process, we can cache all the actions and match them against incoming events with a few javascript if statements... and directly trigger any webhooks/callbacks on top of that. That's a lot better than sending a task via celery.

Now, again, this should be specced out a bit more :).

@Twixes
Copy link
Collaborator

Twixes commented May 5, 2021

Alright, more specifically:

Note: Python = Django server in main repo, JS = plugin server in this repo

Currently

Property Postgres ClickHouse
Continuous action matching maintenance Python only 🙂 Python only 🙂
Insights mechanism Periodically-async in Python, Celery job doing matching over all events created since last action calculation and saving results to many-to-many actions-for-events bridge table using Event.objects.query_db_by_action() Insight-sync in Python, applying format_action_filter() to insight query
Webhooks/Zapier mechanism Insights mechanism-sync in Python Ingestion-async in Python, Celery dispatch for a small fraction of prefiltered events, an ephemeral PG event for use with Event.objects.query_db_by_action() is created for every CH event
Plugin server onAction mechanism 404 404

Crossroads

Laying out 3 options to consider. "Implementation work" values below don't list "building onAction API based on Celery/Kafka queueing and Piscina tasks", as that is implied for all.

  1. All Python. Minimum work, simply allowing onAction with added event-driven architecture. System performance taking a hit in exchange due to all the latencies (and Python).

    Property Postgres ClickHouse
    Implementation work None 🙂 Moving webhooks/Zapier dispatch from Celery to Kafka 🙂
    Continuous action matching maintenance Python only 🙂 Python only 🙂
    Insights mechanism Periodically-async in Python, Celery job doing matching over all events created since last action calculation and saving results to many-to-many actions-for-events bridge table using Event.objects.query_db_by_action() Insight-sync in Python, applying format_action_filter() to insight query
    Webhooks/Zapier mechanism Insights mechanism-sync in Python Ingestion-async in Python, Kafka dispatch for every single event, an ephemeral PG event for use with Event.objects.query_db_by_action() is created for every CH event
    Plugin server onAction mechanism Insights mechanism-async in JS, queued with Celery for processing with Piscina tasks Webhooks/Zapier mechanism-async in JS, queued with Kafka for processing with Piscina tasks
  2. Mixed Python and JS. Lots of work to move practically all action matching to JS, only CH insights still having to use format_action_filter() and its dependencies. Would be simple to maintain in the future if the query building didn't still have to be kept in Python.

    Property Postgres ClickHouse
    Implementation work Porting action matching PG query building to JS, identical to Python, along with all tests, plus moving periodic action calculation job to JS 😧 Porting action matching CH query building to JS, identical to Python, along with all tests, plus adding webhooks/Zapier matching to JS 😧
    Continuous action matching maintenance JS only 🙂 Python and JS 😧
    Insights mechanism Periodically-async in JS, Celery job doing matching over all events created since last action calculation and saving results to many-to-many actions-for-events bridge table using JS action matching Insight-sync in Python, applying format_action_filter() to insight query
    Webhooks/Zapier mechanism Insights mechanism-sync in JS Ingestion-sync in JS, an ephemeral PG event for use with JS action matching is created for every CH event
    Plugin server onAction mechanism Insights mechanism-async in JS, queued with Celery for processing with Piscina tasks Ingestion-async in JS, queued with Kafka for processing with Piscina tasks
  3. All JS. Evolution of 2 – same as that one, only ClickHouse insights mechanism is moved to JS from Python using a precalculation approach and webhooks/Zapier fired there. No Python maintenance anymore. @EDsCODE expressed interest in something like this, as this could also optimize insights queries, though there's added complexity in figuring that part out too.

    Property Postgres ClickHouse
    Implementation work Porting action matching PG query building to JS, identical to Python, along with all tests, plus moving periodic action calculation job to JS 😧 Porting action matching CH query building to JS, identical to Python, along with all tests, plus replacing the need for format_action_filter() with a new CH action_events table 😧
    Continuous action matching maintenance JS only 🙂 JS only 🙂
    Insights mechanism Periodically-async in JS, Celery job doing matching over all events created since last action calculation and saving results to many-to-many actions-for-events PG bridge table using JS action matching Ingestion-sync in JS, matching event that's being ingested to actions and saving results to many-to-many actions-for-events CH bridge table using JS action matching
    Webhooks/Zapier mechanism Insights mechanism-sync in JS Insights mechanism-sync in JS
    Plugin server onAction mechanism Insights mechanism-async in JS, queued with Celery for processing with Piscina tasks Ingestion-async in JS, queued with Kafka for processing with Piscina tasks

As always – this is only what I have identified, and I may be missing something, so if you're thinking about any other way, would love to hear (and I may have mixed something up in the 4 tables above, it took some tinkering to present this somewhat nicely).

@mariusandra
Copy link
Collaborator Author

mariusandra commented May 5, 2021

Hey! Thanks for this! It's a great tool to aid us in comparing the approaches, plus great documentation for later.

However, I'm still not seeing the approach I keep advocating for for months now 😛 . I'm not sure what gets lost in the translation, or what am I missing that makes it impossible to work. I'll use the table structure to describe the changes I'd make:

  1. Mixed Python and real-time JS. We port a part of the filtering logic to JS, we do not implement a query builder in JS.

    Property Postgres ClickHouse
    Implementation work Porting filtering logic for real-time action matching to JS, such that we do not do any extra queries per event to match an action. Keep an in-memory synced list of action steps for all teams in some POJO, compare those to events when they're processed (event.properties.bla === actionStep.filter.properties.bla && event.elements.matches(actionStep.selector)) and if there's a match, insert an action into postgres and run the onAction code. Same as postgres, but no need to save the action in the database
    Continuous action matching maintenance Python, just like now Not needed
    Insights mechanism Nothing changes Nothing changes
    Webhooks/Zapier mechanism We put zapier in the same place where we will put onAction Same as postgres. Also no ephemeral events are created
    Plugin server onAction mechanism As delicious as chocolate cake Tastes like chicken

Now, again, I might be missing something, so I'd really love for this approach to be specced out. What filters do we need. What are we missing.

However this approach has so many benefits that I'd really love to get some MVP out: there will be no celery/kafka round trips. There will be no extra queries per action. No longer do we make 100 queries per event if we have 100 actions. Etc. As discussed here, we need to get a handle on server costs and this will bring massive savings. We could scale down celery workers from ~50 tasks to ~2.

From what I was able to gather (and copy/pasting from the first post), we have three ActionSteps that we can match immediately:

  • Autocapture: event == '$autocapture' && match elements via selector && url contains/regex && ...
  • Custom Event: event == 'custom event'
  • Pageview: event == '$pageview' && url contains/regex

Filters on all three of them add some complexity:

  • Event property equals/contains/etc something --> we have the event's properties, so easy to do all possible comparisons
  • User property contains something --> we need the Person for this. We are selecting it for all posthog-js events due to some automatic $set properties, so it's probably fine to also fetch the person (if not already fetched) to match a property for an action (preferably only if we know we need it to match actions, not just in case)
  • User belongs to a cohort --> add one extra query to get all cohorts for a person? ... or inlined in the fetch above if we know we need this data later for action mapping?
  • Elements match selector

Plus the logic to keep actionsteps in sync in the plugin server, probably via some redis pubsub.

That's not THAT much work to get in to JS? And that's not that much effort to keep in sync with core, I think. What am I missing?

I mean, yes, we will need to port these guys, but that's 20 lines of code:

Screenshot 2021-05-05 at 09 34 28

I'm up for a call to discuss this in more detail if needed :).

@Twixes
Copy link
Collaborator

Twixes commented May 5, 2021

Okay, I get what you're going for more or less.
What do you mean by "Same as postgres, but need to save the action in the database". As in save action occurrence to ClickHouse? Why so?
One thing where we'd have to use a query is regex, to avoid discrepancies between insights and this, but that should be simple enough.
Also, just noting webhooks will stay a built-in feature (and not a plugin) for a while even after this, because we don't have an interface for specifying actions to plugins (for which actions to fire webhook) + we don't have an interface to specify per-action plugin config (webhook formatting).

@Twixes
Copy link
Collaborator

Twixes commented May 5, 2021

And yes, you're pretty much on point with the specific features to port, I don't have anything to add to that list.
BTW I think you missed the point of the "Continuous action matching maintenance" row – the meaning is that, taking your suggested approach for instance, we'll need to maintain (fix/update features) Postgres matching in JS only (onAction/webhooks/Zapier/insights occurrence saving - I don't include querying the bridge table from Python because that's trivial), and ClickHouse matching in Python (insights query building) + in JS (onAction/webhooks/Zapier in ingestion). This is 3 places to maintain logic in, which, alright, is more than 2, but still acceptable to me.
Maybe just a Zoom call ASAP? :D

@mariusandra
Copy link
Collaborator Author

mariusandra commented May 5, 2021

Sorry, typo for CH, should have been "no need to save in db" .

Why do we need a query for the regex? Just because of incompatible regex implementations?

Definitely fine for webhooks not to be a plugin.

I'm not really following after the BTW. Again, we would not do any postgres matching in JS. No databases, just comparing javascript variables on events as they fly by. We'd maintain the two filters as they are now in python, and call actions.filter(action => matchesAction(event, action)) in a simple loop after processEvent in JS. That's indeed one extra place to maintain the logic, but I think the gains are worth it.

What's the bridge table?

@Twixes
Copy link
Collaborator

Twixes commented May 5, 2021

Yep, the regex implementations differ slightly between all of JS, ClickHouse and Postgres.

No, sorry, I definitely don't mean "Postgres matching" as in "matching done with Postgres SQL", but as "matching done in the Postgres pipeline of ingestion > analytics" – as there's a significant degree separation between Postgres and CH in this regard, particularly with the current setup. I do see that doing matching in JS ingestion can eliminate a lot of this separation though – so actually really the logic is only 2x. OK, I like that.

A bridge table is one name for a table connecting two tables with a many-to-many relationship. Specifically interested in posthog_action_events here, used in Postgres insights, which is implicitly created by Django ORM for ManyToManyField Action.events.

OK, so here's a revised approach:

  • Implement an in-memory synced list of action steps for all teams, reloading/adding/removing data for a specific action based on pubsub dispatched when the action is updated in the app, using Django model signals.
  • Port action matching logic to ingestion in JS, eliminating action matching SQL queries per event:
    • Action step type:
      1. Autocapture:
        • Link href equals is trivial
        • Text equals is trivial
        • HTML selector match - more complicated, we can port the exact code from Python or maybe there's an npm library that does exactly what we need?
        • URL contains is trivial
        • URL matches regex – more complicated, JS RegExp should do the job in most cases, though a small fraction of users may experience discrepancies due to Postgres, ClickHouse and JS all using slightly different syntaxes. Querying Postgres/ClickHouse would give always correct results at a cost of complexity and performance
        • URL matches exactly is trivial
      2. Custom view - just matching exact event name, trivial
      3. Page view - just matching exact event name, trivial + same URL options as autocapture, needs resolving
    • General filtering:
      1. Property:
        • Equals (or not) is trivial
        • Contains (or not) is trivial
        • Matches (or not) regex – same problem as with autocapture URLs
        • Greater than/lower than seems trivial, but will have to watch for weird JS behavior with casting strings to numbers
        • Set (or not) is trivial
      2. Cohort:
        • Person belonging - the simplest approach would just query Postgres posthog_cohortpeople, but that'd be an SQL query per event. As far as I understand, this is calculated async for both Postgres and ClickHouse, using tasks in posthog/tasks/calculate_cohort.py. So we can store posthog_cohortpeople data in memory in the plugin server, and these async calculation tasks can - when they are done - dispatch a pubsub event to the plugin server causing an update of CohortPeople data from Postgres. The only caveat is that this data has to fit in memory. Currently that table is 2.7 GB on Cloud, which is… maybe manageable, but still large?
  • In ingestion pipeline (processEvent), run above JS matching on the event – if there's a match, run an onAction Piscina task for it, run webhooks (plus Zapier on ClickHouse) firing. If ingesting to Postgres, insert the action occurrence into the Postgres posthog_action_events table (skipped for ClickHouse).
  • Postgres insights will keep using posthog_action_events, but that's no matching logic. ClickHouse insights will still need to use format_action_filter synced with JS mechanism, which is not ideal, but manageable.
  • Postgres webhooks firing will be moved entirely from Python (periodic calculation) to JS (ingestion). ClickHouse webhooks/Zapier firing will be moved entirely form Python (prefiltered Celery dispatch) to JS (ingestion).

Comments?

@mariusandra
Copy link
Collaborator Author

That seems perfect to me!

Regarding cohorts, 2.7GB is obviously too much to keep in memory. And always multiply these numbers 10x just in case and it's then definitely too much.

I think the system should be smart enough to first even know if it needs cohorts for any action, and if so, we can just inline them with a subquery when fetching the person. Similarly we should also know if there's any need to fetch the person, and skip if not.

@Twixes
Copy link
Collaborator

Twixes commented May 5, 2021

Ah, crap, I also forgot one other ugly complication for Zapier: it's a premium feature, meaning it does some convoluted stuff to get features available for the organization (both on Cloud and self-hosted), using the Organization model from PostHog/posthog plus OrganizationBilling and License models from PostHog/posthog-cloud. We'll probably have to make that saved on the Organization model, which could be cached with something similar to plugin server TeamManager for Team. Will need to talk to @paolodamico about this.

@paolodamico
Copy link
Contributor

I think for this it makes sense to have an available_features property in Organization that's just a cache. We can then update it based on certain event triggers (e.g. after you add a new license, or a billing subscription is activated, subscription is cancelled), and probably a daily/weekly chron job would be needed too (to handle updates based on license/billing_plan expiration, and code changes where we add/remove features to certain plans). This would also marginally speed up the /api/users/@me/ endpoint. Happy to help here if needed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request epic
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants