-
Notifications
You must be signed in to change notification settings - Fork 5
Kafka last mile batching & retrying processEvent
#273
Comments
Won't dive into the implementation proposal yet, as there's quite a few open questions we should consider:
Note that we should answer from a 2022-2023 posthog perspective here even if we won't implement all of these immediately. This sort of architecture is really hard to change after the fact if we get some fundamental assumption incorrect. |
That's a bunch of good questions! I don't know the answers to them all either, but to keep the ideas flowing, I'll give my first quick thoughts. And thanks for helping figure out what we really need to build!
S€gment has an option to throw specific errors to indicate what you want to happen. Basically you throw a
"Segment retries 9 times over the course of 4 hours. This increases the number of attempts for messages, so we try to re-deliver them another 4 times after some backoff." Something like that as a default and we adapt as needed? 4 hours seems a bit on the lower side though given our current scale and potential random outages.
We might need to move in a direction where too many timeouts disables a plugin altogether, as such out of control plugins might harm the system.
We can take inspiration from this delivery issues dashboard.
Nope.
We could eventually bill by CPU time (even if totally idle and awaiting 🤑), so it wouldn't really matter. 30sec for
I'd say plugin*message based. So if you have 3 plugins: ["currency", "geoip", "export"] and "geoip" errors, it should retry from that spot, skipping "currency".
Skipping ingesting anything until middle one succeeds after retries
We shouldn't make guarantees we can't keep. I'd say any export should be best-effort and unique by uuid.
Possibly only if throwing a
"No, Segment can’t guarantee the order in which the events are delivered to an endpoint" and neither can't we.
They can use the |
This is very good context, especially around RetryErrors and scheduled plugins. The only thing I disagree with is:
From a semantics standpoint I agree here, this would be ideal. However thinking of how to implement this, we would need to write the message "down" into a persistent storage N_Plugins times, which is going to be a driver in terms of $cost_of_storage. How about this?
Proposed semanticsTo summarize the rest, the semantics would come out to be something like:
This should allow us to yeah create a system that basically functions like the current one, except:
In my head it should work as expected for the different classes of plugins:
WDYT? |
I'm not fully sure why we can't just directly skip the already processed plugins on a best effort basis? Requiring every other plugin authors to add |
Good question, I tried answering this earlier but let's try again. It's because of difference of expectation for data munging plugins vs webhooks/synchronization. For data munging plugins (e.g. currency converter), it does not care for retries, it needs to "do" its thing every time. As to why we are not storing the processed message after every plugin to avoid this added complexity:
You can think of our cost of infrastructure increasing linearly with $bytes_written_somewhere and that being the main cost point (besides devs). Saving the message per plugin would result of increasing the bill significantly (we'd be going from 3 times of writing the message somewhere to 3 + N_Plugins). Even if we did all that, there's no guarantee we wouldn't do double-deliveries. Plugin worker stalling after doing the $write-to-biguery but before $write-to-queue is an easy example of this. Another alternative here is separating these two plugin types somehow, but that also ends up with a lot of extra constraints. |
I guess you could store the state of the message before $step_of_retry_failure in the DB and proceed from there. This brings up a new question though - what if the plugins or order of then changes? I'd expect the retry to go through the "new" pipeline rather than the old. |
I'd imagine we'd write the transient state of the event only if something fails and needs to be retried. Effectively a dead letter queue. That shouldn't add too much extra write overhead, except for when large outages happen. It could/would be best-effort, meaning if something in the order of plugins changes, all bets are off or if something is retried twice (failure between $write-to-bigquery and $write-to-retry-queue), no big deal. |
Not sure I understand where you're proposing these writes would go. Are you describing a separate solution compared to the thing described under Proposed semantics here or something different? Why do we need a separate solution for this? If you're proposing the same solution, what would the semantics be? Do we only write things after failures or eagerly after every plugin? |
I haven't confirmed that these semantics are exactly what we will implement, so I'm just thinking abstractly in terms of product here and not proposing anything specific. These semantics get deep into implementation details already in step 2 ("how will that look like?"), so I haven't had the chance to fully play them through nor poke holes in them. |
Ack, will wait until you have specific feedback. You are correct that the "Proposed semantics" section had some implementation/algorithm thinking involved here. This is because some edge cases only became obvious for me when thinking through the whole data flow and trying to work through things that can break during every step. |
I've been reading though segment's centrifuge blog post to get ideas. They have an interesting system that took 9 months to build. Some facts from there:
I'm not sure if, how nor when we should build something as complex. |
How's this for a crazy idea? For last mile batching after kafka, how about using... Kafka? Currently for PostHog cloud, the plugin server runs on 4-core instances (4096 ECS CPU). Each of these is able to process 10k events per minute, when sequentially processing kafka batches of size 1. Beyond that the batch size (and parallelism) increases, all the way up to 4 events per batch during normal cloud hours. The most we've seen on cloud is ~40 events per batch during huge congestions. So how's this for a strategy: we increase the number of Kafka consumers per server and limit the batch size to always be just 1. Is this a bad idea? I feel like this is slightly rebellious, but it could work. Am I missing something fundamental @fuziontech @macobo? On a 4-core server, that gives us the theoretical maximum processing+ingestion of 40k events/minute (1.7B events/month), unless we over provision to better handle async plugins. Since Kafka right now supports thousands of partitions and is very soon moving to a model that supports millions of partitions, we can probably increase our consumer count from the low single digits we have now to low double digits... and eventually add a comma or two. If even then we exceed the feasible number of consumers, we can implement app level sharding. Either sharding by topic (round-robin in What would this "one event per batch" give us? We will trade some raw throughput for increased control and safety. Currently if we have one consumer and a batch of 40, with one 30sec event in the middle, ingestion is frozen. With the "consumer per thread" model, we could have processed and committed 39 events, while leaving that one hanging for longer. That one hanging event? We could have the possibility to commit it anyway after 3 seconds, and throw its metadata on a separate retry queue. We'll check that after X seconds to see if the event had completed in the background, or if it needs to be retried. Unless we overprovision consumers to threads (piscina is configured to handle 10 parallel tasks per thread after all), we would take a hit in our ingestion speed. However the flexibility we gain would be worth it. And if we bill per plugin-second, we should be good anyway :). This might sound like we should also just make the entire plugin server singlethreaded (only 1 core to play with) and launch a dozen of them, but that's a bad idea. Thinking a few steps ahead, I'd still prefer having multiple cores per plugin server. The reason is simple: we will have different types of tasks a plugin server can do. Scheduled tasks. Processing events. Processing retries. Replying to HTTP requests (webhooks). Distributing frontend plugins. Etc. Imagine a dozen plugin servers running. We'll want to somehow distribute the workload. For example with 16 cores in a cluster, have 10 cores process events, 2 run scheduled tasks, 1 deals with retries and 1 deals with web requests. Coordinating 4 servers with 4 cores feels a lot easier than coordinating 16 servers with 1 core. It becomes more apparent if you increase the scale. It feels easier to coordinate 80 cores on 20 servers than 80 cores on 80 servers. This coordination is a topic for another issue, but shouldn't be too hard to implement. We already use redlock, that designates one server as the "scheduler". Instead we should just designate one server as "coordinator" and let it tell the others what to do... and unless told otherwise, every server just processes events. Once that single-coordinator model gets too complicated, we can implement some version of Raft.... or do a reverse-kafka and implement Zookeeper :D. |
Feeling a bit lost here.
Can you clarify a bit: Where would that queue live and how would we make sure the task has completed when it's processed? |
I'm not sure and that's an open question. Either kafka again (though there's no delayed consumption support there, requiring app level fun) or it could be a postgres queue like https://github.com/graphile/worker The requirements for this slow backup ingestion are different here compared to normal fast ingestion. The main point is to just keep these events somewhere and to not lose data. |
Consumer batching is now implemented here, though I haven't properly tested it yet. No retries either. |
My 2c: This issue is now tackling two (IMO) separate issues.
I don't have input for 2 because:
As for (1), I don't really have anything to add which I didn't cover here. To summarize the architectural components of it:
There are still some questions open re semantics - N_messages vs N_messages*M_plugins writes, exposing things as meta arguments, but the gist there stays the same. |
Hi - slowpoke to this distributed queue party, but just finished reading segment's centrifuge architecture and :chef_kiss: That's quite nice. They nailed it. The key here is the They use mysql for their JobDB which is a great choice. Especially considering that they are running So for us, we are planning on moving over the K8s for everything. This means that we can use different kinds of deployments like StatefulSets. Which means we can have volumes associated with our services. This is how you can run Postgres on Kubernetes. What we can do then is just have a volume associated with each Plugin server with a Sqlite DB on it that is the JobDB. No locks or manager necessary. Alternatively we can setup a mysql container in the same pod as the plugin server to accomplish the same thing with only slightly more complexity (1 more container per pod for plugin server deployment). In the meantime we could basically build towards this goal but abstract out the JobDB side and spin up a larger Mysql RDS instance that all of the plugin-servers talk to as the Something else we would need to do is to build out the state machine for handling payloads that need to be handled later. For cloud this is pretty straight forward with another Kafka topic ending up in S3. As for on prem we will probably need to use S3 or GCS unless we can allocate plenty of EBS for the cluster. We would then need to update |
Why is mysql better for this than postgres? |
I have seen Mysql scream at basic key value functions like this where you are not doing any joins and doing basic inserts and reads, granted usually with more reads then writes. We used it heavily for this kind of thing at Uber. To be fair Postgres would work just fine here too. I've just never seen anyone use Postgres as a core for this kind of thing, almost always it's for some OLAP thing. For something like this (simpler, higher TPS, no joins) it always seems to be built on mysql. I can't find any solid benchmarks comparing the two to validate that though. Don't get me wrong Postgres is my favorite RDMBS, but anytime you don't need those sophisticated functions and features Mysql has done the trick nicely. It's also a relatively lighter and simpler db. To be clear though I am advocating something like sqlite or rocksdb - local to the plugin-service so that we don't need to manage db instances separately from plugin-service or having some sort of lock to bind the plugin-service to the db instance. |
Approach nr 42 (I lost count) This issue is indeed mixing two topics (1. case with one plugin slowing down ingestion; 2. case with retrying event processing). Proposed solutions to both of them need some sort of "retry queue". So here's another idea for a queue. Our priority is data integrity first, speed second. As long as we don't lose anything... and retries do happen within a timely manner, we can store this retry queue where it's the easiest to keep... and where we're sure it won't run out of space. 💡 So why not just store the retry queue in the same place where we eventually store the events? That means Postgres and Clickhouse. With postgres ingestion, it'd be just another random table "posthog_retryqueue". Something simple like: create table posthog_retryqueue (
created_at datetime,
uuid uuid,
plugin_config_id integer,
retry_at datetime,
retry_count number,
retry_type varchar(200),
retry_payload text -- actually json, but no need to deserialize into the db
) In the plugin server, we will use redlock to select a random server that gets the role of "queue cleaner". This one server just periodically runs <sidenote> Inside the plugins, we'll need support for different types of retries (events, batches, etc). This db-backed queue would be universal and support all of them. The interface to interact with the queue would be pretty simple: // inside blabla-plugin/index.js
function sendToAPI(event: PluginEvent, meta: PluginMeta) {
try {
await fetch(...)
} catch (error) {
meta.retry("sendToAPI", event)
}
}
export function onRetry(retryType: string, retryPayload: any, meta: PluginMeta) {
if (retryType === "sendToAPI") {
sendToAPI(retryPayload, meta)
}
}
export function processEvent(event: PluginEvent, meta: PluginMeta) {
void sendToApi(event, meta) // void to ignore async and run in the background
return event
} Running I originally also had an idea to use Kafka as a |
This sounds like a great first cut design on a retry system...but This will get really messy on clickhouse when there are a lot of writes to the retry table. Imagine bigquery goes down and we end up sending a ton of events to be retried to clickhouse...and then bigquery comes back up, but intermittently - some events succeed and some don't. Even if we have the modulus coordination system for grabbing and distributing events to plugin workers how are you going to update the statuses on clickhouse? You'll have to rewrite the rows with an updated An even more extreme example is because we have some level of fanout in the event -> plugins flow. What if AWS goes down taking down 6 services plugins depend on. We'll be writing 6 * events count to clickhouse and reading that number * the number of retries rows which could get pretty big. This feels much more OLTP than OLAP to me. I'd almost rather just throw this all into a giant Postgres db to start out with and hope nothing huge happens until we code the flush to S3 logic. It does feel like this is the simplest way to get going moving forward, but I think the mysql/postgres/rocksdb (local to plugin workers) -> s3 way might be a better solution in the longer term. Maybe we can as a first step just setup the postgres solution for both EE and OSS? After a certain number of retries or a timeout we can then evict to CH or S3 for long term retry logic. Later we can swap out big PG instance with a local OLTP db to the plugin server? I do love that this solution is leaning in on clickhouse. There's a great article on AWS leaning into Dynamo in the same way. |
The conclusion I reach from all of the above is that there's no best approach. Thus I made an abstract approach with swappable retry queues in PR #325 . It is still very WIP, needs a layer of refactoring and has not gone through any kind of load testing. But it kind of works. Until now I've implemented two different retry queues:
We can stack as many retry queues as we want. If one throws an error while queueing, we'll just use the next one. The last one could always be a S3 or SQS fallback. S3 might be slow to read back from, but who cares as long as we don't lose data :). We can also still have a clickhouse fallback. I think we can do it without modifying any rows, and reading all new data just once. That surely can't be too much to ask from it? :) |
The curse of engineering: Tradeoffs ⚖️ This is glorious and I love it. Basically we get to get it done and leave the path open to improve it later...Nice. We aren't even limiting ourselves to a local vs global solution. Nice. |
We can close this now. In summary, we didn't implement last mile batching before Kafka. Instead two things changed:
|
The worst thing a data processing company can do is to lose data, yet that's what we're currently capable of :).
Imagine an export plugin (say bigquery). If for whatever reason the export fails (a rat ate a network cable) while we're processing an event, it will never be exported again. Worse, we also won't know if it was exported or not, unless we diff the
uuid
s in the source and destination databases and see what's missing. (See #269 for one alternative solution)To get around this, we need some retrying logic for
processEvent
. There is an existing issue for a dead letter queue, which partially covers this usecase. However there could be a more broad solution that solves this as well.Both segment and rudderstack also came up against this issue, and fixed it by implementing a last mile queue on top of a relational database (mysql for segment, postgres for rudderstack):
Basically, events come in through Kafka, end up in a postgres/mysql table. This table is write-heavy with expiring timestamps for segment (only read from if something must be retried), not sure how it works for rudderstack.
We might need to do the same, unless some technology other than postgres is a better fit.
Browsing around, I came accross this project: https://github.com/graphile/worker - a Job queue for PostgreSQL running on Node.js.
It's not 100% what segment does (you also read from it), but looks pretty sweet and could be a decent and easy to implement stopgap solution:
Similar to the celery worker, we'd just listen to jobs and send them to piscina for handling. We'd just have to reorient the Kafka reader in such a way that it feeds tasks to the system, in case there are not enough of them already in the queue.
For performance, they claim:
That's 2x what we're experiencing during peak load. It should be horizontally scalable, yet definitely not as fast as segment's implementation, that's closer to a dead letter queue in its implementation.
This could also be an optional extra that you enable, and it could also work with celery. For cloud, I'd definitely hook it up to a different database than the main Heroku one. If for nothing else, then just for data locality.
This library also has crontab support with some pretty nice guarantees that we'd be happy to give as well (see #15 and #68):
The text was updated successfully, but these errors were encountered: