You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Originally created by @aaronsteers on 2021-05-26 17:53:40
While draining or flushing a collection of target sinks, priority could be given to a number of different values:
Drain often for benefit of:
reduced memory pressure by flushing stored records
reduced latency, at least for earlier-emitted record(s)
more frequent checkpoints, aka increased frequency of STATE messages emitted
Drain less often for benefit of:
Fewer overall batches
Efficiency of bulk loading records at high scale
Lower costs on destination platforms which may charged or metered per batch
for instance Snowflake charges less when running 1 minute out of every 15 versus running intermittently for all 15 minutes.
Other factors to consider:
defining 'full' for each sink
Each sink should report when it is full, either by writing custom is_full() logic or else by specifying a max record count.
controlling max per-record latency
We may want to provide a max per-record latency threshold - for instance, prioritizing a batch to be loaded if contains one ore more record in queue for over 120 minutes.
draining multiple sinks when one is triggered ('toppling')
When one sink is being drained, we may want to opportunistically drain all others at the same time. This could have benefits for metering and platform costs.
For instance, it is cheaper in the Snowflake case to flush all at once and have fewer minutes of each hour running batches.
Draining all sinks also allows us to flush the stored state message(s).
memory pressure
If memory pressure is detected, this might force the flush of one or more streams
Our strategy for this (broadly) should probably be to have at least two layers:
The developer provides some default logic or prioritization strategy that is tuned to work well for the destination system.
The user may optionally have some ability to override at runtime using config options.
The text was updated successfully, but these errors were encountered:
I have a wild idea: if we have an intermediate buffer layer (perhaps by expanding inline map spec or introducing a buffer block type), the end user can roughly control and tinker flush size of the loader block even before each loader implements any strategy. In my guts feeling I guess this could satisfy maybe 80% of the use cases? Just my 2 cents.
@simonpai thank you for sharing your thoughts! Yeah, I agree. A durable source sitting right in front of the tap might be a simple and good enough solution against backpressure. I can't think of any reason why a naive buffer that implements the mapper interface wouldn't just work, but I may be missing something.
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.
Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/135
Originally created by @aaronsteers on 2021-05-26 17:53:40
While draining or flushing a collection of target sinks, priority could be given to a number of different values:
is_full()
logic or else by specifying a max record count.Our strategy for this (broadly) should probably be to have at least two layers:
The text was updated successfully, but these errors were encountered: