Target sink - optimization strategies for when to flush batches #134

MeltyBot · 2021-05-26T17:53:40Z

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/135

Originally created by @aaronsteers on 2021-05-26 17:53:40

While draining or flushing a collection of target sinks, priority could be given to a number of different values:

Drain often for benefit of:
1. reduced memory pressure by flushing stored records
2. reduced latency, at least for earlier-emitted record(s)
3. more frequent checkpoints, aka increased frequency of STATE messages emitted
Drain less often for benefit of:
1. Fewer overall batches
2. Efficiency of bulk loading records at high scale
3. Lower costs on destination platforms which may charged or metered per batch
  - for instance Snowflake charges less when running 1 minute out of every 15 versus running intermittently for all 15 minutes.
Other factors to consider:
1. defining 'full' for each sink
  - Each sink should report when it is full, either by writing custom is_full() logic or else by specifying a max record count.
2. controlling max per-record latency
  - We may want to provide a max per-record latency threshold - for instance, prioritizing a batch to be loaded if contains one ore more record in queue for over 120 minutes.
3. draining multiple sinks when one is triggered ('toppling')
  - When one sink is being drained, we may want to opportunistically drain all others at the same time. This could have benefits for metering and platform costs.
  - For instance, it is cheaper in the Snowflake case to flush all at once and have fewer minutes of each hour running batches.
  - Draining all sinks also allows us to flush the stored state message(s).
4. memory pressure
  - If memory pressure is detected, this might force the flush of one or more streams

Our strategy for this (broadly) should probably be to have at least two layers:

The developer provides some default logic or prioritization strategy that is tuned to work well for the destination system.
The user may optionally have some ability to override at runtime using config options.

The text was updated successfully, but these errors were encountered:

MeltyBot · 2022-05-29T23:41:10Z

View 2 previous comments from the original issue on GitLab

simonpai · 2022-06-29T01:56:50Z

I have a wild idea: if we have an intermediate buffer layer (perhaps by expanding inline map spec or introducing a buffer block type), the end user can roughly control and tinker flush size of the loader block even before each loader implements any strategy. In my guts feeling I guess this could satisfy maybe 80% of the use cases? Just my 2 cents.

edgarrmondragon · 2022-06-30T04:31:41Z

@simonpai thank you for sharing your thoughts! Yeah, I agree. A durable source sitting right in front of the tap might be a simple and good enough solution against backpressure. I can't think of any reason why a naive buffer that implements the mapper interface wouldn't just work, but I may be missing something.

stale · 2023-07-18T07:10:25Z

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

This was referenced May 29, 2022

Support for backoff retry codes customization #136

Closed

Let taps continue to next partitions and/or streams upon certain errors #280

Open

This was referenced Apr 20, 2023

feat: Configurable batch size and max wait limit for targets #1626

Open

feat: Track memory usage in targets and drain record batch if a treshold is passed #1627

Open

stale bot added the stale label Jul 18, 2023

stale bot closed this as completed Aug 8, 2023

edgarrmondragon reopened this Aug 8, 2023

stale bot removed the stale label Aug 8, 2023

meltano locked and limited conversation to collaborators Aug 8, 2023

edgarrmondragon converted this issue into discussion #1901 Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Target sink - optimization strategies for when to flush batches #134

Target sink - optimization strategies for when to flush batches #134

MeltyBot commented May 26, 2021

MeltyBot commented May 29, 2022

simonpai commented Jun 29, 2022

edgarrmondragon commented Jun 30, 2022

stale bot commented Jul 18, 2023

This issue was moved to a discussion.

This issue was moved to a discussion.

Target sink - optimization strategies for when to flush batches #134

Target sink - optimization strategies for when to flush batches #134

Comments

MeltyBot commented May 26, 2021

MeltyBot commented May 29, 2022

simonpai commented Jun 29, 2022

edgarrmondragon commented Jun 30, 2022

stale bot commented Jul 18, 2023

This issue was moved to a discussion.