Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Target sink - optimization strategies for when to flush batches #134

Closed
MeltyBot opened this issue May 26, 2021 · 4 comments
Closed

Target sink - optimization strategies for when to flush batches #134

MeltyBot opened this issue May 26, 2021 · 4 comments

Comments

@MeltyBot
Copy link
Contributor

Migrated from GitLab: https://gitlab.com/meltano/sdk/-/issues/135

Originally created by @aaronsteers on 2021-05-26 17:53:40


While draining or flushing a collection of target sinks, priority could be given to a number of different values:

  1. Drain often for benefit of:
    1. reduced memory pressure by flushing stored records
    2. reduced latency, at least for earlier-emitted record(s)
    3. more frequent checkpoints, aka increased frequency of STATE messages emitted
  2. Drain less often for benefit of:
    1. Fewer overall batches
    2. Efficiency of bulk loading records at high scale
    3. Lower costs on destination platforms which may charged or metered per batch
      • for instance Snowflake charges less when running 1 minute out of every 15 versus running intermittently for all 15 minutes.
  3. Other factors to consider:
    1. defining 'full' for each sink
      • Each sink should report when it is full, either by writing custom is_full() logic or else by specifying a max record count.
    2. controlling max per-record latency
      • We may want to provide a max per-record latency threshold - for instance, prioritizing a batch to be loaded if contains one ore more record in queue for over 120 minutes.
    3. draining multiple sinks when one is triggered ('toppling')
      • When one sink is being drained, we may want to opportunistically drain all others at the same time. This could have benefits for metering and platform costs.
      • For instance, it is cheaper in the Snowflake case to flush all at once and have fewer minutes of each hour running batches.
      • Draining all sinks also allows us to flush the stored state message(s).
    4. memory pressure
      • If memory pressure is detected, this might force the flush of one or more streams

Our strategy for this (broadly) should probably be to have at least two layers:

  • The developer provides some default logic or prioritization strategy that is tuned to work well for the destination system.
  • The user may optionally have some ability to override at runtime using config options.
@MeltyBot
Copy link
Contributor Author

@simonpai
Copy link

I have a wild idea: if we have an intermediate buffer layer (perhaps by expanding inline map spec or introducing a buffer block type), the end user can roughly control and tinker flush size of the loader block even before each loader implements any strategy. In my guts feeling I guess this could satisfy maybe 80% of the use cases? Just my 2 cents.

@edgarrmondragon
Copy link
Collaborator

@simonpai thank you for sharing your thoughts! Yeah, I agree. A durable source sitting right in front of the tap might be a simple and good enough solution against backpressure. I can't think of any reason why a naive buffer that implements the mapper interface wouldn't just work, but I may be missing something.

@stale
Copy link

stale bot commented Jul 18, 2023

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

@stale stale bot added the stale label Jul 18, 2023
@stale stale bot closed this as completed Aug 8, 2023
@stale stale bot removed the stale label Aug 8, 2023
@meltano meltano locked and limited conversation to collaborators Aug 8, 2023
@edgarrmondragon edgarrmondragon converted this issue into discussion #1901 Aug 8, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

3 participants