Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-*-batch: Timeouts and partial-progress logging #2267

Merged
merged 3 commits into from
Jan 13, 2025

Conversation

willdonnelly
Copy link
Member

@willdonnelly willdonnelly commented Jan 11, 2025

Description:

This PR adds no-data watchdog timeouts to all of the batch SQL connectors, such that if a long period of time passes without any rows being received (when we're awaiting polling query results) it will kill the capture. The time period is pretty generous: 30 minutes when we're waiting for the first result row and 5 minutes for subsequent rows. Pretty much the only times this watchdog should fire are:

  • When the database connection has silently dropped on us (which is a thing that happens sometimes, especially when using SSH tunneling)
  • When the polling query is performing a full-table sort on some massive dataset. And generally speaking if that hasn't finished after 30 minutes it's not likely that we'll actually end up with a successful capture and a pleasant user experience anyway.

I'm still a little bit concerned that even a 30 minute timeout on the initial result rows might cause us to error out in situations where we otherwise could have just kept waiting and eventually gotten results. For example if it takes an hour to sort a particular table but we're only polling it once a day and the user is okay with the DB load of that full-table sort, then in that one particular scenario the user would be better-served by a connector without any such watchdog timeout.

But I think that in practice that scenario basically never happens. Note that when we're doing a cursorless full-refresh of a table we just issue a SELECT * FROM foobar query without any ordering, so the ordering clause only comes into play when the user has set up a cursor and clearly wants more efficient incremental sync behavior. So 30 minutes is probably fine. Even ten minutes would probably be fine.

This PR also adds a couple much smaller improvements:

  • Partial-progress logging so we can tell when looking at the task logs whether a capture was sitting there idly waiting for results, or whether it was furiously processing results but there were just a lot of them.
  • MySQL connection timeouts, the same as we use on the MySQL CDC connector. This is just for consistency with the CDC connector, and because I was already working on "batch MySQL timeouts" and so it seemed related-enough to throw in at the same time.

This change is Reviewable

Adds partial-progress log messages to be emitted after every 100k
result rows. This allows us to tell the difference (looking only
at the task logs) between a long-running query which is actively
processing a greate deal of data, and one which is just sitting
there waiting on the database to send us something.

Log messages are emitted after rows [1, 100001, 200001, ...] so
that we will see at least one log message as soon as we receive
the first result row. This will hopefully not be too spammy, as
it's just one additional log message amidst the three or so we
emit for each polling cycle, and it seemed useful to have that
log after the first result row because often the question is
whether we've received _any_ results yet.
Adds the same one-minute read/write timeouts on the MySQL
connection as the MySQL CDC connector now has. This doesn't
address any current or recent user issues as far as I know,
it just seemed like a useful thing to do for consistency,
while I'm here and thinking about connection loss timeouts
on batch SQL connectors.
The timeout is 30 minutes for the first result row and 5 minutes
for subsequent result rows, which is absurdly generous any time
things are working correctly. Pretty much the only times we ought
to see these fire are:

 - When the database connection has silently dropped on us (which
   is a thing that happens sometimes, especially when using SSH
   tunneling)
 - When the polling query is performing a full-table sort on some
   massive dataset. And generally speaking if that hasn't finished
   after 30 minutes it's not likely that we'll actually end up with
   a successful capture and a pleasant user experience anyway.
@willdonnelly willdonnelly added the change:unplanned Unplanned change, useful for things like doc updates label Jan 11, 2025
@willdonnelly willdonnelly requested a review from a team January 11, 2025 02:44
Copy link
Member

@williamhbaker williamhbaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@willdonnelly willdonnelly merged commit 7f596b0 into main Jan 13, 2025
51 of 54 checks passed
@willdonnelly willdonnelly deleted the wgd/batch-sql-progress-and-timeouts-20250110 branch January 13, 2025 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:unplanned Unplanned change, useful for things like doc updates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants