source-*-batch: Timeouts and partial-progress logging #2267

willdonnelly · 2025-01-11T02:44:09Z

Description:

This PR adds no-data watchdog timeouts to all of the batch SQL connectors, such that if a long period of time passes without any rows being received (when we're awaiting polling query results) it will kill the capture. The time period is pretty generous: 30 minutes when we're waiting for the first result row and 5 minutes for subsequent rows. Pretty much the only times this watchdog should fire are:

When the database connection has silently dropped on us (which is a thing that happens sometimes, especially when using SSH tunneling)
When the polling query is performing a full-table sort on some massive dataset. And generally speaking if that hasn't finished after 30 minutes it's not likely that we'll actually end up with a successful capture and a pleasant user experience anyway.

I'm still a little bit concerned that even a 30 minute timeout on the initial result rows might cause us to error out in situations where we otherwise could have just kept waiting and eventually gotten results. For example if it takes an hour to sort a particular table but we're only polling it once a day and the user is okay with the DB load of that full-table sort, then in that one particular scenario the user would be better-served by a connector without any such watchdog timeout.

But I think that in practice that scenario basically never happens. Note that when we're doing a cursorless full-refresh of a table we just issue a SELECT * FROM foobar query without any ordering, so the ordering clause only comes into play when the user has set up a cursor and clearly wants more efficient incremental sync behavior. So 30 minutes is probably fine. Even ten minutes would probably be fine.

This PR also adds a couple much smaller improvements:

Partial-progress logging so we can tell when looking at the task logs whether a capture was sitting there idly waiting for results, or whether it was furiously processing results but there were just a lot of them.
MySQL connection timeouts, the same as we use on the MySQL CDC connector. This is just for consistency with the CDC connector, and because I was already working on "batch MySQL timeouts" and so it seemed related-enough to throw in at the same time.

This change is

Adds partial-progress log messages to be emitted after every 100k result rows. This allows us to tell the difference (looking only at the task logs) between a long-running query which is actively processing a greate deal of data, and one which is just sitting there waiting on the database to send us something. Log messages are emitted after rows [1, 100001, 200001, ...] so that we will see at least one log message as soon as we receive the first result row. This will hopefully not be too spammy, as it's just one additional log message amidst the three or so we emit for each polling cycle, and it seemed useful to have that log after the first result row because often the question is whether we've received _any_ results yet.

Adds the same one-minute read/write timeouts on the MySQL connection as the MySQL CDC connector now has. This doesn't address any current or recent user issues as far as I know, it just seemed like a useful thing to do for consistency, while I'm here and thinking about connection loss timeouts on batch SQL connectors.

The timeout is 30 minutes for the first result row and 5 minutes for subsequent result rows, which is absurdly generous any time things are working correctly. Pretty much the only times we ought to see these fire are: - When the database connection has silently dropped on us (which is a thing that happens sometimes, especially when using SSH tunneling) - When the polling query is performing a full-table sort on some massive dataset. And generally speaking if that hasn't finished after 30 minutes it's not likely that we'll actually end up with a successful capture and a pleasant user experience anyway.

williamhbaker

LGTM

willdonnelly added 3 commits January 10, 2025 19:58

willdonnelly added the change:unplanned Unplanned change, useful for things like doc updates label Jan 11, 2025

willdonnelly requested a review from a team January 11, 2025 02:44

williamhbaker approved these changes Jan 13, 2025

View reviewed changes

willdonnelly merged commit 7f596b0 into main Jan 13, 2025
51 of 54 checks passed

willdonnelly deleted the wgd/batch-sql-progress-and-timeouts-20250110 branch January 13, 2025 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source-*-batch: Timeouts and partial-progress logging #2267

source-*-batch: Timeouts and partial-progress logging #2267

willdonnelly commented Jan 11, 2025 •

edited by jgraettinger

Loading

williamhbaker left a comment

source-*-batch: Timeouts and partial-progress logging #2267

source-*-batch: Timeouts and partial-progress logging #2267

Conversation

willdonnelly commented Jan 11, 2025 • edited by jgraettinger Loading

williamhbaker left a comment

Choose a reason for hiding this comment

willdonnelly commented Jan 11, 2025 •

edited by jgraettinger

Loading