-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refresh_queries shouldn't break because of a single query having a bad schedule object #4163
Conversation
While we should basically move on to the next query when an error occurs (and not stop the the loop), we should also handle a case where jobs already expire and can't be fetched (thus raising an exception), and remove the lock in these cases, so these queries can be executed as part of this issue. |
While wrapping it in |
redash/models/__init__.py
Outdated
Query.schedule.isnot(None), | ||
expression.func.coalesce( | ||
expression.text("schedule::json->>'interval'"), None | ||
).isnot(None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't the Python version of this a bit easier to understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was. I just found the fact that the filtering logic is divided between the query and the enqueuing loops more confusing. I prefer a (slightly) more complex query over mixed concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's split anyway, because later we do tests on the schedule data to see if a query is outdated. How's checking if interval
is None
any different from the other tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I'm not sure if this test is even needed. I believe it's from an interim version of the code where there was always a schedule
object, but later we switched to using None
to signal no schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How's checking if interval is None any different from the other tests?
The only difference is how much complexity it adds. Checking for until
was way harder to understand in the query.
later we switched to using
None
to signal no schedule
Yeah, looks like there are no residues of that so it's safe to remove. (6ad764f)
redash/models/__init__.py
Outdated
logging.info( | ||
"Could not determine if query %d is outdated due to %s", | ||
query.id, | ||
repr(e), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This should go into Sentry.
- Considering the way our scheduler works this is going to be spammy. Maybe we should report this and then unschedule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding #2 - this function should be idempotent. It's kinda strange to call outdated_queries
and have your schedule change as a result.
I feel more comfortable with calling track_failure
in these cases - this way users could find out that there was an error with running the query, and it would increase the schedule_failures
counter, which would push the next reschedule down the exponential backoff track.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it would make more sense to use track_failure
in _enqueue_queries
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think schedule_failures
will work here, because it won't reach the part of the code that calculates the next iteration.
While usually you're right in the approach, we're talking about queries with malformed schedule object. Also, sending a user a notification about it doesn't feel useful as in most cases they won't know what to do about it.
Another approach we can take is to introduce a disabled
boolean to the schedule
object which we will trigger in such cases (and later can review these). The downside here is that it won't work in case it's completely malformed (not even a valid JSON).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, disabled
could be set to True
if schedule
is broken or schedule['disabled']
is True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reporting to Sentry is done in 57b1d4f.
redash/tasks/queries/execution.py
Outdated
job_exists = Job.exists(job_id) | ||
job_complete = None | ||
|
||
if job_exists: | ||
job = Job.fetch(job_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it's not very likely, it still can happen that between the call to exists
to fetch
, the job will be removed. Is there a version of fetch
that returns None
if job doesn't exist? Or maybe just catch exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redash/tasks/queries/maintenance.py
Outdated
outdated = models.Query.outdated_queries() | ||
refreshable = _skip_unrefreshable_queries(outdated) | ||
queries = _apply_default_parameters(refreshable) | ||
enqueued = list(_enqueue_queries(queries)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be easier to understand, more performant (less loops) and simpler if it was something like:
for query in models.Query.outdated_queries():
# logic from _skip_unrefreshable_queries
if not should_refresh_query(query):
continue
try:
enqueue_query(
# logic from _apply_default_parameters
apply_default_parameters(query.query_text),
query.data_source,
query.user_id,
scheduled_query=query,
metadata={"Query ID": query.id, "Username": "Scheduled"},
)
except Exception as e:
logging.info("Could not enqueue query %d due to %s", query.id, repr(e))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, similar comment about exception logging as with the one in outdated_queries
. Except that here we should only report to Sentry, but not unschedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplicity / understandability is a matter of preference. Personally I thought of the journey from outdated_queries
to enqueued
as a set of transformations and filters, so I took a FP-based approach. I've switched to your suggestion in ef2eb39.
Regarding more performant - not sure, those were generator loops and I think there's a built-in optimization there.
…oid one query blowing up the rest
…nsibility and add try/except blocks to avoid one query blowing up the rest
… them again instead
6ad764f
to
fea7814
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. We might need to tweak the Sentry reporting, but let's see how it goes.
Issue Summary
We recently had an issue where a single query had a bad schedule object:
And it caused
Query.outdated_queries
to blow up, which in turn stoppedrefresh_queries
from working.We should make sure there is enough exception handling that a single query won't stop the scheduler from working.
Technical details: