refresh_queries shouldn't break because of a single query having a bad schedule object #4163

rauchy · 2020-02-12T13:36:11Z

Issue Summary

We recently had an issue where a single query had a bad schedule object:

{
    "interval": 604800, 
    "until": null,
    "day_of_week": "Sunday", 
    "time": "Invalid date"
}

And it caused Query.outdated_queries to blow up, which in turn stopped refresh_queries from working.

We should make sure there is enough exception handling that a single query won't stop the scheduler from working.

Technical details:

Redash Version: 8

rauchy · 2020-01-09T09:21:28Z

While we should basically move on to the next query when an error occurs (and not stop the the loop), we should also handle a case where jobs already expire and can't be fetched (thus raising an exception), and remove the lock in these cases, so these queries can be executed as part of this issue.

rauchy · 2020-02-12T20:12:57Z

While wrapping it in try/except blocks, I felt that refresh_queries has too much going on and wasn't easy to read, so I've taken the opportunity to refactor it into something (I see as) more readable.

arikfr · 2020-02-13T09:56:07Z

redash/models/__init__.py

+            Query.schedule.isnot(None),
+            expression.func.coalesce(
+                expression.text("schedule::json->>'interval'"), None
+            ).isnot(None),


Wasn't the Python version of this a bit easier to understand?

Yes, it was. I just found the fact that the filtering logic is divided between the query and the enqueuing loops more confusing. I prefer a (slightly) more complex query over mixed concerns.

But it's split anyway, because later we do tests on the schedule data to see if a query is outdated. How's checking if interval is None any different from the other tests?

Btw, I'm not sure if this test is even needed. I believe it's from an interim version of the code where there was always a schedule object, but later we switched to using None to signal no schedule.

How's checking if interval is None any different from the other tests?

The only difference is how much complexity it adds. Checking for until was way harder to understand in the query.

later we switched to using None to signal no schedule

Yeah, looks like there are no residues of that so it's safe to remove. (6ad764f)

arikfr · 2020-02-16T08:07:45Z

redash/models/__init__.py

+                logging.info(
+                    "Could not determine if query %d is outdated due to %s",
+                    query.id,
+                    repr(e),
+                )


This should go into Sentry.

Considering the way our scheduler works this is going to be spammy. Maybe we should report this and then unschedule?

Regarding #2 - this function should be idempotent. It's kinda strange to call outdated_queries and have your schedule change as a result.

I feel more comfortable with calling track_failure in these cases - this way users could find out that there was an error with running the query, and it would increase the schedule_failures counter, which would push the next reschedule down the exponential backoff track.

Actually it would make more sense to use track_failure in _enqueue_queries instead.

I don't think schedule_failures will work here, because it won't reach the part of the code that calculates the next iteration.

While usually you're right in the approach, we're talking about queries with malformed schedule object. Also, sending a user a notification about it doesn't feel useful as in most cases they won't know what to do about it.

Another approach we can take is to introduce a disabled boolean to the schedule object which we will trigger in such cases (and later can review these). The downside here is that it won't work in case it's completely malformed (not even a valid JSON).

well, disabled could be set to True if schedule is broken or schedule['disabled'] is True.

Reporting to Sentry is done in 57b1d4f.

arikfr · 2020-02-16T08:09:09Z

redash/tasks/queries/execution.py

+                job_exists = Job.exists(job_id)
+                job_complete = None
+
+                if job_exists:
+                    job = Job.fetch(job_id)


While it's not very likely, it still can happen that between the call to exists to fetch, the job will be removed. Is there a version of fetch that returns None if job doesn't exist? Or maybe just catch exception?

arikfr · 2020-02-16T08:15:49Z

redash/tasks/queries/maintenance.py

+    outdated = models.Query.outdated_queries()
+    refreshable = _skip_unrefreshable_queries(outdated)
+    queries = _apply_default_parameters(refreshable)
+    enqueued = list(_enqueue_queries(queries))


Wouldn't it be easier to understand, more performant (less loops) and simpler if it was something like:

for query in models.Query.outdated_queries(): # logic from _skip_unrefreshable_queries if not should_refresh_query(query): continue try: enqueue_query( # logic from _apply_default_parameters apply_default_parameters(query.query_text), query.data_source, query.user_id, scheduled_query=query, metadata={"Query ID": query.id, "Username": "Scheduled"}, ) except Exception as e: logging.info("Could not enqueue query %d due to %s", query.id, repr(e))

Also, similar comment about exception logging as with the one in outdated_queries. Except that here we should only report to Sentry, but not unschedule.

Simplicity / understandability is a matter of preference. Personally I thought of the journey from outdated_queries to enqueued as a set of transformations and filters, so I took a FP-based approach. I've switched to your suggestion in ef2eb39.

Regarding more performant - not sure, those were generator loops and I think there's a built-in optimization there.

…oid one query blowing up the rest

…nsibility and add try/except blocks to avoid one query blowing up the rest

… them again instead

…s have intervals

arikfr

Looks good. We might need to tweak the Sentry reporting, but let's see how it goes.

arikfr added the Backend label Sep 22, 2019

arikfr added this to the Next milestone Sep 22, 2019

weekly-digest bot mentioned this pull request Sep 23, 2019

Weekly Digest (16 September, 2019 - 23 September, 2019) #4172

Closed

rauchy requested a review from arikfr February 12, 2020 20:10

arikfr reviewed Feb 16, 2020

View reviewed changes

This was referenced Feb 17, 2020

Weekly Digest (10 February, 2020 - 17 February, 2020) #4652

Closed

Weekly Digest (17 February, 2020 - 24 February, 2020) #4678

Closed

Omer Lachish added 6 commits February 25, 2020 10:15

move filtering of invalid schedules to the query

6df5a20

simplify retrieved_at assignment and wrap in a try/except block to av…

fc77983

…oid one query blowing up the rest

refactor refresh_queries to use simpler functions with a single respo…

fd9c709

…nsibility and add try/except blocks to avoid one query blowing up the rest

avoid blowing up when job locks point to expired Job objects. Enqueue…

5cfa8d1

… them again instead

there's no need to check for the existence of interval - all schedule…

6002654

…s have intervals

disable faulty schedules

fea7814

rauchy force-pushed the resilient-refresh-queries branch from 6ad764f to fea7814 Compare February 25, 2020 21:03

Omer Lachish added 3 commits February 26, 2020 12:36

reduce FP style in refresh_queries

ef2eb39

report refresh_queries errors to Sentry (if it is configured)

57b1d4f

avoid using exists+fetch and use exceptions instead

24ca387

rauchy requested a review from arikfr February 27, 2020 09:06

arikfr reviewed Feb 27, 2020

View reviewed changes

rauchy merged commit a9cb87d into master Mar 1, 2020

rauchy deleted the resilient-refresh-queries branch March 1, 2020 09:02

weekly-digest bot mentioned this pull request Mar 2, 2020

Weekly Digest (24 February, 2020 - 2 March, 2020) #4702

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refresh_queries shouldn't break because of a single query having a bad schedule object #4163

refresh_queries shouldn't break because of a single query having a bad schedule object #4163

rauchy commented Feb 12, 2020

rauchy commented Jan 9, 2020

rauchy commented Feb 12, 2020

arikfr Feb 13, 2020

rauchy Feb 16, 2020

arikfr Feb 16, 2020

arikfr Feb 16, 2020

rauchy Feb 19, 2020 •

edited

Loading

arikfr Feb 16, 2020

rauchy Feb 19, 2020 •

edited

Loading

rauchy Feb 19, 2020

arikfr Feb 19, 2020

rauchy Feb 19, 2020

arikfr Feb 19, 2020

rauchy Feb 26, 2020

arikfr Feb 16, 2020

rauchy Feb 27, 2020

arikfr Feb 16, 2020

arikfr Feb 16, 2020

rauchy Feb 26, 2020

arikfr left a comment

refresh_queries shouldn't break because of a single query having a bad schedule object #4163

refresh_queries shouldn't break because of a single query having a bad schedule object #4163

Conversation

rauchy commented Feb 12, 2020

Issue Summary

Technical details:

rauchy commented Jan 9, 2020

rauchy commented Feb 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rauchy Feb 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rauchy Feb 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arikfr left a comment

Choose a reason for hiding this comment

rauchy Feb 19, 2020 •

edited

Loading

rauchy Feb 19, 2020 •

edited

Loading