-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: add retry to SQL-based alerting celery task #10542
Conversation
caf324c
to
34aa2fa
Compare
e28793a
to
a838c7d
Compare
e10e9d5
to
d6ce466
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a couple of comments/doubts
superset/tasks/schedules.py
Outdated
else: | ||
raise RuntimeError("Unknown report type") | ||
except NoSuchColumnError as column_error: | ||
stats_logger.incr("run_alert_task.failure.NoSuchColumnError") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the API follows a fixed pattern on statsd metrics: https://github.com/apache/incubator-superset/blob/master/superset/views/base_api.py#L173
So we use ..error|success|init. My take is that it's desirable to just incr an error counter (at the method level is granular enough) and also log the error with a more detailed message or exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case we want to differentiate between NoSuchColumnError & ResourceClosedError - still researching the error
s/failure/error - is a good idea
we should probably use snake case for the metric name
superset/tasks/schedules.py
Outdated
@@ -569,7 +583,7 @@ class AlertState: | |||
|
|||
|
|||
def deliver_alert(alert_id: int, recipients: Optional[str] = None) -> None: | |||
alert = db.session.query(Alert).get(alert_id) | |||
alert = db.create_scoped_session().query(Alert).get(alert_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting, @john-bodley submitted this: #10427
is this necessary because it's called by a celery task, and we want to make sure the session gets removed on task completion? What's the effect if this is called by a web request with proper session scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I'm actually going to revert this change and just use regular session. Both types of sessions cause errors with the underlying cause remaining unknown for now.
superset/tasks/schedules.py
Outdated
|
||
logger.info("Processing alert ID: %i", alert.id) | ||
database = alert.database | ||
dbsession = db.create_scoped_session() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we call it db_session
superset/tasks/schedules.py
Outdated
else: | ||
raise RuntimeError("Unknown report type") | ||
except NoSuchColumnError as column_error: | ||
stats_logger.incr("run_alert_task.failure.NoSuchColumnError") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case we want to differentiate between NoSuchColumnError & ResourceClosedError - still researching the error
s/failure/error - is a good idea
we should probably use snake case for the metric name
@@ -533,6 +535,9 @@ def schedule_email_report( # pylint: disable=unused-argument | |||
name="alerts.run_query", | |||
bind=True, | |||
soft_time_limit=config["EMAIL_ASYNC_TIME_LIMIT_SEC"], | |||
autoretry_for=(NoSuchColumnError, ResourceClosedError,), | |||
retry_kwargs={"max_retries": 5}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a todo and the link to the issue - to get rid of retry once the underlying issue is resolved
|
* added retry and minimized sqlalchemy object lives * pylint * added try catch * adjusted naming * added scoped session * update tests for dbsession * added requested changes * nit todo Co-authored-by: Jason Davis <@dropbox.com>
* added retry and minimized sqlalchemy object lives * pylint * added try catch * adjusted naming * added scoped session * update tests for dbsession * added requested changes * nit todo Co-authored-by: Jason Davis <@dropbox.com>
SUMMARY
Multiple errors such as sqlalchemy.exc.NoSuchColumnError, sqlalchemy.exc.ResourceClosedError and MySQLdb._exceptions.OperationalError were being seen when multiple alert celery tasks were run concurrently. These errors occurred during the first few iterations of a celery beat schedule after initialization, however they would stop after after a few iterations of the beat schedule. This PR introduces retrying along with the celery task as a temporary fix for #10530. The use of
db.session
is also standardized in the alerting workflow.BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TEST PLAN
ADDITIONAL INFORMATION