-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: enable ETag header for dashboard GET requests #10963
feat: enable ETag header for dashboard GET requests #10963
Conversation
eac7940
to
bd76db1
Compare
Codecov Report
@@ Coverage Diff @@
## master #10963 +/- ##
==========================================
- Coverage 61.46% 60.76% -0.71%
==========================================
Files 382 382
Lines 24139 24154 +15
==========================================
- Hits 14836 14676 -160
- Misses 9303 9478 +175
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
b879c96
to
caf3745
Compare
caf3745
to
d2afe98
Compare
superset/utils/decorators.py
Outdated
@@ -72,6 +80,12 @@ def wrapper(*args: Any, **kwargs: Any) -> ETagResponseMixin: | |||
if request.method == "POST": | |||
return f(*args, **kwargs) | |||
|
|||
# if it is dashboard request but feature is not eabled, | |||
# do not use cache | |||
is_dashboard = is_dashboard_request(kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding a helper function, maybe we can just expand dashboard_id_or_slug
in wrapper
and make this code more transparent?
def wrapper(*args: Any, dashboard_id_or_slug: str=None, **kwargs: Any) -> ETagResponseMixin:
# ...
is_dashboard = dashboard_id_or_slug is not None
Better yet, we should probably aim for keeping all dashboard specific logics out of etag_cache
so it stays generic. Maybe add a skip=
parameter that runs the feature flag check.
@etag_cache(skip=lambda: is_feature_enabled("ENABLE_DASHBOARD_ETAG_HEADER"))
def etag_cache(
max_age: int,
check_perms: Callable[..., Any],
skip: Optional[Callable[..., Any]] = None,
) -> Callable[..., Any]:
def decorator(f: Callable[..., Any]) -> Callable[..., Any]:
def wrapper(*args: Any, **kwargs: Any) -> ETagResponseMixin:
check_perms(*args, **kwargs)
if request.method == "POST" or (skip and skip(*args, **kwargs)):
return f(*args, **kwargs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! after introduce this skip
, dashboard_id
is not needed in decorator function any more.
superset/utils/decorators.py
Outdated
tzinfo=timezone.utc | ||
).astimezone(tz=None) | ||
if latest_changed_on.timestamp() > latest_record.timestamp(): | ||
response = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can probably rename check_latest_changed_on
to get_latest_changed_on
and do
if get_latest_changed_on:
latest_changed_on = get_latest_changed_on(*args, **kwargs)
if response and response.last_modified and response.last_modified < latest_changed_on:
response = None
else:
latest_changed_on = datetime.utcnow()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@graceguo-supercat can you please describe what effect this change will have on
|
|
great thanks, it looks like it addresses question #1, what about 2 and 3? e.g. for #2
Sorry if those questions do not make sense or are obvious. I am just trying to understand the flow here |
Note: For dashboard requests, we want to cache dashboard bootstrap data, which includes datasource metadata, slices parameters, etc. Dashboard front-end js use datasource metadata, slice parameters, and dashboard filters to build query and fetch query results, the results itself are not in the dashboard bootstrap data. This PR will only focus on dashboard's cache stale logic. i do not want to change current slice cache behavior. #2 #3: |
|
||
def get_dashboard_latest_changed_on(_self: Any, dashboard_id_or_slug: str) -> datetime: | ||
""" | ||
Get latest changed datetime for a dashboard. The change could be dashboard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Get latest changed datetime for a dashboard.
/Get latest changed datetime for a dashboard and it's charts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I rename it to get_dashboard_latest_changedon_dt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we get more specific with _self type ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what type to use here. do you have suggestion? (Sorry i am not an expert in Python)
superset/utils/decorators.py
Outdated
@@ -34,6 +34,10 @@ | |||
logger = logging.getLogger(__name__) | |||
|
|||
|
|||
def is_dashboard_request(kwargs: Any) -> bool: | |||
return kwargs.get("dashboard_id_or_slug") is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't seem robust, it it possible to validate via uri path or just pass a param ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after introduce this skip, dashboard_id_or_slug
is not needed in decorator function any more. this check is removed.
superset/utils/decorators.py
Outdated
latest_changed_on = check_latest_changed_on(*args, **kwargs) | ||
if response and response.last_modified: | ||
latest_record = response.last_modified.replace( | ||
tzinfo=timezone.utc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this assumes that superset server runs in utc zone, it may be safer to make it as a superset config variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this convert, .replace(tzinfo)
is not necessary, removed.
superset/views/utils.py
Outdated
return dashboard | ||
|
||
|
||
def get_datasources_from_dashboard( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like a good candidate for the Dashboard class method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a little confusing: Dashboard class has a get datasources
function. So i use it in check_dashboard_perms
function. But this function is to group slices by datasources, and the result will be used by another feature:
https://github.com/apache/incubator-superset/blob/ba009b7c09d49f2932fd10269882c901bc020c1d/superset/views/core.py#L1626
Instead of datasource.data, datasource.data_for_slices(slices) can reduce the initial dashboard data load size.
So right now i removed this helper function from utils, and build dict in the dashboard function. But i rename datasource to slices_by_datasources
for clarification.
superset/views/utils.py
Outdated
return datasources | ||
|
||
|
||
def get_dashboard_latest_changed_on(_self: Any, dashboard_id_or_slug: str) -> datetime: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is _self here? ideally we should avoid Any types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see other functions that used by decorator:
This function takes `self` since it must have the same signature as the the decorated method.
@@ -490,6 +538,26 @@ def check_slice_perms(_self: Any, slice_id: int) -> None: | |||
viz_obj.raise_for_access() | |||
|
|||
|
|||
def check_dashboard_perms(_self: Any, dashboard_id_or_slug: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
best practice is to have a unit test for every function, it would be great if you could add some
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, i agree. but this function is refactored out from dashboard function. it is tested in
https://github.com/apache/incubator-superset/blob/448a41a4e7563cafadea1e03feb5980151e8b56d/tests/security_tests.py#L665
I assume the old unit tests didn't break will be good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a good practice is to incrementally improve the state of the code, however it will be your call here
Big thanks for the explanation! |
eb3956a
to
1c39347
Compare
fda20c7
to
c69afd0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG%nits
): | ||
response = None | ||
else: | ||
content_changed_time = datetime.utcnow() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment here why content_changed_time is set to now()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use this content_changed_time
as cache's last_modified time.
for dashboard content_changed_time
is dashboard entity's latest updated time (like metadata, chart metadata changed time etc). this data is from a callback function.
for explore_json, the cache is query results and there is no entity's latest modified time to use. so we use request time (now) as cache's last_modified time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know generalizing too soon is not a good practice, but I wonder if we should pass a callable called is_stale
here. It would simplify the decorator logic, and since it would be defined closer to the dashboard it might simplify the logic there as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at first i thought is_stale is a good idea. but when start refactor it, i found last_modified time is needed by decorator function(to set header). So is_stale (only boolean value) is not enough.
So instead of using is_stale return true or false, I prefer to keep get_last_modified, and use last_modified time to invalid cache.
|
||
def get_dashboard_latest_changed_on(_self: Any, dashboard_id_or_slug: str) -> datetime: | ||
""" | ||
Get latest changed datetime for a dashboard. The change could be dashboard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we get more specific with _self type ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, @graceguo-supercat! I left a small comment on how to possible simplify the code a little b it, but I'm not 100% sure it would help.
): | ||
response = None | ||
else: | ||
content_changed_time = datetime.utcnow() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know generalizing too soon is not a good practice, but I wonder if we should pass a callable called is_stale
here. It would simplify the decorator logic, and since it would be defined closer to the dashboard it might simplify the logic there as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, found a couple of more nits, both are optional
superset/utils/decorators.py
Outdated
def etag_cache( | ||
max_age: int, | ||
check_perms: Callable[..., Any], | ||
get_latest_changed_on: Optional[Callable[..., Any]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename get_latest_changed_on
to get_last_modified
just to be more consistent with the response attribute? Imagine in future refactor response.last_modified
is renamed to something else, you would know this function is definitely related by searching for last_modified
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree. rename it to get_last_modified
.
superset/views/core.py
Outdated
@etag_cache( | ||
0, | ||
check_perms=check_dashboard_perms, | ||
get_latest_changed_on=get_dashboard_changedon_dt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it's a little weird to have both changed_on
and changedon
, but up to your.
c69afd0
to
18b31f8
Compare
* feat: add etag for dashboard load requests * fix review comments
SUMMARY
ETag header is widely used as efficient server-side caching mechanism. Superset had ETag support for explore_json, this PR is to expand the coverage to dashboard GET request.
When dashboard request come in, Superset need to gather all the datasource metadata and all the slices query parameters that are used for this dashboard, and return as a big blob of data. For a large dashboard in airbnb, we saw some dashboard can have 100+ datasources and 300+ slices. This server-side processing can take 4 seconds, and make the whole dashboard load became very slow.
eTag could be a good solution for large dashboards with less frequent changes:
TEST PLAN
For example: regular GET request takes about 4 seconds. If enabled eTag header, 2nd request will get 304 and faster responded since there is no server-side processing:
cc @betodealmeida @etr2460