-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make timestamp expression native SQLAlchemy element #7131
Conversation
superset/db_engine_specs.py
Outdated
@@ -117,16 +138,31 @@ class BaseEngineSpec(object): | |||
max_column_name_length = None | |||
|
|||
@classmethod | |||
def get_time_expr(cls, expr, pdf, time_grain, grain): | |||
def get_time_expr(cls, col: ColumnClause, pdf: Optional[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you also provide a default value if pdf
and time_grain
are optional? Also in the docstring the Optional
word is redundant as that's implied for the type hints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's my understanding that Optional[str]
is synonymous with Union[None, str]
, not that it is an optional argument. Good point about the docstring.
Thanks @villebro, Happy to "test this in prod" at Uber and report back to you. The code changes look okay to me and I think they should work. I have just one question: I am wondering if the TimestampExpression should be a good vehicle to fix the TZ issues plaguing superset: #6768 and #1149 for example. The problem is that Superset is essentially TZ unaware, and it assumes that the stored data is in the local TZ. I was curious if TimestampExpression could have a ".tz" field in it eventually ? And for databases that do support TZ's, that tz field is plumbed in when generating the timestamp expression sql. |
superset/connectors/sqla/models.py
Outdated
def get_timestamp_expression(self, time_grain): | ||
"""Getting the time component of the query""" | ||
def get_timestamp_expression(self, time_grain: Optional[str]) \ | ||
-> Union[TimeExpression, Label]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type annotations FTW!
Codecov Report
@@ Coverage Diff @@
## master #7131 +/- ##
==========================================
+ Coverage 65.24% 65.27% +0.03%
==========================================
Files 435 435
Lines 21503 21502 -1
Branches 2379 2379
==========================================
+ Hits 14030 14036 +6
+ Misses 7353 7346 -7
Partials 120 120
Continue to review full report at Codecov.
|
Rebased to fix conflict. @agrawaldevesh thanks for offering to help verify this PR! Regarding the timezone issue, I think it's something that runs slightly deeper in the code, so would probably require some additional planning (perhaps a SIP). Perhaps something along the lines of a setting in |
kind reminder @john-bodley @agrawaldevesh for final comments on this PR. |
|
||
@compiles(TimestampExpression) | ||
def compile_timegrain_expression(element: TimestampExpression, compiler, **kw): | ||
return element.name.replace('{col}', compiler.process(element.col, **kw)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "name" of a ColumnClause ?
Do we need to pass this further via the compiler ? Is this supposed to return a string back or is it supposed to return some sqlalchemy struct, and so we might have to do:
compiler.process(element.name.replace('{col}', compiler.process(element.col, **kw)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name
is the part containing the column name/expression; think SELECT name FROM table
. In this case name
will be a string containing a token {col}
to indicate where the native element col
should be compiled. The reason the element is kept in native form is because at construction time we want to build an engine agnostic select statement (SqlaTable.get_sqla_query()
in connectors/sqla/models.py
). By doing it this way we can get a different value for name
depending on the dialect when we are finally compiling the whole statement (SqlaTable.get_query_str_extended()
in connectors/sqla/models.py
). Example:
name
=TRUNC(CAST({col} as DATE), 'MI')
(string)col
=MyMixedCaseCol
(Any Core SQLAlchemy element that can be compiled, but in this example aColumnClause
whosename
=MyMixedCaseCol
)
When compiling this using different dialects we would get different results:
- MSSQL:
TRUNC(CAST([MyMixedCaseCol] as DATE), 'MI')
- Oracle:
TRUNC(CAST("MyMixedCaseCol" as DATE), 'MI')
And so forth. I'm actually working on adding a more generalized version of this to SQLAlchemy, but since it requires fairly big changes to the codebase I don't expect to see it ship in a released version before the end of this year. Hence the simple version above.
@@ -1485,20 +1503,21 @@ class PinotEngineSpec(BaseEngineSpec): | |||
inner_joins = False | |||
supports_column_aliases = False | |||
|
|||
_time_grain_to_datetimeconvert = { | |||
# Pinot does its own conversion below | |||
time_grain_functions: Dict[Optional[str], str] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is renaming this to time_grain_functions is that they are not really behaving like "functions" (like in the rest of the db engine specs).
For example if someone (other than the pinot specific get_timestamp_expr) starts using them, then they wouldn't be getting a function. So I would be wary of changing this.
If anything, we can make time_grain_functions = None (in case they are set in the BaseEngineSpec), so that no one can use them.
And switch this back to its internal name that the below pinot specific get_timestamp_expr can use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I see it all values in time_grain_functions
are merely technical strings that are used by the spec to construct the final TimestampExpression
. The fact that Pinot does this slightly differently I think is ok; all interaction with them should always be done via the get_timestamp_expr()
function anyway, be it Pinot or Postgres. But I agree that the naming is slightly confusing. I would personally stick with my proposal, but I will change it back if you feel it is more intuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. This was a nit anyway :D.
Should we make time_grain_functions private then ? _time_grain_functions ?
The name is a bit of a misnomer now: The values in this dictionary are no longer pure functions.
Or we can just punt on this and perhaps the comment above "time_grain_functions are merely technical strings ...." as a code comment.
Its fine either way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it should be made private as it is. However, I think I'll let this be as it is and keep this in mind next time this time grains are refactored (originally they were a tuple of tuples which are currently emulated via BaseEngineSpec.get_time_grains()
). Perhaps next iteration they'll become more final.
Hi @villebro , I left some comments inline. LGTM to me otherwise. |
Thanks @agrawaldevesh , great comments as always! Were you able to get this working? And was the unit test ok? |
Hi Ville ... Everything works. The unit test is proper and also I was able to merge this branch and test it out in my prod environment and superset had no problems. All charts and everything rendered fine. The generated queries remained the same. So this PR is totally fine from Pinot perspective. |
@mistercrunch @john-bodley this one should be ok to ship. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a pass on the PR and it LGTM @agrawaldevesh has this been running in production on your side? Which engines are you using there? Presto + Pinot I'm assuming?
Hi Max ... I tested this with Pinot natively and it works. Presto+pinot wouldn't really test the pinot side of superset (since superset would just speak presto-sql then). Having tested it, I am confident that this PR would work with Pinot. |
# if epoch, translate to DATE using db specific conf | ||
if pdf == 'epoch_s': | ||
expr = cls.epoch_to_dttm().format(col=expr) | ||
time_expr = time_expr.replace('{col}', cls.epoch_to_dttm()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.format
does much more than just replacing strings. There's a whole mini language behind it that people might have used (even though it's unlikely)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously the code conditionally "burned" the column name/expression into the epoch expression (later that gets conditionally "burned" into the timegrain expression), which is what this PR tries to avoid. In this proposal the epoch expression is instead conditionally burned into the timestamp expression, making it possible to keep the column object in it's native form until compilation time. I can change the .replace
logic to .format
, but in this context I don't think it will make a difference.
elif pdf == 'epoch_ms': | ||
expr = cls.epoch_ms_to_dttm().format(col=expr) | ||
time_expr = time_expr.replace('{col}', cls.epoch_ms_to_dttm()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
Hi, Ex: So, is there any configuration parameter that we can modify to display the data in desired time zone (like EST/PST or some other)? Please advise. Thanks in advance. |
Also, we need to show the data displayed in time zone per user profile or preference if there is a way. |
It is observed that Superset is automatically converting the timestamp fields to EST time while displaying in the SQL Lab or Chart. |
SUMMARY
The functional part of this PR makes the timestamp expression a native SQLAlchemy element that respects quoting rules for column names (based largely on https://docs.sqlalchemy.org/en/latest/core/compiler.html) . This is a continuation of what was started in #6897, which was really just a hack to make
Postgres
work, but didn't address e.g. reserved keywords or similar problems that might arise in other dbs. This PR adds a new classTimestampExpression
, which can be used in a SQLAlchemy query, which keeps the target column as a native Core element. The column name is only rendered to text once the query is compiled, which ensures that engine specific quoting rules are respected.Refactoring:
sqla/models/get_timestamp_expression()
todb_engine_specs/get_timestamp_expr()
.models/core/grains()
which was no longer needed.TEST PLAN
Added unit tests that test the central features and tested timeseries graphs using both column and expression, both with and without time grains.
ADDITIONAL INFORMATION
REVIEWERS
Comments much appreciated @betodealmeida @john-bodley. Also @agrawaldevesh: I had to change the Pinot logic for this PR. Do you have the opportunity to test if this works on your Pinot deployment (I don't have a Pinot installation handy right now)? I also added a few grains while at it, do they work?