Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Common SQLCheckOperators Various Functionality Update #25164
Common SQLCheckOperators Various Functionality Update #25164
Changes from 5 commits
dc01866
7c25227
2a3df61
66922f0
554e8ba
cf90083
7c20bf6
d364e96
3c300e7
5645b5d
98a28c3
ee697e3
1754fdc
31d0e0e
bc4140e
24ce964
987f6c2
a27fc2c
404eef5
01bfb2f
75d59d1
0388ac4
719a830
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we change from getting records to getting this as a pandas dataframe? This now places a hard requirement on using pandas for this operator, where as previously pandas was almost entirely optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this was changed because with
hook.get_first
, there was an issue with how the SQL was being written that caused only fully aggregated checks to be returned, unless the syntax of the SQL query was changed, but that would require either afetch_all
orget_pandas_df
call as the new SQL needs to returned multiple lines. It seemed much easier and possibly more efficient to use pandas here, but if afetch_all
seems more reasonable this can be changed. Happy to explain more about the specific issue if curious.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be curious to hear more about it :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To expand a bit more, a check like
col_a + col_b >= col_c
would not work when there were multiple checks in the operator as the previousSELECT
statement would then fail and require aGROUP BY
clause iirc. So the check would either have to be in its own operator, or be amended like so:SUM(col_a) + SUM(col_b) >= SUM(col_c)
which isn't quite the same check. So the query needed to be updated, and the one that I wound up using returns multiple rows. Soget_first
is no longer useful, and in the moment of writing it seemed that handling things with a pandas dataframe might be easier in the long term, if more complicated uses of the pulled in data were implemented. But as of now I see how it's unneeded.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.... Is the operator easy to understand by the users? I am afraid this is something we need good and clear howto and examples for it because people will have hard time using it and rais too many questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally most users won't need to learn about why
.fetch_all
is being used instead of.get_first
🙃 . I'm working with some users of the operator right now and seeing what's complicated to make sure the docs are robust. I also have a working example DAG showing how to use the operator (with several more planned) here.