-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python wrapper classes for all user interfaces #750
Python wrapper classes for all user interfaces #750
Conversation
…r facing python features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a big step forward to me. I am not a Python expert though, so it would be good to get reviews from others.
@jdye64 @charlesbluca fyi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a py.typed
file to indicate to type-checkers that library is typed now (per PEP 561)
This looks great in terms of functionality, but I personally don't think it's worth to have wrappers for every DataFusion functionality just for the types and the docs. Afaik PyO3 creates Python docstrings from the Rust documentation of the exposed objects. So things can be documented in the existing Rust functions. And as you mentioned, for types, a For the renaming part |
It's a well made point @datapythonista and maintainability is definitely a concern. I would argue that going with a What I'm really trying to address is a concern from users that the
This is my reasoning. If the community prefers to go the route of not adding type hinting (or go with pyi approach) I'm willing to move things over to a different location. |
Thanks for the detailed information @timsaucer, and for all the work with this. I'm not an expert on type hints but it's not immediately clear to me what would every approach validate in a reasonable way. By using wrappers we won't have types for objects that don't exist, which can happen in the pyi file. But other than that, feels like validations should ve mostly via mypy of the unittests, and should be the same for both approaches, no? Or am I missing something? For your 3rd point, the PR I referenced in the previous message, #751, should address all the So, for me personally, I fully agree with the goals, but feels like a pyi file and documenting objects in Rust will make things simpler to maintain. I may surely be wrong. |
The wrappers also allow us to offer more pythonic interfaces when the An example of this is how @timsaucer cleaned up the datafusion-python/python/datafusion/functions.py Lines 946 to 954 in 79bb196
|
@datapythonista I think your PR is great! It definitely cleans up a problem, but it's not the same one as what I was trying to say. I know there are many users who look at the repositories to try to understand how a project works. All of this information can be found by looking in python using Again, this isn't a hill I intend to die on. I want this PR to be helpful. |
Updated all docstrings to match google format and added line length check into pre-commit configuration. The only thing I have left on my TODO list is updating the documentation building to get the cross references working right. I am currently testing mkdocs per @kylebarron recommendation. Once that is complete, I'll move this from Draft to ready and ask for a re-review. Thank you everyone for the thoughtful and productive discussion! |
Given that this project is currently using sphinx for documentation, it would be a pretty big change to move from sphinx to mkdocs, including new configuration and converting existing re-structured text files to markdown. Maybe it would be better to discuss moving to mkdocs in another issue? I mentioned that in my workflow as I'm a big fan of mkdocs over sphinx, but this PR is already decently large. |
Added a temporary work around for CI issues and added an issue to fix once the upstream is resolved. #763 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is excellent @timsaucer. Thank you for all of the hard work.
IMO, having Python wrappers is the most practical way to make sure we offer the Pythonic experience that people want.
This is required as of `datafusion-python` v40. ref: apache/datafusion-python#750
Which issue does this PR close?
Closes #749.
Rationale for this change
As described the the issue linked above, some python users find the current approach to exposing the
pyo3
generated functions and classes to be "unpythonic" (their words). This PR attempts to make a first class user interface by adding python documentation and type annotations in the place most python users will look. It has the added benefits of a cleanerhelp(datafusion.functions)
and similar calls.What changes are included in this PR?
datafusion.functions
str
orpathlib.Path
for paths, updatingselect
to accept column names as strings, and performing comparisons against literals.Are there any user-facing changes?
datafusion.functions.functions
it is simplydatafusion.functions
. Similar changes occur for other modules.df.filter(col("a") > lit(3))
you can simply dodf.filter(col("a") > 3)
. Any value that can be coerced into a literal is valid for the RHS.select
operation you can pass in either expressions or column names as a string. Instead of doingdf.select(col("a"), col("b") - col("c"))
you can dodf.select("a", col("b") - col("c"))
.filter
now accepts multiple arguments and will treat them as an AND filter. Instead of doingdf.filter(col("a") > 3).filter(col("b") == "orange")
you can usedf.filter(col("a" > 3, col("b") == "orange")
.pathlib.Path
objects.plan
,serde
,producer
, andconsumer
withindatafusion.substrait
are all marked as deprecated. Please usePlan
,Serde
,Producer
, andConsumer
, respectively.Volatility
enum available for convenience. Thestr
values are still accepted.Example generated documenation: