-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable unit tests using pyspark backend #55
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great approach, thanks a ton @jstammers! Just a few tiny bits, and need to figure out those test failures but the should be good to go. I'm curious if pyspark I stalled via pip is gonna be stable enough across different devs computers and CI. Ibis for example uses docker to run their various backend to ensure consistency. When you pip install a pinned version of pyspark, do you know if you are only getting a pinned python layer version, or you also are getting a pinned version of the underlying java engine?
mismo/conftest.py
Outdated
return ibis.duckdb.connect() | ||
elif backend_option == "pyspark": | ||
# Suppress warnings from PySpark | ||
warnings.filterwarnings("ignore", category=UserWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make these filters more selective? Long term I want to make these warnings not happen, if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into how best to do this. Ideally, I'd like to suppress them for just the warnings emitted by pyspark when using the ibis pyspark backend.
Pip installing pyspark is waaaaay simpler so I think we should probably start with that and see if we run into problems, but just keep the docker idea in the back of our minds |
pyspark versions mirror the underlying spark version, so pyspark 3.5.2 is a python layer on top of spark 3.5.2 etc. I'm not sure how backwards-compatible we would want to be with this, as in my experience there can be a few changes to the API between minor versions. Wherever possible, we would expect those to be handled within the ibis backend. I know that supports back to 3.3, depending on the function (e.g. pyarrow UDFs were implemented in pyspark 3.5), but I would expect most if not all of the built-in ibis functions to be supported in pyspark 3.3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, didn't mean to approve.
Another thing to think about, probably some tests won't pass on pyspark immediately. IDK how feasible it is to get 100% of them passing, and if not them we should introduce some test markers similar to ibis: notimpl, notyet, never
I'm curious how many tests are gonna be problems 🙃
@NickCrews thanks for the review on this. I've implemented most of the changes you've suggested. I'd like to make use of the notyet, never, etc. marks from Will let you know when this is ready for a second review |
sorry @jstammers, I just switched the project manager to be uv instead of PDM, you're gonna get a few merge conflicts when you rebase :( Oof, just took a look at the conftest.py's of the various backends from ibis. Open to what you come up with, but at first glance I don't think we need to be as thorough as they are. I think we want something equivalent (but probably lighter weight) to their BackendTest class, which would
Then, we create the BackendTest instance per pytest session using pytest fixtures. Let's look at how ibis does this, I bet they have this fairly worked out. |
This adds an option
--backend
for pytest that can be used to configure theBackend
for ibis.Currently, this only supports the use of
pyspark
And many of the tests appear to be failing.To make the CI tests clearer, we may want to define a separate workflow to run the tests with a pyspark backend, similar to how this is done for ibis