Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable unit tests using pyspark backend #55

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

jstammers
Copy link
Contributor

This adds an option --backend for pytest that can be used to configure the Backend for ibis.
Currently, this only supports the use of pyspark And many of the tests appear to be failing.

To make the CI tests clearer, we may want to define a separate workflow to run the tests with a pyspark backend, similar to how this is done for ibis

Copy link
Owner

@NickCrews NickCrews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great approach, thanks a ton @jstammers! Just a few tiny bits, and need to figure out those test failures but the should be good to go. I'm curious if pyspark I stalled via pip is gonna be stable enough across different devs computers and CI. Ibis for example uses docker to run their various backend to ensure consistency. When you pip install a pinned version of pyspark, do you know if you are only getting a pinned python layer version, or you also are getting a pinned version of the underlying java engine?

return ibis.duckdb.connect()
elif backend_option == "pyspark":
# Suppress warnings from PySpark
warnings.filterwarnings("ignore", category=UserWarning)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make these filters more selective? Long term I want to make these warnings not happen, if possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into how best to do this. Ideally, I'd like to suppress them for just the warnings emitted by pyspark when using the ibis pyspark backend.

justfile Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
@NickCrews
Copy link
Owner

Pip installing pyspark is waaaaay simpler so I think we should probably start with that and see if we run into problems, but just keep the docker idea in the back of our minds

@jstammers
Copy link
Contributor Author

When you pip install a pinned version of pyspark, do you know if you are only getting a pinned python layer version, or you also are getting a pinned version of the underlying java engine?

pyspark versions mirror the underlying spark version, so pyspark 3.5.2 is a python layer on top of spark 3.5.2 etc.

I'm not sure how backwards-compatible we would want to be with this, as in my experience there can be a few changes to the API between minor versions. Wherever possible, we would expect those to be handled within the ibis backend. I know that supports back to 3.3, depending on the function (e.g. pyarrow UDFs were implemented in pyspark 3.5), but I would expect most if not all of the built-in ibis functions to be supported in pyspark 3.3.

pyproject.toml Outdated Show resolved Hide resolved
justfile Outdated Show resolved Hide resolved
Copy link
Owner

@NickCrews NickCrews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, didn't mean to approve.

Another thing to think about, probably some tests won't pass on pyspark immediately. IDK how feasible it is to get 100% of them passing, and if not them we should introduce some test markers similar to ibis: notimpl, notyet, never

I'm curious how many tests are gonna be problems 🙃

@jstammers
Copy link
Contributor Author

@NickCrews thanks for the review on this. I've implemented most of the changes you've suggested. I'd like to make use of the notyet, never, etc. marks from ibis, but struggling to think of a way to do this that doesn't directly copy the conftest.py module from there.

Will let you know when this is ready for a second review

@NickCrews
Copy link
Owner

sorry @jstammers, I just switched the project manager to be uv instead of PDM, you're gonna get a few merge conflicts when you rebase :(

Oof, just took a look at the conftest.py's of the various backends from ibis. Open to what you come up with, but at first glance I don't think we need to be as thorough as they are. I think we want something equivalent (but probably lighter weight) to their BackendTest class, which would

  • hold an instance of the actual backend object
  • be responsible for initializing and tearing down the backend (ie maybe we switch to using a docker-based pyspark instead of the in-process method you have here, or maybe someone wants to add postgres)
  • have some helper methods for smoothing over any differences between backends (but I think for our use cases this should be more minimal, and anyway if we discover these we should report them upstream to ibis, I think that is in their purview to smooth over these differences for us)

Then, we create the BackendTest instance per pytest session using pytest fixtures. Let's look at how ibis does this, I bet they have this fairly worked out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants