Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replacing pandas with polars #1926

Closed
langestefan opened this issue Dec 3, 2023 · 6 comments
Closed

Replacing pandas with polars #1926

langestefan opened this issue Dec 3, 2023 · 6 comments

Comments

@langestefan
Copy link

langestefan commented Dec 3, 2023

Is your feature request related to a problem? Please describe.
Pandas is used extensively in pvlib all over the place. However after using it for many years I think it is now outdated and needs to be replaced. It is greedy, slow, inconsistent and the API is awkward and difficult to use.

Describe the solution you'd like
I would like to propose replacing pandas with polars. Polars is a DataFrame interface built in Rust with many, many performance benefits. Too many to list them here, so I suggest visiting their page to learn more.

I know this is a very big change, but I believe it will lead to some serious performance and usability benefits. So there is no real plan yet, I would like to just start the discussion and see what others think. I think a good starting point would be to benchmark some performance-critical pvlib code in pandas and polars so we can quantify what the performance benefits for pvlib would be.

@mikofski
Copy link
Member

mikofski commented Dec 3, 2023

It is greedy, slow, inconsistent and the API is awkward and difficult to use.

I ❤️ this! 🤣

a good starting point would be to benchmark some performance-critical code in pandas and polars so we can quantify what the performance benefits would be.

I 💯 agree. Maybe change issue title to “benchmark polars v pandas”? And could broaden it to include xarray and numpy structs to be unbiased?

one thing not seeing is how polars behaves with vectorization. I see SIMD but does that require or utilize MKL or specific CPU arch? Polars send great at SQL operations, but we don’t use it much in pvlib, eg sorting, selecting, etc.

also, important to note where/how pvlib api changes bc of pandas extensive use

@kandersolar
Copy link
Member

@langestefan can you say more about what "replace pandas with polars" would mean in this context? Not supporting pandas inputs, or changing our docs to use polars instead of pandas, or what?

Off the top of my head, there aren't a whole lot of places that pvlib actually uses pandas directly (iotools, spectrum, a few model functions here and there). Many of the files that import pandas only do it so that functions can be sure to return pandas objects when the inputs are pandas objects. It's hard to know for sure, but I'd guess that the majority of pvlib users have pandas-based workflows, so removing support for passing pandas objects to (and getting pandas objects back from) pvlib would be an enormously breaking change. I doubt any performance comparison would convince me personally that dropping support for pandas in favor of polars would be a net positive for the pvlib community.

If polars support is valuable, what seems like a better end result to me is to try to support both polars and pandas rather than dropping one for the other. I'm not eager to proliferate the if isinstance(input, polars.Whatever): return polars.Whatever(output) code pattern that I suspect supporting polars would require, but if the value is big enough then I could see it being a good tradeoff. For this, I agree quantifying the performance impact would be helpful.

@jules-ch
Copy link
Contributor

jules-ch commented Dec 5, 2023

Like @kevinsa5

I'd say it's way to early to support polars instead of pandas. The API is still unstable, version 0.X.
I'm a big user of polars on personal projects because of performance lazyloading, async collection, integration with the Apache Arrow Ecosystem.

But I really don't see any improvement in using polars here, apart from using the new shiny library, an effort is being made to use numpy array as inputs in functions where pandas were used before see #1455.
Performance-wise, benchmarks need to be made first and since we're not processing GB of data, we probably won't see any difference in performance, especially since we can use the new interface with the Arrow ecosystem with pandas 2.0.

Even when processing 20 years of weather data for a single location it's not even close to several millions of rows.
IMO I don't think it will make a significant gain in perf but I'm eager to see some benchmarks.

Moving forward using Numpy types instead of Dataframe or Series will make integrating with Polars or any Dataframe library easy.

We can probably use polars already if that's what you want with a bunch of PVLIB API.

@markcampanelli
Copy link
Contributor

I think it would be more appropriate to simply make pvlib "agnostic" to if one is using pandas vs. polars (vs. just plain old numpy float vectors and a Python list of timestamps). Not sure if this could be done with some "timestamped data provider" interface abstraction to lighten the load on users. Given the current evolution of pvlib, this would most likely be painful and disruptive to the pandas lovers. Also, sorting all this out would be quite the lift, I fear.

@langestefan
Copy link
Author

I think it would be more appropriate to simply make pvlib "agnostic" to if one is using pandas vs. polars (vs. just plain old numpy float vectors and a Python list of timestamps).

Narwhals would be way to go.

@markcampanelli
Copy link
Contributor

I think it would be more appropriate to simply make pvlib "agnostic" to if one is using pandas vs. polars (vs. just plain old numpy float vectors and a Python list of timestamps).

Narwhals would be way to go.

Oh cool. I need to check this out further!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants