Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

narwhals to support pandas as input #108

Closed
s3alfisc opened this issue Jul 2, 2024 · 9 comments
Closed

narwhals to support pandas as input #108

s3alfisc opened this issue Jul 2, 2024 · 9 comments

Comments

@s3alfisc
Copy link
Contributor

s3alfisc commented Jul 2, 2024

Hi @vincentarelbundock - @juanitorduz pointed me towards @MarcoGorelli's narwhals project, which solves the "two DataFrame APIs" problem for developers by allowing to define APIs that are agnostic to the input data frame type.

I.e. one can do things as

import narwhals as nw
import pandas as pd 
import polars as pl

def func(df_any):
    df = nw.from_native(df_any)
    df = df.select(
        a_sum=nw.col('a').sum(),
        a_mean=nw.col('a').mean(),
        a_std=nw.col('a').std(),
    )
    return nw.to_native(df)

df = pd.DataFrame({'a': [1, 2, 3, 4, 5]})
func(df) # returns pandas DataFrame

df = pl.DataFrame({'a': [1, 2, 3, 4, 5]})
func(df) # returns polars DataFrame

In other words - your polars code will work on pandas inputs, without requiring pandas as a dependency! In fact, narwhals even allows you to drop the polars dependency if you wanted to.

Happy to try myself at a PR for this once I find the time =)

@vincentarelbundock
Copy link
Owner

vincentarelbundock commented Jul 2, 2024

Ooooh that sounds like exactly what we need.

I won't have much (if any) time to write code for this in the next couple months, but I'd be more than happy to review a PR, especially if the changes are pretty minimal.

Thanks for raising this issue!

@vincentarelbundock
Copy link
Owner

@artiom-matvei you could take a look at this if you have time. Seems simple enough, and may not require that much internal change.

@artiom-matvei
Copy link
Contributor

@vincentarelbundock From what I understand, narwhals is used as an interface that translates operations to the underlying functions of polars or pandas.

Currently, I think we convert pandas to polars (through sanitize_newdata()) and then we work with polars everywhere.

The way I see we could apply narwhals, would be by using narwhals.DataFrame instead of polars.DataFrame everywhere which seems like needing quite some refactoring. Is this the way to go? Do you have a suggestion for a different way of using narwhals?

@vincentarelbundock
Copy link
Owner

Yes, I think that's basically it.

IIUC, there wouldn't be that much refactoring to do, because the Narwhals API is extremely similar to Polars, which we already use. So it would mostly be a matter of changing:

pl.DataFrame() -> nw.DataFrame()
pl.col() -> nw.col()

I think what I'd like to see is a proof of concept on a very small part of the code base. Converting a couple functions to accept any kind of data frame as input, and returning the original type.

If that passes the tests, we will have a better idea of what it would take to refactor everything. A proof of concept before committing many hours.

@s3alfisc do you have experience doing this? If so, how would you recommend we proceed?

@s3alfisc
Copy link
Contributor Author

I think what I'd like to see is a proof of concept on a very small part of the code base. Converting a couple functions to accept any kind of data frame as input, and returning the original type.

This sounds like the right strategy to me! I also think that replacing the pl. suffix with nw. should get you at least to 90%.

Unfortunately, I haven't yet started the narwhals migration of pyfixest, so still somewhat inexperienced. But I think that @MarcoGorelli has by now accompanied many such migrations (for example for altair) so he might have some tips / insights?

@artiom-matvei
Copy link
Contributor

I opened an Issue with narwhals as I cannot make it apply the scipy.stats.t.cdf function inside the pipe function. narwhals-dev/narwhals#1225
@vincentarelbundock if you have any suggestions

@vincentarelbundock
Copy link
Owner

Thanks for opening this issue. It looks detailed and helpful. Hopefully someone will respond with useful info.

@MarcoGorelli
Copy link

But I think that @MarcoGorelli has by now accompanied many such migrations (for example for altair) so he might have some tips / insights?

thanks for the ping! happy to help out along the way (though I wouldn't have capacity right now to lead the effort) - i'd suggest to start small and see if there's any parts of the codebase which it's possible to narwhalify whilst keeping the test suite passing. then, gradually increase

for any questions / help, i'd suggest:

  • joining the narwhals discord (link on the project's README)
  • attending our community call if you'd like to chat about anything (also linked to on the README)
  • i also have a calendly link if the time of the above doesn't work for you

happy to see what we can do here!

@vincentarelbundock
Copy link
Owner

marginaleffects version 0.0.14 now uses Narwhals to ingest data frames in many different formats. We still load Polars and do all internal transformations with it, but wrote a single Narwhals function to convert user-supplied data frames on entry.

Thanks to every who contributed to the discussion, and especially to Artiom for the implementation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants