Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue 108 narwhals pandas polars support #130

Conversation

artiom-matvei
Copy link
Contributor

@artiom-matvei artiom-matvei commented Oct 18, 2024

I might also change the tests import structure to not use a relative import path like in this SO post: https://stackoverflow.com/a/16985066/6916564

@artiom-matvei artiom-matvei force-pushed the fix_issue_108_narwhals_pandas_polars_support branch from d341df9 to 4effc0f Compare October 18, 2024 22:01
@artiom-matvei artiom-matvei force-pushed the fix_issue_108_narwhals_pandas_polars_support branch from ef0699b to 71d0a6e Compare October 21, 2024 15:39
@vincentarelbundock
Copy link
Owner

@artiom-matvei looks like you are modifying a lot of files. Have you concluded that this is possible/easy/desirable?

@artiom-matvei artiom-matvei force-pushed the fix_issue_108_narwhals_pandas_polars_support branch from 71d0a6e to 0719887 Compare October 21, 2024 15:47
@artiom-matvei
Copy link
Contributor Author

I started with predictions() but then it required some small changes in other parts, and then other parts so it became a pretty big change.

To be honest, I don't think it is desirable as we can simply convert everything to polars as was done before.

Also, the user interacts with us by passing a dataframe only at the entry, if we want to pass him the same dataframe type, we can convert our dataframes at the exit.

Furthermore, the current version feels pretty patchy and would need additional refactoring to improve the implementation. @vincentarelbundock

@vincentarelbundock
Copy link
Owner

That makes sense.

Would it be possible to implement a more lightweight version of this: Use narhwals to convert the data frame to Polars on entry.

We would still do everything in Polars internally, and we would still return a class that inherits from Polars data frame.

The benefit of this would be that we wouldn't have to require both Polars and Pandas as dependencies. Pandas is very heavy, with lots of compiled code, etc. In contrast, narwhals is 0-dep, so it would be nice to use it as a thin ingestion layer.

Another benefit would be that this would let us accept other data frame times automatically, including DuckDB, PyArrow, etc.

@artiom-matvei
Copy link
Contributor Author

artiom-matvei commented Oct 22, 2024

Narwhals cannot be used to convert between dataframe types unfortunately.

How it removes dependencies to other dataframes is by

  • using a common API
  • at runtime, translate common API operations to dataframe package-specific operations

It is not a compatibility layer between different types of dataframes.

See discussion on discord
image

The way to go would be to convert the whole project to use nw but this can be made in steps

@vincentarelbundock
Copy link
Owner

We only need narwhals to convert in one direction: Any Format -> Polars.

And narwhals can absolutely do that. I'm just thinking about writing a simple ingest() helper, and using that to convert any user-supplied DF to Polars.

In the example below, I create data frames in 3 formats: Polars, Pandas, and DuckDB. Then, I use narwhals to convert each of them to identical Polars DFs:

import narwhals as nw
import pandas as pd
import polars as pl
import duckdb
df_polars = pl.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
        "B": [5, 4, 3, 2, 1],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
    }
)
df_duckdb = duckdb.sql("SELECT * FROM df_polars")
df_pandas = df_polars.to_pandas()

@nw.narwhalify
def ingest(df):
  return pl.DataFrame(df.to_arrow())

ingest(df_polars)
ingest(df_pandas)
ingest(df_duckdb)

@artiom-matvei
Copy link
Contributor Author

artiom-matvei commented Oct 22, 2024

Your are right. We can improve by doing like what MarcoGorelli suggested:
yup - you don't need to use df.to_arrow() explicitly btw, it would be more efficient to do

def ingest(df):
    return nw.from_arrow(df, native_namespace=pl).to_native()

, then it'll use the PyCapsule Interface (and, depending on the input, might not need to go via PyArrow)

@vincentarelbundock
Copy link
Owner

Sounds great. As long as it works with all the data frame formats... Are there any non-Arrow ones where this fails?

@vincentarelbundock
Copy link
Owner

Your function does not seem to work on DuckDB

@artiom-matvei artiom-matvei force-pushed the fix_issue_108_narwhals_pandas_polars_support branch from 0719887 to 9306fce Compare October 23, 2024 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants