Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: pipe() can not pass column values to scipy.stats.t.cdf to compute p-value for example #1225

Closed
artiom-matvei opened this issue Oct 18, 2024 · 4 comments

Comments

@artiom-matvei
Copy link
Contributor

Describe the bug

It seems like there is a problem in how narwhals passes values to lambda functions.

Steps or code to reproduce the bug

This is a stub that eventually will be used to compute a p-value. I removed what is not necessary to reproduce the bug.

import pandas as pd
import polars as pl
import narwhals as nw
import numpy as np
from scipy.stats import t
df = {
     "statistic": [1, 2, -3, 4],
 }
df_pd = pd.DataFrame(df)
df_pl = pl.DataFrame(df)

@nw.narwhalify
def p_value(df):
  dof = 4
  return df.select(nw.col("statistic").abs().pipe(lambda x: t.cdf(x, dof) ))

p_value(df_pd)

Expected results

I expect this to return some floats that will be used to compute the p-value.

Actual results

Basically I get a TypeError from scipy telling that something is wrong with what has been passed to it.

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Below is the stack trace:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[21], [line 1](vscode-notebook-cell:?execution_count=21&line=1)
----> [1](vscode-notebook-cell:?execution_count=21&line=1) p_value(df_pl)

File c:\Users\IBM\Projects\Vincent A.B\pymarginaleffects\.venv\Lib\site-packages\narwhals\translate.py:766, in narwhalify.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    [763](file:///C:/Users/IBM/Projects/Vincent%20A.B/pymarginaleffects/.venv/Lib/site-packages/narwhals/translate.py:763)     msg = "Found multiple backends. Make sure that all dataframe/series inputs come from the same backend."
    [764](file:///C:/Users/IBM/Projects/Vincent%20A.B/pymarginaleffects/.venv/Lib/site-packages/narwhals/translate.py:764)     raise ValueError(msg)
--> [766](file:///C:/Users/IBM/Projects/Vincent%20A.B/pymarginaleffects/.venv/Lib/site-packages/narwhals/translate.py:766) result = func(*args, **kwargs)
    [768](file:///C:/Users/IBM/Projects/Vincent%20A.B/pymarginaleffects/.venv/Lib/site-packages/narwhals/translate.py:768) return to_native(result, strict=strict)

Cell In[19], [line 4](vscode-notebook-cell:?execution_count=19&line=4)
      [1](vscode-notebook-cell:?execution_count=19&line=1) @nw.narwhalify
      [2](vscode-notebook-cell:?execution_count=19&line=2) def p_value(df):
      [3](vscode-notebook-cell:?execution_count=19&line=3)   dof = 4
----> [4](vscode-notebook-cell:?execution_count=19&line=4)   return df.select(nw.col("statistic").abs().pipe(lambda x: t.cdf(x, dof) ))

File c:\Users\IBM\Projects\Vincent A.B\pymarginaleffects\.venv\Lib\site-packages\narwhals\expr.py:119, in Expr.pipe(self, function, *args, **kwargs)
     [80](file:///C:/Users/IBM/Projects/Vincent%20A.B/pymarginaleffects/.venv/Lib/site-packages/narwhals/expr.py:80) def pipe(self, function: Callable[[Any], Self], *args: Any, **kwargs: Any) -> Self:
     [81](file:///C:/Users/IBM/Projects/Vincent%20A.B/pymarginaleffects/.venv/Lib/site-packages/narwhals/expr.py:81)     """
     [82](file:///C:/Users/IBM/Projects/Vincent%20A.B/pymarginaleffects/.venv/Lib/site-packages/narwhals/expr.py:82)     Pipe function call.
     [83](file:///C:/Users/IBM/Projects/Vincent%20A.B/pymarginaleffects/.venv/Lib/site-packages/narwhals/expr.py:83) 
   (...)
    [117](file:///C:/Users/IBM/Projects/Vincent%20A.B/pymarginaleffects/.venv/Lib/site-packages/narwhals/expr.py:117)         └─────┘
    [118](file:///C:/Users/IBM/Projects/Vincent%20A.B/pymarginaleffects/.venv/Lib/site-packages/narwhals/expr.py:118)     """
--> [119](file:///C:/Users/IBM/Projects/Vincent%20A.B/pymarginaleffects/.venv/Lib/site-packages/narwhals/expr.py:119)     return function(self, *args, **kwargs)

Cell In[19], [line 4](vscode-notebook-cell:?execution_count=19&line=4)
      [1](vscode-notebook-cell:?execution_count=19&line=1) @nw.narwhalify
      [2](vscode-notebook-cell:?execution_count=19&line=2) def p_value(df):
      [3](vscode-notebook-cell:?execution_count=19&line=3)   dof = 4
----> [4](vscode-notebook-cell:?execution_count=19&line=4)   return df.select(nw.col("statistic").abs().pipe(lambda x: t.cdf(x, dof) ))

File c:\Users\IBM\Projects\Vincent A.B\pymarginaleffects\.venv\Lib\site-packages\scipy\stats\_distn_infrastructure.py:2116, in rv_continuous.cdf(self, x, *args, **kwds)
   [2114](file:///C:/Users/IBM/Projects/Vincent%20A.B/pymarginaleffects/.venv/Lib/site-packages/scipy/stats/_distn_infrastructure.py:2114) cond = cond0 & cond1
   [2115](file:///C:/Users/IBM/Projects/Vincent%20A.B/pymarginaleffects/.venv/Lib/site-packages/scipy/stats/_distn_infrastructure.py:2115) output = zeros(shape(cond), dtyp)
-> [2116](file:///C:/Users/IBM/Projects/Vincent%20A.B/pymarginaleffects/.venv/Lib/site-packages/scipy/stats/_distn_infrastructure.py:2116) place(output, (1-cond0)+np.isnan(x), self.badvalue)
   [2117](file:///C:/Users/IBM/Projects/Vincent%20A.B/pymarginaleffects/.venv/Lib/site-packages/scipy/stats/_distn_infrastructure.py:2117) place(output, cond2, 1.0)
   [2118](file:///C:/Users/IBM/Projects/Vincent%20A.B/pymarginaleffects/.venv/Lib/site-packages/scipy/stats/_distn_infrastructure.py:2118) if np.any(cond):  # call only if at least 1 entry

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Please run narwhals.show_version() and enter the output below.

System:
    python: 3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:20:11) [MSC v.1938 64 bit (AMD64)]
executable: c:\Users\IBM\Projects\Vincent A.B\pymarginaleffects\.venv\Scripts\python.exe
   machine: Windows-10-10.0.19045-SP0

Python dependencies:
     narwhals: 1.9.4
       pandas: 2.2.2
       polars: 1.7.1
         cudf: 
        modin: 
      pyarrow: 17.0.0
        numpy: 2.0.2

Relevant log output

No response

@artiom-matvei
Copy link
Contributor Author

Maybe the solution would be in this example on github

@artiom-matvei
Copy link
Contributor Author

artiom-matvei commented Oct 18, 2024

Using polars we could do something similar to:

df.with_columns(
    pl.col("statistic")
        .map_elements(
            lambda x: (2 * (1 - stats.t.cdf(np.abs(x), dof))), return_dtype=pl.Float64
        )
        .alias("p_value")
)

The above is my try to converting this function to use narwhals.

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Oct 19, 2024

thanks @artiom-matvei for the report

I think what you're trying to do in Polars is

import pandas as pd
import polars as pl
import narwhals as nw
import numpy as np
from scipy.stats import t
df = {
     "statistic": [1, 2, -3, 4],
 }
df_pd = pd.DataFrame(df)
df_pl = pl.DataFrame(df)

def p_value(df):
    dof = 4
    return df.with_columns(
        pl.col("statistic")
            .map_batches(
                lambda x: (2 * (1 - t.cdf(np.abs(x), dof))), return_dtype=pl.Float64
            )
            .alias("p_value")
    )

print(p_value(df_pl))

right?

If so, we don't (yet) support map_batches - but we probably should! I've opened a new feature for that 👌

In the meantime, if df is an eager dataframe, you could do

import pandas as pd
import polars as pl
import narwhals as nw
import numpy as np
from scipy.stats import t
import narwhals as nw

df = {
    "statistic": [1, 2, -3, 4],
}
df_pd = pd.DataFrame(df)
df_pl = pl.DataFrame(df)


@nw.narwhalify
def p_value(df):
    dof = 4
    return df.with_columns(
        nw.new_series(
            "p_value",
            2 * (1 - t.cdf(np.abs(df["statistic"]), dof)),
            dtype=nw.Float64,
            native_namespace=nw.get_native_namespace(df),
        )
    )

print(p_value(df_pl))

Once we address #1226, then you should be able to use map_batches on the col('statistic') expression, and then it should work for both lazy and eager inputs

@MarcoGorelli
Copy link
Member

closing in favour of #1226 as that'll address this, thanks again for the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants