-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert plotly-express to use narwhals #4790
base: master
Are you sure you want to change the base?
Conversation
@ndrezn with the latest commit, I am able to get pandas performances on the same ballpark of master on my local machine. Could you check if you are able to replicate that when you have the time? |
…lotly.py into plotly-with-narwhals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very impressive effort, just left some comments on some things i noticed
regarding the overhead, it still replicates on the latest commit if I run the timing test on a kaggle notebook: https://www.kaggle.com/code/marcogorelli/plotly-timings/notebook - looking into it 🕵️
if isinstance(argument, Constant) or isinstance(argument, Range): | ||
if isinstance(argument, (Constant, Range)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is nice, but to keep the diff down, would it make sense to factor out some drive-by changes like this one into a separate precursor PR?
packages/python/plotly/plotly/tests/test_optional/test_px/conftest.py
Outdated
Show resolved
Hide resolved
import numpy as np | ||
|
||
import narwhals.stable.v1 as nw | ||
from narwhals.dependencies import is_into_series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FBruzzesi Should these second two import statements also import from the stable namespace? I would expect the import format to be consistent, but not sure if there is some reasoning I'm missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed that, yet generate_unique_token
is yet not exposed. We have an open issue for it
def _is_continuous(df, col_name): | ||
return df[col_name].dtype.kind in "ifc" | ||
def _is_continuous(df: nw.DataFrame, col_name: str) -> bool: | ||
return df.get_column(col_name).dtype.is_numeric() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FBruzzesi I think this is subtly different from the current logic in the case of unsigned integer types -- the current logic returns False
for unsigned types but this logic returns True
. I'm not yet sure whether that makes a difference in practice, but I wanted to flag the difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but I would tend to treat uint and int in the same way. Probably _is_countinuous
is not the best name, but _is_numeric
could be.
I will try to come up with an example in which they end up with different plots in master and reason from there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On master
, when grouping by some level (e.g. color='color_by'
), then currently Plotly drops groups where the key is None
, whereas in this PR we would still plot those
master:
here:
This is a source of overhead for pandas - iterating over groups dropna=False
is noticeably slower than with the default dropna=True
In [4]: %timeit dict(pandas_df.groupby(['colorby', 'facetby'], sort=False, dropna=True).__iter__())
173 ms ± 7.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [5]: %timeit dict(pandas_df.groupby(['colorby', 'facetby'], sort=False, dropna=False).__iter__())
254 ms ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I think we should add such an option in Narwhals: narwhals-dev/narwhals#1257
Alright, we're making progress: a lot of the overhead is coming from BaseFigure._perform_update
- there's a few pandas-specific paths which we're currently skipping in this PR because the pandas objects are wrapped in Narwhals classes:
plotly.py/packages/python/plotly/_plotly_utils/basevalidators.py
Lines 186 to 207 in 960adb9
def is_homogeneous_array(v): | |
""" | |
Return whether a value is considered to be a homogeneous array | |
""" | |
np = get_module("numpy", should_load=False) | |
pd = get_module("pandas", should_load=False) | |
if ( | |
np | |
and isinstance(v, np.ndarray) | |
or (pd and isinstance(v, (pd.Series, pd.Index))) | |
): | |
return True | |
if is_numpy_convertable(v): | |
np = get_module("numpy", should_load=True) | |
if np: | |
v_numpy = np.array(v) | |
# v is essentially a scalar and so shouldn't count as an array | |
if v_numpy.shape == (): | |
return False | |
else: | |
return True # v_numpy.dtype.kind in ["u", "i", "f", "M", "U"] | |
return False |
This should be addressable - studying the code 🔎 , will update in due course
# This is safe since at this point `_compliant_frame` is one of the "full" level | ||
# support dataframe(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outdated comment 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with removing the comment, but how is it outdated? For InterchangeFrame
, the __native_namespace__
method raises NotImplementedError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code below it does nw.get_native_namespace(df_input)
, so it's not clear what _compliant_frame
refers to (we may rename it in Narwhals, it's just an implementation detail)
and we should probably implement __native_namespace__
for InterchangeFrame
# ``` | ||
# However we cannot do that just yet, therefore a workaround is provided | ||
agg_f[args["color"]] = nw.col(args["color"]).max() | ||
agg_f[f'{args["color"]}__plotly_n_unique__'] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use generate_unique_token
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I get this right, do you mean f'{args["color"]}' + <token>
or just <token>
?
The former case would keep similar comfort in the workaround but there is no guarantee of uniqueness (since it compares with columns that do not contain f'{args["color"]}' + <token>
. The latter case would be a fairly additional headache since we do that repeatedly the same workaround in line 1882, and we should keep track of all the tokens added so far, and to which column they refer to (since they come in pairs).
We can go really specific in the suffix info (e.g. __plotly_process_dataframe_hierarchy_discrete_agg_n_unique__
), hopefully no one uses a suffix like this one 😁
Update on performance from some work on some branches: timed on a kaggle notebook: master: plotly-with-narwhals (with Narwhals 1.11.0, and FBruzzesi#1, right at the bottom of the notebook): I'm going to work on cleaning this all up now, but in summary, the main sources of overhead were:
|
Description
This PR migrates plotly-express module logic from pandas to narwhals. In this way, pandas is not a required dependency for plotly-express (or at least for its entirety - e.g. trendlines will still require pandas for now) and users coming with polars,
pyarrow or other eager dataframes supported in narwhals do not need to depend on pandas in the first place.
Related issue #4749
Code PR
plotly.graph_objects
, my modifications concern thecodegen
files and not generated files.modified existing tests.
new tutorial notebook (please see the doc checklist as well).
Out of scope for the PR
cc: @MarcoGorelli @LiamConnors