Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert plotly-express to use narwhals #4790

Open
wants to merge 77 commits into
base: master
Choose a base branch
from

Conversation

FBruzzesi
Copy link

@FBruzzesi FBruzzesi commented Oct 9, 2024

Description

This PR migrates plotly-express module logic from pandas to narwhals. In this way, pandas is not a required dependency for plotly-express (or at least for its entirety - e.g. trendlines will still require pandas for now) and users coming with polars,
pyarrow or other eager dataframes supported in narwhals do not need to depend on pandas in the first place.

Related issue #4749

Code PR

  • I have read through the contributing notes and understand the structure of the package. In particular, if my PR modifies code of plotly.graph_objects, my modifications concern the codegen files and not generated files.
  • I have added tests (if submitting a new feature or correcting a bug) or
    modified existing tests.
  • For a new feature, I have added documentation examples in an existing or
    new tutorial notebook (please see the doc checklist as well).
  • I have added a CHANGELOG entry if fixing/changing/adding anything substantial.
  • For a new feature or a change in behaviour, I have updated the relevant docstrings in the code to describe the feature or behaviour (please see the doc checklist as well).

Out of scope for the PR

  • No documentation change has been done so far
  • Adapt plotly data accordingly
  • timeseries/trendlines

cc: @MarcoGorelli @LiamConnors

@FBruzzesi
Copy link
Author

FBruzzesi commented Oct 22, 2024

@ndrezn with the latest commit, I am able to get pandas performances on the same ballpark of master on my local machine. Could you check if you are able to replicate that when you have the time?

Copy link
Contributor

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very impressive effort, just left some comments on some things i noticed

regarding the overhead, it still replicates on the latest commit if I run the timing test on a kaggle notebook: https://www.kaggle.com/code/marcogorelli/plotly-timings/notebook - looking into it 🕵️

packages/python/plotly/plotly/express/_core.py Outdated Show resolved Hide resolved
packages/python/plotly/plotly/express/_core.py Outdated Show resolved Hide resolved
packages/python/plotly/plotly/express/_core.py Outdated Show resolved Hide resolved
packages/python/plotly/plotly/express/_core.py Outdated Show resolved Hide resolved
packages/python/plotly/plotly/express/_core.py Outdated Show resolved Hide resolved
Comment on lines -1181 to +1217
if isinstance(argument, Constant) or isinstance(argument, Range):
if isinstance(argument, (Constant, Range)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nice, but to keep the diff down, would it make sense to factor out some drive-by changes like this one into a separate precursor PR?

packages/python/plotly/plotly/express/_core.py Outdated Show resolved Hide resolved
packages/python/plotly/plotly/express/_core.py Outdated Show resolved Hide resolved
packages/python/plotly/plotly/express/_core.py Outdated Show resolved Hide resolved
packages/python/plotly/plotly/express/_core.py Outdated Show resolved Hide resolved
import numpy as np

import narwhals.stable.v1 as nw
from narwhals.dependencies import is_into_series
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FBruzzesi Should these second two import statements also import from the stable namespace? I would expect the import format to be consistent, but not sure if there is some reasoning I'm missing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed that, yet generate_unique_token is yet not exposed. We have an open issue for it

def _is_continuous(df, col_name):
return df[col_name].dtype.kind in "ifc"
def _is_continuous(df: nw.DataFrame, col_name: str) -> bool:
return df.get_column(col_name).dtype.is_numeric()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FBruzzesi I think this is subtly different from the current logic in the case of unsigned integer types -- the current logic returns False for unsigned types but this logic returns True. I'm not yet sure whether that makes a difference in practice, but I wanted to flag the difference.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but I would tend to treat uint and int in the same way. Probably _is_countinuous is not the best name, but _is_numeric could be.

I will try to come up with an example in which they end up with different plots in master and reason from there

Copy link
Contributor

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On master, when grouping by some level (e.g. color='color_by'), then currently Plotly drops groups where the key is None, whereas in this PR we would still plot those

master:

Screenshot 2024-10-26 110452

here:

Screenshot 2024-10-26 110540

This is a source of overhead for pandas - iterating over groups dropna=False is noticeably slower than with the default dropna=True

In [4]: %timeit dict(pandas_df.groupby(['colorby', 'facetby'], sort=False, dropna=True).__iter__())
173 ms ± 7.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit dict(pandas_df.groupby(['colorby', 'facetby'], sort=False, dropna=False).__iter__())
254 ms ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I think we should add such an option in Narwhals: narwhals-dev/narwhals#1257


Alright, we're making progress: a lot of the overhead is coming from BaseFigure._perform_update - there's a few pandas-specific paths which we're currently skipping in this PR because the pandas objects are wrapped in Narwhals classes:

def is_homogeneous_array(v):
"""
Return whether a value is considered to be a homogeneous array
"""
np = get_module("numpy", should_load=False)
pd = get_module("pandas", should_load=False)
if (
np
and isinstance(v, np.ndarray)
or (pd and isinstance(v, (pd.Series, pd.Index)))
):
return True
if is_numpy_convertable(v):
np = get_module("numpy", should_load=True)
if np:
v_numpy = np.array(v)
# v is essentially a scalar and so shouldn't count as an array
if v_numpy.shape == ():
return False
else:
return True # v_numpy.dtype.kind in ["u", "i", "f", "M", "U"]
return False

This should be addressable - studying the code 🔎 , will update in due course

Comment on lines +1497 to +1498
# This is safe since at this point `_compliant_frame` is one of the "full" level
# support dataframe(s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outdated comment 😉

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with removing the comment, but how is it outdated? For InterchangeFrame, the __native_namespace__ method raises NotImplementedError

Copy link
Contributor

@MarcoGorelli MarcoGorelli Oct 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code below it does nw.get_native_namespace(df_input), so it's not clear what _compliant_frame refers to (we may rename it in Narwhals, it's just an implementation detail)

and we should probably implement __native_namespace__ for InterchangeFrame

# ```
# However we cannot do that just yet, therefore a workaround is provided
agg_f[args["color"]] = nw.col(args["color"]).max()
agg_f[f'{args["color"]}__plotly_n_unique__'] = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use generate_unique_token here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I get this right, do you mean f'{args["color"]}' + <token> or just <token>?
The former case would keep similar comfort in the workaround but there is no guarantee of uniqueness (since it compares with columns that do not contain f'{args["color"]}' + <token>. The latter case would be a fairly additional headache since we do that repeatedly the same workaround in line 1882, and we should keep track of all the tokens added so far, and to which column they refer to (since they come in pairs).

We can go really specific in the suffix info (e.g. __plotly_process_dataframe_hierarchy_discrete_agg_n_unique__), hopefully no one uses a suffix like this one 😁

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Oct 26, 2024

Update on performance from some work on some branches:

timed on a kaggle notebook:

master:
scatter_polars,0.586423233000005
bar_polars,1.5854483000000528
scatter_pandas,0.5920868089999658
bar_pandas,1.620950485000094

plotly-with-narwhals (with Narwhals 1.11.0, and FBruzzesi#1, right at the bottom of the notebook):
scatter_polars,0.20539162399995803
bar_polars,1.159081075999893
scatter_pandas,0.5411779630001092
bar_pandas,1.5852239850000842

I'm going to work on cleaning this all up now, but in summary, the main sources of overhead were:

  • pandas making unnecessary copies for some operations (rename, reset_index), which we're now being more careful about. thanks for having highlighted this area of improvement which helped us find these!
  • some pandas-specific paths being missed in BaseFigure._perform_update, meaning that we were repeatedly converting to numpy (and sometimes then making additional copies) unnecessarily

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community community contribution feature something new P1 needed for current cycle
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants