convert plotly-express to use narwhals #4790

FBruzzesi · 2024-10-09T11:54:47Z

Description

This PR migrates plotly-express module logic from pandas to narwhals. In this way, pandas is not a required dependency for plotly-express (or at least for its entirety - e.g. trendlines will still require pandas for now) and users coming with polars,
pyarrow or other eager dataframes supported in narwhals do not need to depend on pandas in the first place.

Related issue #4749

Code PR

I have read through the contributing notes and understand the structure of the package. In particular, if my PR modifies code of plotly.graph_objects, my modifications concern the codegen files and not generated files.
I have added tests (if submitting a new feature or correcting a bug) or
modified existing tests.
For a new feature, I have added documentation examples in an existing or
new tutorial notebook (please see the doc checklist as well).
I have added a CHANGELOG entry if fixing/changing/adding anything substantial.
For a new feature or a change in behaviour, I have updated the relevant docstrings in the code to describe the feature or behaviour (please see the doc checklist as well).

Out of scope for the PR

No documentation change has been done so far
Adapt plotly data accordingly
timeseries/trendlines

cc: @MarcoGorelli @LiamConnors

FBruzzesi · 2024-10-22T15:52:47Z

@ndrezn with the latest commit, I am able to get pandas performances on the same ballpark of master on my local machine. Could you check if you are able to replicate that when you have the time?

…lotly.py into plotly-with-narwhals

MarcoGorelli

very impressive effort, just left some comments on some things i noticed

regarding the overhead, it still replicates on the latest commit if I run the timing test on a kaggle notebook: https://www.kaggle.com/code/marcogorelli/plotly-timings/notebook - looking into it 🕵️

packages/python/plotly/plotly/express/_core.py

MarcoGorelli · 2024-10-23T08:33:24Z

packages/python/plotly/plotly/express/_core.py

-            if isinstance(argument, Constant) or isinstance(argument, Range):
+            if isinstance(argument, (Constant, Range)):


this is nice, but to keep the diff down, would it make sense to factor out some drive-by changes like this one into a separate precursor PR?

packages/python/plotly/plotly/express/_core.py

packages/python/plotly/plotly/tests/test_optional/test_px/conftest.py

packages/python/plotly/plotly/express/_core.py

emilykl · 2024-10-25T20:26:26Z

packages/python/plotly/plotly/express/_core.py

-import numpy as np
+
+import narwhals.stable.v1 as nw
+from narwhals.dependencies import is_into_series


@FBruzzesi Should these second two import statements also import from the stable namespace? I would expect the import format to be consistent, but not sure if there is some reasoning I'm missing.

Fixed that, yet generate_unique_token is yet not exposed. We have an open issue for it

emilykl · 2024-10-25T20:46:05Z

packages/python/plotly/plotly/express/_core.py

-def _is_continuous(df, col_name):
-    return df[col_name].dtype.kind in "ifc"
+def _is_continuous(df: nw.DataFrame, col_name: str) -> bool:
+    return df.get_column(col_name).dtype.is_numeric()


@FBruzzesi I think this is subtly different from the current logic in the case of unsigned integer types -- the current logic returns False for unsigned types but this logic returns True. I'm not yet sure whether that makes a difference in practice, but I wanted to flag the difference.

True, but I would tend to treat uint and int in the same way. Probably _is_countinuous is not the best name, but _is_numeric could be.

I will try to come up with an example in which they end up with different plots in master and reason from there

packages/python/plotly/plotly/express/_core.py

MarcoGorelli

On master, when grouping by some level (e.g. color='color_by'), then currently Plotly drops groups where the key is None, whereas in this PR we would still plot those

master:

here:

This is a source of overhead for pandas - iterating over groups dropna=False is noticeably slower than with the default dropna=True

In [4]: %timeit dict(pandas_df.groupby(['colorby', 'facetby'], sort=False, dropna=True).__iter__())
173 ms ± 7.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit dict(pandas_df.groupby(['colorby', 'facetby'], sort=False, dropna=False).__iter__())
254 ms ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I think we should add such an option in Narwhals: narwhals-dev/narwhals#1257

Alright, we're making progress: a lot of the overhead is coming from BaseFigure._perform_update - there's a few pandas-specific paths which we're currently skipping in this PR because the pandas objects are wrapped in Narwhals classes:

plotly.py/packages/python/plotly/_plotly_utils/basevalidators.py

Lines 186 to 207 in 960adb9

    
           def is_homogeneous_array(v): 
        
               """ 
        
               Return whether a value is considered to be a homogeneous array 
        
               """ 
        
               np = get_module("numpy", should_load=False) 
        
               pd = get_module("pandas", should_load=False) 
        
               if ( 
        
                   np 
        
                   and isinstance(v, np.ndarray) 
        
                   or (pd and isinstance(v, (pd.Series, pd.Index))) 
        
               ): 
        
                   return True 
        
               if is_numpy_convertable(v): 
        
                   np = get_module("numpy", should_load=True) 
        
                   if np: 
        
                       v_numpy = np.array(v) 
        
                       # v is essentially a scalar and so shouldn't count as an array 
        
                       if v_numpy.shape == (): 
        
                           return False 
        
                       else: 
        
                           return True  # v_numpy.dtype.kind in ["u", "i", "f", "M", "U"] 
        
               return False

This should be addressable - studying the code 🔎 , will update in due course

MarcoGorelli · 2024-10-25T20:35:54Z

packages/python/plotly/plotly/express/_core.py

+    # This is safe since at this point `_compliant_frame` is one of the "full" level
+    # support dataframe(s)


outdated comment 😉

I am ok with removing the comment, but how is it outdated? For InterchangeFrame, the __native_namespace__ method raises NotImplementedError

the code below it does nw.get_native_namespace(df_input), so it's not clear what _compliant_frame refers to (we may rename it in Narwhals, it's just an implementation detail)

and we should probably implement __native_namespace__ for InterchangeFrame

MarcoGorelli · 2024-10-26T10:40:28Z

packages/python/plotly/plotly/express/_core.py

+            # ```
+            # However we cannot do that just yet, therefore a workaround is provided
+            agg_f[args["color"]] = nw.col(args["color"]).max()
+            agg_f[f'{args["color"]}__plotly_n_unique__'] = (


should we use generate_unique_token here?

Just to make sure I get this right, do you mean f'{args["color"]}' + <token> or just <token>?
The former case would keep similar comfort in the workaround but there is no guarantee of uniqueness (since it compares with columns that do not contain f'{args["color"]}' + <token>. The latter case would be a fairly additional headache since we do that repeatedly the same workaround in line 1882, and we should keep track of all the tokens added so far, and to which column they refer to (since they come in pairs).

We can go really specific in the suffix info (e.g. __plotly_process_dataframe_hierarchy_discrete_agg_n_unique__), hopefully no one uses a suffix like this one 😁

MarcoGorelli · 2024-10-26T19:10:33Z

Update on performance from some work on some branches:

timed on a kaggle notebook:

master:
scatter_polars,0.586423233000005
bar_polars,1.5854483000000528
scatter_pandas,0.5920868089999658
bar_pandas,1.620950485000094

plotly-with-narwhals (with Narwhals 1.11.0, and FBruzzesi#1, right at the bottom of the notebook):
scatter_polars,0.20539162399995803
bar_polars,1.159081075999893
scatter_pandas,0.5411779630001092
bar_pandas,1.5852239850000842

I'm going to work on cleaning this all up now, but in summary, the main sources of overhead were:

pandas making unnecessary copies for some operations (rename, reset_index), which we're now being more careful about. thanks for having highlighted this area of improvement which helped us find these!
some pandas-specific paths being missed in BaseFigure._perform_update, meaning that we were repeatedly converting to numpy (and sometimes then making additional copies) unnecessarily

FBruzzesi added 30 commits September 28, 2024 19:08

non core changes

9873e97

_core overhaul

0389591

some _core fixes

ba93236

tests replace sort_index(axis=1)

421fc1d

reset_index in concat and allow any object to pandas

ca5c820

trendline prep

a6aab24

WIP Index

7665f10

clean from breakpoints

ec4f250

some tests fix

7e0d4c2

hotfix and tests output to pandas

5543638

FIX: columns never as index

cd0dab7

getting there with the tests

f334b32

get_column instead of pandas slicing, unix to seconds

e5eb949

bump narhwals, hierarchy fastpath

7747e30

fix to_unindexed_series

ac00b36

fix trendline

da80c5b

rm numpy dep in _core

8a72ba1

fix: _check_dataframe_all_leaves

aeff203

(maybe) fix to_unindexed_series

2041bef

(maybe) fix to_unindexed_series

71473f1

started tests with constructor

9f74c38

added constructor to all tests

28587c9

added some comments for fixme

1bb2448

to_py_scalar and more tests

f45addf

dealing with exceptions and tests

5341759

bump version, sort(...,nulls_last=True)

dfc957c

We did it: no more dups in group by :D

90f2667

concat_str

fb58d1b

fix test_several_dataframes

ddb3b35

dedups customdata

37ce302

Merge branch 'master' into plotly-with-narwhals

1fa9fe4

FBruzzesi added 2 commits October 22, 2024 18:02

revert test

0103aa6

Merge branch 'plotly-with-narwhals' of https://github.com/FBruzzesi/p…

673d141

…lotly.py into plotly-with-narwhals

MarcoGorelli reviewed Oct 23, 2024

View reviewed changes

packages/python/plotly/plotly/express/_core.py Show resolved Hide resolved

FBruzzesi and others added 2 commits October 23, 2024 18:43

feedback adjustments

b858ed8

Merge branch 'master' into plotly-with-narwhals

bbcf438

emilykl reviewed Oct 24, 2024

View reviewed changes

packages/python/plotly/plotly/tests/test_optional/test_px/conftest.py Outdated Show resolved Hide resolved

MarcoGorelli reviewed Oct 24, 2024

View reviewed changes

packages/python/plotly/plotly/express/_core.py Outdated Show resolved Hide resolved

FBruzzesi and others added 5 commits October 25, 2024 08:44

raise if numpy is missing, conftest fix, typo

49efae2

__plotly_n_unique__

a36bc24

Merge branch 'master' into plotly-with-narwhals

c119153

format

7416407

format

1867f6f

ndrezn mentioned this pull request Oct 25, 2024

Explore dropping Pandas requirement for Plotly Express #4834

Open

emilykl reviewed Oct 25, 2024

View reviewed changes

packages/python/plotly/plotly/express/_core.py Outdated Show resolved Hide resolved

MarcoGorelli reviewed Oct 26, 2024

View reviewed changes

MarcoGorelli mentioned this pull request Oct 26, 2024

feat(?): drop_null_keys in group_by? narwhals-dev/narwhals#1256

Closed

feedback adjustments

d3a28c0

MarcoGorelli mentioned this pull request Oct 27, 2024

Is v = np.array(v.dt.to_pydatetime()) still necessary? #4836

Open

MarcoGorelli and others added 5 commits October 27, 2024 15:51

use drop_null_keys, some pandas fastpaths

e6e9994

bump narwhals version

64b8c70

some improvements by Marco

3f6b383

format and pyspark path

755aea8

add narwhals to requirements core

6f18021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert plotly-express to use narwhals #4790

convert plotly-express to use narwhals #4790

FBruzzesi commented Oct 9, 2024 •

edited

Loading

FBruzzesi commented Oct 22, 2024 •

edited

Loading

MarcoGorelli left a comment •

edited

Loading

MarcoGorelli Oct 23, 2024

emilykl Oct 25, 2024

FBruzzesi Oct 27, 2024

FBruzzesi Oct 27, 2024

emilykl Oct 25, 2024

FBruzzesi Oct 27, 2024

MarcoGorelli left a comment •

edited

Loading

MarcoGorelli Oct 25, 2024

FBruzzesi Oct 27, 2024

MarcoGorelli Oct 27, 2024 •

edited

Loading

MarcoGorelli Oct 26, 2024

FBruzzesi Oct 27, 2024

MarcoGorelli commented Oct 26, 2024 •

edited

Loading

		if isinstance(argument, Constant) or isinstance(argument, Range):
		if isinstance(argument, (Constant, Range)):

	def is_homogeneous_array(v):
	"""
	Return whether a value is considered to be a homogeneous array
	"""
	np = get_module("numpy", should_load=False)
	pd = get_module("pandas", should_load=False)
	if (
	np
	and isinstance(v, np.ndarray)
	or (pd and isinstance(v, (pd.Series, pd.Index)))
	):
	return True
	if is_numpy_convertable(v):
	np = get_module("numpy", should_load=True)
	if np:
	v_numpy = np.array(v)
	# v is essentially a scalar and so shouldn't count as an array
	if v_numpy.shape == ():
	return False
	else:
	return True # v_numpy.dtype.kind in ["u", "i", "f", "M", "U"]
	return False

		# This is safe since at this point `_compliant_frame` is one of the "full" level
		# support dataframe(s)

convert plotly-express to use narwhals #4790

Are you sure you want to change the base?

convert plotly-express to use narwhals #4790

Conversation

FBruzzesi commented Oct 9, 2024 • edited Loading

Description

Code PR

Out of scope for the PR

FBruzzesi commented Oct 22, 2024 • edited Loading

MarcoGorelli left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Oct 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli commented Oct 26, 2024 • edited Loading

FBruzzesi commented Oct 9, 2024 •

edited

Loading

FBruzzesi commented Oct 22, 2024 •

edited

Loading

MarcoGorelli left a comment •

edited

Loading

MarcoGorelli left a comment •

edited

Loading

MarcoGorelli Oct 27, 2024 •

edited

Loading

MarcoGorelli commented Oct 26, 2024 •

edited

Loading