MathFeatures seems much slower than pandas.sum() #576

solegalli · 2022-12-12T17:35:06Z

Need to check what is going on and fix

Morgan-Sell · 2023-04-13T12:50:57Z

hi @solegalli,

Were you able to pinpoint the root cause of this issue?

solegalli · 2023-04-20T14:40:22Z

No. Didn't have time to check.

olikra · 2024-06-05T20:44:25Z

"seems much slower" Do we know how much slower MathFeatures is? If desired I can do some checks and report back...

solegalli · 2024-06-06T08:40:10Z

I don't know exactly how much slower it is. It would be great if you could do the checks @olikra !

olikra · 2024-06-06T10:16:39Z

@solegalli
Welcome - Happy to help

I suggest following approach concentrating on the 4 basic arithmetics (add,subtract,divide,multiply):

Use of latest versions of numpy/pandas/feature-engine
Generating a CSV-File with two columns (float-numbers) and 100.000 rows for having same baseline for each module
Load the data upfront
repeating each arithmetic (result in a third column) 10 times on numpy,pandas/feature-engine and measure via timeit isolated calculation time and average it
build some graphics and discuss the ground truth

In a second step we could do further calculation on the two columns itself with (log, sin,...)

solegalli · 2024-06-06T11:26:43Z

Sounds good @olikra

olikra · 2024-06-08T11:57:54Z

@solegalli Did some investigation:

1. Performance Measurement Approach
For reproducable results I created a dataframe with 2 features (floating numbers) and 1.000.000 samples:

                 A            B
0       4083.179436  1029.857437
1       7451.928885  5945.188317
2       3669.416751  4133.482993
3       2288.967806  2333.608841
4       8108.977405  6750.933386
...             ...          ...
999995   623.329330  8458.191033
999996  1048.666685  1992.428440
999997   677.435021  3372.229829
999998  1934.224681  9202.690288
999999  4376.775617  5361.415726
[1000000 rows x 2 columns]>

A script reads in the testdata (looping from 100.000 to 1.000.000 in 10.000 increments) and doing the math 50 times to get better averages.

For pandas i used this line of code:
df['C'] = df['A'] + df['B']

For numpy:
df = np.column_stack((df, (df[:, 0] + df[:, 1])[:, None]))

for feature-engine math:

transformer = MathFeatures(variables=["A", "B"], func='sum', )
df = transformer.fit_transform(df)

Probably, there are more performant solutions for pandas and numpy. But it's just about the comparison to feature-engine math!

2. Results

For the 500.000 record block the runtime measured via time.time_ns() directly before and after the 3 line of code:

Runtime over all record-blocks:

Luckily the runtime is linear to number of records!

I stepped into the feature-engine-math and measured the time for each codeline with transform and fit + transform_fit:

I assume the issue is in the transform part :-)

I'm not familiar with profiling the code, but used cProfile for a first look inside:

profiler.enable()
mf.transform(df)
profiler.disable()

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3    0.056    0.019    0.124    0.041 ../python3.11/site-packages/pandas/core/frame.py:11435(_reduce)
       12    0.023    0.002    0.023    0.002 ../python3.11/site-packages/pandas/core/array_algos/take.py:120(_take_nd_ndarray)
        7    0.023    0.003    0.023    0.003 ../python3.11/site-packages/pandas/core/internals/managers.py:180(blknos)
        1    0.017    0.017    0.023    0.023 .../python3.11/site-packages/pandas/core/nanops.py:1499(_maybe_null_out)
       25    0.012    0.000    0.012    0.000 {method 'reduce' of 'numpy.ufunc' objects}
       10    0.012    0.001    0.012    0.001 {method 'take' of 'numpy.ndarray' objects}
        4    0.010    0.002    0.010    0.002 {method 'copy' of 'numpy.ndarray' objects}
        2    0.003    0.001    0.003    0.001 ../python3.11/site-packages/pandas/core/dtypes/missing.py:261(_isna_array)

Not sure if above output is an indicator for the issue.

Regards

Addition:
Following versions were used:
pandas: 2.2.2
numpy: 1.26.4
feature-engine: 1.8.0

Performance measurement was done on a Debian system exclusive - no other tasks like mariadb,nginx,... influenced the run

olikra · 2024-06-09T14:39:05Z

It crossed my mind: Of course I have to test the pandas.agg() function similarly. Step into it as soon as posible...

olikra · 2024-06-10T06:51:29Z

Here we go:

Added two additional functions to the performance measurement

pandas.agg

variables = ['A', 'B']
df['C'] = df[variables].agg('sum', axis=1)

pandas.sum
df['C'] = df.loc[:, ['A', 'B']].sum(axis=1)

Performance for the 500.000 records run:

Performance over all records (100.000 records to 1.000.000 records

My conclusion: There is no real performance issue in the feature-engine.agg function. It is influenced by the underlaying pandas.agg function. The difference is just the additional wrapping in the feature-engine implementation of pandas.agg

solegalli · 2024-06-10T07:25:30Z

Thank you @olikra !! This is really useful.

So, by the looks of it, we'd have to replace the transform logic, stepping away from pandas agg, and probably replacing with numpy.

Yes, the problem is during transform, where the feature creation/combination takes place.

It would be great to have numpy sum in the comparison, to see if this makes a better solution: https://numpy.org/doc/stable/reference/generated/numpy.sum.html

Would you be able to add that one too in your comparison? and then, i guess we can modify the logic to replace the functions with those of numpy

olikra · 2024-06-10T15:14:50Z

@solegalli added numpy.sum to the stack:
numpy.sum

starttime = time.time_ns()
df = np.append(df, np.sum(df, axis=1, keepdims=True,), axis=1)
enddtime = time.time_ns() - starttime

Performance for the 500.000 record float based run:

Note: To check for correctness of all computations I calculated a checksum (sum of floats in all features/samples) after each run. For the 500.000 record-runs it looks so:

The checksum for the pandas-based calculations is the same- but the numpy-based ones are a little bit different. I assume the cause is the different handling of floats in numpy and pandas. Not sure, if this can be relevant for existing models when the calculation changes from pandas to numpy.

Cause the measurement is already setup I did the same with integers:
Performance for the 500.000 record run

So no big differences from Float to Integer.

solegalli · 2024-06-11T08:27:09Z

Thank you so much @olikra ! This is great! We need to move from pandas to numpy then. Would you be also up for trying to change the logic of this transformer?

olikra · 2024-06-11T09:24:26Z

@solegalli I can give it a try. Need to get more familiar with feature-engine and the coding-style, test-style. Until now it was a blackbox-action for me.

So the plan is to replace the generic pandas.agg functionality by native numpy functionality, where a native numpy functionality is available at this code-segment:

MathFeatures.transform(self, X)

 if len(new_variable_names) == 1:
         X[new_variable_names[0]] = X[self.variables].agg(self.func, axis=1)
     else:
         X[new_variable_names] = X[self.variables].agg(self.func, axis=1)

solegalli · 2024-06-11T15:22:43Z

Yes @olikra that is the bit to be replaced. Not sure there is a numpy equivalent of pandas.agg, if there is, that would be the simplest, if not, we need to break it down function by function :/

Hopefully, you wouldn't have to change the tests much.

olikra · 2024-06-12T11:40:02Z

I did some analytics how pandas implemented the pandas.agg function to avoid trapping in pitfalls implementing the numpy-functionality.

I used a simple dataframe for the pandas.agg sum function:

ref = pd.DataFrame.from_dict(
        {
            "A": [20, 21, 19, 18],
            "B": [0.9, 0.8, 0.7, 0.6],
            "C": [5.9, 0.8, 4.7, 0.6],
        }
    )

In the deeper code-sections of pandas, it seems they iterate over each row/sample and doing the math:

olikra · 2024-06-13T14:19:01Z

@solegalli Let me present a first idea how the migration from pandas.agg to numpy in .transform could be done. We have to keep in mind, that already existing implementations of MathFeatures (in your courses and in the rest of the world) have to work in the same way as before migration!

sidenote: The "func function, str, list or dict" in pandas.agg with the endless combinations does make things not easier... :-)

Overall solution
I assume the most used way to call the transform with functions, is to submit the native keywords like 'sum', 'mean', 'min', 'max', ... .

We have to iterate over the func-list, cause I found no numpy equivalent of pandas.agg. So the func-list is handled like a stack and we use a dictionary to do the calculation for each element in the stack.

In this example func will get following values:

func = ['mean', 'sum', 'np.log', 'convert_kelvin_to_celsius']

'np.log' is a call to numpy, because we haven't provide it in the following dictionary.
'convert_kelvin_to_celsius' is a domain-specific complex calculation.

functiondictionary = {
        "mean": getattr(np, "nanmean"),
        "sum": getattr(np, "nansum"),
        "min": getattr(np, "nanmin"),
        "max": getattr(np, "nanmax"),
        "prod": getattr(np, "nanprod"),
        "median": getattr(np, "median"),
        "std": getattr(np, "nanstd"),
        "var": getattr(np, "nanvar")}

Inside the iterator the calculation is done by this (surrounded code not shown):

df_nan = df_nan[stack_element] = functiondictionary[stack_element](df_nan[variables], axis=1, )
remove stack_element from stack

df_nan is populated with the 'mean' and 'sum' values and the stack looks this way:

func = ['np.log', 'convert_kelvin_to_celsius']

Unfortunatley we have to call these two functions with the pandas.agg. If we could find a way to identify a custom-function (e.g. by an decorator) we could call the custom-function by:

df_nan['convert_kelvin_to_celsius'] = np.apply_along_axis(convert_kelvin_to_celsius, axis=1, arr=df_nan[variables])

Specific issues with 'std' and 'var'

Pandas uses different default-values e.g. for ddof. Using following code:

df_nan = pd.DataFrame.from_dict(
    {
        "A": [20, 21, 23, 12],
        "B": [10, 10, 10, 10],
    }
)
variables = ['A', 'B',]

df_nan['std_pandas_np'] = df_nan[variables].agg(np.nanstd, axis=1)
df_nan['std_numpy_ddof1'] = np.nanstd(df_nan[variables], axis=1, ddof=1 )
df_nan['std_numpy_ddof0'] = np.nanstd(df_nan[variables], axis=1, )

results in this:

Pandas uses ddof = 1 as default and numpy uses ddof = 0. Same for 'var'

Specific issues with Nan

Having some Nan's in the array:

df_nan = pd.DataFrame.from_dict(
    {
        "A": [20, 21, 23, 12],
        "B": ['N/A', 12, 'N/A', 'N/A',],
        "C": [10, 10, 10, 10],
    }
)
df_nan.replace('N/A', np.NaN, inplace=True)
df_nan['var_pandas_np'] = df_nan[variables].agg(np.nanvar, axis=1)
df_nan['var_numpyddof1'] = np.nanvar(df_nan[variables], axis=1,ddof=1)
df_nan['var_numpyddof0'] = np.nanvar(df_nan[variables], axis=1,)

results in this:

Open question is, how we handle different default values against the background that MathFeatures is already in use.
What do you think about the general direction/approach?

solegalli · 2024-06-14T07:56:12Z

Hey @olikra thank you so much! This is an incredible amount of work.

I think you are on the right track. The main thing is not to break backwards compatibility in terms of the functionality. That means, that the user can still instantiate the transformer as usual to get the results they expect. So in principle, we should try and implement as many of the pandas supported methods as possible.

Then, I think that using ddof 1 or 0 is a minor detail, but if we want to be absolutely strict with backward compatibility we can enforce numpy to use ddof 1 as well.

Is the idea to use numpy whenever possible and pandas agg for pandas functions like the convert_kelvin_to_celsius? How many functions are there? did you find a list? If this is the approach, it will accelerate the transformer quite a bit, because I presume that most users would use the standard functions like sum, mean etc and those are supported by numpy. Then, depending on how many extra functions are supported by agg, we can choose to reimplement them in numpy (if they are just a few) or the easiest would be to default to pandas. The latter has the advantage that if/when pandas releases another function, it will be available for MathFeatures by default, whereas if we re-code them ourselves, every time pandas makes an update, we need to update it from our side.

olikra · 2024-06-15T14:26:08Z

@solegalli
1. Pandas List of functions
I digged into the pandas code but found no overview, what functions are supported by pandas.agg(). It’s quite splattered in the code. In the core component ist a list, but this is used by the plot-function:

_cython_table = {
builtins.sum: "sum",
builtins.max: "max",
builtins.min: "min",
np.all: "all",
np.any: "any",
np.sum: "sum",
np.nansum: "sum",
np.mean: "mean",
np.nanmean: "mean",
np.prod: "prod",
np.nanprod: "prod",
np.std: "std",
np.nanstd: "std",
np.var: "var",
np.nanvar: "var",
np.median: "median",
np.nanmedian: "median",
np.max: "max",
np.nanmax: "max",
np.min: "min",
np.nanmin: "min",
np.cumprod: "cumprod",
np.nancumprod: "cumprod",
np.cumsum: "cumsum",
np.nancumsum: "cumsum",
}

2. Backwards compatibility
Sure this is the most important thing.

On my Jupiter-Notebook I created/installed a first version of a boosted-transform method. I implemented the direct usage of numpy.nan* for following functions-calls:

calculations = ('mean', 'sum', 'min', 'max', 'prod', 'median', 'std', 'var')

During exercising your training-courses I will check/adjust it and do some fine-tuning.

3. Performance
Having the transform (with pandas.agg() ) and boosted-transform ( with numpy.nan*) by hand, i did a performance test with 100.000 records and two columns with floats.

transform (with pandas.agg() ) - Numbers in Seconds

boosted-transform ( with numpy.nan_) numbers in Millisecond

4. Test of transform
From my perspective, actually there is no test in place for calling the custom functions like convert_kelvin_to_celsius(). We have to test this anyway, when we call a custom-function via the boosted-transform. Should I implement one?

olikra · 2024-06-16T11:28:38Z

@solegalli
Update
Switched on local side for these standard-calculations:

('mean','sum', 'min', 'max', 'prod', 'median', 'std', 'var')

from pandas.agg to numpy.

32 Tests of 33 already passed...

Will look into it tomorrow evening...

olikra · 2024-06-21T06:11:45Z

@solegalli Math-Feature-conversion is done so far. I assume some parts need discussion. Do you prefer discussion in this issue or in the pull request?

olikra · 2024-06-25T08:13:50Z

I have uploaded a pr (#774)

solegalli · 2024-07-03T12:02:34Z

We discuss the code changes in the PR. I added a few comments.

solegalli added urgent urgent attention needed priority need to be looked at next labels Jan 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MathFeatures seems much slower than pandas.sum() #576

MathFeatures seems much slower than pandas.sum() #576

solegalli commented Dec 12, 2022

Morgan-Sell commented Apr 13, 2023

solegalli commented Apr 20, 2023

olikra commented Jun 5, 2024

solegalli commented Jun 6, 2024

olikra commented Jun 6, 2024

solegalli commented Jun 6, 2024

olikra commented Jun 8, 2024 •

edited

Loading

olikra commented Jun 9, 2024

olikra commented Jun 10, 2024 •

edited

Loading

solegalli commented Jun 10, 2024

olikra commented Jun 10, 2024 •

edited

Loading

solegalli commented Jun 11, 2024

olikra commented Jun 11, 2024 •

edited

Loading

solegalli commented Jun 11, 2024

olikra commented Jun 12, 2024

olikra commented Jun 13, 2024

solegalli commented Jun 14, 2024

olikra commented Jun 15, 2024 •

edited

Loading

olikra commented Jun 16, 2024

olikra commented Jun 21, 2024

olikra commented Jun 25, 2024

solegalli commented Jul 3, 2024

MathFeatures seems much slower than pandas.sum() #576

MathFeatures seems much slower than pandas.sum() #576

Comments

solegalli commented Dec 12, 2022

Morgan-Sell commented Apr 13, 2023

solegalli commented Apr 20, 2023

olikra commented Jun 5, 2024

solegalli commented Jun 6, 2024

olikra commented Jun 6, 2024

solegalli commented Jun 6, 2024

olikra commented Jun 8, 2024 • edited Loading

olikra commented Jun 9, 2024

olikra commented Jun 10, 2024 • edited Loading

solegalli commented Jun 10, 2024

olikra commented Jun 10, 2024 • edited Loading

solegalli commented Jun 11, 2024

olikra commented Jun 11, 2024 • edited Loading

solegalli commented Jun 11, 2024

olikra commented Jun 12, 2024

olikra commented Jun 13, 2024

solegalli commented Jun 14, 2024

olikra commented Jun 15, 2024 • edited Loading

olikra commented Jun 16, 2024

olikra commented Jun 21, 2024

olikra commented Jun 25, 2024

solegalli commented Jul 3, 2024

olikra commented Jun 8, 2024 •

edited

Loading

olikra commented Jun 10, 2024 •

edited

Loading

olikra commented Jun 10, 2024 •

edited

Loading

olikra commented Jun 11, 2024 •

edited

Loading

olikra commented Jun 15, 2024 •

edited

Loading