Optimize array-from-ctypes in basic.py #3927

asford · 2021-02-08T23:50:45Z

Approximately %80 of runtime when loading "low column count, high row
count" DataFrames into Datasets is consumed in np.fromiter, called
as part of the Dataset.get_field method.

This is particularly pernicious hotspot, as unlike other ctypes-based
methods this is a hot loop over a python iterator loop and causes
significant GIL-contention in multi-threaded applications.

Replace np.fromiter with a direct call to np.ctypeslib.as_array,
which allows a single-shot copy of the underlying array.

This reduces the load time of a ~35 million row categorical dataframe
with 1 column from ~5 seconds to ~1 second, and allows multi-threaded
execution.

Approximately %80 of runtime when loading "low column count, high row count" DataFrames into Datasets is consumed in `np.fromiter`, called as part of the `Dataset.get_field` method. This is particularly pernicious hotspot, as unlike other ctypes-based methods this is a hot loop over a python iterator loop and causes significant GIL-contention in multi-threaded applications. Replace `np.fromiter` with a direct call to `np.ctypeslib.as_array`, which allows a single-shot `copy` of the underlying array. This reduces the load time of a ~35 million row categorical dataframe with 1 column from ~5 seconds to ~1 second, and allows multi-threaded execution.

asford · 2021-02-09T00:02:07Z

Profile shot, v3.1.1 via conda-forge:

Post-patch, v3.1.1:

jameslamb · 2021-02-09T23:17:22Z

Thanks so much for taking the time to contribute @asford ! I think we'll need a little time to research this and might have some followup questions, but the benchmarking graphic and the fact that all the tests are passing is really encouraging.

asford · 2021-02-10T02:00:36Z

My pleasure, thanks for all the effort on y'alls part maintaining the tool.

Not sure what would be useful from an review perspective, but this is actually quite easy to evaluate by monkeypatching the associated functions. If it'd be useful I'd be happy to setup a colab or binder with the head-to-head demo or setup a pytest-benchmark test case between the two code paths.

guolinke · 2021-02-12T08:10:38Z

Will the copy increase the memory usage?

asford · 2021-02-14T03:01:31Z

The existing np.fromiter based implementation is creating a copy of the underlying float array, just in a highly inefficient way by iterating over the pointer via the ctypes interface. This change preserves the same semantics by copying the contents of the internal buffer.

Though there could be a minor reduction is memory consumption by making this return a view, rather than copy, of the ctypes pointer but this opens a huge can of worms wrt lifetime of the underlying buffer. I've no idea idea how lightgbm's internal memory management strategy is structured, but t's far-safer to perform the copy so that the standard memory management semantics for numpy are obeyed. (ie. the array contents are valid for the lifetime of the ndarray object).

As you can see from the second, post fix, profiling result there are a lot of big existing gains to be had in tuning the "Dataset from DataFrame" story. For example, the current categorical normalization performs at-least-one full-size copy of the source dataframe and performs a slow single-threaded normalization pass.

@jameslamb given your work with dask dataframes you may find this strategy useful. I was seeing some GME-circa-January '21 gains in dataset construction performance. 🚀

Code snippet for threaded, multi-column Dataset construction

def merge_dataset(target: lgb.Dataset, other: lgb.Dataset) -> lgb.Dataset:
    """In-place merge of lightgbm dataset features.
    Args:
        target: Dataset, updated in place.
        other: Dataset, features added to target.
    Returns:
        In-place updated target.
    """

    # LightGBM contains a booster reference, in C, which is handled via
    # "add_features_for".

    # pandas-categorical metadata, needed to ensure proper cat encoding,
    # is stored on the python-level object and must be manually merged

    if target.pandas_categorical or other.pandas_categorical:
        pandas_categorical = (
            target.pandas_categorical if target.pandas_categorical else []
        ) + (other.pandas_categorical if other.pandas_categorical else [])
    else:
        pandas_categorical = None

    categorical_column = target.params.get("categorical_column", []) + [
        c + target.num_feature() for c in other.params.get("categorical_column", [])
    ]

    target.add_features_from(other)

    target.pandas_categorical = pandas_categorical
    if categorical_column:
        target.params["categorical_column"] = categorical_column

    return target


def construct_dataset_threaded(
    df: pd.DataFrame, pre_apply: Callable[[pd.DataFrame], pd.DataFrame] = lambda df: df
) -> lgb.Dataset:
    """Multi-threaded Dataset setup.
    Args:
        df: Target dataframe, multi-threaded construct on a per-column basis.
        pre_apply: Pre-process a single-column frame before Dataset constructor.
    Returns:
        Dataset, with feature-per-column.
    """

    @dask.delayed
    def d_construct(col_df: pd.DataFrame) -> lgb.Dataset:
        return lgb.Dataset(pre_apply(col_df)).construct()

    with dask.config.set(scheduler="threads"):
        (d_lgb_datasets,) = dask.persist([d_construct(df[[c]]) for c in df.columns])

        dataset = d_lgb_datasets[0].compute()
        for d_other in d_lgb_datasets[1:]:
            merge_dataset(dataset, d_other.compute())

        return dataset

StrikerRUS · 2021-02-14T15:48:45Z

@asford Thanks a lot for your explanation!

For the reference, source code of fromiter function:
https://github.com/numpy/numpy/blob/b4491833d636344ef295f1609bcf4742c1d94d10/numpy/core/src/multiarray/multiarraymodule.c#L2206-L2231
https://github.com/numpy/numpy/blob/72be9ca10b5391fc036c4cd61830e168712b51ea/numpy/core/src/multiarray/ctors.c#L3731-L3848

I'm afraid only about that we are loosing dtype information with new interface. Will numpy preserve original type of cptr and will not cast for example float to default float64? According to the source code of as_array, it will preserve, but I'd like to be 100% sure.

asford · 2021-02-15T07:01:32Z

As far as I can tell, the requested types in these functions were probably to workaround the fact that np.fromiter may disregard the type of the input ctypes pointer.

As you mention, as_array infers the array dtype from the source pointer, so the ctype dtype is passed through. You could be extremely defensive and assert on the expected dtype before returning, however I'm not a yuge fan of asserts "on the way out".

The pull is is set to "Allow edits by maintainers". I'm totally fine with edits here to match the expectations of the project; will likely not be online to update this pull for several days.

StrikerRUS

LGTM based on great comments from @asford !

jameslamb

great contribution, thanks so much for the time and effort you put into improving LightGBM!

guolinke

Thank you!

github-actions · 2023-08-24T01:13:44Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

asford requested review from chivee, henry0312, StrikerRUS and wxchan as code owners February 8, 2021 23:50

jameslamb added the efficiency label Feb 9, 2021

jameslamb self-requested a review February 9, 2021 23:17

jameslamb added the awaiting review label Feb 9, 2021

StrikerRUS approved these changes Feb 16, 2021

View reviewed changes

StrikerRUS removed the awaiting review label Feb 16, 2021

jameslamb approved these changes Feb 16, 2021

View reviewed changes

guolinke approved these changes Feb 17, 2021

View reviewed changes

jameslamb merged commit de8c610 into microsoft:master Feb 17, 2021

github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize array-from-ctypes in basic.py #3927

Optimize array-from-ctypes in basic.py #3927

asford commented Feb 8, 2021

asford commented Feb 9, 2021

jameslamb commented Feb 9, 2021

asford commented Feb 10, 2021

guolinke commented Feb 12, 2021

asford commented Feb 14, 2021

StrikerRUS commented Feb 14, 2021

asford commented Feb 15, 2021

StrikerRUS left a comment

jameslamb left a comment

guolinke left a comment

github-actions bot commented Aug 24, 2023

Optimize array-from-ctypes in basic.py #3927

Optimize array-from-ctypes in basic.py #3927

Conversation

asford commented Feb 8, 2021

asford commented Feb 9, 2021

jameslamb commented Feb 9, 2021

asford commented Feb 10, 2021

guolinke commented Feb 12, 2021

asford commented Feb 14, 2021

StrikerRUS commented Feb 14, 2021

asford commented Feb 15, 2021

StrikerRUS left a comment

Choose a reason for hiding this comment

jameslamb left a comment

Choose a reason for hiding this comment

guolinke left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 24, 2023