-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize array-from-ctypes in basic.py #3927
Conversation
Approximately %80 of runtime when loading "low column count, high row count" DataFrames into Datasets is consumed in `np.fromiter`, called as part of the `Dataset.get_field` method. This is particularly pernicious hotspot, as unlike other ctypes-based methods this is a hot loop over a python iterator loop and causes significant GIL-contention in multi-threaded applications. Replace `np.fromiter` with a direct call to `np.ctypeslib.as_array`, which allows a single-shot `copy` of the underlying array. This reduces the load time of a ~35 million row categorical dataframe with 1 column from ~5 seconds to ~1 second, and allows multi-threaded execution.
Thanks so much for taking the time to contribute @asford ! I think we'll need a little time to research this and might have some followup questions, but the benchmarking graphic and the fact that all the tests are passing is really encouraging. |
My pleasure, thanks for all the effort on y'alls part maintaining the tool. Not sure what would be useful from an review perspective, but this is actually quite easy to evaluate by monkeypatching the associated functions. If it'd be useful I'd be happy to setup a colab or binder with the head-to-head demo or setup a pytest-benchmark test case between the two code paths. |
Will the copy increase the memory usage? |
The existing Though there could be a minor reduction is memory consumption by making this return a view, rather than copy, of the ctypes pointer but this opens a huge can of worms wrt lifetime of the underlying buffer. I've no idea idea how lightgbm's internal memory management strategy is structured, but t's far-safer to perform the copy so that the standard memory management semantics for numpy are obeyed. (ie. the array contents are valid for the lifetime of the ndarray object). As you can see from the second, post fix, profiling result there are a lot of big existing gains to be had in tuning the "Dataset from DataFrame" story. For example, the current categorical normalization performs at-least-one full-size copy of the source dataframe and performs a slow single-threaded normalization pass. @jameslamb given your work with dask dataframes you may find this strategy useful. I was seeing some GME-circa-January '21 gains in dataset construction performance. 🚀 Code snippet for threaded, multi-column Dataset constructiondef merge_dataset(target: lgb.Dataset, other: lgb.Dataset) -> lgb.Dataset:
"""In-place merge of lightgbm dataset features.
Args:
target: Dataset, updated in place.
other: Dataset, features added to target.
Returns:
In-place updated target.
"""
# LightGBM contains a booster reference, in C, which is handled via
# "add_features_for".
# pandas-categorical metadata, needed to ensure proper cat encoding,
# is stored on the python-level object and must be manually merged
if target.pandas_categorical or other.pandas_categorical:
pandas_categorical = (
target.pandas_categorical if target.pandas_categorical else []
) + (other.pandas_categorical if other.pandas_categorical else [])
else:
pandas_categorical = None
categorical_column = target.params.get("categorical_column", []) + [
c + target.num_feature() for c in other.params.get("categorical_column", [])
]
target.add_features_from(other)
target.pandas_categorical = pandas_categorical
if categorical_column:
target.params["categorical_column"] = categorical_column
return target
def construct_dataset_threaded(
df: pd.DataFrame, pre_apply: Callable[[pd.DataFrame], pd.DataFrame] = lambda df: df
) -> lgb.Dataset:
"""Multi-threaded Dataset setup.
Args:
df: Target dataframe, multi-threaded construct on a per-column basis.
pre_apply: Pre-process a single-column frame before Dataset constructor.
Returns:
Dataset, with feature-per-column.
"""
@dask.delayed
def d_construct(col_df: pd.DataFrame) -> lgb.Dataset:
return lgb.Dataset(pre_apply(col_df)).construct()
with dask.config.set(scheduler="threads"):
(d_lgb_datasets,) = dask.persist([d_construct(df[[c]]) for c in df.columns])
dataset = d_lgb_datasets[0].compute()
for d_other in d_lgb_datasets[1:]:
merge_dataset(dataset, d_other.compute())
return dataset |
@asford Thanks a lot for your explanation! For the reference, source code of I'm afraid only about that we are loosing |
As far as I can tell, the requested types in these functions were probably to workaround the fact that As you mention, The pull is is set to "Allow edits by maintainers". I'm totally fine with edits here to match the expectations of the project; will likely not be online to update this pull for several days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM based on great comments from @asford !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great contribution, thanks so much for the time and effort you put into improving LightGBM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Approximately %80 of runtime when loading "low column count, high row
count" DataFrames into Datasets is consumed in
np.fromiter
, calledas part of the
Dataset.get_field
method.This is particularly pernicious hotspot, as unlike other ctypes-based
methods this is a hot loop over a python iterator loop and causes
significant GIL-contention in multi-threaded applications.
Replace
np.fromiter
with a direct call tonp.ctypeslib.as_array
,which allows a single-shot
copy
of the underlying array.This reduces the load time of a ~35 million row categorical dataframe
with 1 column from ~5 seconds to ~1 second, and allows multi-threaded
execution.