Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] How should I determine categorical features #5932

Closed
TopoKunst opened this issue Jun 15, 2023 · 6 comments
Closed

[Python] How should I determine categorical features #5932

TopoKunst opened this issue Jun 15, 2023 · 6 comments
Labels

Comments

@TopoKunst
Copy link

Version: [Python 3.7+] [lightgbm 2.1.2]
Thank you for sharing this great work. I have several questions about handling categorical features which have really confused me for a long time.
Now, I pass a pandas dataframe to lightgbm by

# CATEGORICAL_COLS is a list of vategorical column names
X[CATEGORICAL_COLS] = X[CATEGORICAL_COLS].astype("category")
gbm = LGBMClassifier(objective="binary", n_estimators=10, max_depth=5, random_state=42)
gbm.fit(X_train, y_train)

or

X[CATEGORICAL_COLS] = X[CATEGORICAL_COLS].astype("category")
lgb_train = lgb.Dataset(data = X_train, label = y_train, feature_name = list(X.columns), categorical_feature = CATEGORICAL_COLS)
lgb_test = lgb.Dataset(data = X_test, label = y_test, feature_name = list(X.columns), categorical_feature = CATEGORICAL_COLS)
gbm = lgb.train(params, lgb_train, categorical_feature = CATEGORICAL_COLS)

And my questions are
Q1: Which method I use is right?
Q2: Do I need to both change dataframe column type to category AND specify categorical_feature when defining lightgbm model as the same time?
Q3: What should I do with categorical feature? Should I transform it to label encoding from 0 to k-1 (assume k is the categories) or I can just pass the string column to lightgbm?

Thank you for your time.

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM.

Before we look into this... is it absolutely necessary that you use lightgbm v2.1.2? That version is very old (June 2018), and there have been hundreds of changes to LightGBM since then... including bugfixes and new features around handling of categorical features, e.g. #1979 and #2754.

I strongly recommend updating to lightgbm's latest stable release, v3.3.5.

@TopoKunst
Copy link
Author

Thanks for using LightGBM.

Before we look into this... is it absolutely necessary that you use lightgbm v2.1.2? That version is very old (June 2018), and there have been hundreds of changes to LightGBM since then... including bugfixes and new features around handling of categorical features, e.g. #1979 and #2754.

I strongly recommend updating to lightgbm's latest stable release, v3.3.5.

Sure. I don't know why the version I install is 2.1.2, maybe it lies on the mirror? Now I have update lightgbm to v3.3.5

@jameslamb
Copy link
Collaborator

Ok thank you! Someone here will provide an answer using lightgbm v3.3.5 when we have time.

@Recepguel
Copy link

Recepguel commented Jun 17, 2023

Check the code here. It basically says that if you provide a pandas dataframe, it will do the following. If it detects categorical columns, it will automatically encode them in integers (check x.cat.codes in line 688.) So you don't need to do the conversion yourself. Detection of categorical columns is also automatic so you don't need X[CATEGORICAL_COLS] = X[CATEGORICAL_COLS].astype("category"). You should use categorical_feature feature only if you want to explicitly set categorical features ignoring type assignment of pandas dataframe. For example a numeric column "N" which took 40 different values can also be interpreted as categorical column with 40 classes. But this won't happen automatically, since pandas thinks N is a numeric column. With categorical_feature, you can force N to be a categorical feature. You should this option in such cases.

 if isinstance(data, pd_DataFrame):
        if len(data.shape) != 2 or data.shape[0] < 1:
            raise ValueError('Input data must be 2 dimensional and non empty.')
        if feature_name == 'auto' or feature_name is None:
            data = data.rename(columns=str, copy=False)
        cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if isinstance(dtype, pd_CategoricalDtype)]
        cat_cols_not_ordered = [col for col in cat_cols if not data[col].cat.ordered]
        if pandas_categorical is None:  # train dataset
            pandas_categorical = [list(data[col].cat.categories) for col in cat_cols]
        else:
            if len(cat_cols) != len(pandas_categorical):
                raise ValueError('train and valid dataset categorical_feature do not match.')
            for col, category in zip(cat_cols, pandas_categorical):
                if list(data[col].cat.categories) != list(category):
                    data[col] = data[col].cat.set_categories(category)
        if len(cat_cols):  # cat_cols is list
            data = data.copy(deep=False)  # not alter origin DataFrame
            data[cat_cols] = data[cat_cols].apply(lambda x: x.cat.codes).replace({-1: np.nan})
        if categorical_feature is not None:
            if feature_name is None:
                feature_name = list(data.columns)
            if categorical_feature == 'auto':  # use cat cols from DataFrame
                categorical_feature = cat_cols_not_ordered
            else:  # use cat cols specified by user
                categorical_feature = list(categorical_feature)  # type: ignore[assignment]

@TopoKunst
Copy link
Author

Got it and thank you very much for this meticulous explanation !

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants