Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KProtoypes wrongly identifies categorical data as non-categorical #71

Closed
mikeyford opened this issue May 11, 2018 · 1 comment
Closed

Comments

@mikeyford
Copy link

mikeyford commented May 11, 2018

In the example below to reproduce I'm using the titanic dataset from https://www.kaggle.com/c/titanic/data

from kmodes.kprototypes import KPrototypes
import pandas as pd

df = pd.read_csv("train.csv", usecols=['Sex', 'Age', 'Embarked'])

model = KPrototypes(n_clusters=2)
clusters = model.fit_predict(df)

Gives NotImplementedError: No categorical data selected, effectively doing k-means. Present a list of categorical columns, or use scikit-learn's KMeans instead.

The Sex and Embarked variables are categorical. Doing something like df["Sex"] = df["Sex"].astype('category') gives the same result. KModes has no problems with the same data. Am I doing something wrong here and this is expected behaviour, or is something up?

@mikeyford
Copy link
Author

mikeyford commented May 11, 2018

Ah figured it out where I had been going wrong by looking at the source code. For anyone else's benefit who finds this via google, it needs a categorical argument with the index of the columns you need to use.
The last line above should be changed to
clusters = model.fit_predict(df.values, categorical=[0,2]) for the example dataset.

None that .values was added to df due to encountering #40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant