Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical Features with no 0 label leads to partial_dependence ValueError #285

Open
tatkeller opened this issue Nov 20, 2020 · 1 comment

Comments

@tatkeller
Copy link

tatkeller commented Nov 20, 2020

Hi there,

I created a model that had categorical features above the value of 0 (range of n to m, where n>0 and m>0). I wanted to plot the partial dependence for my model, but ran into a ValueError (error recreated below). The problem is that generate_X_grid creates a matrix that looks like this:

[[0,0,0, ..., 0, i, 0, ..., 0,0,0],
[0,0,0, ..., 0, i, 0, ..., 0,0,0],
...,
[0,0,0, ..., 0, i, 0, ..., 0,0,0]]

And for models that have been trained with categorical features that do not have '0' as a category, this will raise an error when calling the partial dependence function.

Here is a recreation of the error using the Quick start example code:

Input:

from pygam.datasets import wage

X, y = wage()

from pygam import LinearGAM, s, f

gam = LinearGAM(f(0) + s(1) + f(2)).fit(X, y) ##Use f(0) to make the 0th term categorical. The 0th term contains no value equal to  0

import matplotlib.pyplot as plt

for i, term in enumerate(gam.terms):
    if term.isintercept:
        continue

    XX = gam.generate_X_grid(term=i)
    pdep, confi = gam.partial_dependence(term=i, X=XX, width=0.95)

    #plt.figure()
    plt.plot(XX[:, term.feature], pdep)
    plt.plot(XX[:, term.feature], confi, c='r', ls='--')
    plt.title(repr(term))
    plt.show()

Output:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-0e5df89ff530> in <module>()
      7     XX = gam.generate_X_grid(term=i)
      8     print(XX)
----> 9     pdep, confi = gam.partial_dependence(term=i, X=XX, width=0.95)
     10 
     11     #plt.figure()

/Users/tatekeller/opt/anaconda3/envs/pbh/lib/python3.6/site-packages/pygam/pygam.py in partial_dependence(self, term, X, width, quantiles, meshgrid)
   1542                         features=self.feature, verbose=self.verbose)
   1543 
-> 1544         modelmat = self._modelmat(X, term=term)
   1545         pdep = self._linear_predictor(modelmat=modelmat, term=term)
   1546         out = [pdep]

/Users/tatekeller/opt/anaconda3/envs/pbh/lib/python3.6/site-packages/pygam/pygam.py in _modelmat(self, X, term)
    455         X = check_X(X, n_feats=self.statistics_['m_features'],
    456                     edge_knots=self.edge_knots_, dtypes=self.dtype,
--> 457                     features=self.feature, verbose=self.verbose)
    458 
    459         return self.terms.build_columns(X, term=term)

/Users/tatekeller/opt/anaconda3/envs/pbh/lib/python3.6/site-packages/pygam/utils.py in check_X(X, n_feats, min_samples, edge_knots, dtypes, features, verbose)
    301                                      'feature {}. Expected data on [{}, {}], '\
    302                                      'but found data on [{}, {}]'\
--> 303                                      .format(i, min_, max_, x.min(), x.max()))
    304 
    305     return X

ValueError: X data is out of domain for categorical feature 0. Expected data on [2003.0, 2009.0], but found data on [0.0, 0.0]

The versions that I used are:
pyGAM=0.8.0
Python=3.6.12

For now I will work around this by subtracting the respective minimum value from each categorical value changing the category range values from (n,m) to (n-n, m-n)==(0,m-n).

Thanks in advance

@5ch0r5ch1
Copy link

See #302

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants