Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not work on titanic dataset #13

Closed
edvardoss opened this issue Jun 10, 2019 · 7 comments
Closed

not work on titanic dataset #13

edvardoss opened this issue Jun 10, 2019 · 7 comments
Labels
bug Something isn't working

Comments

@edvardoss
Copy link

edvardoss commented Jun 10, 2019

I’m impressed while reading your blog about model interpretation and try to test this package on popular dataset “titanic” but all my attemtions is failed.

install.packages("titanic") # only data in package
data("titanic_train",package="titanic")
library(tidyverse)
str(titanic_train)

d <- titanic_train %>% as_tibble %>%
  mutate(title=str_replace_all(string = Name, # extract title as general feature
                               pattern = "^[[:alpha:][:space:]'-]+,\\s+(the\\s)?(\\w+)\\..+",
                               replacement = "\\2")) %>%
  mutate(title=str_trim(title),
         title=case_when(title %in% c('Mlle','Ms')~'Miss', # normalize some titles
                         title=='Mme'~ 'Mrs',
                         title %in% c('Capt','Don','Major','Sir','Jonkheer', 'Col')~'Sir',
                         title %in% c('Dona', 'Lady', 'Countess')~'Lady',
                         TRUE~title)) %>%
  mutate(title=as_factor(title),
         Survived=factor(Survived,levels = c(0,1),labels=c("no","yes")),
         Sex=as_factor(Sex),
         Pclass=factor(Pclass,ordered = T)) %>%
  group_by(title) %>% # impute Age by median in current title
  mutate(Age=replace_na(Age,replace = median(Age,na.rm = T))) %>% ungroup
table(d$title,d$Sex) # look on title distribution        
caret::nearZeroVar(x = d,saveMetrics = T) # search and drop some unusefull features (PassengerId,Name,Ticket)
d <- d %>% select_at(vars(-c(PassengerId,Name,Ticket)))
d %>% summarise_all(~sum(is.na(.))) # control NAs

library(ranger)
m <- ranger(formula = Survived~.,data = d,mtry = 6,min.node.size = 5, num.trees = 600,
            importance = "permutation")

library(easyalluvial)
imp <- importance(m) %>% as.data.frame %>% tidy_imp(imp = .,df=d)
alluvial_wide(data = select(d,Survived,title,Pclass,Sex,Fare),fill_by = "first_variable") # ok, it work but i wont describe model (not describe data)

gds <- get_data_space(df = d,imp,degree = 4) # Error in Summary.factor(c(1L, 2L, 3L, 2L, 1L, 1L, 1L, 4L, 2L, 2L, 3L,  : ‘max’ not meaningful for factors

# ok, don`t  give up and try caret
library(caret)
trc <- trainControl(method = "none")
m <- train(Survived~.,data = d,method="rf",trControl=trc,importance=T)
alluvial_model_response_caret(train = m,degree = 4,bins=5,stratum_label_size = 2.8) # Error in tidy_imp(imp, df) : not all listed important variables found in input data


@erblast erblast added the bug Something isn't working label Jun 10, 2019
@erblast
Copy link
Owner

erblast commented Jun 10, 2019

Thanks for reporting, I did not think to test with an all factor dataset. Will fix this as soon as possible

erblast added a commit that referenced this issue Jun 13, 2019
@edvardoss
Copy link
Author

edvardoss commented Jun 21, 2019

get_data_space now work, thank you!
But next step - not.

library(ranger)
m <- ranger(formula = Survived~.,data = d,mtry = 6,min.node.size = 5, num.trees = 600,
            importance = "permutation")
library(easyalluvial)
imp <- importance(m) %>% as.data.frame %>% tidy_imp(imp = .,df=d)

dspace <- get_data_space(df = d,imp,degree = 4) # Work!
pred = predict(m, data = dspace)
p = alluvial_model_response(pred, dspace, imp, degree = 4) # Error in alluvial_model_response: "pred" needs to be a numeric or a factor vector

@erblast
Copy link
Owner

erblast commented Jun 24, 2019

fixing some issues that arise when having character and factors in the training data eb74c37

@erblast
Copy link
Owner

erblast commented Jun 24, 2019

Hi sorry Ia am not as frequently checking back on this as I would like to. So The problem is with predict in the ranger package it does not return pure predictions but returns some kind of list that needs to be indexed to get to the predictions.

try:
p = alluvial_model_response(pred = pred$predictions, dspace = gds, imp = imp, degree = 4)

this works for me. Could you install the most recent development version? And tell me if it works for you now? Including the caret bit?.

Thanks for reporting this, it uncovered a few issues when using factors that I should have anticipated. I have added your example as a new test case. It will go to CRAN in the next two weeks hopefully.

@edvardoss
Copy link
Author

Hi!
Yes, i'm install latest dev.version.
Sorry for ranger::predict - i am not properly checked this object type, thank you for your answer, its work well!
But caret still generate error for me:

# ok, don`t  give up and try caret
devtools::install_local(path = "C:\\Users\\AnanevHA\\Downloads\\easyalluvial-master",force = TRUE)
library(caret)
trc <- trainControl(method = "none")
m <- train(Survived~.,data = d,method="rf",trControl=trc,importance=T)
library(easyalluvial)
alluvial_model_response_caret(train = m,degree = 4,bins=5,stratum_label_size = 2.8) # Error in tidy_imp(imp, df) : not all listed important variables found in input data

@erblast
Copy link
Owner

erblast commented Jun 25, 2019

Could you make sure that you have the latest dev version installed
devtools::install_github('https://github.com/erblast/easyalluvial.git')

When you execute easyalluvial::tidy_imp
you see the function source code. You should find the following lines.

 # correct dummyvariable names back to original name

  df_ori_var = tibble( ori_var = names( select_if(df, ~ is.factor(.) | is.character(.) ) ) ) %>%

not the | is.character(.) was added. This should resolve the error you were getting.
Let me know how it goes.

@edvardoss
Copy link
Author

Hi!
Everything is working, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants