Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importance Overhall: What Method(s) to Get P Values? #36

Open
ryanbressler opened this issue Apr 8, 2014 · 7 comments
Open

Importance Overhall: What Method(s) to Get P Values? #36

ryanbressler opened this issue Apr 8, 2014 · 7 comments

Comments

@ryanbressler
Copy link
Owner

P-Values for variable importance are desirable as they are easier to interpret and will be potentially easier to drop in to our other tools.

A couple of different methods seem viable for this. The ace method as used in rf-ace involves repeatedly growing a forest including artificial contrasts of all features and using Wilcoxon tests.

Another Method presented in "Identification of Statistically Significant Features from Random Forests" tests the change produced by per mutating each feature and testing on OOB cases after each tree. This is potentially computationally more efficient since only one forest needs to be grown.

Another interesting paper that might be complementary is "Understanding variable importances in forests of randomized trees" which presents work on totally randomized trees, Extra-Trees and random forests suggesting that the more randomized implementations might be of use when we are concerned primarily with feature selection.

@ryanbressler
Copy link
Owner Author

An issue with the second approach is that it relies on a chi-squared test of the tree predictions for the permutated and unpermutated case so you are still doing significance testing of the feature values and may have issues with numerical features. The ace paper on the other hand uses testing on the rank of feature importance which may be less problematic. .

@ryanbressler
Copy link
Owner Author

The other ACE paper Feature Selection with Ensembles, Artificial Variables and Redundancy Elimination uses a different method. Forests don't depend on the previous one and a student's t test is used to compare variable importance of each variable vs it's contrasts.

@tungntdhtl
Copy link

Hi Ryan,

  • How does CloudRF calculate the weights file? (p-values)
  • Assume we have a weight file (according to p-values), how can we grow cloudRF using these weights?

@ryanbressler
Copy link
Owner Author

P values aren't calculated yet. Just variable importance as described in the readme:

https://github.com/ryanbressler/CloudForest#importance-and-contrasts

This is a measure of how important each variable is to the predictor and not something that can be fed into the predictor.

@tungntdhtl
Copy link

My data set is in this case:
https://github.com/ryanbressler/CloudForest#data-with-lots-of-noisy-uninformative-high-cardinality-features
I have a weight file based on p-values, how can I type a command to grow trees in CloudRF using this weight file?

You wrote: "The -vet option penalizes the impurity decrease of potential best split by subtracting the best split they can make after the target values cases on which the split is being evaluated have been shuffled".
Assume my weight file, namely "wfile.tsv", has 3 columns (featurenames, p-value, importances)

Is this command correct?
~/cloudRF/growforest -train usps -rfpred usps.sf -target 0 -nTrees 500 -vet wfile.tsv

@ryanbressler
Copy link
Owner Author

No vet is an internal method and takes no parameters.

I'm not aware of a way to specify feature weights going into a random
forest. Since it does its own feature selection internally i'm not even
sure how the algorithm could be modified to use them.

Methods like -evaloob -vet and comparison to artifical contrasts improve
the internal feature selection.

If you allready have values you want to use for feature selection from rf
or another method you'll need to apply a cutoff and/or take the top N
features. You can specify white and blacklists of features to use or not
use if you don't want to reproduce the data set with the smaller set of
features.

Ryan

On Mon, Apr 14, 2014 at 10:12 AM, tungntdhtl [email protected]:

My data set is in this case:

https://github.com/ryanbressler/CloudForest#data-with-lots-of-noisy-uninformative-high-cardinality-features
I have a weight file based on p-values, how can I type a command to grow
trees in CloudRF using this weight file?

You wrote: "The -vet option penalizes the impurity decrease of potential
best split by subtracting the best split they can make after the target
values cases on which the split is being evaluated have been shuffled".
Assume my weight file, namely "wfile.tsv", has 3 columns (featurenames,
p-value, importances)

Is this command correct?
~/cloudRF/growforest -train usps -rfpred usps.sf -target 0 -nTrees 500
-vet wfile.tsv


Reply to this email directly or view it on GitHubhttps://github.com//issues/36#issuecomment-40385266
.

@ryanbressler
Copy link
Owner Author

This is another paper that uses iterative feature selection:

http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002956

It depends on pairwise correlation and network partitioning and each forest/iteration reweighs network modules and features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants