-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importance Overhall: What Method(s) to Get P Values? #36
Comments
An issue with the second approach is that it relies on a chi-squared test of the tree predictions for the permutated and unpermutated case so you are still doing significance testing of the feature values and may have issues with numerical features. The ace paper on the other hand uses testing on the rank of feature importance which may be less problematic. . |
The other ACE paper Feature Selection with Ensembles, Artificial Variables and Redundancy Elimination uses a different method. Forests don't depend on the previous one and a student's t test is used to compare variable importance of each variable vs it's contrasts. |
Hi Ryan,
|
P values aren't calculated yet. Just variable importance as described in the readme: https://github.com/ryanbressler/CloudForest#importance-and-contrasts This is a measure of how important each variable is to the predictor and not something that can be fed into the predictor. |
My data set is in this case: You wrote: "The -vet option penalizes the impurity decrease of potential best split by subtracting the best split they can make after the target values cases on which the split is being evaluated have been shuffled". Is this command correct? |
No vet is an internal method and takes no parameters. I'm not aware of a way to specify feature weights going into a random Methods like -evaloob -vet and comparison to artifical contrasts improve If you allready have values you want to use for feature selection from rf Ryan On Mon, Apr 14, 2014 at 10:12 AM, tungntdhtl [email protected]:
|
This is another paper that uses iterative feature selection: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002956 It depends on pairwise correlation and network partitioning and each forest/iteration reweighs network modules and features. |
P-Values for variable importance are desirable as they are easier to interpret and will be potentially easier to drop in to our other tools.
A couple of different methods seem viable for this. The ace method as used in rf-ace involves repeatedly growing a forest including artificial contrasts of all features and using Wilcoxon tests.
Another Method presented in "Identification of Statistically Significant Features from Random Forests" tests the change produced by per mutating each feature and testing on OOB cases after each tree. This is potentially computationally more efficient since only one forest needs to be grown.
Another interesting paper that might be complementary is "Understanding variable importances in forests of randomized trees" which presents work on totally randomized trees, Extra-Trees and random forests suggesting that the more randomized implementations might be of use when we are concerned primarily with feature selection.
The text was updated successfully, but these errors were encountered: