The astrid
R-package, short for Automatic STRucture IDentification, provides an implementation of the method described in
Henelius, Andreas, Puolamäki, Kai and Ukkonen, Antti. Finding Statistically Significant Attribute Interactions. 2016, available from arXiv.
The basic idea is to use classifiers to investigate class-dependent attribute interactions in datasets.
To get a BibTex entry in R type citation("astrid")
when the package is installed.
The development version of the astrid
package can be installed from GitHub as follows.
First install the devtools
-package and load it:
install.packages("devtools")
library(devtools)
You can now install the astrid
package:
install_github("bwrc/astrid-r")
This is a short example demonstrating use of the library. We here analyse the following synthetic dataset:
The dataset has two classees, each with 500 samples. The data is generated so that attributes a1 and a2 must be used jointly to predict the class (leftmost panel), while attribute a3 carries some (weak) class information (middle panel). Attriubte a4 (rightmost panel) is just noise. The known class-dependent attribute interaction structure is hence given by ((a1, a2), (a3), (a4)).
## Load the library
library(astrid)
library(e1071)
library(randomForest)
## Create a synthetic dataset with the known
## attribute interaction structure
## ((a1, a2), (a_3), (a_4)), where attribute a_4 is just noise.
dataset <- make_synthetic_dataset(N = 500, seed = 42, mg2 = 0.6)
## Perform the analysis using the ASTRID algorithm
res <- analyze_dataset(dataset, classname = "class", classifier = "svm", parallel = TRUE, R = 250)
## Print the results as an HTML table
print_result_table_html(res, full_tree = TRUE)
This gives the following results for the analysis of the synthetic dataset using the SVM classifier:
k | acc | p | a3 | a4 | a2 | a1 |
---|---|---|---|---|---|---|
2 | 0.89 | 0.71 | (A) | (B | B | B) |
3 | 0.88 | 0.78 | (A) | (B) | (C | C) |
4 | 0.73 | 0.00 | (A) | (B) | (C) | (D) |
In this table k is the size (cardinality) of the grouping, acc is the average accuracy of the classifier when trained using a dataset randomised using this grouping, and p is the statistical significance of the grouping. The following columns each denote one attribute (here a4, a3, a1 and a2.). At each row, attributes marked with the same letter belong to the same group.
This shows that the maximum-cardinality grouping with a p-value of at least 0.05 is for k = 3, where the grouping is ((a1, a2), (a3), (a4)). The structure found by the ASTRID algorithm matches the model used to create the data.
The astrid
R-package is licensed under the MIT-license.