An R interface to the Python module Featuretools.
featuretoolsR
provides functionality from the Python module featuretools
, which aims to automate feature engineering. This package is very much a work in progress as Featuretools offers a lot of functionality. Any PRs are much appreciated.
The latest stable release is found on CRAN.
You can get the latest version of featuretoolsR
by installing it straight from Github: devtools::install_github("magnusfurugard/featuretoolsR")
.
You'll need to have a working Python environment as well as featuretools
installed. The recommended way is to use the built-in function install_featuretools()
which automatically sets up a virtual environment for the package and installs featuretools
.
All functions in featuretoolsR
comes with documentation, but it's advised to briefly browse through the Featuretools Python documentation. It'll cover things like entities
, relationships
and dfs
.
An entityset is the set which contain all your entities. To create a set and add an entity straight away, you can use as_entityset
.
# Libs
library(featuretoolsR)
library(magrittr)
# Create some mock data
set_1 <- data.frame(key = 1:100, value = sample(letters, 100, T), a = rep(Sys.Date(), 100))
set_2 <- data.frame(key = 1:100, value = sample(LETTERS, 100, T), b = rep(Sys.time(), 100))
# Create entityset
es <- as_entityset(
set_1,
index = "key",
entity_id = "set_1",
id = "demo",
time_index = "a"
)
To add entities (i.e if you have relational data across multiple data.frames
), this can be achieved with add_entity
. This function is pipe friendly. For this demo-case, we'll use set_2
.
es <- es %>%
add_entity(
df = set_2,
entity_id = "set_2",
index = "key",
time_index = "b"
)
With relational data, it's useful to define a relationship between two or more entities. This can be done with add_relationship
.
es <- es %>%
add_relationship(
parent_set = "set_1",
child_set = "set_2",
parent_idx = "key",
child_idx = "key"
)
The bread and butter of Featuretools is the dfs
-function (official docs here). It will attempt to create features based on *_primitives
you provide (more on primitives below).
ft_matrix <- es %>%
dfs(
target_entity = "set_1",
trans_primitives = c("and", "cum_sum")
)
To use the new data.frame/features created by dfs
, a function unique for featuretoolsR
, tidy_feature_matrix
can be used. A few "nice-to-have" arguments can be passed to clean the new data, like removing near zero variance variables, as well as replacing NaN
with NA
.
tidy <- tidy_feature_matrix(ft_matrix, remove_nzv = T, nan_is_na = T, clean_names = T)
Featuretools supports a lot of primitives. These are accessible with the function list_primitives()
which returns a data.frame containing type (aggregation (agg_primitives
) or transform (trans_primitives
)), name (in the example above, "and" and "divide") as well as a brief description of the primitive itself.
reticulate - an R interface to Python.