DIBclust is an R package for clustering datasets using the Deterministic Information Bottleneck (DIB) method. This package supports datasets with mixed-type variables (nominal, ordinal, and continuous), as well as datasets that are purely continuous or categorical. The DIB approach preserves the most relevant information while forming concise and interpretable clusters, guided by principles from information theory.
You can install the latest version of the package directly from GitHub using devtools
:
install.packages("devtools") # Install devtools if not already installed
devtools::install_github("amarkos/DIBclust") # Install DIBclust from GitHub
Below is a comprehensive example demonstrating how to use the package for clustering mixed-type, continuous, and categorical datasets, and displaying the results.
library(DIBclust)
# Example Mixed-Type Data
data <- data.frame(
cat_var = factor(sample(letters[1:3], 100, replace = TRUE)), # Nominal categorical variable
ord_var = factor(sample(c("low", "medium", "high"), 100, replace = TRUE),
levels = c("low", "medium", "high"),
ordered = TRUE), # Ordinal variable
cont_var1 = rnorm(100), # Continuous variable 1
cont_var2 = runif(100) # Continuous variable 2
)
# Perform Mixed-Type Clustering
result_mix <- DIBmix(X = data, ncl = 3, catcols = 1:2, contcols = 3:4)
cat("Mixed-Type Clustering Results:\n")
print(result_mix$Cluster)
print(result_mix$Entropy)
print(result_mix$MutualInfo)
# Example Continuous Data
X_cont <- matrix(rnorm(1000), ncol = 5) # 200 observations, 5 features
# Perform Continuous Data Clustering
result_cont <- DIBcont(X = X_cont, ncl = 3, s = -1, nstart = 50)
cat("Continuous Clustering Results:\n")
print(result_cont$Cluster)
print(result_cont$Entropy)
print(result_cont$MutualInfo)
# Example Categorical Data
X_cat <- data.frame(
Var1 = factor(sample(letters[1:3], 200, replace = TRUE)), # Nominal variable
Var2 = factor(sample(letters[4:6], 200, replace = TRUE)), # Nominal variable
Var3 = factor(sample(c("low", "medium", "high"), 200, replace = TRUE),
levels = c("low", "medium", "high"), ordered = TRUE) # Ordinal variable
)
# Perform Categorical Data Clustering
result_cat <- DIBcat(X = X_cat, ncl = 3, lambda = -1, nstart = 50)
cat("Categorical Clustering Results:\n")
print(result_cat$Cluster)
print(result_cat$Entropy)
print(result_cat$MutualInfo)
You may as well find ten classification data sets taken from the UCI Machine Learning repository and the relevant scripts to run these in this GitHub repository. These can be used for reproducing the results presented in the paper.
Contributions are welcome! If you encounter issues, have suggestions, or would like to enhance the package, please feel free to submit an issue or a pull request on the GitHub repository.
This package is distributed under the GPL-3 License. See the GNU General Public License version 3 for details.