Skip to content

Deterministic Information Bottleneck Method for Clustering Mixed-Type Data

Notifications You must be signed in to change notification settings

amarkos/DIBclust

Repository files navigation

DIBclust Package

DIBclust is an R package for clustering datasets using the Deterministic Information Bottleneck (DIB) method. This package supports datasets with mixed-type variables (nominal, ordinal, and continuous), as well as datasets that are purely continuous or categorical. The DIB approach preserves the most relevant information while forming concise and interpretable clusters, guided by principles from information theory.

Installation

You can install the latest version of the package directly from GitHub using devtools:

install.packages("devtools")  # Install devtools if not already installed
devtools::install_github("amarkos/DIBclust")  # Install DIBclust from GitHub

Getting Started

Below is a comprehensive example demonstrating how to use the package for clustering mixed-type, continuous, and categorical datasets, and displaying the results.

library(DIBclust)

# Example Mixed-Type Data
data <- data.frame(
  cat_var = factor(sample(letters[1:3], 100, replace = TRUE)),      # Nominal categorical variable
  ord_var = factor(sample(c("low", "medium", "high"), 100, replace = TRUE),
                   levels = c("low", "medium", "high"),
                   ordered = TRUE),                                # Ordinal variable
  cont_var1 = rnorm(100),                                          # Continuous variable 1
  cont_var2 = runif(100)                                           # Continuous variable 2
)

# Perform Mixed-Type Clustering
result_mix <- DIBmix(X = data, ncl = 3, catcols = 1:2, contcols = 3:4)
cat("Mixed-Type Clustering Results:\n")
print(result_mix$Cluster)
print(result_mix$Entropy)
print(result_mix$MutualInfo)

# Example Continuous Data
X_cont <- matrix(rnorm(1000), ncol = 5)  # 200 observations, 5 features

# Perform Continuous Data Clustering
result_cont <- DIBcont(X = X_cont, ncl = 3, s = -1, nstart = 50)
cat("Continuous Clustering Results:\n")
print(result_cont$Cluster)
print(result_cont$Entropy)
print(result_cont$MutualInfo)

# Example Categorical Data
X_cat <- data.frame(
  Var1 = factor(sample(letters[1:3], 200, replace = TRUE)),  # Nominal variable
  Var2 = factor(sample(letters[4:6], 200, replace = TRUE)),  # Nominal variable
  Var3 = factor(sample(c("low", "medium", "high"), 200, replace = TRUE),
                levels = c("low", "medium", "high"), ordered = TRUE)  # Ordinal variable
)

# Perform Categorical Data Clustering
result_cat <- DIBcat(X = X_cat, ncl = 3, lambda = -1, nstart = 50)
cat("Categorical Clustering Results:\n")
print(result_cat$Cluster)
print(result_cat$Entropy)
print(result_cat$MutualInfo)

You may as well find ten classification data sets taken from the UCI Machine Learning repository and the relevant scripts to run these in this GitHub repository. These can be used for reproducing the results presented in the paper.

Contributing

Contributions are welcome! If you encounter issues, have suggestions, or would like to enhance the package, please feel free to submit an issue or a pull request on the GitHub repository.

License

This package is distributed under the GPL-3 License. See the GNU General Public License version 3 for details.

About

Deterministic Information Bottleneck Method for Clustering Mixed-Type Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published