HoloClean is built on top of PyTorch and Postgres.
HoloClean is a statistical inference engine to impute, clean, and enrich data. As a weakly supervised machine learning system, HoloClean leverages available quality rules, value correlations, reference data, and multiple other signals to build a probabilistic model that accurately captures the data generation process, and uses the model in a variety of data curation tasks. HoloClean allows data practitioners and scientists to save the enormous time they spend in building piecemeal cleaning solutions, and instead, effectively communicate their domain knowledge in a declarative way to enable accurate analytics, predictions, and insights form noisy, incomplete, and erroneous data.
Install Postgres by running
$ apt-get install postgresql postgresql-contrib
Installation instructions can be found at https://www.postgresql.org/download/macosx/.
To start the Postgres console from your terminal
$ psql --user <username> # you can omit --user <username> to use current user
We then create a database holo
and user holo
(default settings for HoloClean)
CREATE DATABASE holo;
CREATE USER holocleanuser;
ALTER USER holo WITH PASSWORD 'abcd1234';
GRANT ALL PRIVILEGES ON DATABASE holo TO holocleanuser;
\c holo
ALTER SCHEMA public OWNER TO holo;
In general, to connect to the holo
database from the Postgres psql console
\c holo
HoloClean currently populates the database holo
with auxiliary and meta tables.
To clear the database simply connect as a root user or as holocleanuser
and run
DROP DATABASE holo;
CREATE DATABASE holo;
Install Conda using one of the following methods
For 32-bit machines run
$ wget https://repo.continuum.io/archive/Anaconda-2.3.0-Linux-x86.sh
$ sh Anaconda-2.3.0-Linux-x86.sh
For 64-bit machines run
$ wget https://repo.continuum.io/archive/Anaconda-2.3.0-Linux-x86_64.sh
$ sh Anaconda-2.3.0-Linux-x86_64.sh
Follow instructions here to install Anaconda (NOT miniconda).
Create a Python 2.7 conda environment by running
$ conda create -n holo_env python=2.7
Upon starting/restarting your terminal session, you will need to activate your conda environment by running
$ source activate holo_env
NOTE: ensure your environment is activated throughout the installation process.
If you are familiar with Virtualenv, create a new Python 2.7 environment with your preferred Virtualenv wrapper, for example:
- virtualenvwrapper (Bourne-shells)
- virtualfish (fish-shell)
Either follow instructions here or install via
pip
$ pip install virtualenv
Create a new directory for a Python 2.7 virtualenv environment
$ mkdir -p holo_env
$ virtualenv --python=python holo_env
where python
is a valid reference to a python executable.
Activate the environment
$ source bin/activate
NOTE: ensure your environment is activated throughout the installation process.
In the project root directory, run the following to install the required packages. Note that this commands installs the packages within the activated virtual environment.
$ pip install -r requirements.txt
If you are on MacOS, you may need to install XCode developer tools using the command xcode-select --install
.
See the code in tests/test_holoclean.py
for a documented usage of HoloClean.
In order to run the test script, run the following:
$ cd tests
$ ./start_test.sh
The script sets up the python path environment for running holoclean.