(note that this code builds on and modifies original HoloClean)
We introduce a learning framework for the problem of unifying conflicting data in multiple records repeating the same entity, which we call “record fusion” which generalizes two known problems:“data fusion” and “golden record”. This approach expresses record fusion as a learning problem over probabilistic models. In contrast to preceding approaches, our method achieves high performance with or without the records source information, and outperforms state-of-art baselines. Furthermore, we show how our learned fusion model can solve the the problem of scarcity of training data. We show that our framework fuses records with an average precision of ∼98% when source information is available, and ∼94% without source information across a diverse array of datasets. We compare our approach to a comprehensive collection of data fusion and entity consolidation methods, ranging from source information related methods to approaches that do not need any source information. We show that our approach can achieve an average improvement of ∼20/ ∼45 precision points with/without source information. Besides, our data augmentation method improves previous approaches an average of ∼10 precision points.
Record Fusion was tested on Python versions 2.7, 3.6, and 3.7. It requires PostgreSQL version 9.4 or higher.
We describe how to install PostgreSQL and configure it for Record Fusion (creating a database, a user, and setting the required permissions).
A native installation of PostgreSQL runs faster than docker containers. We explain how to install PostgreSQL then how to configure it for Record Fusion use.
On Ubuntu, install PostgreSQL by running
$ apt-get install postgresql postgresql-contrib
For macOS, you can find the installation instructions on https://www.postgresql.org/download/macosx/
By default, Record Fusion needs a database holo
and a user holocleanuser
with permissions on it.
-
Start the PostgreSQL
psql
console from the terminal using
$ psql --user <username>
. You can omit--user <username>
to use current user. -
Create a database
holo
and userholocleanuser
CREATE DATABASE holo;
CREATE USER holocleanuser;
ALTER USER holocleanuser WITH PASSWORD 'abcd1234';
GRANT ALL PRIVILEGES ON DATABASE holo TO holocleanuser;
\c holo
ALTER SCHEMA public OWNER TO holocleanuser;
You can connect to the holo
database from the PostgreSQL psql
console by running
psql -U holocleanuser -W holo
.
Record Fusion currently populates the database holo
with auxiliary and meta tables.
To clear the database simply connect as a root
user or as holocleanuser
and run
DROP DATABASE holo;
CREATE DATABASE holo;
If you are familiar with docker, an easy way to start using Record Fusion is to start a PostgreSQL docker container.
To start a PostgreSQL docker container, run the following command:
docker run --name pghc \
-e POSTGRES_DB=holo -e POSTGRES_USER=holocleanuser -e POSTGRES_PASSWORD=abcd1234 \
-p 5432:5432 \
-d postgres:11
which starts a backend server and creates a database with the required permissions.
You can then use docker start pghc
and docker stop pghc
to start/stop the container.
Note the port number which may conflict with existing PostgreSQL servers. Read more about this docker image here.
Record Fusion runs on Python 2.7 or 3.6+. We recommend running it from within a virtual environment.
First, download Anaconda (not miniconda) from this link. Follow the steps for your OS and framework.
Second, create a conda environment (python 2.7 or 3.6+). For example, to create a Python 3.6 conda environment, run:
$ conda create -n hc36 python=3.6
Upon starting/restarting your terminal session, you will need to activate your conda environment by running
$ conda activate hc36
If you are familiar with virtualenv
, you can use it to create
a virtual environment.
For Python 3.6, create a new environment with your preferred virtualenv wrapper, for example:
- virtualenvwrapper (Bourne-shells)
- virtualfish (fish-shell)
Either follow instructions here or install via
pip
.
$ pip install virtualenv
Then, create a virtualenv
environment by creating a new directory for a Python 3.6 virtualenv environment
$ mkdir -p hc36
$ virtualenv --python=python3.6 hc36
where python3.6
is a valid reference to a Python 3.6 executable.
Activate the environment
$ source hc36/bin/activate
Note: make sure that the environment is activated throughout the installation process.
When you are done, deactivate it using
conda deactivate
, source deactivate
, or deactivate
depending on your version.
In the project root directory, run the following to install the required packages. Note that this commands installs the packages within the activated virtual environment.
$ pip install -r requirements.txt
Note for macOS Users:
you may need to install XCode developer tools using xcode-select --install
.