Urban Data Integration - Database

This Java library is part of the Urban Data Integration project. It provides classes and functionality to maintain and transform (open urban) data sets.

Compute Database Column Similarity

The library contains JAR files to compute pairwise similarity between database columns. Computation assumes that a unique term index has been generated a-priori.

Create Unique Term Index

The JAR file TermIndexGenerator.jar is used to generate a set of unique terms in the database. Each term is assigned a unique identifier. With each term the index maintains the assigned data type and a comma-separated list of column:frequency pairs. Each pair denotes the frequency of the term in the identified column.

The possible data types that a term can have are:

1=INTEGER
2=DECIMAL
3=LONG
4=DATE
5=STRING

Usage:

java -jar TermIndexGenerator.jar
  <input-directory> : Directory with column files
  <mem-buffer-size> : Size of the memory-buffer. Once buffer is full intermediate results are written to disk.
  <output-file>     : Output file for term index

Compute Pairwise Column Similarity

Use JAR file ComputeColumnSimilarity.jar to compute pairwise similarity between columns. Requires a term-index-file generated using TermIndexGenerator.jar.

java -jar ComputeColumnSimilarity.jar
  <term-index-file>     : The term-index file generated using TermIndexGenerator.jar
  <similarity-function> : Similarity function [
                            JI          : Jaccard-Index |
                            WJI-COLSIZE : Weighted Jaccard-Index that uses total number of values in a column as normalization scale |
                            WJI-COLMAX  : Weighted Jaccard-Index that uses the most frequent value in a column as normalization scale
                          ]
  <similarity-threshold>: Outputs only column pairs with similarity above the given threshold
  <threads>             : Number of parallel threads to use
  <output-file>         : Output file for similarities. The format is tab-delimited:
                          1) ID of column 1,
                          2) ID of column 2,
                          3) similarity

N-Gram Column Generator

When computing column similarity based on n-grams one first has to transform the database column files into n-gram column files. These files only contain the n-grams for all column values. One then has to use TermIndexGenerator.jar to create a unique n-gram index before column similarity can be computed.

java -jar NGramColumnGenerator.jar
  <input-directory>      : Directory with database column files
  <ngram-size>           : Size of the generated n-grams
  <pad-for-short-values> : Add special padding characters ('$' and '#') at beginning and end of each value [true | false]
  <output-directory>     : Output directory for n-gram column files. The file format is the same as for the input files

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
lib		lib
src/main/java/org/urban/data/db		src/main/java/org/urban/data/db
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Urban Data Integration - Database

Compute Database Column Similarity

Create Unique Term Index

Compute Pairwise Column Similarity

N-Gram Column Generator

About

Releases

Packages

Languages

License

VIDA-NYU/urban-data-db

Folders and files

Latest commit

History

Repository files navigation

Urban Data Integration - Database

Compute Database Column Similarity

Create Unique Term Index

Compute Pairwise Column Similarity

N-Gram Column Generator

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages