This repo hosts source code to train locality-sensitive bucketing (LSB) functions.
A bucketing function
Here we develop a machine-learning framework to automatically learn
-
Environment: python vision >= 3.6
-
Data simulation. Codes in
/simulation
can generate a set of random pairs of length-n strings$(s,t)$ with various edit distances as needed. Given$d_1, d_2$ , training samples consist of tuples${(s,t,y)}$ ,$y = -1$ if$edit(s,t) \le d_1$ and$y = 1$ if$edit(s,t) \ge d_2$ . -
Model training. Codes for
$n = 20$ and$n=100$ are put in separate folders.siacnn_models_gpu.py
is a function library (including losses, evaluations, model structures and generating hash code) awaiting import. Thesiaincp_runner.py
is a trainer for Siamese Neural Network. Parameters are easily modified in the files following the annotations. To train a model, use command:python siaincp_runner.py
-
Testing and hashcode generating. tester.py is a quick example of testing data
seq-n20-ED15-2.txt
for the pretained models stored intrained models
and generating the hash code with the command; hash codes will be stored in a file namedhashcode_20k_40m_(d1,d2)s.hdf5
.python tester.py
-
Pre-trained models. More pre-trained models are available at zenodo.