Fast, flexible name matching for large datasets
Recommended install via pip
- Create virtual env ``. Optional
- Install nama
pip install git+https://github.com/bradhackinen/nama.git@master
Install from source with conda
-
Install Anaconda
-
Clone
nama
git clone https://github.com/bradhackinen/nama.git
- Enter the
conda
directory where the conda environment file is with
cd conda
- Create new conda environment with
conda create --name <env-name>
- Activate the new environment with
conda activate <env-name>
- Download & Install
pytorch-mutex
conda install pytorch-mutex-1.0-cuda.tar.bz2
- Download & Install
pytorch
conda install pytorch-1.10.2-py3.9_cuda11.3_cudnn8.2.0_0.tar.bz2
- Install the rest of the dependencies with
conda install --file conda_env.txt
- Exit the
conda
directory with
cd ..
- Install the package with
pip install .
Installing from source with pip
- Clone
nama
git clone https://github.com/bradhackinen/nama.git
- Create & activate virtual environment
python -m venv nama_env && source nama_env/bin/activate
- Install dependencies
pip install -r requirements.txt
- Install the package with
pip install ./nama
- Install from the project root directory
pip install .
- Install from another directory
pip install /path-to-project-root
To import data into the matcher we can either pass nama
a pandas DataFrame with
import nama
training_data = nama.from_df(
df,
group_column='group_column',
string_column='string_column')
print(training_data)
or we can pass nama
a .csv file directly
import nama
testing_data = nama.read_csv(
'path-to-data',
match_format=match_format,
group_column=group_column,
string_column=string_column)
print(training_data)
See from_df
& read_csv
for parameters and function details
We can initalize a model like so
from nama.embedding_similarity import EmbeddingSimilarityModel
sim = EmbeddingSimilarityModel()
If using a GPU then we need to send the model to a GPU device like
sim.to(gpu_device)
To train a model we simply need to specifiy the training parmeters and training data
train_kwargs = {
'max_epochs': 1,
'warmup_frac': 0.2,
'transformer_lr':1e-5,
'score_lr':30,
'use_counts':False,
'batch_size':8,
'early_stopping':False
}
history_df, val_df = sim.train(training_data, verbose=True, **train_kwargs)
We can also save the trained model for later
sim.save("path-to-save-model")
We can use the model we train above directly like
embeddings = sim.embed(testing_data)
Or load a previously trained model
from nama.embedding_similarity import load_similarity_model
new_sim = load_similarity_model("path-to-saved-model")
embeddings = sim.embed(testing_data)
MORE TO COME