This repository documents M.Sc. thesis research titled "SoPa++: Leveraging explainability from hybridized RNN, CNN and weighted finite-state neural architectures". This research was adapted from the original SoPa model in Schwartz, Thomson and Smith (2018), which is distributed under the MIT License.
For more details, check out the following documents:
This repository's code was tested with Python versions
. To sync dependencies, we recommend creating a virtual environment and installing the relevant packages viapip
:pip install -r requirements.txt
Note: If you intend to use the GPU, the
dependency inrequirements.txt
works out-of-the-box with CUDA version10.2
. If you have a different version of CUDA, refer to the official PyTorch webpage for alternativepip
installation commands which will providetorch
optimized for your CUDA version. -
We use
for visualizations integrated withTikZ
. Below is thesessionInfo()
output, which can be used for replicating our dependencies explicitly.R version 4.0.4 (2021-02-15) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Arch Linux Matrix products: default BLAS: /usr/lib/ LAPACK: /usr/lib/ locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] tools stats graphics grDevices utils datasets methods [8] base other attached packages: [1] RColorBrewer_1.1-2 plyr_1.8.6 reshape2_1.4.4 [4] optparse_1.6.6 tikzDevice_0.12.3.1 rjson_0.2.20 [7] ggh4x_0.1.2.1 ggplot2_3.3.3
Automatically download and prepare GloVe-6B word embeddings and the Facebook Multilingual Task Oriented Dialogue (FMTOD) data set:
bash scripts/
Optional: Manually download our pre-trained models and place the
tarball in themodels
directory (~5 GB download size). Next, execute the following to prepare all models:bash scripts/
Optional: Initialize git hooks to manage development workflows such as linting shell/R scripts, keeping python dependencies up-to-date and formatting the development log:
bash scripts/
i. Preprocessing
For preprocessing the FMTOD data set, we use src/
usage: [-h] [--data-directory <dir_path>]
[--logging-level {debug,info,warning,error,critical}]
optional arguments:
-h, --help show this help message and exit
optional preprocessing arguments:
--data-directory <dir_path>
Data directory containing clean input data (default:
--disable-upsampling Disable upsampling on the train and validation data
sets (default: False)
--truecase Retain true casing when preprocessing data. Otherwise
data will be lowercased by default (default: False)
optional logging arguments:
--logging-level {debug,info,warning,error,critical}
Set logging level (default: info)
The default workflow cleans the original FMTOD data, forces it to lowercased format and upsamples all minority classes. To run the default workflow, execute:
bash scripts/
ii. Training
For training the SoPa++ model, we use src/
usage: [-h] --embeddings <file_path> --train-data <file_path>
--train-labels <file_path> --valid-data <file_path>
--valid-labels <file_path> [--batch-size <int>]
[--bias-scale <float>] [--clip-threshold <float>]
[--disable-scheduler] [--disable-tqdm] [--dropout <float>]
[--epochs <int>] [--evaluation-period <int>] [--gpu]
[--gpu-device <str>] [--grid-config <file_path>]
[--grid-training] [--learning-rate <float>]
[--logging-level {debug,info,warning,error,critical}]
[--max-doc-len <int>] [--max-train-instances <int>]
[--models-directory <dir_path>] [--no-wildcards]
[--num-random-iterations <int>] [--only-epoch-eval]
[--patience <int>] [--patterns <str>]
[--scheduler-factor <float>] [--scheduler-patience <int>]
[--seed <int>]
[--semiring {MaxSumSemiring,MaxProductSemiring}]
[--static-embeddings] [--tau-threshold <float>]
[--torch-num-threads <int>] [--tqdm-update-period <int>]
[--wildcard-scale <float>] [--word-dropout <float>]
optional arguments:
-h, --help show this help message and exit
required training arguments:
--embeddings <file_path>
Path to GloVe token embeddings file (default: None)
--train-data <file_path>
Path to train data file (default: None)
--train-labels <file_path>
Path to train labels file (default: None)
--valid-data <file_path>
Path to validation data file (default: None)
--valid-labels <file_path>
Path to validation labels file (default: None)
optional training arguments:
--batch-size <int>
Batch size for training (default: 256)
--clip-threshold <float>
Gradient clipping threshold (default: None)
--disable-scheduler Disable learning rate scheduler which reduces
learning rate on performance plateau (default:
--dropout <float>
Neuron dropout probability (default: 0.2)
--epochs <int>
Maximum number of training epochs (default: 50)
--evaluation-period <int>
Specify after how many training updates should
model evaluation(s) be conducted. Evaluation will
always be conducted at the end of epochs (default:
--learning-rate <float>
Learning rate for Adam optimizer (default: 0.001)
--max-doc-len <int>
Maximum document length allowed (default: None)
--max-train-instances <int>
Maximum number of training instances (default:
--models-directory <dir_path>
Base directory where all models will be saved
(default: ./models)
--only-epoch-eval Only evaluate model at the end of epoch, instead of
evaluation by updates (default: False)
--patience <int>
Number of epochs with no improvement after which
training will be stopped (default: 10)
--scheduler-factor <float>
Factor by which the learning rate will be reduced
(default: 0.1)
--scheduler-patience <int>
Number of epochs with no improvement after which
learning rate will be reduced (default: 5)
--seed <int>
Global random seed for numpy and torch (default:
--word-dropout <float>
Word dropout probability (default: 0.2)
optional grid-training arguments:
--grid-config <file_path>
Path to grid configuration file (default:
--grid-training Use grid-training instead of single-training
(default: False)
--num-random-iterations <int>
Number of random iteration(s) for each grid
instance (default: 10)
optional spp-architecture arguments:
--bias-scale <float>
Scale biases by this parameter (default: 1.0)
--no-wildcards Do not use wildcard transitions (default: False)
--patterns <str>
Pattern lengths and counts with the following
syntax: PatternLength1-PatternCount1_PatternLength2
-PatternCount2_... (default: 6-50_5-50_4-50_3-50)
--semiring {MaxSumSemiring,MaxProductSemiring}
Specify which semiring to use (default:
--static-embeddings Freeze learning of token embeddings (default:
--tau-threshold <float>
Specify value of TauSTE binarizer tau threshold
(default: 0.0)
--wildcard-scale <float>
Scale wildcard(s) by this parameter (default: None)
optional hardware-acceleration arguments:
--gpu Use GPU hardware acceleration (default: False)
--gpu-device <str>
GPU device specification in case --gpu option is
used (default: cuda:0)
--torch-num-threads <int>
Set the number of threads used for CPU intraop
parallelism with PyTorch (default: None)
optional logging arguments:
--logging-level {debug,info,warning,error,critical}
Set logging level (default: info)
optional progress-bar arguments:
--disable-tqdm Disable tqdm progress bars (default: False)
--tqdm-update-period <int>
Specify after how many training updates should the
tqdm progress bar be updated with model diagnostics
(default: 5)
To train a single SoPa++ model using our defaults on the CPU, execute:
bash scripts/
To train a single SoPa++ model using our defaults on a single GPU, execute:
bash scripts/
To apply grid-based training on SoPa++ models using our defaults on the CPU, execute:
bash scripts/
To apply grid-based training on SoPa++ models using our defaults on a single GPU, execute:
bash scripts/
iii. Resume training
For resuming the aforementioned training workflow in case of interruptions, we use src/
usage: [-h] --model-log-directory <dir_path>
[--disable-tqdm] [--gpu] [--gpu-device <str>]
[--logging-level {debug,info,warning,error,critical}]
[--torch-num-threads <int>]
[--tqdm-update-period <int>]
optional arguments:
-h, --help show this help message and exit
required training arguments:
--model-log-directory <dir_path>
Base model directory containing model data to be
resumed for training (default: None)
optional grid-training arguments:
--grid-training Use grid-training instead of single-training
(default: False)
optional hardware-acceleration arguments:
--gpu Use GPU hardware acceleration (default: False)
--gpu-device <str>
GPU device specification in case --gpu option is used
(default: cuda:0)
--torch-num-threads <int>
Set the number of threads used for CPU intraop
parallelism with PyTorch (default: None)
optional logging arguments:
--logging-level {debug,info,warning,error,critical}
Set logging level (default: info)
optional progress-bar arguments:
--disable-tqdm Disable tqdm progress bars (default: False)
--tqdm-update-period <int>
Specify after how many training updates should the
tqdm progress bar be updated with model diagnostics
(default: 5)
To resume training of a single SoPa++ model using our defaults on the CPU, execute:
bash scripts/ /path/to/model/log/directory
To resume training of a single SoPa++ model using our defaults on a single GPU, execute:
bash scripts/ /path/to/model/log/directory
To resume grid-based training of SoPa++ models using our defaults on the CPU, execute:
bash scripts/ /path/to/model/log/directory
To resume grid-based training of SoPa++ models using our defaults on a single GPU, execute:
bash scripts/ /path/to/model/log/directory
iv. Evaluation
For evaluating trained SoPa++ model(s), we use src/
usage: [-h] --eval-data <file_path> --eval-labels <file_path>
--model-checkpoint <glob_path> [--batch-size <int>]
[--evaluation-metric {recall,precision,f1-score,accuracy}]
[--evaluation-metric-type {weighted avg,macro avg}]
[--gpu] [--gpu-device <str>] [--grid-evaluation]
[--logging-level {debug,info,warning,error,critical}]
[--max-doc-len <int>] [--output-prefix <str>]
[--torch-num-threads <int>]
optional arguments:
-h, --help show this help message and exit
required evaluation arguments:
--eval-data <file_path>
Path to evaluation data file (default: None)
--eval-labels <file_path>
Path to evaluation labels file (default: None)
--model-checkpoint <glob_path>
Glob path to model checkpoint(s) with '.pt'
extension (default: None)
optional evaluation arguments:
--batch-size <int>
Batch size for evaluation (default: 256)
--max-doc-len <int>
Maximum document length allowed (default: None)
--output-prefix <str>
Prefix for output classification report (default:
optional grid-evaluation arguments:
--evaluation-metric {recall,precision,f1-score,accuracy}
Specify which evaluation metric to use for
comparison (default: f1-score)
--evaluation-metric-type {weighted avg,macro avg}
Specify which type of evaluation metric to use
(default: weighted avg)
--grid-evaluation Use grid-evaluation framework to find/summarize
best model (default: False)
optional hardware-acceleration arguments:
--gpu Use GPU hardware acceleration (default: False)
--gpu-device <str>
GPU device specification in case --gpu option is
used (default: cuda:0)
--torch-num-threads <int>
Set the number of threads used for CPU intraop
parallelism with PyTorch (default: None)
optional logging arguments:
--logging-level {debug,info,warning,error,critical}
Set logging level (default: info)
To evaluate SoPa++ model(s) using our defaults on the CPU, execute:
bash scripts/ "/glob/to/neural/model/*/checkpoint(s)"
To evaluate SoPa++ model(s) using our defaults on a single GPU, execute:
bash scripts/ "/glob/to/neural/model/*/checkpoint(s)"
To evaluate grid-based SoPa++ models using our defaults on the CPU, execute:
bash scripts/ "/glob/to/neural/model/*/checkpoints"
To evaluate grid-based SoPa++ models using our defaults on a single GPU, execute:
bash scripts/ "/glob/to/neural/model/*/checkpoints"
i. Explanations by simplification
For explaining SoPa++ model(s) by simplifying it into a RE proxy model, we use src/
usage: [-h] --neural-model-checkpoint <glob_path>
--train-data <file_path> --train-labels
<file_path> --valid-data <file_path>
--valid-labels <file_path> [--atol <float>]
[--batch-size <int>] [--disable-tqdm] [--gpu]
[--gpu-device <str>]
[--logging-level {debug,info,warning,error,critical}]
[--max-doc-len <int>]
[--max-train-instances <int>]
[--torch-num-threads <int>]
[--tqdm-update-period <int>]
optional arguments:
-h, --help show this help message and exit
required explainability arguments:
--neural-model-checkpoint <glob_path>
Glob path to neural model checkpoint(s) with
'.pt' extension (default: None)
--train-data <file_path>
Path to train data file (default: None)
--train-labels <file_path>
Path to train labels file (default: None)
--valid-data <file_path>
Path to validation data file (default: None)
--valid-labels <file_path>
Path to validation labels file (default: None)
optional explainability arguments:
--atol <float>
Specify absolute tolerance when comparing
equivalences between tensors (default: 1e-06)
--batch-size <int>
Batch size for explainability (default: 256)
--max-doc-len <int>
Maximum document length allowed (default: None)
--max-train-instances <int>
Maximum number of training instances (default:
optional hardware-acceleration arguments:
--gpu Use GPU hardware acceleration (default: False)
--gpu-device <str>
GPU device specification in case --gpu option is
used (default: cuda:0)
--torch-num-threads <int>
Set the number of threads used for CPU intraop
parallelism with PyTorch (default: None)
optional logging arguments:
--logging-level {debug,info,warning,error,critical}
Set logging level (default: info)
optional progress-bar arguments:
--disable-tqdm Disable tqdm progress bars (default: False)
--tqdm-update-period <int>
Specify after how many training updates should
the tqdm progress bar be updated with model
diagnostics (default: 5)
To simplify SoPa++ model(s) using our defaults on the CPU, execute:
bash scripts/ "/glob/to/neural/model/*/checkpoint(s)"
To simplify SoPa++ model(s) using our defaults on a GPU, execute:
bash scripts/ "/glob/to/neural/model/*/checkpoint(s)"
ii. Compression
For compressing RE proxy model(s), we use src/
usage: [-h] --regex-model-checkpoint <glob_path>
[--logging-level {debug,info,warning,error,critical}]
[--tqdm-update-period <int>]
optional arguments:
-h, --help show this help message and exit
required explainability arguments:
--regex-model-checkpoint <glob_path>
Glob path to regex model checkpoint(s) with '.pt'
extension (default: None)
optional logging arguments:
--logging-level {debug,info,warning,error,critical}
Set logging level (default: info)
optional progress-bar arguments:
--disable-tqdm Disable tqdm progress bars (default: False)
--tqdm-update-period <int>
Specify after how many training updates should the
tqdm progress bar be updated with model
diagnostics (default: 5)
To compress RE proxy model(s) using our defaults on the CPU, execute:
bash scripts/ "/glob/to/regex/model/*/checkpoint(s)"
iii. Evaluation
For evaluating RE proxy model(s), we use src/
usage: [-h] --eval-data <file_path> --eval-labels
<file_path> --model-checkpoint <glob_path>
[--batch-size <int>] [--disable-tqdm] [--gpu]
[--gpu-device <str>]
[--logging-level {debug,info,warning,error,critical}]
[--max-doc-len <int>] [--output-prefix <str>]
[--torch-num-threads <int>]
[--tqdm-update-period <int>]
optional arguments:
-h, --help show this help message and exit
required evaluation arguments:
--eval-data <file_path>
Path to evaluation data file (default: None)
--eval-labels <file_path>
Path to evaluation labels file (default: None)
--model-checkpoint <glob_path>
Glob path to model checkpoint(s) with '.pt' extension
(default: None)
optional evaluation arguments:
--batch-size <int>
Batch size for evaluation (default: 256)
--max-doc-len <int>
Maximum document length allowed (default: None)
--output-prefix <str>
Prefix for output classification report (default:
optional hardware-acceleration arguments:
--gpu Use GPU hardware acceleration (default: False)
--gpu-device <str>
GPU device specification in case --gpu option is used
(default: cuda:0)
--torch-num-threads <int>
Set the number of threads used for CPU intraop
parallelism with PyTorch (default: None)
optional logging arguments:
--logging-level {debug,info,warning,error,critical}
Set logging level (default: info)
optional progress-bar arguments:
--disable-tqdm Disable tqdm progress bars (default: False)
--tqdm-update-period <int>
Specify after how many training updates should the
tqdm progress bar be updated with model diagnostics
(default: 5)
To evaluate RE proxy model(s) using our defaults on the CPU, execute:
bash scripts/ "/glob/to/regex/model/*/checkpoint(s)"
To evaluate RE proxy model(s) using our defaults on a single GPU, execute:
bash scripts/ "/glob/to/regex/model/*/checkpoint(s)"
i. Model pair comparison
For comparing SoPa++ and RE proxy model pair(s), we use src/
usage: [-h] --eval-data <file_path> --eval-labels
<file_path> --model-log-directory <glob_path>
[--atol <float>] [--batch-size <int>]
[--disable-tqdm] [--gpu] [--gpu-device <str>]
[--logging-level {debug,info,warning,error,critical}]
[--max-doc-len <int>] [--output-prefix <str>]
[--torch-num-threads <int>]
[--tqdm-update-period <int>]
optional arguments:
-h, --help show this help message and exit
required evaluation arguments:
--eval-data <file_path>
Path to evaluation data file (default: None)
--eval-labels <file_path>
Path to evaluation labels file (default: None)
--model-log-directory <glob_path>
Glob path to model log directory/directories which
contain both the best neural and compressed regex
models (default: None)
optional evaluation arguments:
--atol <float>
Specify absolute tolerance when comparing
equivalences between tensors (default: 1e-06)
--batch-size <int>
Batch size for evaluation (default: 256)
--max-doc-len <int>
Maximum document length allowed (default: None)
--output-prefix <str>
Prefix for output classification report (default:
optional hardware-acceleration arguments:
--gpu Use GPU hardware acceleration (default: False)
--gpu-device <str>
GPU device specification in case --gpu option is used
(default: cuda:0)
--torch-num-threads <int>
Set the number of threads used for CPU intraop
parallelism with PyTorch (default: None)
optional logging arguments:
--logging-level {debug,info,warning,error,critical}
Set logging level (default: info)
optional progress-bar arguments:
--disable-tqdm Disable tqdm progress bars (default: False)
--tqdm-update-period <int>
Specify after how many training updates should the
tqdm progress bar be updated with model diagnostics
(default: 5)
To compare SoPa++ and RE proxy model pair(s) using our defaults on the CPU, execute:
bash scripts/ "/glob/to/model/log/*/director(ies)"
To compare SoPa++ and RE proxy model pair(s) using our defaults on a GPU, execute:
bash scripts/ "/glob/to/model/log/*/director(ies)"
ii. FMTOD summary statistics
For visualizing the FMTOD data set summary statistics, we apply functions from src/visualize_fmtod.R
. This workflow is wrapped using scripts/
Usage: [-h|--help]
Visualize FMTOD data set summary statistics
Optional arguments:
-h, --help Show this help message and exit
To visualize the FMTOD data set summary statistics, simply execute:
bash scripts/
iii. Grid-based training
For visualizing grid-based training performance, we use src/
to convert tensorboard event logs to csv
files and apply functions from src/visualize_grid.R
to plot them. These two scripts are bound together by scripts/
Usage: [-h|--help] tb_event_directory
Visualize grid training performance for SoPa++ models,
given that grid allows for the following varying arguments:
patterns, tau_threshold, seed
Optional arguments:
-h, --help Show this help message and exit
Required arguments:
tb_event_directory <glob_path> Tensorboard event log directory/
To produce a facet-based visualization of grid-based training, simply execute:
bash scripts/ "/glob/to/tb/event/*/director(ies)"
Note: This script has been hard-coded for grid-based training scenarios where only the following three training/model arguments are varied: patterns
, tau_threshold
and seed
iv. Grid-based evaluation
For visualizing grid-based evaluation performance and model-pair distances, we apply functions from src/visualize_grid.R
. This workflow is wrapped using scripts/
Usage: [-h|--help] model_log_directory
Visualize grid evaluations for SoPa++ and regex model pairs, given
the grid-search allows for the following varying arguments:
patterns, tau_threshold, seed
Optional arguments:
-h, --help Show this help message and exit
Required arguments:
model_log_directory <glob_path> Model log directory/directories
containing SoPa++ and regex models,
as well as all evaluation json's
To produce a facet-based visualization of grid-based evaluation, simply execute:
bash scripts/ "/glob/to/model/log/*/director(ies)"
Note: This script has been hard-coded for grid-based evaluation scenarios where only the following three training/model arguments are varied: patterns
, tau_threshold
and seed
v. TauSTE neurons and RE samples
For visualizing TauSTE neurons and RE samples, we use src/
usage: [-h] --class-mapping-config <file_path>
--regex-model-checkpoint <glob_path>
[--logging-level {debug,info,warning,error,critical}]
[--max-num-regex <int>]
[--max-transition-tokens <int>] [--only-neurons]
[--seed <int>] [--tqdm-update-period <int>]
optional arguments:
-h, --help show this help message and exit
required visualization arguments:
--class-mapping-config <file_path>
Path to class mapping configuration (default:
--regex-model-checkpoint <glob_path>
Glob path to regex model checkpoint(s) with '.pt'
extension (default: None)
optional visualization arguments:
--max-num-regex <int>
Maximum number of regex's for each TauSTE neuron
(default: 10)
--max-transition-tokens <int>
Maximum number of tokens to display per transition
(default: 5)
--only-neurons Only produces plots of neurons without regex's
(default: False)
--seed <int>
Random seed for numpy (default: 42)
optional logging arguments:
--logging-level {debug,info,warning,error,critical}
Set logging level (default: info)
optional progress-bar arguments:
--disable-tqdm Disable tqdm progress bars (default: False)
--tqdm-update-period <int>
Specify after how many training updates should the
tqdm progress bar be updated with model
diagnostics (default: 5)
To visualize TauSTE neurons and corresponding activating RE samples, execute the following:
bash scripts/ "/glob/to/regex/model/*/checkpoint(s)"
To visualize only TauSTE neurons, execute the following:
bash scripts/ "/glob/to/regex/model/*/checkpoint(s)"