Natural Language Processing

Area under Roc Curve (AUC) - a single scalar value that measures the overall performance of a binary classifier; true positive rate (sensitivity) as a function of false positive rate (1-specificity)
Bootstrapping - random sampling with replacement from the available training data
Confusion Matrices -  N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. True Positive, True Negative, False Negative, and True Negative are supplemented allowing for the calculation of Recall, Precision, Accuracy, and AUC. 
Continuous Bag of Words (CBOW) - word embedding model architecture that tries to predict the current target word (the center word) based on the source context words (surrounding words).
Cosine Similarity - a metric to measure the text-similarity between two documents, words or phrases after they have been converted to a vector. Calculated by the dot product of two vectors divided by the product of the two vectors' lengths (magnitude).
Data-Driven Lexicon - a lexicon including Knowledge-Based lexicon terminology and additional terminology gathered through Word2vec word embedding trained on domain specific corpora. 
F1-Score - Indicator of a machine learning model's performance. Calculated by the weighted average of Precision and Recall. Takes both false positives and false negatives into account.
Knowledge-Based Lexicon - a lexicon curated using UMLS terminology and subject matter expert termionology
Lexicon - a vocabulary, a list of words or a dictionary
Log-likelihood - a way to measure the goodness of fit for a model; 
Logistic regression - a supervised learning classification algorithm used to predict the probability of a target variable.
	elastic penalty - a regularized regression method that linearly combines the L1 and L2 penalties
	generalized linear model - the linear model is related to the response variable via a link function and allowing the magnitude of the variance of each measurement to be a function of its predicted value.
Nearest Neighbor - classifying a data point by looking at the nearest annotated data point based on the cosine similarity distance.
Neural Network - a type of machine learning algorithm used to model complex patterns in datasets
Positive Predictive Value (PPV) (Precision) -  Indicator of a machine learning model's performance. Calculated by the number of true positives divided by the total number of positive predictions (true positive + false positive)
Recall (Sensitivity) - Indicator of a machine learning model's performance. Calculated by the number of true positives divided by the total number of positives (true positive + false negative)
TIUDocuments - Text Integration Utility (TIU) documents found exclusively within the Veterans Affairs database
Vector space - n-dimensional metric space where each document, word or phrase is represented by a n-dimensional vector. Allows for vector addition or scalar-vector multiplications. Learned through optimizing a neural network by maximizing the log-likelihood probabilities of observing frequently co-occurring words and phrases.
Word2Vec Parameters
	context window - how many words before and after a target word are included as context words of the target word
	embedding dimensionality - size of vectors being assigned 
	learning rate - threshold for downsampling frequent words
	negative instance sampling - sampling just N negative instances along with the target word instead of sampling the whole vocabulary.
	Word2Phrase - algorithm that progressively joins frequently adjacent pairs of words with an '_' character. Can be used multiple times to create multiword phrases known as n-grams.