This crate is designed to create a general tooklit for natural language processing, a current deficiency in the Rust ecosystem.
Project can be found on crates.io.
Check out the examples folder to see how to create a sentiment lexicon and get the arousal level for a term.
The sentiment analysis was originally designed by Dr. Christopher Healey and then ported to Rust for the purpose of this project.
Basic tokenization is supported right now (string to sentences, string to tokens, term frequencies), but there are plans to expand this to include stop word removal as well.
Stemming currently uses modified code from rust-stem, but this may switch to the rust-stemmers crate after further research.
More information on the stemming algorithm can be found here.
Term frequency–inverse document frequency (TF-IDF) is an algorithm used to find document similarity. Creating a TF-IDF matrix takes place over two steps:
- Apply a weight,
$w_{i,j}$ , for every term,$t_i$ , in the document,$D_j$ .$w_{i,j}$ is defined as$tf_{i,j} \times idf_i$ , where$tf_{i,j}$ is the number of occurrences of$t_i$ in$D_j$ , and$idf_i$ is the log of inverse fraction of documents$n_i$ that contain at least one occurrence of$t_i, idf_i = ln(\frac{n}{n_i})$ . - Take the weighted matrix and then normalize each document vector in order to remove the influence of document length.
The weighted, normalized matrix can then be used to find the cosine similarity between documents.
Normally, calculating the cosine similarity of two document vectors would look like
The resulting
Latent Semantic Analysis (LSA) finds document similarity based on the idea of concepts. LSA starts with the
The resulting
- article summary (based on term frequency)
- topic clustering
- sentiment negation