Read SLP Chapter 2 #5

rohildshah · 2024-01-25T01:42:50Z

Text normalization includes tokenization, sentence segmentation, and lemmatization
Types are the number of distinct words in a corpus, tokens are the number of words overall
Corpora have datasheets describing the unique conditions under which they were created (language, time, location, speaker social status)

Penn Treebank tokenization is a common standard
nltk.regexp_tokenize function of the Python-based Natural Language Toolkit is a faster lightweight tokenizer
Most tokenizers have token learners and token segmenters
Token learners induce a vocabulary from raw training corpora
Token segmenters use this vocabulary to tokenize raw test sentences
Algorithms: byte-pair encoding (Sennrich et al., 2016), unigram language modeling (Kudo, 2018), and WordPiece (Schuster and Nakajima, 2012); there is also a SentencePiece library that has the first two of the three (Kudo and Richardson, 2018a).

Generally uses morphological parsing; morphemes are the smallest meaning-holding pieces of a word
Stemmers are sometimes used for their simplicity (remove wordfinal affixes) - Porter Stemmer

Periods are ambiguous and have many edge cases. Thus sometimes framed as learning problem
Other times (Stanford CoreNLP toolkit (Manning et al., 2014)), it is rule-based

Defined as the number of insertions, deletions, or substitutions necessary to convert between words
Useful for typo-recognition and fixing
Levenshtein distance gives weights to I, D, S. Simplest is 1 for each. Another is 1 for I, D, no subs allowed.
Minimum Edit Distance Algorithm from Wagner and Fischer 1974 is a dynamic programming algo that does this
This algorithm can be edited to determine alignment (for machine translation, for example)
Instead of Levenshtein weights, we can weight based on adjacent keys on the keyboard to compute maximum probability alignment

Provide feedback