You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Text normalization includes tokenization, sentence segmentation, and lemmatization
Types are the number of distinct words in a corpus, tokens are the number of words overall
Corpora have datasheets describing the unique conditions under which they were created (language, time, location, speaker social status)
Tokenization
Penn Treebank tokenization is a common standard
nltk.regexp_tokenize function of the Python-based Natural Language Toolkit is a faster lightweight tokenizer
Most tokenizers have token learners and token segmenters
Token learners induce a vocabulary from raw training corpora
Token segmenters use this vocabulary to tokenize raw test sentences
Algorithms: byte-pair encoding (Sennrich et al., 2016), unigram language modeling (Kudo, 2018), and WordPiece (Schuster and Nakajima, 2012); there is also a SentencePiece library that has the first two of the three (Kudo and Richardson, 2018a).
Lemmatization
Generally uses morphological parsing; morphemes are the smallest meaning-holding pieces of a word
Stemmers are sometimes used for their simplicity (remove wordfinal affixes) - Porter Stemmer
Sentence segmentation
Periods are ambiguous and have many edge cases. Thus sometimes framed as learning problem
Other times (Stanford CoreNLP toolkit (Manning et al., 2014)), it is rule-based
Minimum edit distance
Defined as the number of insertions, deletions, or substitutions necessary to convert between words
Useful for typo-recognition and fixing
Levenshtein distance gives weights to I, D, S. Simplest is 1 for each. Another is 1 for I, D, no subs allowed.
Minimum Edit Distance Algorithm from Wagner and Fischer 1974 is a dynamic programming algo that does this
This algorithm can be edited to determine alignment (for machine translation, for example)
Instead of Levenshtein weights, we can weight based on adjacent keys on the keyboard to compute maximum probability alignment
The text was updated successfully, but these errors were encountered:
Tokenization
Lemmatization
Sentence segmentation
Minimum edit distance
The text was updated successfully, but these errors were encountered: