Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read SLP Chapter 2 #5

Open
rohildshah opened this issue Jan 25, 2024 · 0 comments
Open

Read SLP Chapter 2 #5

rohildshah opened this issue Jan 25, 2024 · 0 comments

Comments

@rohildshah
Copy link
Collaborator

  • Text normalization includes tokenization, sentence segmentation, and lemmatization
  • Types are the number of distinct words in a corpus, tokens are the number of words overall
  • Corpora have datasheets describing the unique conditions under which they were created (language, time, location, speaker social status)

Tokenization

  • Penn Treebank tokenization is a common standard
  • nltk.regexp_tokenize function of the Python-based Natural Language Toolkit is a faster lightweight tokenizer
  • Most tokenizers have token learners and token segmenters
  • Token learners induce a vocabulary from raw training corpora
  • Token segmenters use this vocabulary to tokenize raw test sentences
  • Algorithms: byte-pair encoding (Sennrich et al., 2016), unigram language modeling (Kudo, 2018), and WordPiece (Schuster and Nakajima, 2012); there is also a SentencePiece library that has the first two of the three (Kudo and Richardson, 2018a).

Lemmatization

  • Generally uses morphological parsing; morphemes are the smallest meaning-holding pieces of a word
  • Stemmers are sometimes used for their simplicity (remove wordfinal affixes) - Porter Stemmer

Sentence segmentation

  • Periods are ambiguous and have many edge cases. Thus sometimes framed as learning problem
  • Other times (Stanford CoreNLP toolkit (Manning et al., 2014)), it is rule-based

Minimum edit distance

  • Defined as the number of insertions, deletions, or substitutions necessary to convert between words
  • Useful for typo-recognition and fixing
  • Levenshtein distance gives weights to I, D, S. Simplest is 1 for each. Another is 1 for I, D, no subs allowed.
  • Minimum Edit Distance Algorithm from Wagner and Fischer 1974 is a dynamic programming algo that does this
  • This algorithm can be edited to determine alignment (for machine translation, for example)
  • Instead of Levenshtein weights, we can weight based on adjacent keys on the keyboard to compute maximum probability alignment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant