Day 03
Section outline
-
March 6th, Wednesday (10:30-12:30)
Text normalization
- Word tokenization, character tokenization, and subword tokenization
- Byte-pair encoding algorithm
- Sentence segmentation and case folding
- Stop words, stemming and lemmatization
- Research papers
Words and meaning
- Lexical semantics
- Distributional semantics
- Review: vectors
- Term-context matrix
References
- Jurafsky and Martin, chapter 2
- Jurafsky and Martin, chapter 6
- Eisenstein, section 14.3
Resources