Day 02
Section outline
-
March 6th, Friday (10:30-12:30)
Text normalization
- Word types and word tokens
- Herdan law and Zipf law
- Morphology and word-form
- Corpora
- Language identification and spell checking
- Text normalization: contraction, punctuation and special characters
- Word tokenization, character tokenization, and subword tokenization
- Byte-pair encoding algorithm: learner, encoder and decoder
References
- Jurafsky and Martin, chapter 2
Resources