Section outline

  • March 6th, Friday (10:30-12:30)

    Text normalization

    • Word types and word tokens
    • Herdan law and Zipf law
    • Morphology and word-form
    • Corpora
    • Language identification and spell checking
    • Text normalization: contraction, punctuation and special characters
    • Word tokenization, character tokenization, and subword tokenization
    • Byte-pair encoding algorithm: learner, encoder and decoder

    References

    • Jurafsky and Martin, chapter 2

    Resources