Section outline

  • March 6th, Wednesday (10:30-12:30)

    Text normalization

    • Word tokenization, character tokenization, and subword tokenization
    • Byte-pair encoding algorithm
    • Sentence segmentation and case folding
    • Stop words, stemming and lemmatization
    • Research papers

    Words and meaning

    • Lexical semantics
    • Distributional semantics
    • Review: vectors
    • Term-context matrix

    References

    • Jurafsky and Martin, chapter 2
    • Jurafsky and Martin, chapter 6
    • Eisenstein, section 14.3

    Resources