Section outline

  • Content: Detailed description of course content and prerequisites can be found here.

    Textbook: The adopted textbook is Speech and Language Processing (3rd Edition, draft, Jan 6th 2026) by Dan Jurafsky and James H. Martin, available here.

    Logistics: Lectures are on Thursday 10:30-12:30 (room Ce) and on Friday 10:30-12:30 (room Ce).

    Office hours: Thursday 12:30-14:30, email appointment required. Meetings can be face-to-face or else on-line at this Zoom link.

    • Forum for general news and announcements. Only the lecturer can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.

    • Forum for discussion of technical matter presented during the lectures. Any student with a unipd account can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.

    • Forum for discussion of technical matter presented during the open laboratory sessions. Any student with a unipd account can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.

    • Forum for project discussion. Any student with a unipd account can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.

  • March 5th, Thursday (10:30-12:30)

    Course administration and presentation

    • Content outline
    • Laboratory sessions
    • Course requirements
    • Textbook
    • Project
    • Coursework

    Natural language processing: An unexpected journey

    • What is natural language processing?
    • Very short history of natural language processing
    • Why is natural language processing tricky?
    • Ambiguity, composition, recursion and hidden structure
    • Language & learning
    • Miscellanea

    References

    • Slides from the lecture

    Resources

  • March 6th, Friday (10:30-12:30)

    Text normalization

    • Word types and word tokens
    • Herdan law and Zipf law
    • Morphology and word-form
    • Corpora
    • Language identification and spell checking
    • Text normalization: contraction, punctuation and special characters
    • Word tokenization, character tokenization, and subword tokenization
    • Byte-pair encoding algorithm: learner, encoder and decoder

    References

    • Jurafsky and Martin, chapter 2

    Resources

  • March 12th, Thursday (10:30-12:30)

    Text normalization

    • Byte-pair encoding algorithm: learner, encoder and decoder (cont'd)
    • Sentence segmentation and case folding
    • Stop words, stemming and lemmatization

    Words and meaning

    • Lexical semantics
    • Distributional semantics
    • Count-based embeddings
    • Word2vec and skip-gram
    • Logistic regression
    • Training

    References

    • Jurafsky and Martin, chapter 2
    • Jurafsky and Martin, chapter 5

    Resources

  • March 13th, Friday (10:30-12:30)

    Words and meaning

    • Training (cont'd)
    • Practical issues
    • FastText and GloVe
    • Semantic properties of neural word embeddings
    • Evaluation
    • Cross-lingual word embeddings

    References

    • Jurafsky and Martin, chapter 5
    • Voita, NLP Course | For You (web course): Word embeddings
  • March 19th, Thursday (10:30-12:30)

    Statistical language models

    • Language modeling: prediction and generation
    • Language modeling applications
    • Relative frequency estimation
    • N-gram model
    • N-gram probabilities and bias-variance trade-off
    • Practical issues
    • Evaluation: perplexity measure
    • Sampling sentences
    • Smoothing: Laplace and add-k smoothing
    • Stupid backoff and linear interpolation

    References

    • Jurafsky and Martin, chapter 3

    Resources

  • March 20th, Friday (10:30-12:30)

    Statistical language models

    • Out-of-vocabulary words
    • Limitations of N-gram model
    • Research papers

    Neural language models (NLM)

    • General architecture for NLM
    • Feedforward NLM: inference
    • Feedforward NLM: training
    • Recurrent NLM: inference

    Exercises

    • Subword tokenization: BPE algorithm

    References

    • Jurafsky and Martin, chapter 3
    • Jurafsky and Martin, section 6.5
    • Voita, NLP Course | For You (web course): Language Modeling