Section outline

  • Content: Detailed description of course content and prerequisites can be found here.

    Textbook: The adopted textbook is Speech and Language Processing (3rd Edition, draft, Jan 6th 2026) by Dan Jurafsky and James H. Martin, available here.

    Logistics: Lectures are on Thursday 10:30-12:30 (room Ce) and on Friday 10:30-12:30 (room Ce).

    Office hours: Thursday 12:30-14:30, email appointment required. Meetings can be face-to-face or else on-line at this Zoom link.

    • Forum for general news and announcements. Only the lecturer can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.

    • Forum for discussion of technical matter presented during the lectures. Any student with a unipd account can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.

    • Forum for discussion of technical matter presented during the open laboratory sessions. Any student with a unipd account can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.

    • Forum for project discussion. Any student with a unipd account can post in this forum. Subscription to this forum is automatic for every student who has registered to this course.

  • March 5th, Thursday (10:30-12:30)

    Course administration and presentation

    • Content outline
    • Laboratory sessions
    • Course requirements
    • Textbook
    • Project
    • Coursework

    Natural language processing: An unexpected journey

    • What is natural language processing?
    • Very short history of natural language processing
    • Why is natural language processing tricky?
    • Ambiguity, composition, recursion and hidden structure
    • Language & learning
    • Miscellanea

    References

    • Slides from the lecture

    Resources

  • March 6th, Friday (10:30-12:30)

    Text normalization

    • Word types and word tokens
    • Herdan law and Zipf law
    • Morphology and word-form
    • Corpora
    • Language identification and spell checking
    • Text normalization: contraction, punctuation and special characters
    • Word tokenization, character tokenization, and subword tokenization
    • Byte-pair encoding algorithm: learner, encoder and decoder

    References

    • Jurafsky and Martin, chapter 2

    Resources

  • March 12th, Thursday (10:30-12:30)

    Text normalization

    • Byte-pair encoding algorithm: learner, encoder and decoder (cont'd)
    • Sentence segmentation and case folding
    • Stop words, stemming and lemmatization

    Words and meaning

    • Lexical semantics
    • Distributional semantics
    • Count-based embeddings
    • Word2vec and skip-gram
    • Logistic regression
    • Training

    References

    • Jurafsky and Martin, chapter 2
    • Jurafsky and Martin, chapter 5

    Resources

  • March 13th, Friday (10:30-12:30)

    Words and meaning

    • Training (cont'd)
    • Practical issues
    • FastText and GloVe
    • Semantic properties of neural word embeddings
    • Evaluation
    • Cross-lingual word embeddings

    References

    • Jurafsky and Martin, chapter 5
    • Voita, NLP Course | For You (web course): Word embeddings
  • March 19th, Thursday (10:30-12:30)

    Statistical language models

    • Language modeling: prediction and generation
    • Language modeling applications
    • Relative frequency estimation
    • N-gram model
    • N-gram probabilities and bias-variance trade-off
    • Practical issues
    • Evaluation: perplexity measure
    • Sampling sentences
    • Smoothing: Laplace and add-k smoothing
    • Stupid backoff

    References

    • Jurafsky and Martin, chapter 3

    Resources

  • March 20th, Friday (10:30-12:30)

    Statistical language models

    • Linear interpolation
    • Out-of-vocabulary words
    • Limitations of N-gram model

    Neural language models (NLM)

    • General architecture for NLM
    • Feedforward NLM: inference
    • Feedforward NLM: training
    • Recurrent NLM: inference
    • Recurrent NLM: training

    Exercises

    • Subword tokenization: BPE algorithm

    References

    • Jurafsky and Martin, section 6.5
    • Jurafsky and Martin, section 13.2
    • Voita, NLP Course | For You (web course): Language Modeling

    Resources

  • March 26th, Thursday (10:30-12:30)

    Neural language models (NLM)

    • Practical issues: parameter freezing, weight tying, softmax temperature

    Contextual word embeddings

    • Static embeddings vs. contextualized embeddings
    • ELMo
    • BERT: encoder-only model
    • BERT: masked language modeling

    References

    • Jurafsky and Martin, chapter 9
    • Voita, NLP Course | For You (web course): Language Modeling
    • Slides from lecture

    Resources

  • March 27th, Friday (10:30-12:30)

    Contextual word embeddings

    • BERT: next sentence prediction
    • Applications of BERT: sentiment analysis
    • Applications of BERT: named-entity recognition
    • GPT-n decoder-only model
    • Sentence-BERT

    References

    • Jurafsky and Martin, chapter 9
    • Slides from lecture
  • April 2nd, Thursday (10:30-12:30)

    Lab Session I: Static word embeddings

    • Introduction to the gensim library
    • Common operations with word embeddings: lookup, similarity, NN retrieval
    • Visualizing word embeddings: dimensionality reduction with PCA
    • Intrisic evaluation of word embeddings: word similarity and word analogy benchmarks

    Large language models and pretraining

    • Pretraining and transfer learning
    • Large language models
    • Language modeling head
    • Text completion and decoder-only model
    • Casting NLP tasks as text completion

    References

    • Jurafsky and Martin, chapter 7
    • Jurafsky and Martin, chapter 8

    Resources

  • April 9th, Thursday (10:30-12:30)

    Large language models and pretraining

    • Sampling
    • Key-Value cache
    • Pretraining

    References

    • Jurafsky and Martin, chapter 7
    • Jurafsky and Martin, chapter 8
  • April 10th, Friday (10:30-12:30)

    Lab Session II: Hugging Face Transformer

    • General overview of the library
    • Importing and using BERT
    • Using Gemma and chat templates

    Large language models and pretraining

    • Training corpora
    • Scaling laws for LLMs
    • Overview of LLMs
    • Multi-lingual LLMs
    • Training of MLLMs
    • Evaluation of LLM
    • Emergent abilities
    • Mixture of Experts

    References

    • Jurafsky and Martin, chapter 7
    • Jurafsky and Martin, chapter 8
    • Slides from lecture

    Resources