##Introduction

In this notebook, we will explore how to train and test a transformer-based model for automatic summarization using the powerful Hugging Face libraries.

Text summarization is a challenging task in the field of Natural Language Processing, aiming to condense lengthy pieces of text into shorter summaries while preserving the most important information. It finds numerous applications in areas such as news summarization, document summarization, and information retrieval.

In [None]:
!pip install datasets 
!pip install transformers==4.28.0
!pip install py7zr
!pip install evaluate
!pip install rouge_score
!pip install accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
Collec

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
import evaluate
import numpy as np

##Model

We use the T5 model which is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. In particular summarization was also included in the pre-training.

It's highly recommended to check out the official page [here](https://huggingface.co/docs/transformers/model_doc/t5). In general the documentation pages are a great resource that contain detailed explanations and tons of useful information on how to use the models.

In [None]:
MODEL_NAME = "t5-small"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

Let's take a look at the model definition.

In [None]:
model

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Drop

## Dataset

For this notebook we use the [samsum](https://huggingface.co/datasets/samsum) dataset. It contains 16k messenger-like conversations with annotated summaries.






In [None]:
train_data = load_dataset("samsum", split="train")
val_data = load_dataset("samsum", split="validation")
test_data = load_dataset("samsum", split="test")

Let's take a look at an example.

In [None]:
print(train_data[0])

{'id': '13818513', 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)", 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.'}



Next we perform the necessary data preprocessing steps.

First, as mentioned in the model documentation, we prepend the prefix 'summarize' to each input sample. This prefix acts as a prompt for the model, indicating that the task is text summarization. Then we use the tokenizer to get the input ids for both the text and the target summaries.




In [None]:
PREFIX = 'summarize: '

def preprocess_function(samples):

    inputs = [PREFIX + text for text in samples['dialogue']]
    inputs = tokenizer(inputs, max_length=512, truncation=True)

    with tokenizer.as_target_tokenizer():
      labels = tokenizer(samples['summary'], max_length=128, truncation=True)
    inputs['labels'] = labels.input_ids

    return inputs

In [None]:
train_data = train_data.map(preprocess_function, batched=True)
val_data = val_data.map(preprocess_function, batched=True)
test_data = test_data.map(preprocess_function, batched=True)

Then, we define our data collator that will automatically do dynamic padding for us.

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=MODEL_NAME)

## Evaluation

As evaluation metric we use ROUGE (Recall-Oriented Understudy for Gisting Evaluation). It measures the overlap between the generated summary and one or more reference summaries. The key idea behind ROUGE is to capture the recall of important information in the generated summary by comparing it with the reference summaries. [Here](https://medium.com/nlplanet/two-minutes-nlp-learn-the-rouge-metric-by-examples-f179cc285499) is a link with a brief explanation.

In [None]:
rouge = evaluate.load("rouge")

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result['gen_len'] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

##TRAINING

We are now ready to train our model. Let's setup the training parameters and call `trainer.train()`.

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir = 'models/',
    evaluation_strategy='epoch',
    learning_rate=3e-4,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    weight_decay=0.01,
    num_train_epochs=8,
    predict_with_generate=True,
    metric_for_best_model='rouge1',
    load_best_model_at_end=True,
    save_strategy='epoch'
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)on.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed i

trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,1.730138,0.4217,0.1937,0.3541,0.354,16.8032
2,1.981800,1.674981,0.4409,0.21,0.3688,0.3685,16.6773
3,1.781000,1.666199,0.4496,0.2222,0.3817,0.3816,16.4694
4,1.675100,1.654412,0.4511,0.2236,0.3798,0.3798,16.7579
5,1.610300,1.646768,0.4562,0.2256,0.3849,0.3848,16.6369
6,1.543800,1.647068,0.4591,0.2279,0.3856,0.3856,16.6516
7,1.506300,1.645739,0.4619,0.2284,0.3879,0.3881,16.7494
8,1.476300,1.648365,0.4609,0.2275,0.3877,0.3881,16.813


TrainOutput(global_step=3688, training_loss=1.6438676002485892, metrics={'train_runtime': 3622.4301, 'train_samples_per_second': 32.535, 'train_steps_per_second': 1.018, 'total_flos': 1.3623539571621888e+16, 'train_loss': 1.6438676002485892, 'epoch': 8.0})

Once the training is done let's compute the predictions on the test set and evaluate the performance

In [None]:
preds = trainer.predict(test_data)
print(preds.metrics)

{'test_loss': 1.656678318977356, 'test_rouge1': 0.4409, 'test_rouge2': 0.2039, 'test_rougeL': 0.3678, 'test_rougeLsum': 0.3675, 'test_gen_len': 16.9976, 'test_runtime': 22.8167, 'test_samples_per_second': 35.895, 'test_steps_per_second': 1.14}


Let's also inspect manually one of the predictions

In [None]:
sample_id = 32
decoded_p = tokenizer.decode(preds.predictions[sample_id], skip_special_tokens=True)
print(test_data[sample_id]['dialogue'])
print('\nGold summary:')
print(test_data[sample_id]['summary'])
print('\nGenerated summary:')
print(decoded_p)

Jack: Cocktails later?
May: YES!!!
May: You read my mind...
Jack: Possibly a little tightly strung today?
May: Sigh... without question.
Jack: Thought so.
May: A little drink will help!
Jack: Maybe two!

Gold summary:
Jack and May will drink cocktails later.

Generated summary:
Jack and May will have a drink together.
