PunctFormer

3 min readJan 17, 2021

Treating Punctuation restoration as translation with Transformers.

Illustration of seq2seq model for punctuation restoration

edit: Code https://github.com/deterministic-algorithms-lab/PunctFormer

Task

The transcript we get in ASR is often not punctuated and to use it in other tasks, we need a punctuated text. There are many approaches for this but I wanted to explore seq2seq Transformers with this and possibly for multi-lingual application too.

Approaches

Classification

We can classify between punctuations for each word in the input as discussed in this paper: Adversarial Transfer Learning for Punctuation Restoration

BertPunc uses similar approach of predicting a punctuation for each word by applying Linear layer on top of a pretrained BERT masked language model (BertForMaskedLM).

Seq2Seq

Just give unpunctuated text and get punctuated text as output. However, the model may sometimes change modify a word too. This bug/feature also allows for automatic capitalization. Thus, giving more context while restoring punctuations.

The model

Pre-trained

Pre-trained models have a better understanding of a sentence. So, I used a pre-trained BART-base from HuggingFace’s model hub since it’s better choice for this type of tasks.

from transformers import pipeline, AutoTokenizer, BartForConditionalGenerationmodel = BartForConditionalGeneration.from_pretrained("bart-base", max_length=50)tokenizer = AutoTokenizer.from_pretrained("bart-base")restore = pipeline('translation_xx_to_yy', model=model, tokenizer=tokenizer,)

This pre-trained model simply copies the input. Now the task is to train it.

The data

Target

Punctuated english text from European Parliament Proceedings Parallel Corpus. It was splitted into 70:15:15 for training:evaluation:test purpose.

Source

punctuations = "?!,-:;."

Lowercase target text with punctuations removed. Lowercasing all the words helps by not letting the model to guess the punctuations based on initial capitalization of a sentence.

Training

The standard seq2seq training using HuggingFace’s scripts.

The model was trained with progressively increasing the source and target lengths from 32, 64 to 128 on TPU.

After few epochs, the model starts generating some really good quality punctuations.

Results

This was more of an experiment without delving much into the metrics and comparisons. The results surely looks promising.

Input

Predictions

Comparison with Punctuator2

As we can see, the transformer model has given some good baseline results. I tried comparing the predictions’ accuracy. Although the scoring function won’t do a justice since both approaches are different but this does give a sense of comparison.

Precision, Recall, F1 and support. Left: Punctuator2, Right: BART-base

Future Improvements

Current model is around 800MB in size. so there is certainly space for distillation and pruning. Updated and optimized checkpoint will be published by the Lab.
Multilingual models.

Come join DA Labs on our quest to understand the workings of machines and how they learn ! Wonder and wander in the beautiful field , that is Deep Learning ! Any kind of feed-back/questions are always welcome & appreciated 😇 To join us in our projects/research, interact with us, or write with us, just chime in on our discord server https://discord.gg/UwFdGVN ! :)