PunctFormer
Treating Punctuation restoration as translation with Transformers.
edit: Code https://github.com/deterministic-algorithms-lab/PunctFormer
Task
The transcript we get in ASR is often not punctuated and to use it in other tasks, we need a punctuated text. There are many approaches for this but I wanted to explore seq2seq Transformers with this and possibly for multi-lingual application too.
Approaches
Classification
We can classify between punctuations for each word in the input as discussed in this paper: Adversarial Transfer Learning for Punctuation Restoration
BertPunc uses similar approach of predicting a punctuation for each word by applying Linear layer on top of a pretrained BERT masked language model (BertForMaskedLM).
Seq2Seq
Just give unpunctuated text and get punctuated text as output. However, the model may sometimes change modify a word too. This bug/feature also allows for automatic capitalization. Thus, giving more context while restoring punctuations.
The model
Pre-trained
Pre-trained models have a better understanding of a sentence. So, I used a pre-trained BART-base from HuggingFace’s model hub since it’s better choice for this type of tasks.
from transformers import pipeline, AutoTokenizer, BartForConditionalGenerationmodel = BartForConditionalGeneration.from_pretrained("bart-base", max_length=50)tokenizer = AutoTokenizer.from_pretrained("bart-base")restore = pipeline('translation_xx_to_yy', model=model, tokenizer=tokenizer,)
This pre-trained model simply copies the input. Now the task is to train it.
The data
Target
Punctuated english text from European Parliament Proceedings Parallel Corpus. It was splitted into 70:15:15 for training:evaluation:test purpose.
Source
punctuations = "?!,-:;."
Lowercase target text with punctuations removed. Lowercasing all the words helps by not letting the model to guess the punctuations based on initial capitalization of a sentence.
Training
The standard seq2seq training using HuggingFace’s scripts.
The model was trained with progressively increasing the source and target lengths from 32, 64 to 128 on TPU.
After few epochs, the model starts generating some really good quality punctuations.
Results
This was more of an experiment without delving much into the metrics and comparisons. The results surely looks promising.
Input
Predictions
Comparison with Punctuator2
As we can see, the transformer model has given some good baseline results. I tried comparing the predictions’ accuracy. Although the scoring function won’t do a justice since both approaches are different but this does give a sense of comparison.
Future Improvements
- Current model is around 800MB in size. so there is certainly space for distillation and pruning. Updated and optimized checkpoint will be published by the Lab.
- Multilingual models.
Come join DA Labs on our quest to understand the workings of machines and how they learn ! Wonder and wander in the beautiful field , that is Deep Learning ! Any kind of feed-back/questions are always welcome & appreciated 😇 To join us in our projects/research, interact with us, or write with us, just chime in on our discord server https://discord.gg/UwFdGVN ! :)