Has BERT become the solution for all problems?

In the world of Natural Language Processing (NLP), language models have come a long way, with the introduction of BERT in 2018 setting new records. Since then, several models have been developed as extensions and improvements, such as RoBERTa, ALBERT, and StructBERT, to name a few.

BERT: Breaking Barriers with Masked Language Modeling (MLM) and Next Sentence Prediction (NSP)

BERT, short for Bidirectional Encoder Representations from Transformers, was trained to optimize two tasks: MLM and NSP. In MLM, some words or tokens are masked, and the task is to predict those tokens. This was originally introduced as a "Cloze task" by Taylor. In BERT, MLM was extended to a unidirectional language model.

The issue with MLM, however, lies in the mismatch between the pre-training and fine-tuning phases due to the MASK token not appearing during fine-tuning. To mitigate this, BERT used the [MASK] token 80% of the time, a random token 10% of the time, and the original token for the remaining 10% of the time.

RoBERTa: A Step Forward with Dynamic Masking

RoBERTa, an extension of BERT, performs noticeable improvements over BERT on various downstream tasks due to the dynamic masking of tokens. This approach has been shown to be more effective in capturing the context of the masked words.

ALBERT: Simplifying BERT for Efficiency

ALBERT, another extension of BERT, claims that NSP conflates topic prediction and coherence prediction, and it is better to remove or alter NSP. ALBERT also simplifies BERT by reducing its parameter count, making it more efficient.

UniLM: A Unified Approach to Language Modeling

Researchers Jie Song, Xu Tan, Di He, Tao Qin, Jianfeng Lu, and Tie-Yan Liu proposed an improvement to masked language modeling called "UniLM" (Unified Language Model). UniLM extends mask prediction to unidirectional, bidirectional, and sequence-to-sequence predictions.

StructBERT: Focusing on Sentence Order Prediction

StructBERT uses Sentence Order Prediction (SOP) as a pre-training task, following ALBERT's improvements over BERT on various downstream tasks. SOP helps the model understand the order of sentences in a document.

ELECTRA: A New Approach to Masked Language Modeling

ELECTRA uses a generator to replace some tokens of a sequence, and the discriminator's job is to identify whether the token is an actual or a replaced one. This approach aims to address the issue of the mismatch between pre-training and fine-tuning phases more effectively.

TLM: A Pre-training Task for Parallel Bilingual Data

More recent researches have shown improvement when masking the whole word instead of the broken pieces in BERT. TLM (Transformer Language Model) takes parallel bilingual data and randomly masks tokens in both source and target languages.

In the ongoing quest for improving language models, researchers continue to develop new pre-training tasks, such as Permuted Language Model (PLM), to address the remaining issues in masked language modeling. These advancements promise to bring us closer to more human-like language understanding and generation capabilities.