The fields of natural language processing (NLP) and natural language generation (NLG) have never been as promising as they are today. In just a few years, neural networks and machine learning models like the transformer architecture have allowed for constant leaps in the capability of computers to process and generate natural human languages.
Most recently, Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer 3 (GPT-3) have led the charge. As two of the most influential modern tools for NLP and NLG, both are already impacting the artificial intelligence (AI) landscape and bringing the advent of artificial general intelligence (AGI) ever closer.
Let’s explore the history of BERT and GPT-3 to see which is more practical or better suited for specific use cases.
BERT Was First on the Scene
A brainchild of Google, BERT, was introduced in 2018. BERT is a neural network-based NLP pre-training technique — in other words, a language model. According to Google, BERT “enables anyone to train their own state-of-the-art question answering system.”
Of course, BERT can do more. Using a transformer model as a base, BERT was pre-trained on an unlabeled, plain text corpus (more specifically, English Wikipedia and the Brown Corpus) and continues to learn unsupervised from this initial layer of knowledge during practical applications. BERT’s pre-training focuses on masked language models and next sentence prediction.
The former hides a word from BERT and forces it to use surrounding words as context clues (instead of mapping a word to a one-directional vector), while the latter makes BERT predict if there is sequential logic between two sentences. There are a few noteworthy things here: the pre-training on unlabeled data and using self-attention in an NLP model.
Its major technical breakthrough is word masking in pre-training, enabling the bidirectional language model to learn natural human language syntax in practically any language. As the name implies, BERT only uses a transformer’s encoder parts for tasks downstream, where users add new trainable layers on top to learn a domain-specific task.
BERT performs well in classification tasks like sentiment analysis and answering questions. The model also excels in named entity recognition (NER) and next sentence prediction. BERT has been immensely helpful for:
- Voice assistants and chatbots aiming to enhance customer experience
- Customer review analysis (this is one of the most common sentiment analysis and classification use applications)
- Enhanced search results
BERT is already used by Google in its search algorithm. Facebook has also leveraged a modified version to handle content moderation on its platform. BERT has even found its way to Japan, where they have built a financial domain-specific model for commercial purposes.
After BERT, Comes GPT-3
As the third iteration of OpenAI’s Generative Pre-trained Transformer (GPT), GPT-3 is a general language model trained on uncategorized text data from the internet. GPT-3 dwarfs its predecessors and remains one of the most significant language models in the world today, with its largest model boasting around 170 billion parameters — ten times larger than the nearest notable NLP model, Turing Natural Language Generation (T-NLG) from Microsoft. GPT-3 learns these parameters from historical training data and applies its “knowledge” to downstream tasks such as language inference, paraphrasing, and sentiment analysis.
In short, GPT-3 takes transformer model embeddings and generates outputs from them. Its pre-training was on such a large base of parameters, attention layers, and batch sizes that it could produce striking results as a generic model with only a bit of user prompting in a downstream task.
GPT-3 has displayed promising performance in zero, one, or a few multitask settings. It can bring existing knowledge to bear on new tasks without more training or on which there’s little to no available data. You can see this in action by testing the model yourself. Go to OpenAI’s GPT Playground and give the model a task and one or two examples of expected output, and just with what it already knows and the information you’ve provided, GPT-3 will attempt to do what you’ve instructed.
So the core GPT-3 model doesn’t know how to perform any specific task, but you can easily make it specialize in a particular task, even without a lot of training data.
GPT-3 is broadly used in answering questions and excels in translation from other languages into English. To date, developers have used GPT-3 to:
- Build websites and applications
- Assist the development of written content and material like podcasts
- Assist in the generation of legal documents and things like resumes
- Generate machine learning code
Indeed, GPT-3 has also been a media darling for language modeling, specifically text generation, where it can complete a sentence, a paragraph, or even a movie script. GPT-3 also powers AI-assisted web content generation, including copywriting and marketing collateral, and a more comprehensive approach to predictive auto-completion or assistance in emails, notes, and even programming languages.
Overlaps and Distinctions
There’s a lot of overlap between BERT and GPT-3, but also many fundamental differences. The foremost architectural distinction is that in a transformer’s encoder-decoder model, BERT is the encoder part, while GPT-3 is the decoder part. This structural difference already practically limits the overlap between the two.
BERT can encode and use transfer learning to continue learning from its existing data when adding user-specific tasks. GPT-3, on the other hand, decodes from its massive pre-learned embeddings to present output that impressively matches user prompts. It does not learn anything new. In essence, it already has enough data to deliver something usable right out of the box.
Additionally, BERT’s bidirectionality gives it a leg up in tasks historically performed better with such encoders. These include fill-in-the-blank tasks where the model must look back and compare two pieces of content and tasks where it must process long passages to generate short answers. But as a more general language model with a more extensive parameter base, GPT-3 performs better at common sense tasks and pragmatic inference than baseline BERT.
To better illustrate this distinction in architecture: at the sentence level, GPT-3 looks back at previous wordsto predict what should come next. Meanwhile, BERT considers the words that come before and after a missing term and predicts what the word should be.
According to a research paper introducing it to the world, GPT-3 scored better on the tasks used for performance evaluation in research papers. More precisely, GPT-3’s base model is already promising as it is competitive or even better than state-of-the-art models, which their developers often fine-tune to specific tasks. For example, GPT-3 is competitive in low-resource translation with Meta’s BERT-based, finely-tuned model called the robustly optimized BERT pre-training approach (RoBERTa). And it produces slightly better results in some common sense reasoning and reading comprehension test sets.
The working-right-out-of-the-box nature of the language model has a lot to do with its state-of-the-art status, which also gives it a bit of an edge in low-resource environments where training data is scarce. Although you can fine-tune BERT, it won’t work without domain-specific training data.
But let’s look at more current data since BERT first arrived, also achieved state-of-the-art in 11 NLP tasks. The newest shiny thing will almost always be at least competitive with predecessors—but how are BERT and GPT-3 models today? Looking at the General Language Understanding Evaluation (GLUE) benchmarks, which include single-sentence, similarity, paraphrasing, and natural language inference tasks, both BERT and GPT-3 are high on the leaderboard.
The model DeBERTa + CLEVER (BERT-based) and the model ERNIE (GPT-3-based) are neck and neck in the GLUE benchmarks, literally earning identical overall scores of 91.1, mere inches away from the first and second highest records at 91.2 and 91.3 respectively. In single-sentence tasks, GPT-3 has a slight edge, and it’s the opposite regarding similarity and paraphrasing tasks. The two are practically tied in natural language inference tasks, where one alternatively outperforms the other across four specific datasets.
But the real question is, what does all this mean for you? Which model is more capable? Which is easier to use?
BERT can potentially give you better downstream and domain-specific capabilities at a rudimentary level, given that you have the proper training data. GPT-3 outperforms BERT out-of-the-box in most tasks performed during research, but you can’t customize it to the same degree. In the same vein, GPT-3 is easier to use and simpler to start integrating into your systems and processes. However, its sheer size and black box nature can be restrictive for smaller operations that cannot accommodate the required infrastructure or might need more hands-on model tweaking.
Again, you must take these considerations into account with the historical performance of both models on available applications.
What Does the Future Hold?
GPT-3’s massive model is typically overqualified for simple or particular tasks, not to mention cumbersome and computer power-hungry. On the flip side, BERT requires additional training, which uses many resources because its large model has relatively slow training times and needs a lot of computing power.
It’s also interesting to note that BERT (from tech giant Google) is open source, while GPT-3 (from OpenAI) is a paid model and API. These are essential considerations for larger development teams.
Still, BERT and GPT-3 are far from the be-all and end-all in large language models for NLP and NLG. Another massive language model based on GPT-3 called BLOOM has just entered the fray, backed by collaboration with BigScience, HuggingFace, and other organizations and independent researchers. Only time will tell how it will compare to BERT and GPT-3.
One thing’s for sure, the future of NLP and NLG will be bright, busy, and full of possibilities.