Understanding the Magic of GPT-J Technology

Guest Blog by @ParthMannan3

Feb 12, 2022

Text within this block will maintain its original spacing when published

Artwork by Wxll, using Pixelmind

GPT-J-6B is an open source, autoregressive language model created by EleutherAI, a collective of researchers working to open source AI research.

That was quite a handful to read. Let’s unpack that statement a bit by understanding what a language model is - which will lead us to understanding what GPT-J is.

What is a language model?

We all have used smart keyboards on our phones. As we type sentences, the keyboard offers us a few choices of words it thinks we might want to type next. Given a bit more thought, it seems strange a phone can even predict what we want to say. So how does it know which words to offer us? The answer is Language Models.

Essentially, language models are models that try to learn what “language” looks like by reading a lot of words. 📖

By reading A LOT of words, the model can learn a probability distribution over words and word sequences. Overall, this helps the model predict what words or phrases would make sense next to each other or produce a valid sequence after a given sequence of words. Validity here refers not to grammar, but to how often someone would use this combination of words in speech or text based on the data the model has consumed.

For example, if you type “How” - the model can suggest “are”, “is”, “to” etc. simply because these words often occur after the word “How” in the English language. Now that we understand this simple scenario, what else can language models do?

Text within this block will maintain its original spacing when published

Artwork by Wxll, using Pixelmind

What can language models do?

Not only trained to learn the nuance of language, these models can also be trained to perform many different tasks. We can train models to translate languages, summarize large texts, or even answer questions. After all, the answers to a question can just be thought of as predicting what words should come after the words in the question. We can even train the model to code since a programming language is also a “language”.

Exciting, isn’t it?

So can the model in our smart keyboards do all these tasks since it is a language model? Not quite. The models that can perform these tasks are much bigger and consume much more data and compute to learn. GPT-J-6B is one of those large models.

What is GPT-J-6B?

"GPT" is short for generative pre-trained transformer. Let’s break that down. 🔨 “Generative” means that the model was trained to predict or “generate” the next token (word) in a sequence of tokens. “Pre-trained” refers to the fact that a trained model can be considered entirely trained for any language task and does not need to be re-trained for specific tasks individually (with some caveats). Essentially, the model has understood language well enough to perform many tasks automatically.

Although, fine-tuning the model for specific tasks can help boost performance and accuracy. “Transformer” refers to the popular model architecture in deep learning. One way to think about model architecture is that it defines how the information learned by a model is organized inside of it. Learn more about it here in a very well written and easy-to-understand article about transformers.

Let’s continue. "J" distinguishes this model from other GPT variants and is likely due to the model being trained on the popular python library “JAX” - it is something that helps programmers make models without manually writing out all mathematical operations underneath.

"6B" represents the 6 billion trainable parameters. Parameters can be thought of as information storing units in neural network models. When a model is given an input, it runs through a combination of these parameters to give us the result. Quality of a language model has been found to continue to improve as the number of parameters increase. More parameters means it can consume more data successfully and store it.

For example, the GPT-3 from OpenAI for example has 175 billion parameters (almost 30x larger than GPT-J-6B). The most recent state-of-the-art language model Megatron-Turing by Microsoft and NVIDIA has 530 billion parameters (almost a crazy 90x bigger 😲).

So what is GPT-J-6B trained on and it’s importance?

As we have learnt, the model not only learns the words individually but learns the combination of words and phrases. In learning this, the model also learns the biases and behaviors of the original dataset which can be very dangerous. Some extreme examples here. The dataset plays a crucial part in the model’s learning.

For GPT-J-6B, the research group has compiled a 825 gigabytes (GB) large dataset called The Pile, curated from a set of datasets including arXiv, GitHub, Wikipedia, StackExchange, HackerNews, etc. The model was trained on 400 billion tokens from this dataset.

Now you might be thinking: “That’s all good to know but what does this have to do with images?” 🤔. You’ll have to bear with me as this is where my understanding still has its gaps but let’s try to reason about it.

Language models + images + PixelMind?

Now, we established earlier that language models were good at learning what words/phrases appeared often around other words/phrases. Now imagine, if instead of showing the model a whole bunch of text in training, we showed the model an image (or rather pixel data) along with some text that described that image.

Would the model be similarly able to learn what text often appears with what kind of pixels? Would the model be able to get a rough idea of what pixel values often occur next to each other in an image when the text description contains the word “corgi”?

Understanding GPT-J 🧠 4

You see where I am going with this. If we scale this kind of thinking, we can imagine that the model can learn the relationship between words and parts of an image, location of objects in an image, the concept of foreground/background, styles and even what “Trending on artstation” looks like. It will be able to start predicting pixels in the same way that it was able to predict words 🤯

There you have it folks - that’s my idea of how a Pixelmind-like tool could use language models for more than just language. Now, to be very clear there are way more advanced methodologies of achieving image generation from text that cannot be captured in this simplistic example. Pixelmind leverages those methods for us in the tools we love but I hope this text gives you some tiny bit of glimpses into how Artificial Intelligence is making the magical world of generating wild art from our imaginations possible. 🧠

Join Our Discord

PixelmindAI

Discussion about this post