7 min read

What Is a Vision Transformer (ViT)?

How researchers taught the transformer, built for text, to see images by chopping them into patches and treating each one like a word. A plain-English explainer.

From words to pictures

The transformer architecture was designed for language. It reads a sentence as a sequence of tokens (roughly, words or word pieces) and uses a mechanism called attention to figure out which tokens matter to each other. It powered a wave of language models and quickly became the default tool for text.

For a few years, images were a different world. Computer vision was dominated by convolutional neural networks, or CNNs, which scan an image with small sliding filters that detect edges, then textures, then shapes, building understanding layer by layer. The natural question was whether the transformer that worked so well on text could be pointed at pixels instead. A Vision Transformer, or ViT, is the answer to that question.

An image is worth 16x16 words

The breakthrough came from a Google Research team in a paper first posted in October 2020 and published at the ICLR 2021 conference. Its memorable title is An Image Is Worth 16x16 Words, and the lead author was Alexey Dosovitskiy, with colleagues including Lucas Beyer, Alexander Kolesnikov, Neil Houlsby and others.

The core trick is almost stubbornly simple. Instead of feeding the transformer one pixel at a time, which would be far too many tokens, the image is cut into a grid of fixed-size square patches, such as 16 by 16 pixels each. Each patch is flattened and passed through a single linear layer that turns it into a vector, the same kind of vector a language model would use to represent a word. Now the image is just a short sequence of patch tokens, and a standard transformer can read it almost exactly the way it reads a sentence.

Patches, position, and attention

A few extra pieces make this work. Because a flattened patch carries no information about where it sat in the original image, the model adds a positional embedding to each patch token, a learned signal that says, in effect, this piece came from the top-left, this one from the middle. Without it, the transformer would see a bag of patches with no layout.

The team also borrowed an idea from language models like BERT: a special extra token, often called the class token, is added to the front of the sequence. As the layers run, this token gathers information from every patch and ends up holding a summary of the whole image, which a small final layer uses to make the prediction.

The engine doing the real work is self-attention. In each layer, every patch can look directly at every other patch and decide how much weight to give it. This is the headline difference from a CNN. A convolution only sees a small neighborhood at first and has to stack many layers before distant parts of an image can influence each other. A ViT can relate a corner to the opposite corner in its very first attention layer. That global view is what makes the approach interesting.

How ViTs compare to CNNs

The honest summary is that there is a trade-off, not a clean winner. CNNs come with built-in assumptions about images, sometimes called inductive biases: nearby pixels are related, and an object is the same object whether it sits on the left or the right. Those assumptions are baked into how convolutions work, so CNNs learn efficiently even from modest datasets.

A plain ViT has far fewer of those assumptions. It has to learn the structure of images from scratch, which means it is hungrier for data. The original paper found that when trained only on a mid-sized dataset, ViT trailed strong CNNs. But when pre-trained on very large datasets, the ViT caught up and pulled ahead, while requiring less compute to reach comparable quality. The slogan that stuck is that ViTs are less data-efficient but higher-capacity: give them enough examples and their flexibility pays off.

Since 2021 the picture has kept evolving. Researchers built data-efficient training recipes, hybrid models that mix convolutions with attention, and very large ViTs scaled to billions of parameters. Transformers and CNNs now coexist across vision rather than one fully replacing the other.

Why this matters beyond research

Vision Transformers are not just a lab curiosity. The same patch-and-attention idea underpins many modern image and multimodal systems, including models that connect pictures with text. When a tool can describe a photo, find an object in it, or separate a subject from its background, attention-based vision is often part of the stack.

At Renderivo we work on the practical end of that spectrum: cleaning up product photos for ecommerce, removing busy backgrounds, placing items on clean white, squaring them up for marketplace requirements, and generating tidy scene shots. You do not need to understand attention to use it, but it is the kind of research that makes everyday visual AI tools possible. If you sell online and want to see modern image AI applied to your own catalog, new accounts get free credits to try it.

Frequently asked questions

What does ViT stand for?

ViT stands for Vision Transformer. It is a model that applies the transformer architecture, originally built for language, directly to images by treating small image patches as tokens.

How is a Vision Transformer different from a CNN?

A CNN scans an image with small local filters and builds up a global view across many layers. A ViT splits the image into patches and uses self-attention so any patch can relate to any other patch from the first layer. CNNs learn well from smaller datasets, while plain ViTs usually need more training data but scale very well when they have it.

Why are images split into patches?

Feeding a transformer one pixel at a time would create far too many tokens to process efficiently. Splitting the image into fixed-size patches, such as 16 by 16 pixels, turns it into a short sequence the transformer can handle, with each patch acting like a word.

Who created the Vision Transformer?

It was introduced by a team at Google Research in the 2020 paper An Image Is Worth 16x16 Words, published at ICLR 2021, with Alexey Dosovitskiy as lead author alongside several co-authors.

See modern image AI on your own products

Renderivo cleans backgrounds, makes clean white shots, squares images for marketplaces, and generates scene photos. New accounts get free credits.

Start free Try free tools