6 min read

What is an embedding? How AI turns words and images into numbers

A plain-English explainer on vector embeddings: how AI represents words and images as points in space, why similar things end up close together, and how this powers search and recommendations.

The core idea: meaning as a location

An embedding is a list of numbers that an AI model assigns to a piece of data, such as a word, a sentence, or an image. You can think of that list as coordinates. Just as a city has a latitude and longitude that place it on a map, a word gets a position in a much larger space of numbers.

The useful part is what those positions mean. Embeddings are built so that similar things land near each other and unrelated things land far apart. The sentence the cat sat on the mat and the sentence a kitten rested on the rug end up close together because they mean nearly the same thing, while a sentence about quarterly revenue sits far away. Meaning becomes distance, and distance is something a computer can measure.

These spaces are high-dimensional. Instead of two coordinates, an embedding might have hundreds or thousands of numbers per item. We cannot picture that many dimensions, but the math behaves the same way it does on a flat map: closeness still signals similarity.

Why numbers instead of words

Computers do not understand language or pictures directly. They work with numbers. The old approach was to treat each word as a separate symbol with no relationship to any other, so a model had no way to know that big and large are related, or that Paris and London are both cities.

Embeddings fix this by letting models learn the relationships from data. Google for Developers describes an embedding space as a representation where similar items are positioned near one another, learned during training rather than hand-coded. The model reads enormous amounts of text or image data and adjusts each item's coordinates until the geometry reflects real patterns of use.

IBM frames embeddings as numerical representations that capture the meaning and relationships of data, and notes they sit underneath most modern machine learning, from language models to image systems. Once data is in this numeric form, everything downstream becomes a question of comparing points.

The famous king and queen example

The breakthrough that made embeddings famous came from word2vec, a method published in 2013 by Tomas Mikolov and colleagues at Google in a paper titled Efficient Estimation of Word Representations in Vector Space. It trained a small neural network to predict words from their surroundings, and in doing so it placed words that appear in similar contexts near one another.

What surprised people was that relationships showed up as consistent directions. If you take the vector for king, subtract the vector for man, and add the vector for woman, the closest result is queen. The step from man to woman and the step from king to queen are almost the same move through the space, so simple arithmetic captures a relationship the model was never explicitly taught.

One honest caveat: these analogies are tidy demonstrations, not magic. In practice, the original words are usually excluded from the search, because a word like king sits so close to the answer that it would otherwise win. The effect is real and important, but it is a tendency in the data, not a guaranteed equation.

Measuring closeness: cosine similarity

To compare two embeddings, the most common tool is cosine similarity. It measures the angle between two vectors rather than how long they are, so it focuses on direction. Vectors pointing the same way are treated as similar even if one is longer than the other.

The score runs from minus one to one. A value near one means the two items point in nearly the same direction and are very similar; a value near zero means they are unrelated; a value near minus one means they are opposites. This single number is what lets a system rank thousands of candidates and surface the closest matches quickly.

What embeddings power

Semantic search is the clearest example. Instead of matching keywords, a search engine turns your query into an embedding and finds the stored items whose embeddings are closest. That is why a search for affordable running shoes can return a product described as budget trainers, even with no shared words.

Recommendations work the same way. If two products, songs, or articles have nearby embeddings, a system can suggest one to people who liked the other. The same logic underpins clustering, deduplication, and flagging items that do not fit a group.

Embeddings also cross between media. OpenAI introduced CLIP in January 2021, a model that places images and text in one shared space using a dataset of around 400 million image and text pairs. Because a photo and its description can be compared directly, you can search a photo library with a plain sentence. In ecommerce, this matters: image embeddings can group near-duplicate product shots, find visually similar listings, or check whether a photo matches its description. At Renderivo our focus is upstream of that, on producing clean, consistent product photos in the first place, since reliable visual AI works best when the images feeding it are sharp and uncluttered.

Frequently asked questions

Is an embedding the same as the AI model?

No. The model is the trained system that produces embeddings. The embedding is the output: a list of numbers representing one specific input, such as a single word or image. Different models produce different embeddings for the same input.

How many dimensions does an embedding have?

It varies by model. Many text embeddings have several hundred to a few thousand numbers per item. More dimensions can capture more nuance but cost more to store and compare, so the right size depends on the task.

Why use cosine similarity instead of plain distance?

Cosine similarity compares the direction of two vectors rather than their length, which suits high-dimensional embedding spaces where direction tends to carry the meaning. It is the standard choice for comparing embeddings, though some systems also use straight-line distance.

Can the same idea apply to images and text together?

Yes. Multimodal models such as CLIP place images and text in one shared space, so a picture and a sentence describing it land near each other. That makes it possible to search images with words or find matching pictures for a caption.

Cleaner product photos for smarter visual AI

Embeddings and visual search work best when your images are clean and consistent. New accounts get free credits to clean backgrounds and standardize product shots.

Start free Try free tools