Also known as: dense vector, feature vector, tensor (1D)
TL;DR
A vector is an ordered list of numbers — the universal data shape in modern AI. Every embedding, every layer activation, every gradient, every prediction is a vector under the hood.
A vector is an ordered list of numbers. That’s it. [0.42, -1.8, 3.1, ..., 0.07] of length is a -dimensional vector. In modern AI, vectors are the universal data shape: every embedding is a vector, every layer activation inside a transformer is a vector, every gradient in backpropagation is a vector, every output logit before softmax is a vector. The whole stack runs on float arithmetic over arrays of numbers.
What “dimension” actually means
The dimension is the length of the list. A 768-dimensional vector has 768 numbers in it. The numbers themselves can mean anything — coordinates in space, scores across a vocabulary, components of an embedded sentence — but the shape of the data structure is fixed and matters for compute.
Common dimensions you’ll meet in AI:
768 — original BERT-base hidden size; many embedding models still emit this.
1024 — BERT-large, many production embedding models (E5, BGE).
30000–256000 — vocabulary-sized vectors (logits over the tokenizer’s vocabulary).
These numbers aren’t magic. They’re hardware-friendly multiples of 64 or 128 (which match GPU warp sizes and memory-coalescing widths) chosen by the original architects. Embedding models inherit them by convention.
The operations you’ll see everywhere
Three operations dominate AI compute:
Addition — adding two vectors of the same dimension element-wise. Used in residual connections, gradient updates, and accumulation.
Scalar multiplication — multiplying every element by a single number. Used in scaling, normalization, and softmax temperature.
Dot product — . The core similarity operation. Used in attention’s , in cosine similarity, in the final layer’s logit computation.
Almost every ML primitive reduces to a sequence of these three. The reason GPUs are good at AI is that they parallelize all three across thousands of cores at once.
The dot product is the only bilinear, symmetric, scalar-valued operation on two vectors. Anything you want to compute that takes two vectors and returns a single number — similarity, projection, alignment, attention weight — eventually reduces to a dot product (often after some normalization).
Concretely:
In attention: measures how much the query token wants to read from the key token.
In retrieval: (after embedding both) measures relevance.
In the LLM output head: for each row of the output projection produces one logit per vocab token.
In a linear classifier: the score for class is .
The recurring pattern: project both inputs into a shared space, then dot-product. The whole field of representation learning is “design embedding spaces where dot products mean something useful.”
Norms — the “size” of a vector
The most common way to talk about a vector’s magnitude is its 2-norm (Euclidean length): . A unit vector has norm 1. Most production embedding indexes store unit-normalized vectors so that cosine similarity collapses to a plain dot product — one fewer operation per comparison, and of them per query is the inner loop of retrieval.
Where vectors stop being enough
Vectors are 1D. As soon as you need a 2D structure (an attention matrix Q @ K.T) or a 3D structure (a batch of sequences of token embeddings: [batch, seq_len, d]), you’re in tensor territory. A tensor is a higher-rank array of numbers; in PyTorch / NumPy code, “vector” and “tensor” are interchangeable up to the rank in .shape. Every operation in a deep network reduces to vector arithmetic, batched and parallelized over higher-rank tensor shapes.
Go further
What's the difference between a vector and a tensor?
A vector is a 1-dimensional tensor. A tensor is the general N-dimensional generalization — a scalar is rank-0, a vector is rank-1, a matrix is rank-2, an attention K/Q/V tensor is rank-3 or 4. In ML code (torch.Tensor, numpy.ndarray), 'tensor' is the umbrella; 'vector' is the rank-1 case.
Dimensionality is the universal lever in ML. More dimensions = more expressive capacity but more storage, more compute per operation, more risk of curse-of-dimensionality effects. The standard knobs (768, 1024, 1536) come from BERT-era hidden sizes and are mostly hardware-friendly multiples of 64.
In modern AI, almost always — float32, float16, or bfloat16. Integer vectors show up only as quantized representations (int8 for compressed embeddings, int4 for compressed model weights) or as token-ID inputs before they get embedded. Once you're in the model's hidden state, you're in float space.