AI Glossary

32 key terms and definitions you need to know when evaluating and comparing AI models. From LLMs to LoRA, benchmarks to vector databases — all explained in plain language.

API (Application Programming Interface)

A standardized way for software applications to communicate with AI models. Most AI providers offer REST APIs where you send text prompts and receive model outputs. API access is typically metered by token usage, with pricing varying by model and provider.

Benchmark

A standardized test used to evaluate and compare AI model performance. Common benchmarks include MMLU (general knowledge), HumanEval (coding), SWE-bench (software engineering), and GSM8K (math reasoning). AI Models Map aggregates scores from multiple benchmarks to produce composite rankings.

Chain-of-Thought (CoT)

A prompting technique that encourages the model to break down complex problems into intermediate reasoning steps before arriving at a final answer. Chain-of-thought prompting significantly improves performance on math, logic, and multi-step reasoning tasks.

Context Window

The maximum amount of text (measured in tokens) that a model can process in a single request, including both the input prompt and the generated output. Larger context windows allow models to handle longer documents and maintain more conversation history. Modern models range from 8K to over 2M tokens.

Diffusion Model

A type of generative AI model that creates images (or other media) by gradually removing noise from a random starting point. Models like Stable Diffusion, DALL-E 3, and Midjourney use diffusion processes to generate high-quality images from text descriptions.

Elo Rating

A scoring system adapted from chess that ranks AI models based on head-to-head comparisons. In AI arenas like Chatbot Arena, users compare two model outputs blindly and vote for the better one. The resulting Elo scores reflect relative quality as judged by real users.

Embedding

A numerical representation of text (or images, audio, etc.) as a vector of numbers. Embeddings capture semantic meaning, so similar concepts have similar vector values. They are essential for search, recommendation systems, and retrieval-augmented generation.

Few-shot Learning

A technique where a model is given a small number of examples (typically 2-5) within the prompt to demonstrate the desired task or output format. Few-shot prompting helps models understand exactly what kind of response is expected without any additional training.

Fine-tuning

The process of further training a pre-trained AI model on a specific dataset to improve its performance on particular tasks. Fine-tuning allows you to customize a general-purpose model for specialized use cases like medical diagnosis, legal analysis, or domain-specific coding.

GAN (Generative Adversarial Network)

A machine learning architecture consisting of two neural networks: a generator that creates synthetic data and a discriminator that tries to distinguish real from generated data. GANs were the dominant approach to image generation before diffusion models and are still used in video synthesis and image editing.

Hallucination

When an AI model generates information that sounds plausible but is factually incorrect, fabricated, or not grounded in the provided context. Hallucinations are a major challenge in LLM deployment. Techniques like RAG, grounding, and chain-of-thought reasoning help reduce them.

HumanEval

A benchmark for evaluating AI code generation, consisting of 164 hand-written Python programming problems. Each problem includes a function signature, docstring, and unit tests. Models are scored on pass@1 (percentage of problems solved correctly on the first attempt).

Inference

The process of running a trained AI model to generate predictions or outputs from new input data. When you send a prompt to an AI API, the model performs inference to produce its response. Inference speed and cost are critical factors in production applications.

Latency

The time delay between sending a request to an AI model and receiving the response. Lower latency means faster responses. Latency is typically measured in milliseconds and varies based on model size, server load, and geographic distance. Time-to-first-token (TTFT) is a related metric.

LLM (Large Language Model)

A type of AI model trained on massive text datasets to understand and generate human language. Examples include GPT-5, Claude, Gemini, and Llama. LLMs power chatbots, coding assistants, content generation tools, and a growing range of agentic applications.

LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning technique that adds small trainable matrices to a frozen pre-trained model instead of updating all parameters. LoRA dramatically reduces the memory and compute required for fine-tuning, making it practical to customize large models on consumer hardware.

MMLU (Massive Multitask Language Understanding)

A benchmark consisting of 57 subjects across STEM, humanities, social sciences, and more, with about 15,000 multiple-choice questions. MMLU tests a model's breadth of knowledge and reasoning ability. Top models now score above 90%, leading to the creation of harder variants like MMLU-Pro.

MoE (Mixture of Experts)

An architecture where a large model is split into smaller specialized sub-networks (experts), and a routing mechanism selects which experts to activate for each input. MoE allows models to have more total parameters while keeping inference costs manageable, since only a fraction of experts are active per token.

Multimodal

An AI model that can process and generate multiple types of data, such as text, images, audio, and video. Multimodal models like GPT-4o, Claude, and Gemini can understand images in prompts and some can generate both text and visual content.

Perplexity

A metric that measures how well a language model predicts a sequence of text. Lower perplexity indicates the model is less "surprised" by the text and better at predicting the next token. Perplexity is commonly used during model training and evaluation as an intrinsic quality measure.

Prompt Engineering

The practice of crafting input prompts to get the best possible output from an AI model. Techniques include providing examples (few-shot), assigning roles, specifying output format, breaking complex tasks into steps (chain-of-thought), and using system prompts to set behavior.

Quantization

The process of reducing the numerical precision of a model's weights (e.g., from 32-bit floating point to 4-bit integers) to decrease memory usage and speed up inference. Quantization enables running large models on consumer GPUs with minimal quality loss. Common formats include GPTQ, GGUF, and AWQ.

RAG (Retrieval-Augmented Generation)

A technique that enhances AI model responses by first retrieving relevant information from an external knowledge base, then including that context in the prompt. RAG reduces hallucinations and keeps responses grounded in up-to-date, factual data without retraining the model.

RLHF (Reinforcement Learning from Human Feedback)

A training technique where human evaluators rate model outputs, and the model learns to produce responses that align with human preferences. RLHF is a key reason modern chatbots feel helpful, harmless, and honest. Variants include RLAIF (AI feedback) and DPO (Direct Preference Optimization).

SDK (Software Development Kit)

A collection of tools, libraries, and documentation that simplifies integrating an AI model into your application. SDKs typically wrap REST APIs with language-specific functions, handle authentication, manage retries, and provide type-safe interfaces. Major providers offer SDKs for Python, TypeScript, and other languages.

Temperature

A parameter that controls the randomness of a model's output. Lower temperatures (e.g., 0.1) produce more deterministic, focused responses, while higher temperatures (e.g., 1.0) produce more creative and varied outputs. Temperature 0 gives the most consistent, repeatable results.

Throughput

The number of tokens a model can generate per second, or the number of requests it can handle in a given time period. Higher throughput means the model can serve more users simultaneously and generate responses faster. Throughput is a key metric for production deployments.

Tokens

The basic unit of text that AI models process. A token can be a word, part of a word, or a punctuation mark. On average, one token equals roughly 3/4 of an English word (~4 characters). API pricing is typically measured per million tokens (input and output priced separately).

Top-p (Nucleus Sampling)

A sampling parameter that limits the model to choosing from the smallest set of tokens whose cumulative probability exceeds the threshold p. For example, Top-p of 0.9 means the model considers only the most likely tokens that together account for 90% of the probability mass, pruning unlikely options.

Transformer

The neural network architecture behind virtually all modern LLMs. Introduced in 2017 ("Attention Is All You Need"), transformers use self-attention mechanisms to process relationships between all parts of the input simultaneously, enabling much better handling of long-range dependencies in text.

Vector Database

A specialized database optimized for storing, indexing, and querying high-dimensional vectors (embeddings). Vector databases like Pinecone, Weaviate, Qdrant, and pgvector enable fast similarity search, which is essential for RAG pipelines, semantic search, and recommendation systems.

Zero-shot Learning

A model's ability to perform a task it was not explicitly trained on, without any examples in the prompt. Zero-shot capability is a key measure of a model's generalization. For instance, a model performing sentiment analysis without ever seeing labeled sentiment examples is doing zero-shot learning.

Put These Terms Into Practice

Now that you know the terminology, explore our tools to compare AI models using real benchmark data, pricing information, and community ratings.

View Leaderboard|Benchmarks|Our Methodology

AI Glossary

Jump to term

Put These Terms Into Practice

AI Glossary

Jump to term

Put These Terms Into Practice