API (Application Programming Interface)A standardized way for software applications to communicate with AI models. Most AI providers offer REST APIs where you send text prompts and receive model outputs. API access is typically metered by token usage, with pricing varying by model and provider.BenchmarkA standardized test used to evaluate and compare AI model performance. Common benchmarks include MMLU (general knowledge), HumanEval (coding), SWE-bench (software engineering), and GSM8K (math reasoning). AI Models Map aggregates scores from multiple benchmarks to produce composite rankings.Chain-of-Thought (CoT)A prompting technique that encourages the model to break down complex problems into intermediate reasoning steps before arriving at a final answer. Chain-of-thought prompting significantly improves performance on math, logic, and multi-step reasoning tasks.Context WindowThe maximum amount of text (measured in tokens) that a model can process in a single request, including both the input prompt and the generated output. Larger context windows allow models to handle longer documents and maintain more conversation history. Modern models range from 8K to over 2M tokens.Diffusion ModelA type of generative AI model that creates images (or other media) by gradually removing noise from a random starting point. Models like Stable Diffusion, DALL-E 3, and Midjourney use diffusion processes to generate high-quality images from text descriptions.Elo RatingA scoring system adapted from chess that ranks AI models based on head-to-head comparisons. In AI arenas like Chatbot Arena, users compare two model outputs blindly and vote for the better one. The resulting Elo scores reflect relative quality as judged by real users.EmbeddingA numerical representation of text (or images, audio, etc.) as a vector of numbers. Embeddings capture semantic meaning, so similar concepts have similar vector values. They are essential for search, recommendation systems, and retrieval-augmented generation.Few-shot LearningA technique where a model is given a small number of examples (typically 2-5) within the prompt to demonstrate the desired task or output format. Few-shot prompting helps models understand exactly what kind of response is expected without any additional training.Fine-tuningThe process of further training a pre-trained AI model on a specific dataset to improve its performance on particular tasks. Fine-tuning allows you to customize a general-purpose model for specialized use cases like medical diagnosis, legal analysis, or domain-specific coding.GAN (Generative Adversarial Network)A machine learning architecture consisting of two neural networks: a generator that creates synthetic data and a discriminator that tries to distinguish real from generated data. GANs were the dominant approach to image generation before diffusion models and are still used in video synthesis and image editing.HallucinationWhen an AI model generates information that sounds plausible but is factually incorrect, fabricated, or not grounded in the provided context. Hallucinations are a major challenge in LLM deployment. Techniques like RAG, grounding, and chain-of-thought reasoning help reduce them.HumanEvalA benchmark for evaluating AI code generation, consisting of 164 hand-written Python programming problems. Each problem includes a function signature, docstring, and unit tests. Models are scored on pass@1 (percentage of problems solved correctly on the first attempt).InferenceThe process of running a trained AI model to generate predictions or outputs from new input data. When you send a prompt to an AI API, the model performs inference to produce its response. Inference speed and cost are critical factors in production applications.LatencyThe time delay between sending a request to an AI model and receiving the response. Lower latency means faster responses. Latency is typically measured in milliseconds and varies based on model size, server load, and geographic distance. Time-to-first-token (TTFT) is a related metric.LLM (Large Language Model)A type of AI model trained on massive text datasets to understand and generate human language. Examples include GPT-5, Claude, Gemini, and Llama. LLMs power chatbots, coding assistants, content generation tools, and a growing range of agentic applications.LoRA (Low-Rank Adaptation)A parameter-efficient fine-tuning technique that adds small trainable matrices to a frozen pre-trained model instead of updating all parameters. LoRA dramatically reduces the memory and compute required for fine-tuning, making it practical to customize large models on consumer hardware.MMLU (Massive Multitask Language Understanding)A benchmark consisting of 57 subjects across STEM, humanities, social sciences, and more, with about 15,000 multiple-choice questions. MMLU tests a model's breadth of knowledge and reasoning ability. Top models now score above 90%, leading to the creation of harder variants like MMLU-Pro.MoE (Mixture of Experts)An architecture where a large model is split into smaller specialized sub-networks (experts), and a routing mechanism selects which experts to activate for each input. MoE allows models to have more total parameters while keeping inference costs manageable, since only a fraction of experts are active per token.MultimodalAn AI model that can process and generate multiple types of data, such as text, images, audio, and video. Multimodal models like GPT-4o, Claude, and Gemini can understand images in prompts and some can generate both text and visual content.PerplexityA metric that measures how well a language model predicts a sequence of text. Lower perplexity indicates the model is less "surprised" by the text and better at predicting the next token. Perplexity is commonly used during model training and evaluation as an intrinsic quality measure.Prompt EngineeringThe practice of crafting input prompts to get the best possible output from an AI model. Techniques include providing examples (few-shot), assigning roles, specifying output format, breaking complex tasks into steps (chain-of-thought), and using system prompts to set behavior.QuantizationThe process of reducing the numerical precision of a model's weights (e.g., from 32-bit floating point to 4-bit integers) to decrease memory usage and speed up inference. Quantization enables running large models on consumer GPUs with minimal quality loss. Common formats include GPTQ, GGUF, and AWQ.RAG (Retrieval-Augmented Generation)A technique that enhances AI model responses by first retrieving relevant information from an external knowledge base, then including that context in the prompt. RAG reduces hallucinations and keeps responses grounded in up-to-date, factual data without retraining the model.RLHF (Reinforcement Learning from Human Feedback)A training technique where human evaluators rate model outputs, and the model learns to produce responses that align with human preferences. RLHF is a key reason modern chatbots feel helpful, harmless, and honest. Variants include RLAIF (AI feedback) and DPO (Direct Preference Optimization).SDK (Software Development Kit)A collection of tools, libraries, and documentation that simplifies integrating an AI model into your application. SDKs typically wrap REST APIs with language-specific functions, handle authentication, manage retries, and provide type-safe interfaces. Major providers offer SDKs for Python, TypeScript, and other languages.TemperatureA parameter that controls the randomness of a model's output. Lower temperatures (e.g., 0.1) produce more deterministic, focused responses, while higher temperatures (e.g., 1.0) produce more creative and varied outputs. Temperature 0 gives the most consistent, repeatable results.ThroughputThe number of tokens a model can generate per second, or the number of requests it can handle in a given time period. Higher throughput means the model can serve more users simultaneously and generate responses faster. Throughput is a key metric for production deployments.TokensThe basic unit of text that AI models process. A token can be a word, part of a word, or a punctuation mark. On average, one token equals roughly 3/4 of an English word (~4 characters). API pricing is typically measured per million tokens (input and output priced separately).Top-p (Nucleus Sampling)A sampling parameter that limits the model to choosing from the smallest set of tokens whose cumulative probability exceeds the threshold p. For example, Top-p of 0.9 means the model considers only the most likely tokens that together account for 90% of the probability mass, pruning unlikely options.TransformerThe neural network architecture behind virtually all modern LLMs. Introduced in 2017 ("Attention Is All You Need"), transformers use self-attention mechanisms to process relationships between all parts of the input simultaneously, enabling much better handling of long-range dependencies in text.Vector DatabaseA specialized database optimized for storing, indexing, and querying high-dimensional vectors (embeddings). Vector databases like Pinecone, Weaviate, Qdrant, and pgvector enable fast similarity search, which is essential for RAG pipelines, semantic search, and recommendation systems.Zero-shot LearningA model's ability to perform a task it was not explicitly trained on, without any examples in the prompt. Zero-shot capability is a key measure of a model's generalization. For instance, a model performing sentiment analysis without ever seeing labeled sentiment examples is doing zero-shot learning.