40+ terms defined with examples. Your reference guide for data annotation, RLHF, model evaluation, and AI operations.
A 3D point cloud is a dataset of points in three-dimensional space, where each point has x, y, z coordinates and may carry additional attributes such as intensity, color, or surface normal. Point clouds are generated by sensors like LiDAR, structured-light scanners, or stereo cameras and represent the geometry of physical environments. Processing and annotating point clouds requires specialized tools and algorithms due to their unordered, sparse, and large-scale nature.
Example
A construction company captures weekly 3D point cloud scans of a building site using a terrestrial LiDAR scanner, then compares successive scans against BIM models to track construction progress and detect deviations.
Audio transcription is the process of converting spoken language in audio recordings into accurate, time-stamped text. Beyond verbatim transcription, it can include speaker diarization (identifying who spoke when), emotion tagging, and annotation of non-speech sounds. High-quality transcription datasets are critical for training automatic speech recognition (ASR) systems and voice assistants.
Example
A healthcare AI startup transcribes thousands of doctor-patient consultations, tagging each speaker turn and medical terminology, to train a clinical ASR model that auto-generates visit summaries.
Active learning is a machine learning strategy in which the model selectively queries a human annotator to label the most informative or uncertain data points, rather than labeling data randomly. By focusing annotation effort on samples where the model is least confident—near decision boundaries, in underrepresented regions, or on edge cases—active learning can achieve target accuracy with significantly fewer labeled examples. This dramatically reduces annotation cost and time-to-deployment.
Example
A document classification team uses uncertainty sampling to identify the 500 most ambiguous documents from a pool of 50,000 unlabeled samples, achieving the same F1 score as random labeling of 5,000 documents—a 10x reduction in annotation effort.
An annotation pipeline is an end-to-end workflow that orchestrates the creation, review, and delivery of labeled data for machine learning. It typically includes task distribution to annotators, multi-stage quality review (adjudication, consensus, expert review), inter-annotator agreement measurement, and export in model-ready formats. A well-designed annotation pipeline balances throughput, cost, and quality while incorporating feedback loops from model performance back into annotation guidelines.
Example
A medical imaging company's annotation pipeline routes each CT scan through two radiologist annotators independently, flags disagreements for a senior specialist to adjudicate, and exports consensus segmentation masks in NIfTI format for model training.
Agentic AI refers to AI systems that can autonomously plan, execute multi-step tasks, use tools, and adapt their strategies based on intermediate results—going beyond single-turn question answering to sustained, goal-directed behavior. These systems often combine a large language model with external tools (web search, code execution, APIs) and a planning loop that decomposes complex goals into subtasks. Agentic AI represents a shift from passive assistants to proactive collaborators capable of completing real-world workflows.
Example
A software engineering agent receives a bug report, searches the codebase for relevant files, writes a fix, runs the test suite, and submits a pull request—all without human intervention beyond the initial task description.
AI red teaming is the practice of systematically probing an AI system to discover vulnerabilities, failure modes, and harmful behaviors before deployment. Red teamers craft adversarial prompts, edge cases, and attack vectors—such as jailbreaks, prompt injections, and social engineering scenarios—to test the model's safety guardrails. The findings inform targeted mitigations including fine-tuning, system prompt hardening, and output filtering, making red teaming a critical component of responsible AI deployment.
Example
Before launching a customer-facing chatbot, a safety team of 15 red teamers spends two weeks attempting to elicit harmful content, PII leakage, and policy violations, documenting 237 failure cases that are used to retrain the model's safety classifier.
A bounding box is a rectangular annotation drawn around an object in an image or video frame, defined by its top-left and bottom-right coordinates (or center, width, and height). It is the simplest and most widely used form of object localization in computer vision. While bounding boxes do not capture object shape precisely, their speed of annotation and computational efficiency make them the default choice for object detection tasks.
Example
A retail analytics system uses bounding box annotations around products on store shelves to train a detector that monitors stock levels and flags empty slots for restocking in real time.
Bias detection is the process of identifying systematic prejudices in AI models, training data, or outputs that lead to unfair treatment of individuals or groups based on attributes such as race, gender, age, or socioeconomic status. Detection methods include statistical parity analysis across demographic groups, counterfactual testing (changing protected attributes and observing output changes), and adversarial probing with targeted prompts. Addressing detected biases is essential for building equitable AI systems and meeting regulatory requirements.
Example
An HR tech company tests its resume screening model by submitting identical resumes with names associated with different ethnic backgrounds, discovering a 23% callback rate disparity that leads them to retrain the model with debiased features.
Constitutional AI (CAI) is an alignment approach developed by Anthropic in which a language model is guided by a set of written principles—a 'constitution'—that define desired behavior. The model first critiques and revises its own outputs based on these principles, then the revised outputs are used to train a preference model via RLHF. CAI reduces reliance on large-scale human feedback for safety while making the alignment criteria explicit and auditable.
Example
A safety team defines 15 constitutional principles—such as 'Choose the response that is least likely to be used for harm'—and uses the model's self-critique against these rules to generate preference pairs for alignment training.
Coreference resolution identifies all expressions in a text that refer to the same real-world entity and clusters them together. For example, linking 'Marie Curie,' 'she,' 'the physicist,' and 'her' to the same entity enables coherent understanding of the text. This task is crucial for document summarization, question answering, and knowledge extraction where pronoun and noun phrase references must be resolved to avoid ambiguity.
Example
A question answering system uses coreference resolution on a Wikipedia article about Albert Einstein to determine that 'he,' 'the Nobel laureate,' and 'Einstein' all refer to the same person, allowing it to correctly answer 'Who developed the theory of relativity?'
Data annotation is the process of labeling raw data—such as images, text, audio, or video—with meaningful tags so that machine learning models can learn from it. High-quality annotations serve as the ground truth that supervised learning algorithms use during training. The accuracy, consistency, and granularity of annotations directly determine the upper bound of model performance.
Example
A self-driving car company annotates millions of dashcam frames by drawing bounding boxes around pedestrians, cyclists, and vehicles, enabling their perception model to detect road users in real time.
Direct Preference Optimization (DPO) is an alignment technique that eliminates the need for a separate reward model by directly optimizing the language model on human preference pairs. Instead of the multi-stage RLHF pipeline—train reward model, then run RL—DPO reformulates the objective so the model learns from chosen-versus-rejected response pairs in a single supervised training step. This simplification reduces computational cost and training instability while achieving comparable alignment quality.
Example
A chatbot team collects 50,000 preference pairs where annotators chose the better of two responses, then runs DPO training in a single pass rather than building a separate reward model and PPO loop.
A data pipeline is an automated sequence of steps that ingests, transforms, validates, and delivers data from source systems to downstream consumers such as model training jobs, annotation platforms, or analytics dashboards. Well-designed pipelines include schema validation, deduplication, data quality checks, and monitoring to ensure that models are trained on clean, consistent data. In ML workflows, data pipelines are often the most complex and maintenance-intensive component.
Example
A fraud detection team builds a data pipeline that ingests transaction logs from Kafka, joins them with user profile data from PostgreSQL, applies feature engineering transformations, and writes daily training datasets to S3.
Data versioning tracks changes to datasets over time, enabling reproducibility of experiments, rollback to previous dataset states, and auditability of the data used to train each model version. Tools like DVC, LakeFS, and Delta Lake provide git-like semantics for large datasets, supporting branching, diffing, and merging of data. Data versioning is a cornerstone of mature MLOps practices and is often required for regulatory compliance in industries like healthcare and finance.
Example
An NLP team uses DVC to version their training corpus, discovering that a 2% accuracy drop in their latest model was caused by a data cleaning script that accidentally removed 15,000 valid training examples in the most recent dataset commit.
Depth estimation predicts the distance of each pixel in an image from the camera, producing a dense depth map from monocular (single-camera) or stereo imagery. Monocular depth estimation uses deep learning to infer depth from visual cues like texture gradients, occlusion, and perspective, while stereo methods triangulate depth from two viewpoints. Accurate depth estimation is essential for 3D scene reconstruction, augmented reality, robotic navigation, and autonomous driving.
Example
An AR furniture app uses monocular depth estimation on a smartphone camera feed to understand room geometry, allowing users to place virtual couches and tables at physically plausible positions and scales.
A foundation model is a large-scale neural network pre-trained on broad, diverse data that can be adapted to a wide range of downstream tasks through fine-tuning, prompting, or in-context learning. Examples include GPT-4, Claude, LLaMA, and DALL-E. These models learn general-purpose representations during pre-training and exhibit emergent capabilities—such as reasoning, translation, and code generation—that were not explicitly trained for. The foundation model paradigm has unified many previously separate AI subfields under a common architecture.
Example
A legal tech company takes a pre-trained foundation model and fine-tunes it on 100,000 legal documents, creating a specialized assistant that can draft contracts, summarize case law, and answer regulatory questions.
Human preference data consists of structured annotations in which human raters compare, rank, or rate model outputs according to criteria such as helpfulness, accuracy, safety, and style. This data is the foundation of alignment techniques like RLHF and DPO, translating subjective human values into a training signal. Collecting high-quality preference data requires carefully designed annotation guidelines, diverse annotator pools, and rigorous quality assurance to minimize bias and noise.
Example
An alignment team hires 200 annotators across five countries to rank pairs of chatbot responses on helpfulness and harmlessness, producing 500,000 labeled comparisons used to train both a reward model and DPO baselines.
Hallucination detection identifies instances where an AI model generates factually incorrect, fabricated, or unsupported information presented as if it were true. Hallucinations can range from subtle inaccuracies (wrong dates or statistics) to entirely invented citations, events, or entities. Detection methods include cross-referencing model outputs against verified knowledge sources, using entailment classifiers, and employing human evaluators who fact-check responses against source documents.
Example
A research assistant AI is tested by asking it to cite five academic papers on a topic; a hallucination detection pipeline verifies each citation against Semantic Scholar's API, flagging two fabricated paper titles and DOIs that do not exist.
Human-in-the-Loop (HITL) refers to a workflow design in which human judgment is integrated into an AI system's decision-making process, either by reviewing model outputs, correcting errors, handling edge cases, or providing feedback that improves the model over time. HITL systems balance automation efficiency with human oversight accuracy, and are especially important in high-stakes domains like healthcare, legal, and finance where fully automated decisions carry unacceptable risk.
Example
A medical diagnosis AI flags suspicious chest X-rays for radiologist review; the radiologist confirms or overrides each prediction, and the corrections are fed back into the training pipeline to improve the model in subsequent retraining cycles.
Image classification assigns one or more categorical labels to an entire image, indicating what the image depicts without localizing objects within it. It is one of the most fundamental computer vision tasks and serves as a building block for more complex systems. Modern classifiers use deep convolutional or transformer architectures and can distinguish among thousands of classes with near-human accuracy.
Example
A wildlife conservation NGO classifies camera-trap images into species categories—deer, bear, wolf, or empty—enabling automated population monitoring across remote protected areas.
Instance segmentation combines object detection and semantic segmentation by assigning a class label and a unique identity to every pixel belonging to each individual object in an image. This means two adjacent cars receive different instance IDs while each is segmented at the pixel level. Instance segmentation is essential when counting, tracking, or individually analyzing objects matters, such as in cell biology or crowd analysis.
Example
A pathology AI segments individual cell nuclei in histology slides, assigning each nucleus a unique mask and class label, enabling automated cell counting and morphology analysis for cancer grading.
Inter-Annotator Agreement (IAA) measures the degree of consensus among multiple human annotators labeling the same data, quantifying the reliability and consistency of the annotation process. Common metrics include Cohen's kappa (for two annotators), Fleiss' kappa (for multiple annotators), and Krippendorff's alpha (for any number of annotators and data types). High IAA indicates clear annotation guidelines and well-defined categories, while low IAA signals ambiguous instructions or inherently subjective tasks requiring guideline revision.
Example
A sentiment annotation project achieves a Cohen's kappa of 0.82 between two annotators on a 3-class task, considered 'almost perfect' agreement, confirming that the annotation guidelines are clear and the labeled dataset is reliable for model training.
Keypoint annotation marks specific anatomical or structural points on an object to capture its pose, shape, or articulation. Annotators place discrete points at predefined locations—such as joints on a human body or corners of a face—and these points are connected by edges to form a skeleton. This technique is fundamental for pose estimation, gesture recognition, and motion analysis.
Example
A sports analytics platform annotates basketball footage with 17-point body skeletons on each player, allowing a pose estimation model to analyze shooting form and predict injury risk from biomechanical imbalances.
LiDAR annotation is the process of labeling 3D point cloud data collected by Light Detection and Ranging sensors, which measure distances to objects by emitting laser pulses. Annotators draw 3D bounding cuboids, segment point clusters, or classify individual points into categories such as vehicle, pedestrian, vegetation, and ground. LiDAR annotation is essential for autonomous driving, robotics, and geospatial mapping where depth and spatial accuracy are critical.
Example
A self-driving truck company annotates LiDAR sweeps from highway driving by fitting 3D cuboids around vehicles, guardrails, and overhead signs, providing the 3D ground truth needed to train their sensor-fusion perception stack.
Model evaluation is the systematic process of measuring a machine learning model's performance against defined metrics, test sets, and real-world conditions. It encompasses quantitative metrics (accuracy, precision, recall, F1, mAP, BLEU), qualitative human evaluation, fairness audits, and robustness testing under distribution shift. Rigorous evaluation is essential for deciding when a model is ready for deployment and for identifying failure modes that require additional training data or architectural changes.
Example
Before deploying a content moderation model, the trust and safety team evaluates it on a stratified test set across 12 policy categories, measuring per-category recall at a fixed 1% false positive rate to ensure no category falls below 95% recall.
Multimodal AI systems process and reason over multiple types of input—such as text, images, audio, video, and structured data—within a unified model or architecture. By jointly understanding different modalities, these models can perform tasks that require cross-modal reasoning, such as answering questions about images, generating images from text descriptions, or transcribing and translating video content. Multimodal capabilities have become a defining feature of state-of-the-art foundation models.
Example
A multimodal model analyzes a photograph of a restaurant menu, reads the text via OCR, translates it from Italian to English, and estimates calorie counts for each dish—combining vision, language, and knowledge in a single inference pass.
Named Entity Recognition (NER) is an NLP task that identifies and classifies spans of text into predefined entity categories such as person, organization, location, date, monetary value, and medical term. NER serves as a foundational building block for information extraction, knowledge graph construction, and document understanding. Modern NER systems use transformer-based models fine-tuned on domain-specific annotated corpora to achieve high precision and recall.
Example
A legal tech platform runs NER on contract documents to automatically extract party names, effective dates, governing jurisdictions, and monetary obligations, saving paralegals hours of manual review per contract.
Object detection is a computer vision task that identifies and localizes multiple objects within an image by predicting both their class labels and spatial coordinates (typically bounding boxes). Modern detectors such as YOLO, Faster R-CNN, and DETR can process images in real time, making them suitable for safety-critical applications. Detection performance is measured by metrics like mean Average Precision (mAP) at various IoU thresholds.
Example
A factory quality inspection system runs a YOLO-based detector on high-resolution images of circuit boards, identifying and localizing defects such as missing solder joints, cracked components, and bridged pins.
Optical Character Recognition (OCR) is the technology that extracts machine-readable text from images, scanned documents, photographs, or video frames. Modern OCR systems use deep learning architectures—combining convolutional feature extractors with recurrent or transformer-based sequence decoders—to handle diverse fonts, handwriting, and complex layouts. OCR is a foundational capability for document digitization, automated data entry, and visual question answering.
Example
An insurance company deploys an OCR pipeline that extracts policyholder names, claim amounts, and dates from thousands of scanned paper forms daily, reducing manual data entry by 90%.
Polygon annotation involves drawing multi-sided shapes around irregularly shaped objects in an image to create pixel-precise boundaries. Unlike bounding boxes, polygons conform closely to the contours of objects such as roads, lakes, or tumors, capturing their true shape. This annotation type is essential for tasks requiring fine-grained spatial understanding, such as autonomous driving and medical imaging.
Example
An agricultural AI team annotates aerial drone images by tracing polygon boundaries around individual crop fields, enabling the model to calculate precise acreage and detect pest-damaged zones.
Panoptic segmentation unifies semantic segmentation and instance segmentation into a single task by assigning every pixel in an image both a class label and an instance ID. 'Stuff' classes like sky and road receive only semantic labels, while 'thing' classes like cars and people receive both semantic and instance-level labels. This comprehensive pixel-level understanding provides the richest possible scene representation for downstream tasks like autonomous navigation and robotics.
Example
A delivery robot uses panoptic segmentation to label every pixel of its camera feed—distinguishing individual pedestrians and bicycles (things) from sidewalk and grass (stuff)—so its planner can navigate safely on busy sidewalks.
Prompt engineering is the practice of designing, refining, and structuring input prompts to elicit desired behavior from large language models without modifying their weights. Techniques include zero-shot and few-shot prompting, chain-of-thought reasoning, system prompts, role-playing instructions, and structured output formatting. Effective prompt engineering can dramatically improve model accuracy, consistency, and safety on specific tasks, making it a critical skill for deploying LLMs in production.
Example
A data extraction team improves their LLM's JSON output accuracy from 72% to 96% by switching from a simple instruction ('Extract the entities') to a structured prompt with a schema definition, two examples, and explicit error-handling instructions.
Part-of-Speech (POS) tagging assigns grammatical categories—such as noun, verb, adjective, adverb, preposition, and determiner—to each word in a sentence based on its syntactic role and context. POS tagging is a fundamental NLP preprocessing step that enables downstream tasks like parsing, named entity recognition, and information extraction. Modern POS taggers use contextual embeddings from transformer models to achieve over 97% accuracy on standard benchmarks.
Example
A grammar correction tool uses POS tagging to identify that 'run' in 'I had a good run' is a noun rather than a verb, allowing it to correctly suggest 'runs' for the plural form instead of incorrectly flagging a verb conjugation error.
Reinforcement Learning from Human Feedback (RLHF) is a training paradigm that aligns language models with human preferences by using human judgments as a reward signal. After an initial supervised fine-tuning phase, a reward model is trained on ranked human comparisons, and the language model is then optimized via reinforcement learning (typically PPO) to maximize this learned reward. RLHF has become the dominant technique for making large language models helpful, harmless, and honest.
Example
OpenAI applied RLHF to GPT-4 by having human raters rank pairs of model responses on helpfulness and safety, then training a reward model on those rankings to iteratively steer the model toward preferred behavior.
A reward model is a neural network trained to predict human preferences by scoring model outputs on a numerical scale. It is trained on datasets of human comparisons—where annotators rank two or more responses to the same prompt—and learns to assign higher scores to outputs that humans prefer. In the RLHF pipeline, the reward model acts as a proxy for human judgment, providing the scalar reward signal used to optimize the language model via reinforcement learning.
Example
After collecting 100,000 pairwise comparisons from annotators, a team trains a reward model that scores helpfulness on a 0-to-1 scale, then uses it to provide per-token rewards during PPO training of their assistant model.
Relation extraction identifies and classifies semantic relationships between entities mentioned in text, such as 'works-for,' 'located-in,' 'causes,' or 'treats.' Given a sentence with two identified entities, the model predicts the type of relationship (if any) connecting them. Relation extraction is a key step in constructing knowledge graphs, populating databases, and enabling structured querying over unstructured text corpora.
Example
A biomedical knowledge graph system extracts drug-disease-gene relationships from PubMed abstracts, identifying that 'metformin treats type 2 diabetes by activating AMPK,' and populating a graph database used by researchers to discover drug repurposing opportunities.
Supervised Fine-Tuning (SFT) is the process of adapting a pre-trained language model to a specific task or style by training it on curated prompt-response pairs created by human experts. SFT establishes a strong behavioral baseline before further alignment techniques like RLHF or DPO are applied. The quality and diversity of the demonstration data directly determine how well the model learns to follow instructions.
Example
A medical AI team fine-tunes a foundation model on 10,000 clinician-written responses to patient queries, teaching the model the appropriate tone, clinical accuracy, and safety disclaimers expected in healthcare communication.
Semantic segmentation assigns a class label to every pixel in an image, producing a dense pixel-wise map that distinguishes between different categories such as road, sidewalk, vegetation, and sky. Unlike object detection, it does not differentiate between individual instances of the same class—all pixels belonging to 'car' receive the same label regardless of how many cars are present. This technique is critical for applications requiring holistic scene understanding.
Example
An autonomous vehicle perception stack uses semantic segmentation to classify every pixel of a forward-facing camera image into 19 categories—road, lane marking, pedestrian, building—enabling the planner to understand drivable surfaces.
Sentiment analysis determines the emotional tone or opinion expressed in a piece of text, typically classifying it as positive, negative, or neutral. Advanced sentiment models detect fine-grained emotions (joy, anger, frustration), aspect-level sentiment (sentiment toward specific product features), and sarcasm. It is widely used in brand monitoring, customer feedback analysis, financial market sentiment tracking, and social media analytics.
Example
A hotel chain runs aspect-level sentiment analysis on guest reviews, discovering that while overall sentiment is positive, the 'check-in process' aspect consistently receives negative sentiment—prompting operational improvements.
Synthetic data is artificially generated data that mimics the statistical properties and structure of real-world data without containing actual sensitive information. It can be produced through rule-based generation, simulation engines, GANs, diffusion models, or large language models. Synthetic data addresses data scarcity, privacy constraints, and class imbalance problems, enabling model training when real data is insufficient, expensive to collect, or subject to regulatory restrictions like GDPR or HIPAA.
Example
A fintech startup generates 10 million synthetic transaction records—including rare fraud patterns—using a conditional GAN, augmenting their limited real fraud dataset to improve their detection model's recall on uncommon fraud types by 35%.
Text classification assigns one or more predefined labels to a piece of text, such as an email, review, support ticket, or social media post. It encompasses binary tasks (spam vs. not spam), multi-class tasks (topic categorization), and multi-label tasks (tagging an article with multiple topics). Text classification models are trained on labeled datasets and are among the most widely deployed NLP applications in production systems.
Example
A customer support platform classifies incoming tickets into categories—billing, technical issue, feature request, account access—and routes each ticket to the appropriate team, reducing average response time by 40%.
Tokenization is the process of breaking raw text into smaller units called tokens, which can be words, subwords, characters, or byte-pair encodings. It is the essential first step in any NLP pipeline, converting human-readable text into a sequence of discrete symbols that models can process numerically. The choice of tokenization strategy—word-level, BPE, SentencePiece, or character-level—directly affects vocabulary size, out-of-vocabulary handling, and downstream model performance.
Example
A multilingual translation model uses SentencePiece tokenization to handle Japanese, Arabic, and English in a shared vocabulary of 64,000 subword tokens, enabling a single model to translate between all three languages.
Video annotation extends image annotation across temporal sequences by labeling objects, actions, or events across consecutive frames. Annotators track objects as they move, change shape, or become occluded, maintaining consistent identity labels throughout the clip. Techniques include frame-by-frame bounding boxes, interpolation between keyframes, and temporal action segmentation.
Example
A warehouse robotics company annotates forklift operation videos with tracked bounding boxes on pallets and workers, training a safety model to detect near-miss incidents and trigger automated alerts.
Our team specializes in all aspects of AI data services