Glossary

Key terms and concepts

title: “Glossary” subtitle: “Key terms and concepts” —

A

ADLS2 (Azure Data Lake Storage Gen2)
Azure’s scalable cloud storage service optimized for big data analytics. Supports hierarchical namespace and abfss:// protocol.
Asset Bundles (DABs)
Databricks’ infrastructure-as-code tool for deploying pipelines, jobs, and notebooks via CLI. See databricks bundle deploy.

B

Bronze Layer
First layer of medallion architecture. Raw data ingested as-is from source, with metadata added. No transformations applied.

C

Catalog (Unity Catalog)
Top-level namespace in Databricks: Catalog > Schema > Table. Provides centralized governance, auditing, and lineage.
Checkpoint
A file on cloud storage that records the exact offset a streaming query last processed. Enables exactly-once semantics and fault recovery in Structured Streaming.
COPY INTO
Snowflake SQL command for bulk loading data from external stages into tables. Efficient for Parquet, CSV, JSON.
Cortex AI
Snowflake’s suite of AI features including AI_COMPLETE(), AI_EXTRACT(), Cortex Code, and Cortex Analyst.
Cortex Analyst
Snowflake service that translates natural language questions into SQL against governed semantic models (YAML definitions of tables and metrics).
Cortex Search
Snowflake hybrid search service (semantic + keyword) over unstructured and structured content. Enables RAG patterns within Snowflake.
Cross-validation
ML evaluation technique that splits data into k folds, training on k-1 and testing on 1, rotating through all folds. More robust than a single train/test split.

D

Data Leakage
Including information in ML features that wouldn’t be available at prediction time or mathematically encodes the target. The #1 ML mistake (e.g., total_amount includes tip_amount).
dbt (data build tool)
Transformation framework that turns SQL SELECT statements into managed tables/views with testing, documentation, and lineage.
Delta Lake
Open-source storage format providing ACID transactions, time travel, and schema evolution on top of Parquet files.
Delta Live Tables (DLT)
Databricks’ declarative ETL framework using @dlt.table decorators. Handles orchestration, quality enforcement, and lineage automatically.
Dynamic Table (Snowflake)
A Snowflake object defined by a SQL SELECT that refreshes automatically based on TARGET_LAG. Declarative streaming — you write SQL, Snowflake manages incremental refresh.

E

ELT (Extract, Load, Transform)
Pattern where raw data is loaded first, then transformed in-place within the data warehouse. Used by all three pipelines in this training.
Exactly-once Semantics
Guarantee that each event is processed exactly once, even during failures. Achieved in Structured Streaming via checkpoint files + Delta Lake transaction log.
External Stage
Snowflake object pointing to a cloud storage location (ADLS2, S3, GCS) for reading external files.

F

Feature Engineering
Process of creating input variables (features) for ML models from raw data. In this training, dbt defines the canonical feature table that both Databricks and Snowflake training pipelines consume.

G

Genie Space
Databricks AI/BI feature. Natural language interface over data tables — ask questions in plain English, get SQL + results.
Gold Layer
Third layer of medallion architecture. Pre-aggregated business KPIs and metrics optimized for reporting and dashboards.

I

Incremental Materialization
dbt strategy that appends or merges only new rows instead of rebuilding the entire table. Requires is_incremental() logic and a unique_key. Essential for large Silver tables.

J

Jinja
Templating engine embedded in dbt SQL files. Enables conditional compilation (target.type), reusable macros, ref()/source() dependency management, and configuration blocks.

K

KPI (Key Performance Indicator)
Pre-computed business metric. This training computes 12 KPIs from taxi trip data (trips by hour, top zones, revenue, etc.).

M

Macro (dbt)
Reusable SQL function defined in .sql files under macros/. Similar to functions in Python — accepts parameters and returns SQL snippets. Example: { time_of_day('pickup_hour') }.
Materialization
Strategy for how a dbt model becomes a database object. Types: view, table, incremental, ephemeral, materialized_view. Controls the tradeoff between freshness, cost, and complexity.
Medallion Architecture
Multi-layer data design pattern: Bronze (raw) → Silver (cleaned) → Gold (aggregated). Also called “multi-hop architecture.”
MLflow
Open-source platform for ML lifecycle management. Tracks experiments (hyperparameters, metrics, model artifacts). Databricks includes MLflow autolog for automatic experiment capture.
Model (dbt)
A SQL SELECT statement that dbt materializes as a table or view. Each .sql file in models/ is one model.

P

PySpark
Python API for Apache Spark. Used in Databricks notebooks for distributed data processing.

R

ref() (dbt)
dbt function that creates a dependency between models. { ref('my_model') } references another model and builds the DAG.

S

SAS Token
Shared Access Signature — a URL-based authentication method for Azure Storage. Used for Snowflake external stage access in trial accounts.
Silver Layer
Second layer of medallion architecture. Cleaned, validated, enriched data with derived metrics. Trustworthy data for analysis.
Slim CI
dbt testing pattern that only tests changed models + their downstream dependencies instead of the entire project. Uses --select state:modified+.
Snowpark
Snowflake’s Python DataFrame API. Runs Python code natively on Snowflake compute. API is similar to PySpark.
Snowflake Intelligence
Umbrella brand (announced 2025) bundling all Cortex AI capabilities — LLM functions, Cortex Analyst, Cortex Search, and ML functions — into a unified AI layer.
source() (dbt)
dbt function referencing a raw table not managed by dbt. { source('bronze', 'trips') } references an external source table.
Stream (Snowflake)
Snowflake object that tracks changes (inserts, updates, deletes) on a table. Used for change data capture (CDC).

T

Task (Snowflake)
Snowflake object that runs SQL on a schedule or when triggered by a condition. Used with Streams for event-driven pipelines.

U

Unity Catalog
Databricks’ centralized governance layer. Manages catalogs, schemas, tables, permissions, lineage, and data sharing.

V

Virtual Warehouse (Snowflake)
Snowflake compute resource. Auto-suspends when idle, auto-resumes on query. Sizes from X-Small to 6X-Large.

Y

YellowLine NYC
Fictional NYC yellow-taxi operator used as the training client story (inspired by TLC public trip data, not a real company). Marcus Chen (Operations Manager) works for YellowLine NYC; MHP is the consulting vendor engaged to build analytics. On-screen dispatch UI and Marcus’s badge use YellowLine branding — not MHP.

W

Watermark
In streaming, a threshold that tells the engine to ignore events older than X minutes. Enables window finalization and state memory cleanup. Tradeoff: shorter = lower latency but more dropped late events.
Workflows (Databricks)
Databricks’ job scheduler. Runs notebooks, DLT pipelines, or Python scripts as multi-task DAGs with dependencies and retries.