Module 5: Production Patterns

Scheduling, CI/CD, and go-live checklist

Duration: 45 min — Animation (3) · Think & Discuss (7) · Theory (15) · Quiz (3) · Practice (17)

1. Animation

2. Think & Discuss

Situation: Pipelines work in notebooks and manual dbt runs. YellowLine NYC needs scheduled, monitored, reliable jobs.

Prompts:

  • What is different about production vs this morning’s lab?
  • How would you schedule nightly Bronze → Silver → Gold on each platform?
  • If Silver fails at 2 a.m., who should know?

3. Theory

NoteVendor naming (2026)
Legacy name in repo Say in class
Delta Live Tables (DLT) Lakeflow Spark Declarative Pipelines (LSDP)
Databricks Workflows / Jobs Lakeflow Jobs
@dlt.table @dp.table / LSDP decorators (old API still works)

Lab file dlt_pipeline.py keeps the legacy import — no migration required for training.

3.1 The Big Question

You just learned to build pipelines. But you wouldn’t run notebooks in production.

This module bridges the training-to-production gap — showing how each tool deploys pipelines in the real world.

3.2 Training vs Production

Aspect Training (Today) Production (Real World)
Databricks Interactive notebooks LSDP + Lakeflow Jobs + Asset Bundles
Snowflake SQL Worksheets / Snowpark scripts Tasks + Streams + Stored Procs
dbt Manual dbt run in terminal dbt Cloud jobs / GitHub Actions CI
Orchestration Manual, sequential Scheduled, event-driven, automated
Error handling Interactive debugging Retries, alerts, logging
Deployment Copy/paste or import Git + CI/CD + infrastructure as code
Testing Ad-hoc SELECT COUNT(*) Automated tests on every change

3.3 Databricks Production Patterns

Lakeflow Spark Declarative Pipelines (LSDP, formerly DLT)

NoteLegacy import in lab code

The code below uses the legacy @dlt.table / @dlt.expect_or_drop decorators for training compatibility. In production, Lakeflow Spark Declarative Pipelines use @dp.table — same API, new name. No migration is required for this workshop.

LSDP replaces notebooks with declarative pipeline definitions:

@dlt.table(name="silver_nyc_taxi_cleaned")
@dlt.expect_or_drop("valid_distance", "trip_distance > 0")
def silver_nyc_taxi_cleaned():
    return dlt.read("bronze_nyc_taxi_trips").filter(...)

Key benefits:

  • You declare what each table should contain — LSDP figures out how to run it
  • @dp.expect enforces data quality with automatic violation metrics
  • Auto-scaling, auto-retry, built-in lineage in Unity Catalog
  • No explicit spark.read() / .write.saveAsTable() — LSDP manages I/O

Lakeflow Jobs (formerly Databricks Workflows)

Schedule notebook pipelines as multi-task jobs with dependency DAGs:

flowchart LR
    SETUP["setup<br/><i>Create schemas</i>"] --> BRONZE["bronze_ingestion<br/><i>Read Parquet</i>"]
    BRONZE --> SILVER["silver_cleaning<br/><i>Filter & enrich</i>"]
    SILVER --> GOLD["gold_kpis<br/><i>12 aggregations</i>"]
    GOLD --> VALIDATE["validate<br/><i>SQL checks</i>"]

    classDef task fill:#0369a1,color:#fff,stroke:#075985
    classDef final fill:#107c10,color:#fff,stroke:#0a5c0a

    class SETUP,BRONZE,SILVER,GOLD task
    class VALIDATE final

  • Task chaining: setup → bronze → silver → gold
  • Retry policies per task
  • Email/Slack alerting on failure
  • Cluster policies for cost control

Databricks Asset Bundles (DABs)

Infrastructure-as-code for Databricks:

databricks bundle deploy -t prod   # Deploy pipeline + job from Git

File: databricks/production/asset_bundle/databricks.yml — packages everything into a deployable bundle.

Job clusters vs all-purpose clusters

In production, always use job clusters (created fresh per run and terminated on completion) instead of all-purpose clusters left running. Job clusters cost ~5x less per DBU and eliminate the #1 cause of unexpected cloud bills in Databricks: forgotten interactive clusters. Combine with cluster policies to cap maximum worker count and prevent runaway costs.

3.4 Snowflake Production Patterns

Streams + Tasks

Streams detect changes (CDC). Tasks run on schedule or when triggered:

CREATE STREAM bronze_stream ON TABLE bronze_nyc_taxi_trips;

CREATE TASK refresh_silver
  WAREHOUSE = DE_WORKSHOP_WH
  SCHEDULE = 'USING CRON 0 6 * * * Europe/Berlin'
  WHEN SYSTEM$STREAM_HAS_DATA('bronze_stream')
AS
  -- Silver transformation SQL

flowchart TD
    BRONZE_TABLE[("Bronze Table")] --> STREAM["Stream<br/><i>detects new rows (CDC)</i>"]
    STREAM -->|"SYSTEM$STREAM_HAS_DATA()"| TASK_SILVER["Task: refresh_silver<br/><i>SCHEDULE = '0 6 * * *'</i>"]
    TASK_SILVER --> SILVER_TABLE[("Silver Table")]
    SILVER_TABLE --> TASK_GOLD["Task: refresh_gold"]
    TASK_GOLD --> GOLD_TABLE[("Gold Tables")]

    classDef table fill:#01065c,color:#fff,stroke:#000940
    classDef stream fill:#29B5E8,color:#fff,stroke:#1e9bc9
    classDef task fill:#0369a1,color:#fff,stroke:#075985

    class BRONZE_TABLE,SILVER_TABLE,GOLD_TABLE table
    class STREAM stream
    class TASK_SILVER,TASK_GOLD task

Stored Procedures

Wrap pipeline logic in callable units:

CREATE PROCEDURE run_silver_pipeline(attendee VARCHAR)
RETURNS VARCHAR
AS $$
BEGIN
    -- Silver transformation logic
    RETURN 'Silver pipeline completed';
END $$;

CALL run_silver_pipeline('01_alice');

Snowpark Stored Procedures

Deploy Python code as native Snowflake procedures:

session.sproc.register(
    func=run_silver_pipeline,
    name="RUN_SILVER_PIPELINE_PY",
    stage_location="@my_stage"
)

3.5 dbt Production Patterns

GitHub Actions CI

Automated testing on every pull request:

- name: dbt build (slim CI)
  run: dbt build --select state:modified+ --defer --state ./target

Slim CI only tests changed models + downstream — saves time and cost.

NoteState-aware runs (generally available 2026)

dbt now supports state-aware runs that detect which models have changed since the last run and only rebuild those — saving significant compute time on large projects. Combined with --select state:modified+, dbt compares the current project state against a previous manifest and skips unchanged models entirely. This is especially valuable for Gold-heavy projects where upstream Silver models rarely change.

dbt Cloud

Managed scheduling + monitoring:

  • Scheduled jobs: daily dbt build with alerting
  • CI jobs: automatic PR checks
  • Hosted documentation
  • Copilot: AI-assisted SQL, docs, tests (Starter+ plans)

“Works on my machine” is not production-ready

The most common production incident pattern: a pipeline runs fine in a development notebook or local dbt environment, then fails in production due to (a) different schema permissions, (b) hard-coded paths or attendee IDs, or (c) missing dependencies that were loaded interactively. Always test with dbt build in a clean environment that mirrors production — and never hard-code attendee IDs, warehouse names, or file paths that change between environments.

Blue-green deployment for data pipelines

For mission-critical pipelines, consider a blue-green deployment pattern: maintain two parallel environments (blue = current production, green = new version). Deploy changes to green, validate with dbt build and automated tests, then switch traffic by updating the production schedule to point at green. This enables zero-downtime releases and instant rollback — if green produces incorrect results, simply re-enable blue’s schedule. Both Databricks (via Asset Bundles with --target prod-blue / --target prod-green) and dbt Cloud (via separate environments) support this pattern natively.

3.6 Key Takeaways

  • Production ≠ notebooks — scheduled, tested, monitored pipelines replace interactive development
  • Databricks uses LSDP (declarative) + Lakeflow Jobs (orchestration) + Asset Bundles (IaC deployment)
  • Snowflake uses Streams (CDC) + Tasks (scheduling) + Stored Procedures (encapsulated logic)
  • dbt uses dbt build (run + test) + GitHub Actions CI (slim CI) or dbt Cloud (managed scheduling)
  • Cost control is a first-class production concern: job clusters, auto-suspend, and cluster policies
  • Testing in CI catches bad models before they reach Gold tables and dashboards

4. Quiz

Quiz: Module 5 — Production Patterns Quiz

Scan QR to open quiz

Before moving on, make sure you can answer:

  1. What is the difference between a Databricks job cluster and an all-purpose cluster, and why does it matter for production cost?
  2. How does Snowflake’s SYSTEM$STREAM_HAS_DATA() function enable event-driven task execution?
  3. What does “slim CI” mean in dbt, and how does it save time and compute cost?

5. Practice

Hands-on lab

Priya / Power BI: Production SLAs define how fresh Priya’s dashboard can be — batch refresh vs DirectQuery comes in Module 8.

Next module

Module 6: AI Features — Marcus asks whether AI can help analysts explore data faster.


Official Documentation