flowchart LR
S["Azure ADLS2\nParquet + CSV"]
subgraph DB["Databricks"]
DB1["spark.read.parquet()\nDirect cloud read"]
DB2["PySpark DataFrame\nfilter / join / agg"]
DB3["Delta Lake\nopen Parquet + txn log"]
DB1 --> DB2 --> DB3
end
subgraph SF["Snowflake"]
SF1["COPY INTO\nfrom external stage"]
SF2["CTAS / Snowpark\nSQL or Python"]
SF3["Snowflake storage\nproprietary columnar"]
SF1 --> SF2 --> SF3
end
subgraph DBT["dbt"]
DBT1["source()\nreads Bronze tables"]
DBT2["SQL Jinja models\nref() for DAG"]
DBT3["materialization\ntable / incremental / view"]
DBT1 --> DBT2 --> DBT3
end
S --> DB1
S --> SF1
DB3 --> DBT1
SF3 --> DBT1
style S fill:#0057b8,color:#fff,stroke:#003d82
style DB3 fill:#01065c,color:#fff,stroke:#000940
style SF3 fill:#01065c,color:#fff,stroke:#000940
style DBT3 fill:#01065c,color:#fff,stroke:#000940
Module 7: Comparison & Wrap-up
Power BI payoff and trainee-led tool discussion
Duration: 30 min — Animation (3) · Think & Discuss (5) · Theory (5) · Practice — Open Discussion (17)
1. Animation
2. Think & Discuss
Situation: Priya presented the full Power BI dashboard to Marcus. MHP’s engagement ends — trainees recommend a tool strategy for YellowLine NYC.
Prompts:
- Silent write (2 min): My recommended stack is Databricks / Snowflake / dbt / combination — one sentence why.
- If Marcus’s team is SQL-only, what do you recommend?
- Look at your Story whiteboard — what would you change now?
3. Theory
This module keeps theory brief before group discussion. Focus on objectives, Power BI demo notes, and when to use what. Full comparison tables and architectural gotchas are in Tool comparison deep dive for self-study after class.
Three constraints — back to the Story
Revisit the three constraints Marcus introduced on day one. Every Round 1 share and Round 2 challenge maps to one of them.
| Constraint | Use in discussion |
|---|---|
| Cost | “Year 3 TCO — does your stack still fit?” |
| Performance | “Will Module 8 streaming fit on this platform later?” |
| Compliance | “What proves a number to the auditor in Q3?” |
Decision-matrix handout: your trainer will provide a printed worksheet for the silent reflection step (2 min).
Closing line (Elena, Module 7 final slide):
“Technology is a decision. Architecture is responsibility.”
3.1 PowerBI Dashboard Demo (10 min)
The trainer demonstrates a pre-built PowerBI dashboard connected to the Gold KPI tables. Everything uses Power BI Desktop (free) — no Pro or Premium license required.
Self-paced lab: Exercise: Power BI Dashboard (after Module 4 — five pages, all 12 KPIs). Extra DAX measures and theme JSON: Power BI dashboard guide.
Dashboard pages:
- Overview — KPI cards (total trips, revenue, avg fare, quality score), trips-by-hour line chart, day-of-week bar chart
- Map ⭐ — Filled map by borough (color = revenue), bubble map by top pickup zones (size = trips), popular routes table
- Time Analysis — Day × hour heatmap (matrix with gradient), time-of-day donut, revenue-by-hour column chart
- Revenue & Payments — Payment-type pie chart, detail table with conditional formatting, credit-card tip KPI card
- Efficiency — Distance vs. speed scatter, distance-band funnel, passenger-count breakdown
Key demo points:
- The same dashboard connects to Databricks or Snowflake Gold tables — schema is identical
- The Filled Map auto-geocodes NYC boroughs; the Bubble Map geocodes zone names
- Power BI Desktop is free; publish to My Workspace (also free) for browser access
- Save as
.pbittemplate to share layout without credentials
3.3 Architecture: Three Pipelines, One Dataset
dbt is a transformation layer, not an ingestion tool. It reads from tables that Databricks or Snowflake already created. This is why you always need at least one other platform alongside dbt.
3.5 The same Silver transform — three ways
All three pipelines apply identical business logic. The platform changes only the how, not the what.
silver_df = bronze_df.filter(
(col("fare_amount") > 0) &
(col("trip_distance") > 0) &
(col("passenger_count") > 0) &
(col("tpep_pickup_datetime").isNotNull())
)
silver_df.write.mode("overwrite").saveAsTable("workshop.silver.nyc_taxi_enriched")Compute runs on Spark executors. Delta Lake enforces schema on write.
CREATE OR REPLACE TABLE silver.nyc_taxi_enriched AS
SELECT *
FROM bronze.nyc_taxi_raw
WHERE fare_amount > 0
AND trip_distance > 0
AND passenger_count > 0
AND tpep_pickup_datetime IS NOT NULL;Compute runs inside a Snowflake virtual warehouse. Data stored in proprietary columnar format.
-- models/silver/silver_nyc_taxi_enriched.sql
{{ config(materialized='table') }}
SELECT *
FROM {{ ref('bronze_nyc_taxi_raw') }}
WHERE fare_amount > 0
AND trip_distance > 0
AND passenger_count > 0
AND tpep_pickup_datetime IS NOT NULL{ ref() } resolves the DAG. Compute runs on whichever backend dbt targets. dbt test validates nulls and counts automatically after the run.
3.6 When to Use What
Choose Databricks when:
- You need complex Python/Spark transformations
- ML/AI workloads are part of the pipeline
- You want unified analytics + ML on one platform
- Streaming workloads require sub-second latency (Structured Streaming)
- Team is comfortable with PySpark
Choose Snowflake when:
- SQL is your team’s primary language
- You need instant, elastic compute scaling
- Data sharing across organizations is important
- You want both SQL and Python (Snowpark) options
- No-code ML (Cortex
ML.FORECAST,ML.ANOMALY_DETECTION) covers your use case - Near-real-time analytics (1-min lag) are sufficient — Dynamic Tables are simpler than Structured Streaming
Choose dbt when:
- You want transformation-layer standardization
- Testing and documentation are priorities
- You need to run the same logic across multiple backends
- Your team values version-controlled, testable SQL
- You want a governed, tested ML feature table that both Databricks and Snowflake training pipelines consume

Combine tools when:
- Databricks + dbt: Databricks for ingestion/ML, dbt for transformation governance
- Snowflake + dbt: Snowflake for compute/storage, dbt for testing/docs/CI
- All three: Different teams, different strengths — dbt as the common layer
3.7 Discussion Questions
- Which tool would you choose for a greenfield project and why?
- When would you combine tools vs. use one end-to-end?
- Where does dbt add value? Where is it overhead?
- How did the Snowpark experience compare to PySpark?
- When would you use streaming over batch? Which platform made that easier?
- For ML workloads: when is Snowflake Cortex SQL sufficient vs. when do you need Databricks?
- What surprised you most today?
3.8 Key Takeaways
- There is no single best tool — each excels in different scenarios
- Medallion architecture works across all tools — the pattern transcends tooling
- Production pipelines look very different from training notebooks
- AI features are rapidly becoming standard in every data engineering tool
- Streaming and ML are natural extensions of the batch pipeline — the same Silver tables feed both
- Snowflake Cortex ML covers the majority of business ML needs with zero Python
- Databricks remains the stronger choice when algorithm flexibility or deep MLflow tracking matters
- The ability to compare tools hands-on is the best way to make informed decisions
3.9 Further Learning
Official documentation
| Tool | Page | What it covers |
|---|---|---|
| What is Delta Lake? | Open format, ACID, time travel, schema enforcement | |
| High-level architecture | Control plane vs compute plane, serverless vs classic | |
| Overview of data loading | COPY INTO, stage types, INFER_SCHEMA, DML error logging |
|
| Virtual warehouses | Sizes, auto-suspend, credit billing model | |
| Cortex ML Functions | FORECAST, ANOMALY_DETECTION, CLASSIFICATION, TOP_INSIGHTS | |
| Snowflake Model Registry | Snowpark ML models (not Cortex ML Functions — those are separate) | |
| Materializations | All 5 types, Python models, dynamic_table vs materialized_view |
|
| Snowflake-specific configs | Why Snowflake uses dynamic_table not materialized_view |
Extended learning
- Delta Lake open protocol
- Snowpark Developer Guide
- Databricks MLflow
- Aiven Free Kafka (streaming broker used in Module 8)
- Aiven Karapace Schema Registry
- dbt Best Practices — Real-time data
Detailed comparison tables, integration gotchas, discussion questions, and further reading: Tool comparison deep dive
3.10 Review Questions
Before open discussion, make sure you can answer:
- Name one scenario where Databricks is the clear winner over Snowflake, and one where Snowflake wins.
- Why is dbt always used alongside another platform — never alone?
- What are Marcus’s three constraints, and which tool selection decision does each one influence?
4. Practice — Open Discussion
Trainer-led group discussion: compare Databricks, Snowflake, and dbt for YellowLine NYC and agree on a recommendation for Marcus.
Optional opener: Batch comparison exercise
Optional: Power BI demo (10 min)
Connect the same Gold schema live or walk through the Power BI dashboard guide.
Priya / Power BI: Full dashboard walkthrough — Overview, Map, Time Analysis, Revenue, Efficiency — all 12 KPIs.
After discussion — share the wrap-up quiz and end-of-workshop survey:
Module 7 — Comparison & Wrap-up Quiz · End-of-workshop survey
Next module
Module 8: Streaming (Optional) — Optional Phase 2: Marcus needs live dispatch, then tip prediction (Module 9).