Exercise: Batch Comparison
YellowLine NYC story · full hands-on lab
title: “Exercise: Batch Comparison” subtitle: “YellowLine NYC story · full hands-on lab” —
Estimated time: 10–15 min (fill table from memory: 5 min · discussion: 5–10 min)
YellowLine NYC context (Module 7)
Fill the observation table after Modules 2–4, then discuss.
Complete the Databricks, Snowflake, and dbt exercises first, then come back here.
Fill in the observation table from memory — the exact numbers matter less than noticing where the tools felt different. Your trainer will lead a discussion based on the open questions at the bottom.
What did you observe?
Fill this in after running all three pipelines on the same NYC Taxi data.
| What you ran | Databricks (PySpark) | Snowflake (SQL) | Snowflake (Snowpark) | dbt |
|---|---|---|---|---|
| Silver transform — how long? | ______ | ______ | ______ | ______ |
| How long until you could query output? | ______ | ______ | ______ | ______ |
trips_by_hour row count |
______ | ______ | ______ | ______ |
| Top pickup zone | ______ | ______ | ______ | ______ |
| How did you write the Silver table? | .write.saveAsTable() |
CREATE TABLE AS SELECT |
.write.save_as_table() |
materialized= config |
| How did you verify row counts? | Manual SELECT COUNT(*) |
Manual SELECT COUNT(*) |
Manual SELECT COUNT(*) |
dbt test — automatic |
| Did the cluster/warehouse need warming up? | ______ | ______ | ______ | N/A |
| Where did you see the output table? | Unity Catalog | Snowsight | Snowsight | dbt CLI + target DB |
If your trips_by_hour row counts differ between Databricks and Snowflake, there is a data quality issue. Most common cause: case-sensitive column name mismatch in the Silver filter (FARE_AMOUNT vs fare_amount).
Things you probably noticed
These are things nearly every attendee notices without being told. Check which ones you experienced:
Open questions for group discussion
For the trainer: these are designed to be open-ended. There is no single right answer. Use them to surface what attendees found surprising or counterintuitive.
Q1 — All three tools produced the same KPI numbers.
“If the output is identical, what would actually drive your choice between these three tools for a new project at your company?”
Q2 — PySpark and Snowpark have almost the same API.
“Snowflake deliberately designed Snowpark to look like PySpark. Why? Who is the intended audience? Does that change how you would upskill your team?”
Q3 — dbt ran automated tests; Databricks and Snowflake required manual verification.
“In your current team, who would be responsible for data quality checks in the Databricks or Snowflake pipelines? How would you change that before going to production?”
Q4 — Snowflake auto-suspended the warehouse; Databricks kept the cluster running.
“Your pipeline needs to run every 5 minutes. Your pipeline needs to run once per day. Which cost model — pay-per-second-active vs. pay-per-VM-hour — is better for each scenario?”
Q5 — dbt cannot ingest data.
“You want to use dbt for transformation governance and automated testing, but your team also handles ingestion. How would you split responsibilities between dbt and one other tool?”
Q6 — After today’s hands-on exercise:
“What surprised you most? Was there anything that worked better than you expected — or worse?”
The Module 7: Comparison & Wrap-up page contains the complete reference: full 20-row comparison table, architecture diagram, side-by-side code, key architectural facts from official docs, and all further reading links.
Return to module
Source: merged from frozen workshop-2026-v1/exercises/ex-batch-comparison.qmd — do not edit workshop-2026-v1/.