Exercise: Batch Comparison

YellowLine NYC story · full hands-on lab

title: “Exercise: Batch Comparison” subtitle: “YellowLine NYC story · full hands-on lab” —

Estimated time: 10–15 min (fill table from memory: 5 min · discussion: 5–10 min)

YellowLine NYC context (Module 7)

Fill the observation table after Modules 2–4, then discuss.

How to use this page

Complete the Databricks, Snowflake, and dbt exercises first, then come back here.

Fill in the observation table from memory — the exact numbers matter less than noticing where the tools felt different. Your facilitator will lead a discussion based on the open questions at the bottom.

GitHub blocked? (emergency only)

The normal path is fork + Codespace (Prerequisites § Step 2). Use Lab source files only if your facilitator approved it — e.g. you cannot create or use GitHub before class.

What did you observe?

Fill this in after running all three pipelines on the same NYC Taxi data.

What you ran	Databricks (PySpark)	Snowflake (SQL)	Snowflake (Snowpark)	dbt
Silver transform — how long?	______	______	______	______
How long until you could query output?	______	______	______	______
`trips_by_hour` row count	______	______	______	______
Top pickup zone	______	______	______	______
How did you write the Silver table?	`.write.saveAsTable()`	`CREATE TABLE AS SELECT`	`.write.save_as_table()`	`materialized=` config
How did you verify row counts?	Manual `SELECT COUNT(*)`	Manual `SELECT COUNT(*)`	Manual `SELECT COUNT(*)`	`dbt test` — automatic
Did the cluster/warehouse need warming up?	______	______	______	N/A
Where did you see the output table?	Unity Catalog	Snowsight	Snowsight	dbt CLI + target DB

The KPI numbers must match across all tools

If your trips_by_hour row counts differ between Databricks and Snowflake, there is a data quality issue. Most common cause: case-sensitive column name mismatch in the Silver filter (FARE_AMOUNT vs fare_amount).

Things you probably noticed

These are things nearly every attendee notices without being told. Check which ones you experienced:

The SQL for Silver was almost identical across all three tools — same WHERE conditions, same column names
Snowpark and PySpark looked nearly the same — .filter(col(...)), .groupBy(), .agg(), .write
dbt was the only tool that ran tests automatically — you had to write and run SELECT COUNT(*) manually in the others
Snowflake auto-suspended the warehouse — the cluster in Databricks kept running unless you stopped it
dbt needed the Bronze tables to already exist — it could not ingest data on its own, it read from Databricks or Snowflake tables
Column names looked different — fare_amount in Databricks, FARE_AMOUNT in Snowflake
All three produced the same KPI numbers — same top zones, same 24-row trips_by_hour, same borough distribution

Open questions for group discussion

These questions are open-ended — there is no single right answer. Use them in your table group or when the facilitator opens the floor.

Q1 — All three tools produced the same KPI numbers.

“If the output is identical, what would actually drive your choice between these three tools for a new project at your company?”

Q2 — PySpark and Snowpark have almost the same API.

“Snowflake deliberately designed Snowpark to look like PySpark. Why? Who is the intended audience? Does that change how you would upskill your team?”

Q3 — dbt ran automated tests; Databricks and Snowflake required manual verification.

“In your current team, who would be responsible for data quality checks in the Databricks or Snowflake pipelines? How would you change that before going to production?”

Q4 — Snowflake auto-suspended the warehouse; Databricks kept the cluster running.

“Your pipeline needs to run every 5 minutes. Your pipeline needs to run once per day. Which cost model — pay-per-second-active vs. pay-per-VM-hour — is better for each scenario?”

Q5 — dbt cannot ingest data.

“You want to use dbt for transformation governance and automated testing, but your team also handles ingestion. How would you split responsibilities between dbt and one other tool?”

Q6 — After today’s hands-on exercise:

“What surprised you most? Was there anything that worked better than you expected — or worse?”

Want the full technical deep-dive?

The Module 7: Comparison & Wrap-up page contains the complete reference: full 20-row comparison table, architecture diagram, side-by-side code, key architectural facts from official docs, and all further reading links.

Return to module

Module 7 — story wrapper