# Google Forms Reflection — Answer Key (Quiz + Full Rationale)

**Trainer only** — synced from [google-forms-grading.yaml](google-forms-grading.yaml).
Trainee forms (Mod 0–9) use **Quiz mode** with auto-score where `gradable: true`.

**Regenerate**: `node sync-grading-artifacts.mjs`

**Manifest**: [google-forms-manifest.yaml](google-forms-manifest.yaml)

---

## Quiz scoring

| Item | Detail |
|------|--------|
| Default points | Radio 1, Checkbox 2 |
| Final survey | Not a quiz — no auto-score |
| Name (G1) | Not graded |
| Scales / grids / 0.7 | Not auto-graded — explanation only |

---

# Module 0 — Welcome & Setup

### 0.1 — What is Marcus's biggest problem right now?

**Recommended answer**: No trusted analytics platform or KPIs for the business

**Rationale**: YellowLine NYC already has trip data in the lake, but Marcus cannot answer business questions because there is no governed analytics platform or agreed KPIs. Tool licenses and streaming come later in the workshop.

**Correct feedback** (written to Google Form): Exactly — Marcus has data but no trusted KPI layer yet.

**Incorrect feedback** (written to Google Form): Think about the story anchor: what is Marcus missing on day one?

**Distractor rationale**:

- **Power BI license is missing**: Power BI is part of the solution, but the core gap is trusted KPIs and a pipeline — not a missing license.
- **Drivers are unhappy**: Driver satisfaction may be a business concern, but the workshop story starts with analytics and data platform gaps.
- **Kafka is not installed**: Kafka and streaming are optional (Module 8); Marcus's immediate problem is batch analytics and KPIs.

**Reference docs**:

- [Databricks - Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture)
- [Snowflake - Data Engineering Guide](https://docs.snowflake.com/en/guides-overview-data-engineering)

### 0.2 — Where does trip data most likely live today?

**Recommended answer**: Cloud object storage / data lake (e.g. ADLS2 Parquet)

**Rationale**: In the YellowLine story, historical trip files land in cloud object storage (ADLS2) as Parquet. A production Snowflake warehouse and Excel-only workflows are future or distractor states.

**Correct feedback** (written to Google Form): Correct — the lab source is ADLS2 Parquet.

**Incorrect feedback** (written to Google Form): Where does the workshop lab read raw trips from?

**Distractor rationale**:

- **Excel files on shared drives only**: Excel might exist informally, but the workshop dataset at scale lives in the data lake as Parquet.
- **Snowflake production warehouse already**: Snowflake is built during the workshop; on day one the raw source is the lake, not a finished warehouse.
- **Not sure**: The narrative and labs consistently start from ADLS2 Parquet as the Bronze source.

**Reference docs**:

- [Databricks - What is a Data Lake](https://www.databricks.com/glossary/data-lake)
- [Azure - ADLS Gen2](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction)

### 0.3 — How often should data refresh for Priya's dashboard to start?

**Recommended answer**: Daily batch is enough to start

**Rationale**: Priya's first dashboard does not need sub-minute latency. Daily batch is a pragmatic MVP; real-time dispatch use cases are covered optionally in Module 8.

**Correct feedback** (written to Google Form): Right — start with a reliable daily MVP for Priya.

**Incorrect feedback** (written to Google Form): What refresh cadence matches an MVP before optional streaming?

**Distractor rationale**:

- **Every minute (real-time)**: Minute-level refresh is a streaming/advanced pattern — overkill for the initial KPI dashboard.
- **Once per year**: Annual refresh cannot support operational or weekly business decisions.
- **Only when Marcus asks**: Ad hoc refresh is not a dependable pattern for a production analytics platform.

**Reference docs**:

- [Databricks - Auto Loader (Batch/Streaming)](https://docs.databricks.com/aws/en/ingestion/auto-loader/)
- [Snowflake - Tasks Scheduling](https://docs.snowflake.com/en/user-guide/tasks-intro)

### 0.4 — Best description of a 3-layer medallion design

**Recommended answer**: Bronze = raw; Silver = cleaned; Gold = KPI tables for BI

**Rationale**: Medallion architecture separates immutable raw ingest (Bronze), cleaned and joined entities (Silver), and business-ready aggregates (Gold) that BI tools like Power BI consume.

**Correct feedback** (written to Google Form): That's the standard medallion pattern.

**Incorrect feedback** (written to Google Form): Which layer is raw, which is cleaned, and which feeds Power BI?

**Distractor rationale**:

- **Bronze = KPIs; Silver = raw; Gold = backups**: This reverses the layers — Bronze must preserve raw history, not KPIs.
- **One layer is enough for Marcus**: A single table blurs raw history, quality fixes, and report logic — hard to maintain and audit.
- **Gold = raw Parquet files**: Gold holds curated KPI tables, not raw file bytes; raw files belong in Bronze.

**Reference docs**:

- [Databricks - Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture)
- [Snowflake - Data Engineering Guide](https://docs.snowflake.com/en/guides-overview-data-engineering)

### 0.5 — Pick TWO tools you would consider first (select two)

**Recommended answer**: `Databricks` + `Snowflake`

**Rationale**: Any two of Databricks, Snowflake, and dbt are valid first choices for pipeline work (e.g. Databricks+Snowflake, Snowflake+dbt, or Databricks+dbt). Power BI alone and Excel macros do not replace ingest and transform layers.

**Correct feedback** (written to Google Form): Good pair — both are real pipeline platforms for this story.

**Incorrect feedback** (written to Google Form): Pick two tools that actually build and run the data pipeline, not BI-only or spreadsheets.

**Distractor rationale**:

- **Power BI only (no pipeline)**: Power BI consumes Gold KPIs; it does not ingest and transform lake-scale data by itself.
- **Excel macros**: Excel cannot replace a repeatable medallion pipeline at NYC Taxi scale.

**Reference docs**:

- [Databricks - Lakehouse Platform](https://www.databricks.com/product/data-lakehouse)
- [Snowflake - Key Concepts](https://docs.snowflake.com/en/user-guide/intro-key-concepts)
- [dbt - Introduction](https://docs.getdbt.com/docs/introduction)

### 0.6 — What must exist before Priya can build a reliable Power BI dashboard?

**Recommended answer**: Gold KPI tables with agreed metrics

**Rationale**: A reliable dashboard needs Gold tables with agreed definitions (trips, revenue, time grains). Bronze raw files alone lack joins, tests, and business-ready aggregates.

**Correct feedback** (written to Google Form): Yes — Priya needs governed Gold metrics.

**Incorrect feedback** (written to Google Form): What layer and artifact does Power BI connect to for trusted KPIs?

**Distractor rationale**:

- **Bronze raw files only**: Bronze preserves history but is not report-ready — Priya would re-implement logic in Power BI.
- **A published AI chatbot**: AI can assist analysts but does not replace governed KPI tables and metric ownership.
- **Streaming Kafka topics**: Streaming is optional; the core batch Gold layer must exist first.

**Reference docs**:

- [Databricks - Medallion Architecture (Gold Layer)](https://www.databricks.com/glossary/medallion-architecture)
- [Power BI - Enterprise BI Guidance](https://learn.microsoft.com/en-us/power-bi/guidance/powerbi-implementation-guidance-usage-scenario-enterprise-bi)

### 0.7 — Top risk you worry about most for this project

**Auto-scored**: No (self-assessment / open / matrix)

**Rationale**: Every listed risk is valid for YellowLine — data quality, cost, skills after MHP leaves, and governance/compliance. Use this question to debrief why each trainee prioritized their choice; there is no single correct answer.

---

# Module 1 — Data Engineering Fundamentals

### 1.1 — Main reason for three layers instead of one big table

**Recommended answer**: Separate raw history, cleaned data, and report-ready KPIs

**Rationale**: Three layers isolate concerns: preserve raw history (Bronze), apply cleaning and joins once (Silver), and publish stable KPIs for BI (Gold). This improves reuse, testing, and auditability.

**Correct feedback** (written to Google Form): That's the core medallion rationale.

**Incorrect feedback** (written to Google Form): Why split work across three layers instead of one wide table?

**Distractor rationale**:

- **Google requires exactly three tables**: Three layers are a design pattern, not a vendor requirement.
- **Three layers use less storage than one**: Medallion often uses more storage; the benefit is clarity and maintainability, not storage savings.
- **Power BI only connects to three tables**: Power BI can connect to many tables; the pattern is about data engineering design.

**Reference docs**:

- [Databricks - Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture)

### 1.2 — Best example of something that belongs in Silver

**Recommended answer**: Zone name joined to each trip

**Rationale**: Silver applies business rules and joins — attaching zone names to each trip is classic Silver work. Aggregates by hour belong in Gold; raw paths stay in Bronze.

**Correct feedback** (written to Google Form): Correct — row-level enrichment belongs in Silver.

**Incorrect feedback** (written to Google Form): Silver holds cleaned entity-level data — not aggregates or raw paths.

**Distractor rationale**:

- **Total revenue by hour (aggregated)**: Hourly revenue totals are Gold KPI aggregates, not row-level Silver logic.
- **Raw Parquet file path**: File paths and raw ingest metadata belong in Bronze, not cleaned Silver entities.
- **Marcus's email address**: PII like personal emails is not part of the taxi trip Silver model in this workshop.

**Reference docs**:

- [Databricks - Medallion Architecture (Silver Layer)](https://www.databricks.com/glossary/medallion-architecture)

### 1.3 — Best example of something that belongs in Gold

**Recommended answer**: Total trips and revenue by hour of day

**Rationale**: Gold tables answer business questions at the right grain — trips and revenue by hour power peak-hour analysis. Row-level timestamps and unparsed strings belong in lower layers.

**Correct feedback** (written to Google Form): Yes — that's a Gold KPI table.

**Incorrect feedback** (written to Google Form): Gold holds report-ready aggregates Priya charts in Power BI.

**Distractor rationale**:

- **Single trip pickup timestamp (row-level)**: Row-level timestamps are Silver or Bronze detail, not a Gold aggregate.
- **Unparsed CSV string from source**: Unparsed source strings are Bronze raw data.
- **Unity Catalog audit log**: Audit logs are operational metadata, not a business KPI table for Priya's dashboard.

**Reference docs**:

- [Databricks - Medallion Architecture (Gold Layer)](https://www.databricks.com/glossary/medallion-architecture)

### 1.4 — Priya asks: "When are our peak revenue hours?" — which layer?

**Recommended answer**: Gold

**Rationale**: Questions about peak revenue hours map to pre-aggregated Gold KPIs (e.g. by hour of day). Silver has row-level trips; Bronze has raw files.

**Correct feedback** (written to Google Form): Peak revenue hours is a KPI question — Gold.

**Incorrect feedback** (written to Google Form): Which layer answers 'when are our peak hours?' with pre-built metrics?

**Distractor rationale**:

- **Bronze**: Bronze is raw ingest — it does not pre-compute peak-hour revenue.
- **Silver**: Silver cleans and joins trips but typically does not hold hour-level revenue KPIs for BI.
- **Not sure**: If the question is about peak metrics for a dashboard, the answer is Gold.

**Reference docs**:

- [Databricks - Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture)

### 1.5 — For NYC Taxi at scale, best pattern

**Recommended answer**: ELT — load raw first, transform in the platform

**Rationale**: ELT loads raw data into the platform first (Bronze), then uses Spark or SQL for heavy transforms. This keeps Bronze immutable and leverages warehouse/compute scale.

**Correct feedback** (written to Google Form): Correct — load raw, transform with Spark/SQL in the platform.

**Incorrect feedback** (written to Google Form): At NYC Taxi scale, do you transform before load or load raw then transform?

**Distractor rationale**:

- **ETL — transform completely before any load**: Full pre-load transform fights scale and loses raw history — the workshop pattern is ELT.
- **No transform needed**: Raw trips need cleaning, joins, and aggregates before trustworthy KPIs.
- **Transform only in Power BI**: Pushing all transform logic into Power BI duplicates effort and bypasses governed Silver/Gold layers.

**Reference docs**:

- [Databricks - ELT vs ETL](https://www.databricks.com/glossary/elt)
- [Snowflake - Data Engineering Guide](https://docs.snowflake.com/en/guides-overview-data-engineering)

### 1.6 — Best reason for your ELT/ETL choice

**Recommended answer**: Keep Bronze immutable; use Spark/SQL for heavy joins

**Rationale**: Immutable Bronze preserves audit history; Spark and SQL engines handle volume and joins efficiently inside Databricks or Snowflake. This is the main engineering reason for ELT here.

**Correct feedback** (written to Google Form): That's the ELT rationale for this dataset.

**Incorrect feedback** (written to Google Form): Why load raw first instead of transforming upstream?

**Distractor rationale**:

- **Excel cannot read Parquet**: Parquet tooling exists; the choice is architectural, not Excel limitation.
- **Snowflake forbids ELT**: Snowflake supports ELT — load then transform in SQL.
- **dbt requires ETL first**: dbt transforms data already in the warehouse; it assumes ELT-style loading.

**Reference docs**:

- [Databricks - ELT vs ETL](https://www.databricks.com/glossary/elt)

### 1.7 — One fair criterion to compare tools (no winner yet)

**Recommended answer**: Team SQL skills and long-term maintenance

**Rationale**: Fair tool comparison includes team skills, maintainability, governance, and cost — not logos or press releases. This foreshadows the Module 7 decision matrix.

**Correct feedback** (written to Google Form): Good — compare tools on skills and sustainment.

**Incorrect feedback** (written to Google Form): Pick a criterion that matters after MHP leaves, not marketing trivia.

**Distractor rationale**:

- **Logo color**: Cosmetic criteria do not determine pipeline maintainability.
- **Number of press releases**: Marketing volume is not a technical fit criterion.
- **Alphabetical product name**: Naming order has no bearing on architecture fit.

**Reference docs**:

- [dbt - Introduction](https://docs.getdbt.com/docs/introduction)
- [Databricks - Lakehouse Platform](https://www.databricks.com/product/data-lakehouse)

---

# Module 2 — Databricks Pipeline

### 2.1 — Best reason to use Databricks instead of Excel for this dataset

**Recommended answer**: Volume and repeatable Spark/Delta pipeline

**Rationale**: NYC Taxi data at scale needs distributed Spark processing and a repeatable Delta pipeline. Excel and ad hoc tools cannot reliably handle volume or production scheduling.

**Correct feedback** (written to Google Form): Exactly — scale and repeatability drive Databricks here.

**Incorrect feedback** (written to Google Form): Why not Excel for millions of taxi trips?

**Distractor rationale**:

- **Excel cannot open any files**: Excel can open some files; the issue is scale and pipeline repeatability.
- **Marcus banned SQL**: SQL is central to the workshop — nothing bans it.
- **Power BI requires Databricks brand**: Power BI connects to many backends; Databricks is chosen for data engineering, not BI branding.

**Reference docs**:

- [Databricks - Lakehouse Platform](https://www.databricks.com/product/data-lakehouse)
- [Databricks - Delta Lake](https://www.databricks.com/glossary/delta-lake)

### 2.2 — Best way to ingest ADLS2 Parquet into Bronze on Databricks

**Recommended answer**: Auto Loader or Spark read from abfss path → Delta Bronze

**Rationale**: The Databricks lab reads Parquet from abfss paths (Auto Loader or Spark read) and writes Delta Bronze tables. This is cloud-native, repeatable ingest.

**Correct feedback** (written to Google Form): Correct — that's the lab ingest pattern.

**Incorrect feedback** (written to Google Form): How do you land ADLS2 Parquet into Delta Bronze on Databricks?

**Distractor rationale**:

- **Email Parquet files to Marcus**: Email is not a production ingest path for lake-scale data.
- **Manual copy into Power BI only**: Power BI import skips governed Bronze and cannot replace the medallion pipeline.
- **Delete Parquet and use CSV paste**: Destroying Parquet and manual paste breaks immutability and scale.

**Reference docs**:

- [Databricks - Auto Loader](https://docs.databricks.com/aws/en/ingestion/auto-loader/)
- [Databricks - Delta Lake](https://www.databricks.com/glossary/delta-lake)

### 2.3 — What could go wrong if you skip checks and jump to KPIs? (select all that apply)

**Recommended answer**: `Wrong aggregates on revenue` + `Silent data drift` + `Broken or misleading Power BI` + `Audit/compliance issues`

**Rationale**: Jumping straight to KPIs without Silver checks risks wrong revenue, undetected schema drift, misleading dashboards, and audit gaps. Data quality gates belong before Gold.

**Correct feedback** (written to Google Form): All four — skipping checks creates real downstream risk.

**Incorrect feedback** (written to Google Form): Each listed outcome can happen when quality checks are skipped.

**Reference docs**:

- [dbt - Data Tests](https://docs.getdbt.com/docs/build/tests)
- [Databricks - Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture)

### 2.4 — Which are Silver-layer work? (select TWO)

**Recommended answer**: `Join taxi zone lookup to trips` + `Cast/fix data types and filter bad rows`

**Rationale**: Silver joins reference data (zones) and fixes types/filters bad rows. Hourly revenue aggregation is Gold; storing raw Parquet unchanged is Bronze.

**Correct feedback** (written to Google Form): Both are classic Silver transformations.

**Incorrect feedback** (written to Google Form): Silver cleans and enriches — not raw storage or Gold aggregates.

**Distractor rationale**:

- **Aggregate total revenue by hour**: Hourly revenue totals are Gold KPI work, not Silver entity cleaning.
- **Store raw Parquet bytes unchanged**: Raw bytes unchanged belong in Bronze, not Silver.

**Reference docs**:

- [Databricks - Medallion Architecture (Silver Layer)](https://www.databricks.com/glossary/medallion-architecture)

### 2.5 — Which TWO Gold tables unlock Priya's Overview hour/day charts? (select two)

**Recommended answer**: `kpi_trips_by_hour` + `kpi_trips_by_day`

**Rationale**: Priya's Overview needs kpi_trips_by_hour and kpi_trips_by_day. Borough analysis is a different slice; Bronze raw trips are not BI-ready KPIs.

**Correct feedback** (written to Google Form): Those two Gold tables feed Priya's hour/day Overview charts.

**Incorrect feedback** (written to Google Form): Which Gold tables match hour-of-day and day-level chart grains?

**Distractor rationale**:

- **kpi_borough_analysis**: Borough KPIs are useful but do not directly power the hour/day Overview charts asked here.
- **Bronze raw trips table**: Bronze is raw — Power BI should consume governed Gold KPIs.

**Reference docs**:

- [Databricks - Medallion Architecture (Gold Layer)](https://www.databricks.com/glossary/medallion-architecture)

### 2.6 — How clear is the path ADLS2 → Bronze → Silver → Gold? (1–5)

**Auto-scored**: No (self-assessment / open / matrix)

**Rationale**: Subjective self-assessment (1–5) of how clear the ADLS2 → Bronze → Silver → Gold path feels after the lab. Scores 4–5 suggest readiness to move on; 1–2 suggest revisiting the medallion diagram before practice.

---

# Module 3 — Snowflake Pipeline

### 3.1 — What is Marcus really asking for?

**Recommended answer**: Both a SQL-friendly tool and maintainable skill set

**Rationale**: Marcus's ask is dual: a SQL-friendly platform YellowLine can operate and a skill set that survives after MHP leaves. Neither tool choice alone nor a theme change solves that.

**Correct feedback** (written to Google Form): Marcus wants both the tool and team capability.

**Incorrect feedback** (written to Google Form): Marcus asked for SQL accessibility and long-term maintainability — both matter.

**Distractor rationale**:

- **Different tool only**: Tool change without skills does not sustain the pipeline.
- **Different skill set only**: Skills need a matching platform — tool and training go together.
- **A new Power BI theme**: Dashboard theming does not address pipeline maintainability.

**Reference docs**:

- [Snowflake - Key Concepts](https://docs.snowflake.com/en/user-guide/intro-key-concepts)

### 3.2 — Best way to keep the same 12 KPIs without PySpark

**Recommended answer**: Rebuild Silver/Gold logic in Snowflake SQL (same table names)

**Rationale**: Snowflake module rebuilds Silver/Gold in SQL with the same table names so Priya's semantic model stays stable. PySpark in Power BI or cutting KPIs breaks parity.

**Correct feedback** (written to Google Form): Correct — parity via SQL on Snowflake with same KPI tables.

**Incorrect feedback** (written to Google Form): How do you keep the same 12 KPIs without PySpark?

**Distractor rationale**:

- **Run PySpark inside Power BI**: Power BI is not a Spark execution environment for medallion pipelines.
- **Export KPIs to PDF only**: Static PDFs are not a live analytics platform.
- **Reduce KPI count from 12 to 1**: Marcus asked to keep the same KPIs — not collapse the model.

**Reference docs**:

- [Snowflake - Data Modeling](https://docs.snowflake.com/en/user-guide/data-modeling)

### 3.3 — Who should maintain the pipeline after MHP leaves?

**Recommended answer**: YellowLine data engineer with SQL + Snowflake (+ dbt later)

**Rationale**: Production ownership should sit with a YellowLine data engineer trained in SQL and Snowflake, with dbt added for transforms. Marcus alone or Priya-only pipeline ownership does not scale.

**Correct feedback** (written to Google Form): That's the intended handoff model.

**Incorrect feedback** (written to Google Form): Who owns the pipeline after consultants leave?

**Distractor rationale**:

- **Marcus alone with no training**: Executives rarely maintain nightly pipelines without engineering support.
- **Priya only — she owns pipelines**: Priya consumes KPIs in Power BI; pipeline engineering is a different role.
- **External vendor only — no internal skills**: Vendor-only models create dependency — Marcus wanted internal maintainability.

**Reference docs**:

- [Snowflake - Data Engineering Guide](https://docs.snowflake.com/en/guides-overview-data-engineering)

### 3.4 — Best way to load ADLS2 Parquet into Snowflake Bronze

**Recommended answer**: External stage + COPY INTO or Snowpipe

**Rationale**: Snowflake uses external stages pointing at ADLS2, then COPY INTO (batch) or Snowpipe (continuous) to load Bronze tables. Manual worksheets and Power BI import are not ingest patterns.

**Correct feedback** (written to Google Form): Correct — Snowflake-native load from ADLS2.

**Incorrect feedback** (written to Google Form): How does Snowflake ingest Parquet from ADLS2 into Bronze?

**Distractor rationale**:

- **Databricks notebook only**: Databricks is Module 2; Snowflake module uses Snowflake-native load paths.
- **Manual row entry in worksheet**: Manual entry cannot load lake-scale Parquet.
- **Power BI Import button**: Power BI import does not replace warehouse Bronze ingest.

**Reference docs**:

- [Snowflake - Data Loading Overview](https://docs.snowflake.com/en/user-guide/data-load-overview)
- [Snowflake - Snowpipe Introduction](https://docs.snowflake.com/en/user-guide/data-load-snowpipe-intro)

### 3.5 — If Gold schema stays identical, can Priya keep the same Power BI reports?

**Recommended answer**: Yes — same semantic model; connection/server may change

**Rationale**: Identical Gold schema lets Priya reuse the same semantic model and reports; only connection details (server, warehouse) may change. Visuals do not need a full rebuild.

**Correct feedback** (written to Google Form): Yes — identical Gold schema keeps Priya's reports working.

**Incorrect feedback** (written to Google Form): If Gold table names and columns match, what changes for Power BI?

**Distractor rationale**:

- **Yes — nothing changes at all (same connection string)**: Connection strings often change when moving platforms even if schema stays the same.
- **No — must rebuild every visual from scratch**: Schema parity avoids rebuilding every visual — only reconnect.
- **Not sure**: Same Gold schema is explicitly the Priya parity goal in Module 3.

**Reference docs**:

- [Snowflake - Data Loading Overview](https://docs.snowflake.com/en/user-guide/data-load-overview)

### 3.6 — How well does Snowflake fit your team's skills today? (1–5)

**Auto-scored**: No (self-assessment / open / matrix)

**Rationale**: Subjective 1–5 poll on Snowflake fit for the trainee's team. Use in Module 7 discussion — lower scores may support a SQL-first Snowflake + dbt narrative; there is no right/wrong score.

---

# Module 4 — dbt Pipeline

### 4.1 — What did Snowflake SQL alone not fully solve for Marcus?

**Recommended answer**: Lineage, tests, and versioned transform code in Git

**Rationale**: Snowflake stores and runs SQL, but dbt adds Git-versioned transforms, automated tests, and lineage documentation — critical for audit and team handoff.

**Correct feedback** (written to Google Form): That's what dbt adds on top of warehouse SQL.

**Incorrect feedback** (written to Google Form): Snowflake SQL runs queries — what did Marcus still need for governance?

**Distractor rationale**:

- **Storing any data**: Snowflake already stores data; dbt does not replace the warehouse storage role.
- **Running SELECT statements**: Both Snowflake and dbt can run SELECTs — that is not the unique dbt value.
- **Connecting Power BI**: Power BI connectivity is separate from dbt's transform-and-test layer.

**Reference docs**:

- [dbt - Data Tests](https://docs.getdbt.com/docs/build/tests)
- [dbt - Documentation & Lineage](https://docs.getdbt.com/docs/collaborate/documentation)

### 4.2 — Is dbt a replacement for Snowflake?

**Recommended answer**: No — dbt runs on the warehouse

**Rationale**: dbt compiles and runs transform SQL against Snowflake (or other warehouses); it does not replace storage and compute. Treating dbt as the warehouse leads to architecture mistakes.

**Correct feedback** (written to Google Form): Critical distinction — dbt orchestrates transforms in Snowflake.

**Incorrect feedback** (written to Google Form): dbt is a transform layer, not a replacement database.

**Distractor rationale**:

- **Yes — dbt replaces the warehouse**: dbt has no storage engine — it needs Snowflake/Databricks/etc. underneath.
- **Yes — dbt is only for Excel**: dbt is a warehouse transform tool, not a spreadsheet product.
- **Not sure**: Remember: dbt runs ON the warehouse — Module 4 lab shows this directly.

**Reference docs**:

- [dbt - Introduction](https://docs.getdbt.com/docs/introduction)

### 4.3 — dbt's job in this stack

**Recommended answer**: Manage transform SQL, tests, and lineage in Git on Snowflake

**Rationale**: In the YellowLine stack, dbt owns versioned transform SQL, data tests, and lineage docs executed against Snowflake. It does not ingest raw Parquet or replace Power BI.

**Correct feedback** (written to Google Form): That's dbt's role in this stack.

**Incorrect feedback** (written to Google Form): dbt sits between raw warehouse tables and Gold KPIs — what does it manage?

**Distractor rationale**:

- **Replace Power BI**: Power BI remains the BI layer; dbt feeds Gold tables Power BI reads.
- **Ingest raw Parquet from ADLS2**: Ingest is Bronze/Snowpipe work — dbt transforms data already in the warehouse.
- **Schedule Databricks clusters**: Databricks scheduling is Module 2/5; dbt schedules via dbt Cloud, CI, or orchestrators.

**Reference docs**:

- [dbt - Introduction](https://docs.getdbt.com/docs/introduction)
- [dbt - Data Tests](https://docs.getdbt.com/docs/build/tests)

### 4.4 — Best way to prove where a Power BI revenue tile comes from

**Recommended answer**: dbt docs lineage from tile → Gold model → Silver → source

**Rationale**: dbt docs lineage shows the full path from a Power BI metric back through Gold and Silver models to sources — essential for Q3 board audit questions.

**Correct feedback** (written to Google Form): Lineage tracing is the audit-friendly answer.

**Incorrect feedback** (written to Google Form): How do you prove a revenue number's path for a board audit?

**Distractor rationale**:

- **Ask Marcus to trust the number**: Trust without traceability fails governance and audit requirements.
- **Screenshot Power BI only**: A screenshot shows the tile, not upstream SQL lineage.
- **Delete the tile**: Removing the metric avoids the question instead of answering it.

**Reference docs**:

- [dbt - Documentation & Lineage](https://docs.getdbt.com/docs/collaborate/documentation)

### 4.5 — Pick TWO tests to prevent silent KPI breakage (select two)

**Recommended answer**: `not_null` + `relationships`

**Rationale**: not_null catches empty KPI fields; relationships validates foreign keys (e.g. zone IDs). not_null + Source freshness is also a strong valid pair for silent breakage. accepted_values alone is narrow for this story.

**Correct feedback** (written to Google Form): Good pair — null checks and referential integrity catch silent breakage.

**Incorrect feedback** (written to Google Form): Pick tests that catch missing keys, broken joins, or stale data — not just enum lists.

**Distractor rationale**:

- **unique**: Unique tests help primary keys but do not alone catch stale sources or broken FKs to lookup tables.
- **accepted_values**: Accepted values guard enums but miss null revenue rows or orphaned zone references.

**Reference docs**:

- [dbt - Data Tests (Generic Tests)](https://docs.getdbt.com/docs/build/tests)

### 4.6 — Best Gold table for Priya's data quality scorecard

**Recommended answer**: kpi_data_quality_metrics

**Rationale**: kpi_data_quality_metrics aggregates test results and quality signals for Priya's scorecard. Trip-by-hour KPIs answer volume questions, not quality monitoring.

**Correct feedback** (written to Google Form): Correct — that's Priya's data quality scorecard table.

**Incorrect feedback** (written to Google Form): Which Gold table is built for quality metrics, not trip counts?

**Distractor rationale**:

- **kpi_trips_by_hour**: Trip-by-hour is an operational KPI, not a data quality scorecard.
- **Bronze raw trips**: Bronze is untested raw data — not a quality scorecard.
- **Excel template**: Excel templates are not governed Gold warehouse tables.

**Reference docs**:

- [dbt - Data Tests](https://docs.getdbt.com/docs/build/tests)

### 4.7 — How useful was dbt docs / lineage today? (1–5)

**Auto-scored**: No (self-assessment / open / matrix)

**Rationale**: Subjective 1–5 rating of dbt docs/lineage usefulness after the lab. Scores 4–5 mean the click-through lineage demo landed; 1–2 suggest rerunning dbt docs generate in debrief.

---

# Module 5 — Production Patterns

### 5.1 — How is production different from this morning's lab? (select all that apply)

**Recommended answer**: `Scheduling` + `Monitoring/alerting` + `CI/CD or PR review for SQL` + `Access control and cost controls`

**Rationale**: Production pipelines need schedules, monitoring, change control (CI/CD/PRs), and governance (access and cost). A one-off lab run skips these until Module 5.

**Correct feedback** (written to Google Form): All four — production adds operational discipline.

**Incorrect feedback** (written to Google Form): Production differs from a morning lab in all four listed ways.

**Reference docs**:

- [Databricks - Lakeflow Jobs](https://docs.databricks.com/aws/en/workflows/)
- [Snowflake - Tasks](https://docs.snowflake.com/en/user-guide/tasks-intro)

### 5.2 — What often breaks in production but rarely in class? (select all that apply)

**Recommended answer**: `Late or missing source files` + `Schema drift` + `2 a.m. job failures nobody notices` + `Analyst changing prod SQL without review`

**Rationale**: Class labs use clean, present data with trainers watching. Production hits missing files, schema changes, silent overnight failures, and unreviewed SQL edits.

**Correct feedback** (written to Google Form): All four — classic production failure modes.

**Incorrect feedback** (written to Google Form): Each item is a real production break that labs rarely expose.

**Reference docs**:

- [Databricks - Lakeflow Jobs (Monitoring)](https://docs.databricks.com/aws/en/workflows/)
- [dbt - CI/CD Jobs](https://docs.getdbt.com/docs/deploy/ci-jobs)

### 5.3 — Best way to schedule nightly pipeline on Databricks

**Recommended answer**: Lakeflow Jobs (Databricks Workflows)

**Rationale**: Lakeflow Jobs (formerly Databricks Workflows) schedule notebook and pipeline tasks reliably. Manual clicks, Power BI refresh alone, or email CSVs are not production scheduling.

**Correct feedback** (written to Google Form): Correct — Lakeflow Jobs schedule nightly pipelines.

**Incorrect feedback** (written to Google Form): What Databricks feature runs notebooks on a schedule?

**Distractor rationale**:

- **Manual notebook click each morning**: Manual runs do not scale and fail on holidays or staff absence.
- **Power BI scheduled refresh only**: Power BI refresh reads Gold — it does not orchestrate Bronze/Silver transforms.
- **Email CSV to Marcus**: Email is not an automated pipeline scheduler.

**Reference docs**:

- [Databricks - Lakeflow Jobs](https://docs.databricks.com/aws/en/workflows/)

### 5.4 — Best way to schedule nightly pipeline on Snowflake + dbt

**Recommended answer**: Snowflake Tasks + scheduled dbt (Cloud or CI)

**Rationale**: Snowflake Tasks can chain SQL steps; dbt runs on schedule via dbt Cloud or CI pipelines. Together they replace manual worksheets for nightly transforms.

**Correct feedback** (written to Google Form): That's the Snowflake + dbt scheduling pattern.

**Incorrect feedback** (written to Google Form): Combine warehouse tasks with dbt runs — how?

**Distractor rationale**:

- **Only manual worksheets**: Manual runs lack alerting, dependency management, and audit trail.
- **Restart laptop nightly**: Laptop restarts are not pipeline orchestration.
- **Priya refreshes Power BI only**: BI refresh does not execute Silver/Gold transforms in the warehouse.

**Reference docs**:

- [Snowflake - Tasks](https://docs.snowflake.com/en/user-guide/tasks-intro)
- [dbt - Schedule Jobs](https://docs.getdbt.com/docs/deploy/schedule-jobs)

### 5.5 — If Silver fails at 2 a.m., best response

**Recommended answer**: Alert owner; stop/warn Gold updates; no silent wrong KPIs

**Rationale**: Production response must alert owners, halt or warn downstream Gold updates, and never silently publish incorrect KPIs. Ignoring failures or fabricating numbers violates trust.

**Correct feedback** (written to Google Form): Fail-safe — alert, block bad Gold, no silent wrong numbers.

**Incorrect feedback** (written to Google Form): When Silver fails at 2 a.m., should Gold still publish stale/wrong KPIs?

**Distractor rationale**:

- **Ignore until Marcus complains**: Silent failure erodes trust before Marcus opens the dashboard.
- **Publish random numbers so dashboard is green**: Fabricated KPIs are worse than a visible failure.
- **Delete all Bronze data**: Deleting Bronze destroys audit history and does not fix the Silver failure.

**Reference docs**:

- [Databricks - Lakeflow Jobs (Alerts)](https://docs.databricks.com/aws/en/workflows/)
- [Snowflake - Tasks (Error Handling)](https://docs.snowflake.com/en/user-guide/tasks-intro)

### 5.6 — Analyst edits SQL on Friday — best guardrail

**Recommended answer**: Git PR + CI tests before prod deploy

**Rationale**: Git pull requests with CI running dbt tests prevent untested SQL from reaching production. Direct prod edits and disabling tests invite breakage.

**Correct feedback** (written to Google Form): Correct — change control via PR and automated tests.

**Incorrect feedback** (written to Google Form): Friday analyst SQL edits need review and tests before production.

**Distractor rationale**:

- **Edit prod directly for speed**: Direct edits bypass review and break audit trails.
- **Email SQL to Marcus only**: Email approval is not automated testing or version control.
- **Turn off all tests**: Disabling tests increases silent KPI breakage risk.

**Reference docs**:

- [dbt - CI/CD Jobs](https://docs.getdbt.com/docs/deploy/ci-jobs)
- [dbt - Git Workflow Best Practices](https://docs.getdbt.com/docs/best-practices#git-workflow)

### 5.7 — Readiness to write a go-live checklist for Marcus (1–5)

**Auto-scored**: No (self-assessment / open / matrix)

**Rationale**: Subjective 1–5 self-check on readiness to write Marcus's go-live checklist. Higher scores mean trainees can draft scheduling, monitoring, and change-control items; not graded right/wrong.

---

# Module 6 — AI Features

### 6.1 — What is Marcus mainly trying to speed up?

**Recommended answer**: Writing and exploring SQL / data

**Rationale**: Marcus wants faster SQL writing and data exploration for analysts — not replacing the medallion pipeline, Gold tables, or hiring drivers.

**Correct feedback** (written to Google Form): Marcus wants analyst speed on SQL and data exploration.

**Incorrect feedback** (written to Google Form): Module 6 AI assists analysts — what task is Marcus trying to accelerate?

**Distractor rationale**:

- **Replacing the medallion pipeline**: AI augments engineers; it does not replace Bronze/Silver/Gold architecture.
- **Removing the need for Gold tables**: Governed Gold KPIs remain required for trusted BI.
- **Hiring more drivers**: Driver hiring is unrelated to Cortex/LLM data assistance.

**Reference docs**:

- [Snowflake - Cortex AI Functions](https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions)
- [Databricks - AI/ML Overview](https://docs.databricks.com/aws/en/large-language-models/index.html)

### 6.2 — Where could AI help? (select all that apply)

**Recommended answer**: `Draft or explain SQL` + `Document models` + `Suggest anomalies for review`

**Rationale**: AI can draft/explain SQL, document models, and flag anomalies for human review. Auto-approving revenue without review is unsafe — humans must validate before publish.

**Correct feedback** (written to Google Form): Good — AI assists drafting, docs, and anomaly hints with human review.

**Incorrect feedback** (written to Google Form): AI helps analysts work faster — but should not auto-approve revenue.

**Distractor rationale**:

- **Auto-approve revenue without review**: Revenue KPIs require human validation — AI must not auto-publish financial metrics.

**Reference docs**:

- [Snowflake - Cortex AI Functions](https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions)
- [dbt - Documentation](https://docs.getdbt.com/docs/collaborate/documentation)

### 6.3 — Where should AI NOT be trusted without human review? (select all that apply)

**Recommended answer**: `Published revenue KPIs` + `PII / sensitive data` + `Production deployments`

**Rationale**: Published revenue KPIs, PII/sensitive data, and production deployments require human review. Spell-checking variable names alone is low risk and does not need the same guardrails.

**Correct feedback** (written to Google Form): Correct — high-stakes areas need human review.

**Incorrect feedback** (written to Google Form): Where must humans stay in the loop — not just spell-check?

**Distractor rationale**:

- **Spell-checking variable names only**: Cosmetic name suggestions are low stakes — unlike revenue, PII, or prod deploys.

**Reference docs**:

- [Snowflake - Cortex AI Functions](https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions)
- [Databricks - AI/ML Overview](https://docs.databricks.com/aws/en/large-language-models/index.html)

### 6.4 — If AI writes SQL against Silver, what must still exist?

**Recommended answer**: Governed medallion layers, tests, and metric ownership

**Rationale**: AI-generated SQL must still query governed Silver/Gold with tests and clear metric ownership. AI does not replace data engineering fundamentals.

**Correct feedback** (written to Google Form): AI SQL still runs against governed layers with tests.

**Incorrect feedback** (written to Google Form): Even AI-generated SQL needs medallion governance — what must exist?

**Distractor rationale**:

- **Nothing — AI replaces data engineering**: AI assists but does not remove need for layers, tests, and ownership.
- **Only Bronze raw files**: Bronze alone lacks cleaned joins and KPI definitions AI should use.
- **Only Marcus's intuition**: Intuition is not a substitute for tested, documented models.

**Reference docs**:

- [Databricks - Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture)
- [dbt - Data Tests](https://docs.getdbt.com/docs/build/tests)

### 6.5 — Who is primarily accountable for wrong revenue from AI-generated SQL?

**Recommended answer**: Shared — data engineering owns models; business owns definitions

**Rationale**: Data engineering owns pipeline and model quality; business owns metric definitions and acceptance. Blaming only the vendor, Marcus, or claiming AI is always correct avoids accountability.

**Correct feedback** (written to Google Form): Accountability is shared between engineering and business.

**Incorrect feedback** (written to Google Form): Wrong revenue from AI SQL — who owns models vs metric definitions?

**Distractor rationale**:

- **Only the cloud vendor**: Vendors provide tools; YellowLine owns what gets published.
- **Only Marcus**: Marcus sets priorities but engineering shares responsibility for model quality.
- **Nobody — AI is always correct**: AI outputs require validation — they are not infallible.

**Reference docs**:

- [Snowflake - Cortex AI Functions](https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions)

### 6.6 — Could AI replace Priya's Power BI dashboard?

**Recommended answer**: No — governed dashboard and KPIs still required

**Rationale**: AI can assist analysis but cannot replace governed Power BI dashboards with agreed KPIs, access control, and refresh patterns Priya relies on.

**Correct feedback** (written to Google Form): Priya still needs governed Power BI and KPIs.

**Incorrect feedback** (written to Google Form): Can AI replace Priya's dashboard entirely?

**Distractor rationale**:

- **Yes — fully replace all BI**: BI governance, semantic models, and stakeholder views still require Power BI (or equivalent).
- **Yes — but only for Marcus personally**: Personal AI chats do not replace team dashboards.
- **Only on weekends**: Dashboard need is continuous — not a weekend-only concern.

**Reference docs**:

- [Power BI - Enterprise BI Guidance](https://learn.microsoft.com/en-us/power-bi/guidance/powerbi-implementation-guidance-usage-scenario-enterprise-bi)

### 6.7 — How clear is LLM (Mod 6) vs ML (Mod 9)? (1–5)

**Auto-scored**: No (self-assessment / open / matrix)

**Rationale**: Subjective 1–5 clarity rating for LLM features (Module 6) versus predictive ML (Module 9). Low scores suggest replaying the Cortex vs ML comparison slide in debrief.

---

# Module 7 — Power BI Payoff & Tool Discussion

### 7.1 — My recommended stack for YellowLine NYC

**Recommended answer**: Snowflake + dbt

**Rationale**: Snowflake + dbt is the strongest workshop recommendation for YellowLine's SQL team and Q3 audit lineage. Combination stacks (e.g. Databricks + Snowflake + dbt) can be defended but Snowflake + dbt is the narrative anchor.

**Correct feedback** (written to Google Form): Strong story fit — SQL team plus lineage/tests on Snowflake.

**Incorrect feedback** (written to Google Form): For a SQL-heavy team needing lineage after MHP leaves, which stack fits best?

**Distractor rationale**:

- **Databricks only**: Valid for Spark-heavy teams, but the story emphasizes SQL maintainability after MHP leaves.
- **Snowflake only**: Snowflake alone lacks the transform/test/lineage layer dbt provides.
- **Databricks + dbt**: dbt on Databricks is possible but not the primary YellowLine recommendation in Module 7.
- **Databricks + Snowflake + dbt (combination)**: Defensible for larger orgs, but adds complexity — debrief trade-offs vs Snowflake + dbt.

**Reference docs**:

- [Databricks - Lakehouse Platform](https://www.databricks.com/product/data-lakehouse)
- [Snowflake - Key Concepts](https://docs.snowflake.com/en/user-guide/intro-key-concepts)
- [dbt - Introduction](https://docs.getdbt.com/docs/introduction)

### 7.2 — Strongest reason for my recommendation (pick one)

**Recommended answer**: SQL team can maintain it after MHP leaves

**Rationale**: The strongest reasons mirror 7.1: SQL maintainability, lineage/tests for board audit, or fastest path to Priya's KPIs. SQL maintainability is the primary story anchor for YellowLine.

**Correct feedback** (written to Google Form): Maintainability aligns with the Snowflake + dbt recommendation.

**Incorrect feedback** (written to Google Form): Pick the reason that best matches your 7.1 stack choice.

**Distractor rationale**:

- **Lineage/tests for board audit (Q3)**: Also valid if 7.1 emphasized audit — ensure consistency with your stack choice in debrief.
- **Fastest path to Priya's KPIs**: Valid if time-to-value drove 7.1 — discuss whether speed trades off governance.
- **Lowest licensing cost at our scale**: Cost matters but the workshop narrative prioritizes skills and lineage over license alone.

**Reference docs**:

- [Snowflake - Data Engineering Guide](https://docs.snowflake.com/en/guides-overview-data-engineering)
- [dbt - Introduction](https://docs.getdbt.com/docs/introduction)

### 7.3 — Tool I would NOT choose as the primary platform

**Recommended answer**: dbt alone (not a warehouse)

**Rationale**: dbt alone is not a warehouse — it requires Snowflake, Databricks, or similar underneath. The other options are debatable 'would not choose primary' answers depending on team context.

**Correct feedback** (written to Google Form): dbt cannot be the primary platform — it needs a warehouse underneath.

**Incorrect feedback** (written to Google Form): Which option is not a primary data platform on its own?

**Distractor rationale**:

- **Databricks alone for a SQL-only team**: Often a poor primary fit for SQL-only teams — but debatable; discuss in Mod 7.
- **Snowflake alone with no transform layer**: Risky without dbt/tests — but Snowflake is still a warehouse platform.
- **Power BI as primary platform**: Power BI is BI, not a pipeline platform — also a strong 'would not choose' answer in discussion.

**Reference docs**:

- [dbt - Introduction (transform layer, not warehouse)](https://docs.getdbt.com/docs/introduction)

### 7.4-databricks — Rate Databricks (1 = poor fit, 5 = excellent)

**Auto-scored**: No (self-assessment / open / matrix)

**Rationale**: Subjective grid rating Databricks on skillset fit, time-to-KPI, governance/lineage, and cost. Example debrief pattern for SQL-heavy YellowLine: moderate skill fit, good time-to-KPI, moderate governance, moderate cost — compare rows across tools, no single correct grid.

### 7.4-snowflake — Rate Snowflake (1 = poor fit, 5 = excellent)

**Auto-scored**: No (self-assessment / open / matrix)

**Rationale**: Subjective grid rating Snowflake on four criteria. Example debrief: high skill fit for SQL team, strong time-to-KPI, good governance with dbt, watch cost at scale — use for comparison, not auto-grading.

### 7.4-dbt — Rate dbt (1 = poor fit, 5 = excellent)

**Auto-scored**: No (self-assessment / open / matrix)

**Rationale**: Subjective grid rating dbt on four criteria. Example debrief: strong on governance/lineage and tests, moderate skill fit (SQL + Git), not a warehouse — compare against 7.4-snowflake and 7.4-databricks rows.

### 7.8 — Looking back at Module 0, I would change (select all that apply)

**Recommended answer**: `Add dbt / lineage layer on Snowflake` + `Name Gold KPI tables earlier` + `Separate ingest, transform, and Power BI`

**Rationale**: Looking back, trainees typically add dbt/lineage, clarify Gold naming, or separate pipeline from BI layers. Claiming the day-one sketch was perfect misses the workshop learning arc.

**Correct feedback** (written to Google Form): Good reflection — all three show learning since Module 0.

**Incorrect feedback** (written to Google Form): Most trainees refine their day-one sketch — avoid 'nothing perfect' if you learned something.

**Distractor rationale**:

- **Nothing — my day-one sketch was perfect**: Rarely true after Modules 0–6 — reflection expects at least one improvement.

**Reference docs**:

- [Databricks - Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture)

### 7.9 — If Marcus needed streaming (Mod 8) or ML (Mod 9)

**Recommended answer**: Would add streaming and/or ML paths but keep batch Gold core

**Rationale**: Streaming and ML add specialized paths (dispatch, tip prediction) while batch Gold remains Priya's core dashboard foundation. Dropping Snowflake or replacing Priya with AI breaks the story.

**Correct feedback** (written to Google Form): Optional mods extend the stack without replacing batch Gold.

**Incorrect feedback** (written to Google Form): Streaming (Mod 8) and ML (Mod 9) are add-ons — what happens to batch Gold?

**Distractor rationale**:

- **Would drop Snowflake entirely**: Optional modules complement batch — they do not require abandoning the core warehouse.
- **Would not change anything**: Modules 8–9 introduce new capabilities worth acknowledging in architecture.
- **Would replace Priya with AI only**: AI assists analysts; it does not replace governed Power BI dashboards.

**Reference docs**:

- [Databricks - Structured Streaming](https://docs.databricks.com/aws/en/structured-streaming/)
- [Databricks - MLflow](https://docs.databricks.com/aws/en/mlflow/)

### 7.10 — I could defend my tool choice to a client (1–5)

**Auto-scored**: No (self-assessment / open / matrix)

**Rationale**: Subjective 1–5 confidence rating for defending the tool choice to a client. Scores 4–5 mean trainees can tie stack to Marcus/Priya/Q3 audit; 1–2 suggest pairing for open discussion — not auto-graded.

---

# Module 8 — Streaming (Optional)

### 8.1 — What batch cannot give Marcus for live dispatch

**Recommended answer**: Sub-hour / live demand visibility

**Rationale**: Live dispatch needs near-real-time demand visibility — sub-hour or minute grain. Annual reports and last year's KPIs are batch strengths, not live dispatch.

**Correct feedback** (written to Google Form): Live dispatch needs fresher than batch — sub-hour visibility.

**Incorrect feedback** (written to Google Form): What can batch not give Marcus for live dispatch decisions?

**Distractor rationale**:

- **Annual tax reports**: Tax reporting is a batch/periodic use case batch handles well.
- **Historical KPIs for last year**: Historical analysis is a batch Gold strength.
- **Power BI license keys**: Licensing is unrelated to batch vs streaming latency.

**Reference docs**:

- [Databricks - Structured Streaming](https://docs.databricks.com/aws/en/structured-streaming/)

### 8.2 — When is batch still the right answer?

**Recommended answer**: Daily finance KPIs and audit snapshots

**Rationale**: Daily finance KPIs, reconciliation, and audit snapshots are ideal batch workloads. Live GPS every second and 'batch is obsolete' overstate streaming; small Excel files are a scale distractor.

**Correct feedback** (written to Google Form): Batch still wins for daily finance and audit snapshots.

**Incorrect feedback** (written to Google Form): When is batch still the right answer despite streaming hype?

**Distractor rationale**:

- **Live GPS every second only**: High-frequency GPS is a streaming use case — not where batch still wins.
- **Never — batch is obsolete**: Batch remains essential for finance, history, and governed Gold KPIs.
- **Only for Excel under 100 rows**: Batch at scale serves millions of rows — not just tiny spreadsheets.

**Reference docs**:

- [Databricks - Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture)

### 8.3 — Best description of a Kafka topic

**Recommended answer**: Named channel of events producers write to

**Rationale**: A Kafka topic is a durable, named log/channel where producers publish events and consumers subscribe. It is not a BI page, warehouse size, or dbt test.

**Correct feedback** (written to Google Form): A Kafka topic is a named event channel.

**Incorrect feedback** (written to Google Form): What is a Kafka topic in streaming architecture?

**Distractor rationale**:

- **A Power BI page**: Power BI pages visualize data — they are not Kafka messaging channels.
- **A Snowflake warehouse size**: Warehouse sizing is compute config, not a Kafka topic.
- **A dbt test**: dbt tests validate warehouse data — unrelated to Kafka topics.

**Reference docs**:

- [Apache Kafka - Introduction](https://kafka.apache.org/intro)

### 8.4 — Databricks vs relay to ADLS2 for Kafka

**Recommended answer**: Direct streaming for latency; relay when archive/multi-engine needed

**Rationale**: Direct Kafka-to-Databricks streaming minimizes latency for live use cases. Relay through ADLS2 helps archival replay and multiple engines at the cost of higher latency.

**Correct feedback** (written to Google Form): Trade-off — direct for latency, relay for archive/multi-consumer.

**Incorrect feedback** (written to Google Form): When direct Databricks streaming vs relay to ADLS2?

**Distractor rationale**:

- **Never use Kafka**: Kafka is the workshop streaming source — the question is integration pattern.
- **Only email events**: Email is not a scalable event streaming backbone.
- **Only batch forever**: Module 8 covers when streaming adds value alongside batch.

**Reference docs**:

- [Databricks - Structured Streaming](https://docs.databricks.com/aws/en/structured-streaming/)
- [Apache Kafka - Introduction](https://kafka.apache.org/intro)

### 8.5 — Why use watermarks in stream processing?

**Recommended answer**: Handle late events in windowed aggregations

**Rationale**: Watermarks tell the engine how late events may arrive so windowed aggregations can complete without waiting forever. They do not affect log formatting, encryption, or replace Snowflake.

**Correct feedback** (written to Google Form): Watermarks bound lateness for windowed aggregates.

**Incorrect feedback** (written to Google Form): Why watermarks in stream processing?

**Distractor rationale**:

- **Print prettier logs**: Watermarks control event-time processing semantics, not log aesthetics.
- **Replace Snowflake entirely**: Watermarks are a streaming concept — unrelated to replacing a warehouse.
- **Encrypt Parquet files**: Encryption is a security control, not watermark purpose.

**Reference docs**:

- [Databricks - Structured Streaming (Watermarks)](https://docs.databricks.com/aws/en/structured-streaming/)
- [Spark - Watermarking](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking)

### 8.6 — Priya wants refresh every minute — Power BI mode

**Recommended answer**: DirectQuery

**Rationale**: DirectQuery queries the warehouse/streaming sink live, supporting minute-level refresh. Import with daily refresh cannot meet sub-hour dispatch dashboards.

**Correct feedback** (written to Google Form): DirectQuery supports near-minute refresh from live data.

**Incorrect feedback** (written to Google Form): Priya wants ~minute refresh — Import daily is too slow.

**Distractor rationale**:

- **Import with daily refresh only**: Daily import cannot refresh every minute for live dispatch.
- **Turn off dashboard**: Disabling the dashboard avoids the requirement instead of meeting it.
- **Export PDF only**: PDF export is static — not a live refresh mode.

**Reference docs**:

- [Power BI - DirectQuery](https://learn.microsoft.com/en-us/power-bi/connect-data/desktop-directquery-about)

---

# Module 9 — Machine Learning (Optional)

### 9.1 — Main business use of tip prediction

**Recommended answer**: Driver incentives / ops tuning; Priya compares predicted vs actual

**Rationale**: Tip prediction helps driver incentives and operations tuning; Priya compares predicted vs actual tips in Power BI. It does not replace all KPIs or remove Silver.

**Correct feedback** (written to Google Form): Tip prediction supports ops tuning — Priya validates predictions.

**Incorrect feedback** (written to Google Form): What's the business use of tip prediction in the story?

**Distractor rationale**:

- **Replace all KPIs**: ML augments analysis — batch Gold KPIs remain core.
- **Predict stock market**: Unrelated to NYC taxi tip modeling in the workshop.
- **Remove Silver layer**: Feature engineering still relies on cleaned Silver/Gold inputs.

**Reference docs**:

- [Databricks - MLflow](https://docs.databricks.com/aws/en/mlflow/)

### 9.2 — Who builds the feature table vs trains the model?

**Recommended answer**: Data engineer / analytics engineer features; data scientist trains

**Rationale**: Analytics/data engineers build versioned feature tables in the warehouse; data scientists train and evaluate models. Marcus in Excel or Priya-only training skips feature engineering discipline.

**Correct feedback** (written to Google Form): Correct role split — engineers build features, scientists train.

**Incorrect feedback** (written to Google Form): Who owns feature tables vs model training?

**Distractor rationale**:

- **Marcus trains in Excel**: Excel is not the production ML path for lake-scale features.
- **Priya trains; nobody builds features**: Features must be engineered and tested before training.
- **AI builds everything with no review**: ML requires human review of features, leakage, and model metrics.

**Reference docs**:

- [Databricks - MLflow](https://docs.databricks.com/aws/en/mlflow/)
- [Snowflake - ML Overview](https://docs.snowflake.com/en/developer-guide/snowflake-ml/overview)

### 9.3 — Why must total_amount NOT be a feature for tip_amount?

**Recommended answer**: Leakage — it includes or reveals the tip

**Rationale**: total_amount typically includes fare plus tip (or closely correlates), leaking the target variable. Column name length and vendor restrictions are distractors.

**Correct feedback** (written to Google Form): total_amount leaks the target — classic leakage.

**Incorrect feedback** (written to Google Form): Why can't total_amount predict tip_amount?

**Distractor rationale**:

- **Column name too long**: Name length does not cause ML leakage.
- **Snowflake does not allow it**: Snowflake allows the column — the issue is statistical leakage.
- **Power BI forbids it**: Power BI is unrelated to feature selection leakage rules.

**Reference docs**:

- [Databricks - MLflow (Feature Engineering)](https://docs.databricks.com/aws/en/mlflow/)

### 9.4 — Why train on credit card trips only?

**Recommended answer**: Cash tips are not recorded reliably

**Rationale**: Cash tips frequently go unreported in NYC taxi data, making tip_amount unreliable for cash trips. Credit card trips record tips consistently for supervised learning.

**Correct feedback** (written to Google Form): Cash tips are often unrecorded — credit card trips have labeled tips.

**Incorrect feedback** (written to Google Form): Why filter to credit card trips for training?

**Distractor rationale**:

- **Credit cards are always fraudulent**: Fraud is not the training filter rationale in the lab.
- **Marcus prefers cash only**: Business preference is unrelated to label reliability.
- **Bronze has no trips**: Bronze contains trips — the issue is tip label quality by payment type.

**Reference docs**:

- [Databricks - MLflow](https://docs.databricks.com/aws/en/mlflow/)

### 9.5 — Primary ML role — match tool (select all that apply)

**Recommended answer**: `Databricks sklearn + MLflow — flexible training` + `Snowflake Cortex ML — SQL-native ML functions` + `Snowpark ML — Python in Snowflake` + `dbt — feature tables and tests, not training`

**Rationale**: Databricks sklearn + MLflow handles flexible Python training; Cortex ML offers SQL-native functions; Snowpark ML runs Python in Snowflake; dbt builds tested feature tables but does not train models.

**Correct feedback** (written to Google Form): All four — each tool has a distinct ML role in the lab.

**Incorrect feedback** (written to Google Form): Match each platform to its primary ML role — all four apply.

**Reference docs**:

- [Databricks - MLflow](https://docs.databricks.com/aws/en/mlflow/)
- [Snowflake - ML Overview](https://docs.snowflake.com/en/developer-guide/snowflake-ml/overview)
- [dbt - Data Tests](https://docs.getdbt.com/docs/build/tests)

### 9.6 — Lowest effort to first prediction (workshop comparison)

**Recommended answer**: Snowflake Cortex ML

**Rationale**: Workshop comparison often shows Snowflake Cortex ML as fastest for SQL-native teams to reach a first prediction. Databricks and Snowpark offer more flexibility at higher setup cost; dbt alone does not train.

**Correct feedback** (written to Google Form): In the workshop comparison, Cortex ML is often lowest effort for SQL teams.

**Incorrect feedback** (written to Google Form): Which path got to a first prediction with least setup for a SQL team?

**Distractor rationale**:

- **Databricks**: Flexible but typically more setup than SQL-native Cortex for first prediction.
- **Snowpark ML**: Powerful Python-in-Snowflake but more setup than Cortex SQL functions for a quick first model.
- **dbt-only (features only, no model)**: dbt prepares features — it does not produce model predictions by itself.

**Reference docs**:

- [Snowflake - ML Overview](https://docs.snowflake.com/en/developer-guide/snowflake-ml/overview)
- [Databricks - MLflow](https://docs.databricks.com/aws/en/mlflow/)

### 9.7 — Should features always be defined in dbt before training?

**Recommended answer**: Yes for production — versioned, tested features for all trainers

**Rationale**: Production ML benefits from dbt-managed feature tables with tests and Git versioning so all model trainers consume consistent definitions. Skipping dbt for ML loses contract and quality gates.

**Correct feedback** (written to Google Form): Production ML needs versioned, tested features — often in dbt.

**Incorrect feedback** (written to Google Form): Should features be defined in dbt before training in production?

**Distractor rationale**:

- **No — never use dbt for ML**: dbt is widely used for feature tables and data quality before ML training.
- **Only for batch pipelines, never ML**: Feature stores and ML pipelines commonly use dbt-modeled tables.
- **Only if Marcus asks**: Feature versioning is a production best practice, not optional whim.

**Reference docs**:

- [dbt - Introduction](https://docs.getdbt.com/docs/introduction)
- [dbt - Data Tests](https://docs.getdbt.com/docs/build/tests)

---

# Final workshop survey (not graded)

Form `module: final` — satisfaction only. See [google-forms-reflection-design.md](google-forms-reflection-design.md#final-workshop-survey-module-final).

---

## Document history

| Date | Change |
|------|--------|
| 2026-05-31 | Auto-generated from google-forms-grading.yaml (quiz + full distractors) |
| 2026-06-04 | Added reference URLs (refs) per question; audit against official docs |