Module 1: Data Engineering Fundamentals

ETL, medallion, and the NYC Taxi dataset

Duration: 35 min — Animation (3) · Think & Discuss (8) · Theory (12) · Quiz (3) · Practice (9)

1. Animation

2. Think & Discuss

Situation: Elena proposed Bronze, Silver, Gold. Priya listed business questions but her dashboard is still empty. The team must agree layer semantics before writing code.

Prompts:

  • Why three layers instead of one big table for Marcus?
  • What belongs in Silver vs Gold? Give one example for each.
  • Priya asks: When are our peak revenue hours? — which layer do you query?

Capture 3–5 bullets on the whiteboard. Do not reveal answers yet — theory and labs validate trainee ideas.

3. Theory

Data Warehouse vs Data Lake vs Data Lakehouse

Understanding these three architectures explains why YellowLine NYC’s new platform uses a lakehouse approach.

Feature Data Warehouse Data Lake Data Lakehouse
Data types Structured only Any format (structured, semi-structured, unstructured) Any format
Primary use BI & reporting Data science & ML BI, ML, and streaming
ACID transactions Yes No Yes (via Delta Lake, Iceberg, etc.)
Schema Schema-on-write Schema-on-read Schema-on-write with evolution
Storage format Proprietary Open (Parquet, ORC, JSON, …) Open (Parquet-based)
Cost High (compute + storage coupled) Low (cheap object storage) Low (decoupled compute & storage)
BI latency Low (optimized for SQL) High (unvalidated data) Low (optimized metadata layer)
ML support Limited Full access to raw data Full — same data for BI and ML
Governance Strong Weak Strong (Unity Catalog, Snowflake RBAC)

Why lakehouse for YellowLine NYC?

A lakehouse gives YellowLine NYC the low-cost, open-format storage of a data lake with the ACID transactions and governance of a data warehouse. Both Databricks and Snowflake now implement this pattern — which is why Elena can run the same medallion layers (Bronze → Silver → Gold) on either platform.

3.1 ETL vs ELT

Cloud Platforms for Data Engineering

Before comparing pipeline patterns, let’s survey the cloud platforms where these patterns are implemented.

ETL vs ELT

The distinction between ETL and ELT shapes how you design every pipeline in this workshop. In traditional ETL, a separate transformation engine cleanses data before it reaches the warehouse — this made sense when storage was expensive and compute was fixed. Modern cloud platforms invert this: storage is cheap, compute scales elastically, so it is faster and more flexible to load raw data first and transform in-place (ELT).

ETL ELT
Transform Before loading After loading
Where Separate transform engine Inside the data warehouse
Best for Legacy systems, data cleansing Cloud DWH, big data, iteration
Examples SSIS, Informatica Databricks, Snowflake, dbt

Microsoft Fabric is also an ELT-native SaaS platform — standalone from Azure Synapse, built on Azure infrastructure — that follows the same load-first pattern.

Today’s approach: ELT — we load raw data first, then transform in-place.

flowchart LR
    subgraph ETL ["Traditional ETL"]
        E1["Source"] --> E2["Transform Engine"] --> E3["Load"] --> E4["Data Warehouse"]
    end

    subgraph ELT ["Modern ELT (Today)"]
        L1["Source"] --> L2["Load Raw"] --> L3["Data Warehouse"] --> L4["Transform In-Place"]
    end

    classDef source fill:#0057b8,color:#fff,stroke:#003d82
    classDef transform fill:#d97706,color:#fff,stroke:#b45309
    classDef warehouse fill:#01065c,color:#fff,stroke:#000940

    class E1,L1 source
    class E2,L4 transform
    class E4,L3 warehouse

Always land raw data first

Even if the source is already clean, always write an exact copy to Bronze before any transformation. This gives you reproducibility (reprocess from raw at any time) and an audit trail (prove what the source looked like at ingestion time). Skipping Bronze is the most common shortcut that creates irreproducible pipelines.

3.2 Medallion Architecture

The medallion architecture (also called multi-hop or multi-layer) organizes data into three progressively refined layers. Each layer has a clear contract: what goes in, what transformations are applied, and who consumes the output. This pattern is vendor-neutral — Databricks, Snowflake, and dbt all implement it, each using their own compute engine and syntax. Some organizations use different names — “raw / cleaned / consumption” or “landing / standardised / curated” — but the semantics are identical: raw → trusted → aggregated.

Bronze — Raw Ingestion

  • Exact copy of source data — no transformations, no filters
  • Metadata added: source file, processing timestamp, partition keys
  • Purpose: Reproducibility — you can always reprocess from raw
  • Consumer: Silver transformation jobs (never queried directly by analysts)

Silver — Cleaned & Enriched

  • Data quality filters applied (nulls, duplicates, outliers removed)
  • Column standardization (naming conventions, type casting)
  • Derived metrics (trip duration, fare per mile, time features)
  • Enrichment (zone lookups joined, descriptive labels added)
  • Purpose: Trustworthy, analysis-ready data
  • Consumer: Gold aggregations, ML feature tables, ad-hoc analyst queries

Gold — Business KPIs

  • Aggregated tables optimized for reporting — one table per KPI or metric group
  • Pre-computed for fast dashboard queries (sub-second on Power BI)
  • Purpose: Ready for business users, dashboards, and executive reports
  • Also feeds: ML feature engineering (tip prediction) and streaming Gold aggregations (optional modules)

Don’t skip Silver

A common mistake is to read directly from Bronze and build Gold KPIs without a cleaning step. Without Silver’s quality filters and standardization, Gold tables silently contain null fares, zero-distance trips, and duplicate rows. The errors compound: a “top routes” KPI built on dirty data looks plausible but is wrong. Always route through Silver — even if the transformation seems simple.

NoteMedallion is vendor-neutral

The layer semantics stay the same across Databricks, Snowflake, and dbt. Databricks also maps medallion hops to Lakeflow Spark Declarative Pipelines (LSDP) — formerly Delta Live Tables.

3.3 Key Takeaways

  • Medallion architecture is vendor-neutral — the layer semantics stay the same whether you use Databricks, Snowflake, or dbt
  • ELT loads raw data first, then transforms in-place — the dominant pattern for modern cloud data platforms
  • Bronze preserves an exact copy of source data for reproducibility and audit
  • Silver is where data quality happens — skipping it is the most common cause of silent KPI errors
  • The 12 Gold KPIs are the consistent benchmark — all three pipelines produce identical results so you can compare tools fairly

4. Quiz

Quiz: Module 1 — ETL & Medallion Quiz

Scan QR to open quiz

Before moving on, make sure you can answer:

  1. What is the key difference between ETL and ELT, and why does this workshop use ELT?
  2. Name one transformation that belongs in Silver but not in Bronze, and one that belongs in Gold but not Silver.
  3. Why do we compute the same 12 KPIs in three separate pipelines instead of just one?

5. Practice

NYC Taxi Dataset

We use the NYC Taxi & Limousine Commission trip data — one of the most popular public datasets for data engineering training.

Trip Data (Parquet)

  • ~3 million yellow taxi trips per month
  • 19 columns: pickup/dropoff times, locations, distances, fares, tips, payment types
  • Stored as Parquet on Azure ADLS2

Zone Lookup (CSV)

  • 265 taxi zones across NYC
  • 4 columns: LocationID, Borough, Zone, service_zone
  • Used to enrich trip data with human-readable location names

12 Core KPIs

We compute the same 12 KPIs in all three pipelines:

# KPI What It Shows
1 Trips by Hour Hourly demand patterns
2 Trips by Day Weekly patterns
3 Time of Day Analysis Morning/Midday/Evening/Night
4 Top Pickup Zones Busiest locations
5 Borough Analysis Cross-borough flows
6 Popular Routes Top zone-to-zone routes
7 Distance Bands Short/Medium/Long/Very Long
8 Passenger Count Solo vs group riders
9 Revenue by Hour Hourly revenue patterns
10 Payment Types Cash vs credit card behavior
11 Trip Efficiency Speed/cost by distance
12 Data Quality Pipeline health metrics

Today’s Architecture

flowchart TD
    ADLS2[("Azure ADLS2<br>Parquet + CSV")]

    ADLS2 --> DB_B
    ADLS2 --> SF_B
    ADLS2 --> DBT_B

    subgraph DB ["Databricks (PySpark)"]
        DB_B["Bronze<br>Raw ingest"] --> DB_S["Silver<br>Cleaned + enriched"] --> DB_G["Gold<br>12 KPI tables"]
    end

    subgraph SF ["Snowflake (SQL + Snowpark)"]
        SF_B["Bronze<br>Raw ingest"] --> SF_S["Silver<br>Cleaned + enriched"] --> SF_G["Gold<br>12 KPI tables"]
    end

    subgraph DBT ["dbt (SQL models)"]
        DBT_B["Staging<br>Source refs"] --> DBT_S["Silver<br>SQL models"] --> DBT_G["Gold<br>SQL models"]
    end

    DB_G --> PBI["Power BI<br>Dashboard"]
    SF_G --> PBI
    DBT_G --> PBI

    DB_S -.->|"Module 9 (Optional)"| ML["ML Feature Table<br>Tip Prediction"]
    SF_S -.->|"Module 9 (Optional)"| ML
    DBT_S -.->|"Module 9 (Optional)"| ML

    AIVEN[("Aiven Kafka<br>User Activity")] -.->|"Module 8 (Optional)"| DB_B
    AIVEN -.->|"Module 8 (Optional)"| SF_B
    AIVEN -.->|"Module 8 (Optional)"| DBT_B

    style ADLS2 fill:#0057b8,color:#fff,stroke:#003d82
    style AIVEN fill:#d97706,color:#fff,stroke:#b45309
    style DB_B  fill:#475569,color:#fff,stroke:#334155
    style DB_S  fill:#0369a1,color:#fff,stroke:#075985
    style DB_G  fill:#01065c,color:#fff,stroke:#000940
    style SF_B  fill:#475569,color:#fff,stroke:#334155
    style SF_S  fill:#0369a1,color:#fff,stroke:#075985
    style SF_G  fill:#01065c,color:#fff,stroke:#000940
    style DBT_B fill:#475569,color:#fff,stroke:#334155
    style DBT_S fill:#0369a1,color:#fff,stroke:#075985
    style DBT_G fill:#01065c,color:#fff,stroke:#000940
    style PBI   fill:#107c10,color:#fff,stroke:#0a5c0a
    style ML    fill:#6d28d9,color:#fff,stroke:#5b21b6

Why three pipelines? To compare tools hands-on. Each pipeline produces the same 12 KPIs from the same data — giving you direct experience to inform tool selection.

flowchart LR
    RAW[("Raw Data<br/>Parquet + CSV")] --> BRONZE
    BRONZE --> SILVER
    SILVER --> GOLD
    GOLD --> PBI["Power BI<br/>Dashboard"]

    subgraph BRONZE ["Bronze — Raw"]
        B1["Exact copy of source"]
        B2["+ metadata & timestamps"]
    end

    subgraph SILVER ["Silver — Cleaned & Enriched"]
        S1["Quality filters"]
        S2["Standardization"]
        S3["Derived metrics"]
        S4["Zone enrichment"]
    end

    subgraph GOLD ["Gold — Business KPIs"]
        G1["12 KPI tables"]
        G2["Pre-aggregated"]
        G3["Dashboard-ready"]
    end

    classDef bronze fill:#475569,color:#fff,stroke:#334155
    classDef silver fill:#0369a1,color:#fff,stroke:#075985
    classDef gold fill:#01065c,color:#fff,stroke:#000940
    classDef source fill:#0057b8,color:#fff,stroke:#003d82
    classDef output fill:#107c10,color:#fff,stroke:#0a5c0a

    class RAW source
    class BRONZE,B1,B2 bronze
    class SILVER,S1,S2,S3,S4 silver
    class GOLD,G1,G2,G3 gold
    class PBI output

The optional modules extend the same Silver tables into new use cases:

  • Module 8 (Streaming): Live user activity events from Aiven Kafka show how each tool handles continuous data — Databricks Structured Streaming, Snowflake Snowpipe + Dynamic Tables, dbt dynamic_table.
  • Module 9 (ML): The silver_nyc_taxi_enriched table feeds tip prediction models in Databricks (sklearn + MLflow), Snowflake (Cortex ML + Snowpark ML), and a dbt feature table.

Hands-on exploration

Explore the NYC Taxi schema and KPI mapping above.

Optional: preview paths under mhpdeworkshopsa/nyc-taxi-data — no full pipeline yet.

Reference: Data Model & KPIs

Priya / Power BI: Dashboard wireframe stays empty — waiting for Gold tables from Bob.

Next module

Module 2: Databricks Pipeline — Bob asks Elena to prototype ingest at scale on Databricks.


Official Documentation