Exercise: Machine Learning

YellowLine NYC story · full hands-on lab

title: “Exercise: Machine Learning” subtitle: “YellowLine NYC story · full hands-on lab” —

Estimated time: 45–50 min (Databricks sklearn: 20 min · Snowflake Cortex ML: 10 min · Snowpark ML: 10 min · dbt features: 5 min · Discussion: 5 min)

YellowLine NYC context (Module 9)

Predict tip amounts on credit-card NYC Taxi Silver.

Module prerequisites

Modules 2–3 required · Module 4 recommended for the dbt feature-table track. This is Cortex ML (Module 9) — not Module 6 Cortex LLM assistants.

Working Environment

Use GitHub Codespaces for a ready-to-use environment — all tools pre-installed. Open your fork on GitHub → Code → Codespaces → Create codespace on master. ML notebooks run in Databricks Runtime ML; Cortex runs in Snowflake Workspaces SQL files.

GitHub blocked? (emergency only)

The normal path is fork + Codespace (Prerequisites § Step 2). Use Lab source files only if your facilitator approved it — e.g. you cannot create or use GitHub before class.

Optional Module

This exercise requires Module 2 (Databricks) and Module 3 (Snowflake) Silver enriched tables to exist with data. For Cortex ML.FORECAST, run Step A0 in ml/snowflake/01_cortex_ml.sql to create V_HOURLY_TRIPS_BY_BOROUGH.

Prerequisites

silver_nyc_taxi_enriched exists in your {attendee_id}_silver Databricks schema
SILVER_NYC_TAXI_ENRICHED exists in your {ATTENDEE_ID}_SQL_SILVER Snowflake schema (SQL track) or {ATTENDEE_ID}_SP_SILVER (Snowpark track)
Silver enriched table exists in Snowflake ({ATTENDEE_ID}_SQL_SILVER) for Cortex Step A0
Databricks: Runtime ML cluster attached (16.4 LTS ML recommended) — ML setup
Snowflake: role DE_WORKSHOP_ROLE + DE_WORKSHOP_WH Started (workshop role)
Snowflake role has SNOWFLAKE.CORTEX_USER database role (run snowflake/sql/setup/05_cortex_access.sql on your trial account if not done in Module 3/6)

.env check (dbt + Snowpark ML in Codespace)

Cortex SQL (base exercise) and Databricks sklearn do not use .env.

For the dbt feature model or Stretch B Snowpark ML in the Codespace terminal, reuse .env from Module 4:

bash .devcontainer/setup-environment.sh

All variables ✅? Continue. Any ❌? Complete Exercise: dbt § Configure .env first.

Base Exercise — Snowflake Cortex ML (SQL, ~20 min)

Tip

Start here — no Python, no setup. Cortex ML runs in a Workspaces SQL file (Exercise: Snowflake § navigation).

Step 1: Run ML.FORECAST on trip demand

Projects → Workspaces → + → SQL File — see Exercise: Snowflake § run SQL
Open ml/snowflake/01_cortex_ml.sql
Run all as DE_WORKSHOP_ROLE — attendee_id auto-loads from _workshop_config (no placeholder editing). Stops with ✖ ERROR if Modules 2–3 setup was skipped.
Select/highlight Parts A0–A2 in the script, then Run (Ctrl+Enter) — or place the cursor in each part and run one block at a time. Do not skip Step A0 (creates V_HOURLY_TRIPS_BY_BOROUGH in {ATTENDEE_ID}_SQL_GOLD)

The forecast reads from V_HOURLY_TRIPS_BY_BOROUGH, not kpi_trips_by_hour.

Note

Why is Cortex ML just one SELECT? Cortex ML.FORECAST is a managed service — Snowflake handles model selection, training, hyperparameter tuning, and inference behind a single SQL function. You never see the model artifact. The tradeoff: you can only run FORECAST and ANOMALY_DETECTION — no custom algorithms. Contrast this with Databricks where you control every step of the training pipeline.

Verify: You should see rows with SERIES (borough), TS (future timestamp), FORECAST, LOWER_BOUND, UPPER_BOUND.

Questions: - Which borough has the highest forecasted demand for the next peak hour? - How wide are the confidence intervals (UPPER_BOUND - LOWER_BOUND)? What does a wider interval mean?

Step 2: Run ML.ANOMALY_DETECTION on average fares

Still in 01_cortex_ml.sql, select Part B, then Run — ML.ANOMALY_DETECTION (creates V_HOURLY_FARES and the anomaly query).

Questions: - Which boroughs have anomalous fare hours? - Look up the anomalous timestamps — do they correspond to late night, holidays, or storms?

Base Exercise — Databricks sklearn + MLflow (~25 min)

Step 1: Run the training notebook

In your Databricks Git folder (sidebar → Workspace), open ml/databricks/01_ml_tip_prediction.py
Set ATTENDEE_ID in 00_setup.py first if this notebook errors — it inherits ATTENDEE_ID, CATALOG_NAME, and schema names from the batch setup notebooks.
Adjust SAMPLE_FRACTION = 0.10 (leave as-is for speed)
Run all cells in order — training takes ~1–2 minutes

Note

Why sample 10 % instead of training on all data? Training on 10 % of Silver (~280 k rows) gives a reasonable model in ~2 minutes. Full-data training would take ~15–20 minutes with marginal accuracy improvement. In production you would use more data, but for a workshop the 10 % sample demonstrates the same concepts without the wait.

Training filters (same on Databricks, Snowpark ML, and dbt)

All three paths use identical row filters before training:

payment_type_desc = 'Credit card'
tip_amount >= 0, fare_amount > 0, trip_distance > 0
passenger_count between 1 and 8
fare_amount < 200, tip_amount < 100 (outlier guard)

Databricks and Snowpark apply these in Python; dbt encodes them in ml_features_tip_prediction.sql.

Watch for the output:

MLflow Run ID: <run_id>
   RMSE: $1.xx
   MAE:  $0.xx
   R²:   0.xx

Step 2: Explore the MLflow Experiment

In the Databricks sidebar (left), under AI/ML, click Experiments → find tip_prediction_{ATTENDEE_ID}
Click your run — inspect:
- Parameters: n_estimators, learning_rate, max_depth
- Metrics: RMSE, MAE, R²
- Artifacts: the model pickle, input_example.json
Click Feature Importances artifact (if present) or look at the notebook output

Note

Why track experiments in MLflow? Without MLflow, you would compare models by remembering which RMSE came from which run. MLflow automatically logs every parameter, metric, and artifact — so you can compare 10 runs side-by-side and reproduce any result. Cortex ML has no equivalent: it gives you the forecast but no model introspection.

Questions: - Which feature has the highest importance? Does this make business sense? - Is fare_amount the top feature? Is that a problem? (Hint: it’s not leakage — tip is not part of fare — but it is highly correlated)

Step 3: Inspect the Gold predictions table

After the notebook finishes, run this SQL query in a new cell:

SELECT
    pickup_borough,
    ROUND(AVG(tip_amount), 2)            AS avg_actual_tip,
    ROUND(AVG(predicted_tip), 2)         AS avg_predicted_tip,
    ROUND(AVG(ABS(prediction_error)), 2) AS avg_abs_error,
    COUNT(*)                             AS trips
FROM {catalog}.{attendee_id}_gold.ml_tip_predictions
GROUP BY pickup_borough
ORDER BY avg_actual_tip DESC;

Questions: - In which borough are predictions most accurate (lowest avg_abs_error)? - In which borough does the model overestimate tips?

Base Exercise — dbt Feature Table (~10 min)

Prerequisite: Configure .env from Exercise: dbt.

Step 1: Run the feature model

cd dbt_project
dbt run  --target snowflake --select ml_features_tip_prediction
dbt test --target snowflake --select ml_features_tip_prediction

(ATTENDEE_ID is loaded from .env in Codespaces. Local only: export ATTENDEE_ID=de_XX_yourname.)

Verify the feature table in {ATTENDEE_ID}_DBT_GOLD (dbt +schema: dbt_gold — isolated from batch _SQL_GOLD KPIs):

SHOW TABLES LIKE 'ML_FEATURES%' IN SCHEMA DE_MASTERCLASS.DE_01_ALICE_DBT_GOLD;
SELECT COUNT(*), AVG(fare_amount), AVG(target_tip_amount)
FROM DE_MASTERCLASS.DE_01_ALICE_DBT_GOLD.ML_FEATURES_TIP_PREDICTION;

All 18 column tests (11 columns) should pass. If any fail, look at: - dbt_utils.expression_is_true: tip_amount >= 0 — are there negative tips in the data? - accepted_values: payment_type_desc = [Credit card] — are non-credit-card rows leaking through?

Step 2: View the lineage graph

dbt docs generate --target snowflake
dbt docs serve

Open the lineage graph — find ml_features_tip_prediction and click it. Trace the full path from Bronze source to the feature table.

Question: If a data engineer modifies silver_nyc_taxi_enriched (e.g., changes how time_of_day is calculated), how would the ML team know? (Answer: dbt lineage + CI tests)

Stretch Goal — Retrain with a different feature set

Stretch A: Remove `fare_amount` and observe the impact

fare_amount is the strongest predictor — but you might argue it’s “too close” to the target. In a real scenario, you might not know the fare until after the trip (though for NYC taxi you do).

In your Databricks Git folder (sidebar (left) → Workspace), open ml/databricks/01_ml_tip_prediction.py — find the FEATURE_COLS list near the top of the file and remove "fare_amount"
Retrain the model
Compare RMSE before and after in MLflow Experiments
Check new feature importances — which feature rises to #1?

Expected: RMSE increases. This shows how much predictive power fare_amount adds.

Stretch Goal — Snowpark ML

Stretch B: Run Snowpark ML tip prediction

Prerequisite: Configure .env from Exercise: dbt. Requires SILVER_NYC_TAXI_ENRICHED in {ATTENDEE_ID}_SQL_SILVER (recommended SQL track from Module 3).

All packages are pre-installed. Load credentials and run:

bash .devcontainer/setup-environment.sh
python ml/snowflake/02_snowpark_ml_tip_prediction.py

Install packages, load credentials, then run:

uv pip install --system ".[ml]"
# Load .env (see Exercise: dbt § Configure .env)
python ml/snowflake/02_snowpark_ml_tip_prediction.py

uv pip install --system ".[ml]"
source .env
python ml/snowflake/02_snowpark_ml_tip_prediction.py

In Snowsight: Catalog → Database Explorer → DE_MASTERCLASS → {ATTENDEE_ID}_SQL_GOLD → Models — confirm TIP_PREDICTOR_{ATTENDEE_ID} appears (navigation mapping)

Compare with Databricks: - Which RMSE is lower? - Which took longer to run? - In which approach did data leave Snowflake?

Stretch Goal — Databricks AutoML

Stretch C: Databricks AutoML

In Databricks sidebar: AI/ML → Experiments → Create AutoML Experiment
Prediction type: Regression
Dataset: {catalog}.{attendee_id}_silver.silver_nyc_taxi_enriched
Target column: tip_amount
Metric: RMSE
Click Start

After it completes (~5–10 minutes): - Which algorithm won (XGBoost, LightGBM, or sklearn)? - Compare the winning RMSE to your manual GBR — which is better? - Open the best-run notebook — what preprocessing did AutoML apply that you didn’t?

Compare Your Results

After running all three tools, fill in the table below with your own observed values. This is the core learning of the module — the same prediction problem, three very different experiences.

Observable differences from the exercises

What you observed	Cortex ML (SQL)	Databricks (sklearn)	Snowflake (Snowpark ML)
Lines of code you wrote	~5 (one SQL call)	~80 (notebook)	~60 (Python script)
Time to first result	~10 seconds	~2 minutes	~1–2 minutes
Your RMSE / accuracy	N/A — forecasting, not regression	$______	$______
Did data leave Snowflake?	No	Yes — `.toPandas()` to driver	No
Where did you see the model?	SQL result set	MLflow Experiments UI	Snowflake Model Registry
Could you see feature importances?	No	Yes — GBR built-in importance	Limited
Setup steps before running	0 (just SQL)	Runtime ML cluster	Pre-installed in Codespaces (`[ml]`)

Things you should have noticed

1. Cortex ML is dramatically simpler — the entire forecast is one SELECT statement. No training loop, no model artifact, no hyperparameters. The tradeoff: you can only run FORECAST or ANOMALY_DETECTION — no other algorithms.

2. Databricks gives you the most visibility — MLflow Experiments shows every parameter, metric, and artifact. You can compare runs side-by-side. Cortex gives you results but no model introspection.

3. The RMSE from Databricks and Snowpark ML should be similar — same algorithm (GBR), same features, same filters, and same data. Small differences come from sampling randomness and warehouse compute differences.

4. Data movement is the hidden cost in Databricks — the .toPandas() call transfers data from the cluster to the driver. For 10% of the Silver table this is fast, but at 100% or on a larger dataset it becomes a bottleneck. Snowpark ML avoids this entirely.

5. The dbt tests caught the credit card filter — if you forgot and tried to include cash trips, the accepted_values: ['Credit card'] test would catch it before training.

Discussion questions for the group

Which approach would you choose for a production weekly tip prediction job? Consider: maintenance burden, model visibility, data governance.
Cortex ML.FORECAST needs a time-series view (V_HOURLY_TRIPS_BY_BOROUGH). Why can’t it use the raw Silver trip table directly?
MLflow vs Snowflake Model Registry — both track models. What would make you choose one over the other in a real project?
If your company’s data scientists use Python notebooks daily (Databricks), but your data platform is Snowflake — which training approach do you use and why?
dbt ran in ~5 seconds; training took ~2 minutes — if the feature table changes daily, when should dbt run and model.fit() be sequenced in a production pipeline?

Expected Results

Exercise	Expected output
Cortex ML.FORECAST	24 rows per borough with forecast + confidence bounds
Cortex ML.ANOMALY_DETECTION	Handful of anomalous fare hours (usually late-night outliers)
Databricks RMSE	~$1.00–$2.00 (depends on sample size and features)
Databricks R²	~0.60–0.80 for credit card trips
dbt tests	18 passing column tests on ML feature table
Snowpark ML RMSE	Similar to Databricks (same algorithm, same data)

Clean Up

Databricks:

DROP TABLE IF EXISTS {catalog}.{attendee_id}_gold.ml_tip_predictions;

Snowflake: ML tables, views, and schemas are all removed by the central cleanup script — no separate step needed.

Workspaces SQL file: run snowflake/sql/setup/99_cleanup.sql
Workspaces notebook: open 00_setup/99_cleanup and click Run all

Return to module

Module 9 — story wrapper

Prerequisites

Base Exercise — Snowflake Cortex ML (SQL, ~20 min)

Step 1: Run ML.FORECAST on trip demand

Step 2: Run ML.ANOMALY_DETECTION on average fares

Base Exercise — Databricks sklearn + MLflow (~25 min)

Step 1: Run the training notebook

Step 2: Explore the MLflow Experiment

Step 3: Inspect the Gold predictions table

Base Exercise — dbt Feature Table (~10 min)

Step 1: Run the feature model

Step 2: View the lineage graph

Stretch Goal — Retrain with a different feature set

Stretch A: Remove fare_amount and observe the impact

Stretch Goal — Snowpark ML

Stretch B: Run Snowpark ML tip prediction

Stretch Goal — Databricks AutoML

Stretch C: Databricks AutoML

Compare Your Results

Observable differences from the exercises

Discussion questions for the group

Expected Results

Clean Up

Return to module

Stretch A: Remove `fare_amount` and observe the impact