Exercise: Machine Learning

YellowLine NYC story · full hands-on lab

title: “Exercise: Machine Learning” subtitle: “YellowLine NYC story · full hands-on lab” —

Estimated time: 45–50 min (Databricks sklearn: 20 min · Snowflake Cortex ML: 10 min · Snowpark ML: 10 min · dbt features: 5 min · Discussion: 5 min)

YellowLine NYC context (Module 9)

Predict tip amounts on credit-card NYC Taxi Silver.

NoteModule prerequisites

Modules 2–3 required · Module 4 recommended for the dbt feature-table track. This is Cortex ML (Module 9) — not Module 6 Cortex LLM assistants.

TipWorking Environment

Use GitHub Codespaces for a ready-to-use environment — all tools pre-installed. Open your fork on GitHub → Code → Codespaces → Create codespace on main. ML notebooks run in Databricks Runtime ML; Cortex runs in Snowflake worksheets.

NoteOptional Module

This exercise requires Module 2 (Databricks) and Module 3 (Snowflake) Silver enriched tables to exist with data. Check with your trainer that the Gold KPI tables exist too (GOLD_TRIPS_BY_HOUR for Cortex ML.FORECAST).

Prerequisites


Base Exercise — Snowflake Cortex ML (SQL, ~20 min)

Tip

Start here — no Python, no setup. Cortex ML runs directly in a Snowsight SQL Worksheet.

Step 1: Run ML.FORECAST on trip demand

  1. Open a new Snowsight SQL Worksheet
  2. Open ml/snowflake/01_cortex_ml.sql
  3. Replace all {ATTENDEE_ID} with your ID (uppercase)
  4. Run Part AML.FORECAST section
SELECT *
FROM TABLE(
    ML.FORECAST(
        INPUT_DATA        => SYSTEM$REFERENCE('TABLE',
                             'DE_MASTERCLASS.{ATTENDEE_ID}_GOLD.GOLD_TRIPS_BY_HOUR'),
        TIMESTAMP_COLNAME => 'PICKUP_HOUR_TS',
        TARGET_COLNAME    => 'TOTAL_TRIPS',
        SERIES_COLNAME    => 'PICKUP_BOROUGH',
        CONFIG_OBJECT     => {'prediction_interval': 24}
    )
)
ORDER BY SERIES, TS;

Verify: You should see rows with SERIES (borough), TS (future timestamp), FORECAST, LOWER_BOUND, UPPER_BOUND.

Questions: - Which borough has the highest forecasted demand for the next peak hour? - How wide are the confidence intervals (UPPER_BOUND - LOWER_BOUND)? What does a wider interval mean?


Step 2: Run ML.ANOMALY_DETECTION on average fares

Still in 01_cortex_ml.sql, run Part BML.ANOMALY_DETECTION.

First create the hourly fare view:

CREATE OR REPLACE VIEW DE_MASTERCLASS.{ATTENDEE_ID}_SILVER.V_HOURLY_FARES AS
SELECT
    DATE_TRUNC('hour', PICKUP_DATETIME)  AS fare_hour,
    PICKUP_BOROUGH,
    AVG(FARE_AMOUNT)                     AS avg_fare
FROM DE_MASTERCLASS.{ATTENDEE_ID}_SILVER.SILVER_NYC_TAXI_ENRICHED
WHERE FARE_AMOUNT > 0
GROUP BY 1, 2;

Then run anomaly detection:

SELECT
    SERIES  AS pickup_borough,
    TS      AS fare_hour,
    Y       AS avg_fare,
    ANOMALY_SCORE,
    CASE WHEN ANOMALY THEN '⚠️ ANOMALY' ELSE '✅ Normal' END AS status
FROM TABLE(
    ML.ANOMALY_DETECTION(
        INPUT_DATA        => SYSTEM$REFERENCE('VIEW',
                             'DE_MASTERCLASS.{ATTENDEE_ID}_SILVER.V_HOURLY_FARES'),
        TIMESTAMP_COLNAME => 'FARE_HOUR',
        TARGET_COLNAME    => 'AVG_FARE',
        SERIES_COLNAME    => 'PICKUP_BOROUGH',
        LABEL_COLNAME     => NULL
    )
)
WHERE ANOMALY = TRUE
ORDER BY ANOMALY_SCORE DESC;

Questions: - Which boroughs have anomalous fare hours? - Look up the anomalous timestamps — do they correspond to late night, holidays, or storms?


Base Exercise — Databricks sklearn + MLflow (~25 min)

Step 1: Run the training notebook

  1. Open ml/databricks/01_ml_tip_prediction.py in your Databricks workspace
  2. Set ATTENDEE_ID (should already be set from 00_setup)
  3. Adjust SAMPLE_FRACTION = 0.10 (leave as-is for speed)
  4. Run all cells in order — training takes ~1–2 minutes

Watch for the output:

✅ MLflow Run ID: <run_id>
   RMSE: $1.xx
   MAE:  $0.xx
   R²:   0.xx

Step 2: Explore the MLflow Experiment

  1. In the Databricks sidebar: AI/ML → Experiments → find tip_prediction_{ATTENDEE_ID}
  2. Click your run — inspect:
    • Parameters: n_estimators, learning_rate, max_depth
    • Metrics: RMSE, MAE, R²
    • Artifacts: the model pickle, input_example.json
  3. Click Feature Importances artifact (if present) or look at the notebook output

Questions: - Which feature has the highest importance? Does this make business sense? - Is fare_amount the top feature? Is that a problem? (Hint: it’s not leakage — tip is not part of fare — but it is highly correlated)


Step 3: Inspect the Gold predictions table

After the notebook finishes, run this SQL query in a new cell:

SELECT
    pickup_borough,
    ROUND(AVG(tip_amount), 2)            AS avg_actual_tip,
    ROUND(AVG(predicted_tip), 2)         AS avg_predicted_tip,
    ROUND(AVG(ABS(prediction_error)), 2) AS avg_abs_error,
    COUNT(*)                             AS trips
FROM {catalog}.{attendee_id}_gold.ml_tip_predictions
GROUP BY pickup_borough
ORDER BY avg_actual_tip DESC;

Questions: - In which borough are predictions most accurate (lowest avg_abs_error)? - In which borough does the model overestimate tips?


Base Exercise — dbt Feature Table (~10 min)

Step 1: Run the feature model

cd dbt_project
dbt run  --select ml_features_tip_prediction
dbt test --select ml_features_tip_prediction

All 11 tests should pass. If any fail, look at: - dbt_utils.expression_is_true: tip_amount >= 0 — are there negative tips in the data? - accepted_values: payment_type_desc = [Credit Card] — are non-credit-card rows leaking through?

Step 2: View the lineage graph

dbt docs generate
dbt docs serve

Open the lineage graph — find ml_features_tip_prediction and click it. Trace the full path from Bronze source to the feature table.

Question: If a data engineer modifies silver_nyc_taxi_enriched (e.g., changes how time_of_day is calculated), how would the ML team know? (Answer: dbt lineage + CI tests)


Stretch Goal — Retrain with a different feature set

Stretch A: Remove fare_amount and observe the impact

fare_amount is the strongest predictor — but you might argue it’s “too close” to the target. In a real scenario, you might not know the fare until after the trip (though for NYC taxi you do).

  1. In 01_ml_tip_prediction.py, edit FEATURE_COLS — remove "fare_amount"
  2. Retrain the model
  3. Compare RMSE before and after in MLflow Experiments
  4. Check new feature importances — which feature rises to #1?

Expected: RMSE increases. This shows how much predictive power fare_amount adds.


Stretch Goal — Snowpark ML

Stretch B: Run Snowpark ML tip prediction

  1. Install: pip install snowflake-ml-python>=1.5.0

  2. Set environment variables:

    export SNOWFLAKE_ACCOUNT=xy12345.eu-west-1
    export SNOWFLAKE_USER=your_user
    export SNOWFLAKE_PASSWORD=your_password
    export ATTENDEE_ID=01_ALICE
  3. Run: python ml/snowflake/02_snowpark_ml_tip_prediction.py

  4. In Snowsight: Data → Databases → DE_MASTERCLASS → {schema} → Models — confirm TIP_PREDICTOR_{ATTENDEE_ID} appears

Compare with Databricks: - Which RMSE is lower? - Which took longer to run? - In which approach did data leave Snowflake?


Stretch Goal — Databricks AutoML

Stretch C: Databricks AutoML

  1. In Databricks sidebar: AI/ML → Experiments → Create AutoML Experiment
  2. Prediction type: Regression
  3. Dataset: {catalog}.{attendee_id}_silver.silver_nyc_taxi_enriched
  4. Target column: tip_amount
  5. Metric: RMSE
  6. Click Start

After it completes (~5–10 minutes): - Which algorithm won (XGBoost, LightGBM, or sklearn)? - Compare the winning RMSE to your manual GBR — which is better? - Open the best-run notebook — what preprocessing did AutoML apply that you didn’t?


Compare Your Results

After running all three tools, fill in the table below with your own observed values. This is the core learning of the module — the same prediction problem, three very different experiences.

Observable differences from the exercises

What you observed Cortex ML (SQL) Databricks (sklearn) Snowflake (Snowpark ML)
Lines of code you wrote ~5 (one SQL call) ~80 (notebook) ~60 (Python script)
Time to first result ~10 seconds ~2 minutes ~1–2 minutes
Your RMSE / accuracy N/A — forecasting, not regression $______ $______
Did data leave Snowflake? No Yes — .toPandas() to driver No
Where did you see the model? SQL result set MLflow Experiments UI Snowflake Model Registry
Could you see feature importances? No Yes — GBR built-in importance Limited
Setup steps before running 0 (just SQL) Runtime ML cluster pip install snowflake-ml-python
TipThings you should have noticed

1. Cortex ML is dramatically simpler — the entire forecast is one SELECT statement. No training loop, no model artifact, no hyperparameters. The tradeoff: you can only run FORECAST or ANOMALY_DETECTION — no other algorithms.

2. Databricks gives you the most visibility — MLflow Experiments shows every parameter, metric, and artifact. You can compare runs side-by-side. Cortex gives you results but no model introspection.

3. The RMSE from Databricks and Snowpark ML should be similar — because they use the same algorithm (GBR), same features, and same data. Small differences come from sampling randomness and warehouse compute differences.

4. Data movement is the hidden cost in Databricks — the .toPandas() call transfers data from the cluster to the driver. For 10% of the Silver table this is fast, but at 100% or on a larger dataset it becomes a bottleneck. Snowpark ML avoids this entirely.

5. The dbt tests caught the credit card filter — if you forgot and tried to include cash trips, the accepted_values: [Credit Card] test would catch it before training.

Discussion questions for the group

  1. Which approach would you choose for a production weekly tip prediction job? Consider: maintenance burden, model visibility, data governance.
  2. Cortex ML.FORECAST needed a Gold aggregation table as input (GOLD_TRIPS_BY_HOUR). Why can’t it use the raw Silver trip table directly?
  3. MLflow vs Snowflake Model Registry — both track models. What would make you choose one over the other in a real project?
  4. If your company’s data scientists use Python notebooks daily (Databricks), but your data platform is Snowflake — which training approach do you use and why?
  5. dbt ran in ~5 seconds; training took ~2 minutes — if the feature table changes daily, when should dbt run and model.fit() be sequenced in a production pipeline?

Expected Results

Exercise Expected output
Cortex ML.FORECAST 24 rows per borough with forecast + confidence bounds
Cortex ML.ANOMALY_DETECTION Handful of anomalous fare hours (usually late-night outliers)
Databricks RMSE ~$1.00–$2.00 (depends on sample size and features)
Databricks R² ~0.60–0.80 for credit card trips
dbt tests 11 passing tests on ML feature table
Snowpark ML RMSE Similar to Databricks (same algorithm, same data)

Clean Up

Databricks:

DROP TABLE IF EXISTS {catalog}.{attendee_id}_gold.ml_tip_predictions;

Snowflake:

DROP TABLE IF EXISTS DE_MASTERCLASS.{ATTENDEE_ID}_GOLD.ML_TRIP_DEMAND_FORECAST;
DROP TABLE IF EXISTS DE_MASTERCLASS.{ATTENDEE_ID}_GOLD.ML_TIP_PREDICTIONS_SNOWPARK;
DROP VIEW  IF EXISTS DE_MASTERCLASS.{ATTENDEE_ID}_SILVER.V_HOURLY_FARES;
DROP MODEL IF EXISTS DE_MASTERCLASS.{ATTENDEE_ID}_GOLD.TIP_PREDICTOR_{ATTENDEE_ID};

Return to module

Source: merged from frozen workshop-2026-v1/exercises/ex-ml.qmd — do not edit workshop-2026-v1/.