Machine Learning Setup Guide
Databricks ML Runtime, Snowflake Cortex, and Snowpark ML for Module 9 (Optional)
title: “Machine Learning Setup Guide” subtitle: “Databricks ML Runtime, Snowflake Cortex, and Snowpark ML for Module 9 (Optional)” —
This setup is only needed if you plan to run Module 9 (Machine Learning). Modules 2 and 3 (Databricks and Snowflake batch pipelines) must be completed first — the ML module reads from the Silver enriched tables produced by those pipelines.
Prerequisites
Before starting Module 9:
If any of these are missing, run the Bronze → Silver → Gold notebooks from Modules 2 and 3 first.
Databricks Setup
Runtime requirement
The ML notebooks require Databricks Runtime ML (not standard Runtime).
How to check your cluster runtime:
- Compute → select your cluster → Edit
- Under Databricks Runtime Version, look for a version ending with ML — e.g.,
15.4 LTS ML - If you have standard Runtime, create a new cluster with an ML runtime
Databricks Runtime ML pre-installs: scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, MLflow. You don’t need to pip install any of these.
Verify scikit-learn and MLflow are available
In a notebook cell:
import sklearn
import mlflow
import xgboost
print(f"sklearn: {sklearn.__version__}")
print(f"mlflow: {mlflow.__version__}")
print(f"xgboost: {xgboost.__version__}")Expected: all three import without error.
MLflow Experiment access
The ML notebook logs runs to MLflow Experiments (built in to Databricks).
- Click Experiments in the left sidebar
- Verify you can create an experiment (or that a shared one exists)
- After training, your run appears here with metrics, params, and model artifacts
Fallback: standard Runtime
If only standard Runtime is available, install the required packages at the top of the notebook:
%pip install scikit-learn xgboost mlflow pandas matplotlibSnowflake Setup
1. Grant CORTEX_USER privilege (Trainer action)
Cortex ML Functions require the CORTEX_USER privilege. The trainer runs this once on the shared account:
USE ROLE ACCOUNTADMIN;
GRANT DATABASE ROLE SNOWFLAKE.CORTEX_USER TO ROLE SYSADMIN;After this, all users with SYSADMIN (or their own role inheriting from it) can call ML.FORECAST and ML.ANOMALY_DETECTION.
2. Verify Cortex ML access
Run this in a Snowsight worksheet to confirm Cortex is enabled:
SELECT AI_COMPLETE('mistral-large2', 'Say hello') AS test;Expected: short text response. If you get a ACCESS_DENIED error, ask the trainer to grant USE AI FUNCTIONS + CORTEX_USER.
3. Verify ML training data
-- Verify Silver enriched table exists with data
SELECT
COUNT(*) AS total_rows,
COUNT(CASE WHEN PAYMENT_TYPE_DESC = 'Credit Card' THEN 1 END) AS credit_card_rows,
AVG(TIP_AMOUNT) AS avg_tip
FROM DE_MASTERCLASS.{ATTENDEE_ID}_SILVER.SILVER_NYC_TAXI_ENRICHED;Expected: total_rows > 200,000 and credit_card_rows > 100,000.
-- Verify Gold hourly trips table (needed for Cortex FORECAST)
SELECT COUNT(*), MIN(PICKUP_HOUR_TS), MAX(PICKUP_HOUR_TS)
FROM DE_MASTERCLASS.{ATTENDEE_ID}_GOLD.GOLD_TRIPS_BY_HOUR;Expected: at least 500+ rows spanning multiple days.
4. Snowpark ML — local Python (optional)
If you want to run Snowpark ML scripts locally (not in Snowsight notebooks):
uv pip install --system "snowflake-snowpark-python[pandas]>=1.14.0" snowflake-ml-pythonConfigure connection:
from snowflake.snowpark import Session
connection_params = {
"account": "your-account-id",
"user": "your-username",
"password": "your-password",
"warehouse": "DE_WORKSHOP_WH",
"database": "DE_MASTERCLASS",
"schema": f"{ATTENDEE_ID}_SILVER",
"role": "SYSADMIN"
}
session = Session.builder.configs(connection_params).create()Snowpark Python is pre-installed in Snowsight notebooks — no local setup needed. Open your Snowflake account → Projects → Notebooks → create a new notebook.
dbt Setup for ml_features Model
The dbt ML model (ml/dbt/models/ml_features_tip_prediction.sql) materialises the feature table as a standard dbt table model — no special config needed.
Run it with your existing dbt Snowflake profile:
cd dbt_project/
dbt run --target snowflake --select ml_features_tip_prediction
dbt test --target snowflake --select ml_features_tip_predictionVerify:
SELECT COUNT(*), AVG(fare_amount), AVG(tip_amount)
FROM DE_MASTERCLASS.{ATTENDEE_ID}_GOLD.ML_FEATURES_TIP_PREDICTION;Day-of Checklist
Before Module 9 starts:
Troubleshooting
| Issue | Solution |
|---|---|
ModuleNotFoundError: sklearn on Databricks |
Switch to ML Runtime (cluster edit → Runtime version) |
| MLflow experiment not visible | Click Experiments in sidebar; first run creates it automatically |
Cortex ACCESS_DENIED |
Ask trainer to grant CORTEX_USER to your role |
Cortex ML.FORECAST returns 0 rows |
Check GOLD_TRIPS_BY_HOUR is not empty; verify column names match |
Snowpark Session.builder fails |
Check account ID format: orgname-accountname or abc12345.west-europe.azure |
| dbt model fails with missing source | Run Silver and Gold batch pipeline first (Modules 2–4) |
tip_amount all zeros in predictions |
Filter to payment_type_desc = 'Credit Card' — cash tips are always 0 |