Every ML project follows the same workflow: load data, explore, preprocess, train, evaluate, tune, deploy. Scikit-Learn provides a clean API that standardizes each step. In this lesson you will build a complete, reusable pipeline and compare five algorithms on the same dataset, giving you a template you can apply to any regression or classification problem. #ScikitLearn #MLPipeline #CrossValidation
The ML Workflow
Every project follows these steps, regardless of the problem domain.
Without pipelines, preprocessing and model training are separate steps. This creates a subtle bug: if you fit the scaler on the full dataset before splitting, information from the test set leaks into training. Scikit-Learn Pipelines chain preprocessing and modeling into a single object, ensuring that scaling is fit only on training data during each cross-validation fold.
Generate a Realistic Dataset
We will predict power consumption in a building from sensor readings. This simulates the kind of tabular regression problem you encounter in building automation, industrial IoT, and energy management.
Occupancy has the strongest correlation with power consumption, which matches our synthetic formula. But correlations only capture linear relationships. The HVAC effort term (absolute difference between temperature and setpoint) is nonlinear and will not show up well in simple correlation.
Data Preprocessing with Pipelines
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
np.random.seed(42)
# (Regenerate the dataset from above, or assume 'data' is already loaded)
The pipeline ensures that StandardScaler is fit on training data only. When we call transform on the test set, it uses the training mean and standard deviation. Polynomial features create interaction terms (temperature * humidity, hour * occupancy, etc.) that can capture nonlinear relationships.
Cross-Validation and Model Comparison
A single train/test split can give misleading results. Cross-validation runs k different splits and averages the scores, giving a more reliable estimate.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
print(f"\nPredicted power consumption: {prediction[0]:.1f} kWh")
print("(Wednesday afternoon, 45 occupants, 28.5C with 22C setpoint)")
Expected output:
Model saved to power_model.joblib
Predicted power consumption: 193.4 kWh
(Wednesday afternoon, 45 occupants, 28.5C with 22C setpoint)
The saved .joblib file contains both the scaler and the model. When you load it, preprocessing and prediction happen in one predict() call. No separate scaling step, no chance of using the wrong scaler.
The Reusable ML Template
Every project you build from now on can follow this pattern.
Load and explore your data with pandas. Check shapes, types, distributions, and correlations.
Build a Pipeline that chains preprocessing (scaling, encoding, feature engineering) with a model. This prevents data leakage.
Cross-validate with cross_val_score to get reliable performance estimates across multiple splits.
Compare models by running several algorithms through the same pipeline and comparing cross-validation scores.
Tune the best model’s hyperparameters with GridSearchCV (exhaustive) or RandomizedSearchCV (faster for large search spaces).
Evaluate on a held-out test set that was never used during training or tuning.
Save with joblib.dump() and load with joblib.load() for deployment.
This workflow applies whether you are predicting power consumption, classifying sensor anomalies, or estimating remaining useful life of equipment.
Comments