Elastic Net Regression
Elastic Net is a linear model that blends Ridge (L2) and Lasso (L1) penalties. It’s handy when you have many features, possible multicollinearity, and you’re unsure whether pure Ridge or pure Lasso is best.
What problem it solves
- You’ve got a large feature set and don’t know which inputs are important.
- Your inputs are correlated (multicollinearity), e.g., height and weight.
- You want a compromise between Ridge’s stability and Lasso’s feature selection.
Loss function (safe text)
Loss = MSE + A * (Σ w_i^2) + B * (Σ |w_i|)
Ais the weight on the Ridge (L2) penaltyBis the weight on the Lasso (L1) penalty
In scikit‑learn:
alpha≈ overall regularization strength:alpha = A + Bl1_ratiosets the L1/L2 mix:l1_ratio = A / (A + B)- So
A = l1_ratio * alpha B = (1 - l1_ratio) * alpha
- So
- Defaults:
alpha = 1.0,l1_ratio = 0.5(equal L1 and L2 weight) - Set
l1_ratiocloser to 1.0 → more L1 behavior (sparser) - Set
l1_ratiocloser to 0.0 → more L2 behavior (more stable with collinearity)
Ridge vs Lasso context
- Ridge (L2): shrinks all coefficients toward zero but rarely to exactly zero; good with multicollinearity; keeps all features.
- Lasso (L1): can set some coefficients exactly to zero → feature selection; may be unstable when predictors are highly correlated.
- Elastic Net: balances both; tends to keep groups of correlated features while still enabling sparsity.
Practical guidance
- Scale features (standardization) before fitting; penalties are scale‑dependent.
- Tune both
alphaandl1_ratiovia cross‑validation. - Try Linear Regression, Ridge, Lasso, and Elastic Net on wide datasets (e.g., 50+ columns) and compare validation metrics.
Observations (from practice)
- On some multicollinear datasets, Ridge can beat Lasso slightly.
- Elastic Net with, for example,
alpha=0.005andl1_ratio=0.9(more Ridge weight) may outperform both individually—tune on your data to be sure.
Minimal scikit‑learn examples
Direct fit
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import ElasticNet
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(), ElasticNet(alpha=1.0, l1_ratio=0.5, random_state=42))
model.fit(X_train, y_train)
print("R2 on validation:", model.score(X_val, y_val))
With cross‑validated tuning
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import ElasticNet
from sklearn.pipeline import Pipeline
pipe = Pipeline([
("scaler", StandardScaler()),
("enet", ElasticNet(max_iter=10000, random_state=42))
])
param_grid = {
"enet__alpha": [0.001, 0.005, 0.01, 0.1, 1.0],
"enet__l1_ratio": [0.1, 0.3, 0.5, 0.7, 0.9]
}
search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
search.fit(X_train, y_train)
print("Best params:", search.best_params_)
print("Best CV R2:", search.best_score_)
SGDRegressor alternative
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
sgd_enet = make_pipeline(
StandardScaler(),
SGDRegressor(penalty="elasticnet", alpha=0.001, l1_ratio=0.5, max_iter=2000, tol=1e-3, random_state=42)
)
sgd_enet.fit(X_train, y_train)
Bias–variance intuition
- Increasing
alphaincreases bias and reduces variance. l1_ratiocontrols sparsity vs. stability: higher → sparser (more L1), lower → more stable with correlated features (more L2).
Key takeaways
- Elastic Net = L1 + L2: one knob for strength (
alpha), one knob for mix (l1_ratio). - Great when features are many and correlated.
- Tune both hyperparameters; standardize features first.
- Often more robust than pure Lasso in the presence of multicollinearity while still enabling sparsity.