Random Forest — Feature Importance, Tuning, and OOB Evaluation
Feature Importance
Feature importance quantifies how much each input column contributes to a model’s predictions.
Why it matters
- Feature selection: keep the most useful features, remove weak ones to reduce overfitting and training time.
- Interpretability: explain decisions (e.g., why a loan was rejected) by showing which features drove the prediction.
Which models provide importances
- Tree‑based ensembles: Random Forest, Gradient Boosting, AdaBoost (with trees), Decision Trees.
- Many gradient‑boosting libraries (XGBoost, LightGBM, CatBoost) also expose importances.
Example intuition (MNIST)
On 28×28 digit images, Random Forest often assigns higher importance to central pixels than to corners, reflecting where information content is concentrated.
How trees compute impurity‑based importance
At a split node with parent sample set S (size |S|), left child L and right child R, and impurity measure I(·) (e.g., Gini, entropy for classification; MSE for regression), the split’s contribution is the impurity reduction weighted by node size:
For a given feature k, sum (\Delta I) over all nodes that split on k. Normalise across features so importances sum to 1 in a single tree.
Random Forest importances
Compute importances per tree as above, then average across trees (and re‑normalise). Access via feature_importances_ after fitting.
Caveat: high‑cardinality bias
Impurity‑based importances can overvalue features with many unique values. Prefer model‑agnostic checks like permutation importance for validation:
from sklearn.inspection import permutation_importance
rf.fit(X_train, y_train)
result = permutation_importance(rf, X_val, y_val, n_repeats=10, random_state=42, n_jobs=-1)
perm_importances = result.importances_mean
Tools like SHAP values can provide even deeper local and global explanations.
Hyperparameter Tuning
Random Forests work well out‑of‑the‑box, but tuning can yield further gains.
Commonly tuned parameters
- Number of trees:
n_estimators(start in 100–500, increase until gains plateau). - Features per split:
max_features("sqrt", "log2", or a fraction like 0.3–0.8). - Row sampling per tree:
max_samples(only ifbootstrap=True; try 0.5–0.75). - Tree depth/size controls:
max_depth,min_samples_leaf,min_samples_split,max_leaf_nodes. - Split criterion: classifier ("gini", "entropy"/"log_loss"); regressor ("squared_error", "absolute_error", "poisson").
GridSearchCV (exhaustive)
- Try a small, focused grid on key parameters; works well on smaller datasets.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
param_grid = {
"n_estimators": [100, 200, 400],
"max_features": ["sqrt", "log2", 0.5],
"min_samples_leaf": [1, 2, 5],
}
grid = GridSearchCV(rf, param_grid=param_grid, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)
print(grid.best_params_, grid.best_score_)
RandomizedSearchCV (faster on large spaces)
- Sample a fixed number of parameter combinations; often finds near‑optimal settings quickly.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
rf = RandomForestRegressor(random_state=42, n_jobs=-1)
param_dist = {
"n_estimators": randint(100, 600),
"max_features": uniform(0.3, 0.7),
"min_samples_leaf": randint(1, 10),
}
search = RandomizedSearchCV(rf, param_distributions=param_dist, n_iter=40, cv=5, n_jobs=-1, random_state=42)
search.fit(X_train, y_train)
print(search.best_params_, search.best_score_)
Notes
- Use stratified CV and class‑weighted metrics for imbalanced classification.
- Calibrate expectations: smaller
max_featuresincreases diversity (reduces correlation) but can raise bias — tune with CV.
Out‑of‑Bag (OOB) Evaluation
With bootstrapping, each tree trains on a sample drawn with replacement from the training set. Roughly 36.8% of rows are left out (OOB) for a given tree, since the probability of not being picked is ((1 - 1/N)^N \approx e^-1).
These OOB rows serve as an internal validation set:
- Enable with
oob_score=True(requiresbootstrap=True). - After fitting, read
oob_score_for a quick performance estimate (accuracy for classification; R² for regression).
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=300, oob_score=True, bootstrap=True, n_jobs=-1, random_state=42)
rf.fit(X_train, y_train)
print("OOB score:", rf.oob_score_)
OOB scores are typically close to test‑set metrics but may differ slightly; still, they are very handy when you want validation without a dedicated holdout.