XGBoost Classification
XGBoost Classification: Detailed Notes and Intuition
This page explains how XGBoost performs binary classification, from the objective and gradients to split selection, probability conversion, and practical settings. Multi‑class notes are included at the end.
Problem setup and base score
- Data: with
- Predictions are made in log‑odds (logit) space. Start from a constant base score:
All initial predictions are . Convert any logit to probability with the sigmoid .
Objective and derivatives (logistic loss)
Negative log‑likelihood with logits :
Per‑sample first and second derivatives used by XGBoost:
- Gradient:
- Hessian:
Building one tree (second‑order boosting)
For a candidate tree structure, aggregate over each leaf :
- ,
With regularization the optimal leaf weight and gain are:
Choose the split with the highest positive gain; otherwise prune (controlled by ). Note that for classification varies with the current probability, unlike squared‑error regression where .
Prediction update and probabilities
After fitting tree the logit prediction updates with learning rate :
Convert to probability . Class labels use a threshold (default , often tuned for imbalance). Metrics like ROC‑AUC or PR‑AUC are threshold‑independent and recommended during tuning.
Handling class imbalance
- Use
scale_pos_weight \approx (\text{negatives}/\text{positives})for skewed data. - Prefer AUC/PR‑AUC and tune the probability threshold for F1, recall, or custom cost.
- Row/column sampling (
subsample,colsample_bytree) and regularization further improve generalization.
Hyperparameters cheat sheet
- Core:
objective='binary:logistic',learning_rate(η),n_estimatorswithearly_stopping_rounds - Tree shape:
max_depth,min_child_weight,gamma - Sampling:
subsample,colsample_bytree - Regularization:
reg_lambda(L2),reg_alpha(L1) - Performance:
tree_method='hist'(or'gpu_hist'),n_jobs=-1
Minimal example (scikit‑learn API)
from xgboost import XGBClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
X, y = make_classification(n_samples=8000, n_features=20, weights=[0.7, 0.3], random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
model = XGBClassifier(
objective="binary:logistic",
n_estimators=4000,
learning_rate=0.03,
max_depth=6,
subsample=0.8,
colsample_bytree=0.8,
reg_lambda=1.0,
reg_alpha=0.0,
tree_method="hist",
n_jobs=-1,
random_state=42,
)
model.fit(
X_train, y_train,
eval_set=[(X_valid, y_valid)],
eval_metric="auc",
verbose=False,
early_stopping_rounds=200,
)
proba = model.predict_proba(X_valid)[:, 1]
auc = roc_auc_score(y_valid, proba)
print({"auc": auc, "best_iteration": model.best_iteration})
Diagnostics and interpretation
- Inspect confusion matrix and choose a probability threshold aligned with business costs.
- Use ROC/PR curves to understand trade‑offs; PR‑AUC is more informative under heavy imbalance.
- SHAP values provide faithful local/global explanations for tree models.
Multi‑class notes
- Set
objective='multi:softprob'andnum_class=Kto get class probabilities via softmax. - The same second‑order framework applies; gradients/hessians come from softmax cross‑entropy.
Common pitfalls
- Relying on accuracy with imbalanced data; prefer AUC/PR‑AUC and calibrated thresholds.
- Too large
learning_ratewith too few trees; use smallerlearning_rateplus early stopping. - Forgetting
stratifyduring train/valid split, leading to skewed validation sets. - Mismatch between label encoding and expectation (ensure labels are 0/1 for binary:logistic).