Skip to main content

XGBoost Classification

XGBoost Classification: Detailed Notes and Intuition

This page explains how XGBoost performs binary classification, from the objective and gradients to split selection, probability conversion, and practical settings. Multi‑class notes are included at the end.

Problem setup and base score

  • Data: D={(xi,yi)}i=1nD = \{(x_i, y_i)\}_{i=1}^n with yi{0,1}y_i \in \{0,1\}
  • Predictions are made in log‑odds (logit) space. Start from a constant base score:
b=logp1p,p=1niyib = \log\frac{p}{1-p},\quad p=\frac{1}{n}\sum_i y_i

All initial predictions are y^i(0)=b\hat{y}_i^{(0)} = b. Convert any logit zz to probability with the sigmoid σ(z)=11+ez\sigma(z)=\frac{1}{1+e^{-z}}.

Objective and derivatives (logistic loss)

Negative log‑likelihood with logits y^\hat{y}:

(y,y^)=ylogσ(y^)(1y)log(1σ(y^))\ell(y, \hat{y}) = -\, y\,\log\sigma(\hat{y}) - (1-y)\,\log\big(1-\sigma(\hat{y})\big)

Per‑sample first and second derivatives used by XGBoost:

  • Gradient: gi=σ(y^i(t1))yig_i = \sigma(\hat{y}_i^{(t-1)}) - y_i
  • Hessian: hi=σ(y^i(t1))(1σ(y^i(t1)))h_i = \sigma(\hat{y}_i^{(t-1)})\,\big(1-\sigma(\hat{y}_i^{(t-1)})\big)

Building one tree (second‑order boosting)

For a candidate tree structure, aggregate over each leaf jj:

  • Gj=iIjgiG_j = \sum_{i\in I_j} g_i, Hj=iIjhiH_j = \sum_{i\in I_j} h_i

With regularization Ω(f)=γT+12λwj2\Omega(f)=\gamma T + \tfrac{1}{2}\lambda\sum w_j^2 the optimal leaf weight and gain are:

wj=GjHj+λw_j^{\ast} = -\frac{G_j}{H_j + \lambda} extGain=12(GL2HL+λ+GR2HR+λGP2HP+λ)γ ext{Gain} = \tfrac{1}{2}\left(\frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda} - \frac{G_P^2}{H_P+\lambda}\right) - \gamma

Choose the split with the highest positive gain; otherwise prune (controlled by γ\gamma). Note that for classification hih_i varies with the current probability, unlike squared‑error regression where hi=1h_i=1.

Prediction update and probabilities

After fitting tree ftf_t the logit prediction updates with learning rate η\eta:

y^(t)(x)=y^(t1)(x)+ηft(x)\hat{y}^{(t)}(x) = \hat{y}^{(t-1)}(x) + \eta\, f_t(x)

Convert to probability p=σ(y^)p=\sigma(\hat{y}). Class labels use a threshold τ\tau (default 0.50.5, often tuned for imbalance). Metrics like ROC‑AUC or PR‑AUC are threshold‑independent and recommended during tuning.

Handling class imbalance

  • Use scale_pos_weight \approx (\text{negatives}/\text{positives}) for skewed data.
  • Prefer AUC/PR‑AUC and tune the probability threshold for F1, recall, or custom cost.
  • Row/column sampling (subsample, colsample_bytree) and regularization further improve generalization.

Hyperparameters cheat sheet

  • Core: objective='binary:logistic', learning_rate (η), n_estimators with early_stopping_rounds
  • Tree shape: max_depth, min_child_weight, gamma
  • Sampling: subsample, colsample_bytree
  • Regularization: reg_lambda (L2), reg_alpha (L1)
  • Performance: tree_method='hist' (or 'gpu_hist'), n_jobs=-1

Minimal example (scikit‑learn API)

from xgboost import XGBClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

X, y = make_classification(n_samples=8000, n_features=20, weights=[0.7, 0.3], random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

model = XGBClassifier(
objective="binary:logistic",
n_estimators=4000,
learning_rate=0.03,
max_depth=6,
subsample=0.8,
colsample_bytree=0.8,
reg_lambda=1.0,
reg_alpha=0.0,
tree_method="hist",
n_jobs=-1,
random_state=42,
)

model.fit(
X_train, y_train,
eval_set=[(X_valid, y_valid)],
eval_metric="auc",
verbose=False,
early_stopping_rounds=200,
)

proba = model.predict_proba(X_valid)[:, 1]
auc = roc_auc_score(y_valid, proba)
print({"auc": auc, "best_iteration": model.best_iteration})

Diagnostics and interpretation

  • Inspect confusion matrix and choose a probability threshold aligned with business costs.
  • Use ROC/PR curves to understand trade‑offs; PR‑AUC is more informative under heavy imbalance.
  • SHAP values provide faithful local/global explanations for tree models.

Multi‑class notes

  • Set objective='multi:softprob' and num_class=K to get class probabilities via softmax.
  • The same second‑order framework applies; gradients/hessians come from softmax cross‑entropy.

Common pitfalls

  • Relying on accuracy with imbalanced data; prefer AUC/PR‑AUC and calibrated thresholds.
  • Too large learning_rate with too few trees; use smaller learning_rate plus early stopping.
  • Forgetting stratify during train/valid split, leading to skewed validation sets.
  • Mismatch between label encoding and expectation (ensure labels are 0/1 for binary:logistic).