Logistic Regression — Perceptron Trick and Implementation

Logistic Regression is a cornerstone classification algorithm and a natural bridge to Deep Learning. The Perceptron Trick offers a simple way to learn a linear decision boundary and is closely related to Logistic Regression.

This page focuses on the probabilistic view and then walks through a Perceptron‑style training loop, its unified update rule, and why the official Logistic Regression approach is preferred in practice.

Core requirement: linear separability

Works best when data is linearly separable or almost linearly separable.
Linearly separable: a single straight line (2D), plane (3D), or hyperplane (higher‑D) splits classes.
Almost separable: allow a few misclassifications but a single boundary explains most points.
Highly non‑linear class structure → a single linear boundary will underperform without feature engineering.

The Perceptron Trick (simple learner)

Goal: find coefficients (A, B, C) of the line A*x + B*y + C = 0 that separate the classes.

High‑level loop:

Initialise (A, B, C) randomly (or zeros) → initial poor line.
For many epochs: pick a random training point (X_i, Y_true).
If the point is misclassified, update the line to move toward positive points and away from negative points.

Positive/negative regions for `Ax + By + C = 0`

(x1, y1) is positive if A*x1 + B*y1 + C > 0.
Negative if A*x1 + B*y1 + C < 0.
On the line if A*x1 + B*y1 + C = 0.

How coefficients move the line

Change C → parallel shift up/down.
Change A → rotation around Y‑axis.
Change B → rotation around X‑axis.

Augmented coordinates and case updates

Use augmented inputs X = [X0, X1, X2] with X0 = 1 and weights W = [W0, W1, W2] where W0=C, W1=A, W2=B.

Case 1: Negative point (Y_true = 0) predicted positive (Y_pred = 1)
- Update: W_new = W_old - eta * X_i
Case 2: Positive point (Y_true = 1) predicted negative (Y_pred = 0)
- Update: W_new = W_old + eta * X_i
eta is the learning rate (e.g., 0.01).

Unified update rule (concise)

Let S = W · X_i. Predict Y_pred = 1 if S >= 0, else 0.

W_new = W_old + eta * (Y_true - Y_pred) * X_i

Behaviour:

Correct classification → no change (term becomes 0).
Positive misclassified → add eta * X_i.
Negative misclassified → subtract eta * X_i.

Convergence and stopping

Stop when there are no misclassified points (convergence), or after a fixed number of epochs.
The Perceptron stops as soon as it perfectly classifies the training set; it does not try to "improve the margin" further.

Perceptron‑style training loop (pseudo‑Python)

import numpy as np

def train_perceptron(X, y, epochs=1000, eta=0.01, seed=42):
		# X: shape (n_samples, 2) for features [x, y]; we augment inside
		rng = np.random.default_rng(seed)
		X_aug = np.hstack([np.ones((X.shape[0], 1)), X])  # add X0 = 1
		W = np.zeros(X_aug.shape[1])  # [C, A, B]

		for _ in range(epochs):
				i = rng.integers(0, X_aug.shape[0])
				Xi, yi = X_aug[i], y[i]
				S = W @ Xi
				y_pred = 1 if S >= 0 else 0
				W = W + eta * (yi - y_pred) * Xi
		return W

Why the Perceptron Trick is suboptimal

Stops too early: halts once all training points are classified, even if the line is very close to one class (small/asymmetric margin).
Generalisation risk: a boundary that hugs one class can overfit; test performance may suffer.
No probabilistic output: pure perceptron uses a hard threshold rather than probabilities.

Logistic Regression (what we actually train in practice)

Logistic Regression learns a weight vector W and bias b to model a probability:

p(y=1 | x) = sigmoid(W · x + b) where sigmoid(z) = 1 / (1 + exp(-z))

It optimises a smooth logistic (cross‑entropy) loss rather than a 0/1 misclassification rule, typically via gradient descent/solvers. This encourages a boundary that better balances the classes and improves generalisation. Regularisation (L2/L1/Elastic Net) is commonly added.

Minimal scikit‑learn example

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Scale features first; saga/liblinear solvers are common choices
logreg = make_pipeline(
		StandardScaler(),
		LogisticRegression(solver="lbfgs", max_iter=1000, random_state=42)
)

logreg.fit(X_train, y_train)
print("Validation accuracy:", logreg.score(X_val, y_val))

Key takeaways

Perceptron Trick is a great teaching tool: simple updates, clear geometry.
Unified update: W_new = W_old + eta * (Y_true - Y_pred) * X_i with augmented inputs.
It can perfectly separate training data but may yield a poor margin and generalisation.
Logistic Regression optimises a probabilistic loss and typically generalises better; add regularisation when needed.

Core requirement: linear separability​

The Perceptron Trick (simple learner)​

Positive/negative regions for A*x + B*y + C = 0​

How coefficients move the line​

Augmented coordinates and case updates​

Unified update rule (concise)​

Convergence and stopping​

Perceptron‑style training loop (pseudo‑Python)​

Why the Perceptron Trick is suboptimal​

Logistic Regression (what we actually train in practice)​

Minimal scikit‑learn example​

Key takeaways​