BFGS / L-BFGS¶

Quasi-Newton methods. Approximate the Hessian for faster convergence than gradient descent.

bfgs¶

BFGS with optional line search.

from solvor import bfgs

def objective(x):
    return x[0]**2 + x[1]**2

def grad(x):
    return [2*x[0], 2*x[1]]

result = bfgs(objective, grad, x0=[5.0, 5.0])
print(result.solution)  # Close to [0, 0]

Memory: O(n²) for Hessian approximation Best for: Smooth functions and moderate dimensions

lbfgs¶

Limited-memory BFGS. Uses limited history instead of full Hessian.

from solvor import lbfgs

result = lbfgs(objective, grad, x0=[5.0, 5.0], memory=10)

Memory: O(n × m) where m is history size Best for: Large-scale problems

Comparison¶

Method	Memory	Convergence	Use When
BFGS	O(n²)	Superlinear	n < 1000, smooth function
L-BFGS	O(n × m)	Superlinear	n > 1000, memory limited

How It Works¶

The Newton's method dream: Newton's method converges quadratically—insanely fast—by using the Hessian (second derivatives) to find the exact step to the minimum of the local quadratic approximation:

x_{k+1} = x_k - H⁻¹ ∇f(x_k)

The problem: Computing and inverting the Hessian is O(n²) storage and O(n³) per iteration. For large n, this is prohibitive.

BFGS insight: Build an approximation to H⁻¹ using only gradient information. Each iteration updates this approximation using the secant condition:

B_{k+1} s_k = y_k

where s_k = x_{k+1} − x_k (step) and y_k = ∇f_{k+1} − ∇f_k (gradient change).

The BFGS update: A rank-2 update that maintains positive definiteness:

B_{k+1} = (I - ρ s y') B_k (I - ρ y s') + ρ s s'
where ρ = 1 / (y' s)

This satisfies the secant condition and keeps B symmetric positive definite.

L-BFGS trick: Don't store B at all. Store the last m pairs (s_k, y_k) and compute B⁻¹∇f implicitly via a two-loop recursion. Storage drops from O(n²) to O(nm).

Why it works: After seeing enough curvature information, B approximates the true inverse Hessian well enough for superlinear convergence. You get Newton-like speed with gradient-only cost.

The algorithm:

Compute gradient ∇f(x)
Compute search direction p = −B∇f (using stored history for L-BFGS)
Line search to find step size α satisfying Wolfe conditions
Update x ← x + αp
Update B (or store (s, y) pair for L-BFGS)
Repeat until ||∇f|| < ε

Tips¶

Smooth functions only. These methods assume twice-differentiable objectives.
Good for ML. Fast convergence on convex losses.
Line search built-in. Automatically finds step size.