Optimization Specification (v0.1)¶
Defines the training algorithms the Rust backend implements in Phase 2.
The gradients follow directly from math_spec.md.
Gradients¶
Let s = y_hat(x) and dL/ds the loss derivative
(logistic: sigmoid(s) - y_smooth; squared: s - y;
softmax: p_c - target_c per class logit).
FM parameter gradients for one row x:
ds/dw0 = 1
ds/dw_i = x_i
ds/dv_{i,f} = x_i * (sum_j v_{j,f} x_j) - v_{i,f} * x_i^2
(The term sum_j v_{j,f} x_j is shared across i — compute once per row/factor.)
FFM gradients for nonzero pair (i, j):
ds/dV[i, f_j] += x_i x_j * V[j, f_i]
ds/dV[j, f_i] += x_i x_j * V[i, f_j]
L2 adds l2_linear * w_i and l2_factors * v to the respective gradients
(only for parameters touched by the row — lazy/sparse regularization).
SGD¶
theta -= learning_rate * grad
AdaGrad (v0.1 default)¶
Per-parameter accumulator G (init 0), epsilon 1e-10:
G += grad^2
theta -= learning_rate * grad / sqrt(G + epsilon)
This is what libffm uses for FFM and is robust on sparse data.
Adam (v0.2)¶
Per-parameter lazy Adam. Each coordinate keeps its own moments (m, v) and an
update count t; t increments only when the coordinate is touched by a row
(consistent with lazy L2 above), so bias correction adapts per coordinate and the
update is exactly reproducible. Hyperparameters beta_1 (default 0.9), beta_2
(default 0.999), epsilon (default 1e-8); learning_rate is the step size α.
t += 1
m = beta_1 * m + (1 - beta_1) * grad
v = beta_2 * v + (1 - beta_2) * grad^2
m_hat = m / (1 - beta_1^t)
v_hat = v / (1 - beta_2^t)
theta -= learning_rate * m_hat / (sqrt(v_hat) + epsilon)
epsilon is added outside the square root. grad already includes lazy L2
(+ l2_linear * w_i / + l2_factors * v), exactly as for SGD/AdaGrad. The Rust
kernel uses beta.powf(t) to match Python's beta ** t.
In the Rust kernel the (m, v, t) accumulators are internal to a single fit
call. Adam + early stopping is supported by round-tripping the moments across
epochs through the NumPy reference trainer (fm_fit_reference's adam_state
argument; the per-epoch hand-off equals one multi-epoch call exactly). Because
that path is the reference, not the Rust kernel, Adam + early stopping is suited
to the moderate data early stopping is typically used on; for large data use Adam
without early stopping (Rust) or AdaGrad with early stopping (Rust round-trip).
FTRL-Proximal (v0.2)¶
Per-coordinate FTRL-Proximal (McMahan et al. 2013), the standard CTR optimizer.
Unlike SGD/AdaGrad/Adam, L1/L2 are folded into the update, not the gradient —
so the kernels pass FTRL the pure data gradient grad (no + l2 * theta). Each
coordinate keeps state (z, n) (init 0); the weight is reconstructed from (z, n)
each step, so theta always holds the current FTRL weight. Hyperparameters:
learning_rate is α (step size); ftrl_beta is β (default 1.0); l1_linear/
l1_factors are the L1 strengths (default 0.0); l2_linear/l2_factors reused
as the L2 strengths. The bias w0 uses l1 = l2 = 0.
sigma = (sqrt(n + grad^2) - sqrt(n)) / alpha
z += grad - sigma * theta
n += grad^2
theta = 0 if |z| <= l1
= -(z - sign(z)*l1) / ((beta + sqrt(n))/alpha + l2) otherwise
The (z, n) state is internal to one fit call; L1 produces exact zeros
(|z| <= l1), which AdaGrad/Adam never do. With mini-batch, FTRL steps once per
batch on the batch-mean data gradient; with n_jobs>1 the parallel accumulator
sums only data gradients, so FTRL composes with both unchanged. FTRL + early
stopping is deferred in v0.2 (its (z, n) state is not round-tripped), like
Adam. l1_* > 0 with a non-FTRL optimizer raises ValueError.
Training loop¶
- shuffle row order each epoch with the seeded RNG
- mini-batch size
batch_size(gradient averaged over batch); FFM commonly uses batch_size=1 (pure SGD) — both must work - initialization:
w0 = 0,w = 0,V ~ Normal(0, init_scale)(FM) /Uniform(0, init_scale/sqrt(k))(FFM, following libffm) from the seeded RNG - early stopping: evaluate
eval_metricon eval_set each epoch; stop afterpatienceepochs withoutmin_deltaimprovement; optionally restore best
Mini-batch (v0.2)¶
Each epoch's shuffled row order is consumed in contiguous chunks of batch_size
(the last chunk may be shorter). Within one batch:
- Parameters are frozen at batch start. Every row's score
sand per-row data-gradientdL/ds * ds/dthetaare computed against the batch-start parameters (no row in a batch sees another row's update). - Accumulate the data-gradient per touched coordinate, summed over the rows
in the batch (FM linear
g x_i; FM factorg (x_i * cache_f - v_{i,f} x_i^2); FFM per(feature, field)slot — exactly the batch_size=1 terms, just summed). - One update per touched coordinate, applied once at batch end:
g_theta = (1/B) * sum_rows(data-grad) + l2 * theta, whereBis the number of rows in the batch andthetais the batch-start value. Lazy L2 is added once here (not per row). The optimizer step (SGD / AdaGrad / Adam) then runs once: AdaGrad accumulatorsG += g_theta^2and Adam's per-coordinatetadvance once per batch, not once per row.
Coordinates untouched by the batch are neither regularized nor stepped (lazy).
w0 is touched by every row, so it is updated every batch. The batch-mean
divides by the actual batch length, so a full final partial batch is handled
correctly.
batch_size=1 reduces exactly to the per-row update above (one row per
batch ⇒ mean over one row, L2 on the frozen value = current value), so it is
bit-for-bit identical to the v0.1 path. Rows within a batch are accumulated in
row_orders order so the floating-point sum matches between the NumPy reference
and the Rust kernel.
Parallelism (Rust, v0.2)¶
Deterministic parallel-accumulate, serial-apply with rayon (n_jobs
threads), not lock-free Hogwild — chosen so results stay reproducible and
parity-testable. Within each batch:
- the batch's rows are split into
n_jobscontiguous chunks; each chunk's gradients are accumulated into its own partial accumulator on a separate thread (the frozen batch-start parameters are read-only, so no races); - the partial accumulators are reduced in chunk order into one, then a single serial flush applies the updates (and steps the AdaGrad/Adam state once).
Parallelism therefore only acts when batch_size > 1 (a batch_size=1 batch is
a single row/chunk). The contract:
n_jobs=1is the canonical serial path and is bit-identical to the NumPy reference;n_jobs>1is reproducible for a fixedn_jobs(contiguous chunking + fixed-order reduction), differing fromn_jobs=1only in floating-point summation order (parity-tested to a loose tolerance);- different
n_jobsvalues may differ at that tolerance.
FM binary and FFM parallelize; multiclass-softmax FM trains serially in v0.2
(n_jobs is accepted but ignored for it).