Mathematical Specification¶
This file fixes the exact formulas for v0.1. Implementations (Python reference, Rust backend, future GPU backends) must match these definitions. Do not mix in other FM variants (DeepFM, AFM, FwFM, FEFM); those are roadmap items with their own future specs.
Notation: n features, k latent factors, input row x ∈ R^n,
<a, b> is the dot product.
Factorization Machine (FM)¶
Parameters: bias w0 ∈ R, linear weights w ∈ R^n,
latent factors V ∈ R^{n×k} (row v_i for feature i).
Prediction (degree-2 FM, Rendle 2010):
y_hat(x) = w0 + sum_i w_i x_i + sum_{i<j} <v_i, v_j> x_i x_j
Efficient O(nk) computation of the pairwise term:
sum_{i<j} <v_i, v_j> x_i x_j
= 0.5 * sum_f [ (sum_i v_{i,f} x_i)^2 - sum_i v_{i,f}^2 x_i^2 ]
Field-aware Factorization Machine (FFM)¶
Each feature i belongs to a field f_i ∈ {0, ..., F-1}.
Parameters: w0, w ∈ R^n, and per-(feature, field) latent vectors
V ∈ R^{n×F×k} where V[i, g] is the vector feature i uses against field g.
Interaction for a pair (i, j) with i < j:
interaction(i, j) = <V[i, f_j], V[j, f_i]> x_i x_j
Prediction (Juan et al. 2016):
y_hat(x) = w0 + sum_i w_i x_i + sum_{i<j} <V[i, f_j], V[j, f_i]> x_i x_j
Note: unlike FM there is no O(nk) factorization of the pairwise sum; complexity
is O(z^2 k) in the number of nonzeros z per row. This is why FFM defaults to
smaller k and why the linear term is sometimes omitted in other libraries —
we keep w0 and the linear term (libffm omits them; this is a deliberate
difference, controlled later by hyperparameters if needed).
Binary logistic loss¶
Labels y ∈ {0, 1}, raw score s = y_hat(x), p = sigmoid(s):
loss = - y log(p) - (1 - y) log(1 - p)
Numerically stable form (no exp overflow):
loss = max(s, 0) - s * y + log1p(exp(-|s|))
With label smoothing eps:
y_smooth = y * (1 - eps) + 0.5 * eps
then use y_smooth in place of y.
Multiclass softmax loss¶
One model (one set of FM/FFM parameters) per class c produces logit_c.
p_c = exp(logit_c) / sum_k exp(logit_k) # compute via logsumexp
loss = - sum_c target_c * log(p_c)
With label smoothing eps and true class y:
target_c = 1 - eps if c == y
target_c = eps / (n_classes-1) otherwise
Squared loss (regression)¶
loss = 0.5 * (y_hat - y)^2
Reductions and weights¶
- Per-batch loss is the weighted mean:
sum_i sample_weight_i * loss_i / sum_i sample_weight_i. class_weightmultiplies intosample_weightper row before reduction.- L2 regularization adds
0.5 * l2_linear * ||w||^2 + 0.5 * l2_factors * ||V||^2(biasw0is not regularized).