Roadmap¶

v0.1 — Rust CPU core (current)¶

[x] Phase 0: design docs + package skeleton
[x] Phase 1: Python reference predictions (FM naive/fast, FFM naive/vectorized), losses (logistic/softmax + label smoothing), correctness tests
[x] Phase 2A: Rust prediction backend (completed 2026-06-11)
toolchain updated (rustc 1.96), pyproject switched to maturin mixed layout (python-source = "python", module modern_fm._rust, pyo3 0.25 abi3-py310)
rust/src/{lib,data,fm,ffm}.rs: FM fast + FFM predictions, dense and CSR, float64, GIL released during compute; input validation (CSR structure, field_ids range, shape mismatches -> ValueError)
modern_fm._backend: private dispatch — Rust when built, NumPy reference fallback otherwise; handles dtype/contiguity coercion
parity tests (tests/test_rust_parity.py): Rust vs reference at atol/rtol 1e-12, dense + CSR, multiple seeds, zero rows, single nonzero, hand-computed examples, bad-input rejection; Rust unit tests in-crate
[x] Phase 2B: Rust training — SGD + AdaGrad for FM and FFM (NumPy reference trainer as ground truth, parity-tested), estimators wired to the backend (binary + regression), seeded reproducibility. rayon row-parallelism and mini-batch deferred to v0.2 (n_jobs=1, batch_size=1 in v0.1).
[x] Phase 3: sklearn API polish (lightweight mixins, check_is_fitted, fit/predict validation) + CategoricalEncoder. Full sklearn check_estimator compliance deferred to v0.2 (no sklearn runtime dep).
[x] Phase 4: early_stopping/eval_set, label_smoothing, class_weight, sample_weight, multiclass softmax (FM), save/load, examples + benchmark.

v0.2 — Training quality & throughput¶

[x] Adam optimizer (optimizer="adam", per-parameter lazy Adam with beta_1/beta_2/epsilon); FM binary/multiclass + FFM, parity-tested vs the NumPy reference. Adam + early stopping is deferred (moments are not round-tripped).
[x] FTRL-Proximal optimizer (optimizer="ftrl", l1_linear/l1_factors/ ftrl_beta): per-coordinate (z, n) state with L1/L2 folded into the update; FM binary/multiclass + FFM, parity-tested vs the NumPy reference; L1 yields exact zeros (composes with mini-batch + n_jobs; FTRL + early stopping deferred)
[x] Rust multiclass-softmax training kernel (fm_fit_multiclass_csr), parity-tested vs the NumPy reference (done ahead of v0.2)
[x] mini-batch (batch_size > 1): per-batch gradient averaging with one update per touched coordinate (FM binary/multiclass + FFM), parity-tested vs the NumPy reference at batch_size ∈ {1, 4, full}; batch_size=1 stays the per-row path
[x] rayon row-parallelism (n_jobs > 1): deterministic parallel-accumulate / serial-apply per batch (FM binary + FFM; multiclass serial); n_jobs=1 matches the reference, n_jobs>1 reproducible per thread count. ~3x on 4 cores for FFM.
[x] full sklearn check_estimator compatibility (estimators subclass sklearn BaseEstimator + Classifier/RegressorMixin; scikit-learn is now a runtime dependency). FFMClassifier.fit(X, y) defaults field_ids to per-column.
[x] libffm format loader/exporter (load_libffm / dump_libffm, round-trip tested)
[x] pandas/polars input (DataFrames via sklearn validate_data; feature_names_in_ recorded, column reorder rejected at predict)
[x] CI + release pipeline: .github/workflows/ci.yml (pytest + ruff across {Linux, macOS, Windows} × py3.10–3.13, plus cargo test/clippy) and release.yml (abi3 wheels via maturin-action + sdist, PyPI trusted publishing on a v* tag). Verified locally: maturin builds an abi3 wheel + sdist that install and run in a clean venv.
[x] Adam + early stopping: moments round-tripped across epochs via the NumPy reference path (fm_fit_reference's adam_state); FM binary/regression + FFM, per-epoch hand-off equals one multi-epoch call exactly.
[x] multiclass + early stopping: per-class optimizer state (AdaGrad/Adam) round-tripped via the reference path; softmax cross-entropy eval metric, round-trip equals a single multi-epoch call.
[x] FFM multiclass softmax (ffm_fit_multiclass_csr): one FFM per class coupled by softmax, all optimizers (SGD/AdaGrad/Adam/FTRL) + mini-batch, parity-tested vs the NumPy reference; FFMClassifier auto-detects >2 classes / loss="softmax". FTRL + early stopping and multiclass FFM + early stopping are scheduled for v0.4 (see below).

v0.4 — API completeness & online learning¶

Closes the (model × task × optimizer × early-stopping) matrix and adds streaming / out-of-core training. No new model family, no release-infra work.

[x] FFMRegressor — squared-loss FFM, the regression counterpart to FMRegressor. Priority: P0.
DoD: RegressorMixin estimator; Rust kernel + NumPy-reference parity test (atol/rtol like existing FFM tests); dense+CSR equivalence; save/load + pickle round-trip; check_estimator passes; exported in __all__; docs/api_design.md + CHANGELOG updated. _reference*.py unchanged.
[x] FTRL + early stopping — round-trip FTRL's per-coordinate (z, n) state across epochs (mirror Adam's adam_state hand-off). Priority: P1.
DoD: remove the _no_ftrl_early_stopping guard (fm.py/ffm.py); per-epoch hand-off equals one multi-epoch call exactly (test); FM binary/multiclass + FFM; state carried via the reference path (no _reference*.py speed change).
[x] Multiclass FFM + early stopping — per-class optimizer state round-tripped for FFM multiclass (mirror FM-multiclass + ES). Priority: P1.
DoD: remove the multiclass guard in ffm.py; round-trip equals a single multi-epoch call (test); softmax cross-entropy eval metric.
[x] partial_fit + warm_start — incremental/streaming training for FM & FFM. Priority: P0.
DoD: partial_fit(X, y, classes=...) (sklearn first-call classes convention) and warm_start=True continue from existing params + optimizer state; classes_ retained; N sequential partial_fit calls equal one fit on the concatenated data under a matched epoch/batch schedule (exact via optimizer-state round-trip); contract added to docs/api_design.md; reference parity preserved; CHANGELOG updated.

v0.5 — Rust ES fast path, FwFM, pooling, CUDA plumbing¶

Performance completion of the early-stopping matrix, the v1.0 headline model pulled forward, a research-honest nfm_pooling, and locally-testable CUDA groundwork (kernels land separately, gated on real-GPU validation — see docs/gpu_backend_plan.md).

[x] Rust early-stopping fast path — every per-epoch optimizer-state hand-off (AdaGrad accumulators, Adam moments, FTRL (z, n), per-class multiclass state) now round-trips through the Rust kernels via optional state/adam_state/ftrl_state PyO3 arguments; _backend dispatches to the reference only when the extension is missing. The epoch-driven ES loop is bit-identical to a single multi-epoch Rust call (tested per optimizer × {FM, FFM} × {binary, multiclass}). ES fits sped up ~14–170x for the previously reference-bound cells (Adam/FTRL/multiclass; FFM+Adam ES 49.5 s → 0.86 s on the synthetic bench); partial_fit/warm_start ride the same path. Priority: P0.
[x] FwFM (FwFMClassifier) — Field-weighted FM (moved up from v1.0; it remains the 1.0 headline). Priority: P0.
DoD met: docs/math_spec_fwfm.md written FIRST (field-pair weights r_{f(i),f(j)} upper-triangle, exact prediction/gradients/updates, R=ones init = plain FM); NumPy reference → Rust kernel (rust/src/fwfm.rs) → FwFMClassifier, parity-tested at each layer (predict 1e-12, train RTOL=1e-9 × optimizer × loss × batch_size, multiclass, ES bit-exact hand-off); collapse-to-FM property test; existing FM/FFM formulas untouched; check_estimator, save/load, partial_fit/warm_start, __all__ + api_design + CHANGELOG. Binary + multiclass; serial (rayon n_jobs for FwFM deferred). (AFM/FEFM/FmFM follow this template post-1.0; FmFM is the research-recommended next variant — one field-pair k×k matrix generalizes FM/FwFM/FvFM/FmFM.)
[x] Bi-interaction pooling (BiInteractionPooling) — the honest "nfm_pooling": a sklearn transformer emitting the k-dim bi-interaction vector 0.5 * ((sum_i x_i v_i)^2 - sum_i (x_i v_i)^2) from a fitted FM for downstream models. As a predictor a linear head over this provably collapses to plain FM (NFM = this + MLP, out of scope), so it ships as a feature transform, not a model. Priority: P1.
DoD met: _reference.fm_bi_interaction (+ _backend.fm_bi_interaction wrapper; no Rust kernel — NumPy is two BLAS-grade sparse matmuls), BiInteractionPooling with fit/transform/get_feature_names_out (+ bi_interaction(X) on the FM estimators), collapse-to-FM identity test at 1e-12, Pipeline + check_estimator + pickle tests, api_design docs.
[x] CUDA plumbing (no kernels) — cuda-backend Cargo feature (cudarc 0.19, default fallback-dynamic-loading = no CUDA toolkit at build time; target-gated off on macOS), rust/src/cuda/mod.rs available(), always- registered has_cuda() pyfunction, _backend.has_cuda(), backend="cuda" accepted at fit with clear errors (RuntimeError without a CUDA build/device, NotImplementedError while no kernels exist — never a silent CPU fallback), CI cuda-check job (cargo check/clippy --features cuda-backend, in the ci-success gate). Kernels (FM CSR prediction first) follow in a separate PR that merges only after validation on a real GPU (runbook in the PR). Priority: P2.

v0.6 — in progress¶

[x] CUDA FFM prediction + context/module cache (gpu_backend_plan milestone 2, pulled ahead of the post-1.0 GPU track): FFM CSR prediction kernel (rust/src/cuda/ffm.rs, one block/row, 256-thread pair-strided loop, no row-nnz/k limit; FFMClassifier binary+multiclass and FFMRegressor inference via set_params(backend="cuda")), plus a process-wide cache of the CUDA context + NVRTC module (rust/src/cuda/mod.rs) so only the first call pays initialization. Parity rtol/atol 1e-10, T4-validated per docs/cuda_validation_runbook.md; bench_cuda.py gained the FFM grid + a cold-start line. FwFM-CUDA and device-resident parameters remain out of scope.
[x] CUDA FM training accumulation (gpu_backend_plan milestone 3): binary/regression FM fit with backend="cuda" — GPU accumulates each mini-batch's data-gradient (dense buffers, atomicAdd, CSR uploaded once per call), the untouched CPU flush applies SGD/AdaGrad/Adam/FTRL, so early stopping, partial_fit, warm_start and FTRL's exact L1 zeros work unchanged. Multiclass/FwFM training still raise. Nondeterministic run-to-run (atomics); parity on final predictions at rtol 1e-7/atol 1e-8; requires compute >= 6.0. Perf follow-ups shipped: device-resident parameters + per-batch compact-vs-dense transfer switch (2 * batch_nnz < n_features).
[x] CUDA FFM training accumulation (gpu_backend_plan milestone 4): binary/regression FFM fit with backend="cuda" (rust/src/cuda/ffm_train.rs) — dense slot-gradient device buffer, host touched-slot enumeration (pair loop minus the k-dot), gather/scatter kernels so V stays device-resident and only touched (feature, field) slots move; CPU flush reused verbatim (all optimizers + ES + partial_fit/warm_start + FTRL L1 zeros). Same caveats as the FM path.

v1.0 — stable release¶

Headline model variant + production-CTR features + docs/bench polish + API freeze. Shipping this milestone = tagging v1.0.0.

[ ] FwFM — pulled into v0.5 (see above); the v1.0 gate keeps its DoD.
[x] Probability calibration — calibrated predict_proba for CTR. Priority: P1.
DoD met via the recommended path: every public classifier is CalibratedClassifierCV-compatible (no library-specific API); tests/test_calibration.py pins ECE + Brier improvement on synthetic miscalibrated data (label-smoothing compression, sigmoid + isotonic) and compatibility for FM/FFM/FwFM; examples/calibration.py + docs/api_design.md "Probability calibration" section.
[x] Model inspection (top interactions) — strongest learned pairwise interactions. Priority: P1.
DoD met: top_interactions(n_top, class_idx=None) on all five estimators (FM |<v_i, v_j>|, FwFM r-weighted, FFM field-aware slots; exact blockwise scan in python/modern_fm/_inspect.py); planted dominant-pair tests + blockwise-vs-naive parity (tests/test_top_interactions.py); examples/top_interactions.py + api_design section.
[x] Real-data benchmark — real CTR sample. Priority: P1.
DoD met with a substitution, documented in the script: the Criteo/Avazu samples are no longer publicly downloadable without credentials (checked 2026-07: labs.criteo.com redirects, the S3 mirror 404s, the HF copy is gated), so benchmarks/bench_criteo_like.py uses the real KDD Cup 2012 track-2 CTR sample from OpenML (Click_prediction_small, zero-credential fetch_openml). 200k rows / 373k one-hot features / 9 fields, fixed seed, stratified split, libFM-style fixed hyperparameters + built-in early stopping (no tuning-to-benchmark); README table with AUC + fit time + machine specs (honest result: factor models match, not beat, LR on this singleton-heavy sample; FwFM closest at 0.6891 vs LR 0.6908).
[x] Documentation site — published API/usage docs. Priority: P1.
DoD met: mkdocs-material (mkdocs.yml + docs/index.md with install + quickstart + model table; nav covers the API reference, data format, math specs, GPU backend and project docs; examples linked); .github/workflows/docs.yml auto-deploys to GitHub Pages (gh-pages branch) on pushes to main touching docs; README links the site (https://matapanino.github.io/modern_fm/). One-time repo setting after the first deploy: Pages -> deploy from branch gh-pages.
[x] API freeze + backward-compat policy — Priority: P0.
DoD met: __all__ audited (5 estimators + pooling + encoder + NotFittedError + libffm I/O + __version__); every public constructor param verified present in docs/api_design.md by an inspect-based sweep (CategoricalEncoder/libffm section added — the one gap); SemVer + backward-compat policy written (docs/compat_policy.md, in the docs-site nav); save_model carries format_version and load_model now rejects newer-format files with a clear upgrade error (tested); the one stale NotImplementedError (dead _check_binary_classes claiming multiclass is unsupported) removed — the remaining NotImplementedError surface is exactly the documented CUDA cell guards.
[x] Release 1.0.0 — Priority: P0.
DoD met: version 1.0.0 in __init__.py / pyproject.toml / Cargo.toml (+ Cargo.lock); CHANGELOG 1.0.0 entry (Unreleased folded in); full CI matrix green on the release PR; tag v1.0.0 → trusted-publishing release.yml (wheels + sdist to PyPI).

v1.0 — criteria¶

The release is "stable" only when all of these hold (the global gate; the per-item DoDs above are the local checks):

Feature-matrix completeness — every documented cell of (model {FM, FFM} × task {regression, binary, multiclass} × optimizer {sgd, adagrad, adam, ftrl} × early-stopping) works, or is intentionally and visibly documented as out-of-scope. No surprise NotImplementedError in the public surface (FFMRegressor, FTRL+ES, multiclass-FFM+ES all closed).
Reference parity — every fast/Rust path proven equal to the NumPy reference; _reference*.py never changed for speed (CLAUDE.md rule, now a release gate).
Numerical stability — no inf/nan at extreme logits (logsumexp/log1p), tested.
Reproducibility — identical results under a fixed random_state across the supported matrix.
Serialization stability — save_model/load_model + pickle round-trip preserve predictions; on-disk format carries a version tag.
sklearn compatibility — check_estimator passes for every public estimator; works in Pipeline / GridSearchCV / clone.
API frozen & documented — __all__ audited; every public param in docs/api_design.md; a written backward-compatibility / SemVer policy.
Quality gates green — pytest green + ruff clean across the CI matrix (3 OS × py3.10–3.13); cargo test + cargo clippy warning-free.
Docs site live — GitHub Pages: install, quickstart, API reference, math specs, examples.
Real-data evidence — Criteo/Avazu-sample AUC + timing in the README.
Production CTR features — calibrated predict_proba + top-interaction inspection shipped.
Released — version 1.0.0, CHANGELOG complete, v1.0.0 tag published.

v1.1 — full CUDA coverage + CUDA-enabled Linux wheels¶

Shipped 2026-07-02 (additive minor under docs/compat_policy.md):

CUDA multiclass (softmax) training for FM and FFM (gpu_backend_plan milestone 5): GPU per-class batch accumulation with the softmax computed in-kernel in CPU class order; the untouched CPU flush runs per class via McState::class_views, so every optimizer, ES and partial_fit ride through. FFM uses a two-kernel design so one class-sized dense gv buffer serves all classes.
FwFM CUDA — prediction + binary/regression + multiclass training (milestone 6): new prediction kernel (FFM geometry, FM-shaped V, R pair weights) and training kernels (compact feature slots + dense n_fields² gr); the R group flushes through GroupStateMut/McGroupState unchanged. With this, every prediction and training cell is CUDA-covered and the per-cell NotImplementedError guards are gone.
CUDA-ready Linux wheels: pip install modern-fm on Linux now ships the cuda-backend feature (cudarc dynamic-loading pinned — nothing links libcuda, manylinux-clean); backend="cuda" works on Colab/Kaggle GPU runtimes without a source build. Gated in CI (cuda-wheel) and in the release workflow (linux-wheel-check: CPU-only import, has_cuda() is False, full suite, auditwheel).

Post-1.1 — model variants & GPU¶

AFM, FEFM/FmFM (each gets its own math spec first, per the FwFM template)
pairwise dropout, interaction pruning
PyTorch-compatible backend prototype
GPU optimizer flush; stacked-gv fast path for small-C FFM multiclass
cuML-style device= switch investigation

Distribution¶

PyPI name: modern-fm (availability confirmed 2026-06-11)
wheels via maturin + cibuildwheel once Rust backend lands