Roadmap¶
v0.1 — Rust CPU core (current)¶
- [x] Phase 0: design docs + package skeleton
- [x] Phase 1: Python reference predictions (FM naive/fast, FFM naive/vectorized), losses (logistic/softmax + label smoothing), correctness tests
- [x] Phase 2A: Rust prediction backend (completed 2026-06-11)
- toolchain updated (rustc 1.96), pyproject switched to maturin mixed layout
(
python-source = "python", modulemodern_fm._rust, pyo3 0.25 abi3-py310) rust/src/{lib,data,fm,ffm}.rs: FM fast + FFM predictions, dense and CSR, float64, GIL released during compute; input validation (CSR structure, field_ids range, shape mismatches -> ValueError)modern_fm._backend: private dispatch — Rust when built, NumPy reference fallback otherwise; handles dtype/contiguity coercion- parity tests (
tests/test_rust_parity.py): Rust vs reference at atol/rtol 1e-12, dense + CSR, multiple seeds, zero rows, single nonzero, hand-computed examples, bad-input rejection; Rust unit tests in-crate - [x] Phase 2B: Rust training — SGD + AdaGrad for FM and FFM (NumPy reference
trainer as ground truth, parity-tested), estimators wired to the backend
(binary + regression), seeded reproducibility.
rayonrow-parallelism and mini-batch deferred to v0.2 (n_jobs=1, batch_size=1 in v0.1). - [x] Phase 3: sklearn API polish (lightweight mixins, check_is_fitted,
fit/predict validation) +
CategoricalEncoder. Full sklearncheck_estimatorcompliance deferred to v0.2 (no sklearn runtime dep). - [x] Phase 4: early_stopping/eval_set, label_smoothing, class_weight, sample_weight, multiclass softmax (FM), save/load, examples + benchmark.
v0.2 — Training quality & throughput¶
- [x] Adam optimizer (
optimizer="adam", per-parameter lazy Adam withbeta_1/beta_2/epsilon); FM binary/multiclass + FFM, parity-tested vs the NumPy reference. Adam + early stopping is deferred (moments are not round-tripped). - [x] FTRL-Proximal optimizer (
optimizer="ftrl",l1_linear/l1_factors/ftrl_beta): per-coordinate(z, n)state with L1/L2 folded into the update; FM binary/multiclass + FFM, parity-tested vs the NumPy reference; L1 yields exact zeros (composes with mini-batch + n_jobs; FTRL + early stopping deferred) - [x] Rust multiclass-softmax training kernel (
fm_fit_multiclass_csr), parity-tested vs the NumPy reference (done ahead of v0.2) - [x] mini-batch (
batch_size > 1): per-batch gradient averaging with one update per touched coordinate (FM binary/multiclass + FFM), parity-tested vs the NumPy reference at batch_size ∈ {1, 4, full}; batch_size=1 stays the per-row path - [x]
rayonrow-parallelism (n_jobs > 1): deterministic parallel-accumulate / serial-apply per batch (FM binary + FFM; multiclass serial);n_jobs=1matches the reference,n_jobs>1reproducible per thread count. ~3x on 4 cores for FFM. - [x] full sklearn
check_estimatorcompatibility (estimators subclass sklearnBaseEstimator+Classifier/RegressorMixin; scikit-learn is now a runtime dependency).FFMClassifier.fit(X, y)defaultsfield_idsto per-column. - [x] libffm format loader/exporter (
load_libffm/dump_libffm, round-trip tested) - [x] pandas/polars input (DataFrames via sklearn
validate_data;feature_names_in_recorded, column reorder rejected at predict) - [x] CI + release pipeline:
.github/workflows/ci.yml(pytest + ruff across {Linux, macOS, Windows} × py3.10–3.13, plus cargo test/clippy) andrelease.yml(abi3 wheels via maturin-action + sdist, PyPI trusted publishing on av*tag). Verified locally: maturin builds an abi3 wheel + sdist that install and run in a clean venv. - [x] Adam + early stopping: moments round-tripped across epochs via the NumPy
reference path (
fm_fit_reference'sadam_state); FM binary/regression + FFM, per-epoch hand-off equals one multi-epoch call exactly. - [x] multiclass + early stopping: per-class optimizer state (AdaGrad/Adam) round-tripped via the reference path; softmax cross-entropy eval metric, round-trip equals a single multi-epoch call.
- [x] FFM multiclass softmax (
ffm_fit_multiclass_csr): one FFM per class coupled by softmax, all optimizers (SGD/AdaGrad/Adam/FTRL) + mini-batch, parity-tested vs the NumPy reference;FFMClassifierauto-detects >2 classes /loss="softmax". FTRL + early stopping and multiclass FFM + early stopping are scheduled for v0.4 (see below).
v0.4 — API completeness & online learning¶
Closes the (model × task × optimizer × early-stopping) matrix and adds streaming / out-of-core training. No new model family, no release-infra work.
- [x] FFMRegressor — squared-loss FFM, the regression counterpart to
FMRegressor. Priority: P0. - DoD:
RegressorMixinestimator; Rust kernel + NumPy-reference parity test (atol/rtol like existing FFM tests); dense+CSR equivalence; save/load + pickle round-trip;check_estimatorpasses; exported in__all__;docs/api_design.md+ CHANGELOG updated._reference*.pyunchanged. - [x] FTRL + early stopping — round-trip FTRL's per-coordinate
(z, n)state across epochs (mirror Adam'sadam_statehand-off). Priority: P1. - DoD: remove the
_no_ftrl_early_stoppingguard (fm.py/ffm.py); per-epoch hand-off equals one multi-epoch call exactly (test); FM binary/multiclass + FFM; state carried via the reference path (no_reference*.pyspeed change). - [x] Multiclass FFM + early stopping — per-class optimizer state round-tripped for FFM multiclass (mirror FM-multiclass + ES). Priority: P1.
- DoD: remove the multiclass guard in
ffm.py; round-trip equals a single multi-epoch call (test); softmax cross-entropy eval metric. - [x] partial_fit + warm_start — incremental/streaming training for FM & FFM. Priority: P0.
- DoD:
partial_fit(X, y, classes=...)(sklearn first-callclassesconvention) andwarm_start=Truecontinue from existing params + optimizer state;classes_retained; N sequentialpartial_fitcalls equal onefiton the concatenated data under a matched epoch/batch schedule (exact via optimizer-state round-trip); contract added todocs/api_design.md; reference parity preserved; CHANGELOG updated.
v0.5 — Rust ES fast path, FwFM, pooling, CUDA plumbing¶
Performance completion of the early-stopping matrix, the v1.0 headline model
pulled forward, a research-honest nfm_pooling, and locally-testable CUDA
groundwork (kernels land separately, gated on real-GPU validation — see
docs/gpu_backend_plan.md).
- [x] Rust early-stopping fast path — every per-epoch optimizer-state
hand-off (AdaGrad accumulators, Adam moments, FTRL
(z, n), per-class multiclass state) now round-trips through the Rust kernels via optionalstate/adam_state/ftrl_statePyO3 arguments;_backenddispatches to the reference only when the extension is missing. The epoch-driven ES loop is bit-identical to a single multi-epoch Rust call (tested per optimizer × {FM, FFM} × {binary, multiclass}). ES fits sped up ~14–170x for the previously reference-bound cells (Adam/FTRL/multiclass; FFM+Adam ES 49.5 s → 0.86 s on the synthetic bench);partial_fit/warm_startride the same path. Priority: P0. - [x] FwFM (
FwFMClassifier) — Field-weighted FM (moved up from v1.0; it remains the 1.0 headline). Priority: P0. - DoD met:
docs/math_spec_fwfm.mdwritten FIRST (field-pair weightsr_{f(i),f(j)}upper-triangle, exact prediction/gradients/updates, R=ones init = plain FM); NumPy reference → Rust kernel (rust/src/fwfm.rs) →FwFMClassifier, parity-tested at each layer (predict 1e-12, train RTOL=1e-9 × optimizer × loss × batch_size, multiclass, ES bit-exact hand-off); collapse-to-FM property test; existing FM/FFM formulas untouched;check_estimator, save/load,partial_fit/warm_start,__all__+ api_design + CHANGELOG. Binary + multiclass; serial (rayonn_jobsfor FwFM deferred). (AFM/FEFM/FmFM follow this template post-1.0; FmFM is the research-recommended next variant — one field-pair k×k matrix generalizes FM/FwFM/FvFM/FmFM.) - [x] Bi-interaction pooling (
BiInteractionPooling) — the honest "nfm_pooling": a sklearn transformer emitting the k-dim bi-interaction vector0.5 * ((sum_i x_i v_i)^2 - sum_i (x_i v_i)^2)from a fitted FM for downstream models. As a predictor a linear head over this provably collapses to plain FM (NFM = this + MLP, out of scope), so it ships as a feature transform, not a model. Priority: P1. - DoD met:
_reference.fm_bi_interaction(+_backend.fm_bi_interactionwrapper; no Rust kernel — NumPy is two BLAS-grade sparse matmuls),BiInteractionPoolingwithfit/transform/get_feature_names_out(+bi_interaction(X)on the FM estimators), collapse-to-FM identity test at 1e-12, Pipeline +check_estimator+ pickle tests, api_design docs. - [x] CUDA plumbing (no kernels) —
cuda-backendCargo feature (cudarc 0.19, defaultfallback-dynamic-loading= no CUDA toolkit at build time; target-gated off on macOS),rust/src/cuda/mod.rsavailable(), always- registeredhas_cuda()pyfunction,_backend.has_cuda(),backend="cuda"accepted at fit with clear errors (RuntimeError without a CUDA build/device, NotImplementedError while no kernels exist — never a silent CPU fallback), CIcuda-checkjob (cargo check/clippy --features cuda-backend, in theci-successgate). Kernels (FM CSR prediction first) follow in a separate PR that merges only after validation on a real GPU (runbook in the PR). Priority: P2.
v0.6 — in progress¶
- [x] CUDA FFM prediction + context/module cache (gpu_backend_plan
milestone 2, pulled ahead of the post-1.0 GPU track): FFM CSR prediction
kernel (
rust/src/cuda/ffm.rs, one block/row, 256-thread pair-strided loop, no row-nnz/k limit;FFMClassifierbinary+multiclass andFFMRegressorinference viaset_params(backend="cuda")), plus a process-wide cache of the CUDA context + NVRTC module (rust/src/cuda/mod.rs) so only the first call pays initialization. Parity rtol/atol 1e-10, T4-validated perdocs/cuda_validation_runbook.md;bench_cuda.pygained the FFM grid + a cold-start line. FwFM-CUDA and device-resident parameters remain out of scope. - [x] CUDA FM training accumulation (gpu_backend_plan milestone 3):
binary/regression FM fit with
backend="cuda"— GPU accumulates each mini-batch's data-gradient (dense buffers,atomicAdd, CSR uploaded once per call), the untouched CPU flush applies SGD/AdaGrad/Adam/FTRL, so early stopping,partial_fit,warm_startand FTRL's exact L1 zeros work unchanged. Multiclass/FwFM training still raise. Nondeterministic run-to-run (atomics); parity on final predictions at rtol 1e-7/atol 1e-8; requires compute >= 6.0. Perf follow-ups shipped: device-resident parameters + per-batch compact-vs-dense transfer switch (2 * batch_nnz < n_features). - [x] CUDA FFM training accumulation (gpu_backend_plan milestone 4):
binary/regression FFM fit with
backend="cuda"(rust/src/cuda/ffm_train.rs) — dense slot-gradient device buffer, host touched-slot enumeration (pair loop minus the k-dot), gather/scatter kernels so V stays device-resident and only touched (feature, field) slots move; CPU flush reused verbatim (all optimizers + ES + partial_fit/warm_start + FTRL L1 zeros). Same caveats as the FM path.
v1.0 — stable release¶
Headline model variant + production-CTR features + docs/bench polish + API freeze. Shipping this milestone = tagging v1.0.0.
- [ ] FwFM — pulled into v0.5 (see above); the v1.0 gate keeps its DoD.
- [x] Probability calibration — calibrated
predict_probafor CTR. Priority: P1. - DoD met via the recommended path: every public classifier is
CalibratedClassifierCV-compatible (no library-specific API);tests/test_calibration.pypins ECE + Brier improvement on synthetic miscalibrated data (label-smoothing compression, sigmoid + isotonic) and compatibility for FM/FFM/FwFM;examples/calibration.py+docs/api_design.md"Probability calibration" section. - [x] Model inspection (top interactions) — strongest learned pairwise interactions. Priority: P1.
- DoD met:
top_interactions(n_top, class_idx=None)on all five estimators (FM|<v_i, v_j>|, FwFM r-weighted, FFM field-aware slots; exact blockwise scan inpython/modern_fm/_inspect.py); planted dominant-pair tests + blockwise-vs-naive parity (tests/test_top_interactions.py);examples/top_interactions.py+ api_design section. - [x] Real-data benchmark — real CTR sample. Priority: P1.
- DoD met with a substitution, documented in the script: the Criteo/Avazu
samples are no longer publicly downloadable without credentials (checked
2026-07: labs.criteo.com redirects, the S3 mirror 404s, the HF copy is
gated), so
benchmarks/bench_criteo_like.pyuses the real KDD Cup 2012 track-2 CTR sample from OpenML (Click_prediction_small, zero-credentialfetch_openml). 200k rows / 373k one-hot features / 9 fields, fixed seed, stratified split, libFM-style fixed hyperparameters + built-in early stopping (no tuning-to-benchmark); README table with AUC + fit time + machine specs (honest result: factor models match, not beat, LR on this singleton-heavy sample; FwFM closest at 0.6891 vs LR 0.6908). - [x] Documentation site — published API/usage docs. Priority: P1.
- DoD met: mkdocs-material (
mkdocs.yml+docs/index.mdwith install + quickstart + model table; nav covers the API reference, data format, math specs, GPU backend and project docs; examples linked);.github/workflows/docs.ymlauto-deploys to GitHub Pages (gh-pagesbranch) on pushes to main touching docs; README links the site (https://matapanino.github.io/modern_fm/). One-time repo setting after the first deploy: Pages -> deploy from branchgh-pages. - [x] API freeze + backward-compat policy — Priority: P0.
- DoD met:
__all__audited (5 estimators + pooling + encoder +NotFittedError+ libffm I/O +__version__); every public constructor param verified present indocs/api_design.mdby an inspect-based sweep (CategoricalEncoder/libffm section added — the one gap); SemVer + backward-compat policy written (docs/compat_policy.md, in the docs-site nav);save_modelcarriesformat_versionandload_modelnow rejects newer-format files with a clear upgrade error (tested); the one staleNotImplementedError(dead_check_binary_classesclaiming multiclass is unsupported) removed — the remainingNotImplementedErrorsurface is exactly the documented CUDA cell guards. - [x] Release 1.0.0 — Priority: P0.
- DoD met: version
1.0.0in__init__.py/pyproject.toml/Cargo.toml(+Cargo.lock); CHANGELOG1.0.0entry (Unreleased folded in); full CI matrix green on the release PR; tagv1.0.0→ trusted-publishingrelease.yml(wheels + sdist to PyPI).
v1.0 — criteria¶
The release is "stable" only when all of these hold (the global gate; the per-item DoDs above are the local checks):
- Feature-matrix completeness — every documented cell of
(model {FM, FFM} × task {regression, binary, multiclass} × optimizer
{sgd, adagrad, adam, ftrl} × early-stopping) works, or is intentionally and
visibly documented as out-of-scope. No surprise
NotImplementedErrorin the public surface (FFMRegressor, FTRL+ES, multiclass-FFM+ES all closed). - Reference parity — every fast/Rust path proven equal to the NumPy
reference;
_reference*.pynever changed for speed (CLAUDE.md rule, now a release gate). - Numerical stability — no inf/nan at extreme logits (logsumexp/log1p), tested.
- Reproducibility — identical results under a fixed
random_stateacross the supported matrix. - Serialization stability —
save_model/load_model+ pickle round-trip preserve predictions; on-disk format carries a version tag. - sklearn compatibility —
check_estimatorpasses for every public estimator; works inPipeline/GridSearchCV/clone. - API frozen & documented —
__all__audited; every public param indocs/api_design.md; a written backward-compatibility / SemVer policy. - Quality gates green —
pytestgreen +ruffclean across the CI matrix (3 OS × py3.10–3.13);cargo test+cargo clippywarning-free. - Docs site live — GitHub Pages: install, quickstart, API reference, math specs, examples.
- Real-data evidence — Criteo/Avazu-sample AUC + timing in the README.
- Production CTR features — calibrated
predict_proba+ top-interaction inspection shipped. - Released — version
1.0.0, CHANGELOG complete,v1.0.0tag published.
v1.1 — full CUDA coverage + CUDA-enabled Linux wheels¶
Shipped 2026-07-02 (additive minor under docs/compat_policy.md):
- CUDA multiclass (softmax) training for FM and FFM
(gpu_backend_plan milestone 5): GPU per-class batch accumulation with the
softmax computed in-kernel in CPU class order; the untouched CPU flush runs
per class via
McState::class_views, so every optimizer, ES andpartial_fitride through. FFM uses a two-kernel design so one class-sized dense gv buffer serves all classes. - FwFM CUDA — prediction + binary/regression + multiclass training
(milestone 6): new prediction kernel (FFM geometry, FM-shaped V, R pair
weights) and training kernels (compact feature slots + dense
n_fields²gr); the R group flushes throughGroupStateMut/McGroupStateunchanged. With this, every prediction and training cell is CUDA-covered and the per-cellNotImplementedErrorguards are gone. - CUDA-ready Linux wheels:
pip install modern-fmon Linux now ships thecuda-backendfeature (cudarcdynamic-loadingpinned — nothing links libcuda, manylinux-clean);backend="cuda"works on Colab/Kaggle GPU runtimes without a source build. Gated in CI (cuda-wheel) and in the release workflow (linux-wheel-check: CPU-only import,has_cuda() is False, full suite, auditwheel).
Post-1.1 — model variants & GPU¶
- AFM, FEFM/FmFM (each gets its own math spec first, per the FwFM template)
- pairwise dropout, interaction pruning
- PyTorch-compatible backend prototype
- GPU optimizer flush; stacked-gv fast path for small-C FFM multiclass
- cuML-style
device=switch investigation
Distribution¶
- PyPI name:
modern-fm(availability confirmed 2026-06-11) - wheels via maturin + cibuildwheel once Rust backend lands