I Built a Neural Network from Scratch in Rust — Then Compiled It to WebAssembly

A complete ML pipeline: engine, backprop, binary format, and a live browser demo. Zero dependencies. Under 200 KB total.

If you have built machine-learning projects before, you have probably done it by importing PyTorch, TensorFlow, or scikit-learn and calling .fit(). Those are excellent libraries. This article is about what happens when you deliberately do not use them — when you build every piece of the pipeline yourself, in a language that compiles to WebAssembly, and the result runs live in the browser with no server, no Python, and no cloud bill.

Here is the live demo: move four sliders, watch the predicted Iris species update in real time. The model is running entirely inside your browser tab, loaded from a 1.1 KB binary file, powered by ~100 KB of WebAssembly compiled from pure Rust.

This is the story of how I built it and why the engineering choices made it work.

Why Rust? Why WebAssembly? Why zero dependencies?

Three constraints drove every design decision.

WASM requires no_std or a carefully limited std. The wasm32-unknown-unknown target has no operating system, no file system, and no libc. A crate that links against rand, ndarray, or any library that makes OS calls will not compile to it without significant plumbing. An engine built from nothing but the Rust standard library compiles cleanly to every target, including WASM.

A zero-dependency std-only crate is uniquely auditable. There are no transitive dependency trees to vet, no supply-chain risks, no version conflicts. Every line of code that runs in the user's browser lives in this repository.

The deployment story becomes the technical story. A 100 KB WASM blob that runs locally in the browser is not just a cost optimisation — it is a privacy guarantee (user inputs never leave the machine) and a latency guarantee (inference is microseconds, not a round trip to a cloud API). That story is only possible because the engine has no external dependencies that would bloat the binary.

The architecture: eight modules, strictly layered

The engine is a Cargo workspace with four crates. The core library, ferrum_core, contains twelve modules arranged in a strict dependency stack — each module imports only from those above it. There are no cycles, no forward references.

error      ← single InferError enum, Result<T> alias
tensor     ← Tensor: flat Vec<f32> + shape, row-major
ops        ← matmul (i-k-j order), bias-add, transpose, argmax, softmax
activation ← ReLU, Sigmoid, Tanh, Softmax, Identity (serialisable enum)
layer      ← Layer trait, Linear (y = xW+b), ActivationLayer
model      ← Sequential: Vec<Box<dyn Layer>>, forward()
rng        ← seeded xorshift64* PRNG, Box-Muller normal samples
loss       ← fused softmax cross-entropy + analytic gradient
optim      ← SGD with momentum, stateless over parameters
csv        ← numeric CSV parser, z-score Normalizer, train/val split
train      ← DenseT, ReluT, Net (trainable MLP), backpropagation
loader     ← FINF v2 binary format: weights + normalizer in one file

You can read these files in order and the entire engine unfolds with no surprises.

The tensor: deliberately minimal

pub struct Tensor {
    pub shape: Vec<usize>,
    pub data:  Vec<f32>,
}

That is the whole data model. A 3×4 matrix is twelve contiguous floats. A vector is the same thing with a one-element shape. There is no broadcasting, no views, no strides, no GPU. Every operation returns a new Tensor rather than mutating in place. This costs allocations and buys clarity: the data flow through the network is always explicit.

The key primitive is map:

pub fn map<F: Fn(f32) -> f32>(&self, f: F) -> Tensor {
    Tensor {
        shape: self.shape.clone(),
        data: self.data.iter().copied().map(f).collect(),
    }
}

This single method is the entire mechanism behind ReLU, Sigmoid, and Tanh.

The matmul: cache-friendly loop order

The matrix multiply is the performance-critical operation and the one place where a single implementation choice makes a real difference.

The textbook i-j-k loop order — for each output element (i,j), dot the i-th row of A with the j-th column of B — reads B in column-major order, which is cache-unfriendly for a row-major matrix.

The i-k-j order walks both B and the output buffer contiguously in the innermost loop:

for i in 0..m {
    let a_row = i * ka;
    let o_row = i * n;
    for k in 0..ka {
        let a_ik = a.data[a_row + k];
        let b_row = k * n;
        for j in 0..n {
            out[o_row + j] += a_ik * b.data[b_row + j];
        }
    }
}

Same FLOP count. Meaningfully better cache behaviour. This is not BLAS — it is still naive single-threaded loops — but it is the right naive implementation.

The loss function: why fusion matters

For a classifier, the natural loss is cross-entropy applied to a softmax output. Most introductions compute them separately. There is a strong reason not to.

If p = softmax(z) and the true class is t, the gradient of cross-entropy loss with respect to the logits z is:

dL/dz = (p - onehot(t)) / batch_size

That is it. No chain rule composition, no softmax Jacobian, no numerical instability from computing log(softmax(z)) via two separate steps. The gradient is the predicted probabilities minus the one-hot target, scaled by batch size. This is why every layer below the loss in the network never needs to know what a softmax derivative looks like.

The implementation computes the stable softmax (max-subtracted before exponentiating), the negative log-likelihood of the true class, and this gradient in a single pass over the batch.

The correctness of this gradient is verified by a finite-difference check in the test suite: every logit is perturbed by ε, the actual change in loss is measured, and the result is compared to the analytic gradient element-by-element.

Backpropagation by hand

The trainable network is a separate set of types from the inference engine. DenseT and ReluT mirror their inference counterparts but cache intermediate values for the backward pass.

For a dense layer y = xW + b, the backward pass is three expressions:

self.grad_w = matmul(&transpose(x)?, dy)?;   // dL/dW = x^T · dy
self.grad_b = sum_axis0(dy)?;                 // dL/db = Σ_rows(dy)
// return:
matmul(dy, &transpose(&self.weight)?)         // dL/dx = dy · W^T

For a ReLU layer, the backward pass is one expression. The forward pass caches a 0/1 mask of which inputs were positive — that mask is exactly ReLU's local derivative:

ops::mul(dy, mask)  // gate the gradient

The gradient check test perturbs individual weights by ε = 0.001, measures (L(w+ε) - L(w-ε)) / 2ε, and confirms it matches the analytic gradient to within 0.01. If the calculus were wrong, this test would catch it.

The model file: a custom binary format

Real engines use GGUF, SafeTensors, or ONNX. To stay dependency-free, I defined a minimal binary format called FINF (version 2) and serialised it by hand.

The key design choice: the normalizer statistics (per-column mean and standard deviation) are embedded in the same file as the model weights. Inference requires normalising input features with the exact statistics used during training. A model that was trained on standardised data but receives raw values at inference will silently produce wrong predictions. Embedding both in one file makes this mistake structurally impossible: there is nothing to forget to load separately.

4 bytes  b"FINF"                  ← magic number
u32      version = 2
u32      normalizer_byte_length
[bytes]  "mean0,std0;mean1,std1;…" ← normalizer stats
u32      num_layers
per layer:
  u8     tag: 0=Linear, 1=Activation
  ...    layer parameters

The reader is a forward-only bounds-checked cursor. Truncated or corrupt files return a Format error rather than panicking or reading out of bounds.

The WASM bindings

The deployment crate, iris_wasm, exposes a single class to JavaScript:

#[wasm_bindgen]
pub struct IrisClassifier {
    model: Sequential,
    norm:  Normalizer,
}

#[wasm_bindgen]
impl IrisClassifier {
    #[wasm_bindgen(constructor)]
    pub fn new(model_bytes: &[u8]) -> Result<IrisClassifier, JsValue> { ... }

    pub fn predict(&self, sl: f32, sw: f32, pl: f32, pw: f32) -> Result<String, JsValue> { ... }
}

new takes the raw bytes of the FINF file (fetched via fetch() in JavaScript) and deserialises them. predict normalises the four slider values using the embedded statistics, runs a forward pass, and returns a JSON string with the predicted class index, name, and all three probabilities.

The JavaScript side is equally minimal — an ES module, no bundler, no framework:

import init, { IrisClassifier } from './pkg/iris_wasm.js';

await init();
const bytes = new Uint8Array(await (await fetch('model.bin')).arrayBuffer());
const classifier = new IrisClassifier(bytes);

slider.addEventListener('input', () => {
    const result = JSON.parse(classifier.predict(sl, sw, pl, pw));
    updateUI(result);
});

The numbers

Property	Value
Architecture	4 → 32 (ReLU) → 3 (Softmax)
Trainable parameters	259
Dataset	UCI Iris, 150 examples
Full-dataset accuracy	99.3%
Training time	~2 seconds, single-threaded
Model file size	1,161 bytes
WASM binary	~100 KB
Total page weight	~118 KB (less than one JPEG)
External dependencies	0
Tests	110 (0 failures)

What the test suite covers

The 110-test suite is worth describing because it covers correctness at every level:

Unit tests (84) — every module tested in isolation. Highlights: the finite-difference gradient check proves the loss function's calculus is correct; the tag_roundtrips_all_variants test proves the serialisation enum never silently mis-maps; the normalizer_zero_mean_unit_std test verifies that fit-and-transform produces genuinely standardised data.

WASM glue tests (5) — the bindings layer tested natively: load from bytes, infer, verify probability distribution, check argmax range, reject corrupt input.

Integration tests (21) — the complete pipeline: parse the real UCI Iris CSV, normalise, train from scratch, serialise, deserialise, infer on known samples. These tests include setosa_textbook_sample and virginica_textbook_sample which assert that specific well-known Iris measurements produce the correct predicted species.

Deployment in three commands

# Train (produces model.bin in ~2 seconds)
cargo run -p train --release

# Compile to WASM
cargo build -p iris_wasm --target wasm32-unknown-unknown --release
wasm-bindgen target/wasm32-unknown-unknown/release/iris_wasm.wasm \
  --out-dir web/pkg --target web --no-typescript

# Copy model and serve
cp model.bin web/model.bin
cd web && python3 -m http.server 8080

GitHub Pages deployment is three more steps: push the web/ directory, enable Pages, get a permanent HTTPS URL. The included DEPLOYMENT.md has the full GitHub Actions workflow for continuous deployment.

The engineering lessons

Separation of concerns is testable. Because every arithmetic operation lives in ops.rs, every activation in activation.rs, and the loss gradient in loss.rs, the finite-difference gradient check can verify the math without touching any other module.

Embedding the normalizer in the model file eliminates a whole class of bugs. This seems like a small detail. It is not. Preprocessing statistics are the most commonly forgotten artifact when deploying a model, and a model that receives un-normalised inputs fails silently.

The i-k-j loop order for matmul is a free speedup. It is the same code length as the naive order, just with the k and j loops swapped. Any ML engine that does its own matrix multiply should use it.

WASM deployment is simpler than it looks. The hard part is not the WASM compilation — cargo build --target wasm32-unknown-unknown is one command. The hard part is having a codebase that compiles to that target in the first place, which requires zero OS dependencies. The architectural constraint and the deployment story are the same constraint.

What's next

The engine is designed so each kind of extension touches exactly one module:

New activation → activation.rs (add variant, tag, apply arm)
New layer → layer.rs + loader.rs (tag bytes)
New loss → loss.rs (return (scalar, gradient))
New optimizer → optim.rs (Adam, RMSProp)
Faster kernels → ops.rs only (SIMD, rayon, BLAS)
Bigger model → csv.rs for a new featurizer, everything else unchanged

The repository will soon be on GitHub. I will update the link here once it goes live. The web/ directory is self-contained: copy it to any static host and you have a live demo immediately.

Built with Rust 1.95, wasm-bindgen 0.2.122, and the UCI Iris dataset from UC Irvine Machine Learning Repository.