CRT Neural Network

7 independent channels. Block-diagonal backprop. Free error detection.

Train a neural network split into 7 independent sub-networks: mod 8, mod 9, mod 25, mod 49, mod 11, mod 13, mod 17. Each channel trains alone. Standard network (red) beside CRT ensemble (green). Kill any channel -- graceful degradation. Corrupt weights -- triple-parity (mod 11 + mod 13 + mod 17) detects it for free. 5 datasets. Two modes: i32 fixed-point and f64 native float.

How It Works

STANDARD NEURAL NETWORK:
  Forward:  x -> W*x + b -> ReLU -> output
  Backward: full Jacobian (all neurons interact)
  Cost: O(N^2) per layer

CRT NEURAL NETWORK:
  Forward:  x -> 7 independent sub-networks -> average
  Backward: 7 INDEPENDENT Jacobians (block-diagonal)
  Cost: O(sum(m_i^2)) = O(8^2+9^2+25^2+49^2+11^2+13^2+17^2) = O(3750)

  mod 8  (2^3):  8 states, depth 3 (unique -- only even-prime channel)
  mod 9  (3^2):  9 states, depth 2
  mod 25 (5^2):  25 states, depth 2
  mod 49 (7^2):  49 states, depth 2 (largest channel)
  mod 11:        error detection (parity check)
  mod 13:        quality gate (bounds check)
  mod 17:        escape channel (triple-parity ECC with mod 11 + mod 13)

At N=214,414,200: 12 TRILLION x fewer gradient ops.
Triple-parity ECC (mod 11 + mod 13 + mod 17) = free error detection.

Live Training

Standard = one 2->8->1 network (33 params). CRT = seven independent 2->4->1 networks (119 params, 7 channels x 17 params each), outputs averaged. 5 tasks: XOR (trivial), Circle (radial boundary), Moons (interleaving crescents), Spiral (hard), Blobs (500 pts, GPU scale). Adjustable learning rate: low (0.05) = stable but slow, high (0.50) = fast but may overshoot. CPU Train = fixed-point i32. CPU Train (f64) = native float with leaky ReLU + sigmoid + cross-entropy. GPU uses mini-batch SGD + leaky ReLU 25% + golden interleave (stride 137).

Task:

Epochs/click:

Learning rate:

Epoch: 0

Standard Network

Click Initialize

CRT Network (7 channels)

Click Initialize

Loss Curve

Each dot = loss after 200 epochs. Red = standard, green = CRT.

Accuracy Curve

Each dot = accuracy (%) after 200 epochs. 50% gridline = random chance.

Per-Channel Loss

Each trace = one CRT channel's loss. Channels converge independently (block-diagonal backprop).

XOR (4 pts) converges in ~200 epochs. Blobs (500 pts) in ~3000. Circle and Moons need 3000+. Spiral (200 pts) is hardest -- the 3-turn boundary stretches 8 hidden units. Learning rate: 0.05 = stable but slow, 0.50 = fast but may overshoot. CPU Train: i32 mini-batch SGD. CPU Train (f64): native float, leaky ReLU + sigmoid -- Circle 50/50, Spiral 114/200 (+10 over i32). GPU: mini-batch SGD + leaky ReLU 25%.

Channel Isolation

Kill individual channels during inference -- the CRT network degrades gracefully because each channel is independent. A standard network has no structural channel decomposition -- failure is not isolated to independent components.

mod 8: -

mod 9: -

mod 25: -

mod 49: -

mod 11: -

mod 13: -

mod 17: -

Even with channels dead, the remaining ones reconstruct a partial answer. With mod-11 alive, corruption in any other channel is detectable -- the parity check catches inconsistencies.

Error Detection (mod 11)

11 = 1+2+3+5, the parity checksum of the first four chain primes. Corrupt the mod-49 channel weights with random noise. The mod-11 channel detects corruption by comparing its independent prediction against the corrupted consensus. No additional cost -- mod-11 trains as part of CRT decomposition, protecting all other channels for free.

Train a model, then click to corrupt and detect.

Standard vs CRT

Property	Standard	CRT
Backprop	Full Jacobian	Block-diagonal (7 independent)
Ops at N=214,414,200	N^2 = 46 quadrillion	sum(m_i^2) = 3,750
Savings	1x	12,261,000,000,000x
Channel failure	Entangled	Graceful degradation
Error detection	External validation	Triple-parity ECC (free)
Parallelism	Synchronous	7 independent machines
Hot swap	Retrain everything	Add/remove channels

Two views of the same data: standard (one big network) and CRT (seven small ones). CRT decomposition is EXACT -- the Chinese Remainder Theorem guarantees unique reconstruction. The 12 trillion x savings is the mathematical ratio N^2/sum(m_i^2) at N = 214,414,200.

HONEST NOTE: This comparison overstates practical benefit. Real neural networks do not have N = 214,414,200 parameters in one layer. The block-diagonal structure trades off representational capacity for computational independence. Whether CRT decomposition improves real ML workloads has not been benchmarked against PyTorch or JAX. The demos above are toy problems. The architecture idea (independent sub-networks with built-in error detection) is novel; its practical value is an open question.

28x mulmod Elimination (COMPUTED)

In Z/214,414,200, each ring multiply requires mulmod with ~28 binary-method iterations (log2 N). Per-channel: one i32 multiply (max 48*48 = 2304, no overflow). Forward + backward: 9 multiplies per sample. Full ring: 9 x 28 = 252 iterations. Per-channel parallel: 9 i32 ops across 7 channels = 9 wall-clock ops. Speedup: 28x from mulmod elimination alone, 7x additional from channel parallelism. Verified: 120/120 gradient channel checks pass. Per-channel training converged on XOR (7 independent networks, CRT-reconstructed 4/4). CC0.

GPU Backprop

CRT block-diagonal backprop maps directly to WGSL compute shaders: 7 independent channels = 7 workgroups, each operating in parallel. Each benchmark dispatches 100,000 ring elements through Z/214,414,200.

Each operation runs as a WGSL compute shader with 7 channels. Ring Add = forward pass accumulation. CRT Multiply = block-diagonal gradient step. Eigenvalue = convergence check. All verified against CPU.

GPU Training (WebGPU compute)

Train the same CRT network on GPU. Mini-batch SGD with leaky ReLU 25% and golden interleave (stride 137), matching CPU training. 8 networks (1 standard + 7 channels) in a single WGSL compute dispatch. Uses the selected learning rate and epoch count. Atomic gradient accumulation in workgroup shared memory.

Task:

f64 Training

Native float lifts the i32 truncation floor. Same 7-network architecture (1 standard nh=8 + 6 CRT nh=4 = 135 params), native WASM float64: leaky ReLU hidden (alpha=0.01), sigmoid output, cross-entropy gradient. No clipping, no overflow, no fixed-point truncation.

i32 FIXED-POINT (CPU Train):
  Scale 1000. Gradients clipped +-1000. Weights clamped +-5000.
  Circle: 49/50.  Spiral: 104/200.  Moons: 73/100.

f64 NATIVE FLOAT (CPU Train f64):
  sigmoid(x) = 1/(1+exp(-x)).  leaky_relu(x, 0.01).
  Circle: 50/50 (perfect).  Spiral: 114/200 (+10).  Moons: 74/100.

WHY f64 WINS:
  No gradient truncation at decision boundaries.
  Sigmoid output = proper [0,1] probabilities.
  Cross-entropy gradient = (sigmoid(z) - target), no clipping needed.
  He initialization with sqrt() = proper scale per fan-in.

Both modes use the same CRT decomposition: 7 independent sub-networks whose predictions are averaged at inference. This is classic ensemble regularization -- each channel sees the full training signal, trains independently, and averaging smooths individual errors. The f64 mode makes this visible: smoother decision boundaries, faster convergence, higher accuracy on all datasets. Tested: momentum SGD helps the standard network (Moons 71 to 83 at beta=0.9) but HURTS CRT (74 to 69). Similarly, adding a second hidden layer helps standard (Moons 71 to 84) but NOT CRT (74 to 72). Both momentum and depth add regularization that a single large network needs, but CRT's ensemble averaging already provides it. Learning rate robustness: sweeping 7 rates (0.10 to 2.00), CRT accuracy varies by only 7 points on Moons vs 13 for standard -- 46% less sensitive. CRT works well with any rate in the 0.25 to 1.50 range without per-task tuning. CRT does not need external regularization -- the structure IS the regularizer.

The 2-Ratio (Theorem 47)

2-Ratio Bridge (Theorem 47, PROVED)

Two independent structural ratios of Z/214,414,200 both equal 2: |Nil|/|Sq0| = 420/210 = 2 and |Invol|/|Idem| = 256/128 = 2. Both trace to a single source: the mod-8 channel is the only one with exponent depth 3. In Z/8 (depth 3): nil=4, sq0=2, idem=2, invol=4. For every odd Z/p^k: nil/sq0=1, invol/idem=1. Only the mod-8 contribution differs -- and it contributes exactly 2 to both ratios. Exhaustive on all 7 channels.

Run the exhaustive per-channel count: nilpotent, square-zero, idempotent, involution for all 7 channels.

Nil / Sq0 = 2

420 / 210 = 2

Nilpotents = 420 (Carmichael lambda). Square-zeros = 210 (primorial). Ratio = 2. Only Z/8 contributes the extra factor (nil=4 vs sq0=2).

Invol / Idem = 2

256 / 128 = 2

Involutions = 2^8 = 256. Idempotents = 2^7 = 128. Ratio = 2. Only Z/8 has 4 involutions (Klein four-group) vs 2 for all odd channels.

mod-8 depth 3

Sole source

All 6 other channels contribute ratio 1 to both. The ratio 2 is a fingerprint of mod-8's unique exponent depth.

Klein four-group

{1, 3, 5, 7}

Z/8 involutions form the Klein four-group V_4. Z/9: {1, 8}. Z/25: {1, 24}. Z/49: {1, 48}. Only mod-8 doubles.

Nil/Sq0 ratioAny single channel: 1 (odd) or 2 (mod-8 only)Full product: exactly 2Invol/Idem ratioAny single channel: 1 (odd) or 2 (mod-8 only)Full product: exactly 2mod-8 uniqueOther channels: exponent depth 1 or 2mod-8: depth 3. The involution doubling IS the exponent justification.

Implementation

This demo is .ax compiled to WebAssembly by a self-hosting .ax compiler (the ouroboros). Two training modes: i32 fixed-point (scale 1000) and f64 native float (leaky ReLU + sigmoid + cross-entropy). 152 parameters: 33 standard (nh=8) + 119 CRT (7 channels x 17 params). Training: mini-batch SGD with golden interleave -- perm[i] = (i*137) % nd -- the phyllotaxis constant as mixing schedule. 137 = floor(golden angle in degrees). CRT = 7 independent channels averaged at inference = built-in ensemble regularization.

Full CRT AI architecture >