Train a neural network split into 7 independent sub-networks: mod 8, mod 9, mod 25, mod 49, mod 11, mod 13, mod 17. Each channel trains alone. Standard network (red) beside CRT ensemble (green). Kill any channel -- graceful degradation. Corrupt weights -- triple-parity (mod 11 + mod 13 + mod 17) detects it for free. 5 datasets. Two modes: i32 fixed-point and f64 native float.
STANDARD NEURAL NETWORK: Forward: x -> W*x + b -> ReLU -> output Backward: full Jacobian (all neurons interact) Cost: O(N^2) per layer CRT NEURAL NETWORK: Forward: x -> 7 independent sub-networks -> average Backward: 7 INDEPENDENT Jacobians (block-diagonal) Cost: O(sum(m_i^2)) = O(8^2+9^2+25^2+49^2+11^2+13^2+17^2) = O(3750) mod 8 (2^3): 8 states, depth 3 (unique -- only even-prime channel) mod 9 (3^2): 9 states, depth 2 mod 25 (5^2): 25 states, depth 2 mod 49 (7^2): 49 states, depth 2 (largest channel) mod 11: error detection (parity check) mod 13: quality gate (bounds check) mod 17: escape channel (triple-parity ECC with mod 11 + mod 13) At N=214,414,200: 12 TRILLION x fewer gradient ops. Triple-parity ECC (mod 11 + mod 13 + mod 17) = free error detection.
Standard = one 2->8->1 network (33 params). CRT = seven independent 2->4->1 networks (119 params, 7 channels x 17 params each), outputs averaged. 5 tasks: XOR (trivial), Circle (radial boundary), Moons (interleaving crescents), Spiral (hard), Blobs (500 pts, GPU scale). Adjustable learning rate: low (0.05) = stable but slow, high (0.50) = fast but may overshoot. CPU Train = fixed-point i32. CPU Train (f64) = native float with leaky ReLU + sigmoid + cross-entropy. GPU uses mini-batch SGD + leaky ReLU 25% + golden interleave (stride 137).
Kill individual channels during inference -- the CRT network degrades gracefully because each channel is independent. A standard network has no structural channel decomposition -- failure is not isolated to independent components.
Even with channels dead, the remaining ones reconstruct a partial answer. With mod-11 alive, corruption in any other channel is detectable -- the parity check catches inconsistencies.
11 = 1+2+3+5, the parity checksum of the first four chain primes. Corrupt the mod-49 channel weights with random noise. The mod-11 channel detects corruption by comparing its independent prediction against the corrupted consensus. No additional cost -- mod-11 trains as part of CRT decomposition, protecting all other channels for free.
| Property | Standard | CRT |
|---|---|---|
| Backprop | Full Jacobian | Block-diagonal (7 independent) |
| Ops at N=214,414,200 | N^2 = 46 quadrillion | sum(m_i^2) = 3,750 |
| Savings | 1x | 12,261,000,000,000x |
| Channel failure | Entangled | Graceful degradation |
| Error detection | External validation | Triple-parity ECC (free) |
| Parallelism | Synchronous | 7 independent machines |
| Hot swap | Retrain everything | Add/remove channels |
Two views of the same data: standard (one big network) and CRT (seven small ones). CRT decomposition is EXACT -- the Chinese Remainder Theorem guarantees unique reconstruction. The 12 trillion x savings is the mathematical ratio N^2/sum(m_i^2) at N = 214,414,200.
HONEST NOTE: This comparison overstates practical benefit. Real neural networks do not have N = 214,414,200 parameters in one layer. The block-diagonal structure trades off representational capacity for computational independence. Whether CRT decomposition improves real ML workloads has not been benchmarked against PyTorch or JAX. The demos above are toy problems. The architecture idea (independent sub-networks with built-in error detection) is novel; its practical value is an open question.
CRT block-diagonal backprop maps directly to WGSL compute shaders: 7 independent channels = 7 workgroups, each operating in parallel. Each benchmark dispatches 100,000 ring elements through Z/214,414,200.
Each operation runs as a WGSL compute shader with 7 channels. Ring Add = forward pass accumulation. CRT Multiply = block-diagonal gradient step. Eigenvalue = convergence check. All verified against CPU.
Train the same CRT network on GPU. Mini-batch SGD with leaky ReLU 25% and golden interleave (stride 137), matching CPU training. 8 networks (1 standard + 7 channels) in a single WGSL compute dispatch. Uses the selected learning rate and epoch count. Atomic gradient accumulation in workgroup shared memory.
Export trained weights as 135 comma-separated i32 values (scale 1000) from CPU Train mode. Copy and reuse in .ax programs or external tools. (f64 weights stored separately.)
Train a model, then export.
Native float lifts the i32 truncation floor. Same 7-network architecture (1 standard nh=8 + 6 CRT nh=4 = 135 params), native WASM float64: leaky ReLU hidden (alpha=0.01), sigmoid output, cross-entropy gradient. No clipping, no overflow, no fixed-point truncation.
i32 FIXED-POINT (CPU Train): Scale 1000. Gradients clipped +-1000. Weights clamped +-5000. Circle: 49/50. Spiral: 104/200. Moons: 73/100. f64 NATIVE FLOAT (CPU Train f64): sigmoid(x) = 1/(1+exp(-x)). leaky_relu(x, 0.01). Circle: 50/50 (perfect). Spiral: 114/200 (+10). Moons: 74/100. WHY f64 WINS: No gradient truncation at decision boundaries. Sigmoid output = proper [0,1] probabilities. Cross-entropy gradient = (sigmoid(z) - target), no clipping needed. He initialization with sqrt() = proper scale per fan-in.
Both modes use the same CRT decomposition: 7 independent sub-networks whose predictions are averaged at inference. This is classic ensemble regularization -- each channel sees the full training signal, trains independently, and averaging smooths individual errors. The f64 mode makes this visible: smoother decision boundaries, faster convergence, higher accuracy on all datasets. Tested: momentum SGD helps the standard network (Moons 71 to 83 at beta=0.9) but HURTS CRT (74 to 69). Similarly, adding a second hidden layer helps standard (Moons 71 to 84) but NOT CRT (74 to 72). Both momentum and depth add regularization that a single large network needs, but CRT's ensemble averaging already provides it. Learning rate robustness: sweeping 7 rates (0.10 to 2.00), CRT accuracy varies by only 7 points on Moons vs 13 for standard -- 46% less sensitive. CRT works well with any rate in the 0.25 to 1.50 range without per-task tuning. CRT does not need external regularization -- the structure IS the regularizer.
Run the exhaustive per-channel count: nilpotent, square-zero, idempotent, involution for all 7 channels.
This demo is .ax compiled to WebAssembly by a self-hosting .ax compiler (the ouroboros). Two training modes: i32 fixed-point (scale 1000) and f64 native float (leaky ReLU + sigmoid + cross-entropy). 152 parameters: 33 standard (nh=8) + 119 CRT (7 channels x 17 params). Training: mini-batch SGD with golden interleave -- perm[i] = (i*137) % nd -- the phyllotaxis constant as mixing schedule. 137 = floor(golden angle in degrees). CRT = 7 independent channels averaged at inference = built-in ensemble regularization.
Full CRT AI architecture >Source code · Public domain (CC0)
.ax source compiled to WASM via self-hosting compiler. Zero HTML authored.