CRT GPU Compute

CRT approach to GPU parallelism. Any function f that distributes over modular arithmetic can be computed in 7 independent workgroups -- one per prime power -- then reconstructed via two-level CRT. Zero cross-channel dependency. FPGA-measured, GPU-deployed.

How It Works

CRT Parallelism Principle
For f(x) = x^2 mod N where N = 8*9*25*49*11*13*17 = 214,414,200: decompose x into 7 residues. Compute f(r_i) = r_i^2 mod m_i independently in each channel. CRT reconstruction gives f(x) mod N exactly. Z/N = Z/8 x Z/9 x Z/25 x Z/49 x Z/11 x Z/13 x Z/17. The 3+4 split: 3 data channels {mod 8, mod 25, mod 49} carry bulk compute, 4 parity channels {mod 9, mod 11, mod 13, mod 17} verify.
7 workgroups
CRT decomposition
Each CRT channel computes independently. No shared memory, no barriers, no race conditions.
9-stage pipeline
FPGA measured
3 x 3 stages. FPGA: 6555 LUT4 (31%), 267 MHz on Tang Nano 20K.
Block-diagonal
Zero cross-talk
Gradients in channel i never touch channel j. Backprop decomposes the same way.
3+4 split
Compute + verify
3 data channels do the work. 4 parity channels verify for free.

Try It

Input value (0-214414199):

Decomposes x into 7 channels, squares each independently, reconstructs via CRT. Compares with direct x^2 mod N.

Batch Test

Real GPU Benchmark (WebGPU)

Run real CRT compute shaders on your GPU. 7 channels of Z/214,414,200 in WGSL. 15 operations: arithmetic, ML passes, ring census, and power map roots of unity. Ring Census: x^2=x (128=2^7), x^2=1 (256=2^8), x^2=0 (210=2*3*5*7), rad(N)|x (420=Carmichael lambda), gcd(x,N)=1 (Euler phi). Power Map: x^3=1 (27=3^3), x^5=1 (25=5^2), x^7=1 (7). The p-th prime power appears as the p-th-power roots-of-unity count. Full verification needs N=214,414,200 (~860MB GPU buffer).

Elements (1K-250M):

GPU Map: Compute on Your Data

gpu_map dispatches a WGSL kernel on user arrays. Data flows: WASM -> GPU -> WASM. Click to multiply 8 elements via CRT on GPU.

FPGA Ground Truth (Tang Nano 20K)

FPGA synthesis reveals the minimum circuit for each CRT operation. These measurements inform the GPU shader design. The FPGA is a microscope, not a destination.

9-stage pipeline
6555 LUT4, 267 MHz
3 x 3 stages. Best verified design. 31% of Tang Nano 20K.
Decompose = 50%
2937 LUT4 (of 5822 total)
7 modulus operations dominate all per-op cost. Critical path: mod 9 at 307 MHz. GPU: mod 8 = single ALU instruction.
Reconstruct
2059 LUT4
Two-level CRT: 6 channels first, then mod-17 lift (inv(11,17)=14). Single reduce. Same algorithm in WGSL shader.
Add: 365, Mul: 461
Arithmetic is cheap
Ring add/mul are small. The cost is decompose (getting into CRT) not computing (once there).

Per-Op Cost Distribution: FPGA vs GPU

FPGA measures gate area (LUT4). GPU measures time (ms). Run all 7 benchmarks above, then compare where cost lives on each substrate:

Cost Decomposition
Decompose cost = T_decompose. Reconstruct cost = T_roundtrip - T_decompose. Per-channel compute = T_eigenvalue - T_decompose. Full CRT multiply overhead = T_multiply - T_add. FPGA ground truth: decompose = 50% of per-op gate area (ALU divider chains). On GPU, integer modulus is a single ALU instruction -- expect decompose to be proportionally cheaper.
FPGA: Decompose
50% (2937 LUT4)
7 modulus ops via combinational dividers. Dominates circuit area. Critical path: mod 9.
FPGA: Reconstruct
35% (2059 LUT4)
6 idempotent multiplies + sum + mod 12,612,600 + mod-17 lift. Second largest cost.
GPU: Decompose
Run benchmark
GPU has hardware integer ALU. Expect mod ops to be fast. Decompose should NOT dominate on GPU.
GPU: Reconstruct
Run benchmark
T_roundtrip - T_decompose isolates CRT reconstruction cost on GPU.

CRT Forward + Backward: The ML Bridge

The CRT Forward benchmark (op 5) runs a 2-layer neural network independently in each of 7 CRT channels. The CRT Backward benchmark (op 6) computes the input gradient via chain rule: recompute forward activations, then g = w2 * 2h * w1 mod q per channel. Seven independent gradient computations, then two-level CRT reconstruction. Forward + backward = complete ML pipeline on the GPU.

CRT Parallelism for ML
A function f: Z/N -> Z/N that decomposes as f = (f_0, ..., f_6) across CRT channels can be evaluated in 7 independent parallel computations. A multi-layer perceptron with modular arithmetic and squaring activation decomposes exactly this way. Each channel trains and infers independently. CRT reconstruction combines channel outputs into a single ring prediction. Block-diagonal Jacobian: zero cross-channel gradients (PROVED). Backward pass: dy/dr = w2 * 2h * w1 mod q. More ops per element than forward -- expect higher GPU speedup.
7 networks
One per channel
Each CRT channel runs its own 2-layer perceptron: affine + x^2 activation + affine. 4 weights per channel, 28 total.
Zero sync
Block-diagonal
No cross-channel dependency during forward or backward pass. Perfect GPU parallelism. PROVED: Jacobian = diag(J_8, ..., J_17).
Backward pass
Chain rule per channel
g = w2 * 2h * w1 mod q. Recompute forward activations, then 3 multiplies for gradient. More per-element work than forward -- higher GPU speedup.
MNIST proved
25/25 accuracy
5-class 25->10->5 classifier in .ax. f64 arrays + arena reset + softmax. The pipeline works.
.ax->WGSL
Transpiler
Write GPU kernels in .ax, compile to WGSL compute shaders. Same language for CPU (WASM) and GPU (WGSL). All 12 benchmarks have .ax source equivalents.
GPU Ring Census
5 invariants + power map
Squaring: x^2=x (128=2^7), x^2=1 (256=2^8), x^2=0 (210=2*3*5*7). Counting: rad(N)|x (420=lambda), gcd=1 (phi). Power map: x^3=1 (27=3^3), x^5=1 (25=5^2), x^7=1 (7). Prime powers appear as roots-of-unity counts.

Gradient Field Theorem

Gradient Field (Theorem 52, PROVED)
Prime CRT channels (Z/p) are fields: for all nonzero a,b, the product a*b is never zero. Prime-power channels (Z/p^k, k>=2) are non-fields: nonzero zero-product (NZP) pairs exist. NZP formula: Z/p^2 (odd) has (p-1)^2 NZP pairs. Z/2^3 has 5. Across all 7 channels: 3 prime channels (mod 11, 13, 17) have NZP = 0. 4 prime-power channels (mod 8, 9, 25, 49) have NZP = 61 total. In modular SGD, NZP pairs cause gradient death: a valid gradient multiplied by a valid weight produces zero. Prime channels preserve gradient flow. Prime-power channels lose it. XOR benchmark: 3x higher convergence rate on prime channels (100 restarts, 500 epochs each).

Run the NZP census: exhaustive count of nonzero zero-product pairs for all 7 channels.

Prime = field
NZP = 0
mod 11, mod 13, mod 17 are prime: ab=0 implies a=0 or b=0. Gradients never vanish spuriously.
Prime-power = non-field
NZP > 0
mod 8: 5 pairs. mod 9: 4. mod 25: 16. mod 49: 36. Total: 61 gradient-death pathways.
Density ordering
mod 8 > mod 9 > mod 25 > mod 49
Smaller channels have higher NZP density. mod 8 (10.2%) > mod 9 (6.3%) > mod 25 (2.8%) > mod 49 (1.6%). Gradient risk decreases with channel size.
SGD convergence
3x prime > prime-power
XOR task, 100 restarts, 500 epochs: prime channels converge 3x more often. The field property IS the convergence advantage.
Zero-divisor pairsPrime-power: 61 total (gradient death)Prime: 0 (field -- gradients preserved)SGD convergencePrime-power: ~20-30% (XOR, 100 runs)Prime: ~60-80% (3x higher)Algebraic structureZ/p^k: nilpotents + zero divisorsZ/p: every nonzero element is a unit

HONEST NOTE: The 3x convergence gap is observed on XOR with a simple 5-parameter network (100 restarts, 500 epochs). AND/OR tasks had too few convergences to show a clear signal. The NZP counts are exhaustive and proved. The SGD gap is empirical.

CRT Compute vs CUDA

Parallelism modelCUDA: thread blocks + shared memory + barriersCRT: 7 independent channels. No sync needed.Cross-talkShared memory races, warp divergenceZero. Channels are algebraically orthogonal.ScalingGPU hardware-dependent (thousands of cores)Ring-intrinsic. 7 algebraically independent channels.VerificationSeparate checksums needed3+4 split: 4 parity channels verify 3 data channels. Triple-parity error detection (rate 4/7).Circuit floorUnknown (GPU microarch proprietary)FPGA-measured: 9-stage pipeline 6555 LUT4, 267 MHz. Minimum circuit known.Patent statusCUDA ecosystem (mostly royalty-free for end users)CC0. Public domain. Forever.

HONEST NOTE: This is a demonstration of CRT-structured parallelism, not a CUDA replacement. CUDA operates on thousands of cores with sophisticated memory hierarchies. CRT parallelism is limited to 7 channels. The structural advantage is zero-synchronization independence, built-in verification, and hardware-verified minimum circuits.

Source code · Public domain (CC0)

Report issue

.ax source compiled to WASM via self-hosting compiler. Zero HTML authored.