CRT GPU Compute -- antonlebed.com

How It Works

CRT Parallelism Principle

For f(x) = x^2 mod N where N = 8*9*25*49*11*13*17 = 214,414,200: decompose x into 7 residues. Compute f(r_i) = r_i^2 mod m_i independently in each channel. CRT reconstruction gives f(x) mod N exactly. Z/N = Z/8 x Z/9 x Z/25 x Z/49 x Z/11 x Z/13 x Z/17. The 3+4 split: 3 data channels {mod 8, mod 25, mod 49} carry bulk compute, 4 parity channels {mod 9, mod 11, mod 13, mod 17} verify.

7 workgroups

CRT decomposition

Each CRT channel computes independently. No shared memory, no barriers, no race conditions.

9-stage pipeline

FPGA measured

3 x 3 stages. FPGA: 6555 LUT4 (31%), 267 MHz on Tang Nano 20K.

Block-diagonal

Zero cross-talk

Gradients in channel i never touch channel j. Backprop decomposes the same way.

3+4 split

Compute + verify

3 data channels do the work. 4 parity channels verify for free.

Real GPU Benchmark (WebGPU)

Run real CRT compute shaders on your GPU. 7 channels of Z/214,414,200 in WGSL. 15 operations: arithmetic, ML passes, ring census, and power map roots of unity. Ring Census: x^2=x (128=2^7), x^2=1 (256=2^8), x^2=0 (210=2*3*5*7), rad(N)|x (420=Carmichael lambda), gcd(x,N)=1 (Euler phi). Power Map: x^3=1 (27=3^3), x^5=1 (25=5^2), x^7=1 (7). The p-th prime power appears as the p-th-power roots-of-unity count. Full verification needs N=214,414,200 (~860MB GPU buffer).

Elements (1K-250M):

FPGA Ground Truth (Tang Nano 20K)

FPGA synthesis reveals the minimum circuit for each CRT operation. These measurements inform the GPU shader design. The FPGA is a microscope, not a destination.

9-stage pipeline

6555 LUT4, 267 MHz

3 x 3 stages. Best verified design. 31% of Tang Nano 20K.

Decompose = 50%

2937 LUT4 (of 5822 total)

7 modulus operations dominate all per-op cost. Critical path: mod 9 at 307 MHz. GPU: mod 8 = single ALU instruction.

Reconstruct

2059 LUT4

Two-level CRT: 6 channels first, then mod-17 lift (inv(11,17)=14). Single reduce. Same algorithm in WGSL shader.

Add: 365, Mul: 461

Arithmetic is cheap

Ring add/mul are small. The cost is decompose (getting into CRT) not computing (once there).

Per-Op Cost Distribution: FPGA vs GPU

FPGA measures gate area (LUT4). GPU measures time (ms). Run all 7 benchmarks above, then compare where cost lives on each substrate:

Cost Decomposition

Decompose cost = T_decompose. Reconstruct cost = T_roundtrip - T_decompose. Per-channel compute = T_eigenvalue - T_decompose. Full CRT multiply overhead = T_multiply - T_add. FPGA ground truth: decompose = 50% of per-op gate area (ALU divider chains). On GPU, integer modulus is a single ALU instruction -- expect decompose to be proportionally cheaper.

FPGA: Decompose

50% (2937 LUT4)

7 modulus ops via combinational dividers. Dominates circuit area. Critical path: mod 9.

FPGA: Reconstruct

35% (2059 LUT4)

6 idempotent multiplies + sum + mod 12,612,600 + mod-17 lift. Second largest cost.

GPU: Decompose

Run benchmark

GPU has hardware integer ALU. Expect mod ops to be fast. Decompose should NOT dominate on GPU.

GPU: Reconstruct

Run benchmark

T_roundtrip - T_decompose isolates CRT reconstruction cost on GPU.

CRT Forward + Backward: The ML Bridge

The CRT Forward benchmark (op 5) runs a 2-layer neural network independently in each of 7 CRT channels. The CRT Backward benchmark (op 6) computes the input gradient via chain rule: recompute forward activations, then g = w2 * 2h * w1 mod q per channel. Seven independent gradient computations, then two-level CRT reconstruction. Forward + backward = complete ML pipeline on the GPU.

CRT Parallelism for ML

A function f: Z/N -> Z/N that decomposes as f = (f_0, ..., f_6) across CRT channels can be evaluated in 7 independent parallel computations. A multi-layer perceptron with modular arithmetic and squaring activation decomposes exactly this way. Each channel trains and infers independently. CRT reconstruction combines channel outputs into a single ring prediction. Block-diagonal Jacobian: zero cross-channel gradients (PROVED). Backward pass: dy/dr = w2 * 2h * w1 mod q. More ops per element than forward -- expect higher GPU speedup.

7 networks

One per channel

Each CRT channel runs its own 2-layer perceptron: affine + x^2 activation + affine. 4 weights per channel, 28 total.

Zero sync

Block-diagonal

No cross-channel dependency during forward or backward pass. Perfect GPU parallelism. PROVED: Jacobian = diag(J_8, ..., J_17).

Backward pass

Chain rule per channel

g = w2 * 2h * w1 mod q. Recompute forward activations, then 3 multiplies for gradient. More per-element work than forward -- higher GPU speedup.

MNIST proved

25/25 accuracy

5-class 25->10->5 classifier in .ax. f64 arrays + arena reset + softmax. The pipeline works.

.ax->WGSL

Transpiler

Write GPU kernels in .ax, compile to WGSL compute shaders. Same language for CPU (WASM) and GPU (WGSL). All 12 benchmarks have .ax source equivalents.

GPU Ring Census

5 invariants + power map

Squaring: x^2=x (128=2^7), x^2=1 (256=2^8), x^2=0 (210=2*3*5*7). Counting: rad(N)|x (420=lambda), gcd=1 (phi). Power map: x^3=1 (27=3^3), x^5=1 (25=5^2), x^7=1 (7). Prime powers appear as roots-of-unity counts.

Gradient Field Theorem

Gradient Field (Theorem 52, PROVED)

Prime CRT channels (Z/p) are fields: for all nonzero a,b, the product a*b is never zero. Prime-power channels (Z/p^k, k>=2) are non-fields: nonzero zero-product (NZP) pairs exist. NZP formula: Z/p^2 (odd) has (p-1)^2 NZP pairs. Z/2^3 has 5. Across all 7 channels: 3 prime channels (mod 11, 13, 17) have NZP = 0. 4 prime-power channels (mod 8, 9, 25, 49) have NZP = 61 total. In modular SGD, NZP pairs cause gradient death: a valid gradient multiplied by a valid weight produces zero. Prime channels preserve gradient flow. Prime-power channels lose it. XOR benchmark: 3x higher convergence rate on prime channels (100 restarts, 500 epochs each).

Run the NZP census: exhaustive count of nonzero zero-product pairs for all 7 channels.

Prime = field

NZP = 0

mod 11, mod 13, mod 17 are prime: ab=0 implies a=0 or b=0. Gradients never vanish spuriously.

Prime-power = non-field

NZP > 0

mod 8: 5 pairs. mod 9: 4. mod 25: 16. mod 49: 36. Total: 61 gradient-death pathways.

Density ordering

mod 8 > mod 9 > mod 25 > mod 49

Smaller channels have higher NZP density. mod 8 (10.2%) > mod 9 (6.3%) > mod 25 (2.8%) > mod 49 (1.6%). Gradient risk decreases with channel size.

SGD convergence

3x prime > prime-power

XOR task, 100 restarts, 500 epochs: prime channels converge 3x more often. The field property IS the convergence advantage.

Zero-divisor pairsPrime-power: 61 total (gradient death)Prime: 0 (field -- gradients preserved)SGD convergencePrime-power: ~20-30% (XOR, 100 runs)Prime: ~60-80% (3x higher)Algebraic structureZ/p^k: nilpotents + zero divisorsZ/p: every nonzero element is a unit

HONEST NOTE: The 3x convergence gap is observed on XOR with a simple 5-parameter network (100 restarts, 500 epochs). AND/OR tasks had too few convergences to show a clear signal. The NZP counts are exhaustive and proved. The SGD gap is empirical.

CRT Compute vs CUDA

Parallelism modelCUDA: thread blocks + shared memory + barriersCRT: 7 independent channels. No sync needed.Cross-talkShared memory races, warp divergenceZero. Channels are algebraically orthogonal.ScalingGPU hardware-dependent (thousands of cores)Ring-intrinsic. 7 algebraically independent channels.VerificationSeparate checksums needed3+4 split: 4 parity channels verify 3 data channels. Triple-parity error detection (rate 4/7).Circuit floorUnknown (GPU microarch proprietary)FPGA-measured: 9-stage pipeline 6555 LUT4, 267 MHz. Minimum circuit known.Patent statusCUDA ecosystem (mostly royalty-free for end users)CC0. Public domain. Forever.

HONEST NOTE: This is a demonstration of CRT-structured parallelism, not a CUDA replacement. CUDA operates on thousands of cores with sophisticated memory hierarchies. CRT parallelism is limited to 7 channels. The structural advantage is zero-synchronization independence, built-in verification, and hardware-verified minimum circuits.

CRT GPU Compute