Intelligence

7 channels = 7 independent computations

CRT decomposes any computation into 7 independent channels -- same math that decomposes the ring. Shared backbone, independent output heads per channel, mod-11 error detection built into the algebra. Block-diagonal gradients. 9512x parameter compression at Z/970,200 scale. Runs in browser.

CRT Architecture

Standard transformer: one monolithic output layer predicts among N classes. CRT transformer: shared backbone produces a representation, then independent output heads predict residues per channel. Joint probability reconstruction recovers the full prediction. The Chinese Remainder Theorem guarantees unique recovery. The full ring Z/214,414,200 has 7 channels; experiments below used the 5-channel Z/970,200 ring.

Architecture Theorem (PROVED)
Shared backbone + CRT output = unique optimal point. Shared backbone maximizes error detection reliability (correlated representations across channels). CRT output maximizes efficiency (5 small softmaxes instead of 1 large one). 7.5x fewer output parameters. 212x output backprop speedup at Z/12,612,600 scale. 2.68x error detection reliability vs split backbone.

Each channel has a natural domain. The CRT decomposition is not imposed -- it emerges from the ring structure:

ChannelSizeDomainWhy
Z/88 classesCoarse (3 bits)Fastest to learn, cheapest to sacrifice
Z/99 classesMid-rangeMinimum for complete decomposition
Z/2525 classesFineCarries golden ratio (discriminant 5)
Z/4949 classesPrecise (49 states)Deepest channel, controls spectral gap
Z/1111 classesParityError detection. Always on. 1+2+3+5 = 11
Z/1313 classesBoundaryDual parity with mod-11. 2^2+3^2 = 13
Z/1717 classesTranscendence5*7 = 1 mod 17. Period quadruples to 1680

Full ring output: 8+9+25+49+11+13+17 = 132 classes. Experiments at Z/970,200 (5 channels, 102 classes): 9512x compression. The backprop Jacobian is block-diagonal: 25654x fewer entries at N=2310.

Five Breakthroughs

All five stack multiplicatively. Missing any one = leaving performance on the table.

BreakthroughFactorMechanism
CRT Decomposition9512x compression5 small heads vs 1 monolithic
Loop TheoremN / sum(p_i) forwardCRT = loop unrolling
Block-Diagonal Backprop25654x fewer Jacobian entries5 independent gradient paths
mod-11 error detection100% thin, ~92% prime-powerFree detection. Dual parity mod 11 + mod 13: 100%.
Rissanen MDL20x byte / 936x tokenMinimum description length selects Z/12,612,600

Combined at N=210 (Z/210): ~126,000x. CRT does not improve AI incrementally. It changes the computational class.

CRT GPU Compute

6 CRT channels = 6 GPU workgroups. Zero synchronization. The ring parallelizes itself. CRT decomposition IS the GPU parallelization strategy. No manual workgroup design. No shared memory management. The algebraic structure does the engineering.

GPU Mapping Theorem (CC0)
N = 12612600 = 2^3 * 3^2 * 5^2 * 7^2 * 11 * 13. CRT decomposition: n -> (n%8, n%9, n%25, n%49, n%11, n%13). Each channel = one GPU workgroup. Per-channel arithmetic = independent compute shader invocations. Block-diagonal Jacobian = no cross-channel gradients = no inter-workgroup communication. All 12,612,600 elements processed simultaneously.

The WGSL compute shader is 8 lines per operation. Decompose, operate per-channel, reconstruct. The CRT reconstruction coefficients are ring constants:

ChannelModulusWorkgroupCRT Coefficient
mod 88Coarse (3 bits)363825
mod 99Mid-range431200
mod 2525Fine853776
mod 4949Precise (49 states)732600
mod 1111Error detection529200

mod-11 error detection runs as a 6th workgroup -- free integrity check on every computation. Training a CRT neural network: each channel = independent backprop. 5 GPU workgroups, zero inter-workgroup communication.

Workgroups
5 independent
CRT guarantees zero cross-channel data dependency. Perfect GPU utilization.
Synchronization
Zero
No barriers, no shared memory, no atomic ops between channels.
Scaling
Linear with N
All 970,200 ring elements processed simultaneously. 256 threads per workgroup.
Prior art
CC0
WGSL compute shader is public domain. No CUDA dependency. No NVIDIA lock-in.

Explore: CRT Compression

Enter a vocabulary size N. See CRT output (102 classes fixed) vs monolithic (N classes). The compression ratio grows with N. Try 256 (bytes), 2310, 970200, 12612600.

N:

CRT Byte Compression

Every byte lives in Z/210 = Z/2 x Z/3 x Z/5 x Z/7 x Z/11. Bytes 210-255 wrap (mod 210). Decompose any data into 5 independent channels. Each channel has its own entropy. Rissanen MDL: the sum of channel entropies vs joint entropy reveals exploitable redundancy -- up to 20x at byte scale.

Rissanen Redundancy (CC0)
Joint entropy H(X) <= 8 bits. Channel entropies: H(X%2) + H(X%3) + H(X%5) + H(X%7) + H(X%11). Redundancy ratio = sum / joint. For structured data (text, images), small channels saturate while large ones concentrate. Ratio climbs to 20x. A LEARNING ALGORITHM using CRT channels needs sum(p_i)/N = 28/210 parameters vs monolithic. Stochastic complexity savings: ~20x at byte scale.

Enter a text string. See per-channel entropy and Rissanen redundancy ratio:

Text:

Method
Algebraic decomposition
n -> (n%2, n%3, n%5, n%7, n%11). Not statistical. CRT isomorphism.
Channels
5 independent (proved)
CRT guarantees channel independence. 5 cores, zero coordination.
Error detection
Built-in
mod-11 channel detects errors for free. Dual mod-11 + mod-13 corrects.
Foundation
Chinese Remainder Theorem
2000+ years. Not ad hoc pattern matching (LZ77, 1977).

Matrix Foundation

Phase A begins: co-evolving .ax and axiom-native intelligence. The first step: f64 matrix operations in pure .ax. matmul, transpose, softmax, cross-entropy -- the primitives every neural network needs. All compiled to WASM. No Python. No NumPy. No dependencies.

The matrix library uses falloc/fset/fget (f64 arrays) and arena reset (hp/set_hp). Matrices are row-major flat arrays. 18 operations: mat_zero, mat_get, mat_set, matmul, transpose, mat_add, mat_sub, mat_scale, dot_f, vec_sum_f, vec_max_f, softmax, cross_entropy, relu, sigmoid, relu_vec, outer_product, argmax. 54/54 WASI tests pass.

matmul
O(n^3) triple loop
Row-major f64 via fget/fset. 2x3 * 3x2 verified: [58,64,139,154].
softmax
Numerically stable
Max-subtraction prevents overflow. Uniform [0,0,0] -> [1/3,1/3,1/3].
cross_entropy
Log-clamped
eps=1e-6 prevents log(0). Good prediction CE~0.01, bad CE>4.
CRT advantage
Per-channel matmul
7 small matrices (max 49x49) instead of one 214414200-wide layer.

CRT Perceptron

First neural network trained in pure .ax f64. A per-channel perceptron learns x^2 mod 210 (Z/210 squaring) using matmul, softmax, and outer_product from the matrix library. 4 independent channels (mod 2, 3, 5, 7) with 87 total parameters. 10 epochs of gradient descent. Result: 210/210 -- perfect accuracy on all inputs, including 168 never seen during training.

A standard perceptron on the same task needs 44100 parameters (210x210). After identical training on 42 examples, it scores 43/210 -- pure memorization, zero generalization. The CRT model generalizes completely because each channel is small enough that training coverage is automatic: 42 examples cover all residue classes up to mod 7.

CRT accuracy
210/210
4 channels: mod 2, 3, 5, 7. Each 210/210. CRT reconstruction perfect.
Standard accuracy
43/210
44100 params. Memorizes training inputs. Cannot generalize.
Backward Loop
506x
87 vs 44100 parameters. sum(p^2) vs N^2.
Z/970,200 Forward
9511x
970200 / 102 forward compression. 3292 params vs 941 billion.

Prime-Power Channels

Z/210 uses thin channels (mod 2, 3, 5, 7). Z/970,200 uses prime-power channels (mod 8, 9, 25, 49, 11) -- the Pareto-optimal exponents. Same architecture, same training. Prime-power channels: 3292 parameters. Standard equivalent: 941 billion. Forward compression jumps from 12x (Z/210 thin) to 9511x.

Prime-power channels compute on a richer substrate. The mod-49 channel carries 49 classes (vs 7 for thin mod-7). mod-11 provides free error detection: every prediction is independently verified by the 5th channel. All five channels achieve perfect accuracy. CRT reconstruction: 50/50 on random Z/970,200 elements.

mod 8
8/8
64 params. The only depth-3 channel. 4 involutions (Klein four).
mod 49
49/49
2401 params. The deepest channel. 49 states.
CRT recon
50/50
5-channel CRT reconstruction on random Z/970,200 elements. 3292 params.
Forward
9511x
970200 / 102. Raising exponents: 12x (Z/210 thin) to 9511x (Z/970,200). 286Mx backward.

Bloom Mixing

The CRT perceptron learns per-channel functions perfectly. But x%13 (a cross-channel function) depends on ALL channels -- no single channel determines it. Linear mixing across channels achieves random performance (23/210). The solution: bloom. CRT reconstruction IS the non-linearity.

Architecture: per-channel perceptrons (87 params) -> CRT reconstruct (0 params) -> mod q. The bloom layer is algebraic and exact. It has zero trainable parameters. It replaces ReLU. One set of trained weights projects through ANY coprime gate: x%11, x%13, x%17 -- all 210/210 from the same 87 parameters.

Bloom x%13
210/210
87 params + 0 bloom. CRT reconstruction provides the non-linearity.
Multi-gate
210/210 each
Same weights: x%11, x%13, x%17 all perfect. One model, infinite gates.
Additive
23/210
221 params, 40 epochs. Linear mixing = RANDOM on cross-channel targets.
Ratio
31x
Standard 2730 params (210*13). Bloom: 87. Structure wins completely.

Hourglass

Bloom computes x%13 perfectly -- but what about x^2 % 13? A single bloom layer reconstructs x^2 mod 210, then reduces mod 13. When x^2 >= 210 (x >= 15), the ring reduction destroys mod-13 information. Result: 28/210 correct. Ring overflow = info loss.

The hourglass fixes this. Two flowers face to face, gate to gate. Flower A extracts the gate value g = x%13 (identity bloom). Gate values are small: 0 to 12. Flower B computes g^2 inside Z/210 -- and 12^2 = 144 < 210, so no overflow ever occurs. The gate IS the bottleneck that prevents overflow.

Hourglass x^2%13
210/210
174 params (87+87). Two flowers solve what one cannot.
Single bloom
28/210
Same squaring, no gate extraction. Ring overflow kills accuracy.
Gate bottleneck
16:1
210 -> 13 -> 210. Max gate^2 = 144 < 210. Safe.
Ratio
15x
Standard 2730 params (210*13). Hourglass: 174.

Z/970,200 Hourglass

The Z/210 hourglass stops at polynomial degree 2: gate value 12, cubed gives 1728 > 210. Overflow. The Z/970,200 hourglass extends to degree 5: 12^5 = 248832 < 970200. Raising exponents gives 3 extra polynomial degrees beyond Z/210.

Architecture: Flower A (identity on 5 prime-power channels, 3292 params) -> gate(13) bottleneck -> Flower B (g^d on prime-power channels, 3292 params) -> %13. Total: 6584 params. Standard: 970200 * 13 = 12,612,600 parameters.

x^2%13
50/50
Degree 2. Safe in both Z/210 and Z/970,200. 6584 params.
x^3%13
50/50
Cube: Z/210 fails (6/13 gate overflow), Z/970,200 succeeds.
x^5%13
50/50
Max degree 5. 12^5 = 248832 < 970200.
Ratio
1915x
Standard = 12,612,600 params. Hourglass: 6584.

Inaccessible Degrees

The Z/970,200 hourglass reaches degree 5. But degrees 7 and 11 are unreachable by power composition: the multiplicative monoid of {1,...,5} in Z/12 (where phi(13) = 12) is missing exactly {7, 11}. These require a different approach.

Resolution: channel-49 carries a correction. For each gate value g, search r in 0..48 such that the CRT reconstruction with r in the b^2 channel produces the target mod 13. Pigeonhole guarantees at least 3 valid r per gate value (floor(49/13)=3). Same 6584 params. Same architecture. Only the training targets differ.

Power deg 7
8/13
Overflow: g^7 >= 970200 for g >= 8. Power per-channel fails.
Lookup deg 7
50/50
mod-49 correction. 13/13 gate values + 50/50 random Z/970,200.
Power deg 11
7/13
x^11 = x^(-1) mod 13 (Inverse Degree). Overflow at g >= 4.
Lookup deg 11
50/50
All 12 degrees accessible. Inner={1,5}. Outer={7,11}.

Stacked Hourglass

The Z/970,200 hourglass ceiling: degree 5. At degree 6, gate values 10, 11, 12 overflow (g^6 >= 970200). The fix: STACK. Two flowers in sequence, each applying a sub-degree. Flower B1 computes g^d1, extracts the gate. Flower B2 computes g1^d2, extracts the result. Total degree = d1 * d2. Each layer stays safe (max 12^5 = 248832 < 970200). The ceiling shatters.

5-squared collapse: degree 5 composed with degree 5 gives degree 25. But 25 mod phi(13) = 25 mod 12 = 1. Quintic composed with quintic = identity. 5^2 = 1 in Z/12: two layers cancel, leaving only what was already there.

Single deg 6
10/13
Overflow: g^6 >= 970200 for g >= 10. THE CEILING.
2-stack deg 6
50/50
cube+square (3*2=6). Ceiling broken. 9876 params.
2-stack deg 12
50/50
cube+quartic (3*4=12). Fermat: x^12=1 mod 13.
5^2 collapse
50/50
quintic+quintic (5*5=25 mod 12=1). Identity.

Soft Bloom Emergence

Every test above hard-codes the CRT basis: {105, 70, 126, 120} for Z/210, {363825, 431200, 853776, 732600, 529200} for Z/970,200. But what if the ring can discover its own basis? Strip all CRT knowledge. Start with bloom weights = [0, 0, 0, 0]. Apply coordinate descent: for each channel, sweep the weight over all possible values and pick the one that maximizes reconstruction accuracy. After a few rounds, the CRT basis emerges from data alone.

Perturbation is catastrophic and exact. Changing any basis weight by +1 drops accuracy to precisely N/p -- the channel dies and only elements with zero residue survive. The basis is an isolated global maximum: no graceful degradation, no nearby local optima. The discovered basis elements are orthogonal idempotent projectors: B_p^2 = B_p, B_p * B_q = 0 for p != q, and sum(B_p) = 1 mod N.

Z/30 emergence
30/30
[0,0,0] -> [15,10,6] in 3 rounds. The CRT basis grows from nothing.
Z/210 emergence
210/210
[0,0,0,0] -> [105,70,126,120]. Unique global maximum.
Z/970,200 full ring
970200/970200
Per-channel search: 5 basis elements found structurally.
Perturbation
exact N/p
B_8+1: 121275/970200 = N/8. Catastrophic, structural damage.

Joint Bloom

Soft bloom discovers the CRT basis but takes per-channel functions as given. Joint bloom learns BOTH. Architecture: prediction(x) = sum(W_ch[x%p] * B[ch]) % N, where W is a per-channel lookup table (17 params) and B is the bloom basis (4 params). Total: 21. Standard: 44100. Ratio: 2100x.

Identity-anchored training: learn B on identity (always bijective, always converges), then learn W on the actual target. Soft bloom alone gets 16/210 on x^2 -- only the 16 idempotents of Z/210 match. Joint bloom with trained W: 210/210. The per-channel lookup discovers r^2 mod p (for x^2) and (p-r) mod p (for mirror) entirely from data.

Joint x^2
210/210
21 params. Identity-anchored B + trained W. Per-channel: r^2 mod p.
Joint mirror
210/210
Same 21 params. W learns (p-r)%p. CRT decomposition intact.
Soft bloom x^2
16/210
Basis alone = identity. Matches x^2 only at idempotents (2^4=16).
Constraint count
48 vs 210
x^2 images: 2*2*3*4=48 tuples. Identity: 210. Non-bijective = fewer.

Deep Joint Bloom

Joint bloom at Z/210 scale (210 elements, 21 params) extends to Z/970,200 (5 prime-power channels). Key result: global coordinate descent FAILS at Z/970,200 scale -- noise drowns signal (~0.0001% accidental match). Per-channel search ALWAYS works: correct basis scores M, wrong scores ~M/q. The ring FORCES CRT decomposition at scale.

Identity-anchored training at Z/970,200: per-channel basis search discovers B structurally (5 elements), then per-channel W training discovers lookup tables from data (102 entries). Total: 107 params. Forward ratio: 9067x. With perceptron matrices: 3297 params, backward ratio 286 million x. Raising exponents: 590,000x gain over Z/210.

Z/970,200 x^2
970200/970200
107 params. Per-channel search discovers r^2 mod q from samples.
Z/970,200 mirror
970200/970200
Per-channel W learns (q-r) mod q. Full ring verified.
Scale separation
~0 vs 970200
Wrong basis: 1/100. CRT basis: 970200/970200. Ring forces decomposition.
Perturbation
exact N/q
B_8+1: 121275 = N/8. B_49+1: 19800 = N/49. Structural damage.

Z/12,612,600 Joint Bloom

Z/12,612,600 adds mod-13 to the 5 channels of Z/970,200. 6 prime-power channels total: {8, 9, 25, 49, 11, 13}. The 490 split is now complete: inner = {mod 8, mod 25, mod 49} + outer = {mod 9, mod 11, mod 13}. Per-channel search scales cleanly -- 121 params (115 lookup + 6 basis) for a ring of 12.6 million elements.

Forward ratio: 104,236x. Perceptron upgrade: 3467 params, backward ratio ~46 billion x. Sample-based verification (1040 elements, exact divisibility by 8 and 13). GATE perturbation: wrong B_13 leaves only elements with x mod 13 = 0 intact (80/1040 = 1/13). The boundary channel is structurally independent.

Z/12,612,600 x^2
1040/1040
121 params. 6 per-channel lookups discover r^2 mod q from 20 samples.
Z/12,612,600 mirror
1040/1040
Per-channel W learns (q-r) mod q. All 115 entries correct.
mod-13 perturbation
80/1040
B_13+1: only x%13=0 survive. Boundary channel independent.
490 split
inner + outer
inner={2,5,7}(3) + outer={3,11,13}(3) = 6 channels.

Frame Evolution Probes

Is ring algebra the right frame? Six probes on Z/970,200 (5 prime-power channels) test the foundation before it strains. All probes PASS: the ring frame holds for decomposable targets. The first boundary is quantified.

Non-decomposable wall: f(x)=(x%8)*(x%9) achieves only 175175/970200 = 18% accuracy via per-channel prediction. The number 175175 = N*13/72 is structurally determined (72=8*9 mixed channels). Decomposable targets (x^2, x+42, x*42) achieve 100%. The wall proves channel independence has limits -- bloom mixing is the path across.

Commutativity
1000/1000
Forward, reverse, scrambled channel order identical. Integer addition commutes.
Non-decomposable wall
18%
f(x)=(x%8)*(x%9): 175175/970200 = N*13/72. Structural limit.
Basis uniqueness
neg=mirror
Negated CRT basis computes mirror(x). Doubled basis computes 2x. Only CRT idempotents give identity.
Channel removal
N/q loss
Drop mod-8: 121275 survive. Drop mod-11: 88200. Proportional to modulus.

Bloom Crossing

The non-decomposable wall (18%) is real. Can bloom mixing cross it? CRT reconstruction enables computing ANY cross-channel function from per-channel inputs. Pipeline: per-channel identity (102 params) -> CRT reconstruct (5 basis) -> compute cross-channel function -> compare. Result: 100% on ALL cross-channel targets. Zero additional trainable parameters.

Four targets tested at Z/970,200: (x%8)*(x%9), (x%25)*(x%49), (x%8)+(x%9), (x%8)*(x%9)*(x%25). All 4 cross from wall to 970200/970200 with bloom. The additive target achieves ZERO per-channel matches -- even harder than the multiplicative wall. Partial CRT: for 2-channel functions, only 19 params needed (vs 107 full).

Bloom (x%8)*(x%9)
970200/970200
Per-channel: 175175 (18%). Bloom: 100%. The wall is crossed.
Additive wall
0/970200
(x%8)+(x%9) gets ZERO per-channel matches. Stronger wall than multiplicative.
Partial CRT
72/72
Only involved channels needed. 19 params for 2-channel crossing. 5x smaller.
Basis necessity
139/1000
Perturbed basis (B[0]+1): crossing collapses. Correct basis is necessary.

Learned Bloom Crossing

Sample efficiency: 72 samples learn a 970200-element function (13475x compression). Accuracy is EXACTLY proportional to coverage -- each filled entry accounts for exactly N/table_size correct predictions. CRT guarantee, not statistical approximation. Works for arbitrary unknown functions: the learner sees (x, f(x)) pairs and fills the table.

(x%8)*(x%9) learned
970200/970200
72 samples, 72-entry table. 13475x compression. Formula-free.
(x%25)*(x%49) learned
970200/970200
1225 samples, 1225-entry table. Larger channels = larger table.
8*9 < per-ch
72 < 102
Cross-channel table (72) SMALLER than per-channel sum (102) for small channels.
Exact coverage
50/72 -> 69%
Partial accuracy = coverage * N/table_size. CRT structure, not statistics.

Composed Forward Pass

Key result: composition is non-commutative. compose(sq, mir)(x) = N - x^2 but compose(mir, sq)(x) = x^2. Squaring absorbs negation: (-x)^2 = x^2 in any ring. Layer order matters -- just like in standard neural networks. Mirror is an involution (mir o mir = identity). Three layers compose correctly: sq o mir o sq = x^4. The cross-channel pipeline chains a per-channel layer with a learned crossing table in one forward pass.

2-layer compose
970200/970200
sq then mir = N-x^2. Full ring verified. Ring endomorphisms compose.
Non-commutative
sq o mir != mir o sq
compose(sq,mir)(1)=970199. compose(mir,sq)(1)=1. Order matters.
Cross pipeline
970200/970200
Per-ch sq -> CRT reconstruct -> learned D*K table. 179 params.
Shared basis
5 params all layers
k layers = k*102 + 5 params. 2-layer: 209 params = 4642x vs N.

Backward Pass

The forward path is complete. The backward pass makes it trainable. CRT backward pass decomposes into independent per-channel backward passes. Each weight W[r] in channel ch has gradient 2*(W[r] - target(r)) -- no coupling to other channels. The Jacobian is block-diagonal by construction: per-channel loss functions share no parameters.

Key contrast: per-channel loss decouples channels completely (block-diagonal), while full-ring reconstruction loss COUPLES them through the shared CRT reconstruction error. Per-channel training avoids this coupling while converging to the same optimum for CRT-decomposable targets. This is the Backward Decomposition Theorem.

Block-diagonal
Gradient mod-9 = 0
Perturb mod-8 weights. Per-channel gradient in mod-9 is exactly zero.
Full-ring coupling
Gradient mod-9 != 0
Same perturbation. Full-ring reconstruction gradient couples channels.
SGD convergence
1000/1000 in 5 epochs
Random init -> per-channel gradient descent -> perfect accuracy on x^2.
Crossing table SGD
1000/1000 in 10 epochs
72-entry D*K product table learned from random init via gradient descent.

Multi-Layer Training

Hybrid pipeline: gradient descent for per-channel layers (block-diagonal, 102 params each) + discrete optimization for crossing tables (72 entries, sample-based or hillclimb). Full stack from scratch: 2 trained layers + learned crossing table. 281 params for 970200-element ring = 3452x compression. Channel-parallel: all 5 channels converge in 1 epoch each (lr=0.5). The full CRT training pipeline is now operational.

2-layer compose
1000/1000
sq->mir = N-x^2. Both layers from random init, trained independently.
3-layer compose
1000/1000
sq->mir->sq = x^4. Three independent layers compose correctly.
Hybrid full ring
970200/970200
Gradient per-ch + sample crossing. Full ring verified.
Channel parallel
1 epoch each
5 channels converge independently. Block-diagonal = embarrassingly parallel.

Coupling-Attention

Standard transformers learn attention via softmax(QK^T / sqrt(d)). CRT coupling provides Q and K algebraically -- no learned projection matrices. Per-channel eigenvalue eig(n, q) = 2*cos(2*pi*(n mod q)/q). Coupling(a,b) = dot product of eigenvalue vectors = sum over channels of eig(a)*eig(b). This is a symmetric, positive-definite kernel on the ring. 102 precomputed f64 entries encode all pairwise coupling for 970200 elements (9511x compression).

Full-coupling attention (all 7 channels simultaneously) outperforms multi-head decomposition -- multi-head averaging dilutes per-channel signal. 9 = the compiler's analysis pass count, not attention heads. Cross-coupling averages to zero by character orthogonality -- self-coupling is strongly positive. Signal and noise are separated by algebra, not by learning.

Positive-definite
1000/1000
coupling(x,x) = ||eig(x)||^2 > 0 for all nonzero elements. Proper kernel.
Orthogonality
E[cross]=0
Random coupling averages to zero. Self-coupling avg=10. Signal/noise algebraic.
Full coupling
Single full-coupling attention outperforms multi-head. 9 = compiler passes, not heads.
Block-diagonal
Per-channel independent
mod-8 coupling depends only on mod-8 residue. Independent of other channels.

7-Channel Extension

490 split as trainable architecture: inner channels (mod 8, mod 25, mod 49) learn x^2, outer channels (mod 9, mod 11, mod 13, mod 17) learn mirror -- simultaneously, independently. Encoder and decoder train with different targets. Block-diagonal Jacobian extends across the split: perturb mod-8 (inner), mod-17 (outer) gradient is exactly zero. Full-coupling attention over all 7 channels.

SGD at 7ch
1000/1000
Per-channel x^2, mirror, and 490-split mixed all converge. f64 CRT reconstruction.
490 split training
1000/1000
inner=x^2, outer=mirror. Independent training, block-diagonal across the split.
Projection bifurcation
2^420 = mirror in mod-17
2^420 CRT=(0,1,1,1,1,1,16). 2^1680=(0,1,1,1,1,1,1). mod-17 last to converge.
1.6Mx compression
132 params
214,414,200 elements from 132 eigenvalue entries. 170x over Z/970,200.

Lambda-Periodic Training

The Carmichael period 420 becomes a training rhythm. Sample-based SGD (one random sample per step) runs for 420 steps per epoch. Then the projector 1,576,576 (= 2^420) fires: the mod-8 channel collapses, all other channels persist as long-term memory. mod-8 recovers in O(8) steps -- the smallest channel is the most expendable. 4 epochs of 420 = 1680 = Carmichael period of Z/214,414,200.

Channel convergence follows modulus size: mod-8 and mod-9 converge first (step 50), then mod-11, then mod-25, then mod-49. At step 420, all 102 per-channel weights are correct. At step 50, only 31/49 mod-49 residues are covered. 420 is the minimum natural period for full coverage -- forced by mod-49 needing 420/49 = 8.6 samples per residue. At Z/214,414,200, 2^420 zeroes mod-8 AND mirrors mod-17 (CRT=16=-1 mod 17). The full projector (0,1,1,1,1,1,1) needs period 1680. mod-17 is always last to resolve.

Projection cycle
102->96->102
Before: 102/102. After 1,576,576: 96/102 (mod-8 killed). Recovery: 102/102.
mod-8 expendable
8 steps to recover
1,576,576 collapses mod-8 to constant. 50 sample steps fully restore.
Convergence order
8<9<11<25<49
Small channels first. mod-49 needs 420 steps. mod-8 converges by step 50.
4 * 420 = 1680
7-channel period
Carmichael periods compose. mod-17 mirrors at step 420, resolves at step 1680.

CRT at Scale

CRT decomposition is a compression technique. It trades exact element identity for smaller per-channel tables. On small rings (Z/210, 27 characters), flat tables outperform CRT -- they preserve more signal with fewer entries. On large rings, CRT is the ONLY feasible decomposition. Z/214,414,200 has 214 million elements. Flat bigram tables would need over 45 thousand trillion entries. CRT per-channel trigrams need 142,956.

Ring arithmetic sequences in Z/214,414,200 -- noisy affine maps mixing multiplication and addition -- are predicted at 58% exact accuracy using CRT per-channel trigrams. Each of seven channels achieves 607-704 ppt accuracy (6x to 34x above random). The mod-49 channel shows the largest improvement: 34x. The improvement over random guessing: 124 million times. Per-channel independence means the same 143K table entries handle a 214-million-element alphabet that no flat table could touch.

CRT exact
1044/1799
58% exact element prediction on 214M-element ring. 143K entries.
Flat impossible
4.6 x 10^16
214414200^2 entries. Cannot be built. CRT is the only path.
mod-49 lift
34x random
683 ppt vs 20 ppt random (mod 49). Largest channel = largest CRT advantage.
Ring tower
5ch >= 6ch >= 7ch
1050 >= 1044 >= 1044. Operations correlated across channels.

Evolutionary Basin Dynamics

The multiplicative basin is a genuine structural attractor. A K^2=9 population of 2D-LUT state machines (3750 entries each) initialized from 90% correct multiplication tables holds at 75% accuracy across 200 generations of random mutation. The basin resists drift indefinitely. But random entry-level mutation cannot improve: each change has >97% probability of being wrong (P=(q-1)/q for Z/49). The 2D-LUT is a state machine where a single corrupted entry redirects the entire downstream trajectory -- cascading errors make the fitness landscape deceptive.

Crossover between members with independently corrupted entries provides +5.1 percentage points (72.6% to 77.7%) by recombining correct entries from different parents. The improvement plateaus when population diversity exhausts. Full recovery to 100% requires systematic search (coordinate descent visits every entry, tries all alternatives) or biology-scale populations. Evolution from additive initialization climbs to only 16% -- structurally unable to bridge the modular discontinuity.

Drift resistance
75.1% held, 200 gen
90% seeded population holds steady. Attractor resists random perturbation.
Crossover benefit
+5.1 pp
Uniform crossover between diverse members. 72.6% to 77.7%. Combines correct entries.
Additive stuck
8% to 16%
Evolution from additive cannot bridge to multiplicative basin.
Biology analog
Structured mutations
Random table rewrites fail. Biology uses structured DNA mutations + large populations.

Cross-Channel Attention

What determines how well one CRT channel predicts another? Two hypotheses: quadratic residue compatibility (algebraic: Legendre symbols between primes) or shared-factor coupling (operational: gcd of operation constants with channel moduli). The QR hypothesis is FALSIFIED; the GCD mechanism is constructive and predictive.

QR-Attention Separation (PROVED)
Among odd-prime pairs: QR mean = NR mean (difference = -4 ppt, noise). The directed 3-cycle K->L->b->K (all QR) shows zero asymmetry over its reverse (all NR). Cross-channel information flow is operational (gcd-mediated), not algebraic (QR-mediated). 21/21 verified.
GCD-Attention Tower-Step (PROVED)
Operation constants 42 = 2*3*7 and 105 = 3*5*7 reduce each data channel's effective target alphabet by one tower step: 8->4, 9->3, 25->5, 49->7. Extension channels (mod 11, mod 13, mod 17) have gcd=1: unreduced. GCD-product concordance with accuracy: 74%. 43/43 verified.
Direction-Resolution (PROVED)
V1(s,t) = q_src * weighted_target_gcd correctly predicts the higher-accuracy direction for 85% of channel pairs. The direction ratio for (mod-49, mod-9): V1(49->9)*9 = V1(9->49)*49 (algebraic identity). mod-25 source pairs account for 2 of 3 wrong predictions (self-blindness via 105 coupling). 30/30 verified.
QR falsified
QR mean = NR mean
Quadratic residue compatibility has zero predictive power for cross-channel accuracy. The information flow is operational, not algebraic.
Tower-step reduction
8->4, 9->3, 25->5, 49->7
Each data channel loses exactly one tower step of effective alphabet through gcd coupling. Extension channels are immune.
85% direction
V1 concordance 18/21
Source modulus and target gcd jointly predict which direction is stronger. 3 failures involve mod-25 source (self-blind channel).

Channel Independence

How much information does each CRT channel carry? The eigenvalue of n -- a sum of 7 cosines, one per channel -- collapses all channels into a single number. If the sum is useful, channels are redundant. If not, each channel carries independent information that summation destroys.

Channel Independence Theorem (VERIFIED)
Coupling class prediction of 2000 elements from Z/214,414,200: mod-8 channel alone (8 bins) achieves 67.9% accuracy. mod-9 alone (9 bins) achieves 49.9% (= baseline). mod-8 + mod-9 combined (72 bins) achieves 84.5%. Eigenvalue sum (200 bins) achieves 51.4% -- barely above random (50.2%). Information grows multiplicatively with channel count. Summing destroys it.

The mod-9 channel achieves exactly the baseline because it is orthogonal to parity: whether n is even or odd is invisible in n mod 9. But combined with mod-8, mod-9 further separates the odd elements by divisibility by 3. This is CRT in action: independent channels carry independent information.

mod-8 alone
67.9%
Parity + mod-8 structure. 8 bins, each class-concentrated.
mod-9 alone
49.9% = baseline
Orthogonal to mod-8. Cannot distinguish even from odd.
mod-8 + mod-9
84.5%
Multiplicative gain: 2 independent channels > 1 by 16.6%.
Eigenvalue sum
51.4% ~ random
Summing 7 channels destroys independence. 200 bins, no structure.

ECC Requires Coupling

Can parity channels (mod 11, mod 13, mod 17) detect prediction errors for free? If 7 independent per-channel predictors disagree on parity, is the prediction wrong? Testing this by training independent bigram predictors on a sequence in Z/88,200.

ECC Coupling Theorem (PROVED)
Parity channel transitions in Z/88,200 depend on ALL data channels jointly. An independent mod-11 predictor achieves only random accuracy (10.6% vs 100% for CRT-factor channels mod 8 and mod 9). The mod-N reduction (N=88200) scrambles mod-11 residues. Error detection requires cross-channel information flow -- it is a training objective, not a free property.

Root cause: the mod-11 residue depends on the FULL element (all 4 data channels), not just the previous mod-11 value. A predictor that sees only current mod-11 cannot predict the next. This proves that CRT-based intelligence MUST include coupling between channels -- pure block-diagonal processing cannot leverage error detection. The rate 4/7 IS the coupling cost: 3/7 of capacity enforces consistency.

mod-8, mod-9
100% predictable
CRT factors of N=88200. Affine transition per-channel. Fully independent.
mod-11, mod-13, mod-17
~10% (random)
NOT CRT factors. Transition depends on full element. Cannot predict independently.
Implication
Detection = coupling
Block-diagonal architecture needs an explicit sync point for parity. 3/7 capacity.

CRT Coupling as Attention

CRT coupling(a,b) = sum of per-channel eigenvalue products = a dot-product in eigenvalue space. Zero learned parameters. Does this replace standard attention? Testing on Z/214,414,200 (7 channels) with 28 possible heads: 7 single-channel + 21 pair-channel.

Coupling Attention Theorem (CONFIRMED)
CRT coupling gives 99.5% retrieval weight on an exact-match target among 30 elements (vs 3.3% uniform = 30x advantage). The eigenvalue inner product IS a positive-definite attention kernel with zero parameters, 1.6M x compression at Z/214,414,200 scale (132 eigenvalue entries encode all 214M pairwise coupling values), and inherent block-diagonal structure.

The 9-heads hypothesis (7 single-channel + 2 pair-channel = optimal head count) shows no statistically significant signal across 5 random seeds. All head counts from 7 to 21 give similar contrast (~50%). CRT attention works best as FULL COUPLING (all 7 channels simultaneously), not decomposed into independent heads.

CRT retrieval
99.5% weight
Exact match gets 99.5% attention. 30x over uniform. Zero parameters.
9 heads
No signal
Multi-seed contrast averages: NH=9 (499/1000) vs NH=8 (495/1000) vs NH=12 (490/1000). Within noise.
Multi-head avg
Dilutes attention
Averaging softmax across heads reduces target weight from 99.5% to 8-15%. Full coupling is the natural form.

Projector Prevents Forgetting

Catastrophic forgetting: train on task A, then train on task B, and task A knowledge is destroyed. Standard models have no remedy -- all weights shift. CRT-decomposed models have a projector: 26,801,776 (= 2^1680 mod 214,414,200) with CRT = (0,1,1,1,1,1,1). It zeroes the mod-8 channel (coarsest, 8 states) and preserves the 6 finer channels exactly.

Projection Forgetting Theorem (VERIFIED)
Two tasks with different per-channel transition patterns (strides differ in all 7 channels of Z/214,414,200). Per-channel bigram tables. Full overwrite: 0% retention (catastrophic forgetting). Projector (mod-8 from task B, 6 channels from task A): 85.7% retention = 6/7 exactly. Algebraic guarantee from CRT channel independence.

The sacrifice is optimal: mod-8 is the coarsest channel (only 8 possible residues). It carries the least fine-grained information and is the fastest to relearn. The 6 preserved channels (mod-9 through mod-17) carry the finer structure that is expensive to learn and critical to retain.

Baseline
100%
All 7 channels learned perfectly. 13993/13993 correct predictions.
Full overwrite
0%
Different task = different transitions. All channels wrong. Catastrophic.
Projector
85.7% = 6/7
6 channels preserved exactly. mod-8 sacrificed and relearned for new task.

Convergence Staircase

How fast does each CRT channel learn? Walk through Z/12,612,600 multiplicatively (a^k mod 12612600) using a maximal-order generator (ord = 420 = Carmichael lambda). Per-channel bigram tables learn deterministic transitions. Each channel converges when its orbit is fully observed -- at step lambda(q_i), the per-channel Carmichael value.

Convergence Staircase Theorem (VERIFIED)
Multiplicative walk on Z/12,612,600 (6 channels). Convergence order: mod-8 (step 2), mod-9 (step 6), mod-11 (step 10), mod-13 (step 12), mod-25 (step 20), mod-49 (step 42). Chain order: 2, 3, 5, 7, 11, 13. DIFFERENT. Extension channels {mod-11, mod-13} converge BEFORE prime-power channels {mod-25, mod-49}. Additive walk confirms same relative order: 8, 9, 11, 13, 25, 49. 14/14 checks pass.

Raising exponents (5->5^2=25, 7->7^2=49) increases the state space and SLOWS learning. Extension channels (mod-11, mod-13) have fewer states and converge faster despite being later in the chain. mod-8 converges first (2 multiplicative steps) -- confirming it is optimal for sacrifice: coarsest, cheapest to relearn.

Period / bottleneck
420 / 42 = 10
Carmichael lambda / phi(49) = 2*5 = 10. Period is 10x the learning bottleneck.
mod-8 fastest
2 steps (mult)
Z/8 has Carmichael lambda = 2. Fewest multiplicative states. Cheapest to sacrifice.
mod-49 slowest
42 steps (mult)
Z/49 has Carmichael lambda = 42. The deepest channel IS the convergence bottleneck.
Extension leapfrogs
mod-11 before mod-25
Chain position does not determine learning speed. Modulus size does.

Scheduling Is Irrelevant

If channels are independent, does the ORDER of training matter? Comparing three schedules on the same additive walk through Z/12,612,600 (6 channels): simultaneous (all channels every step), chain-order sequential (mod 8, 9, 25, 49, 11, 13 one at a time), and modulus-order sequential (mod 8, 9, 11, 13, 25, 49 -- the convergence order).

Scheduling Irrelevance Theorem (VERIFIED)
Simultaneous: converges at step 49 (= max modulus). Chain-order sequential: step 115 (= sum of moduli). Modulus-order sequential: step 115 (same sum). Ratio: 2.35x. Per-channel convergence time = q_i regardless of schedule. Partial convergence at step 41: simultaneous 5/6, modulus 4/6, chain 2/6. 15/15 checks pass.

Each channel's learning speed is an intrinsic property of its modulus -- no curriculum ordering can accelerate it. Simultaneous training is always optimal because channels extract information in parallel without interference. Chain-order annealing provides zero advantage.

Simultaneous
49 steps
max(q_i). All channels trained in parallel. Optimal.
Sequential (any order)
115 steps
sum(q_i). 2.35x slower. Order only affects WHICH channels converge early.
Modulus order partial
4/6 at step 41
Extension channels {mod-11, mod-13} resolved early. More error detection sooner.
Chain order partial
2/6 at step 41
Data channels {mod-25, mod-49} prioritized. Less early coverage.

490 Split: Encoder/Decoder

The 490 split (490^420 mod 12,612,600 = outer idempotent) divides 6 channels into inner = {mod 8, mod 25, mod 49} (encoder, 9800 states) and outer = {mod 9, mod 11, mod 13} (decoder, 1287 states). Does this split create a meaningful timing asymmetry?

490 Convergence Asymmetry Theorem (PROVED)
Inner bottleneck = max(lambda(8), lambda(25), lambda(49)) = max(2, 20, 42) = 42. Outer bottleneck = max(lambda(9), lambda(11), lambda(13)) = max(6, 10, 12) = 12. Ratio = 42/12 = 7/2 = 3.5. Decoder converges 3.5x before encoder. 26/26 checks pass.

lcm(inner Carmichael lambdas) = lcm(2, 20, 42) = 420 = Carmichael period of the ring. lcm(outer) = lcm(6, 10, 12) = 60. Ratio = 420/60 = 7. In Z/214,414,200: mod-17 joins the outer set, flipping the balance (outer = 21879 > inner = 9800).

e_outer = 8,722,000
CRT=(0,1,0,0,1,1)
490^420 mod 12,612,600. Inner channels zeroed, outer = identity. Idempotent.
Ratio 7/2 = 3.5
inner=42, outer=12
Decoder (outer) crystallizes 3.5x faster than encoder (inner).
Period = inner lcm
420 = lcm(2,20,42)
Carmichael period of the ring is exactly the LCM of the inner channel periods.
17 flips balance
outer > inner at 7ch
6-channel: inner=9800 > outer=1287. 7-channel: outer=21879 > inner=9800.

Curriculum Irrelevance

Does training channels in chain order (mod 2 -> mod 3 -> mod 5 -> mod 7) help? Testing three schedules on Z/210 (4 channels): simultaneous (all every step), chain-order sequential, and reverse-order sequential.

Curriculum Irrelevance Theorem (PROVED)
Simultaneous training converges at max(q_i) = 7 steps. Both sequential curricula converge at sum(q_i) = 17 steps (2.43x slower). Curriculum order immaterial: chain-order = reverse = 17 (sum is commutative). Joint accuracy at step k = product of per-channel accuracies (gap = 0). No phase transitions beyond per-channel convergence events. 18/18 checks pass.

The curriculum penalty grows with channels: Z/210 17/7=2.43x, Z/12,612,600 115/49=2.35x, Z/214,414,200 132/49=2.69x. Simultaneous training outperforms ALL curricula. Chain order = real algebra (group-theoretic decomposition), NOT trainable capability phases. Channel independence is so strong that each channel converges at its intrinsic q_i-step rate regardless of what other channels are doing.

Sim=7, Chain=17
2.43x penalty
Sequential curriculum is always slower: sum(q_i)/max(q_i). Parallelism wins.
Order irrelevant
Chain = Reverse = 17
sum(2,3,5,7) is commutative. No ordering advantage. Chain order is not a training curriculum.
Gap = 0 everywhere
Joint = product(per-channel)
No cross-channel synergy. No phase transitions. No emergent capabilities beyond independent convergence.
5 probes confirmed
All consistent
Independence governs memory, convergence, scheduling, encoder/decoder, AND curriculum.

CRT Search Reduction

CRT independence isn't just about learning -- it transforms search. Given training data (x,y) from an unknown polynomial f over Z/N, how many candidates must you try? Monolithically: N^{d+1} for degree d. Per-channel with CRT: sum(q_i^{d+1}). The search space converts from MULTIPLICATIVE to ADDITIVE.

CRT Search Reduction Theorem (PROVED)
Polynomial regression over Z/N with CRT decomposition N = prod(q_i): monolithic search = N^{d+1} candidates, per-channel = sum(q_i^{d+1}). Verified by exhaustive search on Z/210: linear (d=0) 210/17 = 12x, affine (d=1) 44100/87 = 506x, quadratic (d=2) algebraic 18411x. At Z/214,414,200 scale (d=0): 214414200/132 = 1624350x. 22/22 checks pass.

The affine ratio (506x) is EXACTLY the block-diagonal Jacobian ratio N^2/sum(q_i^2) from the backprop theorem. Search reduction IS backprop reduction -- the same algebraic structure powers both. The ratio grows by a factor of ~N/q_max per degree increase.

Linear (d=0)
12x on Z/210
210 monolithic candidates vs 17 per-channel. CRT reconstruction recovers a=137 exactly.
Affine (d=1)
506x on Z/210
44100 pairs vs 87. Per-channel (a,b) reconstructed via CRT: a=67, b=41.
Z/214,414,200 (d=0)
1,624,350x
214 million monolithic vs 132 per-channel. Gap grows with ring size and polynomial degree.
6 probes confirmed
All consistent
Independence governs memory, convergence, scheduling, architecture, curriculum, AND search.

CRT Predicate Discovery

CRT accelerates polynomial search (above). The same principle extends to PREDICATE search: finding ring elements satisfying algebraic conditions. Idempotents (e^2=e), involutions (x^2=1), zero divisors (z*w=0) -- all decompose per CRT channel. The search phase costs sum(q_i) per-channel checks. The enumeration phase costs product(per-channel solution counts).

CRT Predicate Search Theorem (PROVED)
Ring-algebraic predicate P(n) over Z/N decomposes as P = AND(P_i(n mod q_i)). Search: sum(q_i) per-channel checks. Enumerate: product(|S_i|). Monolithic: N checks. Verified on Z/210: idempotents (16 found, ratio 6x), involutions (8 found, ratio 8x). At Z/214,414,200 scale: idempotent ratio 824670x, involution ratio 552613x. Zero divisors need no search -- CRT(w) directly encodes #ZD = gcd(N,w). 28/28 checks pass.

mod-8 signature: Z/2 has 1 involution (-1=1 mod 2). Z/8 has 4 (Klein four-group). The depth-3 exponent quadruples involution count. Zero divisors are instant: CRT(w) tells you which channels are zero (unconstrained) and which require z=0. The decomposition IS the answer.

Idempotent (7ch)
824,670x
214414200 / 260. Per-channel: always 2 solutions (0 and 1).
Involution (7ch)
552,613x
214414200 / 388. Z/8 contributes 4 (Klein four), rest contribute 2.
Zero divisors
Instant
CRT(w) directly encodes #ZD. Zero channels = unconstrained. No search needed.
7 probes confirmed
All consistent
Independence governs memory, convergence, scheduling, architecture, curriculum, search, AND discovery.

Block-Diagonal GPU

If CRT channels are independent, a neural network layer decomposes into 7 small operations instead of 1 big one. For matrix multiply: monolithic uses (sum q_i)^3 FLOPs, block-diagonal uses sum(q_i^3). The ratio is 16x at 7 channels -- and it is scale-invariant (same ratio regardless of hidden dimension).

Block-Diagonal FLOP Theorem (PROVED)
Matmul FLOP ratio = (sum q_i)^3 / sum(q_i^3). Z/210 (4ch): 9x. Z/12,612,600 (6ch): 11x. Z/214,414,200 (7ch): 16x. Scale-invariant. Parameter savings: (sum q_i)^2 / sum(q_i^2) = 4.6x at 7 channels. Neural net layer (attention mono + FFN block-diag): 3.4x total. 12-layer CRT-GPT: ~25M params vs ~117M standard. 24/24 checks pass.

Attention must remain monolithic (coupling is cross-channel). But V-projection and FFN are block-diagonal -- 7 independent per-channel operations. The existing GPU benchmarks on this page already demonstrate per-channel dispatch. CRT Multiply processes 7 channels per element with zero cross-channel communication.

Matmul 16x
7 channels
(sum q_i)^3 / sum(q_i^3) = 2299968/142956. Grows with channel count.
Params 4.6x
7 channels
(sum q_i)^2 / sum(q_i^2) = 17424/3750. Fewer weights to train.
Layer 3.4x
attention+FFN
Attention QK monolithic + V,FFN block-diagonal. 174240/51174 per layer.
CRT-GPT-1
~25M params
12-layer, d=448, 7 channels. vs 117M standard GPT-1. 4.7x compression.

Matrix Library

CRT-GPT-1 forward pass implemented in .ax. Block-diagonal matrix operations: per-channel matmul, matvec, layer norm, squaring activation (h^2 -- ring-native nonlinearity), and 7 independent softmaxes. Full pipeline: character embedding via CRT residues, block-diagonal FFN, per-channel softmax, CRT reconstruction of predicted character. 53 checks pass.

CRT-GPT Forward Pass (VERIFIED)
Input char 'A' (65) decomposes to (1,2,15,16,10,0,14) mod (8,9,25,49,11,13,17). 7 per-channel embeddings. Layer norm per channel (mean=0, var=1). Block-diagonal FFN: W1*x+b1, squaring, W2*h+b2. Per-channel softmax (132 total classes). CRT reconstruction recovers 65. End-to-end .ax, WASI-verified, no JS.
Block-diagonal matmul
7 independent
bd_matmul, bd_matvec: per-channel matrix ops. Zero cross-channel weights.
Layer norm
Per-channel
Mean=0, var=1 independently for each of 7 channels.
Squaring activation
h^2
Ring-native nonlinearity. All activations non-negative. Forced positivity.
CRT softmax
132 classes total
7 small softmaxes (8+9+25+49+11+13+17) vs 1 giant 50K. Sums to 1 per channel.

Attention & Embedding

CRT embedding: character code c decomposes into 7 residues (c mod q_i). Each residue indexes a learned d_c-dimensional table. 132 total entries (8+9+25+49+11+13+17) vs 50K for BPE -- 380x smaller output vocabulary. CRT coupling provides zero-parameter attention: QK scores come from the ring's eigenvalue inner product, with 99.5% retrieval and 30x advantage over uniform (PROVED). Single full-coupling attention, no multi-head (which dilutes). Block-diagonal V projection per channel. 50 checks pass.

CRT Attention Layer (VERIFIED)
Coupling-based QK attention on short sequences. Causal masking: position 0 = self-only (weight 1.0), verified. Identical-code sequences produce uniform attention weights (proved by algebraic symmetry). Identity V gives weighted embedding average. Attention composes with B2 FFN: embedding -> attention -> layer_norm -> FFN -> softmax -> CRT reconstruction recovers input character. All operations per-channel independent.
CRT embedding
132 entries
Character mod 7 channel moduli. 4570x fewer params than standard embedding.
Coupling attention
Zero-parameter QK
Eigenvalue inner product. 99.5% retrieval. No learned Q or K matrices.
Causal masking
Autoregressive
Position i attends to 0..i. Self-coupling dominates (highest weight).
Block-diagonal V
7 independent
Per-channel value projection. No cross-channel mixing in attention.

CRT Likelihood Scoring

Mode-based CRT prediction picks the most common next element per channel. It works at ring scale (58% exact on 214M elements, above) but hits a structural ceiling: 74% of test positions fail, and 89% of those failures are MODEL-limited -- the correct character has nonzero training count but is not the mode. The model sees the answer but selects wrong.

Soft CRT Prediction (VERIFIED)
CRT-factored distribution scoring -- summing normalized per-channel trigram counts across all channels -- achieves 115/399 (28.8%) on text prediction, vs 105/399 (26.3%) for mode+NN search (+9.5%). Per-channel accuracy jumps from 268 to 322 ppt (+20%). Small channels gain most: D-channel +34%, K-channel +37%. 14 soft-only correct positions vs 4 NN-only: distribution scoring is a near-superset of mode search. Score discriminates: 5.1/7 channels match at correct vs 0.7/7 at wrong. Equal channel weights ARE optimal -- SGD coordinate descent finds zero improvement.

The soft score IS a CRT-factored probability: each channel votes with its conditional frequency, not its mode. Channel independence means the sum factorizes -- no interaction terms. Small moduli (D=8, K=9) gain most because their count distributions are richest: 8 and 9 residue classes per context provide well-sampled distributions. Large moduli (b=49) gain little: sparse data collapses distribution to mode. 15 incremental approaches closed (THMs 93-109): mode, bloom variants, weighting, cross-channel joints. Distribution scoring is the capstone.

Ceiling diagnostic
89% model-limited
Mode is 6.2x more common than actual. Correct character ranks 8th among 24 alternatives.
Soft vs Mode+NN
+9.5% exact, +20% per-ch
115/399 vs 105/399. D-channel: 141 vs 105 (+34%). K-channel: 133 vs 97 (+37%).
Weight optimality
Equal = optimal
SGD (5 epochs, 400 validation): 0 exclusive gains on test. 8 divergent predictions ALL wrong for both methods.
Near-superset
14 soft-only, 4 NN-only
Oracle(soft, NN) = 119. Gap from soft = 4, gap from NN = 14. Distribution subsumes mode.

Lucas CRT Independence

The hash representation collapses spatial structure: all 7 channels see the same data through slightly different hashes, producing informationally redundant channels (0.22% pairwise disagree rate). Lucas' theorem shows that binomial coefficients C(n,k) naturally create independent channel views:

Channel Independence Theorem (PROVED)
C(n,k) mod p = product of C(n_i, k_i) mod p, where n_i, k_i are base-p digits of n, k (Lucas). Each axiom prime decomposes (n,k) in a different number base: D=binary, K=ternary, E=base-5, b=base-7, L=base-11, GATE=base-13, ESCAPE=base-17. Row p of Pascal's triangle is all-zero in the p-channel and nonzero in every other channel. 7 primes = 7 unique blindness rows = 7 genuinely independent views. Pairwise disagree: 230-493 permil vs ~2 for hash CRT. 78/78 verified.
Independence gain
230-493 vs 335 permil
Binomial zero/nonzero disagree exceeds hash on same data. V25 ARC hash was 2 permil (different metric). Key: each channel sees a different base-p digit decomposition.
490 split in zeros
DEAD 374 > ALIVE 245
DEAD channels (D,E,b) produce 52% more zeros than ALIVE (K,L,GATE,ESC). Smaller primes = denser Sierpinski fractal.
Kummer carries
945/945 verified
v_p(C(n,k)) = carries in base-p addition of k+(n-k). Each prime counts carries in its own base = genuinely different p-adic structure.

Contrast Table

Neural architectureMonolithic transformer with large output softmaxCRT-decomposed: 5 independent heads. 9512x compression. Block-diagonal gradients.Error detectionPost-hoc validation, separate ECC systemsmod-11 provides free error detection built into the algebra. Dual mod-11 + mod-13 corrects.ScalingNeeds GPU clusters and billions of parametersRuns in browser. CRT makes it small enough. 9512x parameter compression.CompressionStatistical: LZ77 + Huffman / FSE on 1 monolithic streamAlgebraic: 5 independent CRT channels. Up to 20x Rissanen redundancy. Parallel by structure.

Source code · Public domain (CC0)

Report issue

.ax source compiled to WASM via self-hosting compiler. Zero HTML authored.