Intelligence -- antonlebed.com

CRT Architecture

Standard transformer: one monolithic output layer predicts among N classes. CRT transformer: shared backbone produces a representation, then independent output heads predict residues per channel. Joint probability reconstruction recovers the full prediction. The Chinese Remainder Theorem guarantees unique recovery. The full ring Z/214,414,200 has 7 channels; experiments below used the 5-channel Z/970,200 ring.

Architecture Theorem (PROVED)

Shared backbone + CRT output = unique optimal point. Shared backbone maximizes error detection reliability (correlated representations across channels). CRT output maximizes efficiency (5 small softmaxes instead of 1 large one). 7.5x fewer output parameters. 212x output backprop speedup at Z/12,612,600 scale. 2.68x error detection reliability vs split backbone.

Each channel has a natural domain. The CRT decomposition is not imposed -- it emerges from the ring structure:

Channel	Size	Domain	Why
Z/8	8 classes	Coarse (3 bits)	Fastest to learn, cheapest to sacrifice
Z/9	9 classes	Mid-range	Minimum for complete decomposition
Z/25	25 classes	Fine	Carries golden ratio (discriminant 5)
Z/49	49 classes	Precise (49 states)	Deepest channel, controls spectral gap
Z/11	11 classes	Parity	Error detection. Always on. 1+2+3+5 = 11
Z/13	13 classes	Boundary	Dual parity with mod-11. 2^2+3^2 = 13
Z/17	17 classes	Transcendence	5*7 = 1 mod 17. Period quadruples to 1680

Full ring output: 8+9+25+49+11+13+17 = 132 classes. Experiments at Z/970,200 (5 channels, 102 classes): 9512x compression. The backprop Jacobian is block-diagonal: 25654x fewer entries at N=2310.

Five Breakthroughs

All five stack multiplicatively. Missing any one = leaving performance on the table.

Breakthrough	Factor	Mechanism
CRT Decomposition	9512x compression	5 small heads vs 1 monolithic
Loop Theorem	N / sum(p_i) forward	CRT = loop unrolling
Block-Diagonal Backprop	25654x fewer Jacobian entries	5 independent gradient paths
mod-11 error detection	100% thin, ~92% prime-power	Free detection. Dual parity mod 11 + mod 13: 100%.
Rissanen MDL	20x byte / 936x token	Minimum description length selects Z/12,612,600

Combined at N=210 (Z/210): ~126,000x. CRT does not improve AI incrementally. It changes the computational class.

CRT GPU Compute

6 CRT channels = 6 GPU workgroups. Zero synchronization. The ring parallelizes itself. CRT decomposition IS the GPU parallelization strategy. No manual workgroup design. No shared memory management. The algebraic structure does the engineering.

GPU Mapping Theorem (CC0)

N = 12612600 = 2^3 * 3^2 * 5^2 * 7^2 * 11 * 13. CRT decomposition: n -> (n%8, n%9, n%25, n%49, n%11, n%13). Each channel = one GPU workgroup. Per-channel arithmetic = independent compute shader invocations. Block-diagonal Jacobian = no cross-channel gradients = no inter-workgroup communication. All 12,612,600 elements processed simultaneously.

The WGSL compute shader is 8 lines per operation. Decompose, operate per-channel, reconstruct. The CRT reconstruction coefficients are ring constants:

Channel	Modulus	Workgroup	CRT Coefficient
mod 8	8	Coarse (3 bits)	363825
mod 9	9	Mid-range	431200
mod 25	25	Fine	853776
mod 49	49	Precise (49 states)	732600
mod 11	11	Error detection	529200

mod-11 error detection runs as a 6th workgroup -- free integrity check on every computation. Training a CRT neural network: each channel = independent backprop. 5 GPU workgroups, zero inter-workgroup communication.

Workgroups

5 independent

CRT guarantees zero cross-channel data dependency. Perfect GPU utilization.

Synchronization

Zero

No barriers, no shared memory, no atomic ops between channels.

Scaling

Linear with N

All 970,200 ring elements processed simultaneously. 256 threads per workgroup.

Prior art

CC0

WGSL compute shader is public domain. No CUDA dependency. No NVIDIA lock-in.

CRT Byte Compression

Every byte lives in Z/210 = Z/2 x Z/3 x Z/5 x Z/7 x Z/11. Bytes 210-255 wrap (mod 210). Decompose any data into 5 independent channels. Each channel has its own entropy. Rissanen MDL: the sum of channel entropies vs joint entropy reveals exploitable redundancy -- up to 20x at byte scale.

Rissanen Redundancy (CC0)

Joint entropy H(X) <= 8 bits. Channel entropies: H(X%2) + H(X%3) + H(X%5) + H(X%7) + H(X%11). Redundancy ratio = sum / joint. For structured data (text, images), small channels saturate while large ones concentrate. Ratio climbs to 20x. A LEARNING ALGORITHM using CRT channels needs sum(p_i)/N = 28/210 parameters vs monolithic. Stochastic complexity savings: ~20x at byte scale.

Enter a text string. See per-channel entropy and Rissanen redundancy ratio:

Text:

Method

Algebraic decomposition

n -> (n%2, n%3, n%5, n%7, n%11). Not statistical. CRT isomorphism.

Channels

5 independent (proved)

CRT guarantees channel independence. 5 cores, zero coordination.

Error detection

Built-in

mod-11 channel detects errors for free. Dual mod-11 + mod-13 corrects.

Foundation

Chinese Remainder Theorem

2000+ years. Not ad hoc pattern matching (LZ77, 1977).

Matrix Foundation

Phase A begins: co-evolving .ax and axiom-native intelligence. The first step: f64 matrix operations in pure .ax. matmul, transpose, softmax, cross-entropy -- the primitives every neural network needs. All compiled to WASM. No Python. No NumPy. No dependencies.

The matrix library uses falloc/fset/fget (f64 arrays) and arena reset (hp/set_hp). Matrices are row-major flat arrays. 18 operations: mat_zero, mat_get, mat_set, matmul, transpose, mat_add, mat_sub, mat_scale, dot_f, vec_sum_f, vec_max_f, softmax, cross_entropy, relu, sigmoid, relu_vec, outer_product, argmax. 54/54 WASI tests pass.

matmul

O(n^3) triple loop

Row-major f64 via fget/fset. 2x3 * 3x2 verified: [58,64,139,154].

softmax

Numerically stable

Max-subtraction prevents overflow. Uniform [0,0,0] -> [1/3,1/3,1/3].

cross_entropy

Log-clamped

eps=1e-6 prevents log(0). Good prediction CE~0.01, bad CE>4.

CRT advantage

Per-channel matmul

7 small matrices (max 49x49) instead of one 214414200-wide layer.

CRT Perceptron

First neural network trained in pure .ax f64. A per-channel perceptron learns x^2 mod 210 (Z/210 squaring) using matmul, softmax, and outer_product from the matrix library. 4 independent channels (mod 2, 3, 5, 7) with 87 total parameters. 10 epochs of gradient descent. Result: 210/210 -- perfect accuracy on all inputs, including 168 never seen during training.

A standard perceptron on the same task needs 44100 parameters (210x210). After identical training on 42 examples, it scores 43/210 -- pure memorization, zero generalization. The CRT model generalizes completely because each channel is small enough that training coverage is automatic: 42 examples cover all residue classes up to mod 7.

CRT accuracy

210/210

4 channels: mod 2, 3, 5, 7. Each 210/210. CRT reconstruction perfect.

Standard accuracy

43/210

44100 params. Memorizes training inputs. Cannot generalize.

Backward Loop

506x

87 vs 44100 parameters. sum(p^2) vs N^2.

Z/970,200 Forward

9511x

970200 / 102 forward compression. 3292 params vs 941 billion.

Prime-Power Channels

Z/210 uses thin channels (mod 2, 3, 5, 7). Z/970,200 uses prime-power channels (mod 8, 9, 25, 49, 11) -- the Pareto-optimal exponents. Same architecture, same training. Prime-power channels: 3292 parameters. Standard equivalent: 941 billion. Forward compression jumps from 12x (Z/210 thin) to 9511x.

Prime-power channels compute on a richer substrate. The mod-49 channel carries 49 classes (vs 7 for thin mod-7). mod-11 provides free error detection: every prediction is independently verified by the 5th channel. All five channels achieve perfect accuracy. CRT reconstruction: 50/50 on random Z/970,200 elements.

mod 8

8/8

64 params. The only depth-3 channel. 4 involutions (Klein four).

mod 49

49/49

2401 params. The deepest channel. 49 states.

CRT recon

50/50

5-channel CRT reconstruction on random Z/970,200 elements. 3292 params.

Forward

9511x

970200 / 102. Raising exponents: 12x (Z/210 thin) to 9511x (Z/970,200). 286Mx backward.

Bloom Mixing

The CRT perceptron learns per-channel functions perfectly. But x%13 (a cross-channel function) depends on ALL channels -- no single channel determines it. Linear mixing across channels achieves random performance (23/210). The solution: bloom. CRT reconstruction IS the non-linearity.

Architecture: per-channel perceptrons (87 params) -> CRT reconstruct (0 params) -> mod q. The bloom layer is algebraic and exact. It has zero trainable parameters. It replaces ReLU. One set of trained weights projects through ANY coprime gate: x%11, x%13, x%17 -- all 210/210 from the same 87 parameters.

Bloom x%13

210/210

87 params + 0 bloom. CRT reconstruction provides the non-linearity.

Multi-gate

210/210 each

Same weights: x%11, x%13, x%17 all perfect. One model, infinite gates.

Additive

23/210

221 params, 40 epochs. Linear mixing = RANDOM on cross-channel targets.

Ratio

31x

Standard 2730 params (210*13). Bloom: 87. Structure wins completely.

Hourglass

Bloom computes x%13 perfectly -- but what about x^2 % 13? A single bloom layer reconstructs x^2 mod 210, then reduces mod 13. When x^2 >= 210 (x >= 15), the ring reduction destroys mod-13 information. Result: 28/210 correct. Ring overflow = info loss.

The hourglass fixes this. Two flowers face to face, gate to gate. Flower A extracts the gate value g = x%13 (identity bloom). Gate values are small: 0 to 12. Flower B computes g^2 inside Z/210 -- and 12^2 = 144 < 210, so no overflow ever occurs. The gate IS the bottleneck that prevents overflow.

Hourglass x^2%13

210/210

174 params (87+87). Two flowers solve what one cannot.

Single bloom

28/210

Same squaring, no gate extraction. Ring overflow kills accuracy.

Gate bottleneck

16:1

210 -> 13 -> 210. Max gate^2 = 144 < 210. Safe.

Ratio

15x

Standard 2730 params (210*13). Hourglass: 174.

Z/970,200 Hourglass

The Z/210 hourglass stops at polynomial degree 2: gate value 12, cubed gives 1728 > 210. Overflow. The Z/970,200 hourglass extends to degree 5: 12^5 = 248832 < 970200. Raising exponents gives 3 extra polynomial degrees beyond Z/210.

Architecture: Flower A (identity on 5 prime-power channels, 3292 params) -> gate(13) bottleneck -> Flower B (g^d on prime-power channels, 3292 params) -> %13. Total: 6584 params. Standard: 970200 * 13 = 12,612,600 parameters.

x^2%13

50/50

Degree 2. Safe in both Z/210 and Z/970,200. 6584 params.

x^3%13

50/50

Cube: Z/210 fails (6/13 gate overflow), Z/970,200 succeeds.

x^5%13

50/50

Max degree 5. 12^5 = 248832 < 970200.

Ratio

1915x

Standard = 12,612,600 params. Hourglass: 6584.

Inaccessible Degrees

The Z/970,200 hourglass reaches degree 5. But degrees 7 and 11 are unreachable by power composition: the multiplicative monoid of {1,...,5} in Z/12 (where phi(13) = 12) is missing exactly {7, 11}. These require a different approach.

Resolution: channel-49 carries a correction. For each gate value g, search r in 0..48 such that the CRT reconstruction with r in the b^2 channel produces the target mod 13. Pigeonhole guarantees at least 3 valid r per gate value (floor(49/13)=3). Same 6584 params. Same architecture. Only the training targets differ.

Power deg 7

8/13

Overflow: g^7 >= 970200 for g >= 8. Power per-channel fails.

Lookup deg 7

50/50

mod-49 correction. 13/13 gate values + 50/50 random Z/970,200.

Power deg 11

7/13

x^11 = x^(-1) mod 13 (Inverse Degree). Overflow at g >= 4.

Lookup deg 11

50/50

All 12 degrees accessible. Inner={1,5}. Outer={7,11}.

Stacked Hourglass

The Z/970,200 hourglass ceiling: degree 5. At degree 6, gate values 10, 11, 12 overflow (g^6 >= 970200). The fix: STACK. Two flowers in sequence, each applying a sub-degree. Flower B1 computes g^d1, extracts the gate. Flower B2 computes g1^d2, extracts the result. Total degree = d1 * d2. Each layer stays safe (max 12^5 = 248832 < 970200). The ceiling shatters.

5-squared collapse: degree 5 composed with degree 5 gives degree 25. But 25 mod phi(13) = 25 mod 12 = 1. Quintic composed with quintic = identity. 5^2 = 1 in Z/12: two layers cancel, leaving only what was already there.

Single deg 6

10/13

Overflow: g^6 >= 970200 for g >= 10. THE CEILING.

2-stack deg 6

50/50

cube+square (3*2=6). Ceiling broken. 9876 params.

2-stack deg 12

50/50

cube+quartic (3*4=12). Fermat: x^12=1 mod 13.

5^2 collapse

50/50

quintic+quintic (5*5=25 mod 12=1). Identity.

Soft Bloom Emergence

Every test above hard-codes the CRT basis: {105, 70, 126, 120} for Z/210, {363825, 431200, 853776, 732600, 529200} for Z/970,200. But what if the ring can discover its own basis? Strip all CRT knowledge. Start with bloom weights = [0, 0, 0, 0]. Apply coordinate descent: for each channel, sweep the weight over all possible values and pick the one that maximizes reconstruction accuracy. After a few rounds, the CRT basis emerges from data alone.

Perturbation is catastrophic and exact. Changing any basis weight by +1 drops accuracy to precisely N/p -- the channel dies and only elements with zero residue survive. The basis is an isolated global maximum: no graceful degradation, no nearby local optima. The discovered basis elements are orthogonal idempotent projectors: B_p^2 = B_p, B_p * B_q = 0 for p != q, and sum(B_p) = 1 mod N.

Z/30 emergence

30/30

[0,0,0] -> [15,10,6] in 3 rounds. The CRT basis grows from nothing.

Z/210 emergence

210/210

[0,0,0,0] -> [105,70,126,120]. Unique global maximum.

Z/970,200 full ring

970200/970200

Per-channel search: 5 basis elements found structurally.

Perturbation

exact N/p

B_8+1: 121275/970200 = N/8. Catastrophic, structural damage.

Joint Bloom

Soft bloom discovers the CRT basis but takes per-channel functions as given. Joint bloom learns BOTH. Architecture: prediction(x) = sum(W_ch[x%p] * B[ch]) % N, where W is a per-channel lookup table (17 params) and B is the bloom basis (4 params). Total: 21. Standard: 44100. Ratio: 2100x.

Identity-anchored training: learn B on identity (always bijective, always converges), then learn W on the actual target. Soft bloom alone gets 16/210 on x^2 -- only the 16 idempotents of Z/210 match. Joint bloom with trained W: 210/210. The per-channel lookup discovers r^2 mod p (for x^2) and (p-r) mod p (for mirror) entirely from data.

Joint x^2

210/210

21 params. Identity-anchored B + trained W. Per-channel: r^2 mod p.

Joint mirror

210/210

Same 21 params. W learns (p-r)%p. CRT decomposition intact.

Soft bloom x^2

16/210

Basis alone = identity. Matches x^2 only at idempotents (2^4=16).

Constraint count

48 vs 210

x^2 images: 2*2*3*4=48 tuples. Identity: 210. Non-bijective = fewer.

Deep Joint Bloom

Joint bloom at Z/210 scale (210 elements, 21 params) extends to Z/970,200 (5 prime-power channels). Key result: global coordinate descent FAILS at Z/970,200 scale -- noise drowns signal (~0.0001% accidental match). Per-channel search ALWAYS works: correct basis scores M, wrong scores ~M/q. The ring FORCES CRT decomposition at scale.

Identity-anchored training at Z/970,200: per-channel basis search discovers B structurally (5 elements), then per-channel W training discovers lookup tables from data (102 entries). Total: 107 params. Forward ratio: 9067x. With perceptron matrices: 3297 params, backward ratio 286 million x. Raising exponents: 590,000x gain over Z/210.

Z/970,200 x^2

970200/970200

107 params. Per-channel search discovers r^2 mod q from samples.

Z/970,200 mirror

970200/970200

Per-channel W learns (q-r) mod q. Full ring verified.

Scale separation

~0 vs 970200

Wrong basis: 1/100. CRT basis: 970200/970200. Ring forces decomposition.

Perturbation

exact N/q

B_8+1: 121275 = N/8. B_49+1: 19800 = N/49. Structural damage.

Z/12,612,600 Joint Bloom

Z/12,612,600 adds mod-13 to the 5 channels of Z/970,200. 6 prime-power channels total: {8, 9, 25, 49, 11, 13}. The 490 split is now complete: inner = {mod 8, mod 25, mod 49} + outer = {mod 9, mod 11, mod 13}. Per-channel search scales cleanly -- 121 params (115 lookup + 6 basis) for a ring of 12.6 million elements.

Forward ratio: 104,236x. Perceptron upgrade: 3467 params, backward ratio ~46 billion x. Sample-based verification (1040 elements, exact divisibility by 8 and 13). GATE perturbation: wrong B_13 leaves only elements with x mod 13 = 0 intact (80/1040 = 1/13). The boundary channel is structurally independent.

Z/12,612,600 x^2

1040/1040

121 params. 6 per-channel lookups discover r^2 mod q from 20 samples.

Z/12,612,600 mirror

1040/1040

Per-channel W learns (q-r) mod q. All 115 entries correct.

mod-13 perturbation

80/1040

B_13+1: only x%13=0 survive. Boundary channel independent.

490 split

inner + outer

inner={2,5,7}(3) + outer={3,11,13}(3) = 6 channels.

Frame Evolution Probes

Is ring algebra the right frame? Six probes on Z/970,200 (5 prime-power channels) test the foundation before it strains. All probes PASS: the ring frame holds for decomposable targets. The first boundary is quantified.

Non-decomposable wall: f(x)=(x%8)*(x%9) achieves only 175175/970200 = 18% accuracy via per-channel prediction. The number 175175 = N*13/72 is structurally determined (72=8*9 mixed channels). Decomposable targets (x^2, x+42, x*42) achieve 100%. The wall proves channel independence has limits -- bloom mixing is the path across.

Commutativity

1000/1000

Forward, reverse, scrambled channel order identical. Integer addition commutes.

Non-decomposable wall

18%

f(x)=(x%8)*(x%9): 175175/970200 = N*13/72. Structural limit.

Basis uniqueness

neg=mirror

Negated CRT basis computes mirror(x). Doubled basis computes 2x. Only CRT idempotents give identity.

Channel removal

N/q loss

Drop mod-8: 121275 survive. Drop mod-11: 88200. Proportional to modulus.

Bloom Crossing

The non-decomposable wall (18%) is real. Can bloom mixing cross it? CRT reconstruction enables computing ANY cross-channel function from per-channel inputs. Pipeline: per-channel identity (102 params) -> CRT reconstruct (5 basis) -> compute cross-channel function -> compare. Result: 100% on ALL cross-channel targets. Zero additional trainable parameters.

Four targets tested at Z/970,200: (x%8)*(x%9), (x%25)*(x%49), (x%8)+(x%9), (x%8)*(x%9)*(x%25). All 4 cross from wall to 970200/970200 with bloom. The additive target achieves ZERO per-channel matches -- even harder than the multiplicative wall. Partial CRT: for 2-channel functions, only 19 params needed (vs 107 full).

Bloom (x%8)*(x%9)

970200/970200

Per-channel: 175175 (18%). Bloom: 100%. The wall is crossed.

Additive wall

0/970200

(x%8)+(x%9) gets ZERO per-channel matches. Stronger wall than multiplicative.

Partial CRT

72/72

Only involved channels needed. 19 params for 2-channel crossing. 5x smaller.

Basis necessity

139/1000

Perturbed basis (B[0]+1): crossing collapses. Correct basis is necessary.

Learned Bloom Crossing

Sample efficiency: 72 samples learn a 970200-element function (13475x compression). Accuracy is EXACTLY proportional to coverage -- each filled entry accounts for exactly N/table_size correct predictions. CRT guarantee, not statistical approximation. Works for arbitrary unknown functions: the learner sees (x, f(x)) pairs and fills the table.

(x%8)*(x%9) learned

970200/970200

72 samples, 72-entry table. 13475x compression. Formula-free.

(x%25)*(x%49) learned

970200/970200

1225 samples, 1225-entry table. Larger channels = larger table.

8*9 < per-ch

72 < 102

Cross-channel table (72) SMALLER than per-channel sum (102) for small channels.

Exact coverage

50/72 -> 69%

Partial accuracy = coverage * N/table_size. CRT structure, not statistics.

Composed Forward Pass

Key result: composition is non-commutative. compose(sq, mir)(x) = N - x^2 but compose(mir, sq)(x) = x^2. Squaring absorbs negation: (-x)^2 = x^2 in any ring. Layer order matters -- just like in standard neural networks. Mirror is an involution (mir o mir = identity). Three layers compose correctly: sq o mir o sq = x^4. The cross-channel pipeline chains a per-channel layer with a learned crossing table in one forward pass.

2-layer compose

970200/970200

sq then mir = N-x^2. Full ring verified. Ring endomorphisms compose.

Non-commutative

sq o mir != mir o sq

compose(sq,mir)(1)=970199. compose(mir,sq)(1)=1. Order matters.

Cross pipeline

970200/970200

Per-ch sq -> CRT reconstruct -> learned D*K table. 179 params.

Shared basis

5 params all layers

k layers = k*102 + 5 params. 2-layer: 209 params = 4642x vs N.

Backward Pass

The forward path is complete. The backward pass makes it trainable. CRT backward pass decomposes into independent per-channel backward passes. Each weight W[r] in channel ch has gradient 2*(W[r] - target(r)) -- no coupling to other channels. The Jacobian is block-diagonal by construction: per-channel loss functions share no parameters.

Key contrast: per-channel loss decouples channels completely (block-diagonal), while full-ring reconstruction loss COUPLES them through the shared CRT reconstruction error. Per-channel training avoids this coupling while converging to the same optimum for CRT-decomposable targets. This is the Backward Decomposition Theorem.

Block-diagonal

Gradient mod-9 = 0

Perturb mod-8 weights. Per-channel gradient in mod-9 is exactly zero.

Full-ring coupling

Gradient mod-9 != 0

Same perturbation. Full-ring reconstruction gradient couples channels.

SGD convergence

1000/1000 in 5 epochs

Random init -> per-channel gradient descent -> perfect accuracy on x^2.

Crossing table SGD

1000/1000 in 10 epochs

72-entry D*K product table learned from random init via gradient descent.

Multi-Layer Training

Hybrid pipeline: gradient descent for per-channel layers (block-diagonal, 102 params each) + discrete optimization for crossing tables (72 entries, sample-based or hillclimb). Full stack from scratch: 2 trained layers + learned crossing table. 281 params for 970200-element ring = 3452x compression. Channel-parallel: all 5 channels converge in 1 epoch each (lr=0.5). The full CRT training pipeline is now operational.

2-layer compose

1000/1000

sq->mir = N-x^2. Both layers from random init, trained independently.

3-layer compose

1000/1000

sq->mir->sq = x^4. Three independent layers compose correctly.

Hybrid full ring

970200/970200

Gradient per-ch + sample crossing. Full ring verified.

Channel parallel

1 epoch each

5 channels converge independently. Block-diagonal = embarrassingly parallel.

Coupling-Attention

Standard transformers learn attention via softmax(QK^T / sqrt(d)). CRT coupling provides Q and K algebraically -- no learned projection matrices. Per-channel eigenvalue eig(n, q) = 2*cos(2*pi*(n mod q)/q). Coupling(a,b) = dot product of eigenvalue vectors = sum over channels of eig(a)*eig(b). This is a symmetric, positive-definite kernel on the ring. 102 precomputed f64 entries encode all pairwise coupling for 970200 elements (9511x compression).

Full-coupling attention (all 7 channels simultaneously) outperforms multi-head decomposition -- multi-head averaging dilutes per-channel signal. 9 = the compiler's analysis pass count, not attention heads. Cross-coupling averages to zero by character orthogonality -- self-coupling is strongly positive. Signal and noise are separated by algebra, not by learning.

Positive-definite

1000/1000

coupling(x,x) = ||eig(x)||^2 > 0 for all nonzero elements. Proper kernel.

Orthogonality

E[cross]=0

Random coupling averages to zero. Self-coupling avg=10. Signal/noise algebraic.

Full coupling

Single full-coupling attention outperforms multi-head. 9 = compiler passes, not heads.

Block-diagonal

Per-channel independent

mod-8 coupling depends only on mod-8 residue. Independent of other channels.

7-Channel Extension

490 split as trainable architecture: inner channels (mod 8, mod 25, mod 49) learn x^2, outer channels (mod 9, mod 11, mod 13, mod 17) learn mirror -- simultaneously, independently. Encoder and decoder train with different targets. Block-diagonal Jacobian extends across the split: perturb mod-8 (inner), mod-17 (outer) gradient is exactly zero. Full-coupling attention over all 7 channels.

SGD at 7ch

1000/1000

Per-channel x^2, mirror, and 490-split mixed all converge. f64 CRT reconstruction.

490 split training

1000/1000

inner=x^2, outer=mirror. Independent training, block-diagonal across the split.

Projection bifurcation

2^420 = mirror in mod-17

2^420 CRT=(0,1,1,1,1,1,16). 2^1680=(0,1,1,1,1,1,1). mod-17 last to converge.

1.6Mx compression

132 params

214,414,200 elements from 132 eigenvalue entries. 170x over Z/970,200.

Lambda-Periodic Training

The Carmichael period 420 becomes a training rhythm. Sample-based SGD (one random sample per step) runs for 420 steps per epoch. Then the projector 1,576,576 (= 2^420) fires: the mod-8 channel collapses, all other channels persist as long-term memory. mod-8 recovers in O(8) steps -- the smallest channel is the most expendable. 4 epochs of 420 = 1680 = Carmichael period of Z/214,414,200.

Channel convergence follows modulus size: mod-8 and mod-9 converge first (step 50), then mod-11, then mod-25, then mod-49. At step 420, all 102 per-channel weights are correct. At step 50, only 31/49 mod-49 residues are covered. 420 is the minimum natural period for full coverage -- forced by mod-49 needing 420/49 = 8.6 samples per residue. At Z/214,414,200, 2^420 zeroes mod-8 AND mirrors mod-17 (CRT=16=-1 mod 17). The full projector (0,1,1,1,1,1,1) needs period 1680. mod-17 is always last to resolve.

Projection cycle

102->96->102

Before: 102/102. After 1,576,576: 96/102 (mod-8 killed). Recovery: 102/102.

mod-8 expendable

8 steps to recover

1,576,576 collapses mod-8 to constant. 50 sample steps fully restore.

Convergence order

8<9<11<25<49

Small channels first. mod-49 needs 420 steps. mod-8 converges by step 50.

4 * 420 = 1680

7-channel period

Carmichael periods compose. mod-17 mirrors at step 420, resolves at step 1680.

CRT at Scale

CRT decomposition is a compression technique. It trades exact element identity for smaller per-channel tables. On small rings (Z/210, 27 characters), flat tables outperform CRT -- they preserve more signal with fewer entries. On large rings, CRT is the ONLY feasible decomposition. Z/214,414,200 has 214 million elements. Flat bigram tables would need over 45 thousand trillion entries. CRT per-channel trigrams need 142,956.

Ring arithmetic sequences in Z/214,414,200 -- noisy affine maps mixing multiplication and addition -- are predicted at 58% exact accuracy using CRT per-channel trigrams. Each of seven channels achieves 607-704 ppt accuracy (6x to 34x above random). The mod-49 channel shows the largest improvement: 34x. The improvement over random guessing: 124 million times. Per-channel independence means the same 143K table entries handle a 214-million-element alphabet that no flat table could touch.

CRT exact

1044/1799

58% exact element prediction on 214M-element ring. 143K entries.

Flat impossible

4.6 x 10^16

214414200^2 entries. Cannot be built. CRT is the only path.

mod-49 lift

34x random

683 ppt vs 20 ppt random (mod 49). Largest channel = largest CRT advantage.

Ring tower

5ch >= 6ch >= 7ch

1050 >= 1044 >= 1044. Operations correlated across channels.

Evolutionary Basin Dynamics

The multiplicative basin is a genuine structural attractor. A K^2=9 population of 2D-LUT state machines (3750 entries each) initialized from 90% correct multiplication tables holds at 75% accuracy across 200 generations of random mutation. The basin resists drift indefinitely. But random entry-level mutation cannot improve: each change has >97% probability of being wrong (P=(q-1)/q for Z/49). The 2D-LUT is a state machine where a single corrupted entry redirects the entire downstream trajectory -- cascading errors make the fitness landscape deceptive.

Crossover between members with independently corrupted entries provides +5.1 percentage points (72.6% to 77.7%) by recombining correct entries from different parents. The improvement plateaus when population diversity exhausts. Full recovery to 100% requires systematic search (coordinate descent visits every entry, tries all alternatives) or biology-scale populations. Evolution from additive initialization climbs to only 16% -- structurally unable to bridge the modular discontinuity.

Drift resistance

75.1% held, 200 gen

90% seeded population holds steady. Attractor resists random perturbation.

Crossover benefit

+5.1 pp

Uniform crossover between diverse members. 72.6% to 77.7%. Combines correct entries.

Additive stuck

8% to 16%

Evolution from additive cannot bridge to multiplicative basin.

Biology analog

Structured mutations

Random table rewrites fail. Biology uses structured DNA mutations + large populations.

Cross-Channel Attention

What determines how well one CRT channel predicts another? Two hypotheses: quadratic residue compatibility (algebraic: Legendre symbols between primes) or shared-factor coupling (operational: gcd of operation constants with channel moduli). The QR hypothesis is FALSIFIED; the GCD mechanism is constructive and predictive.

QR-Attention Separation (PROVED)

Among odd-prime pairs: QR mean = NR mean (difference = -4 ppt, noise). The directed 3-cycle K->L->b->K (all QR) shows zero asymmetry over its reverse (all NR). Cross-channel information flow is operational (gcd-mediated), not algebraic (QR-mediated). 21/21 verified.

GCD-Attention Tower-Step (PROVED)

Operation constants 42 = 2*3*7 and 105 = 3*5*7 reduce each data channel's effective target alphabet by one tower step: 8->4, 9->3, 25->5, 49->7. Extension channels (mod 11, mod 13, mod 17) have gcd=1: unreduced. GCD-product concordance with accuracy: 74%. 43/43 verified.

Direction-Resolution (PROVED)

V1(s,t) = q_src * weighted_target_gcd correctly predicts the higher-accuracy direction for 85% of channel pairs. The direction ratio for (mod-49, mod-9): V1(49->9)*9 = V1(9->49)*49 (algebraic identity). mod-25 source pairs account for 2 of 3 wrong predictions (self-blindness via 105 coupling). 30/30 verified.

QR falsified

QR mean = NR mean

Quadratic residue compatibility has zero predictive power for cross-channel accuracy. The information flow is operational, not algebraic.

Tower-step reduction

8->4, 9->3, 25->5, 49->7

Each data channel loses exactly one tower step of effective alphabet through gcd coupling. Extension channels are immune.

85% direction

V1 concordance 18/21

Source modulus and target gcd jointly predict which direction is stronger. 3 failures involve mod-25 source (self-blind channel).

Channel Independence

How much information does each CRT channel carry? The eigenvalue of n -- a sum of 7 cosines, one per channel -- collapses all channels into a single number. If the sum is useful, channels are redundant. If not, each channel carries independent information that summation destroys.

Channel Independence Theorem (VERIFIED)

Coupling class prediction of 2000 elements from Z/214,414,200: mod-8 channel alone (8 bins) achieves 67.9% accuracy. mod-9 alone (9 bins) achieves 49.9% (= baseline). mod-8 + mod-9 combined (72 bins) achieves 84.5%. Eigenvalue sum (200 bins) achieves 51.4% -- barely above random (50.2%). Information grows multiplicatively with channel count. Summing destroys it.

The mod-9 channel achieves exactly the baseline because it is orthogonal to parity: whether n is even or odd is invisible in n mod 9. But combined with mod-8, mod-9 further separates the odd elements by divisibility by 3. This is CRT in action: independent channels carry independent information.

mod-8 alone

67.9%

Parity + mod-8 structure. 8 bins, each class-concentrated.

mod-9 alone

49.9% = baseline

Orthogonal to mod-8. Cannot distinguish even from odd.

mod-8 + mod-9

84.5%

Multiplicative gain: 2 independent channels > 1 by 16.6%.

Eigenvalue sum

51.4% ~ random

Summing 7 channels destroys independence. 200 bins, no structure.

ECC Requires Coupling

Can parity channels (mod 11, mod 13, mod 17) detect prediction errors for free? If 7 independent per-channel predictors disagree on parity, is the prediction wrong? Testing this by training independent bigram predictors on a sequence in Z/88,200.

ECC Coupling Theorem (PROVED)

Parity channel transitions in Z/88,200 depend on ALL data channels jointly. An independent mod-11 predictor achieves only random accuracy (10.6% vs 100% for CRT-factor channels mod 8 and mod 9). The mod-N reduction (N=88200) scrambles mod-11 residues. Error detection requires cross-channel information flow -- it is a training objective, not a free property.

Root cause: the mod-11 residue depends on the FULL element (all 4 data channels), not just the previous mod-11 value. A predictor that sees only current mod-11 cannot predict the next. This proves that CRT-based intelligence MUST include coupling between channels -- pure block-diagonal processing cannot leverage error detection. The rate 4/7 IS the coupling cost: 3/7 of capacity enforces consistency.

mod-8, mod-9

100% predictable

CRT factors of N=88200. Affine transition per-channel. Fully independent.

mod-11, mod-13, mod-17

~10% (random)

NOT CRT factors. Transition depends on full element. Cannot predict independently.

Implication

Detection = coupling

Block-diagonal architecture needs an explicit sync point for parity. 3/7 capacity.

CRT Coupling as Attention

CRT coupling(a,b) = sum of per-channel eigenvalue products = a dot-product in eigenvalue space. Zero learned parameters. Does this replace standard attention? Testing on Z/214,414,200 (7 channels) with 28 possible heads: 7 single-channel + 21 pair-channel.

Coupling Attention Theorem (CONFIRMED)

CRT coupling gives 99.5% retrieval weight on an exact-match target among 30 elements (vs 3.3% uniform = 30x advantage). The eigenvalue inner product IS a positive-definite attention kernel with zero parameters, 1.6M x compression at Z/214,414,200 scale (132 eigenvalue entries encode all 214M pairwise coupling values), and inherent block-diagonal structure.

The 9-heads hypothesis (7 single-channel + 2 pair-channel = optimal head count) shows no statistically significant signal across 5 random seeds. All head counts from 7 to 21 give similar contrast (~50%). CRT attention works best as FULL COUPLING (all 7 channels simultaneously), not decomposed into independent heads.

CRT retrieval

99.5% weight

Exact match gets 99.5% attention. 30x over uniform. Zero parameters.

9 heads

No signal

Multi-seed contrast averages: NH=9 (499/1000) vs NH=8 (495/1000) vs NH=12 (490/1000). Within noise.

Multi-head avg

Dilutes attention

Averaging softmax across heads reduces target weight from 99.5% to 8-15%. Full coupling is the natural form.

Projector Prevents Forgetting

Catastrophic forgetting: train on task A, then train on task B, and task A knowledge is destroyed. Standard models have no remedy -- all weights shift. CRT-decomposed models have a projector: 26,801,776 (= 2^1680 mod 214,414,200) with CRT = (0,1,1,1,1,1,1). It zeroes the mod-8 channel (coarsest, 8 states) and preserves the 6 finer channels exactly.

Projection Forgetting Theorem (VERIFIED)

Two tasks with different per-channel transition patterns (strides differ in all 7 channels of Z/214,414,200). Per-channel bigram tables. Full overwrite: 0% retention (catastrophic forgetting). Projector (mod-8 from task B, 6 channels from task A): 85.7% retention = 6/7 exactly. Algebraic guarantee from CRT channel independence.

The sacrifice is optimal: mod-8 is the coarsest channel (only 8 possible residues). It carries the least fine-grained information and is the fastest to relearn. The 6 preserved channels (mod-9 through mod-17) carry the finer structure that is expensive to learn and critical to retain.

Baseline

100%

All 7 channels learned perfectly. 13993/13993 correct predictions.

Full overwrite

0%

Different task = different transitions. All channels wrong. Catastrophic.

Projector

85.7% = 6/7

6 channels preserved exactly. mod-8 sacrificed and relearned for new task.

Convergence Staircase

How fast does each CRT channel learn? Walk through Z/12,612,600 multiplicatively (a^k mod 12612600) using a maximal-order generator (ord = 420 = Carmichael lambda). Per-channel bigram tables learn deterministic transitions. Each channel converges when its orbit is fully observed -- at step lambda(q_i), the per-channel Carmichael value.

Convergence Staircase Theorem (VERIFIED)

Multiplicative walk on Z/12,612,600 (6 channels). Convergence order: mod-8 (step 2), mod-9 (step 6), mod-11 (step 10), mod-13 (step 12), mod-25 (step 20), mod-49 (step 42). Chain order: 2, 3, 5, 7, 11, 13. DIFFERENT. Extension channels {mod-11, mod-13} converge BEFORE prime-power channels {mod-25, mod-49}. Additive walk confirms same relative order: 8, 9, 11, 13, 25, 49. 14/14 checks pass.

Raising exponents (5->5^2=25, 7->7^2=49) increases the state space and SLOWS learning. Extension channels (mod-11, mod-13) have fewer states and converge faster despite being later in the chain. mod-8 converges first (2 multiplicative steps) -- confirming it is optimal for sacrifice: coarsest, cheapest to relearn.

Period / bottleneck

420 / 42 = 10

Carmichael lambda / phi(49) = 2*5 = 10. Period is 10x the learning bottleneck.

mod-8 fastest

2 steps (mult)

Z/8 has Carmichael lambda = 2. Fewest multiplicative states. Cheapest to sacrifice.

mod-49 slowest

42 steps (mult)

Z/49 has Carmichael lambda = 42. The deepest channel IS the convergence bottleneck.

Extension leapfrogs

mod-11 before mod-25

Chain position does not determine learning speed. Modulus size does.

Scheduling Is Irrelevant

If channels are independent, does the ORDER of training matter? Comparing three schedules on the same additive walk through Z/12,612,600 (6 channels): simultaneous (all channels every step), chain-order sequential (mod 8, 9, 25, 49, 11, 13 one at a time), and modulus-order sequential (mod 8, 9, 11, 13, 25, 49 -- the convergence order).

Scheduling Irrelevance Theorem (VERIFIED)

Simultaneous: converges at step 49 (= max modulus). Chain-order sequential: step 115 (= sum of moduli). Modulus-order sequential: step 115 (same sum). Ratio: 2.35x. Per-channel convergence time = q_i regardless of schedule. Partial convergence at step 41: simultaneous 5/6, modulus 4/6, chain 2/6. 15/15 checks pass.

Each channel's learning speed is an intrinsic property of its modulus -- no curriculum ordering can accelerate it. Simultaneous training is always optimal because channels extract information in parallel without interference. Chain-order annealing provides zero advantage.

Simultaneous

49 steps

max(q_i). All channels trained in parallel. Optimal.

Sequential (any order)

115 steps

sum(q_i). 2.35x slower. Order only affects WHICH channels converge early.

Modulus order partial

4/6 at step 41

Extension channels {mod-11, mod-13} resolved early. More error detection sooner.

Chain order partial

2/6 at step 41

Data channels {mod-25, mod-49} prioritized. Less early coverage.

490 Split: Encoder/Decoder

The 490 split (490^420 mod 12,612,600 = outer idempotent) divides 6 channels into inner = {mod 8, mod 25, mod 49} (encoder, 9800 states) and outer = {mod 9, mod 11, mod 13} (decoder, 1287 states). Does this split create a meaningful timing asymmetry?

490 Convergence Asymmetry Theorem (PROVED)

Inner bottleneck = max(lambda(8), lambda(25), lambda(49)) = max(2, 20, 42) = 42. Outer bottleneck = max(lambda(9), lambda(11), lambda(13)) = max(6, 10, 12) = 12. Ratio = 42/12 = 7/2 = 3.5. Decoder converges 3.5x before encoder. 26/26 checks pass.

lcm(inner Carmichael lambdas) = lcm(2, 20, 42) = 420 = Carmichael period of the ring. lcm(outer) = lcm(6, 10, 12) = 60. Ratio = 420/60 = 7. In Z/214,414,200: mod-17 joins the outer set, flipping the balance (outer = 21879 > inner = 9800).

e_outer = 8,722,000

CRT=(0,1,0,0,1,1)

490^420 mod 12,612,600. Inner channels zeroed, outer = identity. Idempotent.

Ratio 7/2 = 3.5

inner=42, outer=12

Decoder (outer) crystallizes 3.5x faster than encoder (inner).

Period = inner lcm

420 = lcm(2,20,42)

Carmichael period of the ring is exactly the LCM of the inner channel periods.

17 flips balance

outer > inner at 7ch

6-channel: inner=9800 > outer=1287. 7-channel: outer=21879 > inner=9800.

Curriculum Irrelevance

Does training channels in chain order (mod 2 -> mod 3 -> mod 5 -> mod 7) help? Testing three schedules on Z/210 (4 channels): simultaneous (all every step), chain-order sequential, and reverse-order sequential.

Curriculum Irrelevance Theorem (PROVED)

Simultaneous training converges at max(q_i) = 7 steps. Both sequential curricula converge at sum(q_i) = 17 steps (2.43x slower). Curriculum order immaterial: chain-order = reverse = 17 (sum is commutative). Joint accuracy at step k = product of per-channel accuracies (gap = 0). No phase transitions beyond per-channel convergence events. 18/18 checks pass.

The curriculum penalty grows with channels: Z/210 17/7=2.43x, Z/12,612,600 115/49=2.35x, Z/214,414,200 132/49=2.69x. Simultaneous training outperforms ALL curricula. Chain order = real algebra (group-theoretic decomposition), NOT trainable capability phases. Channel independence is so strong that each channel converges at its intrinsic q_i-step rate regardless of what other channels are doing.

Sim=7, Chain=17

2.43x penalty

Sequential curriculum is always slower: sum(q_i)/max(q_i). Parallelism wins.

Order irrelevant

Chain = Reverse = 17

sum(2,3,5,7) is commutative. No ordering advantage. Chain order is not a training curriculum.

Gap = 0 everywhere

Joint = product(per-channel)

No cross-channel synergy. No phase transitions. No emergent capabilities beyond independent convergence.

5 probes confirmed

All consistent

Independence governs memory, convergence, scheduling, encoder/decoder, AND curriculum.

CRT Search Reduction

CRT independence isn't just about learning -- it transforms search. Given training data (x,y) from an unknown polynomial f over Z/N, how many candidates must you try? Monolithically: N^{d+1} for degree d. Per-channel with CRT: sum(q_i^{d+1}). The search space converts from MULTIPLICATIVE to ADDITIVE.

CRT Search Reduction Theorem (PROVED)

Polynomial regression over Z/N with CRT decomposition N = prod(q_i): monolithic search = N^{d+1} candidates, per-channel = sum(q_i^{d+1}). Verified by exhaustive search on Z/210: linear (d=0) 210/17 = 12x, affine (d=1) 44100/87 = 506x, quadratic (d=2) algebraic 18411x. At Z/214,414,200 scale (d=0): 214414200/132 = 1624350x. 22/22 checks pass.

The affine ratio (506x) is EXACTLY the block-diagonal Jacobian ratio N^2/sum(q_i^2) from the backprop theorem. Search reduction IS backprop reduction -- the same algebraic structure powers both. The ratio grows by a factor of ~N/q_max per degree increase.

Linear (d=0)

12x on Z/210

210 monolithic candidates vs 17 per-channel. CRT reconstruction recovers a=137 exactly.

Affine (d=1)

506x on Z/210

44100 pairs vs 87. Per-channel (a,b) reconstructed via CRT: a=67, b=41.

Z/214,414,200 (d=0)

1,624,350x

214 million monolithic vs 132 per-channel. Gap grows with ring size and polynomial degree.

6 probes confirmed

All consistent

Independence governs memory, convergence, scheduling, architecture, curriculum, AND search.

CRT Predicate Discovery

CRT accelerates polynomial search (above). The same principle extends to PREDICATE search: finding ring elements satisfying algebraic conditions. Idempotents (e^2=e), involutions (x^2=1), zero divisors (z*w=0) -- all decompose per CRT channel. The search phase costs sum(q_i) per-channel checks. The enumeration phase costs product(per-channel solution counts).

CRT Predicate Search Theorem (PROVED)

Ring-algebraic predicate P(n) over Z/N decomposes as P = AND(P_i(n mod q_i)). Search: sum(q_i) per-channel checks. Enumerate: product(|S_i|). Monolithic: N checks. Verified on Z/210: idempotents (16 found, ratio 6x), involutions (8 found, ratio 8x). At Z/214,414,200 scale: idempotent ratio 824670x, involution ratio 552613x. Zero divisors need no search -- CRT(w) directly encodes #ZD = gcd(N,w). 28/28 checks pass.

mod-8 signature: Z/2 has 1 involution (-1=1 mod 2). Z/8 has 4 (Klein four-group). The depth-3 exponent quadruples involution count. Zero divisors are instant: CRT(w) tells you which channels are zero (unconstrained) and which require z=0. The decomposition IS the answer.

Idempotent (7ch)

824,670x

214414200 / 260. Per-channel: always 2 solutions (0 and 1).

Involution (7ch)

552,613x

214414200 / 388. Z/8 contributes 4 (Klein four), rest contribute 2.

Zero divisors

Instant

CRT(w) directly encodes #ZD. Zero channels = unconstrained. No search needed.

7 probes confirmed

All consistent

Independence governs memory, convergence, scheduling, architecture, curriculum, search, AND discovery.

Block-Diagonal GPU

If CRT channels are independent, a neural network layer decomposes into 7 small operations instead of 1 big one. For matrix multiply: monolithic uses (sum q_i)^3 FLOPs, block-diagonal uses sum(q_i^3). The ratio is 16x at 7 channels -- and it is scale-invariant (same ratio regardless of hidden dimension).

Block-Diagonal FLOP Theorem (PROVED)

Matmul FLOP ratio = (sum q_i)^3 / sum(q_i^3). Z/210 (4ch): 9x. Z/12,612,600 (6ch): 11x. Z/214,414,200 (7ch): 16x. Scale-invariant. Parameter savings: (sum q_i)^2 / sum(q_i^2) = 4.6x at 7 channels. Neural net layer (attention mono + FFN block-diag): 3.4x total. 12-layer CRT-GPT: ~25M params vs ~117M standard. 24/24 checks pass.

Attention must remain monolithic (coupling is cross-channel). But V-projection and FFN are block-diagonal -- 7 independent per-channel operations. The existing GPU benchmarks on this page already demonstrate per-channel dispatch. CRT Multiply processes 7 channels per element with zero cross-channel communication.

Matmul 16x

7 channels

(sum q_i)^3 / sum(q_i^3) = 2299968/142956. Grows with channel count.

Params 4.6x

7 channels

(sum q_i)^2 / sum(q_i^2) = 17424/3750. Fewer weights to train.

Layer 3.4x

attention+FFN

Attention QK monolithic + V,FFN block-diagonal. 174240/51174 per layer.

CRT-GPT-1

~25M params

12-layer, d=448, 7 channels. vs 117M standard GPT-1. 4.7x compression.

Matrix Library

CRT-GPT-1 forward pass implemented in .ax. Block-diagonal matrix operations: per-channel matmul, matvec, layer norm, squaring activation (h^2 -- ring-native nonlinearity), and 7 independent softmaxes. Full pipeline: character embedding via CRT residues, block-diagonal FFN, per-channel softmax, CRT reconstruction of predicted character. 53 checks pass.

CRT-GPT Forward Pass (VERIFIED)

Input char 'A' (65) decomposes to (1,2,15,16,10,0,14) mod (8,9,25,49,11,13,17). 7 per-channel embeddings. Layer norm per channel (mean=0, var=1). Block-diagonal FFN: W1*x+b1, squaring, W2*h+b2. Per-channel softmax (132 total classes). CRT reconstruction recovers 65. End-to-end .ax, WASI-verified, no JS.

Block-diagonal matmul

7 independent

bd_matmul, bd_matvec: per-channel matrix ops. Zero cross-channel weights.

Layer norm

Per-channel

Mean=0, var=1 independently for each of 7 channels.

Squaring activation

h^2

Ring-native nonlinearity. All activations non-negative. Forced positivity.

CRT softmax

132 classes total

7 small softmaxes (8+9+25+49+11+13+17) vs 1 giant 50K. Sums to 1 per channel.

Attention & Embedding

CRT embedding: character code c decomposes into 7 residues (c mod q_i). Each residue indexes a learned d_c-dimensional table. 132 total entries (8+9+25+49+11+13+17) vs 50K for BPE -- 380x smaller output vocabulary. CRT coupling provides zero-parameter attention: QK scores come from the ring's eigenvalue inner product, with 99.5% retrieval and 30x advantage over uniform (PROVED). Single full-coupling attention, no multi-head (which dilutes). Block-diagonal V projection per channel. 50 checks pass.

CRT Attention Layer (VERIFIED)

Coupling-based QK attention on short sequences. Causal masking: position 0 = self-only (weight 1.0), verified. Identical-code sequences produce uniform attention weights (proved by algebraic symmetry). Identity V gives weighted embedding average. Attention composes with B2 FFN: embedding -> attention -> layer_norm -> FFN -> softmax -> CRT reconstruction recovers input character. All operations per-channel independent.

CRT embedding

132 entries

Character mod 7 channel moduli. 4570x fewer params than standard embedding.

Coupling attention

Zero-parameter QK

Eigenvalue inner product. 99.5% retrieval. No learned Q or K matrices.

Causal masking

Autoregressive

Position i attends to 0..i. Self-coupling dominates (highest weight).

Block-diagonal V

7 independent

Per-channel value projection. No cross-channel mixing in attention.

CRT Likelihood Scoring

Mode-based CRT prediction picks the most common next element per channel. It works at ring scale (58% exact on 214M elements, above) but hits a structural ceiling: 74% of test positions fail, and 89% of those failures are MODEL-limited -- the correct character has nonzero training count but is not the mode. The model sees the answer but selects wrong.

Soft CRT Prediction (VERIFIED)

CRT-factored distribution scoring -- summing normalized per-channel trigram counts across all channels -- achieves 115/399 (28.8%) on text prediction, vs 105/399 (26.3%) for mode+NN search (+9.5%). Per-channel accuracy jumps from 268 to 322 ppt (+20%). Small channels gain most: D-channel +34%, K-channel +37%. 14 soft-only correct positions vs 4 NN-only: distribution scoring is a near-superset of mode search. Score discriminates: 5.1/7 channels match at correct vs 0.7/7 at wrong. Equal channel weights ARE optimal -- SGD coordinate descent finds zero improvement.

The soft score IS a CRT-factored probability: each channel votes with its conditional frequency, not its mode. Channel independence means the sum factorizes -- no interaction terms. Small moduli (D=8, K=9) gain most because their count distributions are richest: 8 and 9 residue classes per context provide well-sampled distributions. Large moduli (b=49) gain little: sparse data collapses distribution to mode. 15 incremental approaches closed (THMs 93-109): mode, bloom variants, weighting, cross-channel joints. Distribution scoring is the capstone.

Ceiling diagnostic

89% model-limited

Mode is 6.2x more common than actual. Correct character ranks 8th among 24 alternatives.

Soft vs Mode+NN

+9.5% exact, +20% per-ch

115/399 vs 105/399. D-channel: 141 vs 105 (+34%). K-channel: 133 vs 97 (+37%).

Weight optimality

Equal = optimal

SGD (5 epochs, 400 validation): 0 exclusive gains on test. 8 divergent predictions ALL wrong for both methods.

Near-superset

14 soft-only, 4 NN-only

Oracle(soft, NN) = 119. Gap from soft = 4, gap from NN = 14. Distribution subsumes mode.

Lucas CRT Independence

The hash representation collapses spatial structure: all 7 channels see the same data through slightly different hashes, producing informationally redundant channels (0.22% pairwise disagree rate). Lucas' theorem shows that binomial coefficients C(n,k) naturally create independent channel views:

Channel Independence Theorem (PROVED)

C(n,k) mod p = product of C(n_i, k_i) mod p, where n_i, k_i are base-p digits of n, k (Lucas). Each axiom prime decomposes (n,k) in a different number base: D=binary, K=ternary, E=base-5, b=base-7, L=base-11, GATE=base-13, ESCAPE=base-17. Row p of Pascal's triangle is all-zero in the p-channel and nonzero in every other channel. 7 primes = 7 unique blindness rows = 7 genuinely independent views. Pairwise disagree: 230-493 permil vs ~2 for hash CRT. 78/78 verified.

Independence gain

230-493 vs 335 permil

Binomial zero/nonzero disagree exceeds hash on same data. V25 ARC hash was 2 permil (different metric). Key: each channel sees a different base-p digit decomposition.

490 split in zeros

DEAD 374 > ALIVE 245

DEAD channels (D,E,b) produce 52% more zeros than ALIVE (K,L,GATE,ESC). Smaller primes = denser Sierpinski fractal.

Kummer carries

945/945 verified

v_p(C(n,k)) = carries in base-p addition of k+(n-k). Each prime counts carries in its own base = genuinely different p-adic structure.

Contrast Table

Neural architectureMonolithic transformer with large output softmaxCRT-decomposed: 5 independent heads. 9512x compression. Block-diagonal gradients.Error detectionPost-hoc validation, separate ECC systemsmod-11 provides free error detection built into the algebra. Dual mod-11 + mod-13 corrects.ScalingNeeds GPU clusters and billions of parametersRuns in browser. CRT makes it small enough. 9512x parameter compression.CompressionStatistical: LZ77 + Huffman / FSE on 1 monolithic streamAlgebraic: 5 independent CRT channels. Up to 20x Rissanen redundancy. Parallel by structure.

Intelligence

CRT Architecture

Five Breakthroughs

CRT GPU Compute

Explore: CRT Compression

CRT Byte Compression

Matrix Foundation

CRT Perceptron

Prime-Power Channels

Bloom Mixing

Hourglass

Z/970,200 Hourglass

Inaccessible Degrees

Stacked Hourglass

Soft Bloom Emergence

Joint Bloom

Deep Joint Bloom

Z/12,612,600 Joint Bloom

Frame Evolution Probes

Bloom Crossing

Learned Bloom Crossing

Composed Forward Pass

Backward Pass

Multi-Layer Training

Coupling-Attention

7-Channel Extension

Lambda-Periodic Training

CRT at Scale

Evolutionary Basin Dynamics

Cross-Channel Attention

Channel Independence

ECC Requires Coupling

CRT Coupling as Attention

Projector Prevents Forgetting

Convergence Staircase

Scheduling Is Irrelevant

490 Split: Encoder/Decoder

Curriculum Irrelevance

CRT Search Reduction

CRT Predicate Discovery

Block-Diagonal GPU

Matrix Library

Attention & Embedding

CRT Likelihood Scoring

Lucas CRT Independence

Contrast Table