CRT decomposes any computation into 7 independent channels -- same math that decomposes the ring. Shared backbone, independent output heads per channel, mod-11 error detection built into the algebra. Block-diagonal gradients. 9512x parameter compression at Z/970,200 scale. Runs in browser.
Standard transformer: one monolithic output layer predicts among N classes. CRT transformer: shared backbone produces a representation, then independent output heads predict residues per channel. Joint probability reconstruction recovers the full prediction. The Chinese Remainder Theorem guarantees unique recovery. The full ring Z/214,414,200 has 7 channels; experiments below used the 5-channel Z/970,200 ring.
Each channel has a natural domain. The CRT decomposition is not imposed -- it emerges from the ring structure:
| Channel | Size | Domain | Why |
|---|---|---|---|
| Z/8 | 8 classes | Coarse (3 bits) | Fastest to learn, cheapest to sacrifice |
| Z/9 | 9 classes | Mid-range | Minimum for complete decomposition |
| Z/25 | 25 classes | Fine | Carries golden ratio (discriminant 5) |
| Z/49 | 49 classes | Precise (49 states) | Deepest channel, controls spectral gap |
| Z/11 | 11 classes | Parity | Error detection. Always on. 1+2+3+5 = 11 |
| Z/13 | 13 classes | Boundary | Dual parity with mod-11. 2^2+3^2 = 13 |
| Z/17 | 17 classes | Transcendence | 5*7 = 1 mod 17. Period quadruples to 1680 |
Full ring output: 8+9+25+49+11+13+17 = 132 classes. Experiments at Z/970,200 (5 channels, 102 classes): 9512x compression. The backprop Jacobian is block-diagonal: 25654x fewer entries at N=2310.
All five stack multiplicatively. Missing any one = leaving performance on the table.
| Breakthrough | Factor | Mechanism |
|---|---|---|
| CRT Decomposition | 9512x compression | 5 small heads vs 1 monolithic |
| Loop Theorem | N / sum(p_i) forward | CRT = loop unrolling |
| Block-Diagonal Backprop | 25654x fewer Jacobian entries | 5 independent gradient paths |
| mod-11 error detection | 100% thin, ~92% prime-power | Free detection. Dual parity mod 11 + mod 13: 100%. |
| Rissanen MDL | 20x byte / 936x token | Minimum description length selects Z/12,612,600 |
Combined at N=210 (Z/210): ~126,000x. CRT does not improve AI incrementally. It changes the computational class.
6 CRT channels = 6 GPU workgroups. Zero synchronization. The ring parallelizes itself. CRT decomposition IS the GPU parallelization strategy. No manual workgroup design. No shared memory management. The algebraic structure does the engineering.
The WGSL compute shader is 8 lines per operation. Decompose, operate per-channel, reconstruct. The CRT reconstruction coefficients are ring constants:
| Channel | Modulus | Workgroup | CRT Coefficient |
|---|---|---|---|
| mod 8 | 8 | Coarse (3 bits) | 363825 |
| mod 9 | 9 | Mid-range | 431200 |
| mod 25 | 25 | Fine | 853776 |
| mod 49 | 49 | Precise (49 states) | 732600 |
| mod 11 | 11 | Error detection | 529200 |
mod-11 error detection runs as a 6th workgroup -- free integrity check on every computation. Training a CRT neural network: each channel = independent backprop. 5 GPU workgroups, zero inter-workgroup communication.
Enter a vocabulary size N. See CRT output (102 classes fixed) vs monolithic (N classes). The compression ratio grows with N. Try 256 (bytes), 2310, 970200, 12612600.
N:
Every byte lives in Z/210 = Z/2 x Z/3 x Z/5 x Z/7 x Z/11. Bytes 210-255 wrap (mod 210). Decompose any data into 5 independent channels. Each channel has its own entropy. Rissanen MDL: the sum of channel entropies vs joint entropy reveals exploitable redundancy -- up to 20x at byte scale.
Enter a text string. See per-channel entropy and Rissanen redundancy ratio:
Text:
Phase A begins: co-evolving .ax and axiom-native intelligence. The first step: f64 matrix operations in pure .ax. matmul, transpose, softmax, cross-entropy -- the primitives every neural network needs. All compiled to WASM. No Python. No NumPy. No dependencies.
The matrix library uses falloc/fset/fget (f64 arrays) and arena reset (hp/set_hp). Matrices are row-major flat arrays. 18 operations: mat_zero, mat_get, mat_set, matmul, transpose, mat_add, mat_sub, mat_scale, dot_f, vec_sum_f, vec_max_f, softmax, cross_entropy, relu, sigmoid, relu_vec, outer_product, argmax. 54/54 WASI tests pass.
First neural network trained in pure .ax f64. A per-channel perceptron learns x^2 mod 210 (Z/210 squaring) using matmul, softmax, and outer_product from the matrix library. 4 independent channels (mod 2, 3, 5, 7) with 87 total parameters. 10 epochs of gradient descent. Result: 210/210 -- perfect accuracy on all inputs, including 168 never seen during training.
A standard perceptron on the same task needs 44100 parameters (210x210). After identical training on 42 examples, it scores 43/210 -- pure memorization, zero generalization. The CRT model generalizes completely because each channel is small enough that training coverage is automatic: 42 examples cover all residue classes up to mod 7.
Z/210 uses thin channels (mod 2, 3, 5, 7). Z/970,200 uses prime-power channels (mod 8, 9, 25, 49, 11) -- the Pareto-optimal exponents. Same architecture, same training. Prime-power channels: 3292 parameters. Standard equivalent: 941 billion. Forward compression jumps from 12x (Z/210 thin) to 9511x.
Prime-power channels compute on a richer substrate. The mod-49 channel carries 49 classes (vs 7 for thin mod-7). mod-11 provides free error detection: every prediction is independently verified by the 5th channel. All five channels achieve perfect accuracy. CRT reconstruction: 50/50 on random Z/970,200 elements.
The CRT perceptron learns per-channel functions perfectly. But x%13 (a cross-channel function) depends on ALL channels -- no single channel determines it. Linear mixing across channels achieves random performance (23/210). The solution: bloom. CRT reconstruction IS the non-linearity.
Architecture: per-channel perceptrons (87 params) -> CRT reconstruct (0 params) -> mod q. The bloom layer is algebraic and exact. It has zero trainable parameters. It replaces ReLU. One set of trained weights projects through ANY coprime gate: x%11, x%13, x%17 -- all 210/210 from the same 87 parameters.
Bloom computes x%13 perfectly -- but what about x^2 % 13? A single bloom layer reconstructs x^2 mod 210, then reduces mod 13. When x^2 >= 210 (x >= 15), the ring reduction destroys mod-13 information. Result: 28/210 correct. Ring overflow = info loss.
The hourglass fixes this. Two flowers face to face, gate to gate. Flower A extracts the gate value g = x%13 (identity bloom). Gate values are small: 0 to 12. Flower B computes g^2 inside Z/210 -- and 12^2 = 144 < 210, so no overflow ever occurs. The gate IS the bottleneck that prevents overflow.
The Z/210 hourglass stops at polynomial degree 2: gate value 12, cubed gives 1728 > 210. Overflow. The Z/970,200 hourglass extends to degree 5: 12^5 = 248832 < 970200. Raising exponents gives 3 extra polynomial degrees beyond Z/210.
Architecture: Flower A (identity on 5 prime-power channels, 3292 params) -> gate(13) bottleneck -> Flower B (g^d on prime-power channels, 3292 params) -> %13. Total: 6584 params. Standard: 970200 * 13 = 12,612,600 parameters.
The Z/970,200 hourglass reaches degree 5. But degrees 7 and 11 are unreachable by power composition: the multiplicative monoid of {1,...,5} in Z/12 (where phi(13) = 12) is missing exactly {7, 11}. These require a different approach.
Resolution: channel-49 carries a correction. For each gate value g, search r in 0..48 such that the CRT reconstruction with r in the b^2 channel produces the target mod 13. Pigeonhole guarantees at least 3 valid r per gate value (floor(49/13)=3). Same 6584 params. Same architecture. Only the training targets differ.
The Z/970,200 hourglass ceiling: degree 5. At degree 6, gate values 10, 11, 12 overflow (g^6 >= 970200). The fix: STACK. Two flowers in sequence, each applying a sub-degree. Flower B1 computes g^d1, extracts the gate. Flower B2 computes g1^d2, extracts the result. Total degree = d1 * d2. Each layer stays safe (max 12^5 = 248832 < 970200). The ceiling shatters.
5-squared collapse: degree 5 composed with degree 5 gives degree 25. But 25 mod phi(13) = 25 mod 12 = 1. Quintic composed with quintic = identity. 5^2 = 1 in Z/12: two layers cancel, leaving only what was already there.
Every test above hard-codes the CRT basis: {105, 70, 126, 120} for Z/210, {363825, 431200, 853776, 732600, 529200} for Z/970,200. But what if the ring can discover its own basis? Strip all CRT knowledge. Start with bloom weights = [0, 0, 0, 0]. Apply coordinate descent: for each channel, sweep the weight over all possible values and pick the one that maximizes reconstruction accuracy. After a few rounds, the CRT basis emerges from data alone.
Perturbation is catastrophic and exact. Changing any basis weight by +1 drops accuracy to precisely N/p -- the channel dies and only elements with zero residue survive. The basis is an isolated global maximum: no graceful degradation, no nearby local optima. The discovered basis elements are orthogonal idempotent projectors: B_p^2 = B_p, B_p * B_q = 0 for p != q, and sum(B_p) = 1 mod N.
Soft bloom discovers the CRT basis but takes per-channel functions as given. Joint bloom learns BOTH. Architecture: prediction(x) = sum(W_ch[x%p] * B[ch]) % N, where W is a per-channel lookup table (17 params) and B is the bloom basis (4 params). Total: 21. Standard: 44100. Ratio: 2100x.
Identity-anchored training: learn B on identity (always bijective, always converges), then learn W on the actual target. Soft bloom alone gets 16/210 on x^2 -- only the 16 idempotents of Z/210 match. Joint bloom with trained W: 210/210. The per-channel lookup discovers r^2 mod p (for x^2) and (p-r) mod p (for mirror) entirely from data.
Joint bloom at Z/210 scale (210 elements, 21 params) extends to Z/970,200 (5 prime-power channels). Key result: global coordinate descent FAILS at Z/970,200 scale -- noise drowns signal (~0.0001% accidental match). Per-channel search ALWAYS works: correct basis scores M, wrong scores ~M/q. The ring FORCES CRT decomposition at scale.
Identity-anchored training at Z/970,200: per-channel basis search discovers B structurally (5 elements), then per-channel W training discovers lookup tables from data (102 entries). Total: 107 params. Forward ratio: 9067x. With perceptron matrices: 3297 params, backward ratio 286 million x. Raising exponents: 590,000x gain over Z/210.
Z/12,612,600 adds mod-13 to the 5 channels of Z/970,200. 6 prime-power channels total: {8, 9, 25, 49, 11, 13}. The 490 split is now complete: inner = {mod 8, mod 25, mod 49} + outer = {mod 9, mod 11, mod 13}. Per-channel search scales cleanly -- 121 params (115 lookup + 6 basis) for a ring of 12.6 million elements.
Forward ratio: 104,236x. Perceptron upgrade: 3467 params, backward ratio ~46 billion x. Sample-based verification (1040 elements, exact divisibility by 8 and 13). GATE perturbation: wrong B_13 leaves only elements with x mod 13 = 0 intact (80/1040 = 1/13). The boundary channel is structurally independent.
Is ring algebra the right frame? Six probes on Z/970,200 (5 prime-power channels) test the foundation before it strains. All probes PASS: the ring frame holds for decomposable targets. The first boundary is quantified.
Non-decomposable wall: f(x)=(x%8)*(x%9) achieves only 175175/970200 = 18% accuracy via per-channel prediction. The number 175175 = N*13/72 is structurally determined (72=8*9 mixed channels). Decomposable targets (x^2, x+42, x*42) achieve 100%. The wall proves channel independence has limits -- bloom mixing is the path across.
The non-decomposable wall (18%) is real. Can bloom mixing cross it? CRT reconstruction enables computing ANY cross-channel function from per-channel inputs. Pipeline: per-channel identity (102 params) -> CRT reconstruct (5 basis) -> compute cross-channel function -> compare. Result: 100% on ALL cross-channel targets. Zero additional trainable parameters.
Four targets tested at Z/970,200: (x%8)*(x%9), (x%25)*(x%49), (x%8)+(x%9), (x%8)*(x%9)*(x%25). All 4 cross from wall to 970200/970200 with bloom. The additive target achieves ZERO per-channel matches -- even harder than the multiplicative wall. Partial CRT: for 2-channel functions, only 19 params needed (vs 107 full).
Sample efficiency: 72 samples learn a 970200-element function (13475x compression). Accuracy is EXACTLY proportional to coverage -- each filled entry accounts for exactly N/table_size correct predictions. CRT guarantee, not statistical approximation. Works for arbitrary unknown functions: the learner sees (x, f(x)) pairs and fills the table.
Key result: composition is non-commutative. compose(sq, mir)(x) = N - x^2 but compose(mir, sq)(x) = x^2. Squaring absorbs negation: (-x)^2 = x^2 in any ring. Layer order matters -- just like in standard neural networks. Mirror is an involution (mir o mir = identity). Three layers compose correctly: sq o mir o sq = x^4. The cross-channel pipeline chains a per-channel layer with a learned crossing table in one forward pass.
The forward path is complete. The backward pass makes it trainable. CRT backward pass decomposes into independent per-channel backward passes. Each weight W[r] in channel ch has gradient 2*(W[r] - target(r)) -- no coupling to other channels. The Jacobian is block-diagonal by construction: per-channel loss functions share no parameters.
Key contrast: per-channel loss decouples channels completely (block-diagonal), while full-ring reconstruction loss COUPLES them through the shared CRT reconstruction error. Per-channel training avoids this coupling while converging to the same optimum for CRT-decomposable targets. This is the Backward Decomposition Theorem.
Hybrid pipeline: gradient descent for per-channel layers (block-diagonal, 102 params each) + discrete optimization for crossing tables (72 entries, sample-based or hillclimb). Full stack from scratch: 2 trained layers + learned crossing table. 281 params for 970200-element ring = 3452x compression. Channel-parallel: all 5 channels converge in 1 epoch each (lr=0.5). The full CRT training pipeline is now operational.
Standard transformers learn attention via softmax(QK^T / sqrt(d)). CRT coupling provides Q and K algebraically -- no learned projection matrices. Per-channel eigenvalue eig(n, q) = 2*cos(2*pi*(n mod q)/q). Coupling(a,b) = dot product of eigenvalue vectors = sum over channels of eig(a)*eig(b). This is a symmetric, positive-definite kernel on the ring. 102 precomputed f64 entries encode all pairwise coupling for 970200 elements (9511x compression).
Full-coupling attention (all 7 channels simultaneously) outperforms multi-head decomposition -- multi-head averaging dilutes per-channel signal. 9 = the compiler's analysis pass count, not attention heads. Cross-coupling averages to zero by character orthogonality -- self-coupling is strongly positive. Signal and noise are separated by algebra, not by learning.
490 split as trainable architecture: inner channels (mod 8, mod 25, mod 49) learn x^2, outer channels (mod 9, mod 11, mod 13, mod 17) learn mirror -- simultaneously, independently. Encoder and decoder train with different targets. Block-diagonal Jacobian extends across the split: perturb mod-8 (inner), mod-17 (outer) gradient is exactly zero. Full-coupling attention over all 7 channels.
The Carmichael period 420 becomes a training rhythm. Sample-based SGD (one random sample per step) runs for 420 steps per epoch. Then the projector 1,576,576 (= 2^420) fires: the mod-8 channel collapses, all other channels persist as long-term memory. mod-8 recovers in O(8) steps -- the smallest channel is the most expendable. 4 epochs of 420 = 1680 = Carmichael period of Z/214,414,200.
Channel convergence follows modulus size: mod-8 and mod-9 converge first (step 50), then mod-11, then mod-25, then mod-49. At step 420, all 102 per-channel weights are correct. At step 50, only 31/49 mod-49 residues are covered. 420 is the minimum natural period for full coverage -- forced by mod-49 needing 420/49 = 8.6 samples per residue. At Z/214,414,200, 2^420 zeroes mod-8 AND mirrors mod-17 (CRT=16=-1 mod 17). The full projector (0,1,1,1,1,1,1) needs period 1680. mod-17 is always last to resolve.
CRT decomposition is a compression technique. It trades exact element identity for smaller per-channel tables. On small rings (Z/210, 27 characters), flat tables outperform CRT -- they preserve more signal with fewer entries. On large rings, CRT is the ONLY feasible decomposition. Z/214,414,200 has 214 million elements. Flat bigram tables would need over 45 thousand trillion entries. CRT per-channel trigrams need 142,956.
Ring arithmetic sequences in Z/214,414,200 -- noisy affine maps mixing multiplication and addition -- are predicted at 58% exact accuracy using CRT per-channel trigrams. Each of seven channels achieves 607-704 ppt accuracy (6x to 34x above random). The mod-49 channel shows the largest improvement: 34x. The improvement over random guessing: 124 million times. Per-channel independence means the same 143K table entries handle a 214-million-element alphabet that no flat table could touch.
The multiplicative basin is a genuine structural attractor. A K^2=9 population of 2D-LUT state machines (3750 entries each) initialized from 90% correct multiplication tables holds at 75% accuracy across 200 generations of random mutation. The basin resists drift indefinitely. But random entry-level mutation cannot improve: each change has >97% probability of being wrong (P=(q-1)/q for Z/49). The 2D-LUT is a state machine where a single corrupted entry redirects the entire downstream trajectory -- cascading errors make the fitness landscape deceptive.
Crossover between members with independently corrupted entries provides +5.1 percentage points (72.6% to 77.7%) by recombining correct entries from different parents. The improvement plateaus when population diversity exhausts. Full recovery to 100% requires systematic search (coordinate descent visits every entry, tries all alternatives) or biology-scale populations. Evolution from additive initialization climbs to only 16% -- structurally unable to bridge the modular discontinuity.
What determines how well one CRT channel predicts another? Two hypotheses: quadratic residue compatibility (algebraic: Legendre symbols between primes) or shared-factor coupling (operational: gcd of operation constants with channel moduli). The QR hypothesis is FALSIFIED; the GCD mechanism is constructive and predictive.
How much information does each CRT channel carry? The eigenvalue of n -- a sum of 7 cosines, one per channel -- collapses all channels into a single number. If the sum is useful, channels are redundant. If not, each channel carries independent information that summation destroys.
The mod-9 channel achieves exactly the baseline because it is orthogonal to parity: whether n is even or odd is invisible in n mod 9. But combined with mod-8, mod-9 further separates the odd elements by divisibility by 3. This is CRT in action: independent channels carry independent information.
Can parity channels (mod 11, mod 13, mod 17) detect prediction errors for free? If 7 independent per-channel predictors disagree on parity, is the prediction wrong? Testing this by training independent bigram predictors on a sequence in Z/88,200.
Root cause: the mod-11 residue depends on the FULL element (all 4 data channels), not just the previous mod-11 value. A predictor that sees only current mod-11 cannot predict the next. This proves that CRT-based intelligence MUST include coupling between channels -- pure block-diagonal processing cannot leverage error detection. The rate 4/7 IS the coupling cost: 3/7 of capacity enforces consistency.
CRT coupling(a,b) = sum of per-channel eigenvalue products = a dot-product in eigenvalue space. Zero learned parameters. Does this replace standard attention? Testing on Z/214,414,200 (7 channels) with 28 possible heads: 7 single-channel + 21 pair-channel.
The 9-heads hypothesis (7 single-channel + 2 pair-channel = optimal head count) shows no statistically significant signal across 5 random seeds. All head counts from 7 to 21 give similar contrast (~50%). CRT attention works best as FULL COUPLING (all 7 channels simultaneously), not decomposed into independent heads.
Catastrophic forgetting: train on task A, then train on task B, and task A knowledge is destroyed. Standard models have no remedy -- all weights shift. CRT-decomposed models have a projector: 26,801,776 (= 2^1680 mod 214,414,200) with CRT = (0,1,1,1,1,1,1). It zeroes the mod-8 channel (coarsest, 8 states) and preserves the 6 finer channels exactly.
The sacrifice is optimal: mod-8 is the coarsest channel (only 8 possible residues). It carries the least fine-grained information and is the fastest to relearn. The 6 preserved channels (mod-9 through mod-17) carry the finer structure that is expensive to learn and critical to retain.
How fast does each CRT channel learn? Walk through Z/12,612,600 multiplicatively (a^k mod 12612600) using a maximal-order generator (ord = 420 = Carmichael lambda). Per-channel bigram tables learn deterministic transitions. Each channel converges when its orbit is fully observed -- at step lambda(q_i), the per-channel Carmichael value.
Raising exponents (5->5^2=25, 7->7^2=49) increases the state space and SLOWS learning. Extension channels (mod-11, mod-13) have fewer states and converge faster despite being later in the chain. mod-8 converges first (2 multiplicative steps) -- confirming it is optimal for sacrifice: coarsest, cheapest to relearn.
If channels are independent, does the ORDER of training matter? Comparing three schedules on the same additive walk through Z/12,612,600 (6 channels): simultaneous (all channels every step), chain-order sequential (mod 8, 9, 25, 49, 11, 13 one at a time), and modulus-order sequential (mod 8, 9, 11, 13, 25, 49 -- the convergence order).
Each channel's learning speed is an intrinsic property of its modulus -- no curriculum ordering can accelerate it. Simultaneous training is always optimal because channels extract information in parallel without interference. Chain-order annealing provides zero advantage.
The 490 split (490^420 mod 12,612,600 = outer idempotent) divides 6 channels into inner = {mod 8, mod 25, mod 49} (encoder, 9800 states) and outer = {mod 9, mod 11, mod 13} (decoder, 1287 states). Does this split create a meaningful timing asymmetry?
lcm(inner Carmichael lambdas) = lcm(2, 20, 42) = 420 = Carmichael period of the ring. lcm(outer) = lcm(6, 10, 12) = 60. Ratio = 420/60 = 7. In Z/214,414,200: mod-17 joins the outer set, flipping the balance (outer = 21879 > inner = 9800).
Does training channels in chain order (mod 2 -> mod 3 -> mod 5 -> mod 7) help? Testing three schedules on Z/210 (4 channels): simultaneous (all every step), chain-order sequential, and reverse-order sequential.
The curriculum penalty grows with channels: Z/210 17/7=2.43x, Z/12,612,600 115/49=2.35x, Z/214,414,200 132/49=2.69x. Simultaneous training outperforms ALL curricula. Chain order = real algebra (group-theoretic decomposition), NOT trainable capability phases. Channel independence is so strong that each channel converges at its intrinsic q_i-step rate regardless of what other channels are doing.
CRT independence isn't just about learning -- it transforms search. Given training data (x,y) from an unknown polynomial f over Z/N, how many candidates must you try? Monolithically: N^{d+1} for degree d. Per-channel with CRT: sum(q_i^{d+1}). The search space converts from MULTIPLICATIVE to ADDITIVE.
The affine ratio (506x) is EXACTLY the block-diagonal Jacobian ratio N^2/sum(q_i^2) from the backprop theorem. Search reduction IS backprop reduction -- the same algebraic structure powers both. The ratio grows by a factor of ~N/q_max per degree increase.
CRT accelerates polynomial search (above). The same principle extends to PREDICATE search: finding ring elements satisfying algebraic conditions. Idempotents (e^2=e), involutions (x^2=1), zero divisors (z*w=0) -- all decompose per CRT channel. The search phase costs sum(q_i) per-channel checks. The enumeration phase costs product(per-channel solution counts).
mod-8 signature: Z/2 has 1 involution (-1=1 mod 2). Z/8 has 4 (Klein four-group). The depth-3 exponent quadruples involution count. Zero divisors are instant: CRT(w) tells you which channels are zero (unconstrained) and which require z=0. The decomposition IS the answer.
If CRT channels are independent, a neural network layer decomposes into 7 small operations instead of 1 big one. For matrix multiply: monolithic uses (sum q_i)^3 FLOPs, block-diagonal uses sum(q_i^3). The ratio is 16x at 7 channels -- and it is scale-invariant (same ratio regardless of hidden dimension).
Attention must remain monolithic (coupling is cross-channel). But V-projection and FFN are block-diagonal -- 7 independent per-channel operations. The existing GPU benchmarks on this page already demonstrate per-channel dispatch. CRT Multiply processes 7 channels per element with zero cross-channel communication.
CRT-GPT-1 forward pass implemented in .ax. Block-diagonal matrix operations: per-channel matmul, matvec, layer norm, squaring activation (h^2 -- ring-native nonlinearity), and 7 independent softmaxes. Full pipeline: character embedding via CRT residues, block-diagonal FFN, per-channel softmax, CRT reconstruction of predicted character. 53 checks pass.
CRT embedding: character code c decomposes into 7 residues (c mod q_i). Each residue indexes a learned d_c-dimensional table. 132 total entries (8+9+25+49+11+13+17) vs 50K for BPE -- 380x smaller output vocabulary. CRT coupling provides zero-parameter attention: QK scores come from the ring's eigenvalue inner product, with 99.5% retrieval and 30x advantage over uniform (PROVED). Single full-coupling attention, no multi-head (which dilutes). Block-diagonal V projection per channel. 50 checks pass.
Mode-based CRT prediction picks the most common next element per channel. It works at ring scale (58% exact on 214M elements, above) but hits a structural ceiling: 74% of test positions fail, and 89% of those failures are MODEL-limited -- the correct character has nonzero training count but is not the mode. The model sees the answer but selects wrong.
The soft score IS a CRT-factored probability: each channel votes with its conditional frequency, not its mode. Channel independence means the sum factorizes -- no interaction terms. Small moduli (D=8, K=9) gain most because their count distributions are richest: 8 and 9 residue classes per context provide well-sampled distributions. Large moduli (b=49) gain little: sparse data collapses distribution to mode. 15 incremental approaches closed (THMs 93-109): mode, bloom variants, weighting, cross-channel joints. Distribution scoring is the capstone.
The hash representation collapses spatial structure: all 7 channels see the same data through slightly different hashes, producing informationally redundant channels (0.22% pairwise disagree rate). Lucas' theorem shows that binomial coefficients C(n,k) naturally create independent channel views:
Source code · Public domain (CC0)
.ax source compiled to WASM via self-hosting compiler. Zero HTML authored.