u/Thrumpwart — reddlx

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

Transformer training systems are built around dense linear algebra, yet a nontrivial fraction of end-to-end time is spent on surrounding memory-bound operators. Normalization, activations, residual updates, reductions, and related computations repeatedly move large intermediate tensors through global memory while performing little arithmetic, making data movement an increasingly important bottleneck in otherwise highly optimized training stacks. We introduce CODA, a GPU kernel abstraction that expresses these computations as GEMM-plus-epilogue programs. CODA is based on the observation that many Transformer operators exposed as separate framework kernels can be algebraically reparameterized to execute while a GEMM output tile remains on chip, before it is written to memory. The abstraction fixes the GEMM mainloop and exposes a small set of composable epilogue primitives for scaling, reductions, pairwise transformations, and accumulation. This constrained interface preserves the performance structure of expert-written GEMMs while remaining expressive enough to cover nearly all non-attention computation in the forward and backward pass of a standard Transformer block. Across representative Transformer workloads, both human- and LLM-authored CODA kernels achieve high performance, suggesting that GEMM-plus-epilogue programming offers a practical path toward combining framework-level productivity with hardware-level efficiency.

arxiv.org

u/Thrumpwart — 6 hours ago

▲ 8 r/LocalLLaMA

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require. We propose Vector Policy Optimization (VPO), an RL algorithm that explicitly trains policies to anticipate diverse downstream reward functions and to produce diverse solutions. VPO exploits that rewards are often vector-valued in practice, like per-test-case correctness in code generation or, say, multiple different user personas or reward models. VPO is essentially a drop-in replacement for the GRPO advantage estimator, but it trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space. Across four tasks, VPO matches or beats the strongest scalar RL baselines on test-time search (e.g. pass@k and best@k), with the gap widening as the search budget grows. For evolutionary search, VPO models unlock problems that GRPO models cannot solve at all. As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective.

arxiv.org

u/Thrumpwart — 8 hours ago

▲ 4 r/prolog

9950X3D2 SWI-Prolog Benchmarks

So I got myself an AMD 9950X3D2 with 3D V-Cache on both dies.

This thing is pretty fast...

terminal: ~/bench$ swipl run.pl

Program Time GC

――――――――――――――――――――――――――――――――

boyer 0.330 0.029

browse 0.302 0.000

chat_parser 0.333 0.000

crypt 0.364 0.000

derive 0.355 0.000

fast_mu 0.322 0.000

flatten 0.307 0.000

log10 0.322 0.000

meta_qsort 0.321 0.000

mu 0.335 0.000

nand 0.339 0.000

nreverse 0.412 0.000

ops8 0.315 0.000

perfect 0.329 0.000

poly_10 0.368 0.000

prover 0.328 0.000

qsort 0.301 0.000

queens_8 0.332 0.000

query 0.333 0.000

reducer 0.325 0.000

sendmore 0.353 0.000

serialise 0.302 0.000

simple_analyzer 0.331 0.000

tak 0.342 0.000

times10 0.368 0.000

divide10 0.396 0.000

unify 0.312 0.000

zebra 0.369 0.000

sieve 0.365 0.000

queens_clpfd 0.321 0.000

pingpong 0.383 0.000

fib 0.552 0.000

moded_path 0.510 0.000

det 0.498 0.000

eval 0.363 0.000

average 0.355 0.001

NReverse benchmark

--- Naive Reverse Benchmark (10000 items) ---

Time taken: 0.639 seconds

Total Inferences: 50,015,162

LIPS: 78244493.92

reddit.com

u/Thrumpwart — 8 days ago

▲ 26 r/LocalLLaMA

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \textbf{attention drift}: as the drafter generates successive tokens within a speculation chain, attention progressively moves from the prompt onto its own recently-generated tokens. We observe this across both \emph{EAGLE3} drafters and \emph{MTP heads}, suggesting drift is a property of drafter designs. We trace this to the un-normalized residual path between chain steps: the drafter's hidden state magnitude grows monotonically with chain depth, which exhibits dynamics consistent with additional pre-norm transformer layers stacked on the target rather than as a standalone autoregressive predictor. In order to limit the growth, we propose two architectural changes: Post-norm on the drafter hidden states and per-hidden-state RMSNorm after capturing target hidden states. Our interventions improve acceptance length over the current leading model, pre-norm EAGLE3, by up to 2× under template perturbation, 1.18× on long-context tasks, and 1.10× on seven standard benchmarks spanning multi-turn chat, math, and coding. Our changes also allow shorter train-time-test depths to generalize over longer drafting sequences.

arxiv.org

u/Thrumpwart — 10 days ago

▲ 251 r/LocalLLaMA

Taiwanese company Skymizer announces HTX301 - PCIE inference card with 384GB of Memory at ~240 Watts

skymizer.ai

u/Thrumpwart — 15 days ago

▲ 140 r/LocalLLaMA

I look forward to the Local LLM community getting llama.cpp to run on these. Could be a good value.

u/Thrumpwart — 23 days ago

▲ 282 r/LocalLLM+1 crossposts

Came across hipfire the other day. It's a brand new inference engine focused on all AMD GPU's (not just the latest).

Github.

It uses a special mq4 quantization method. The hipfire creator is pumping out models on huggingface.

I don't know enough about quantization to know how good these quants are in terms of quality, but as an RDNA3 aficionado I'm happy AMD is getting some attention.

Localmaxxing is a new LLM benchmarking site, and shows some pretty dramatic speedups for hipfire inference.

Edit: I should have just said hipfire - I don't think this is connected to AMD officially.

u/Thrumpwart — 25 days ago