r/CUDA

FlashAttention-2 in CuTe, from scratch -- a line-by-line walkthrough
▲ 48 r/CUDA

FlashAttention-2 in CuTe, from scratch -- a line-by-line walkthrough

Hey everyone, I spent a few months learning CuTe by re-implementing FA-2 from scratch on Ampere, then wrote up a thorough walkthrough of every important line in Tri Dao's source code.

CuTe notoriously has an extremely high learning curve and is extremely hard to interpret. Most of what's online about CuTe is either NVIDIA's reference docs (not really a beginner guide), the production source code, or partial deep-dives that cover one concept at a time. I tried to fill the gap by walking the whole kernel end-to-end at a depth where you can fully understand why each decision was made.

The blog is not a true beginner's guide in the sense that it's not for those who have never touched a kernel -- but, I tried to make it as accessible to anyone who only have a vague notion of even the most basic CUDA concepts.

We cover: swizzling and bank conflicts, tiled MMAs and fragment layouts, the LDSM atoms, V-transpose, online softmax via warp reductions, async copy pipelining, and the output store. I made my own diagrams and even include some code improvements as well.

The kernel hits the full performance of production FA-2's throughput on A100 (it's close to the exact algorithm, just stripped into its essential core).

Hoping this is useful for anyone trying to ramp on CuTe or read Tri Dao's source. Happy to answer questions in the comments!

blog.echen.io
u/phospheric — 2 days ago
▲ 45 r/CUDA+4 crossposts

Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

Hey r/DeepSeek,

Who says we need an H100 cluster or the latest expensive GPUs to run frontier MoE models? I wanted to see how far we could push a single node of consumer legacy hardware, so we spent less than $2,500 total to build a budget machine that successfully runs DeepSeek-V4-Flash (284B total, 13B active) locally!

Surprisingly, we managed to hit around 255 prefill tokens/s with a very tight memory budget.

https://preview.redd.it/cfefgc71732h1.png?width=1772&format=png&auto=webp&s=5c673acca7a2a73cfbd0d2059e25102462c56dfc

Here is a quick breakdown of how we achieved this "legacy donkey pulling a massive MoE chariot" feat via hardware-software co-optimization:

⚡️ The Technical Breakthroughs

  1. Custom Turing CUDA Kernels: The 2080 Ti Tensor Cores are still capable, but PCIe Gen3 and VRAM bandwidth are huge bottlenecks. We rewrote custom CUDA kernels tailored specifically for the Turing architecture to accelerate W8A8 (INT8) matrix multiplication, heavily alleviating the bandwidth choke.
  2. Heterogeneous Inference: Optimized static memory splitting and dynamic offloading between the 4x 11/22GB VRAM and 1TB system RAM. 100% of the hardware capacity is utilized.
  3. Computation-Communication Overlap: Implemented a pipelined execution strategy to hide the massive multi-GPU communication overhead caused by MoE routing.

https://preview.redd.it/5ltwol3z632h1.png?width=2414&format=png&auto=webp&s=6c4c4dcf62737f7f5dcb9a5b8d4aa3f422f7edae

🖥️ Budget Hardware Specs

  • CPU: Intel Xeon E5-2696 v4 (The classic budget king for multi-core)
  • GPU: 4x RTX 2080 Ti (11/22GB each)
  • RAM: 1TB DDR4 ECC

The entire implementation, deployment script, and preliminary tech report are 100% open-sourced. I'd love to hear your thoughts, benchmarks, or feedback from fellow system/compiler hackers here!

🔗 GitHub Repository:https://github.com/lvyufeng/deepseek-v4-2080ti

(Note: I submitted the detailed report to arXiv a few days ago, but it’s currently caught in the manual moderation queue—likely because a rookie author throwing a 2080 Ti at DeepSeek-V4 triggered their review boundaries lol. Will update with the arXiv link once it's cleared!)

https://reddit.com/link/1thlbwe/video/lxhccfh2732h1/player

reddit.com
u/Known_Ice9380 — 3 days ago
▲ 22 r/CUDA+2 crossposts

Hi everyone,

I’m an independent developer with a background in algorithms, HPC, and robotics infrastructure. Recently I’ve been working on a lightweight inference engine built around hand-written CUDA kernels, focusing on small-batch and real-time performance (especially for VLA and robotics workloads).

Here are some recent results on Thor and Blackwell:

  • Pi0.5 — Jetson AGX Thor (SM110): 44 ms (23 Hz)
  • Pi0 — Jetson AGX Thor (SM110): 46 ms (22 Hz)
  • Pi0.5 — RTX 5090 (SM120): 17.58 ms (57 Hz)
  • Pi0 — RTX 5090 (SM120): 18.43 / 21.16 / 24.48 ms (54 / 47 / 41 Hz)
  • GROOT N1.6 — Jetson AGX Thor: 45 ms (T=50) / 41 ms (T=16) → 22 / 24 Hz
  • GROOT N1.6 — RTX 5090: 13.08 ms (T=50) / 12.53 ms (T=16) → 76 / 80 Hz
  • Pi0-FAST (token)
    • Thor: 8.1 ms/token (123 tok/s)
    • RTX 5090: 2.39 ms/token (418 tok/s)

The focus is on pushing true real-time inference under small-batch settings, which tends to be underserved by typical large-batch optimized stacks.

Still early, but happy to share more details or discuss if anyone is working on similar workloads 🙂

Feeback welcome!:https://github.com/LiangSu8899/FlashRT

u/Diligent-End-2711 — 3 days ago
▲ 8 r/CUDA+1 crossposts

Writing an LLM compiler from scratch [Part 3]: Autotuning — A Search Loop Over Tile-IR Rewrites

The third and final article of building a hackable ML compiler from scratch. The previous parts built a six-IR pipeline (Torch → Tensor → Loop → Tile → Kernel → CUDA) and lowered TinyLlama / Qwen2.5-7B through it.

Block sizes, register tiles, staging decisions, etc., were determined by a heuristic that didn't generalize beyond the matmul shapes it was fitted on.

This part swaps those heuristics for a search loop. An SP-MCTS that explores the cross-product of rule parameters, benchmarks each candidate, and persists winners in a SQLite cache keyed by structural op hash. The cache replays on subsequent compiles.

On RTX 5090, the tuned stack lands at geomean 0.96× vs PyTorch eager (vs 0.87× for the heuristic and 0.91× for torch.compile), with 32 of 84 kernel shapes faster than PyTorch hand-optimized kernels. Best kernels are 5.6× faster than PyTorch (tall-skinny matmuls).

Passes

Pass                 Forks
tileify              —
chunk_matmul_k       one per legal K-chunk size (divisors of K, 16..128)
split_matmul_k       apply or skip — turn K into a parallel reduction
cooperative_reduce   —
blockify_launch      one per threads-per-block ∈ {64,128,256,512}
chunk_reduce         —
stage_inputs         which inputs to stage in smem (2^k combinations)
register_tile        one per (F_M, F_N) divisor pair
permute_reg_tile     inner-loop order ∈ {km, mk}
double_buffer        apply or skip — split stage buffers for overlap
tma_copy             apply or skip on sm_90+
async_copy           apply or skip on sm_80+ (cp.async)
pad_smem             —
pipeline_k_outer     apply or skip
mark_unroll          —

A dense matmul with six staging-relevant inputs, three legal K-chunks, four threads-per-block values, eight register-tile shapes, two pipelining choices, and two double-buffering choices spans 2^6 × 3 × 4 × 8 × 2 × 2 ≈ 24,000 terminals.

Search loop

SP-MCTS with max-Q propagation, normalized UCB1, and a patience termination criterion (stop after N consecutive measured terminals without a new best):

def sp_mcts(root, patience, c):
    best_reward = 0.0
    visits_at_best = 0
    while root.visits - visits_at_best < patience:
        # SELECT
        # descend to a frontier node by UCB1 over normalized max-Q
        node = root
        while node.children and node.has_unfinished_descendant():
            node = max(
                (ch for ch in node.children if ch.has_unfinished_descendant()),
                key=lambda ch: ucb(ch, node, c),
            )

        # SIMULATE / EXPAND — advance one rule
        # spawn forks or bench a terminal
        result = advance_one_rule(node.candidate)
        if result.forks:
            node.children = [Node(c, parent=node) for c in result.forks]
            continue
        reward = 1.0 / bench_latency(result.cuda_op)

        # BACKPROP — walk parent links
        # bump visits, max-update best_reward
        n = node
        while n is not None:
            n.visits += 1
            n.best_reward = max(n.best_reward, reward)
            n = n.parent

Structural keys

The entire cache is keyed by structural digests that describe the kernel's structure. To produce a structural key, eight normalization passes are used: drop size-1 free axes, sequential SSA rename, sort commutative args, canonicalize external buffer names, collapse op clusters: sub ↔ add (FMA), mod ↔ divide (SFU), the compare family; then hash the result.

Under this transformation, the following ops become identical and the same scheduling decisions will be applied:

# Op A
for i in range(M):
    for j in range(1):
        tmp = load(X[i])
        result = tmp + bias[i]
        Y[i, j] = result

# Op B
# different names and '-' instead of '+'
for i in range(M):
    a = load(input0[i])
    b = load(input1[i]) 
    c = a - b
    output0[i] = c

Run CLI example from the repo:

# Eager 25 µs, Deplodock 38.9 µs (0.64× eager)
deplodock run --bench -c \
  "a=torch.randn(1,32,2048);b=torch.randn(2048,5632);torch.matmul(a,b)"

# Tune (default patience 60). 207 variants explored in 67.7s,
# best 22.54 µs at BM=32, BN=64, F_M=8, F_N=2 (worst was 293.75 µs).
deplodock tune -v -c \
  "a=torch.randn(1,32,2048);b=torch.randn(2048,5632);torch.matmul(a,b)"

# Re-run with the cached knobs — 22.7 µs (1.10× eager)
deplodock run --bench -c \
  "a=torch.randn(1,32,2048);b=torch.randn(2048,5632);torch.matmul(a,b)"
open.substack.com
u/NoVibeCoding — 3 days ago
▲ 32 r/CUDA+1 crossposts

Built a GPT-2 inference engine from scratch in CUDA.

Includes the core transformer pipeline:

- tiled GEMM kernels

- fused attention + softmax kernels

- multi-head causal self-attention

- transformer blocks + MLPs

- KV cache + autoregressive token generation, etc.

Also built the runtime around it:
- weight loading, tensor routing, CUDA memory management, generation loop, profiling, benchmarking, etc.

- Current peak throughput is around ~190 tokens/sec on GPT-2.

Everything was profiled and tested on my RTX 3050 Laptop GPU with only 4GB VRAM.

Definitely not the fastest implementation possible and there’s still a lot that could be improved, but this project was mainly about learning CUDA, transformer inference, profiling, and GPU systems properly from scratch.

repo for more details:

https://github.com/Mog9/gpt2-inference

u/EigenMog — 3 days ago
▲ 6 r/CUDA+2 crossposts

do CS/AI students actually need powerful GPUs anymore, or is RTX 5090 overkill?

​

I’m a first-year CS student trying to buy ONE main machine for the next 4–5 years, and after weeks of research I’m honestly more confused than when I started.

My long-term interests are:

Software engineering

AI/ML engineering

Robotics

Systems programming

Cloud/devtools/infrastructure

Possibly research/startup work later

I’m NOT primarily buying this for gaming.

What’s confusing me is that people online seem completely divided:

One side says:

“You don’t need a GPU anymore.”

“Just use cloud GPUs.”

“Get a MacBook.”

“Students barely train models locally.”

“A lightweight laptop with good battery life matters more.”

The other side says:

“CUDA/NVIDIA is essential.”

“Local experimentation matters a lot.”

“VRAM will become more important because of local LLMs.”

“You’ll regret not having GPU power later.”

“Mac compatibility can become annoying for some AI/robotics workflows.”

Right now I’m considering:

Lenovo Legion Pro 7i

ASUS Strix Scar

Alienware Area-51

Hp omen

Typical configs:

RTX 5080 / 5090(or this is overkill 5060 or 5070)

64 GB RAM

2 TB SSD

Budget is flexible enough for a 5090 laptop, but I genuinely don’t know whether that’s smart planning or just overspending.

What I actually care about:

Reliability over several years

Good thermals/cooling

Lower fan noise if possible

Comfortable keyboard/build quality(don't like to use keyboard only mouse)

Stable performance during long workloads

Portability for college/internships

Linux compatibility

Battery life when unplugged

Upgradeability

Ability to comfortably handle:

software engineering workflows,

Docker/VMs,

AI/ML experimentation,

robotics tools,

CUDA development,

multiple IDEs,

multitasking,

and possibly local inference later(i get excited by hugging face opensource models but learning and developing career in mle/ai >>>> flexing, rgb lights )

I’m trying to optimize for:

learning,

experimentation,

projects,

internships,

and long-term usability, not just benchmark numbers.

What I’m struggling to understand from people already in industry/research:

Starting from scratch, do students even need strong local GPUs anymore?

Is cloud + lightweight laptop actually the smarter path now?

For software engineering + AI/ML + robotics, how important is NVIDIA/CUDA in practice?

Is RTX 5090 laptop genuinely useful long term, or mostly unnecessary for students?

Would RTX 5080 already be enough for almost everything realistically done in college?

Is desktop + lighter laptop a better setup than one powerful laptop?

Do heavy high-end laptops become annoying to carry/use daily?

How much do thermals, fan noise, and battery life affect real-world experience after the “new toy excitement” wears off?

Which brands currently balance:

cooling,

reliability,

portability,

fan noise,

battery,

and sustained performance the best?

Would especially appreciate advice from people working in:

software engineering,

ML/AI,

robotics,

CUDA,

systems,

infrastructure,

or research.

reddit.com
u/Emojers — 6 days ago
▲ 22 r/CUDA+1 crossposts

Built an open source GPU bottleneck analyzer for PyTorch/CUDA. Looking for honest feedback

I've been building an open source tool called Fournex that turns Nsight Compute output into specific, evidence-backed optimization suggestions for CUDA kernels.

What it does

You give it an NCU profile (or a PTX file), and it:

  • classifies bottlenecks from hardware-counter evidence
  • ranks issues by severity
  • generates concrete optimization recommendations tied directly to the metrics that triggered them

What it currently detects

  • Uncoalesced global memory access (sectors/request ratio)
  • L1/L2 cache thrashing
  • Memory bandwidth saturation
  • Tensor core underutilization
  • Warp stall patterns:
    • barrier stalls
    • memory throttle
    • scoreboard stalls
  • Low issue-slot utilization
  • Register pressure / spills (via PTX static analysis)

Concrete example

I tested it on a deliberately broken GEMM kernel with four planted flaws:

  1. stride-K uncoalesced access
  2. no shared memory tiling
  3. FP32 only execution (tensor cores idle)
  4. unnecessary __syncthreads() calls inside the reduction loop

It correctly identified all four and recommended:

  • improving memory coalescing
  • adding shared memory tiling
  • enabling AMP / tensor core usage
  • removing unnecessary barriers

Each recommendation includes:

  • the exact metric that triggered it
  • why the metric matters
  • numbered remediation steps

Workflow

# Analyze existing Nsight Compute CSV output
frx profile --ncu profile.csv

# Or let frx run NCU for you
# (Linux only, may require sudo for hardware counters)
frx profile -- ./my_binary

# Static PTX analysis
frx profile --ptx kernel.ptx

On Windows, you can export the CSV from Nsight Compute and pass it to:

frx profile --ncu profile.csv

No GPU is required at analysis time.

One thing I'm intentionally trying not to do

I don't want this to become an LLM wrapper that generates plausible sounding optimization advice.

Every recommendation is triggered by explicit thresholds on measured hardware counters. If the metric evidence isn't present, the recommendation doesn't fire.

Repo

https://github.com/jorgevee/fournex

Would appreciate feedback from people who profile CUDA workloads seriously or hobby:

  • What bottlenecks are hardest to diagnose today?
  • What’s missing from existing tooling?
  • Would you trust automated optimization suggestions? Under what conditions?
  • What would make something like this useful in your workflow?

And if the direction seems interesting, don't be shy to star the repo

u/jvbiz — 4 days ago
▲ 30 r/CUDA+1 crossposts

Cutile-rs beta is out !

This release today contains more details , its stable and achieves peak performance on B200 blackwell.

Tbh the syntax is far better than python Triton for me , without all of the excessive slicing and brainfuck like incantations that need to be written between the square brackets.

u/Daemontatox — 6 days ago
▲ 23 r/CUDA

Whos got a "plug" forCUDA courses - SR Platform Engineer wanting CUDA exp

I'm tryna make some money, I'm at 180 rn and I'm a SR Platform engineer. I did SWE for a few years, but I wanna pickup some experience. Whats the plug.. or should I go learn more gpu infrastructure engineering

reddit.com
u/Ok-Trainer-6407 — 6 days ago
▲ 38 r/CUDA

Dynamic persistent tile scheduling with Cluster Launch Control on Blackwell

Persistent tile scheduling is crucial for hiding epilogue latency in various kernels. Simply, a threadblock (CTA) is assigned available worktiles continuously until the kernel has completed. In its most naive form, a persistent tile scheduler will assign to CTA X successive tiles of index n*X and "delinearize" this list to compute the proper logical grid coordinates. However, when one has imbalanced workloads, CTAs may be assigned inequal work, a problem that compounds to a phenomenon of "CTA drift"; in this case, the persistent scheduler may perform significantly worse thanan ordinary single tile scheduler.

The solution to this problem is dynamic persistent schedulers: instead of assigning a pre-determined list of worktiles to a CTA, have a CTA claim the next available tile and start computing on that immediately. The common approach to dynamic persistence is to keep a global memory semaphore; once a CTA has finished its computation, it reads the semaphore and then atomically adds to increment. This adds enormous engineering complexity and has a lot of room for error.

Cluster Launch Control (CLC), introduced on Blackwell, approaches dynamic persistent scheduling with hardware-level instructions, alleviating much of the engineering challenge associated with the semaphore-based approach.

Please enjoy our recent blog on CLC, with examples in CuTeDSL! While the post focuses on datacenter Blackwell (Sm100), CLC is available on consumer cards as well (Sm120).

research.colfax-intl.com
u/Logical-Try-4084 — 8 days ago
▲ 62 r/CUDA+1 crossposts

Writing an LLM compiler from scratch [Part 2]: Lowering to a GPU Schedule

The modern ML (LLM) compiler stack is brutal. TVM is 500K+ lines of C++. PyTorch piles Dynamo, Inductor, and Triton on top of each other. I built a hackable LLM compiler from scratch and am documenting the process. It takes a small model (TinyLlama, Qwen2.5-7B) and lowers it to a sequence of CUDA kernels through six IRs.

Currently, on RTX 5090, the emitted FP32 kernels run at geomean 1.11× vs PyTorch eager and 1.20× vs torch.compile, with full-block parity on TinyLlama-128 and Qwen2.5-7B at seq=128. Wins on small reductions / SDPA / kv-projections (up to 4.7×); losses on dense matmul at seq=512.

Part 1 took an RMSNorm layer end-to-end and walked the upper half of that pipeline in detail. This second part closes the gap and explains Tile IR, Kernel IR, and associated lowering rules in depth.

Full article: A Principled ML Compiler Stack in 5,000 Lines of Python Repo: deplodock

The article focuses on producing a GPU schedule for an operation written in loop-nest form (Loop IR). Example for RMSNorm:

v0 = reciprocal(2048)
for a0 in 0..32:  # free
    for a1 in 0..2048:  # reduce
        in2 = load x[0, a0, a1]
        v1 = multiply(in2, in2)
        acc0 <- add(acc0, v1)
    v2 = multiply(acc0, v0)
    v3 = add(v2, 1e-06)
    v4 = rsqrt(v3)
    for a2 in 0..2048:  # free
        in3 = load x[0, a0, a2]
        in4 = load p_weight[a2]
        v5 = multiply(in3, v4)
        v6 = multiply(v5, in4)
        merged_n0[0, a0, a2] = v6

The stack mimics a sequence of optimization steps a CUDA engineer would perform when optimizing kernels: stage inputs to smem, reduce bank conflicts, increase occupancy, and so on.

LoopOp
  │
  ▼
[001] tileify                 — lift outer free Loops to thread axes
[002] chunk_matmul_k          — chunk the K reduce into K-outer × K-inner (intra-CTA)
[003] split_matmul_k          — promote the K-outer chunk loop into a grid dimension
[004] cooperative_reduce      — let multiple threads share one reduce; tree-merge with Combine
[005] blockify_launch         — pick block extents; partition free axes into BLOCK and THREAD
[006] chunk_reduce            — chunk non-matmul reduces so their Loads fit in shared memory
[007] stage_inputs            — hoist hot input slabs into Stage nodes
[008] register_tile           — replicate the inner tile so each thread owns a register block
[009] permute_register_tile   — reorder the register strip so bank-conflicting loads land on far columns
[010] double_buffer           — promote K-outer Stages to BufferedStage (ping-pong)
[011] tma_copy                — narrow eligible BufferedStages to TmaBufferedStage (sm_90+)
[012] split_inner_for_swizzle — split the inner cache axis of a TmaBufferedStage for swizzle
[013] async_copy              — narrow the rest to AsyncBufferedStage (cp.async, sm_80+)
[014] pad_smem                — pad shared-memory strides to break bank conflicts
[015] pipeline_k_outer        — rotate the K-outer loop into prologue/steady-state/epilogue (cp.async + TMA)
[016] mark_unroll             — annotate small inner loops for #pragma unroll
  │
  ▼
TileOp (fully scheduled)

Each stage can be reproduced with a CLI command. For example, the stage_inputs pass stages input buffers into smem if possible and if there is a benefit in doing that (inputs are being read multiple times within CTA). To see it, the following command can be used:

deplodock compile \
  -c "torch.nn.RMSNorm(2048)(torch.randn(1,32,2048))" \
  --ir tile -vv \
  | awk '/^>>> t:007/,/^<<< t:007/'
>>> t:007_stage_inputs
@@ matched at rms_norm (in-place) @@
@@ -2,6 +2,7 @@
   v0 = reciprocal(2048)
   Tile(axes=(a0:256=THREAD, a1:32=BLOCK)):
+      x_smem = Stage(x, origin=(0, a1, 0), slab=(a2:2048@2))
       StridedLoop(a2 = a0; < 2048; += 256):  # reduce
-          in2 = load x[0, a1, a2]
+          in2 = load x_smem[a2]
           v1 = multiply(in2, in2)
           acc0 <- add(acc0, v1)
@@ -11,5 +12,5 @@
       v4 = rsqrt(v3)
       StridedLoop(a2 = a0; < 2048; += 256):  # free
-          in3 = load x[0, a1, a2]
+          in3 = load x_smem[a2]
           in4 = load p_weight[a2]
           v5 = multiply(in3, v4)
<<< t:007_stage_inputs

The final CUDA kernel for the RMSNorm layer:

deplodock compile \
  -c "torch.nn.RMSNorm(2048)(torch.randn(1,32,2048))" \
  --target sm_120 --ir cuda
extern "C" __global__
__launch_bounds__(256) void k_rms_norm_reduce(
    const float* x, const float* p_weight, float* rms_norm) {
    float v0 = 1.0f / 2048.0f;
    int a1 = blockIdx.x;
    int a0 = threadIdx.x;
    int lane = threadIdx.x & 31;
    int warp = threadIdx.x >> 5;
    float acc0 = 0.0f;
    __shared__ float x_smem[2048];
    for (int x_smem_flat = a0; x_smem_flat < 2048; x_smem_flat += 256) {
        float x_smem_v = x[a1 * 2048 + x_smem_flat];
        x_smem[x_smem_flat] = x_smem_v;
    }
    __syncthreads();
    for (int a2 = a0; a2 < 2048; a2 += 256) {
        float in2 = x_smem[a2];
        float v1 = in2 * in2;
        acc0 += v1;
    }
    float acc0_w = acc0;
    acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 16);
    acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 8);
    acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 4);
    acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 2);
    acc0_w = acc0_w + __shfl_xor_sync(0xffffffff, acc0_w, 1);
    __shared__ float acc0_smem[8];
    if (lane == 0) {
        acc0_smem[warp] = acc0_w;
    }
    __syncthreads();
    for (int s = 4; s > 0; s >>= 1) {
        if (warp < s) {
            acc0_smem[warp] = acc0_smem[warp] + acc0_smem[warp + s];
        }
        __syncthreads();
    }
    float acc0_b = acc0_smem[0];
    float v2 = acc0_b * v0;
    float v3 = v2 + 1e-06f;
    float v4 = rsqrtf(v3);
    for (int a2 = a0; a2 < 2048; a2 += 256) {
        float in3 = x_smem[a2];
        float in4 = p_weight[a2];
        float v5 = in3 * v4;
        float v6 = v5 * in4;
        rms_norm[a1 * 2048 + a2] = v6;
    }
}
cloudrift.ai
u/NoVibeCoding — 10 days ago
▲ 6 r/CUDA

Nvidia Interview Help

I am interviewing for System Software - Autonomous Vehicles. Gave my 2nd round 2 weeks ago. Followed up on phone multiple time, HR said feedback is positive. Havent heard back for a week with the F2F interview timeline. Is this normal?

Also how many rounds do they generally take? And what would they be asking in F2F interviews?

reddit.com
u/Gullible_Stomach6765 — 13 days ago
▲ 19 r/CUDA

For edge inference, when do you drop below TensorRT/ONNX and write custom CUDA kernels?

Question for people who do CUDA work on production inference paths.

For large vision / multimodal models running on edge devices, the first pass is usually export/compile/quantize with TensorRT, ONNX Runtime, vendor SDKs, etc. But sometimes a small set of operators or pre/post-processing steps dominates the latency trace enough that custom CUDA kernels become worth it.

Recent datapoint from a Jetson Orin NX deployment I worked on: multimodal classifier, 111ms cold start, 100% of decisions inside a 150ms budget, zero cloud calls.

Curious how CUDA folks decide when custom kernels are worth the maintenance cost:

- What trace/profile signs make you reach for custom CUDA?

- Do you usually target model ops, preprocessing, memory layout/conversion, or batching?

- How do you keep custom kernels portable across Jetson vs larger NVIDIA GPUs?

- Any profiling workflow you trust for this kind of edge latency work?

reddit.com
u/Hairy_Strawberry7028 — 13 days ago
▲ 2 r/CUDA

SASS King Part 2: reverse-engineering ptxas heuristic decisions and what the compiled binary actually reveals

Published the second SASS King article. The first one covered reading SASS from first principles. This one addresses the layer above: what ptxas actually decides at compile time, how those decisions become visible in the binary, and what prior work exists in this space.

A few points the article develops in detail:

Huerta et al. 2025 confirmed on Ampere that control codes in SASS are not performance hints. The hardware does not verify register dependencies for fixed-latency instructions. ptxas embeds stall counts and dependency counters directly into the binary, and the GPU executes them without checking. Wrong value means incorrect results, not degraded performance.

Yan et al. 2026 reverse-engineered the closed-source driver command stream below ptxas. On an A40, 94% of what Nsight reports as "CUDA HW" for an 8-byte transfer is driver and submission overhead. Between CUDA 11.8 and 13.0, same kernel graph, same hardware, launch overhead dropped from 209μs to 5.9μs with zero SASS changes. Purely driver-side command stream reorganization.

Redplait extracted instruction encoding tables directly from the ptxas binary and built ced for inline cubin patching. Kuterdinel reverse-engineered SM90a ISA encoding via nvdisasm fuzzing. Neither covers SM120 or addresses compiler heuristic analysis.

SASS King sits at the intersection: empirical study of ptxas decisions on SM120, their SASS-level signatures, and their measurable performance implications. The long-term direction is a pattern library, production binary audits, and the encoding foundation for eventual assembler-level tooling on Blackwell.

Article: https://florianmattana.com/posts/sass-king-part-2-reading-the-compiler-mind/

Repo: https://github.com/florianmattana/sass-king

Corrections and feedback welcome.

reddit.com
u/CurrentLawfulness358 — 14 days ago