u/jvbiz

Built an open source GPU bottleneck analyzer for PyTorch/CUDA. Looking for honest feedback
▲ 22 r/CUDA+1 crossposts

Built an open source GPU bottleneck analyzer for PyTorch/CUDA. Looking for honest feedback

I've been building an open source tool called Fournex that turns Nsight Compute output into specific, evidence-backed optimization suggestions for CUDA kernels.

What it does

You give it an NCU profile (or a PTX file), and it:

  • classifies bottlenecks from hardware-counter evidence
  • ranks issues by severity
  • generates concrete optimization recommendations tied directly to the metrics that triggered them

What it currently detects

  • Uncoalesced global memory access (sectors/request ratio)
  • L1/L2 cache thrashing
  • Memory bandwidth saturation
  • Tensor core underutilization
  • Warp stall patterns:
    • barrier stalls
    • memory throttle
    • scoreboard stalls
  • Low issue-slot utilization
  • Register pressure / spills (via PTX static analysis)

Concrete example

I tested it on a deliberately broken GEMM kernel with four planted flaws:

  1. stride-K uncoalesced access
  2. no shared memory tiling
  3. FP32 only execution (tensor cores idle)
  4. unnecessary __syncthreads() calls inside the reduction loop

It correctly identified all four and recommended:

  • improving memory coalescing
  • adding shared memory tiling
  • enabling AMP / tensor core usage
  • removing unnecessary barriers

Each recommendation includes:

  • the exact metric that triggered it
  • why the metric matters
  • numbered remediation steps

Workflow

# Analyze existing Nsight Compute CSV output
frx profile --ncu profile.csv

# Or let frx run NCU for you
# (Linux only, may require sudo for hardware counters)
frx profile -- ./my_binary

# Static PTX analysis
frx profile --ptx kernel.ptx

On Windows, you can export the CSV from Nsight Compute and pass it to:

frx profile --ncu profile.csv

No GPU is required at analysis time.

One thing I'm intentionally trying not to do

I don't want this to become an LLM wrapper that generates plausible sounding optimization advice.

Every recommendation is triggered by explicit thresholds on measured hardware counters. If the metric evidence isn't present, the recommendation doesn't fire.

Repo

https://github.com/jorgevee/fournex

Would appreciate feedback from people who profile CUDA workloads seriously or hobby:

  • What bottlenecks are hardest to diagnose today?
  • What’s missing from existing tooling?
  • Would you trust automated optimization suggestions? Under what conditions?
  • What would make something like this useful in your workflow?

And if the direction seems interesting, don't be shy to star the repo

u/jvbiz — 4 days ago