
Built an open source GPU bottleneck analyzer for PyTorch/CUDA. Looking for honest feedback
I've been building an open source tool called Fournex that turns Nsight Compute output into specific, evidence-backed optimization suggestions for CUDA kernels.
What it does
You give it an NCU profile (or a PTX file), and it:
- classifies bottlenecks from hardware-counter evidence
- ranks issues by severity
- generates concrete optimization recommendations tied directly to the metrics that triggered them
What it currently detects
- Uncoalesced global memory access (
sectors/requestratio) - L1/L2 cache thrashing
- Memory bandwidth saturation
- Tensor core underutilization
- Warp stall patterns:
- barrier stalls
- memory throttle
- scoreboard stalls
- Low issue-slot utilization
- Register pressure / spills (via PTX static analysis)
Concrete example
I tested it on a deliberately broken GEMM kernel with four planted flaws:
- stride-K uncoalesced access
- no shared memory tiling
- FP32 only execution (tensor cores idle)
- unnecessary
__syncthreads()calls inside the reduction loop
It correctly identified all four and recommended:
- improving memory coalescing
- adding shared memory tiling
- enabling AMP / tensor core usage
- removing unnecessary barriers
Each recommendation includes:
- the exact metric that triggered it
- why the metric matters
- numbered remediation steps
Workflow
# Analyze existing Nsight Compute CSV output
frx profile --ncu profile.csv
# Or let frx run NCU for you
# (Linux only, may require sudo for hardware counters)
frx profile -- ./my_binary
# Static PTX analysis
frx profile --ptx kernel.ptx
On Windows, you can export the CSV from Nsight Compute and pass it to:
frx profile --ncu profile.csv
No GPU is required at analysis time.
One thing I'm intentionally trying not to do
I don't want this to become an LLM wrapper that generates plausible sounding optimization advice.
Every recommendation is triggered by explicit thresholds on measured hardware counters. If the metric evidence isn't present, the recommendation doesn't fire.
Repo
https://github.com/jorgevee/fournex
Would appreciate feedback from people who profile CUDA workloads seriously or hobby:
- What bottlenecks are hardest to diagnose today?
- What’s missing from existing tooling?
- Would you trust automated optimization suggestions? Under what conditions?
- What would make something like this useful in your workflow?
And if the direction seems interesting, don't be shy to star the repo