I've been building an open source tool called Fournex that turns Nsight Compute output into specific, evidence-backed optimization suggestions for CUDA kernels.

What it does

You give it an NCU profile (or a PTX file), and it:

classifies bottlenecks from hardware-counter evidence
ranks issues by severity
generates concrete optimization recommendations tied directly to the metrics that triggered them

What it currently detects

Uncoalesced global memory access (sectors/request ratio)
L1/L2 cache thrashing
Memory bandwidth saturation
Tensor core underutilization
Warp stall patterns:
- barrier stalls
- memory throttle
- scoreboard stalls
Low issue-slot utilization
Register pressure / spills (via PTX static analysis)

Concrete example

I tested it on a deliberately broken GEMM kernel with four planted flaws:

stride-K uncoalesced access
no shared memory tiling
FP32 only execution (tensor cores idle)
unnecessary __syncthreads() calls inside the reduction loop

It correctly identified all four and recommended:

improving memory coalescing
adding shared memory tiling
enabling AMP / tensor core usage
removing unnecessary barriers

Each recommendation includes:

the exact metric that triggered it
why the metric matters
numbered remediation steps

Workflow

# Analyze existing Nsight Compute CSV output
frx profile --ncu profile.csv

# Or let frx run NCU for you
# (Linux only, may require sudo for hardware counters)
frx profile -- ./my_binary

# Static PTX analysis
frx profile --ptx kernel.ptx

On Windows, you can export the CSV from Nsight Compute and pass it to:

frx profile --ncu profile.csv

No GPU is required at analysis time.

One thing I'm intentionally trying not to do

I don't want this to become an LLM wrapper that generates plausible sounding optimization advice.

Every recommendation is triggered by explicit thresholds on measured hardware counters. If the metric evidence isn't present, the recommendation doesn't fire.

Repo

https://github.com/jorgevee/fournex

Would appreciate feedback from people who profile CUDA workloads seriously or hobby:

What bottlenecks are hardest to diagnose today?
What’s missing from existing tooling?
Would you trust automated optimization suggestions? Under what conditions?
What would make something like this useful in your workflow?

And if the direction seems interesting, don't be shy to star the repo

u/jvbiz