u/FrontRegular6113

I have been benchmarking Nemotron-3-Super and GPT-OSS:120B using vLLM on a system equipped with two Blackwell RTX Pro 6000 cards. I allocated one dedicated GPU to each model for the evaluation.

In my testing, the perceived output token throughput of Nemotron-3-Super was roughly 4x slower than that of GPT-OSS:120B. However, according to the official Nvidia Technical Report, Nemotron-3-Super is supposed to be 2.2x faster.

What could be causing this massive discrepancy between the report and my real-world results? (Reference:https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf)

Do I need to migrate to TensorRT-LLM to unlock the full optimised performance of Nemotron-3-Super? In the paper, Nvidia provides a rather ambiguous explanation regarding their methodology:

They cherry-picked the best-performing metrics without clarifying which serving framework was actually used for which specific model, which is quite frustrating.

Could you please explain what causes this gap, and suggest any optimisation techniques or best practices to maximise the performance of Nemotron-3-Super on my setup?

Nemotron 3 Super vs GPT-OSS:120B on Blackwell RTX Pro 6000 Cards