DeepSeek R1 keeps inventing pandas methods that don't exist. Ran 50 tasks against Qwen3.6 last week — wasn't close.
Last week DeepSeek R1 confidently generated a pandas method that doesn't exist. Took me 20 minutes to figure out why my pipeline was throwing AttributeError. I've been on R1 since January and this isn't the first time it's happened.
So I ran a head-to-head against Qwen3.6 35B. 50 tasks pulled from my actual workflow — python refactoring, SQL optimization, edge case debugging. Same prompts, same temperature.
Qwen3.6 won 31. DeepSeek took 14. 5 were basically a wash.
The hallucination gap was the part I didn't expect. DeepSeek kept generating pandas methods that don't exist, confidently. Same for SQL — invented postgres functions that aren't real. Qwen3.6 caught itself maybe 6-7 times across the run and said "I'm not 100% sure about this syntax, verify it." Which sounds soft until you've shipped code that hallucinated a method name into production.
Where DeepSeek still wins: pure chain-of-thought stuff. I threw in some proof-style math problems and DeepSeek handled them more cleanly. So for reasoning/math I'm still routing to R1. But for daily "this function is broken, fix it" coding work, Qwen3.6 is my default now.
Latency felt faster on Qwen3.6 too but I didn't formally measure it — and the routing could be biasing things either way, so not building a story on that.
Not saying DeepSeek is bad. Still my go-to for reasoning. But the hallucination gap was big enough that I'm not letting it touch library-specific code anymore until it tightens up.
Happy to share the task list if anyone wants to replicate. Both are on atlas if you want a single-endpoint A/B, otherwise direct works too.