tested DeepSeek V4, Kimi K2.6, and Qwen3-235B on the same coding task. surprised by which won.
not sponsored. just spent two weeks running the same workflow through three open-source LLMs and the differences surprised me.
i'd been on claude pro for everything since 2024. ran into the new gemini 3 limits last week and that pushed me to actually look at what open-source had become. spoiler: it's better than the 2024 reddit consensus says it is.
picked one prompt i use weekly. a coding refactor task with about 800 tokens of context and a clear ask. ran it three times on each model. same temperature, same context, same prompt verbatim.
DeepSeek V4:
clean. precise. caught two edge cases without being asked. added a comment explaining the reasoning behind a non-obvious choice. closest to senior-dev output i've seen from open-source.
second-cheapest of the three on my workload.
Kimi K2.6:
different style. more verbose explanations. caught one edge case deepseek missed (an off-by-one in the loop termination). added two test suggestions i hadn't asked for.
most expensive of the three, but still about 1/8 of Claude pricing for the same workload.
Qwen3-235B:
competent but workmanlike. refactored what i asked. didn't catch the edge cases the other two caught. less thoughtful about non-obvious tradeoffs.
cheapest of the three. the cost gap to deepseek isn't huge though, maybe 30%.
the realisation after the week:
DeepSeek V4 thinks.
Kimi K2.6 elaborates.
Qwen3-235B executes.
deepseek's the cleanest output overall. but for tasks where i just need execution and don't need the model to think alongside me, qwen3 is fine and even cheaper. kimi sits in the middle for tasks where explanation matters more than pure code.
the uncomfortable part: open-source has caught up on coding tasks more than the reddit consensus says. the premium i was paying for claude was mostly brand familiarity, not quality.
switching wasn't the right move for everything. but for the bulk of my prompt-engineering work, refactors, summaries, structured extraction, the open-source models cover 80-90% of what i was getting before, at a fraction of the cost. and no rate caps.
run your most-used prompt across deepseek v4, kimi k2.6, and qwen3 this week. or pick three open-source models that match your workload. not to find a winner. to figure out which model is the right fit for which problem.
the answer will be different for different workloads. but you can't see the gap until you actually compare.
which open-source model surprised you when you tested it side by side with what you've been defaulting to?