Prompt consistency across models feels harder than prompt quality itself
Something I’ve been noticing while experimenting with prompts is that the same prompt can produce completely different reasoning patterns depending on the model.
Even when outputs look equally confident, the structure, assumptions, and depth of reasoning can vary a lot between models.
That’s what pushed me to start comparing outputs more systematically instead of treating prompt engineering as something that happens inside a single model only.
Recently I’ve been experimenting with askNestr to compare multi-model responses side by side, and honestly the most useful part hasn’t been finding a “best” answer it’s identifying where models consistently diverge.
It made me wonder whether future prompt engineering workflows will focus less on optimizing for one model and more on designing prompts that remain reliable across multiple reasoning systems.
Curious how others here think about cross-model consistency when evaluating prompts.