I tested how AI picks B2B agencies: 40 prompts across ChatGPT, Perplexity, Gemini and Google AI Overviews
I work mostly in B2B marketing and wanted to sanity-check something I’ve been hearing more often: when a buyer asks an AI engine “who should I hire for X?”, does it actually give a consistent answer?
So I made a simple sheet with 40 recommendation-style prompts. A few examples were things like:
- best GEO agency for B2B 2026
- GEO vs SEO recommendations for a B2B SaaS company
- who should I hire for AI search visibility
- best agency for AI search / answer engine optimization
Then I ran the same prompts through ChatGPT, Perplexity, Gemini, and Google AI Overviews when they appeared.
This wasn’t meant to be a perfect scientific study. I mostly logged whether an agency was named, whether it was cited or linked, which sources were cited, and whether the same agencies showed up across engines.
What surprised me wasn’t that some answers were wrong. It was how little agreement there was.
The same prompt would produce a confident shortlist in one engine and a totally different shortlist in another. One engine would recommend a firm that another engine didn’t mention at all. For one mid-size agency I tracked, Gemini listed it as a top pick, while Perplexity returned no meaningful mention for the same prompt set.
A couple of assumptions I had going in were probably wrong:
First, I assumed there was one “AI ranking” to climb. There isn’t. It feels more like multiple answer surfaces, and each one pulls from a different mix of sources.
Second, I assumed a strong website would translate into strong AI visibility. In this small test, that wasn’t always true. The agencies that showed up most often weren’t necessarily the ones with the best sites. They were the ones mentioned in places the engines seemed willing to quote or summarize.
That makes the whole GEO vs SEO conversation more interesting to me. Some of it definitely feels like rebranded SEO, but the measurable part is real: you can check whether AI engines name you, cite you, ignore you, or recommend competitors instead.
I’m going to rerun the same prompt set monthly to see what’s stable vs. noise.
Curious if anyone else here has run a structured test like this. Are you tracking AI visibility across multiple engines, or mostly just checking ChatGPT?