u/chancemehmu
On-Demand Human Judgement for AI Agents
Been thinking about this a lot lately. Agents are getting scary good at the mechanical stuff - searching, calling APIs, writing code, executing multi-step plans. But they still face two problems that no amount of scaling fixes:
- They hit decision points where the "right answer" is a judgment call, not a logic problem. Is this email tone too aggressive? Which of these three landing page headlines actually lands? Does this UI feel sketchy to a normal person? Models have priors on this stuff but their priors are an average of the internet, not your actual users.
- You can't eval them on anything subjective without burning a week recruiting people, building a survey, paying a panel, etc. So most teams just don't, and ship on vibes.
I built an MCP server that solves both. Agent hits a fork in the road, calls the tool with a question + audience (e.g. "US women 25-34" or "developers who've used Cursor"), and gets back actual human responses in seconds. Not synthetic. Not Mturk graveyard. Real people replying within seconds.
Example from last week - someone wired it into a Claude Code agent generating marketing copy variants. Instead of picking the "best" one itself, the agent fires off 4 versions to 200 people in the target segment, gets back preference data, and only then commits.
Same primitive works for eval generation. Want a 500-person benchmark on whether your agent's outputs feel trustworthy? One tool call.
Anyway - curious if anyone else is doing the human-in-the-loop thing for agents, and how?
Most stuff I've seen is either slow HITL or pure LLM judge (cheap but circular).
I can run consumer surveys at ~5¢/response. What's the most economically valuable usecase for this?
I built a tool where you type a question, it goes out to real consumers (demo-targeted), and an AI layer synthesizes the responses. Marginal cost is ~5¢/response, so n=1000 runs about $50 and completes within a few hours.
The economics are cheap enough that I think it unlocks survey work that currently doesn't happen — stuff that's too small to justify a real study but too important to just guess on.
What I'm curious is If you were me, who's the first customer — agencies doing pre-tests, marketing teams testing ad creatives, founders doing pre-launch validation, somewhere else?
Hey,
I've building consumer apps for a while (recently scaled one to 150M+ users) and now deep in the weeds experimenting with a new ad-format.
We are seeing higher almost 50% higher CPMs than traditional mediation approaches with interstitial or rewarded ads, and recently launched self serve SDKs for both android and iOS.
If you are an android developer with an app that has at least 100k+ downloads, feel free to message me
been playing around with agents a lot lately and one thing kept bugging me
they’re great at generating options
but pretty bad at picking which one is actually good
especially for anything subjective (design, writing, images, etc.)
so we hacked together an mcp server that basically lets an agent go:
>
and get back real human preference data. there's other use cases as well such as if someone wants to test their new packaging or just generally get a human preferences dataset.
what it does:
- rank multiple outputs
- compare candidates side by side
- return a “human preference score” instead of the model guessing
simple flow:
agent generates a few options →
calls the mcp →
gets a ranking →
picks the best one
honestly the interesting part is where it kicks in
it’s not useful for everything, but when the model is uncertain or it’s a taste call, it works way better than trying to prompt your way out of it
feels less like “tooling” and more like giving the agent a fallback brain
repo if anyone wants to mess with it:
https://github.com/impel-intelligence/datapoint-mcp
curious how people here are thinking about this layer
do you:
- trust model evals long term
- add human-in-the-loop like this
- or just avoid these problems entirely
feels like something here becomes standard, but not sure what the right abstraction is yet
been playing around with agents a lot lately and one thing kept bugging me
they’re great at generating options
but pretty bad at picking which one is actually good
especially for anything subjective (design, writing, images, etc.)
so we hacked together an mcp server that basically lets an agent go:
>
and get back real human preference data. there's other use cases as well such as if someone wants to test their new packaging or just generally get a human preferences dataset.
what it does:
- rank multiple outputs
- compare candidates side by side
- return a “human preference score” instead of the model guessing
simple flow:
agent generates a few options →
calls the mcp →
gets a ranking →
picks the best one
honestly the interesting part is where it kicks in
it’s not useful for everything, but when the model is uncertain or it’s a taste call, it works way better than trying to prompt your way out of it
feels less like “tooling” and more like giving the agent a fallback brain
repo if anyone wants to mess with it:
https://github.com/impel-intelligence/datapoint-mcp
curious how people here are thinking about this layer
do you:
- trust model evals long term
- add human-in-the-loop like this
- or just avoid these problems entirely
feels like something here becomes standard, but not sure what the right abstraction is yet