u/chancemehmu

On-Demand Human Judgement for AI Agents

Been thinking about this a lot lately. Agents are getting scary good at the mechanical stuff - searching, calling APIs, writing code, executing multi-step plans. But they still face two problems that no amount of scaling fixes:

  1. They hit decision points where the "right answer" is a judgment call, not a logic problem. Is this email tone too aggressive? Which of these three landing page headlines actually lands? Does this UI feel sketchy to a normal person? Models have priors on this stuff but their priors are an average of the internet, not your actual users.
  2. You can't eval them on anything subjective without burning a week recruiting people, building a survey, paying a panel, etc. So most teams just don't, and ship on vibes.

I built an MCP server that solves both. Agent hits a fork in the road, calls the tool with a question + audience (e.g. "US women 25-34" or "developers who've used Cursor"), and gets back actual human responses in seconds. Not synthetic. Not Mturk graveyard. Real people replying within seconds.

Example from last week - someone wired it into a Claude Code agent generating marketing copy variants. Instead of picking the "best" one itself, the agent fires off 4 versions to 200 people in the target segment, gets back preference data, and only then commits.

Same primitive works for eval generation. Want a 500-person benchmark on whether your agent's outputs feel trustworthy? One tool call.

Anyway - curious if anyone else is doing the human-in-the-loop thing for agents, and how?

Most stuff I've seen is either slow HITL or pure LLM judge (cheap but circular).

reddit.com
u/chancemehmu — 6 days ago

I can run consumer surveys at ~5¢/response. What's the most economically valuable usecase for this?

I built a tool where you type a question, it goes out to real consumers (demo-targeted), and an AI layer synthesizes the responses. Marginal cost is ~5¢/response, so n=1000 runs about $50 and completes within a few hours.

The economics are cheap enough that I think it unlocks survey work that currently doesn't happen — stuff that's too small to justify a real study but too important to just guess on.

What I'm curious is If you were me, who's the first customer — agencies doing pre-tests, marketing teams testing ad creatives, founders doing pre-launch validation, somewhere else?

reddit.com
u/chancemehmu — 7 days ago

Hey,

I've building consumer apps for a while (recently scaled one to 150M+ users) and now deep in the weeds experimenting with a new ad-format.

We are seeing higher almost 50% higher CPMs than traditional mediation approaches with interstitial or rewarded ads, and recently launched self serve SDKs for both android and iOS.

If you are an android developer with an app that has at least 100k+ downloads, feel free to message me

reddit.com
u/chancemehmu — 22 days ago

been playing around with agents a lot lately and one thing kept bugging me

they’re great at generating options
but pretty bad at picking which one is actually good

especially for anything subjective (design, writing, images, etc.)

so we hacked together an mcp server that basically lets an agent go:

>

and get back real human preference data. there's other use cases as well such as if someone wants to test their new packaging or just generally get a human preferences dataset.

what it does:

  • rank multiple outputs
  • compare candidates side by side
  • return a “human preference score” instead of the model guessing

simple flow:
agent generates a few options →
calls the mcp →
gets a ranking →
picks the best one

honestly the interesting part is where it kicks in

it’s not useful for everything, but when the model is uncertain or it’s a taste call, it works way better than trying to prompt your way out of it

feels less like “tooling” and more like giving the agent a fallback brain

repo if anyone wants to mess with it:
https://github.com/impel-intelligence/datapoint-mcp

curious how people here are thinking about this layer

do you:

  • trust model evals long term
  • add human-in-the-loop like this
  • or just avoid these problems entirely

feels like something here becomes standard, but not sure what the right abstraction is yet

u/chancemehmu — 23 days ago
▲ 11 r/mcp

been playing around with agents a lot lately and one thing kept bugging me

they’re great at generating options
but pretty bad at picking which one is actually good

especially for anything subjective (design, writing, images, etc.)

so we hacked together an mcp server that basically lets an agent go:

>

and get back real human preference data. there's other use cases as well such as if someone wants to test their new packaging or just generally get a human preferences dataset.

what it does:

  • rank multiple outputs
  • compare candidates side by side
  • return a “human preference score” instead of the model guessing

simple flow:
agent generates a few options →
calls the mcp →
gets a ranking →
picks the best one

honestly the interesting part is where it kicks in

it’s not useful for everything, but when the model is uncertain or it’s a taste call, it works way better than trying to prompt your way out of it

feels less like “tooling” and more like giving the agent a fallback brain

repo if anyone wants to mess with it:
https://github.com/impel-intelligence/datapoint-mcp

curious how people here are thinking about this layer

do you:

  • trust model evals long term
  • add human-in-the-loop like this
  • or just avoid these problems entirely

feels like something here becomes standard, but not sure what the right abstraction is yet

u/chancemehmu — 23 days ago