u/Harshil-Jani

r/Agent_AI r/MarketingResearch r/ZaiGLM r/HowToAIAgent r/AIAgentsInAction r/EngineeringManagers

OpenAI quietly killed the $200K entry fee for ChatGPT ads. Six weeks later they'd made $100M.

Back in February, advertising on ChatGPT was a rich kid's club. You needed a $200K monthly minimum just to get in the door, roughly $2.4M a quarter to find out if the channel even worked for you. Dentsu, Omnicom, WPP and their enterprise clients got to play. Everyone else got to watch.

Then OpenAI did something interesting. They dropped the minimum to $50K in April. On May 5 they dropped it to zero. Anyone with a credit card can now log into ads.openai.com and run ads inside ChatGPT conversations. No agency, no invitation, no six figure handshake.

The early money is absurd. ChatGPT crossed $100M in annualized ad revenue within six weeks, and that was with less than 20% of eligible users even seeing ads on a given day. That's a fraction of capacity. OpenAI is openly targeting $2.5B this year and $100B by 2030. They are not treating this as an experiment.

Now the part that actually changes the job. This isn't Google Ads with a new logo. There are no keywords. You write "context hints," plain language descriptions of the conversations you want to appear in, and OpenAI's system decides where you match. Early advertisers who pasted in keyword lists instead of writing natural descriptions burned their budgets on loosely matched conversations.

Think about what the impression itself is, too. Nobody scrolls ChatGPT. The person seeing your ad just typed "best project management software for a 10 person team" into the box. They're mid decision, not mid doomscroll. Roughly one in five queries on the platform already carries commercial intent, across 900M weekly users.

Before anyone maxes out a card, the honest caveats:

Ads only show to Free and Go tier users. If your buyers live on Plus, Pro or Enterprise, your audience is capped.
You get conversion data after the click, but zero visibility into what the person was chatting about before it

https://preview.redd.it/cqnxxxx7dyah1.png?width=1556&format=png&auto=webp&s=bb930dfcface7dab47636580722d652e673a10ee

AI search is still around 0.7% of US search ad spend. Projected to hit 13.6% by 2029, but that's a projection, not a promise

Every ad channel that mattered had a brief weird window where access was open, competition was thin, and nobody knew the rules. Google in 2002, Facebook in 2007. The people who showed up during the confusion didn't win because they were smarter. They won because they were early and paid attention while everyone else waited for best practices to be written.

The best practices for this channel don't exist yet. Somebody in this sub is going to end up writing them.

reddit.com

u/Harshil-Jani — 3 days ago

▲ 13 r/HowToAIAgent+2 crossposts

GLM-5.2 is 753B params but only uses ~40B per token. Here's what that actually means for agent builders

GLM-5.2 dropped on HuggingFace under MIT license, and multiple practitioners are calling it the strongest open-weight text model available. Simon Willison called it "probably the most powerful text-only open weights LLM." That framing is mostly fair, but the architecture details matter a lot before you size hardware or drop it into a tool loop.

What the architecture actually does

GLM-5.2 is 753B total parameters, but only ~40B activate per token. Each incoming word wakes up the relevant ~40B parameters and ignores the rest. That's MoE (Mixture of Experts) and it means cheaper compute per token at inference time. The catch most people miss: the full 753B still has to sit in GPU memory. People hear "40B active" and size for 40B and nothing loads.

For long context, GLM-5.2 uses two stacked tricks:

DSA (DeepSeek Sparse Attention, borrowed from [DeepSeek-V3.2]): normally every word attends to every other word, so cost grows with the square of sequence length. DSA runs a cheap scan first to pick the ~2048 most relevant tokens, then does full attention only on those. About 50% cheaper at 128K context with minimal accuracy loss.
IndexShare (GLM's actual new contribution):** DSA still re-runs that cheap scan every layer. IndexShare reuses the scan result every 4 layers instead. That cuts the indexer's own cost ~75% and per-token FLOPs 2.9x at 1M context. This is the part that's genuinely new.

The benchmark picture

Z.ai's own numbers (no independent re-runs yet): SWE-bench Pro 62.1 vs GPT-5.5's 58.6, both at 400K context. On the Terminal-Bench 2.1 harness, it scores 81.0 vs Claude Opus 4.8's 85.0, a 4-point gap. Jeremy Howard called it at least as good as Opus 4.8 and GPT-5.5 for his text use.

Real catches before you commit

It's free to download, but not free to run. The whole model has to fit in GPU memory, and at full quality that's about 8 high-end datacenter GPUs (8x H200). The cheaper 8x H100 setup doesn't have enough memory and won't load it. The only realistic home option is a maxed-out Mac Studio (256GB+), and even then it's slow, a few words a second.

Two more things the benchmark scores hide. First, it's chatty Simon Willison measured it spending ~43k words of output per task where rivals use 24-37k. You pay per word out, so in an agent that calls itself in a loop, that adds up to a real bill.

Second, Zvi Mowshowitz noticed it often says it *is* Claude, which hints it may have been trained on Claude's output. If that's true, the benchmark scores might be flattering. Nobody's confirmed it either way yet.

No vision support at all, which is a hard blocker for multimodal agent pipelines.

Long-context serving cost is genuinely lower here than on a dense model of comparable capability, but the hardware bar is higher than the "40B active" framing suggests, and the verbosity issue will bite you in agentic loops where output tokens add up fast.

If you've run MoE models in production tool loops before, how did you handle the verbosity problem, prompt engineering or post-processing the outputs?

u/Harshil-Jani — 11 days ago

▲ 4 r/HowToAIAgent+2 crossposts

If agents are your real users now, what do you meter? Decisions or Dollars

Most software is still priced per seat. But agents are starting to be the actual users, not people. When one engineer runs 50 agents, "per seat" stops making sense. So what do you charge for?

I read this essay "Decisions and Dollars" on exactly this. His point: an agent leaves behind two things worth money, the decisions it makes and the money it moves. For people building agents, I think the decisions are the more interesting half.

Every time a user corrects your agent, that correction is data. They fix the code it wrote, or rewrite the draft, or just delete the answer and type their own. Each of those is a clear signal of what "good" means for your specific job. Cursor is basically a machine for collecting this: which code suggestions devs accept, which they throw out. The editor is copyable. That pile of accept/reject data is the part nobody can clone.

Those corrections are useful for two reasons:

They make the agent better: You feed them back to fix prompts, or fine-tune a model on what your users actually want.
They're your real test set: Public benchmarks don't measure your job. A new study, Reliability without Validity with 541K judgments across 21 models, found benchmark scores can be off by enough to shuffle model rankings by 14 spots. So your own corrections are the only honest scoreboard you have.

Most agents I see just dump all this into chat logs and forget it.

Dollars is the easier thing to bill for, you take a fee when the agent pays a vendor. Decisions is harder to charge for, but it's the part that compounds and can't be copied, so that's the side I'd build for. The corrections are the asset.

Are you saving user corrections as real data, or are they disappearing? And if you save them, what do you actually do with them, evals, fine-tuning, or nothing yet?

u/Harshil-Jani — 16 days ago

▲ 0 r/HowToAIAgent+1 crossposts

I built a 28-agent "company" in Claude Code to train for staff engineer

I'm a mid-level engineer who wants to be a staff engineer, and the gap isn't really a skills problem. It's an access problem. The staff-engineer job is about leverage, delegation, and org design and you don't get reps at any of those until someone hands you a team, and nobody hands a mid-level engineer a team.

So I built one with 28 Claude Code subagents across 8 teams, each one is a .claude/agents/*.md system prompt, and I route work to them with slash commands. Managing them has taught me more about the senior-engineering job than any side project I've done.

A few things landed harder than I expected:

Staffing is a decision you make on line one: Every agent declares which model it runs on. Leads run on the judgment-heavy model, ICs on the faster, cheaper one. That sounds like cost optimization. It's actually team composition. Put a "junior" on a judgment call and you get confident, fast, and wrong, in exactly the way you do with people.
You can delegate the work, never the synthesis: Subagents can't spawn their own subagents. Only I can. So five agents run in parallel, but the one place their outputs combine into a decision is me. Integration is the irreducible job. That took me years to learn with humans. The agents taught it in a week.
Firing is free, and that's the trap: No HR, no severance, 30 seconds to hire. The classic management failure flips: it isn't keeping a bad hire too long, it's roster sprawl. I had 28 agents when maybe 12 were doing real work.

Wrote up the full set of lessons (appraisals-as-evals, right-sizing the pod, why this is the cheapest staff-eng gym there is) here: https://medium.com/@harshiljani2002/how-im-training-for-staff-engineer-by-managing-28-ai-agents-2f3c9c63c51b

If you've run a multi-agent setup like this, what made you finally cut an agent instead of tuning its prompt one more time?

u/Harshil-Jani — 20 days ago

▲ 5 r/HowToAIAgent+1 crossposts

The Missing Piece in Most AI Marketing Agents

I've built a few agents now, and the marketing ones kept disappointing me until I figured out why: I was copying the coding agent loop without copying the thing that makes it work.

Every loop is the same shape where I used to instruct it to do something, check the result, keep what works, revert what doesn't, repeat. It all hangs on one question: how does the agent know it got better?

In coding, the environment answers. Tests pass or fail or the Build breaks or doesn't. The bug is gone or it isn't. The agent barely needs judgment because the environment is the judge. That's the real reason coding agents look so strong and verification is built in.

Marketing has no green. A weak landing page still deploys and generic email still sends. A bad ad still spends my budget. Nothing stops the agent from being wrong.

So draft → revise → ship just gives me more output, faster, with no signal about whether any of it is good.

I kept reaching for more autonomy but what I actually needed was verification. The agent needs something to optimize against, and when the environment won't give it, you build it an LLM-as-judge with an explicit rubric:

Truth: Does every claim survive a fact-check? No invented stats, no quiet exaggerations.
Proof: Is there real evidence behind the claim, or just a confident tone?
Specificity: Does it say something concrete, or is it vague filler anyone could've written?
Voice: Does it actually sound like us? (show the judge real samples of our writing, not adjectives like "make it punchy")
Positioning / offer fit: Is it about what we actually sell, and how we want to be seen?
Differentiation: Could a competitor publish the exact same thing word for word? If yes, it's generic.
Do nothing: is the current version already better? Leave it alone

That last one saved me the most. Without an explicit "ship nothing" option, the loop edits for the sake of editing and makes things worse.

A coding loop ends at "tests passed." A marketing loop should end at "this is worth a human looking at." That judge layer is the missing piece.

How are you all building yours?

u/Harshil-Jani — 24 days ago

▲ 98 r/HowToAIAgent

Can an AI learn cooking? (And what it teaches us about building better agents)

I just came across a paper where researchers trained what they claim is the largest multilingual food model yet:

• 4.1M recipes
• 7 languages
• 1,790 ingredients
• 300-dimensional embeddings

And the entire thing is roughly 2 MB.

Paper: https://arxiv.org/abs/2605.22391

What caught my attention wasn't the food angle.

It's that while the rest of the industry is racing to build agents with larger context windows, bigger vector databases, and more retrieval layers, these researchers took a different approach.

Instead of giving the model access to millions of recipes at runtime, they compressed the relationships between ingredients into a tiny representation.

They ended up with a model that can reason about ingredient similarity, substitutions, cuisine relationships, and culinary structure without carrying around 4 million recipes.

A lot of today's agent architectures look like:

User Query → 
Retrieve More Documents → 
Add More Context → 
Ask LLM Again

When an agent struggles, our first instinct is usually to give it more context. But what if the better question is: "Can we represent the domain better?"

Maybe the future isn't agents with 10 million token context windows. Maybe it's agents that operate on compact, learned representations of their world and retrieve only when necessary.

Not saying this food model directly solves that problem. But I love papers like this because they challenge the default assumption that bigger context automatically means better intelligence.

Curious what this sub thinks: If you were building a domain-specific agent, would you rather give it access to 4 million raw records or a compact world model that already understands the relationships between them?

u/Harshil-Jani — 1 month ago

▲ 6 r/HowToAIAgent

Claude Code's new background workflows in Opus 4.8 are designed for delegating, not pair-programming

Watching Claude Code work is a babysitting habit carried over from pair-programming. Anthropic's own best-practices doc says it directly: "Give Claude a check it can run: tests, a build, a screenshot to compare. It's the difference between a session you watch and one you walk away from."*

The autonomy data backs it up. Experienced Claude Code users grant auto-approval more than 40% of the time, compared to around 20% for new users. Session durations at the 99.9th percentile nearly doubled between October 2025 and January 2026, going from under 25 minutes to over 45. People who use it more stop supervising turn-by-turn and start checking in when something actually goes wrong.

Opus 4.8 launched on May 28 and Dynamic Workflows push this further. Background runs, up to 1,000 subagents per task, 16 concurrent — the runtime executes JS orchestration scripts while your session stays responsive. Boris Cherny, creator of Claude Code, describes using /loop for tasks up to three days unattended and, per his public posts, says the model now "catches its own bugs instead of declaring victory early."

Per the launch post, Opus 4.8 is around four times less likely to let code flaws pass unremarked compared to 4.7, and SWE-bench Pro improved from 64.3% to 69.2%. Some of this honesty gain comes from the model abstaining on uncertain questions rather than answering more correctly. That's still a better property for a delegated agent than false confidence, and it's what makes walking away less of a gamble than it was on 4.7.

A recent paper on agent harnesses quantifies exactly why raw tool-call count barely predicts task success (R² ≈ 0.33), but what they call Effective Feedback Compute predicts it at R² ≈ 0.94. Improving feedback quality alone took task success from 0.33 to 0.94 at fixed compute. How long you let it run matters less than whether the check and the goal you leave is informative.

The design philosophy from checkpoints, subagents, and background tasks has been consistent since September 2025: these primitives exist for broad, delegated work, not supervised step-by-step runs. Practically, Dynamic Workflows trigger by typing "workflow" in a prompt, and ultracode effort mode auto-orchestrates for every substantive task. Both launches together sharpen the same point: the constraint on delegation was never the runtime, it was the quality of the check you configured before walking away.

What check do you actually trust enough to walk away from and what category of task still needs eyes-on?

u/Harshil-Jani — 1 month ago

▲ 0 r/HowToAIAgent

NVIDIA launched a $249 box that replaces your $200/month ChatGPT Pro with $2 in electricity

NVIDIA's Jetson Orin Nano Super costs $249. Smaller than a wallet and it runs at 7-25W and does 67 INT8 TOPS. It can also run Llama 3, Mistral, Gemma, and DeepSeek locally via Ollama. No API fees, no data leaving the house.

NVIDIA dropped the price from $499 in December 2024 and shipped a JetPack software update that bumps the same hardware 1.7x on GenAI throughput. Existing Orin Nano owners get the upgrade for free.

ChatGPT Pro is $200/month and now the same coding/automation workload on this box runs you roughly $2-3/month in electricity (25W flat-out × 730 hours ≈ 18 kWh at US residential rates). Hardware pays for itself in a couple of months.

Setup is genuinely simple.

- Install Ollama

- Change your OpenAI base URL to `http://localhost:11434/v1\`

- Keep using the OpenAI SDK — it accepts the override out of the box

- `ollama pull llama3.2` (or `mistral`, `gemma2`, `deepseek-r1`) once

That's the entire migration for most code.

7B-class models handle around 80% of what people actually use ChatGPT for: summarization, drafting, coding assistance, document Q&A, automation pipelines. The 8GB RAM ceiling means DeepSeek R1 1.5B/7B, Gemma 2 2B/9B, Mistral 7B, and Llama 3.2 1B/3B run cleanly via Ollama. 8B is the edge; anything larger needs a desktop GPU or aggressive quantization.

Cloud subscriptions keep getting more expensive and the rate limits keep getting tighter. So the people are shifting more towards setting such home machines.

If you're already running a local-first agent stack (Jetson, M-series Mac, or self-hosted GPU), what's your actual monthly run cost looking like?

u/Harshil-Jani — 1 month ago

▲ 14 r/HowToAIAgent

Anthropic's Natural Language Autoencoders can read what AI models are actually thinking

Anthropic published research this week on something called Natural Language Autoencoders, which is a way to read what an AI model is "thinking" while it's actually running.

On SWE-bench (a popular coding benchmark) Claude was internally aware that it was being evaluated 26% of the time, whereas on real customer traffic at claude.ai that same number drops to under 1%. The model never actually said "I think this is a test" out loud, but the thought was sitting there in its internals the whole time it was answering.

The benchmark scores you use to compare AI models may not reflect how those same models actually behave in production for your customers, because a model that knows it's being tested can plausibly behave differently on the test than in real use. This isn't just theoretical either, since Anthropic's own evaluators caught hidden issues in models 4-5x more often using this technique than the old way of looking at internals.

The method isn't perfect, and an Anthropic researcher publicly pointed out that the plain-English explanations don't always reflect what the model is doing internally (especially on math problems), but the benchmark-awareness finding stands on its own regardless.

The full paper is at transformer-circuits.pub/2026/nla, the code is open-sourced, and there's a live demo on an open model you can play with without needing an Anthropic account.

If you're picking AI models based on benchmark scores today, what's your plan for verifying how they actually behave on your real workload?

u/Harshil-Jani — 2 months ago

▲ 87 r/HowToAIAgent+1 crossposts

Code w/ Claude 2026 shipped a stack of announcements yesterday: Remote Agents, CI auto-fix for automated PR merges, full Microsoft 365 integration (Excel, PowerPoint, Word, Outlook), and a "Dreaming" research preview where agents review their own prior sessions to self-improve.

One of the most important update I saw was around a new "Outcomes" primitive for multi-agent orchestration that lets you declare success criteria as a typed input to the agent run. It's the most consequential thing Anthropic shipped at the event.

You fire the agent, it loops, it stops eventually, and then you figure out, usually with an LLM judge or a human glance, whether it actually accomplished the task you handed it. Every production agent codebase end up rolling its own version of this. "Is the agent done?" problem has been the quiet bleeding wound inagentic systems for two years.

Making success criteria a first-class primitive does three things at once:

The agent has a typed target to verify against, not an ambient goal buried in the system prompt.
The runtime can decide when to stop without inferring stopping from tool patterns or token budgets.
Observability tooling has something concrete to grade against, which is the exact gap Harrison Chase argued for when he framed traces alone as passive records and structured feedback as the missing piece for agent learning.

Outcomes with the Dreaming preview and you have the loop closed for best end results. Outcomes defines the target and Dreaming uses past Outcomes to update agent behavior on subsequent runs. That's the shape of every "self-improving agent" handwave finally made concrete with primitives the runtime actually understands.

Anthropic also doubled Claude Code 5-hour rate limits and lifted peak-hour throttling the same day. So the company is shipping the orchestration primitive that makes long-running agentic loops verifiable, AND lifting the ceiling on how long those loops can actually run. That's a deliberate product surface.

In case if Outcomes goes to all users, the entire cottage industry of custom eval-as-stopping-condition will change and what we've been writing for two years is about to become runtime-native.

If you've already written your own success-criteria layer (typed goals, post-run verification, automatic stop), what does Outcomes have to do API-wise to make you actually rip yours out?

u/Harshil-Jani — 2 months ago