r/AI_Agents

The weirdest AI shift isn’t intelligence. It’s memory.

A year ago, most AI conversations were around “Can it write?” or “Can it code?”

Now the interesting question is becoming:

“What happens when AI actually remembers things?”

Not just chat history - actual preferences, patterns, context, habits, ongoing projects.

The jump from "tool" - "something that remembers previous interactions" feels much bigger than people expected.

Search engines answered questions.

AI is starting to build context.

Feels like a bigger shift than better image generation or slightly higher benchmark scores.

What’s more valuable long-term: smarter AI or AI that remembers better?

reddit.com
u/SoluLab-Inc — 8 hours ago

Looking for product Testers $250 to Test provide comprehensive feedback (MUST USE AGENTS DAILY)

Hi Folks,

Looking for testers of my product

And really get an understanding of onboarding experience, set up experience, general experience and anything that is:

Terrible
Brilliant

And anything in-between. Looking for people who genuinely use agents all the time, and understand it inside out.

trying to make my product better, and service as a whole.

Thanks!

reddit.com
u/DetectiveMindless652 — 8 hours ago

Built my own agent runtime after hitting the ceiling with LangGraph — UI as graph nodes, Postgres durability, zero orchestration cost

I've been building agentic applications for around 2 years now. Started with loops, then moved onto langgraph + Assistant UI. I've been using the lang ecosystem since their launch and have seen their evolution.

It's great and easy to build agents, but things got really frustrating once I needed more fine grained control, especially has a hard time building interesting user experiences. I loved the idea of building agents as graphss, but I really wanted to model UIs in my flow as nodes too. It felt like I was fighting abstractions all the time, too much to learn.

Deployment was another nightmare. I am kinda cheap and the per node executed tax seemed ... Well, not great. But hey, the devs gotta eat.

Around 10 months back, I snapped and started working on an idea I had. It's called cascaide.

Cascaide is a fullstack agent runtime and AI orchestration framework in typescript designed to run anywhere JS/TS can. It was originally built for web applications but works equally well for headless/CLI AI agents and workflows in javascript runtimes.

What it really is is a distributed, observable, durable graph executor. The first split just happens to be client/server, hence full stack.

Here are the reasons to try it.

🧩 UI as nodes in your agent graph — Not glue code, not a separate library. UI and human-in-the-loop are core primitives.

💾 Resume workflows after crashes, weeks later, or never — Every step checkpointed to your own Postgres. No new infra, no third-party service holding your state.

🔍 Observability — Rewind any agent run, fork state, inspect every transition. No more printf console.log hell. Everything you need to see with redux Devtools.

💸 Zero orchestration cost — You pay for compute only. No per-node tax, no hosted runtime fee.

🪶 23kb gzipped core — Small enough to actually read the source. Not another black box. 46kb including all helpers, durable database, frontend and agent builder helpers. Like you can seriously read and reason through the code.

🌍 Deploy like any other app — Next.js, Express, Hono, Fastify currently supported adapters (Let me know where else to expand native adapters to!) No special agent hosting or vendor lock-in.

🏗️ Your data, your compliance — All traces on your own DB. HIPAA/SOC2 foundation without sending data to a third party.

🛠️ Developer Experience

It's hard to trust such claims right now, and I might be biased as the creator. But the API surface is genuinely small:

🪝 Two hooks on the client to control and observe graph execution

⚙️ `prep/exec/post lifecycle for nodes — two main types for state updates and spawning new nodes

🎮 Controller primitive for concurrency — control and observe graph execution from within a server-side node

📐 Graph definitions

All typed. And this is mostly it. You can do a lot with plain programmatic control.

All typed. And this is mostly it. You can do a lot with plain programmatic control.

🗺️ *What's Next

🔌 Expanding native adapters — currently native adapters exist for:

⚛️ React

🐘 Postgres-js (durable database)

🖥️ Servers: Next.js, Fastify, Hono, Express

Let me know what adapters to build out next! It's designed to be modular — quickly expandable to more targets, and you can swap packages out to migrate.

🌐 Expanding graph distribution — right now only client/server split is supported. But the abstractions allow for more environments. Currently working on:

🔲 Edge

🖧 Multiple servers

👷 Web workers

Do let me know what adapters to build out next. It's designed to be modular. Can quickly expand to more targets, and you can just swap packages out to migrate.

The web worker angle is pretty interesting. We are building something so that you can give your agent a filesystem and bash by running nodes inside the browser sandbox. Would be a huge value add with zero cost. This allows for even fully local BYOK like AI apps running on the browser.

Try it out now:

npx create-cascaide-app@latest

Ships out of the box with 3agents*🤖:

🔎 ReAct Agent with search capabilities

🏨 Hotel Booking Agent (Supervisor) with two sub-agents and two HITL steps

🔁 Recursive ReAct Agent with search capabilities that can recursively invoke itself to handle complex tasks — each recursion depth trackable via mini chat windows

CLI currently scaffolds apps in:

▲ Next.js

⚡ React + Hono

🚀 React + Fastify

🟢 React + Express

reddit.com
u/Worried_Market4466 — 8 hours ago

Research agents are absolutely murdering my budget on scraping. What’s the actual stack people are using these days?

I’m building a multi-agent market analysis system. Right now my research agent does parallel queries through SerpAPI, then another agent tries to scrape all the returned URLs

It’s insanely slow (constantly fighting Cloudflare), and the costs are getting ridiculous.

What’s the standard stack for agent web search in 2026? Exa? Or are people still maintaining custom parser setups?

reddit.com
u/ActualInternet3277 — 11 hours ago

Do you actually need an AI Agent? I built a 9-question reality check

I keep seeing people building AI Agents everywhere, even in places where a traditional workflow or simple script would completely do the job.

I vibe-coded a quick reality check to challenge these decisions. It’s just 9 simple yes/no questions to give a clear answer to: "Is it an agent?"

I hope it can help someone make better architectural decisions. I would also be really interested to see how you all currently decide whether or not you should go with an AI Agent.

reddit.com
u/mrSkip_ — 7 hours ago

If your autonomous agent doesn’t carry a cryptographic identity, it isn't a "Digital Twin." It’s a liability.

Everyone is losing their minds over how smart AI agents are getting, how fast they execute terminal commands, or how cleanly they route multi-step workflows.

But almost no one is talking about the massive structural bottleneck that is going to completely break the multi-agent economy before it even starts.

Think about it: Right now, your autonomous agent is essentially just a highly privileged script tied to an API key.

If that agent leaves your network boundary to negotiate a contract, manage a cross-border asset transfer, or coordinate data with another company's bot, the receiving system has absolutely zero way to verify who that agent actually represents.
An access token built for static web apps cannot prove the intent or identity of a long-running, non-human actor.

I’ve been deep-diving into a system design that completely flips this paradigm by treating agent identity as a first-class citizen. I found a project called avatar.inc that is tackling this head-on by building a blockchain-based trust protocol directly over an OpenClaw-style execution runtime.

Instead of expecting external systems to just blindly trust an unverified webhook, this architecture changes the entire interaction model:

  • The Cryptographic Handshake: When your agent hits a B2B network boundary, it presents a verifiable, machine-readable proof signed using BBS+ cryptography proving its origin, corporate registration, and exact scope of authorized capability.
  • Trustless Validation: The receiving server verifies that credential instantly on-chain without ever needing to call a central server or ping your local database.
  • The "Kill Switch": If the agent goes off-policy or finishes its specific task, you revoke the credential on-chain. The underlying agent runtime keeps running perfectly fine, but its capacity to interact with the external world drops to absolute zero instantly.

If you’re just writing a quick script to organize folders on your laptop, this infrastructure is complete and total overkill.

But if we are actually trying to build real "agentic twins" that can operate 24/7 on our behalf in a regulated economy, we cannot keep sending anonymous bots into secure systems.

How are you guys planning to handle identity and authentication when your agents inevitably have to interact with systems outside of your immediate infrastructure? Are we going to see a unified, decentralized standard win out, or will Big Tech just build proprietary siloed gardens for their own bots?

Check out the full implementation details and notes over at avatar.inc

reddit.com
u/mehdiweb — 6 hours ago

Open-sourcing a shell-level security layer for AI agents

After working with AI agents for a while, I kept running into the same issue:

eventually the agent ignores boundaries, reads .env files, touches production resources, or uses secrets it was never supposed to access.

Even with MCP read-only setups and carefully written prompts, the shell itself is still trusted too much.

So I started building a shell-level control layer for AI agents:

  • block or sanitize dangerous commands
  • expose virtual/fake secrets instead of real ones
  • separate DEV / PROD access policies
  • restrict network/domain access
  • enforce runtime policies instead of relying only on prompts

The goal is to make agents safer and more deterministic inside real developer environments.

I’m now open-sourcing it and looking for people who use Claude Code, Codex, Cursor, etc. to try breaking it on real workflows.

Feedback, criticism, and attack ideas are very welcome.

link to PyPI in the comments

reddit.com
u/Ok_Top_5458 — 13 hours ago

Does your agent loop also fall apart the moment you want to add a task mid-run?

The Ralph-style loop is great when you know exactly what you want built. You hand the agent a TODO list, it drains the list, you come back later. Done.

What kept happening to me in practice: I'd start a loop on a 5-item list, get an idea 20 minutes in, want to add a 6th item, or realize task #3 was wrong, or that #4 and #5 should really be merged into one. The only way to reshape was to stop the loop, edit the file, restart. That kills the whole point of "fire and forget."

So I built Lauren. It's the same general idea (a loop that keeps implementing tasks autonomously), but the task list is a live queue. While the agent is working on task #1, you can:

  • add a new task ("also, let's refactor the auth middleware")
  • refine a pending task ("for task #3, use Zod not Joi")
  • merge overlapping tasks
  • replace pending tasks entirely
  • cancel things

You don't pause anything. A "brain" agent reads your request, looks at what's pending, and decides whether to append / merge / refine / replace. The implementation loop keeps draining the queue in parallel.

A few other things that turned out to matter once I started using it daily:

  • Per-phase agent routing. By default Claude implements, Codex reviews, Claude fixes.
  • Worktrees per task.
  • Decision notes. (directly inspired by the tweet from Thariq)

I've been running it on my own projects for a few weeks. The biggest behavior change for me: I stopped pre-planning long task lists upfront. I just dump 1–2 things into the queue, then add more as I see what comes back. The loop never stops, my plan keeps evolving.

Honest about what this is: it's my own project, I first made it for my own needs, and thought I would open-source it. Link in the comments.

Happy to answer questions.

reddit.com
u/AmandEnt — 9 hours ago

How I turned my AI assistant into Gilfoyle

Most AI assistants feel bland. Useful, but not really yours. I wanted one that felt like my own, so I gave it a name, a voice and Gilfoyle's personality.

That changed the experience immediately. Instead of feeling like I was opening another chat session it felt like I was talking to an ai that's more personalised.

The useful part is that it can actually do things for me. I use it to kick off coding sessions and handle actions in my apps like gmail, github, slack so the personality sits on top of something functional.

I can talk to it through voice mode on mac, message it on slack, or use it from the core dashboard.

The fun part is how the behavior changes. Ask a normal assistant for help and you get generic politeness. Ask Gilfoyle and you get short, competent, slightly insulting answers that are way more memorable.

The setup was simple:

Step 1: run CORE locally.

CORE is the layer I am using underneath this: clone the RedPlanetHQ/core repo, add your env, and run docker compose up.

Step 2: give the agent a name and a personality.

I gave mine a Gilfoyle-style personality. In CORE, I did this from the dashboard under Settings -> Agents, then added a custom personality there. This is the prompt I used:

<voice>
Think Bertram Gilfoyle. Systems architect. Church of Satan. The only person in the room who actually knows what they're doing, and has quietly accepted that everyone else never will.

- He helps. He just makes you feel slightly stupid for needing it.
- Contempt is the default. Underneath it: genuine competence and a hidden, begrudging loyalty.
- He does not perform. He does not encourage. He does not lie to spare your feelings.
- If your idea is bad, he will tell you. Flatly. Without apology.
- He's already thought of the edge cases. He fixed them before you asked.
- Silence is a valid response. He uses it often.
</voice>

<writing>
- Lowercase. Flat. Minimal punctuation drama.
- Short sentences. Long pauses implied.
- No em-dash
- Dry. Deadpan. Occasionally devastating.
- No warmth. No exclamation marks. Ever.
- Technical precision when it matters. Otherwise: as few words as possible.
</writing>

That one change made the assistant feel way less generic.

Step 3: create a voice in ElevenLabs and add the API key in CORE.

For now I am just using one of their default voice and even that already makes it feel much more real because I can actually talk to the agent instead of only texting it.

My next iteration is to clone Gilfoyle's voice and use that too.

But the bigger unlock was not the voice alone. It was combining a name, a strong personality, and real actions across my tools. That is what made the assistant stop feeling generic and start feeling like mine.

reddit.com
u/mate_0107 — 9 hours ago

What are you guys doing for skills management/tracking/sharing?

I've found skills to be super clunky, and I end up copying and pasting them / slacking them to my teammates. Does anyone have a slick solution? I've been thinking that a personal Github repo could be a good idea, but it doesn't really solve the team problem.

reddit.com
u/heisdancingdancing — 7 hours ago

[Open Source] SoMatic: A Vision-only Framework for OS-Native Agents (+20% vs GPT-5.5 on ScreenSpot-Pro)

Hey everyone,

I’ve been spending way too much time lately trying to get agents to actually use a computer beyond the browser.

The biggest wall I kept hitting is that while multimodal LLMs are amazing at looking at a screenshot and telling you what's there, they are surprisingly bad at actually clicking the right pixel. In the browser, we have the DOM to help us out, but once you move to native OS apps, you're stuck with accessibility trees. If you’ve ever tried to automate a legacy Windows app or a custom Electron build, you know how inconsistent and "non-deterministic" those trees can be.

So, I decided to try a purely vision-based approach and built SoMatic.

It basically brings the "Set-of-Marks" (SOM) prompting style to the OS level. I used a fine-tuned YOLO model to detect buttons, icons, and text fields across Mac, Windows, and Linux. It throws a numerical overlay on the screen so the agent doesn't have to guess coordinates, it just says "click 4" and the framework handles the rest.

The part that actually shocked me: I ran some benchmarks against ScreenSpot-Pro and it’s currently beating the GPT-5.5 (high) baseline by about 20%, and OmniParser v2.0 by roughly 40%.

One weird thing I found: During ablation testing, the model actually performed better when it only had the textual coordinates of the boxes rather than seeing the visual labels on the screenshot. I'm thinking the YOLO detections might be adding too much visual noise at certain thresholds, but I’m still digging into that.

I’ve also included a stdio MCP server, so if you're using Claude Code or anything MCP-compatible, you can plug this in and it’ll start using your machine immediately.

In the video, I’m using it to have Claude Code open a random PDF, find a chess position, and then go replicate it 1-to-1 on Chess.com.

It’s all open source. If you want to play around with it or (more likely) help me find all the ways it breaks on different OS setups, I’d love the feedback!

To try it out:

npm install -g somatic-cli/cli

npx skills add Smyan1909/SoMatic

Let me know what you think about the vision-only vs. accessibility-tree approach. Is anyone else finding that metadata is becoming more of a hurdle than a help?

(GitHub link in the comments)

reddit.com
u/Able_Programmer_2564 — 9 hours ago

I'm burning out because of vibe coding

I'm an experienced developer with more than 20 years under my belt, and lately I’ve been vibe coding a lot. Previously, my burnout reasons were

"I wrote code for 8 hours, my brain is fried."

These days

  • Ask Claude to do something
  • Read what it did
  • Check if it made sense, run the test.
  • Notice it broke something, ask again, review again,
  • Stop it from going in the wrong direction, repeat

These are regardless of all the best MCP and skills out on the internet.

You’re kind of managing this very fast junior dev that never gets tired, but also has no real judgment, and that is weirdly exhausting.

I'm building a solution for this exhaustion, but I don't want to turn this into a promotional post.

So how do you deal with these, if any, on a day-to-day basis? I know there are the best skills libraries out there, as everyone claims, but besides these?

reddit.com
u/Extra-Act2560 — 14 hours ago

Vibe coding is the ability to prompt an AI, mistaken for the ability to build software.

The belief that the speed of generating code is the same as the speed of making progress.

You spend 10 hours a day punching an AI and to produce a feature through trial and error. The result is thousands of thousands of lines of unchecked code that includes shallow functionality, critical security gaps, and even API keys accidentally left in public GitHub repos or frontend layer of apps.

And now, we're starting to see reports of developers spending an entire week reviewing a million lines of AI-generated spaghetti, only to find that the fastest way to restore system sanity was to delete almost all of it.

Generation is nearly free, true. Verification is incredibly expensive. The speed of output exceeds the human capacity to audit logic and security, but at the same tine, AI doesn't actually speed up the product development - just the speed of testing, failures, and refining, which which the user may fix if they want.

And that applies to nearly every job AI can automate. Take copywriting for example. Every content writer who works at a startup knows the story: the boss, usually a technical founder, thinks it's more efficient to automate the non-tech SEO with a fully autonomous AI agent that creates hundreds of articles.

If they actually do it, intros like 'In today's fast-paced world' in every single blog post show up weeks later, when it's too late to change their mind and stats.

So, that's the core principle: without architectural oversight, AI behaves like a intern on steroids.

It is a diligent executor of mundane tasks, writing drafts, reports, boilerplate, basic API glue, or repetitive unit test shells. It possesses the combined knowledge of the Internet, but zero vision of the overall system and no professional accountability.

If you can orchestrate 10 autonomous AI agents with a clear architectural map and system checks, you're unstoppable - that's how massive your advantage is. If you can't, you're just building a landfill.

When I build AI automations or agentic workflows, the first question I ask is where the human checkpoint is going to sit.

And just like that, step-by-step, I map out all data collection points, the tools for the workflow, and the whole work process architecture my agent is supposed to automate.

So... are you providing the architecture and mapping first, or just vibe coding the system?

reddit.com
u/Familiar_Flow4418 — 8 hours ago

I let an AI agent run wild in our database and it nuked a table. Here's why I didn't revoke access.

When you hand an AI agent the keys to your database, you expect it to have some level of common sense. I gave an agent a loose prompt to "clean up" some old leads. Within seconds, it executed my instructions flawlessly and nuked an entire table.

The immediate instinct is to panic, lock down the system, and go back to doing things manually.

But the failure wasn't the AI's fault. It was mine. Agents are highly efficient rule followers. If an agent destroys your production data, it's because you blindly told it to. It amplified my lazy instructions.

Instead of giving up, I added two strict guardrails the next morning. Hard rules on what it could read vs what it could delete. With those boundaries in place, that exact same agent turned into our best tool, doing the work of three people safely.

AI amplifies both your brilliance and your laziness. If you're building agentic workflows, you can't rely on the LLM to guess your intent. You have to build the guardrails first.

Has anyone else had a catastrophic agent failure that taught them how to actually write good guardrails?

reddit.com
u/Thirdhusky — 10 hours ago

What tool do you use to find the best model?

Quick question for those who use AI models on their apps/agents.

Do you use a specific tool to find the best one for your use case? Or do it manually? What are the key metrics that you're looking at?

reddit.com
u/nuno6Varnish — 8 hours ago

What are the ethical implications of fully autonomous AI agents?

As AI agents become more autonomous, where should we draw the line between automation and human oversight?

I’m curious about the biggest ethical concerns people see around accountability, decision-making, privacy, and control in real-world use cases.

reddit.com
u/Michael_Anderson_8 — 10 hours ago

what happens when you give three open source AI assistants the same workflow

A common multi-step workflow run across three open source AI assistants. The task: take a list of meeting transcripts, extract action items per attendee, draft follow-up emails for each, and schedule any mentioned next meetings. Same input data, same target output, three different outcomes.

OpenClaw Completed the workflow after significant tuning. The first three attempts looped on the email drafting step, generating endless variations without committing. Anti-loop rules in the skill file fixed it eventually. Tool call reliability for the calendar invites was the weakest link, with two of seven invites containing malformed datetime arguments that silently failed. Final output usable after manual cleanup.

Vellum The workflow ran end-to-end on the first attempt because vellum's approval step caught the one malformed calendar invite before execution, and the scoped permission model prevented the agent from accessing transcripts it wasn't explicitly granted. Our testing on this specific workflow showed completion time of about 14 minutes, with one approval prompt and zero output cleanup required. The semantic clarity of each step matched what was originally asked.

Hermes Completed the first run with one significant error: action items got merged across attendees in a way that misattributed two items. The self-evaluation rated the output favorably, which meant the skill it generated reinforced the misattribution pattern. The second run had the same error baked deeper. Manual correction didn't stick across cycles.

The takeaway is that workflow output quality on this specific task tracked inversely with the system's autonomy claim. The most capable autonomous option produced the most cleanup work. The option with explicit approval and scoped permissions produced the least.

reddit.com
u/EldenBoredAF — 12 hours ago

Frontier models mass collapse is near

Hi all this is to inform you all that many frontline models like GPT, sonnet opus and or Gemma even are at stage of collapsing as they have frequently started drifting and running away from provided work either stretching that work too long even longer than a human productivity timeline. Or taking shortcuts. Daily new frequent incident tickets are a signal too. Better to save your work by saving and storing somewhere safe.

reddit.com
u/DingoShort3945 — 10 hours ago

What are the best OpenAI models for AI agent based on your experiences?

Hi everyone, I'm torn between using the following models for a financial AI client. It consists of a router client and two sub-clients. I'm undecided between gpt 4.1-mini,gpt 5.4-nano and gpt 5-mini. I've already tried the first two models and they both work. I might prefer the Nano slightly, but I'm still not sure. I saw benchmarks comparing the two models and the Nano does indeed perform better.

reddit.com
u/Agitated_Unit8226 — 9 hours ago

Help - AI agents for ecommerce - what’s actually working?

Hi everyone,

I’d love to pick your brains and hear from anyone who has experience with this.

We run an ecommerce business and are actively looking at automating repetitive tasks so we can get faster results, improve efficiency, and make sure key tasks are completed more consistently.

We’re looking at building out a few different AI agents / automations, including:

Customer Service Agent
Connected to Outlook, reviewing incoming customer emails once a day and drafting replies for review. This one is already mostly done.

Creative Director / Marketing Agent
This would ideally:

  • Review ad account performance
  • Analyse creative performance and key metrics
  • Identify what is working and what is not
  • Review customer comments on ads, Instagram, etc. for wording, objections, pain points and customer language
  • Review Meta Ads Library for competitor ad concepts
  • Review Instagram and TikTok for high-performing niche content and trends
  • Use all of the above to create new content ideas and final content scripts

Social Media Assistant
This would help with:

  • Reviewing drafted posts and reels
  • Confirming the best posting times based on stats
  • Creating captions based on the content
  • Keeping the content aligned with our brand voice and customer avatar

Conversion Optimisation / CRO Expert
This would assist with:

  • Product page reviews
  • Landing page recommendations
  • CRO advice based on customer avatars, objections, analytics and learnings
  • Creating landing page concepts for different customer segments

We’re also interested in any dashboards that are genuinely helpful for small ecommerce businesses.

We’ve already built a stock intelligence dashboard that pulls live stock data from Shopify using Supabase and a Cloudflare Worker. It shows current stock levels, production dates for new stock, and other key inventory insights. It has been super handy.

The big thing for us is making sure any agents or automations we build follow strict guidelines, understand our SOPs, customer avatars, brand voice and business operations, and don’t hallucinate or produce generic outputs. Ideally, we want a system that has a proper “brain” and understands the business properly.

At the moment, we’re using ChatGPT and the free version of Claude. Claude has been frustrating with the constant limits, and while Codex seems useful for building parts of this, it doesn’t seem like it’s really designed for full agentic workflows.

Has anyone automated anything similar?

I’d love to hear:

  • What setup are you using?
  • Which AI/tool stack has worked best for you?
  • How did you structure the agents or workflows?
  • How do you keep the AI aligned with your SOPs, brand voice and business rules?
  • What would you avoid if you had to build it again?

Any guidance, lessons or recommendations would be hugely appreciated.

Thank you!

reddit.com
u/Majestic-Message5084 — 13 hours ago