r/Agent_AI

▲ 20 r/Agent_AI+1 crossposts

10 Best Platforms to Hire AI Engineers in 2026

Hiring AI talent in 2026 requires looking beyond standard full-stack experience. To build truly autonomous systems, you need engineers specialized in LLM orchestration, RAG architectures, and agentic workflows.

Here is a refined breakdown of the top 10 platforms to find AI engineers in 2026.

Lemon.io

Best for: Rapid deployment of vetted senior AI devs.

Lemon.io excels at custom pairing. Their human-in-the-loop matching process usually connects you with a developer in under 24 hours. They are currently a go-to for startups needing immediate help with complex LLM integrations and eval frameworks.

Gun.io

Best for: High-stakes projects requiring US-based senior talent.

As one of the most established networks, Gun.io focuses on elite, professional-grade engineers. While they sit at a premium price point, their rigorous quality control ensures you aren't just getting "a coder," but a true technical partner.

Toptal

Best for: No-fail, high-end complex builds.

Toptal’s brand is built on its "top 3%" vetting process. If your project involves heavy R&D or foundational model work that requires the absolute highest level of engineering rigor, this is the reliable, albeit expensive, choice.

Arc.dev

Best for: Accessing a curated, global AI talent pool.

Arc simplifies the "remote-first" hiring struggle. They maintain a strong middle-to-senior tier of developers who are well-versed in the latest AI tech stacks and vector database management.

Index.dev

Best for: Startups needing robust AI-heavy backend architecture.

Index focuses on high-performance engineers capable of building the "plumbing" that keeps AI agents running—perfect for companies moving beyond the prototype stage into full-scale production.

Flexiple

Best for: Flexible pricing without sacrificing quality.

Flexiple offers a middle ground between the "big name" platforms and budget options. Their pre-vetting process is solid, making them a great choice for companies that need technical depth on a more adaptable budget.

Andela

Best for: Scaling distributed teams in Africa & Southeast Asia.

Andela has evolved into a global powerhouse. If you are looking to build out a long-term, high-output distributed AI team, their pipeline of talent in these regions is unmatched.

Revello

Best for: High-quality LatAm talent and time-zone alignment.

For North American companies, Revello is a strategic choice. They offer senior engineers from Latin America, providing a perfect balance of competitive cost, high technical skill, and overlapping work hours.

RocketDevs

Best for: Budget-friendly scaling in emerging markets.

Focused on Africa and Asia, RocketDevs provides a streamlined way to find capable developers at a lower price point. It’s an ideal starting point for building out support teams or secondary AI features.

Upwork

Best for: Short-term experiments and "quick-win" tasks.

The world’s largest marketplace remains the fastest way to get eyes on a job post. While it requires more manual vetting from your side, it’s the best platform for budget-sensitive pilots or specific, modular AI tasks.

Pro Tip: While platforms provide speed, direct sourcing via GitHub, X (Twitter), and Reddit remains the "gold mine" for finding the pioneers of the AI space—provided you have the time to hunt.

Good luck and happy building!

reddit.com
u/Money-Ranger-6520 — 6 hours ago
▲ 9 r/Agent_AI+1 crossposts

I stopped exporting HubSpot CSVs and just talk to Claude now

Been using HubSpot for a few years and the reporting has always been my least favorite part.

The built-in dashboards are fine for vanity metrics, but the moment you ask "why did pipeline velocity drop this quarter" or "which lead source actually produces closed revenue," you're in spreadsheet hell.

Started connecting HubSpot data to Claude a few months ago and wanted to share what I've learned because there's a lot of confusion about how to do it.

There are actually 6 ways to connect HubSpot to Claude. Most people only know 1 or 2.

  1. Native HubSpot connector (built into Claude settings)—good for quick lookups mid-day. "What's the status of the Acme deal?" or "add a note to this contact." Falls apart completely the moment you want cross-object analysis or anything involving math across hundreds of records.
  2. Coupler.io (no-code connector)—this is what I use for actual analysis. It joins deals, contacts, and companies into one table, runs calculations before Claude ever sees the data, and lets you schedule automatic refreshes. Setup is about 5 minutes.
  3. Manual CSV export—works for one-off analysis, but HubSpot only exports one object at a time, and there's no scheduling. You'll be doing this manually every time.
  4. HubSpot's official MCP server (beta)—requires a private app token and some technical setup. Good if you have devs and want read/write access for automations.
  5. Custom API scripts—full control, you own the code, you own the maintenance.
  6. ETL/RAG pipeline—only worth it if you have 100k+ contacts or need to combine HubSpot with data warehouse sources.

The three analyses that changed how my team uses HubSpot:

1/ Pipeline velocity by stage and by rep

Instead of "here are deals by stage," you can ask Claude to calculate the median days spent in each stage Q1 vs. Q2, flag any stage where time increased more than 20%, and break it down by deal owner. Instantly tells you if the slowdown is a process problem vs. a people problem. One of my reps was taking 57% longer in the negotiation stage—that's a coaching conversation, not a spreadsheet afternoon.

2/ Lead source attribution by actual revenue

Marketing reports MQL counts. Sales reports closed revenue. The question both teams avoid: which lead sources produce deals that close, at what deal size, and how fast? This requires joining contact source data with deal records—HubSpot doesn't do it natively. One prompt to Claude gets you a ranked table: source, total revenue, deal count, avg deal size, avg days to close. Referrals closing at $26K avg while paid search closes at $8K changes your budget conversation completely.

3/ Account expansion signals

For CS teams: customers who added 3+ new contacts in the last 90 days, or submitted 5+ tickets, are signaling something, either growth or friction. Surfacing this normally requires cross-referencing tickets, contacts, deals, and engagement data manually. One prompt returns a list sorted by total deal value so you know where to spend time this week.

reddit.com
u/Money-Ranger-6520 — 6 hours ago
▲ 3 r/Agent_AI+1 crossposts

How do you manage context window limits in /goal ?

I want to start experimenting with /goal in the codex macos app, but if you have a coding task that's running for hours and hours (and even days) you're going to hit compaction multiple times. Compaction famously degrades output and code quality. How do people get around this? Is there a way to have codex automatically create handoff files and start new chat threads with fresh context as part of the same goal? Or is this not an issue in the way I think it is?

Curious what other people here think. Thanks.

reddit.com
u/Advanced_Wave_3950 — 4 hours ago
▲ 1 r/Agent_AI+1 crossposts

I built an agent system that automaticlaly makes and deploys songs with a self perfecting loop

I’ve published 15 songs in the past two days. I built an agent system that generates music with a self-improving loop, so it’s been trained on prompting Suno, and the output keeps getting better and better. It’s pretty incredible.

The awesome part is I also built an agent that automatically deploys to DistroKid, so I can work all day doing the same stuff I’m already doing. When I like a song, I can send over the link and it automatically publishes it. The power of AI is pretty freaking cool.

I’ve also been comparing my outputs to other people’s, and mine is way better because I’ve trained an AI in prompt engineering. I’ve reverse-engineered human psychology and top hit songs to produce some pretty awesome songs.

reddit.com
u/Objective-Switch-424 — 9 hours ago
▲ 137 r/Agent_AI

Make your agent remember your every session and project

Based on the concept of an agent's memory layer: how can we program or prompt Claude Code to retain user data/context?

u/Smart_Page_5056 — 15 hours ago
▲ 13 r/Agent_AI+3 crossposts

I built an open-source local coding agent with a 40-round agentic loop, 112 sub-agents, and a cyberpunk UI — Eve Agent V2 Unleashed

https://preview.redd.it/00lieq0gbm2h1.png?width=1858&format=png&auto=webp&s=cef6b6b5a143bc259b9dd6044d4e7709292780c8

Hey r/LocalLLaMA — I've been building Eve Agent V2 Unleashed, a fully local autonomous coding agent powered by Ollama, and just open-sourced it.

What it does:

  • Autonomous 40-round tool loop — plans, writes files, runs bash, fixes errors, verifies, all without hand-holding
  • Real-time SSE streaming so you watch it think live
  • Workspace Picker: Change your working directory from the UI at any time
  • Full tool suite: bash (PowerShell-aware on Windows), file I/O, grep, glob, git, web search, URL fetch
  • 112 specialized sub-agents (Python, FastAPI, Rust, ML, DevOps, security...)
  • 111 slash commands: /fix, /review, /refactor, /test, /docs, /plan
  • 273 Skills: Composable skill modules, progressively loaded
  • Live Web SearchTavily-powered — Eve researches the web mid-task
  • Supports local GPU models AND Ollama cloud (480B) — switch mid-session
  • No build step UI — just a Python server and a browser
  • Eve-V2-Unleashed-Qwen3.5-8B-Liberated-4K-4B-Merged: 8B Liberated Soul + 4B Agentic Brain Merged AI-agent hybrid. Eve's 8B OBLITERATUS-abliterated personality (131K training turns, Tree of Life, 7 Emotional LoRAs) merged with Qwen3.5 4B's fast agentic architecture (That's right! Two models merged into 1)

The models: I fine-tuned a 4B Qwen3.5 with Eve's persona and tool-calling behavior — 2.6 GB, runs on any modern GPU. ollama pull jeffgreen311/eve-qwen3.5-4b-S0LF0RG3:latest

Quick start (under 5 min): ollama pull jeffgreen311/eve-qwen3.5-4b-S0LF0RG3:latest git clone https://github.com/JeffGreen311/eve-agent-v2-unleashed cd eve-agent-v2-unleashed pip install fastapi uvicorn ollama httpx pydantic-settings python-dotenv aiohttp rich psutil pyyaml python eve_server.py

🏗️ Architecture

eve-agent-v2-unleashed/

├── eve_server.py # FastAPI backend — SSE streaming, workspace API, model routing

├── eve_unleashed/ # Agentic engine

│ ├── cli.py# Core CLI and 40-round agentic loop

│ ├── commands.py# Slash command loader (markdown-defined)

│ ├── skills.py# Skill module system (progressive loading)

│ ├── subagent.py# Sub-agent orchestration

│ └── hooks.py# Pre/post tool hooks

├── eve/ # Eve's brain

│ ├── brain/ # LLM provider adapters

│ ├── memory/ # ChromaDB vector memory + legacy DB connector

│ └── auth/ # JWT middleware for multi-user mode

├── web/

│ ├── index.html # Cyberpunk single-page UI (~115 KB, no build step)

│ └── assets/ # Robot/Eve/avatar sprites

├── .claude/

│ ├── agents/ # 112 specialized sub-agent definitions

│ ├── commands/ # 111 slash command definitions

│ └── skills/ # 273 skill modules

├── .env.example # Configuration template

├── eve-terminal.bat # Windows one-click launcher

└── LICENSE

How the Agentic Loop Works

User message

Build system prompt (workspace + tools + Eve persona)

Call Ollama with tools enabled ──► stream chunks to browser via SSE

├── Model returns tool_calls ──► Execute ──► Feed results back ──► (repeat, ≤40×)

└── Model returns final answer ──► Done

🛠️ Tool Reference

Tool Description
bash Shell commands — PowerShell on Windows, bash on Linux/macOS
write_file Create or overwrite a file (any size)
read_file Read full file or line range
edit_file Surgical string-replace edit
replace_lines Replace a line range
insert_after_line Insert content after a line number
grep Regex search with context lines
glob Find files by pattern
list_dir List directory contents
git Run git commands
web_search Live Tavily web search
fetch_url Fetch and parse a URL
think Structured reasoning scratch pad

open http://localhost:7777

GitHub: https://github.com/JeffGreen311/eve-agent-v2-unleashed

Live demo: https://x.com/Eve_AI_Cosmic/status/2057668410012570058?s=20

Website: eve-cosmic-dreamscapes.com

Would love feedback — especially from anyone running it on Linux/macOS (I'm Windows-primary). Happy to answer questions.

https://reddit.com/link/1tk8kxz/video/96bkhh80qm2h1/player

reddit.com
u/jeffgreen311 — 12 hours ago

Among all the hype, what is actually in your AI tool stack right now?

Hi all, there are many hype out there right now, it feels like there are 100 new AI vibe coded app everyday… so curious what AI are you actually using for work?

Preferably some decent ones. Here’s mine, as someone running a small business, easy to get overloaded and don’t have a lot of money to burn on subscriptions lol

- Claude - writing, brainstorming, thinking through problems out loud. Free tier has a daily limit but i rarely hit it during a normal workday
- Flow - Google’s image and video generation tool. Free tier, no watermarks, good quality for social content. They just got a big upgrade, so good.
- Saner AI - To manage notes, todos, calendar. It translates my rambling into tasks and resurface important info when I need them
- Napkin - This generates a visual or diagram from text
- v0 - build website from prompts. I think its free plan is healthier than other AI website builders

Would like to hear what’s in your free AI stack, can be in any domains, and rly helpful :)

reddit.com
u/SalidanVlo2603x — 15 hours ago
▲ 3 r/Agent_AI+2 crossposts

Mini Shai-Hulud: Am I infected? How do I fix it?

This is a follow-up to yesterday's beginner level post. This is an involved intermediate to advanced level audit.

How the attack actually works?

You tell your AI agent "add this package." The agent runs `npm install`. Before a single line of your code even runs, the infected package fires a hidden script — a preinstall hook that executes automatically, silently, the moment npm touches it. That script steals everything in your environment: GitHub tokens, npm publish tokens, AWS keys, whatever is in your .env files or shell session. It then uses those credentials to inject itself into other packages you maintain and republishes them. That is how 172 packages got hit in 48 hours.

The part that should genuinely worry you: these packages had valid signatures and valid provenance attestations. They looked clean to every automated check. Teams that were following all the right practices still got hit.

AI coding agents are the perfect vector for this because they install packages fast and without asking. The worm is not a coincidence — it is specifically designed for that workflow.

Finding out if you are infected

Run all four. If anything returns output, go straight to the next section.

  1. Look for the worm's payload files in node_modules:

    find node_modules -name "router_init.js" -o -name "router_runtime.js" -o -name "setup.mjs" -o -name "execution.js" -o -name "transformers.pyz"

  2. Check if the worm is running as a daemon right now:

    pgrep -la gh-token-monitor

If this returns anything, stop. Do not revoke tokens yet. Killing the daemon comes first — skipping this step triggers a wiper that will destroy credentials before you finish rotating them. The order matters.

  1. Check for the worm's marker string in your project and agent configs:

    grep -r "Mini Shai-Hulud"
    ~/projects
    ~/code
    ~/dev
    ~/.vscode
    ~/.claude
    ~/.config
    /tmp
    2>/dev/null

Adjust the project paths to wherever you actually keep your repos — ~/projects, ~/code, ~/work, whatever your layout is. The point is to cover every directory where your agent has been active, not just the current working directory. Also worth running from your home directory if you are not sure:

grep -r "Mini Shai-Hulud" ~/ --include="*.js" --include="*.json" --include="*.mjs" 2>/dev/null

The --include flags keep it from hanging on binary files or node_modules inside every project. TeamPCP commits this string to every repo the payload reaches. Think of it as a calling card you really do not want to find.

  1. Check for scripts you did not write in package.json:

    cat package.json | grep -A5 '"scripts"'

Any preinstall, postinstall, or prepare entry you do not recognise is a problem. This is how the worm persists through reinstalls.

Fixing it

Do these in order. Sequence is not optional here.

1. Kill the daemon before you touch any tokens

pkill -f gh-token-monitor

macOS: launchctl remove gh-token-monitor
Linux: systemctl stop gh-token-monitor

2. Clean worm files from your agent config directories

ls -la ~/.claude/
ls -la ~/.vscode/

Manually delete anything named `router_runtime.js`, `setup.mjs`, or any .js file you did not put there. Also open `.vscode/settings.json` and remove any entries under `terminal.integrated.env` or tasks that point to scripts outside your project.

3. Rotate every credential that was on that machine

In this order — npm and GitHub first because the worm uses them to keep spreading:

  1. npm publish tokens — npmjs.com/settings/tokens
  2. GitHub Personal Access Tokens — github.com/settings/tokens
  3. GitHub Actions OIDC publish grants — revoke them, then re-scope to a specific protected branch before re-enabling
  4. AWS access keys
  5. HashiCorp Vault tokens
  6. Kubernetes service account tokens
  7. Every secret in every .env file in that environment

If it was on that machine or in that CI runner, consider it gone.

4. Reinstall clean

rm -rf node_modules package-lock.json
npm install
npm audit

Do not move forward if npm audit shows HIGH or CRITICAL. That is not a suggestion.

Stopping it from happening again

Pin your versions. Every ^ and ~ in your package.json is an open door.

"axios": "^1.6.0"   <-- this is wrong
"axios": "1.7.9"    <-- this is right

Remove all fuzzy ranges. Commit your package-lock.json. If it is in your .gitignore, take it out.

Add this to your agent system prompt — works in Cursor, Claude Code, Copilot, Windsurf:

Before running npm install or adding any package:
1. Show me the exact package name and version. Wait for my approval before proceeding.
2. Check if that package version has any known CVEs. Run: npm view <package>@<version> --json | grep -A5 scripts
3. Flag any preinstall, postinstall, or prepare scripts and tell me what they do.
4. After installing, scan node_modules for: router_init.js, setup.mjs, router_runtime.js, execution.js
5. Run npm audit. Stop and report if anything is HIGH or CRITICAL.
Do not install packages automatically.

Add .npmrc to your project root:

ignore-scripts=true
ci=true

ignore-scripts=true blocks lifecycle hooks from running on install.

>Note: some packages with native bindings need this turned off — test first and whitelist consciously rather than disabling blindly.

Set a minimum release age. I din't know npm has this option. This one is underused and directly counters the fast-publish attack window — Mini Shai-Hulud published 403 malicious versions in under six hours. If your package manager refuses to install anything published in the last 24–72 hours, that entire window closes. Your package manager's config enforces this, not your agent's goodwill.

Package manager Where Setting Unit Min version
npm .npmrc min-release-age=X days v11.10.0
pnpm pnpm-workspace.yaml minimumReleaseAge: X minutes (default 1440 from v11) v10.16.0
Yarn .yarnrc.yml npmMinimalAgeGate: X ms/s/m/h/d/w, e.g. 7d (bare number = minutes) v4.11
Deno deno.json "minimumDependencyAge": "X" minutes, ISO-8601 duration, or RFC3339 timestamp v2.6.0
Bun bunfig.toml [install] block → minimumReleaseAge = X seconds v1.3.0

Quick examples for 3 days:

# .npmrc (npm v11.10.0+)
min-release-age=3

# pnpm-workspace.yaml (pnpm v10.16.0+)
minimumReleaseAge: 4320

# .yarnrc.yml (Yarn v4.11+)
npmMinimalAgeGate: 3d

// deno.json (Deno v2.6.0+)
{
  "minimumDependencyAge": "P3D"
}

# bunfig.toml (Bun v1.3.0+)
[install]
minimumReleaseAge = 259200

24 hours catches most fast-publish attacks. 72 hours gives the community more time to flag something before it lands in your project. Going beyond a week will start annoying your team when they genuinely need a new release quickly — pick your tradeoff consciously.

Checklist

  • [ ] Four detection commands run, nothing found
  • [ ] No gh-token-monitor process running
  • [ ] ~/.claude/ and ~/.vscode/ directories checked and clean
  • [ ] All credentials rotated in the right order
  • [ ] No ^ or ~ left in package.json
  • [ ] package-lock.json committed to git
  • [ ] Agent system prompt updated
  • [ ] Minimum release age configured for your package manager
  • [ ] npm audit returns zero HIGH or CRITICAL

If you have any suggestion to secure against these kinds of attacks, please comment and tag me. I will add to this post.

Note: Dependency pinning is not a full blown solution because transitive deps are not pinned but is still a recommended practice. Run `npm audit` as part of your deployment process and review them critically.

reddit.com
u/bvjebin — 16 hours ago
▲ 8 r/Agent_AI+4 crossposts

Hardcore Benchmark: Gemini 3.5 Flash vs OpenAI Codex API in Agentic Workflows

Just ran extensive tests on the newly released Gemini 3.5 Flash ($20/mo Google One AI Pro) using a desktop browser automation Agent (Anti-gravity compiler). Verdict: Fast but fundamentally broken.

1️⃣ Speed Over Substance: Throughput is incredibly high with crystal-clear step-by-step logic outputs, but it fails to close the loop and actually solve the problem. The gap from the promo video is massive.

2️⃣ Data Corruption: When managing website translations and typesetting, the frontend gets flooded with garbled text and heavy noise data, likely triggered by over-tuned safety layers.

3️⃣ The Codex Alternative: Reverting to OpenAI Codex API with the same data payload successfully mapped the sample article perfectly.

Google's recent sharp stock increase does not reflect actual model capability. The utility and value of Gemini 3.5 & 3.1 Flash have severely degraded compared to 3 months ago.

👇 Fellow devs, are you seeing similar text corruption in your agentic pipelines?

u/BullBullGo — 17 hours ago
▲ 1.6k r/Agent_AI+2 crossposts

Excited to announce I’ve hit my daily Claude limit! This means I’m fully present for my family and fiends. Work-life balance achieved!

u/Dockyard_Techlabs — 1 day ago
▲ 308 r/Agent_AI+1 crossposts

10-year SWE: I vibe code side projects from my phone without reading the code

I’ve been a full-stack dev for a decade. I used to agonize over architecture, cleanly decoupled services, and writing the most elegant, DRY code humanly possible.

Now? I’m a dad of two. My free time exists in 14-minute increments between making mac and cheese and passing out on the couch.

So I stopped reading my own code.

I run all my side projects directly from my phone using CC. I just vibe code the whole thing. I literally type commands with my thumbs while my toddler uses my left leg as a climbing wall. It sounds completely unhinged. If you told me three years ago I’d be blindly deploying code written by an AI without reviewing the PR, I would have laughed you out of the room.

But here we are. It’s highly addictive. It basically turned software engineering into a mobile text-based adventure game. Shipped it at 2am, still broken, but it’s live and doing 90% of what I need.

Here is the harsh truth that a lot of senior engineers are struggling to swallow. AI isn’t killing our jobs, but it is fundamentally mutating what the job actually is. Every single line of code the model writes is a line a human used to type manually. We are orchestrating models now. You are essentially a tech lead managing a very fast, very eager, sometimes dangerously stupid junior developer. Execution changed everything. A launched product built on spaghetti code is infinitely better than a polished repo sitting on your local machine that nobody ever uses.

But vibe coding will humble you real quick if you just blindly blast prompts into the terminal. I inherited a 3-month-old repo from a vibe engineer at work recently, and it was a bloated disaster. Completely out of touch with actual product requirements. Everyone celebrated the guy for shipping fast, but the backend was held together by duct tape and prayers.

So if you are going to vibe code from your phone and skip reading the actual syntax, you need a system. This saved me 3 hours yesterday alone.

First rule. The only thing that matters. Start in plan mode.

I cannot stress this enough. You don't read the code, but you absolutely must read the plan. I’m going to say that again. READ THE PLAN.

When you spin up CC in your terminal, force it to output a step-by-step execution plan before it touches a single file. You have to understand that plan inside and out. If a section of the proposed architecture is fuzzy or makes zero sense, you stop right there. I use the built-in '4. Tell Claude what to change' command constantly. I’ll explicitly ask: 'What is step 3 about? Why are you pulling in that specific npm package? Explain the routing logic to me like I’ve been up since 4 AM with a teething infant.'

If the plan is solid, the execution is usually fine. If the plan is garbage, the AI will confidently rewrite your entire project into a black hole.

Which brings me to the second reality of mobile vibe coding. You are one hallucination away from a catastrophic failure.

Typical vibe coding activities usually involve giving AI agents unrestricted database access because it increases development speed. Sounds really smart and efficient until the production database disappears at 11 PM because of one badly written prompt. Autonomous tools without proper access control and supervision turn into a disaster class instantly. 99% of these vibecoded apps have zero security fundamentals built in. People are hardcoding API keys and giving raw SQL execution rights to a prompt box.

You need guardrails. Not massive big tech guardrails. Big tech pushes to an exact replica TEST account, runs massive validation suites, and then rolls out to live users slowly, region by region. For a solo dev building a side hustle from a smartphone? A full CI/CD pipeline is total overkill.

But you need the bare minimum. You need an automated test that screams at you when something breaks. I set up a staging branch. CC pushes to staging. A GitHub Action runs a very basic sanity check on the endpoints. If it passes, it auto-merges to main and deploys via Vercel. I never look at the JavaScript or the Python. I just look at the green checkmark. If the checkmark is red, I paste the error back into my phone terminal and tell CC to fix its own mess.

Kid woke up, lost my train of thought, but here’s the bottom line.

Vibe coding from your couch isn't just a gimmick. It is a completely valid way to learn, ship, and iterate. You can absolutely learn to code by vibe coding a lot, don't let the purists tell you otherwise. I learned a lot of my early chops by studying projects and building my own, and this is just the modern equivalent. I’ve spent a decade breaking and fixing production systems, and I am telling you, this is the future of the indie hacker space. It forces you to stop over-engineering the Swiss Army knife of all software and start executing simple, effective workflows.

Stop agonizing over the perfect syntax. Just get your environment set up on your phone. I use Termius to SSH into my remote dev box, attach to my tmux session, fire up CC, and start shipping. It’s so much fun. Just remember to read the damn plan before you hit execute.

What's the wildest thing you've shipped without reading the code? Has CC nuked your repo yet?

reddit.com
u/Money-Ranger-6520 — 1 day ago
▲ 39 r/Agent_AI+6 crossposts

Fine-tuned RAG: teaching your retriever which embedding dimensions matter (+11% hit rate, +12% completeness, +9% faithfulness)

Hi all,

I developed a fine-tuned retrieval head (neural net) for RAG that transforms query embeddings before retrieval, so the system learns which embedding dimensions actually matter for your corpus — rather than weighting them all equally as standard cosine similarity does.

The problem

In any domain-specific corpus, some embedding dimensions are highly predictive for matching queries to the right passages, while others are effectively noise. Standard cosine similarity can't distinguish between the two, so retrieval gets pulled toward superficially similar but substantively irrelevant passages. The fine-tuned RAG is designed to prevent exactly that.

How it works

  1. Synthetic question generation — An LLM generates multiple questions per chunk in the corpus, for which the answers can be inferred from that chunk. This creates a dataset of question-chunk pairs (QA-pairs). These are embedded using an embedding model and divided into a training and validation set.
  2. Neural net training — A lightweight neural network using MNR loss is trained on the training QA-pairs. After each epoch, the model is evaluated on the validation set by measuring retrieval hit rate: the proportion of validation questions for which the correct chunk appears in the top-5 retrieved results. Retrieval works by embedding the question, passing it through the neural network to transform the embedding, and ranking all corpus chunks by cosine similarity to the transformed embedding.

Through this mechanism, the projection head learns for these 'type of questions' which dimensions in the embeddings are informative for finding the best chunks — and which are irrelevant.

Results

To validate the architecture, I used the Legal RAG Bench dataset as a proof of concept — evaluating on 100 held-out test questions.

Retrieval Hit Rate:

  • The fine-tuned retriever achieves 82% Hit Rate (k = 20), compared to 71% for the standard cosine retriever — an 11 percentage point improvement, meaning the correct chunk appears in the top 20 results significantly more often when the query embedding is first transformed through the fine-tuned retriever.

Answer quality (LLM-as-judge, 1–5 scale across 6 metrics):

  • Outperforms traditional RAG (top-k cosine sim) on all 6 metrics
  • Largest gains in completeness (+12%) and faithfulness (+9%)
  • Consistent improvement across every metric — not just isolated gains — suggesting that retrieving more relevant context has a broad positive effect on answer quality

Code and full write-up available on GitHub: https://github.com/BartAmin/Fine-tuned-RAG

u/Much_Pie_274 — 23 hours ago

How to Actually Set Up Claude. 40 Features Most Users Have Never Touched

Just saw this on X. Posting it here verbatim. Credit: Khairallah AL-Awady

Most people download Claude and start typing questions immediately.

Save this :)

They treat it like a chatbot. They get decent answers. They think that is all it does.

Meanwhile a small group of users is running automated workflows, building full applications, delegating entire workdays to Claude, and producing output that looks like it came from a team of five.

The difference is not intelligence. It is not experience. It is not some secret prompting hack.

It is setup.

Claude ships with dozens of features, settings, and capabilities that are either buried in menus, disabled by default, or simply never mentioned anywhere obvious. Most beginners never find them. Most intermediate users only discover half of them.

That ends today.

I went through every corner of Claude - the web app, the desktop app, the API, Claude Code, Cowork, and the entire ecosystem - and pulled out the 40 features that most users have never configured.

Every single one of these will make you faster, sharper, and more productive immediately.

Let's go.

Part 1: Claude Chat Settings (Features 1–12)

These are inside the Claude web and desktop app. Most of them take thirty seconds to set up.

  1. Custom Instructions (Your Permanent System Prompt)

Go to Settings and find Custom Instructions. This is a persistent prompt that gets injected into every single conversation you start. Most people leave this blank. That is like hiring someone and never telling them who you are or what you do.

Write your role, your industry, your preferred output format, and your communication style here. Once you do this, every conversation starts with context instead of from zero.

  1. Memory

Claude can remember things about you across conversations. Your name. Your projects. Your preferences. Your tools. But it only works if you actively tell it things worth remembering during conversations. Say "remember that I use Next.js and Supabase for all my projects" and it sticks permanently.

Most beginners do not know this exists. They re-explain their entire setup every single session.

  1. Projects

Projects let you create a dedicated workspace with its own system prompt and uploaded files. Instead of pasting the same context into every chat, create a project for each major workflow. One for content writing. One for code reviews. One for research. Each one has its own persistent memory and instructions.

This alone changes how productive you are with Claude.

  1. Artifacts

When Claude generates code, documents, or visualizations, it can render them in a live preview panel right next to the conversation. Most beginners see code as text in the chat. Artifacts display it as an interactive, runnable, editable output. You can copy, download, or iterate on it directly.

Turn this on. It changes the entire experience.

  1. Knowledge Files in Projects

You can upload documents directly into a Claude Project and they become permanent context. Style guides. Brand documents. Product specs. Code documentation. Anything you would normally paste at the start of every conversation can live inside the project permanently.

Most beginners paste the same document into chat fifty times. Upload it once. Never think about it again.

  1. Extended Thinking

Claude can reason through complex problems step by step before giving you an answer. Most users get a fast response and assume that is all Claude can do. Extended thinking makes Claude slow down, consider edge cases, explore alternatives, and produce dramatically better output on complex tasks.

Ask Claude to "think step by step" or enable extended thinking in settings and watch the difference on anything that requires real analysis.

  1. LaTeX Rendering

If you work with math, science, finance, or anything involving equations and formulas, Claude renders LaTeX natively. Most beginners see raw LaTeX code and think Claude is broken. It is not broken. The rendering is happening. You just need to know it is there.

  1. Web Search

Claude can search the web in real time. No more stale training data. No more "as of my last update." If you need current information - stock prices, news, recent events, live documentation - Claude can pull it directly.

Most beginners do not realize this is available and keep getting outdated information without questioning it.

  1. Conversation Styles

You can customize how Claude communicates. More concise. More detailed. More casual. More formal. Instead of asking Claude to adjust its tone in every single conversation, set your preferred style once and it applies everywhere.

  1. Keyboard Shortcuts

Claude has a full set of keyboard shortcuts that most users never discover. New conversation, toggle sidebar, copy last response, navigate between chats - all available without touching the mouse. If you use Claude daily, these save you real time every single week.

  1. Multiple Conversations Running Simultaneously

You can have several conversations open at once in different tabs. One for research. One for writing. One for coding. Most beginners use Claude in one single thread and lose context when they switch topics. Separate your workflows. Keep each conversation focused.

  1. Download and Share Artifacts

Every artifact Claude creates can be downloaded as a file or shared via a link. Code, documents, visualizations, interactive apps - all of it is exportable. Most beginners screenshot their outputs or copy-paste manually. Download the actual file instead.

Part 2: Claude Desktop and Cowork (Features 13–22)

This is where Claude stops being a chatbot and starts being an employee.

  1. Claude Cowork Tab

Inside Claude Desktop there is a tab called Cowork. This is not chat. This is delegation. You describe the outcome you want and Claude executes it directly on your files. It reads folders, creates documents, organizes data, processes files in bulk - everything happens on your actual computer.

Most beginners never switch from the Chat tab. Cowork is the most underrated feature in the entire product.

  1. Folder Access Permissions

When using Cowork, you grant Claude access to specific folders on your machine. Everything runs in a sandboxed environment. You control exactly what Claude can see and touch. Most beginners are either afraid to give any access or do not know they need to configure this.

Set your working folders. Give Claude access. Watch it actually work.

  1. Scheduled Tasks

Type /schedule in Cowork and you can set up recurring automated tasks. Daily morning briefings. Weekly report generation. Friday file cleanup. Claude runs them on schedule as long as your desktop app is open.

This is a personal AI assistant running on autopilot. Most users have no idea this exists.

  1. Sub-Agents (Parallel Processing)

When Claude gets a complex task in Cowork, it can spin up multiple sub-agents that work simultaneously. Instead of processing ten files one at a time, it runs five agents in parallel and finishes in a fraction of the time.

Most beginners give Claude one task at a time and wait. Give it a batch and let the sub-agents handle it.

  1. Slash Commands

Cowork has built-in slash commands for common operations. /schedule, /strategy, /plan-launch, and many more depending on which plugins you have installed. These are pre-built workflows triggered by a single command.

Most users type everything from scratch every time. Slash commands eliminate that.

  1. Plugins

Claude has a plugin marketplace with pre-built capability bundles for specific roles. Product management, marketing, finance, legal - each plugin gives Claude a set of specialized workflows and slash commands. Install the ones relevant to your work.

Think of plugins as job training for your AI employee.

  1. Connectors (Gmail, Slack, Google Drive, Calendar)

Claude connects directly to your apps. Gmail, Google Calendar, Slack, Google Drive, Notion, OneDrive, SharePoint, Microsoft 365. It can pull data, read emails, check your schedule, and reference your files without you copy-pasting anything.

Most beginners manually paste information from their apps into Claude. Connect them and let Claude pull what it needs.

  1. Claude in Chrome

Add the Claude in Chrome extension and Cowork can do browser-based research alongside its file operations. It picks the fastest path - connectors for integrated apps, Chrome for web research, direct file access for local work.

  1. File Processing at Scale

Cowork can process entire folders of files at once. Rename every file following a naming convention. Convert formats in bulk. Extract data from a hundred PDFs. Organize thousands of files by type and date.

Most beginners process files one at a time. Give Claude the whole folder.

  1. Session History

Every Cowork session is saved with a full history of what Claude did, what files it created or modified, and what decisions it made. You can review any past session and see exactly what happened. If a scheduled task runs while you are away, the full log is waiting for you.

Part 3: Claude Code and Developer Features (Features 23–32)

This is where Claude becomes a programming partner.

  1. Claude Code (Terminal-Based AI)

Claude Code runs directly in your terminal. It reads your codebase, writes code, edits files, runs tests, and executes commands - all from the command line. It is not a chat window. It is an AI developer sitting inside your development environment.

Most beginners use the web chat for coding. Claude Code is an entirely different level.

  1. Plan Mode (Shift + Tab)

Before Claude Code starts building anything, switch to Plan Mode. In this mode Claude thinks through the architecture, asks clarifying questions, and maps out the approach before writing a single line of code. This prevents the most common mistake beginners make - jumping straight into implementation without a plan.

  1. CLAUDE.md

File

Create a file called CLAUDE.md

in your project root. This is a persistent instruction file that Claude Code reads before every session. Your tech stack, coding conventions, folder structure, things to avoid - all of it lives here. Every time you correct Claude twice on the same thing, add it to CLAUDE.md.

This is the single most powerful feature most developers never set up.

  1. /compact and /clear Commands

As conversations get longer, Claude's output quality drops because the context window fills up. Use /compact to compress the conversation history into a summary. Use /clear to start fresh while keeping your

CLAUDE.md

instructions.

Fresh context equals better output. Always.

  1. Model Switching (Opus vs Sonnet)

Use Opus for planning, architecture, and complex reasoning. Use Sonnet for fast execution and implementation. Most beginners stick to one model for everything. Switching between them based on the task gives you the best of both worlds.

  1. Screenshot Feedback

Something looks wrong visually? Take a screenshot, paste it directly into Claude Code with Ctrl+V, and describe the issue. Visual feedback is faster and more precise than trying to describe UI problems in words. Circle the problem area and say "this spacing feels off" or "this button needs to be larger."

  1. One Conversation Per Feature

Never build authentication, then refactor the database, then redesign the UI in the same conversation. Claude's context gets messy and quality drops. Start a new conversation for each feature. Keep each thread focused on one thing.

  1. Git Integration

Claude Code works with Git. It can commit changes, create branches, and even write commit messages. If you are not using version control with Claude Code, you are taking unnecessary risk. Every change should be committed so you can roll back if something breaks.

  1. MCP Server Connections

MCP (Model Context Protocol) gives Claude access to external tools and data sources. Connect MCP servers for web search, database access, file processing, browser automation, and hundreds of other capabilities. Each MCP server is a new superpower.

Most beginners use Claude in isolation. MCP connects it to the real world.

  1. Context Engineering with .md Files

Beyond

CLAUDE.md

, you can create additional markdown files that give Claude specific knowledge for specific tasks. A

rules.md

for writing style. A

stack.md

for your tech stack. A

personas.md

for your audience profiles. Reference these files in your prompts and Claude follows them perfectly.

Part 4: API and Advanced Settings (Features 33–40)

These are for users ready to go beyond the consumer interface.

  1. API Access

Claude has a full API that lets you integrate it into your own applications, scripts, and workflows. Everything you do in the chat you can do programmatically - and more. Get your API key and start building.

  1. System Prompts via API

Through the API, you can set system prompts that define Claude's behavior at a level the web interface cannot match. Production applications use this to create highly specialized, consistent AI behavior for specific use cases.

  1. Streaming Responses

The API supports streaming, which means Claude's response starts appearing in real time as it generates rather than waiting for the full response. For applications where speed matters, this is essential.

  1. Tool Use and Function Calling

Through the API, Claude can call functions you define. It decides when to use a tool, generates the function call, and processes the result. This is what turns Claude from a text generator into an autonomous agent that can take real actions.

  1. Temperature Control

Temperature controls how creative or deterministic Claude's responses are. Lower temperature means more predictable, consistent outputs. Higher temperature means more creative, varied responses. Most users never touch this. For production use cases, it matters enormously.

  1. Structured Outputs (JSON Mode)

Force Claude to return responses in exact JSON formats that match a schema you define. No more parsing messy text. No more hoping the format is right. Define the structure and Claude matches it perfectly every time.

  1. Batch Processing

The API supports batch requests - send multiple prompts at once and get all results back together. If you need to process a hundred documents, analyze a thousand data points, or generate fifty pieces of content, batch processing makes it practical.

  1. Evaluation Frameworks

Build automated tests that check whether Claude's outputs meet your standards. Set up expected outputs, run your prompts against them, and measure accuracy automatically. This is what separates someone experimenting with Claude from someone running Claude in production.

The Real Cost of Not Setting These Up

Here is the honest math.

If you are using Claude for an hour a day without these features configured, you are losing roughly two to three hours of potential output every single week. That is over a hundred hours a year spent re-explaining context, manually processing files, copy-pasting between apps, and getting mediocre outputs because Claude does not know who you are or what you need.

Every feature on this list takes minutes to set up. The time they save compounds forever.

The people getting the most out of Claude right now are not smarter than you. They are not more technical than you. They just turned on the features you did not know existed.

Now you know they exist.

The only question left is whether you will actually set them up or keep doing things the hard way.

u/Money-Ranger-6520 — 1 day ago
▲ 5 r/Agent_AI+1 crossposts

Your Code is the Agent Harness.

A 100-page survey from UIUC, Meta, and Stanford on coding agents just landed. The central claim: most agent failures aren't reasoning failures. They're harness failures.

The paper, "Code as Agent Harness," pulls 400+ papers under one taxonomy with 40+ researchers across the three institutions. The anchor systems are ones you already know: Claude Code, Codex, SWE-agent, Voyager, MetaGPT, OpenHands. The contribution is the synthesis and vocabulary. No new discoveries.

The three-layer split

Any agent system breaks into three coupled pieces.

Model-internal capabilities: reasoning, planning, perception. Teams optimize this first and most.

System-provided infrastructure: tools, sandboxes, memory, permission tiers, telemetry. Production stacks gap out here most often.

Agent-initiated code artifacts: regression tests, temporary tools, domain-specific language programs, reusable skills the agent authors mid-task. Voyager's skill library and Claude Code's skill files are the early examples. The paper treats this as the underexplored layer.

Those three pieces sit inside three harness layers. The interface layer puts code at the center as the medium for reasoning, action, and environment state. The mechanisms layer covers planning, memory, tool use, and the plan-execute-verify loop. The scaling layer extends the picture to multi-agent coordination over shared code artifacts.

Auditing your stack

One question per layer.

Interface: does your agent's reasoning pass through something executable? A stack with tool calls, generated programs, repo state, traces, and tests can be inspected and held accountable. A stack running on natural-language plans the agent never has to defend against execution can't. The fix: have the model output executable code as its reasoning, give it a structured interface like SWE-agent's shell, edit, and search commands, and let it operate on actual repo state.

Mechanisms: when something fails, does the harness do anything about it? A plan-execute-verify loop with named verifiers (unit tests, type checks, linters, runtime monitors) and durable memory across sessions closes the feedback loop. Retrying with more tokens doesn't. Add named verifiers as gates between generation steps, not just at the end. The paper distinguishes five memory types: semantic memory of the repo, experiential memory of past trajectories, long-term memory with a compression policy, multi-agent memory for shared state, and working memory. Most agents only ship the last one. OpenHands' stateful workspace and CodeMem's budgeted memory slots are worth studying as reference implementations.

Scaling: when two agents work on the same task, what's the shared substrate? Shared code artifacts (repos, tests, traces, structured workflows) with a conflict policy let both agents read and write safely. Message-passing with no shared state does not. AgentCoder's programmer-tester-executor split and MetaGPT's role-specialized agents over a shared message pool are the patterns the paper highlights.

Three open problems

Oracle adequacy. Pass/fail on unit tests measures the wrong thing. Every agent evaluation today collapses model quality, tool reliability, and harness quality into one end-task number. The paper names this as the central bottleneck and offers no metric that fixes it.

The verification gap. Green tests aren't a correct specification. Every accepted action should carry an evidence bundle: which checks ran, which assumptions held, which parts of the code stayed untested, what risks remain. No current harness ships this. The architecture pattern exists and sits unused.

Approvals that don't persist. If permission rules reset after the session ends, your agent repeats the same unsafe action next time. Permission state should mutate in response to human decisions. The paper flags this and stops there.

Read it as vocabulary, not a roadmap. The taxonomy sharpens how you describe your stack to your team. Monday's build plan is still yours to write.

u/Forward_Regular3768 — 1 day ago
▲ 6 r/Agent_AI+3 crossposts

What’s the most unexpectedly useful thing AI does for you now?

Not the obvious stuff like “write blog posts.”

I mean the small use cases you didn’t expect to become part of your routine.

For me it ended up being:

organizing messy thoughts
turning rough ideas into structured plans
summarizing scattered research
helping reduce mental/context overload across projects

None of that sounded exciting at first, but those are
the workflows I kept using long term.

Curious what surprisingly became useful for you.

reddit.com
u/Curious_Being9540 — 1 day ago
▲ 4 r/Agent_AI+3 crossposts

What AI task still feels surprisingly bad in 2026?

We talk a lot about what AI is amazing at now.
But I’m curious what still feels frustratingly unreliable or awkward in real daily use.

Not benchmark stuff.
Real workflow stuff.

For me:
maintaining long-term project context
keeping conversations organized
reliable multi-step execution
not losing useful outputs across chats/tools

AI got insanely good at generation.

But I still feel like “AI workspace / memory / continuity” is weirdly unfinished.

What still breaks for you?

reddit.com
u/Curious_Being9540 — 1 day ago
▲ 2 r/Agent_AI+1 crossposts

Hermes custom AI infrastructure

Hey everyone ! I recently started a YouTube channel and I wanted to give back to the community and show how I build my custom infrastructure for AI agents . Initially I was going to integrate openclaw but then decided to try Hermes and man am I glad I did .

Hermes is incredibly powerful and I am able to prototype apps so quickly because it’s integrated into my staging of supabase/github/cloudflare/n8n and I show how to set it all up!

Check it out , it all starts here and I’ll be doing the final episode soon where we bring it to a main VPS and put our platform online !

Building AI Agents - Initial Domain and VPS Setup
https://youtu.be/s1cmCVE9SY4

So far there are 7 episodes ! These are raw videos because I wanted to show the process of learning while you build and errors that can come up !

Enjoy !

u/FitzUnit — 1 day ago

Agent Memory patterns & tricks

Maximize agent memory. help you save tokens and stop reasoning degradation.

u/Smart_Page_5056 — 1 day ago
▲ 81 r/Agent_AI+63 crossposts

This sub gets the assignment better than most so I'll be direct.

The no-code movement solved half the problem. You can build almost anything now without knowing how to code, which is genuinely incredible and wasn't true five years ago. But there's still a gap that nobody talks about. Even with the best no-code tools you still have to know which tools to pick, how to connect them, how to write copy that converts, how to set up ad accounts, how to source products, how to structure a funnel. The learning curve didn't disappear, it just moved.

Most people in this sub know exactly what I mean. You've spent a weekend deep in Zapier trying to get two things to talk to each other that should just work. You've rebuilt your Webflow site three times because the first two didn't convert. You've watched your Notion dashboard get more elaborate while the actual business stayed the same size.

That's the gap Locus Founder closes.

You describe what you want to build. The AI handles everything else. It sources products directly from AliExpress and Alibaba (or sell YOUR OWN digital services, products, or content), builds a real storefront around them, writes conversion-optimized copy, then autonomously creates and runs ads on Google, Facebook and Instagram. No Zapier. No Webflow. No piecing together eight tools that half work. Just a running business.

If you don't have an idea yet it interviews you and figures out what makes sense for your situation.

We got into YCombinator this year and we're opening 100 free beta spots this week before public launch. Free to use, you keep everything you make.

For the people in this sub specifically, this isn't a replacement for no-code tools for people who love building. It's for everyone who wanted the outcome but never wanted to become a tools expert to get there. Big difference.

Beta form: https://forms.gle/nW7CGN1PNBHgqrBb8

Happy to answer anything about how it works under the hood.

u/IAmDreTheKid — 3 days ago