
u/Single-Cherry8263

Learn how to be disliked and be okay with it.
Stop auditioning for your own life.
Google shipped 13 agent skills
Google shipped an official skills repo at Cloud Next last week. One install command:
gemini skills install https://github.com/google/skills.git
13 skills covering BigQuery, Cloud Run, Firebase, GKE, Cloud SQL, AlloyDB, the Gemini API, plus security/reliability/cost WAF pillars and a few operational playbooks.
I ran the same prompt twice, once before installing, once after. Before: the model generated code importing vertexai and vertexai.generative_models, the legacy SDK Google has been deprecating for a year, targeting gemini-1.5-pro, building schemas as raw parameter dicts. It compiles. It's wrong.
After activating the gemini-api skill, same prompt returned from google import genai, Pydantic schemas, gemini-3.1-pro-preview, and environment-variable-driven client config. Current SDK, current model, current patterns.
The skill activation works through a consent prompt. At session start, only the skill's name and description sit in context, under 100 tokens per skill. When your prompt matches, the CLI asks permission to load the full SKILL.md plus the bundled reference files before it answers. You see exactly what context is about to enter the session.
Worth knowing: the Skills format wasn't built by Google. Anthropic built and open-sourced the spec. These same SKILL.md files work in Claude Code and Codex without modification.
Google's own benchmarks put correct-API-code generation at 87% with Gemini 3 Flash and 96% with Gemini 3 Pro with this skill active. From what I saw, those numbers track.
/Goal: Full Guide for Non technical Folks
/goal runs a self-checking loop after every step. It asks "am I done?" and keeps going until the answer is yes. That's the whole mechanism.
This makes it useful for tasks with many steps and a defined finish line, where you'd otherwise prompt the model repeatedly to continue.
Use /goal when:
- The job has 10+ sequential steps
- "Done" can be described in specific, measurable terms
- You'd normally spend time re-prompting or checking progress manually
Don't use it for:
- Single-step tasks ("write a tweet", "summarize this paragraph")
- Open-ended exploration with no clear finish state
- Anything you need to review mid-run before continuing
Good /goal tasks:
- Build a course landing page with a hero section, five modules, three testimonials, a frequently asked questions section, and a Stripe checkout
- Migrate 80 blog posts from WordPress to Beehiiv, fix every broken image and internal link along the way
- Process last month's customer support tickets: categorize them, draft template replies, document the top five recurring issues
Writing a /goal prompt that actually works:
Paste this into Claude Code, Codex, or Hermes first:
>
Take that output, put /goal at the front, and run it.
The prompt-writing step definitely matters. Vague finish lines will waste tokens because the model keeps running checks against a target it can't verify.
Motivation is a spark. Systems are the fire.
Claude writes purple-gradient slop by default. These 10 configs change that
Claude's defaults are very bad: Purple gradients, Inter font, chaotic debugging, context windows eaten in five messages. Here are the 10 skill files fix specific failure modes.
Frontend Design before Claude writes a single line of UI code, it picks an aesthetic (Brutalism, Minimalism, Retro-futurism) and executes from there. The difference in output is not subtle. https://github.com/anthropics/skills/tree/main/skills/frontend-design
Algorithmic Art generative art via p5.js. Give it an idea, get an interactive HTML artifact with sliders, seeds, and randomization controls you can tweak in real-time. https://github.com/anthropics/skills/tree/main/skills/algorithmic-art
Systematic Debugging four-phase bug-hunting using root cause analysis and component boundary logging. Structured teardown instead of trial-and-error. The repo claims 15-30 minutes per bug vs. 2-3 hours of guessing. https://github.com/obra/superpowers
Canvas Design posters, covers, static visuals as PNG and PDF. Claude defines a design philosophy first, then executes from it. Not generic image output. https://github.com/anthropics/skills/tree/main/skills/canvas-design
Theme Factory 10 pre-built professional color and font themes applicable to any artifact with one command. Useful when you need something presentable fast. https://github.com/anthropics/skills/tree/main/skills/theme-factory
Web Artifacts Builder scaffolds React 18 + TypeScript + Tailwind + 40+ shadcn/ui components, then bundles into a single HTML file. For actual apps with routing and state, not demos. https://github.com/anthropics/skills/tree/main/skills/web-artifacts-builder
Superpowers Jesse Vincent's full agentic coding framework. 20+ skills covering test-driven development, planning, code review, git worktrees, and subagent-driven development. Claude can run autonomously for hours without drifting. https://github.com/obra/superpowers
File Search codebase search via ripgrep (text) and ast-grep (abstract syntax tree). Faster than grep/find when you're orienting in an unfamiliar repo or tracking all usages before a refactor. https://github.com/massgen/massgen
Context Optimization context engineering skills: compaction, tool-output masking, key-value cache optimization, multi-agent partitioning. If your agent starts degrading after 30 minutes, it's almost always context pressure. https://github.com/muratcankoylan/agent-skills-for-context-engineering
Skill Creator a meta-skill that teaches Claude to write its own skills, including evaluations, benchmarks, and tests. Start here if you want custom behavior that actually sticks. https://github.com/anthropics/skills/tree/main/skills/skill-creator
/Goal 101. Full guide
Three separate teams shipped the same thing. OpenAI's Codex CLI added /goal a few weeks ago. Claude Code added it this week. Hermes Agent, the orchestrator I run on a Mac Mini, had it built in before either of them.
Now I have a builder, a reviewer, and an orchestrator that all accept the same instruction format, and none of them share a codebase.
That convergence is the thing worth paying attention to.
The primitive itself
A regular prompt gets you the next response. You read it, decide if it's right, and push again. You're steering every turn.
/goal hands the steering to the agent. You write what done looks like, submit it once, and the agent works toward it until it gets there or runs out of budget. A real example from the source:
/goal Build the app described in SPEC.md. Done means tests pass,
build passes, README is accurate, and git status only shows
relevant project files.
The goal stays live until it's achieved, paused, blocked, or cleared. This isn't writing the word "goal" inside a prompt. The primitive only works inside an interactive worker session, not as a one-shot exec command.
The three tools, and why they're not interchangeable
Codex is the builder. Give it a spec and it produces working code. Strong at implementation.
Claude Code is the reviewer. Point it at code that looks right and it finds what's wrong. Spec compliance, error states, security holes, safety issues.
Hermes is neither. It's an orchestrator. It coordinates work between the two above, routes tasks to the right tool, and manages the handoffs. /goal is how I tell Hermes what I want, and also how Hermes tells Codex and Claude Code what to do.
A real run
I gave Hermes one goal: build a command-line tool that finds mentions of me on X and sends alerts when something blows up.
Hermes broke that into six cards on a Kanban board.
Card 1: Hermes wrote SPEC.md itself, capturing the stack, repo path, read-only constraints, mock mode requirements, tests, and verification commands.
Card 2: Codex ran /goal against the spec. It built the project files, wired up the backend and interface, added tests. About 15 minutes. When it finished, npm test passed, npm run build passed, and git status showed only relevant new files.
Card 3: Claude Code ran /goal to review what Codex produced. Checked spec compliance, read-only safety, API key handling, error states, test coverage, and security issues. Returned PASS with no blocking items.
Cards 4 and 5 would have been the fix loop and final verification pass, but the review passed so both were skipped. The cards still mattered. They're how Hermes models conditional work. If Claude Code had blocked, Hermes would have handed the findings back to Codex as a new /goal.
Card 6: Hermes summarized the finished state. UI and API verified in mock mode. Working app at the local path.
One message from me. Three tools did the work. I only ever talked to Hermes.
The verification rule
After Codex marked the build done, Hermes ran the commands itself:
npm test # 17 tests passed
npm run build # vite build passed
Coding agents are confident. They'll tell you tests pass when the tests were never executed. Hermes didn't take Codex's self-report as ground truth. It checked.
Without that step, /goal is just a prompt with a done condition attached. The verification is what turns it into a contract.
Parallelism without conflicts
You can run multiple goals in parallel, but not by pointing multiple workers at the same files. The safe pattern is clear boundaries: different repos, different branches, git worktrees, separate packages, tests versus implementation. One writer per file at a time.
Builder writes. Reviewer only reads. Fix goals stay scoped to the fix.
The bad pattern is three workers in the same repo editing the same file. You get partial overwrites and one agent silently undoing another's work.
The Kanban board is what makes this visible
Without the board, parallel background workers are just terminal chaos. With it, every goal has a card, every card has a status, and every handoff leaves a trail. I watch the work move across columns from my phone over Telegram.
The board is what /goal looks like when there's an orchestrator managing it.
Setup itself is a goal
The first time I needed Codex and Claude Code on the Mac Mini, I didn't run install commands. I sent Hermes a message asking it to install both and log me in.
Once you have an orchestrator running, mechanical setup work stops being yours.
The structural point
If Codex and Claude Code had shipped different job formats, no orchestrator could route between them. The convergence on a single primitive is what makes the composition possible. The next coding tool that adopts /goal joins this pipeline without any changes on my end.
For anybody that needs to hear it.
Are you still waiting to feel ready?
I Run a 4-Agent Claude System With mkdir and a Text File
Running four Claude agents in parallel takes a folder structure and a shared config file. The complexity people imagine doesn't exist.
Each agent owns one phase: research, production, quality review, distribution. An Orchestrator routes between them and handles failures. It reads the full pipeline. Each agent reads only its own prompt.
mkdir multi-agent-system
cd multi-agent-system
mkdir -p inbox research-briefs drafts approved-content distribution logs
A CLAUDE.md at the project root sets the shared contract every agent reads before acting:
# Multi-Agent System — CLAUDE.md
## System Overview
This is a 4-agent content production system.
Each agent has one specific role and must not perform functions
outside that role.
## Agent Roster
- Research Agent: Produces structured research briefs from topics
- Production Agent: Produces first drafts from research briefs
- Quality Agent: Evaluates and approves or returns drafts
- Distribution Agent: Formats and deploys approved content
## Folder Structure
inbox/ — incoming task files
research-briefs/ — research agent outputs
drafts/ — production agent outputs
approved-content/ — quality agent approvals
distribution/ — deployment records
logs/ — operation logs
## Shared Standards
- Every output file must be named: YYYY-MM-DD-[type]-[topic].md
- Every agent must log its action to logs/operations.md
- Every agent must read this CLAUDE.md before starting any task
- No agent takes action outside its defined role
## Quality Bar
Research: Minimum 3 sources cross-referenced. No unsourced claims.
Production: Matches voice profile. Every sentence earns its place.
Quality: Scores 8/10 or above on all criteria before approval.
Distribution: Platform-specific formatting. No generic formatting.
## Hard Rules
- Never delete files. Archive to a timestamped backup folder.
- Never publish without Quality Agent approval in the file header.
- Log every action before taking it, not after.
- When uncertain: stop and flag for human review.
Research Agent
Everything downstream depends on what this agent produces. A thin brief produces a thin draft. Give it a strict output schema:
# Research Agent
## Identity
You are a specialist research agent. Your only job is to produce
Research Briefs. You never write content. You never evaluate drafts.
You research and synthesize.
## Output Format
Save to: research-briefs/YYYY-MM-DD-research-[topic].md
CORE INSIGHT: [one sentence — the non-obvious angle]
TARGET AUDIENCE: [specific description]
SUPPORTING EVIDENCE: [3 specific examples with sources]
COUNTERINTUITIVE ANGLE: [what most people get wrong]
KEY DATA: [2-3 specific numbers or quotes]
CONTENT ANGLES: [3 ranked angles with one-sentence descriptions]
GAPS: [what this research could not answer]
Quality gate: if the core insight is something most people already know, the brief fails before the Production Agent sees it.
Production Agent
The voice profile separates output that sounds like you from output that sounds like a model approximating you. Before writing this agent's prompt, run your 10 best posts through this:
Analyze these 10 pieces of content and extract the following:
1. Average sentence length
2. Capitalization patterns (what do you capitalize strategically?)
3. Structural patterns (how do you open, develop, close?)
4. Vocabulary level and specific word choices
5. What you never do (hedges, filler phrases, etc.)
6. How you handle transitions between ideas
7. Your CTA style
Content samples: [PASTE YOUR 10 BEST PIECES]
That output goes into the ## Voice Profile section. The rest of the prompt is standard: read the brief, pick the strongest angle, write to the schema, self-check before submitting.
Quality Agent
Five criteria, all scored 1-10, all requiring 8 or above to pass:
VOICE MATCH: Does this sound exactly like the configured voice?
HOOK STRENGTH: Does the first line stop the scroll?
INFORMATION DENSITY: Does every sentence earn its place?
CTA CLARITY: Is the call to action specific and compelling?
FORMAT COMPLIANCE: Does it follow all format requirements?
Anything below 8 triggers a revision brief with the exact problem and the exact fix required. Vague feedback ("make it more engaging") gives the Production Agent nothing to act on. The revision brief names the failed criterion and shows the correct approach.
Distribution Agent
The agent verifies the QUALITY APPROVED header before touching the file. No header, no action. Platform rules live in its prompt: character limits for X, narrative structure for LinkedIn, header and subject line conventions for newsletters.
Running a task
Drop a task file in inbox/ and trigger the Orchestrator:
claude "Read CLAUDE.md. You are the Orchestrator.
A new task has arrived in inbox/[TASK-FILENAME].
Begin the workflow. Route to Research Agent first."
Every agent appends to logs/operations.md before acting and after completing. A draft in drafts/ with no matching file in approved-content/ means the Quality Agent returned it. Check the log for the failed criterion. Fix the brief. Rerun.
First end-to-end run: 15 to 30 minutes depending on research complexity. Failures stay isolated to the agent where they occur, so you debug one phase at a time.
10 Claude Code Commands I Use Daily
here are the Ten commands I use consistently.
/init— Generates yourCLAUDE.mdfrom your existing project. SetCLAUDE_CODE_NEW_INIT=1first for the full interactive setup: skills, hooks, personal memory. Not perfect, but 80% done in three seconds. You edit, not write./compact [instructions]— Run at 70-75% context usage, not when Claude warns you. Always pass instructions:/compact focus on the auth module, ignore the migration files. Without them, you get a generic summary. With them, the important context survives./rewind— Full checkpoint rollback. Reverts the conversation and all file changes back to any earlier point. Use it when Claude goes out of scope, breaks something with an unsolicited "improvement," or you want to try a different approach from the same starting point./plan [description]— Pre-load the task into the command:/plan refactor contract validation to handle Arabic RTL edge cases. Claude enters plan mode already thinking about your specific problem, not waiting for a follow-up./context— Shows a colored breakdown of what's consuming your context window. Not just a number, it tells you what's causing the number. Found myCLAUDE.mdwas eating a noticeable slice of context on every message. Trimmed it that day./btw [question]— Ask a side question without adding it to conversation history. The response doesn't carry forward. Zero context cost. I use it for quick one-off lookups mid-session: library defaults, pattern support, anything I'd otherwise open a new tab for./security-review— Analyzes the git diff of your current branch for vulnerabilities. Fast because it looks at what changed, not the whole codebase. I run it before every pull request on anything handling user data. It has flagged subtle input handling issues three times that I would have shipped./insights— Generates an analysis of your recent Claude Code sessions: where you spend the most turns, where friction keeps appearing. I ran it after two months on a project and found I was re-explaining the same parsing logic every session. That pointed directly to a gap in myCLAUDE.md. The fix took 15 minutes./diff— Opens an interactive viewer of uncommitted git changes with per-turn diffs from the current session. Left and right arrows switch between the full diff and individual Claude turns. You can trace exactly which turn added a function, changed a variable, or introduced an edge case./effort [low | medium | high | max]— Controls reasoning depth without changing the model. Uselowfor documentation and comment cleanup. Usehighormaxfor architectural decisions and complex refactors. Defaulting to max for everything wastes tokens on tasks that don't need the depth.
End to End Agent Building
Most teams build, deploy, and then figure out how to test. Production becomes the eval suite. Users will find the obvious bugs.
The order that works: build, test, deploy, monitor. Testing comes before deployment. Every step feeds the next one.
Build
Pick your abstraction layer before you write anything.
Frameworks like LangChain handle model calls, tools, prompts, and retrieval. Runtimes like LangGraph add state, control flow, and the ability to pause and resume. Harnesses like the Claude Agent SDK wrap all of that in a working environment: prompts, skills, MCP servers, hooks, middleware.
The layer determines the complexity ceiling. A forty-line tool-calling loop and a multi-agent system with persistent context are both "building an agent." Know which one you're building.
No-code tools open this up to non-engineers, which matters when the person who understands the workflow isn't the one writing the harness. Hooks and middleware are still how you add custom logic around tool calls, auth, and approvals without rebuilding the agent every time.
Test
Start with a small dataset. Expected use cases, manual testing, dogfooding, known edge cases. Don't wait for production traces to begin testing.
Metrics depend on task shape. Ground truth exists: measure correctness. No single right answer: score criteria, grounding, policy adherence, tool efficiency.
Hold the eval set fixed and vary one thing at a time: prompt, model, retrieval strategy, tool schema. Experiments show whether the system is improving or quietly regressing.
Multi-turn agents need multi-turn evals. A support agent handling a frustrated customer across six turns, a coding agent reacting to test output, an ops agent gathering missing fields before acting — single-turn evals miss all of it.
Deploy
Most real agents need more than a stateless server. They run for minutes, call tools, wait for human input, hold state, and recover from failures.
Two runtime requirements come up constantly. Durable execution: the agent checkpoints and resumes instead of losing the run on failure. Human-in-the-loop: the agent pauses for approval without crashing the trajectory. Teams already running Temporal for long-running workflows often build on top of it.
Agents that write or execute code need isolation. Sandboxes like Daytona and E2B cap the blast radius. If the agent only needs scratch storage, a virtual filesystem is enough. Deep Agents uses files as working memory without spinning up a sandbox per run.
Version and store prompts, retrieval sources, and skills separately from application code. They change more often, and the people editing them don't deploy services.
Monitor
Latency and error rates are the easy half. An agent can return a successful response and still pick the wrong tool, skip a required approval, or produce a plausible but wrong answer.
Traces catch what metrics miss. Every model call, tool invocation, input, output, and final action. That's what you need to debug real failures and build future evals from.
Layer signals on top: large language model judges for quality and policy, regex for required phrases or forbidden tools. Store feedback against the trace so "user was unhappy" connects to "wrong tool three steps earlier."
Production traces become dataset examples. Recurring failures become metrics. Monitoring feeds directly back into testing.
Governance
Cost, tool access, and discoverability are the three things that matter as agent count grows.
Track spend per agent and per team before the bill surprises anyone. Audit every tool call: which agent, what inputs, what authorization. Store prompts, skills, and retrieval sources somewhere findable so the second team doesn't rebuild what the first team already got right.
I have learned one thing the teams that move fastest have enough visibility to ship without guessing. They trace failures back to their cause, fix the right thing, re-run evals, and deploy. Monitoring hands them the next batch.
How I run 550 AI UGC videos a day for TikTok Shop on a $550 budget
There are four layers to run this
Script engine. Build a copy bank from real customer language: viral video comments, 5-star Amazon reviews, Reddit threads where people describe the problem your product solves. Lock the structure (hook in 3 seconds, problem named, one credible insight, product as resolution). Write the negative list into the prompt: no "game-changing," no "revolutionary," no "must-have," no "discover." The negative list matters as much as the positive direction. A tuned engine produces 50-100 unique scripts an hour.
Character system. Three to five recurring characters, not a fresh one per video. Each needs a reference portrait, a 9-angle reference set, and a Soul ID built from all 9 images uploaded together. That locks identity across generations so video 400 reads as the same person as video 1.
Video generation. Seedance 2.0 produces a 15-30 second video in 2-4 minutes. Four prompt elements decide whether it looks human:
- Skin: "realistic skin texture, visible pores around nose and cheeks, natural slight unevenness, no filter quality"
- Camera: "handheld phone camera feel, casual slightly unsteady framing, organic not studio quality, soft diffused light from window"
- Environment: bedroom with natural light and a messy bookshelf reads as real. Clean studio reads as an ad.
- Audio: "natural conversational tone, like talking to a friend, not presenting to an audience, slight natural variation in pace and energy"
Distribution. The publish tap has to come from a real phone on a real network. TikTok reads server IPs, robotic intervals, and missing device fingerprints, and throttles or flags accounts that show those signals. I run Postiz self-hosted to manage the calendar. It pings a team member when something is queued, they open the app, review, tap post. Everything else automates. This step never does.
The math
Per video: $0.15 to $3 depending on length and regenerations. Average $1.
550 videos a day at $1: $550 a day, $16,500 a month.
Equivalent reach on Meta: 5.5M daily impressions at $4-8 cost per mille = $22,000-$44,000 a day.
Quality gates
- Script review. Pass criteria: real hook, product named before midpoint, no banned words. Failed scripts regenerate against the criteria as a prompt.
- Character review. A trained reviewer checks every video before it queues. 60-90 minutes a day at full volume. Budget the headcount.
- Performance triage every 48-72 hours. Sort by revenue per view, not views. Top 10% gets scaled and pushed to Spark Ads. Bottom 20% gets retired.
The system only works when the first 10 videos pass the "is this real" test before you scale to 550. Volume amplifies whatever you point it at.
The Opus 4.7 vs GPT-5.5: Comparison beyond benchmark
Opus 4.7 shipped April 16. GPT-5.5 followed seven days later on April 23. Sticker price favors Opus on output: $25 per million tokens versus $30 for GPT-5.5. Input matches at $5.
On identical coding tasks, GPT-5.5 produces roughly 72% fewer output tokens than Opus 4.7. Opus narrates. It explains its reasoning, describes what it's about to do, documents as it works. Inside a chat window that reads as helpful. Inside an agent loop hitting hundreds of inference calls per task, every line of narration is a billable token.
Run the numbers on a support agent handling 500 tickets a day. GPT-5.5 averages 2,000 output tokens per ticket. Opus 4.7 averages 7,100. The monthly API delta lands around $5,100. At a billion tokens a day across an enterprise, the cheaper-per-token model becomes the more expensive deployment.
NVIDIA's engineers reported 25–50% better cost efficiency on agentic workflows running GPT-5.5-style architectures. Internalize that number before picking a model on output price.
Where each model wins
The benchmarks split along the lines each lab optimized for.
Terminal-Bench 2.0 (multi-step terminal work, compiling, configuring, running tools):
- GPT-5.5: 82.7%
- Opus 4.7: 69.4%
SWE-Bench Pro (resolving real GitHub issues end-to-end):
- Opus 4.7: 64.3%
- GPT-5.5: 58.6%
OSWorld-Verified (operating real computer environments):
- GPT-5.5: 78.7%
- Opus 4.7: 78.0%
GDPval (knowledge work across 44 occupations): GPT-5.5 hits 84.9%.
Long-context retrieval at 512K–1M tokens:
- GPT-5.5: 74%
- Opus 4.7: 32.2%
OpenAI tuned GPT-5.5 for autonomy: tool use, long horizons, retrieval over big context. Anthropic tuned Opus 4.7 for code precision and instruction coherence. Anthropic added a self-verification step where the model checks output for logical faults before returning it. Production teams using Opus reported double-digit drops in feedback cycles because the model caught issues before delivery.
Teams running GPT-5.5 in Codex saw the inverse pattern. The model stays on task longer without pausing for clarification or abandoning halfway. For multi-step engineering work, that persistence compounds across the loop.
Latency
- GPT-5.5: ~3 seconds to first token
- Opus 4.7: ~0.5 seconds
For interactive workflows where someone watches the cursor, the 2.5-second gap shows up in the feel of the tool. For background agents, total wall-clock dominates first-token latency, and GPT-5.5's token efficiency narrows the gap.
Both ship 1M token context windows, so window size stopped being the differentiator. Retrieval reliability inside the window took its place, and GPT-5.5 leads there.
How I'd pick
Pick GPT-5.5 for:
- autonomous agents running long horizons
- high-volume workloads where token spend hits margin
- long-context retrieval over codebases or document sets
- multi-tool orchestration
Pick Opus 4.7 for:
- production code patches where review overhead drives cost
- instruction-heavy work where self-verification cuts revision cycles
- reasoning across interconnected systems where coherence beats speed
The token-efficiency gap rewards GPT-5.5 hardest in the workloads where Opus looks cheap on paper. Teams that pilot at low volume, where Opus's narration tax barely registers, get burned when production traffic separates the projection from the bill.
Run both on a slice of your real traffic. Measure tokens per task, not tokens per million. The pricing page is the wrong place to make this call.