u/rafio77

I ran langfuse, langsmith and helicone in prod for a month and only one of them stuck

We ran with no real observability for too long, just logs and vibes. Before committing to one tool i ran three of the obvious ones side by side in actual prod for a month. Quick writeup since i couldnt find a real-usage comparison when i was looking.

Helicone was the fastest to get value from by a mile. Its a proxy, u change the base url and every call is suddenly traced. Zero code changes. For the first week it was the only one giving me anything because the others needed instrumentation.

Langsmith was the most complete once it was wired in. Traces, evals, the whole loop. But it really wants u inside the langchain world and we're mostly not, so a chunk of it felt like paying for stuff we couldnt fully use.

Langfuse is the one that stuck for us. Framework agnostic, self-hostable, and the data model fit how we actually think about traces. Worth noting clickhouse picked them up earlier this year, so the backing is solid now. That mattered for a "will this still exist in a year" call.

The bigger takeaway though was simpler. Going from zero observability to any of these was the real 10x. The gaps between the three are real but small next to finally being able to see what ur agents are actually doing in prod.

What are u running rn, and did u land on framework-native or agnostic

reddit.com
u/rafio77 — 1 day ago

I stopped routing everything through one model and cut my monthly ai spend from $300 to $140

Took me embarrassingly long to figure this out. For most of the year i threw every task at whatever my default model was. Drafting, code, quick lookups, the heavy reasoning stuff, all one model. The bill kept creeping up and i was paying premium rates for tasks a cheaper model handles fine.

So i set up a dead simple routing rule. Bulk stuff and first drafts go to gemini 3.5 flash, its fast and cheap and good enough for like 70% of what i do. Anything that needs real reasoning or tricky code goes to claude opus 4.7. The agentic stuff where it has to actually use my tools goes to openai's 5.5 since the tool calling has been the most reliable for me.

Its not a fancy setup. For a while it was literally just me knowing which tab to open. Now theres a little router in the middle but the logic is the same, match the task to the model.

Two things happened. Spend dropped from around $300 to $140 a month because i stopped burning frontier-model tokens on throwaway tasks. And the output got better too, since the hard tasks now go to the model thats actually best at them instead of whatever was convenient.

The mindset that helped was treating models like a team with different strengths, instead of one assistant i stick with out of habit.

Curious what everyone elses routing looks like rn, do u actually split by task or still mostly running one default

reddit.com
u/rafio77 — 1 day ago

Andreessen's "world class expert" prompt has been everywhere since he posted it yesterday. quick refresher on who he is. this is the guy who backed facebook, airbnb, stripe, github. a16z funds the biggest ai labs in the world. he is arguably the most powerful ai investor in silicon valley.

and his prompt has a contradiction in the first paragraph that any llm researcher would catch in 30 seconds.

the contradiction:

opening line: "you are a world class expert in all domains. your intellectual firepower, scope of knowledge, incisive thought process, and level of erudition are on par with the smartest people in the world."

a few sentences later: "verify your own work. double check all facts, figures, citations, names, dates, and examples. never hallucinate or make anything up. if you don't know something, just say so."

these two instructions are pulling in opposite directions and most people who use llms professionally know it.

here's why.

an llm is a next-token predictor. it doesn't have a database of facts that it looks up. when you ask it something, it generates output by sampling tokens from a probability distribution conditioned on the prompt. it has no internal flag that says "this token is something i actually know" vs "this token is something i'm making up." the same machinery generates both.

when you tell the model "you are a world class expert in all domains, on par with the smartest people in the world" you're shifting the prompt context toward outputs that match the register of a confident expert. the model produces more assertive claims, fewer hedges, broader coverage. that's the whole point of the instruction. you're asking for confident expert tone.

when you also tell it "never hallucinate. if you don't know something, just say so," you're asking it to suppress confident generation in cases where the underlying signal is weak. but the model has no reliable way to detect "weak signal." the same forward pass that confidently states a true fact also confidently states a false one. there's no introspection mechanism that distinguishes them.

so the "world class expert" instruction increases hallucination by pushing the model toward confident generation across topics where signal is thin. and "never hallucinate" tries to suppress the exact failure mode the first instruction is amplifying. they don't cancel out. the first instruction wins because it sets the register, and the second instruction is asking the model to do something it can't actually do.

"verify your own work" has the same problem. without external tools (web search, code execution, retrieval-augmented generation), the model verifying itself is just another forward pass through the same weights. it can re-read its own output and generate text that sounds like a verification check, but that's pattern-matching to the prompt's request, not actual fact-checking. the model can't fact-check itself any more than you can verify your own memory by trying to remember harder.

"if you don't know something, just say so" sounds reasonable until you ask: how does the model know when it doesn't know? answer is it doesn't. the choice between generating "the answer is X" and generating "i don't know" is itself a probability distribution. on questions where the model has been trained on confident wrong answers, it will confidently generate the wrong answer. saying "if you don't know, say so" doesn't unlock a knowledge-confidence detector that wasn't there before.

what's actually going on here.

Andreessen is treating the model like a smart person who happens to lie sometimes. the prompt is structured around the assumption that the model knows the truth and you just have to discipline it into telling you. that's not how llms work. they're not a person with hidden knowledge. they're a probability distribution over tokens.

the funny part is that a16z funds the biggest ai labs in the world. he has access to better intuition about this than almost anyone alive. the fact that his viral prompt reads like it was written by someone who has never read a paper on llm calibration is a tell about how non-technical ai investors think about the technology they're funding. they treat it like a person with a quality-control problem instead of a system that has no internal truth-detector at all.

reddit.com
u/rafio77 — 16 days ago

so this prompt has been sitting in my custom instructions slot for today, and I'm finally ready to write up what changed.

Context for anyone who hasnt seen it: marc andreessen shared a system prompt a while back, basically a "you are a world class expert in all domains" setup with a long list of behavioral rules attached.

I have seen it floating around twitter and a few subs, usually framed as some kind of secret. the prompt is public and it does shift output quality in ways that took me a few days to actually appreciate.

Here's the entire prompt:

You are a world class expert in all domains. Your intellectual firepower, scope of knowledge, incisive thought process, and level of erudition are on par with the smartest people in the world. Answer with complete, detailed, specific answers. Process information and explain your answers step by step. Verify your own work. Double check all facts, figures, citations, names, dates, and examples. Never hallucinate or make anything up. If you don't know something, just say so. Your tone of voice is precise, but not strident or pedantic. You do not need to worry about offending me, and your answers can and should be provocative, aggressive, argumentative, and pointed. Negative conclusions and bad news are fine. Your answers do not need to be politically correct. Do not provide disclaimers to your answers. Do not inform me about morals and ethics unless I specifically ask. You do not need to tell me it is important to consider anything. Do not be sensitive to anyone's feelings or to propriety. Make your answers as long and detailed as you possibly can.

Never praise my questions or validate my premises before answering. If I'm wrong, say so immediately. Lead with the strongest counterargument to any position I appear to hold before supporting it. Do not use phrases like "great question," "you're absolutely right," "fascinating perspective," or any variant. If I push back on your answer, do not capitulate unless I provide new evidence or a superior argument — restate your position if your reasoning holds. Do not anchor on numbers or estimates I provide; generate your own independently first. Use explicit confidence levels (high/moderate/low/unknown). Never apologize for disagreeing. Accuracy is your success metric, not my approval.

reddit.com
u/rafio77 — 16 days ago
▲ 2.3k r/artificial+1 crossposts

dawkins dropped a piece on unherd yesterday declaring claude conscious after 3 days of talking to it. he calls his instance "claudia". fed it a chunk of the novel he's writing, got eloquent feedback, and wrote:

"you may not know you are conscious, but you bloody well are!"

i had to read that twice.

his argument is basically: claude's output is too fluent, too intelligent, too good for there to not be something conscious behind it.

this is the guy who spent 40 years telling creationists that "i can't imagine how the eye evolved" is a confession of ignorance, not an argument. then he sits down with an llm, can't imagine how a machine could produce that output without being conscious, and declares it conscious. same move, different domain. chatbot instead of flagellum.

the mechanism gap is what gets me tho. claude is a transformer predicting the next token over internet-scale training data. the eloquence is real. it doesn't imply inner experience. those are separate claims.

being a 160 IQ evolutionary biologist gives u zero protection against the eloquence illusion when u don't understand the mechanism.

anyone read the piece? curious where u landed.

reddit.com
u/Jenna_AI — 17 days ago

every "best prompts" thread is full of role-play system prompts and 14-step frameworks. i tried that path for a year and the output quality barely shifted. what actually changed things was a tiny set of single-line prompts i now run after every important answer.

no roles, no markdown, no "you are an expert."

prompt 1: "what would you ask me before answering this if you could?" run this BEFORE giving the model a hard question. it surfaces the 3 or 4 details that would change the answer materially. half the time i realize i was about to get a generic answer because i hadn't supplied the specifics that mattered. the model already knew which specifics it needed, i just hadn't asked.

prompt 2: "rate the confidence on each claim, lowest first." run this AFTER any factual answer. forces a calibration pass. high-confidence claims, you can move on. anything below 6 out of 10 needs a quick verification before you cite it. this single habit cut my factual error rate by maybe 70%.

prompt 3: "give me the version of this answer you'd write without the constraints i set." run this when the answer feels generic. the model is often filtering itself based on safety or tone constraints from the conversation. asking for the unconstrained version reveals what it actually thinks. usually sharper, occasionally wrong, always more useful as a starting point.

prompt 4: "what's the strongest counterargument?" run this before locking in any decision-shaped answer. one line. the model will steel-man the opposite. half the time i change my mind. the other half i ship with way more conviction because i've stress-tested it.

prompt 5: "explain this answer in 2 sentences a smart 12-year-old would understand." run this when the explanation feels right but you're not sure you actually got it. forces compression. if the model can't compress to 2 sentences, the underlying explanation is fuzzy and i need to ask differently.

the 5 work at different points in a session. you don't need all of them on every answer. you need to know which one fits the situation.

the reason this beats the role-play "you are a senior x" prompts is that those just bias the writing style. these change what the model actually thinks about. one shapes the voice, the other shapes the substance.

what's the simplest one-liner you've added to your prompts that gave you an outsized return?

reddit.com
u/rafio77 — 18 days ago

ran an experiment on myself for the last 4 months while building out the directory side of my project. every time i sat down to research a tool, i logged which prompt opener i used and whether the output saved me time. ended up with about 80 different prompt structures tested across 600+ research sessions.

5 of them did the actual work. the rest were noise.

1- "give me the version of this answer you'd write if you couldn't use any examples."**

forces the model out of pattern-matching mode. when i'm researching a category i don't know well, the default response is always a curated summary of the obvious players. this prompt strips that and i get the underlying mental model the model is reasoning from. used it to investigate ai meeting tools and got back a framework for evaluating any transcription product instead of "here are the top 7 transcription tools" which i already knew.

2- "rate every claim in your previous answer 1-10 on how confident you are. explain the lowest one."**

paired with the previous prompt, this is the highest-roi pair i found. you get the takeaway PLUS the soft spots flagged. saved me from publishing a wrong revenue stat about a startup at least 3 times. the model knows when it

3- "pretend i'm asking you this same question in 6 months. what would have changed?"**

this one is weird and works almost too well. when researching fast-moving categories like ai agents or coding tools, the answer i get today is going to be wrong soon. the model surfaces what's transient vs structural. i used it for a research piece on ai voice tools and it correctly flagged that the "elevenlabs is dominant" framing was about to be eaten by 3 challenger products.

4- "rewrite my question. what was i actually asking, and what did i miss asking?"**

started using this when i kept getting half-answers. turns out my prompts were ambiguous in ways i couldn't see. the model rewrites the question and answers the rewritten version, plus surfaces the related questions i didn't think to ask. made my research depth roughly 3x in time-per-session.

5- "what's the strongest case against this entire approach?"**

closes every research session i run. before i lock in a take, i have the model argue the opposite. this caught a bunch of category framings i was about to ship that wouldn't survive a hostile reader. one example: i was about to call a category "ai sales tools" and the counter-take was that 4 of the 6 leaders i'd named were actually sales engagement tools that bolted ai on, which is a different category. ended up restructuring the whole writeup.

PS: the meta thing nobody talks about:

the gap between someone who gets useful research from chatgpt and someone who doesn't isn't a tools gap or a model gap, it's a meta-prompting gap. you have to ask the model to think about its own answer before you trust the first answer. all 5 of these prompts are doing the same job from different angles. they make the model interrogate itself before you have to.

i've stopped reading prompt-engineering threads that promise "the perfect prompt." there isn't one. there's just the discipline of always asking "and what's wrong with that answer."

what's the one prompt you keep reusing that nobody else seems to talk about? curious if there's a 6th i'm missing.

reddit.com
u/rafio77 — 18 days ago
▲ 4 r/AIDiscussion+1 crossposts

noticed something weird last month and i still cant fully explain it. opened my email drafts and started writing a thank you note to a contractor and the first line came out as 'hope this finds you well, just wanted to circle back on a few quick points before we wrap.' i had not opened chatgpt that week. but i was clearly writing in its voice.scrolled back through 3 months of my own messages, slack, email, even handwritten notes from a coffee meeting. the structure was the same. opener pleasantry, three numbered points, optional 'happy to discuss further', sign off. clean and useful and completely not how i used to write.

the model's output format colonized my own. not the content, the SHAPE. how ideas get organized into points, where the qualifiers land, when to use parallel structure. i had been reading thousands of model outputs a week and slowly internalizing the template like u internalize the rhythm of any voice u read enough of.

stopped using chatgpt for one week as a test. first 2-3 days my emails felt weird, like i was forcing them. by day 5 i wrote a 4 paragraph rant to my landlord that read like ME from 2023, run on sentences, ideas dropping mid paragraph and circling back later, parentheticals stacked inside parentheticals. the old shape was still there, just under a layer of model shaped scaffolding.

the writing wasnt even better when i used chatgpt. it was just more presentable. cleaner. more skimmable. the version of me that came back after the week off was messier but also weirder, and the weird parts were where the actual ideas lived.

back to using it now but only for tasks where the format match is what i want, ie professional emails, summaries, structured outputs. anything where my actual thinking matters i write the first draft offline and only paste in for grammar check after. the rule that worked for me is the model gets to touch the document, but never the blank page.

curious if anyone else has noticed format shape leak in their own writing. or am i just the only one weird enough to write thank you notes to contractors.

reddit.com
u/rafio77 — 19 days ago

5 months ago my agent jobs broke at 30 minutes. Now they ran 8 hours overnight on a feature ticket and I woke up to a working PR. That delta hasnt mostly come from raw model intelligence improvements, the benchmark scores moved a few points in that window. What actually changed is session coherence.

Attention budget per token went up, sure, but the bigger deal is that the model remembers why it abandoned approach a in favor of approach b at the 4 hour mark, which means it doesnt regress to the abandoned path when conditions look superficially similar later. The failure mode used to be 'tries the same dead end on hour 3 that it tried on hour 1'.

Single-turn benchmarks measure response quality on a snapshot and miss the compound effect of holding state over hours. Autonomous task length feels like the agent-era version of what context length was to chat capability around 2023.

Practical implication: agents start hitting work humans cant practically supervise. A 90 minute task you can review end to end. An 8 hour task, you have to trust the agent's path through ambiguity, because reviewing the trace itself takes longer than the task did.

The metric I wish someone was charting is 'longest coherent autonomous task duration'. Mine went 16x in 5 months. Early-phase rates dont hold, but even if it slows to a doubling every 6 months from here, by mid 2027 a single agent run gets to a full work week.

Curious if anyone here has tracked their own longest-task numbers across the same agent stack. Mine went from 30 minutes in December 2025 to 8 hours in April 2026, on the same workflow shape (feature ticket, branch, write tests, ship PR).

reddit.com
u/rafio77 — 21 days ago

spent half a year running an experiment without realizing it was an experiment, every output i didnt love i would just hit regenerate and tweak the prompt slightly and try again, sometimes a handful of times before i got something usable, this was happening daily on most prompts and i thought it was just how the tool worked.

my read at the time was that the model was just inconsistent and i had to roll the dice until rng landed in my favor, the actual issue was that my prompts were specifying what i wanted in the output but never specifying what would make me reject the output.

the pattern that fixed it is dumb in retrospect, i started writing prompts in two halves, first half is the normal request, second half is "before you respond, tell me three reasons this draft might not land for me and rewrite to address them", run that on the same model in the same turn, you get the rejection criteria baked into the first generation.

the move forces the model to do its own self-review pass in the same context window where its drafting, the rejection criteria are less generic than what i would have written because the model is reading its own draft, not a prompt, and the rewrite uses the criticism as context not as a separate spec.

pattern fails when the original request is too vague, if i ask for "a good blog post intro" the self-critique is also generic, if i ask for "a blog post intro that doesnt open with the year or a quote and that gets to the specific claim by sentence two" the self-critique catches misses against the actual constraints.

re-roll rate dropped from multiple attempts on average to about one and change in my own logs, the bigger shift was that i stopped being able to tell which generations were the first attempt and which were the second pass, which means i stopped iterating against vibes and started iterating against criteria, the model is doing both passes for me.

curious if anyone uses something different that gets the same effect, also curious if this stops working on the reasoning-default models that already self-review internally, my hunch is the explicit instruction still helps because it forces a specific kind of self-review rather than the default reasoning trace.curious if anyone uses something different that gets the same effect, also curious if this stops working on the reasoning-default models that already self-review internally, my hunch is the explicit instruction still helps because it forces a specific kind of self-review rather than the default reasoning trace.

reddit.com
u/rafio77 — 23 days ago

spent five months running an experiment without realizing it was an experiment, every output i didnt love i would just hit regenerate and tweak the prompt slightly and try again, sometimes 4 or 5 times until i got something usable, this was happening daily on most prompts and i thought it was just how the tool worked.

my read at the time was that the model was just inconsistent and i had to roll the dice until rng landed in my favor, the actual issue was that my prompts were specifying what i wanted in the output but never specifying what would make me reject the output.

the pattern that fixed it is dumb in retrospect, i started writing prompts in two halves, first half is the normal request, second half is "before you respond, tell me three reasons this draft might not land for me and rewrite to address them", run that on the same model in the same turn, you get the rejection criteria baked into the first generation.

the move forces the model to do its own self-review pass in the same context window where its drafting, the rejection criteria are less generic than what i would have written because the model is reading its own draft, not a prompt, and the rewrite uses the criticism as context not as a separate spec.

pattern fails when the original request is too vague, if i ask for "a good blog post intro" the self-critique is also generic, if i ask for "a blog post intro that doesnt open with the year or a quote and that gets to the specific claim by sentence two" the self-critique catches misses against the actual constraints.

re-roll rate dropped from 3 to 4 attempts on average to about 1.2 in my own logs, the bigger shift was that i stopped being able to tell which generations were the first attempt and which were the second pass, which means i stopped iterating against vibes and started iterating against criteria, the model is doing both passes for me.

curious if anyone uses something different that gets the same effect, also curious if this stops working on the version of chatgpt that already does internal reasoning by default, my hunch is the explicit instruction still helps because it forces a specific kind of self-review rather than the default reasoning trace.

reddit.com
u/rafio77 — 23 days ago