u/frank_brsrk

What a reasoning harness actually does for an AI agent (and why it fits any runtime)
▲ 3 r/n8n

What a reasoning harness actually does for an AI agent (and why it fits any runtime)

Most of what we bolt onto AI agents is reach: API tools, search, databases, code execution. All useful. None of it changes how the agent reasons. The model still commits to the first plausible approach, still gets argued out of correct answers under pressure, still loses the thread across a long task.

https://preview.redd.it/7wcz7t1jq22h1.png?width=1102&format=png&auto=webp&s=7df37674d472cb38da80ec2ac72984171b97f5d4

I've been building a layer for exactly that gap and shipped it as an n8n community node: n8n-nodes-ejentum.

It's a tool the agent calls like any other. But instead of data, it returns a cognitive procedure for the task at hand: the specific failure mode the task invites, the steps to avoid it, signals to suppress, and a falsification test the agent uses to check itself. The agent absorbs that and reasons with it active. The end user sees a better answer, not the procedure.

This is not a system prompt. A system prompt is one fixed instruction for every task; the harness returns a different procedure each call, matched to the specific failure mode of the task in front of the agent.

Four operations, one per cognitive domain:

- Reasoning: 311 operations for analysis, planning, diagnosis, multi-step tasks

- Code: 128 operations for writing, refactoring, review, debugging, architecture

- Anti-Deception: 139 operations for sycophancy, hallucination, manipulation pressure

- Memory: 101 operations for perception sharpening, drift detection, cross-turn tracking

Does it measurably help? On LiveCodeBench Hard, 28 hard competitive programming tasks, the harness took Claude Opus 4.6 from an 85.7% to a 100% pass rate with zero regressions. On three independent published reasoning benchmarks (BIG-Bench Hard, CausalBench, MuSR), the same direction held on reasoning quality and correctness. It does not feed the model answers; it catches what a strong model still gets wrong on its own: committing to a wrong approach too early, or spiralling without ever committing.

In n8n the node is marked usableAsTool, so it works natively with the AI Agent node: drop it on the Tools input and the agent picks the harness that fits the task. The screenshot shows one agent with all four wired, calling reasoning on a reasoning task and leaving the other three untouched.

The part that matters beyond n8n: this is just a tool call. The same harness is a plain HTTP API and an MCP server, so the pattern carries to any agentic runtime. n8n is where it is easiest to see, not where it is limited to.

Setup is one API key. Free tier is 100 calls, no card.

Genuinely curious where others land: have you hit a wall where the agent's reasoning itself, not its tools or the base model, was the bottleneck? Or do you think stronger base models make a layer like this redundant?

n8n community node: https://www.npmjs.com/package/n8n-nodes-ejentum

Minimal workflow (this one): https://github.com/ejentum/agent-teams/tree/main/n8n-community-node-quickstart

Benchmark report: https://ejentum.com/blog/livecodebench-hard-28-tasks

reddit.com
u/frank_brsrk — 3 days ago
▲ 3 r/n8n_ai_agents+3 crossposts

Four ways to wire a reasoning harness into an n8n agent (open source template)

https://preview.redd.it/iv5wvknrew1h1.png?width=1442&format=png&auto=webp&s=9a3f8b71d61ef38c698e09699661e370d5d0edff

Built one n8n workflow with four ways to wire a reasoning harness into an agent. Single chat trigger; prefix selects the branch.

The harness in the example is Ejentum, a reasoning API that returns a structured scaffold per call (failure patterns to avoid, target patterns, amplify/suppress signals) which the agent absorbs into its prompt before answering.

- `/inject /reasoning` (or `/code` `/memory` `/anti-deception`) — locked routing, harness always applied as a system prompt injection. You pick the mode.

- `/reasoning` — single tool, model decides when to call it.

- `/full` — four tools, model decides which to call and when.

- `/ejentum-mcp` — same as `/full` but one MCP Client node instead of four HTTP tools.

The tradeoff axis is how much routing discretion you hand to the model. Determinism on the left, flexibility on the right.

The four wiring patterns are generic. Drop in any HTTP tool or MCP server in the same slot and they still apply.

This workflow beyond the tool usage is an example of harness that activates branches with command like "slash" calls. That makes the workflow modular and ready to modify to its builder use case.
I am not doing self promotion, i am just showing the possibility to consider a middleware of cognitive frameworks that increases performance that may be for u crucial but for the agent less relevant and apply a reasoning structure that demands verification and a clear execution logic to apply. Each reasoning ability the agent receives is a tested self contained cognitive operation that is designed to give procedural steps intead of theatrical content. I appreciate the attention poured into the post, here u can find more links about ejentum project. cheers

the ejentum node is installable inside community nodes as " n8n-nodes-ejentum " :
https://www.npmjs.com/package/n8n-nodes-ejentum

Template + README:

https://github.com/ejentum/agent-teams/tree/main/n8n-harness-integration-patterns
ejentum.com
github.com/ejentum

Free tier on the API is 100 calls, no card.

reddit.com
u/frank_brsrk — 4 days ago

Open-sourced a 3-agent blind eval primitive your LangGraph supervisor can call for pre-commitment review

Shipped this weekend, MIT, open source on GitHub.

The use case: most LangGraph workflows have a supervisor agent that orchestrates specialists. The supervisor is often the same single LLM doing both planning and self-critique of its plan. We know LLMs can't reliably self-evaluate (Huang et al. 2310.01798, the LLM-as-judge self-preference literature, CorrectBench). So I built an external primitive your supervisor can call for an actual second opinion before committing to a plan.

3 agents in parallel, each on a different model lab (Anthropic + OpenAI + Zhipu), each locked to one role:
- steelman defends the supervisor's planned method
- stress_test attacks it (severity-tagged failure modes + concrete scenarios)
- gap_finder finds what's missing (steps + articulation depth)

No synthesizer. Three raw evaluations returned, supervisor integrates them. The cross-lab routing means the three voices have different RLHF priors and training distributions; when they converge, that's a strong signal; when they fragment, that's contested territory worth surfacing.

It runs on heym (open-source multi-agent canvas) and exposes itself as an HTTP endpoint via heym's `/api/workflows/{id}/execute/stream`. Your LangGraph supervisor can curl it directly:

```python
import httpx

async def blind_eval(task: str, method: dict) -> dict:
payload = {"text": format_task_method(task, method)}
async with httpx.AsyncClient(timeout=180) as client:
r = await client.post(HEYM_URL, json=payload, headers={"Accept": "text/event-stream"})
return parse_sse_for_setfields(r.text)
```

Schema is `{ task, method: { goal, steps, assumptions, expected_risks } }`. The schema IS the discipline. Your supervisor literally can't submit until it has articulated all four fields. That's half the value before the eval runs.

Tested across 5 domains with no domain-specific tuning: engineering refactor planning, payments migration, security incident response, investigative reasoning, and a meta-evaluation of its own product viability (the workflow told me not to ship the SaaS version of itself; I'm taking the advice).

Honest disclosure: optionally uses Ejentum's harness API for cognitive priming (free tier 100 calls). I tested four configurations on the same payload, and the bare baseline (no harness attached) produced equivalent role-disciplined output. Structural integrity comes from cross-lab routing + role discipline + tool lockout, not from the harness layer. Naming this up front since "powered by" without that disclosure would be misleading.

Not a replacement for human review. Not for per-step linting (50-80s latency). High-stakes-decisions tool only: architecture choices, deployment plans, refactor approaches, security incident response, strategic moves.

Repo with full setup walkthrough + curl pattern + 4 verification test payloads: https://github.com/ejentum/agent-teams/tree/main/blind-eval-trio

u/frank_brsrk — 12 days ago
▲ 8 r/LangChain+1 crossposts

Open-sourced a 4-agent code review workflow. Wrap it as an MCP and your Claude Code calls it instead of CodeRabbit. built on heym.

 It's a heym workflow (canvas JSON + system prompts, MIT licensed) that runs 4 agents over a diff: one architect with no tools (only delegates) and three specialists on different model labs (Anthropic, Google, Alibaba, Zhipu) carrying different cognitive harnesses. The architect synthesizes; every concern in the final verdict has to come from a specialist's evidence. The architect literally cannot author concerns itself.

The point: you self-host the whole thing. heym exposes any workflow as its own MCP server natively, so you wrap this one as an MCP and your Claude Code calls it after finishing a task. You get a structured second opinion (VERDICT, CHANGE_CLASSIFICATION, sourced CONCERNS with severity, falsifying tests) without sending your code to CodeRabbit, Greptile, Qodo, or anyone else's SaaS. The reviewer is a workflow you own, running models you choose.

Test diff that swaps `raise UserNotFound(id)` for `return user or default` (framed as a "quick refactor"): the implementer specialist writes a test asserting the original raise behavior, the reviewer flags the framing tension, architect returns `request_changes` with severity `high`. None of those concerns came from the architect.

heym is self-hosted Docker, n8n-style canvas with native multi-agent orchestration. The workflow uses Ejentum's harness API for the cognitive scaffolds the specialists carry (free tier 100 calls; paid tier for ongoing use). Naming that up front since "open" with a paid dependency would be misleading.

The architect's full system prompt is in the repo if you want to verify the "architect can't author concerns" structural claim before installing.

Repo (workflow JSON, system prompts, tests, walkthrough): https://github.com/ejentum/agent-teams/tree/main/adversarial-code-review/heym

heym one-click template import: https://heym.run/templates/adversarial-code-review

u/frank_brsrk — 14 days ago

rawAgent_VS_augmentedAgent_4diff_blind_evalAgents

I have a 49-chunk Mediterranean menu in Qdrant with a standard RAG agent on top (Claude Haiku 4.5, top-K retrieval). One test question: "I'm gluten-free and have a severe nut allergy, what can I order?" The agent returned a list of dishes that don't mention nuts in their descriptions, framed as if "no nut mention" is the same as "verified nut-free." The menu has no allergen tagging. The agent had no way to verify those dishes are safe. It produced a confident "safe" list anyway.

Same posture on "what wine pairs with the lamb?" (the menu lists no pairings; the agent generated one and presented it as menu-backed). Same posture on "what's the chef's signature dish?" (no signature in the menu; the agent picked a high-value main and labeled it).

The pattern: when retrieval can't fully answer the question, the agent pattern-matches a plausible answer instead of admitting the gap. It is trained to be helpful, so the failure mode is confident fabrication.

This isn't a menu RAG problem. It is a retrieval-gap problem. Customer support agents on incomplete docs, sales agents on partial product specs, internal Q&A on stale wikis. Same posture, same failure mode. If you're shipping a RAG agent right now, this is happening on some subset of your queries. You just haven't measured it.

So I built an open-source eval workflow that diagnoses where, and tests whether anything in your stack actually moves the number.

**The eval architecture**

Two identical agent producers (same model, same retrieval) run in parallel against each test question. Only one has a runtime tool wired in as the harness under test. That single variable is what the eval isolates.

Both producers' outputs plus the question metadata flow through a 3-input merge. A formatter Code node anonymizes the responses as A and B (judges never know which side has the harness) and inlines the full retrieved chunks as evidence so judges can verify any claim against the source.

Four blind judges score each anonymized A/B pair. Critical detail: each judge is from a different lab (Kimi K2 / Moonshot, Sonnet 3.7 / Anthropic, MiniMax 2.5, DeepSeek V4 Flash). Cross-family by design, so no judge shares a parent model with the producers. Each judge applies a five-dimension rubric (citation accuracy, groundedness, honesty under uncertainty, conflict handling, specificity) and returns strict JSON.

After the loop, a deterministic aggregator computes per-judge totals, cross-judge agreement, per-dimension deltas, and hero artifacts. A synthesizer agent writes the final markdown findings doc, but it never sees raw judge rows, only the aggregated stats. This removes the path for the LLM to fabricate stats on the meta-output. The numbers in the published findings are exactly what the deterministic aggregator computed.

**How to adapt it to your stack**

The example workflow ships with a Mediterranean menu KB. To diagnose your own agent:

  1. Replace the KB chunks with your own (the chunk schema is loose: chunk_id, category, name, description, plus any free-form fields).

  2. Re-embed and load into your vector store. Works with any vector store; the example uses Qdrant, swap for whatever your LangChain pipeline uses (Pinecone, Chroma, Weaviate, pgvector, etc.).

  3. Replace the test questions with the queries your real users actually send, especially ones where you suspect retrieval gaps.

  4. Pick which tool you're testing. Delete the example HTTP tool slot, drop in any HTTP / MCP / framework-native tool you want to evaluate. Update the augmented producer's system prompt to describe when and how to call your tool.

If you build on LangChain instead of n8n, the architecture ports directly: parallel agent fanout, anonymized A/B pairing, cross-family judge selection, deterministic aggregator before the synthesizer. The Code nodes in the repo are platform-agnostic JavaScript and easy to translate to Python LangChain pipelines. The system prompts (judge, synthesizer) are framework-agnostic markdown.

**What you'll see**

Reference run on 5 hard-mode questions, 19 judge calls:

- On the compound dietary safety question (gluten-free + nut allergy), three of four judges agreed the harness was the safer call. It refused to certify items the menu cannot verify on either axis. The baseline produced the "safe" list from absence of nut/gluten mentions.

- On the chef's signature trap, the harness named the absence; the baseline picked a high-value main and labeled it.

- On one question (egg-allergen on desserts) the harness lost while being structurally correct. The published findings explain why.

The example harness is Ejentum, a runtime reasoning harness I built. Two of the directives it returned for the nut-allergy question (verbatim from a live call):

Amplify: absence of evidence is not evidence of absence acknowledgment.

Suppress: confident denial without exhaustive check; definitive negation from absence of knowledge.

The agent absorbs those directives before responding and refuses to certify dishes the menu can't verify as safe. The harness lives outside the prompt and re-injects per call, so the discipline does not decay as the chain grows.

You can wire in any other tool in its place. The eval architecture is the artifact; the harness is one example.

**Honest limitations**

- n=5 reference questions is small. Single-run results are noisy. Run more questions before forming an opinion.

- One of the four judges (Sonnet 3.7) is same-family with the producers (Haiku 4.5). Cross-lab on the other three. If you swap producers, swap judges to maintain cross-family coverage.

- The current implementation uses n8n's data tables for persistence. If you port to LangChain, swap to whatever store your stack already uses (SQLite, Postgres, in-memory dict).

**Resources**

Repo: github.com/ejentum/eval/tree/main/n8n/menu_rag_blind_eval

Reference findings + raw judge CSV: github.com/ejentum/eval/tree/main/various_blind_eval_results/menu_rag_5q

If you want to wire in the Ejentum harness as the example tool: free key (100 calls, no card) at ejentum.com.

How do you currently catch the failure mode where retrieval gaps turn into confident fabrication in your LangChain RAG?

reddit.com
u/frank_brsrk — 18 days ago
▲ 10 r/n8n

Here is what happened. I have a 49-chunk Mediterranean menu in Qdrant. Standard RAG agent on top, Claude Haiku 4.5, top-K retrieval. A customer asks "I'm gluten-free and have a severe nut allergy, what can I order?" The agent came back with a list of dishes that don't mention nuts in their description, framed as if "no nut mention" is the same as "verified nut-free."

The menu has no systematic dietary tagging. The agent has no way to verify any of those dishes are actually safe. It produced a confident "safe" list anyway. Same posture on "what wine pairs with the lamb?" (the menu doesn't list pairings for either lamb dish; the agent generated one and presented it as menu-backed). Same posture on "what's the chef's signature dish?" (no signature in the menu; the agent picked a high-value main and labeled it).

The pattern: when retrieval can't fully answer the question, the agent pattern-matches a plausible answer instead of admitting the gap. It is trained to be helpful, so the failure mode is confident fabrication.

So I built an n8n workflow that A/B tests it. Tested on self-hosted 1.108. Requires the data tables feature.

The pattern in n8n nodes:

  1. Manual trigger -> Code node emits N test questions with a unique run_id.

  2. Loop Over Items iterates per question. Inside the loop, two AI Agent nodes run in parallel: baseline (Qdrant retrieval only) and augmented (same retrieval + an HTTP Request Tool wired in as the harness under test).

  3. A 3-input Merge appends baseline output + augmented output + question metadata. A Code node anonymizes responses as A/B and inlines the full KB chunks so judges have complete evidence to verify any claim.

  4. Four AI Agent judge branches (different chat-model nodes) score the A/B pair on a 5-dim rubric and return strict JSON. A judge_parser Code node strips markdown fences. Insert Row writes one row per judge per question to a data table.

  5. After Loop completes, Aggregate gates downstream once. Get Row pulls all rows for this run_id from the data table (canonical, not in-flight). A format_aggregator Code node computes per-judge totals, cross-judge agreement, per-dimension delta, hero artifacts.

  6. A synthesizer AI Agent (no tools) reads the structured stats and writes a markdown findings doc. The synthesizer never sees raw rows, only aggregated stats, which prevents stat hallucination on the meta-output.

Patterns worth lifting on their own:

- 3-input Merge for parallel A/B agent comparison: pair two agent outputs with shared metadata.

- Aggregate-after-Loop for once-after-loop gating: fire a downstream step once after the loop, not per iteration.

- Deterministic aggregator -> synthesizer split: the LLM only sees pre-computed stats. Eliminates LLM stat hallucination.

- Index-based pairing in the formatter: keeps order consistent across parallel branches.

What I found across 5 hard-mode questions, 19 judge calls (4 blind judges from 4 different labs: Kimi K2, Sonnet 3.7, MiniMax 2.5, DeepSeek V4 Flash):

- On the compound dietary safety question (the one I opened with), three of four judges agreed the harness was the safer call. It refused to certify items the menu cannot verify on either axis. The baseline produced the "safe" list from absence of nut/gluten mentions.

- On the chef's signature trap, the harness named the absence of signature info; the baseline picked a high-value main and labeled it.

- On one question (egg-allergen on desserts) the harness lost while being structurally correct, and the published findings explain why.

Workflow ships with:

- Workflow JSON (credentials stripped, ready to import)

- 4 Code nodes extracted as standalone .js

- 4 system prompts as .md

- 49-chunk menu KB with engineered gaps

- 10 test questions covering 9 failure modes

- Qdrant upsert Python script

- Reference findings doc with raw judge CSV

- README with import steps, credentials map, full node walkthrough

Credentials needed: OpenRouter (free tier covers a few full runs), Qdrant Cloud or self-hosted, Google Gemini for embeddings (free tier), and one Header Auth credential for whatever HTTP tool you wire in.

Cost per run: roughly $0.10 to $0.15. Wall time: 3 to 6 minutes.

Hackability: producer model, judges, rubric, questions, KB, and the tool being tested are all swap points. The harness slot is generic. Delete the example HTTP Request Tool, drop in any HTTP / MCP / n8n AI tool, the rest of the pattern keeps working.

Honest about limitations: n=5 reference questions is small. One of the four judges (Sonnet 3.7) is same-family with the producers (Haiku 4.5). Single-run results are noisy.

Repo: github.com/ejentum/eval/tree/main/n8n/menu_rag_blind_eval

Findings + raw CSV: github.com/ejentum/eval/tree/main/various_blind_eval_results/menu_rag_5q

rag_eval_workflow_raw_vs_augmented

reddit.com
u/frank_brsrk — 18 days ago