u/No_Skill_8393

Most teams ship prompts like its 2008. I built something better. A 4 agents pipeline to scientifically create, eval, tune the prompt for you.

Most teams ship prompts the same way they used to ship CSS in 2008. Tweak, eyeball a few outputs, push to prod, wait for users to complain, repeat. Prompts are production code. They deserve the same testing infrastructure your Python does.
 
That's why I built PromptLabs.
 
How the loop works, in five steps:  

  1. You provide the input. Either an intent ("classify customer support emails as billing, technical, account, or other") or an existing production prompt plus the failure modes you've been seeing.
  2. EvalGen writes your test suite. It picks 5 to 8 categories of inputs that will exercise the prompt (happy path, edge cases, adversarial), fires one parallel LLM call per category, and dedupes the result. So you get real coverage, not 50 reworded copies of the same easy case. The same call also writes the scoring rubric. Then it splits the test set into train and holdout. The holdout never leaks into optimization.
  3. Runner executes the prompt across every target model in parallel. Choosing between Sonnet 4.6, GPT-5, and Gemini 3? All three run at once on the same eval set. Results in minutes, cost per eval plotted on the same chart.
  4. Judge scores every output, criterion by criterion. LLM-as-judge with reasoning attached, so you can see exactly why a score is what it is.
  5. Optimizer proposes a diff, not a regeneration. It looks at where the prompt failed, then returns specific line edits (insert this clause after line 3, delete this sentence, reword this paragraph). You read it like a pull request. The new version is scored on the holdout set. The loop checks for convergence or overfitting, and either accepts the result or loops back to step 3 with the new prompt.

 
The accepted prompt is served over HTTP. Your production code fetches the latest version at request time, so you can iterate without redeploying.
 
Three things that make this different from tools you've probably tried:
 
The eval set is real, not theater. Stratified by category with parallel generation and dedup, so you get coverage of edge cases instead of fifty rewordings of the happy path. Most tools either skip eval generation entirely, or give you one LLM call that quietly produces 40 near-duplicates.
 
Train and holdout stay separate, and the loop enforces it. The trajectory chart shows the gap widening the moment you start overfitting, and the loop halts itself when it does. The "best version" pick uses a lower confidence bound so a lucky high-variance run can't game the leaderboard. Most "optimizer" tools you've seen don't even have a holdout set.
 
The Optimizer evolves your prompt, it doesn't replace it. A diff is reviewable. You can accept some edits and reject others. The domain knowledge you spent six months baking into your prompt isn't thrown out every iteration. DSPy-style frameworks regenerate; this one refines.
 
If you've been gluing promptfoo + dspy + langfuse together to do what should be one workflow, this is one tool that does the whole thing. If you're treating prompts like config strings instead of like the production code they are, you're leaving accuracy on the table and inviting silent regressions you wont see until they hurt.
 
MIT, local, your keys.

reddit.com
u/No_Skill_8393 — 3 days ago

Most teams ship prompts like its 2008. I built something better.

Most teams ship prompts the same way they used to ship CSS in 2008. Tweak, eyeball a few outputs, push to prod, wait for users to complain, repeat. Prompts are production code. They deserve the same testing infrastructure your Python does.
 
That's why I built PromptLabs.
 
How the loop works, in five steps:  

  1. You provide the input. Either an intent ("classify customer support emails as billing, technical, account, or other") or an existing production prompt plus the failure modes you've been seeing.
  2. EvalGen writes your test suite. It picks 5 to 8 categories of inputs that will exercise the prompt (happy path, edge cases, adversarial), fires one parallel LLM call per category, and dedupes the result. So you get real coverage, not 50 reworded copies of the same easy case. The same call also writes the scoring rubric. Then it splits the test set into train and holdout. The holdout never leaks into optimization.
  3. Runner executes the prompt across every target model in parallel. Choosing between Sonnet 4.6, GPT-5, and Gemini 3? All three run at once on the same eval set. Results in minutes, cost per eval plotted on the same chart.
  4. Judge scores every output, criterion by criterion. LLM-as-judge with reasoning attached, so you can see exactly why a score is what it is.
  5. Optimizer proposes a diff, not a regeneration. It looks at where the prompt failed, then returns specific line edits (insert this clause after line 3, delete this sentence, reword this paragraph). You read it like a pull request. The new version is scored on the holdout set. The loop checks for convergence or overfitting, and either accepts the result or loops back to step 3 with the new prompt.

 
The accepted prompt is served over HTTP. Your production code fetches the latest version at request time, so you can iterate without redeploying.
 
Three things that make this different from tools you've probably tried:
 
The eval set is real, not theater. Stratified by category with parallel generation and dedup, so you get coverage of edge cases instead of fifty rewordings of the happy path. Most tools either skip eval generation entirely, or give you one LLM call that quietly produces 40 near-duplicates.
 
Train and holdout stay separate, and the loop enforces it. The trajectory chart shows the gap widening the moment you start overfitting, and the loop halts itself when it does. The "best version" pick uses a lower confidence bound so a lucky high-variance run can't game the leaderboard. Most "optimizer" tools you've seen don't even have a holdout set.
 
The Optimizer evolves your prompt, it doesn't replace it. A diff is reviewable. You can accept some edits and reject others. The domain knowledge you spent six months baking into your prompt isn't thrown out every iteration. DSPy-style frameworks regenerate; this one refines.
 
If you've been gluing promptfoo + dspy + langfuse together to do what should be one workflow, this is one tool that does the whole thing. If you're treating prompts like config strings instead of like the production code they are, you're leaving accuracy on the table and inviting silent regressions you wont see until they hurt.
 
MIT, local, your keys.
 
https://github.com/temm1e-labs/promptlabs

reddit.com
u/No_Skill_8393 — 3 days ago

Most teams ship prompts like its 2008. I built something better.

Most teams ship prompts the same way they used to ship CSS in 2008. Tweak, eyeball a few outputs, push to prod, wait for users to complain, repeat. Prompts are production code. They deserve the same testing infrastructure your Python does.
 
That's why I built PromptLabs.
 
How the loop works, in five steps:  

  1. You provide the input. Either an intent ("classify customer support emails as billing, technical, account, or other") or an existing production prompt plus the failure modes you've been seeing.
  2. EvalGen writes your test suite. It picks 5 to 8 categories of inputs that will exercise the prompt (happy path, edge cases, adversarial), fires one parallel LLM call per category, and dedupes the result. So you get real coverage, not 50 reworded copies of the same easy case. The same call also writes the scoring rubric. Then it splits the test set into train and holdout. The holdout never leaks into optimization.
  3. Runner executes the prompt across every target model in parallel. Choosing between Sonnet 4.6, GPT-5, and Gemini 3? All three run at once on the same eval set. Results in minutes, cost per eval plotted on the same chart.
  4. Judge scores every output, criterion by criterion. LLM-as-judge with reasoning attached, so you can see exactly why a score is what it is.
  5. Optimizer proposes a diff, not a regeneration. It looks at where the prompt failed, then returns specific line edits (insert this clause after line 3, delete this sentence, reword this paragraph). You read it like a pull request. The new version is scored on the holdout set. The loop checks for convergence or overfitting, and either accepts the result or loops back to step 3 with the new prompt.

 
The accepted prompt is served over HTTP. Your production code fetches the latest version at request time, so you can iterate without redeploying.
 
Three things that make this different from tools you've probably tried:
 
The eval set is real, not theater. Stratified by category with parallel generation and dedup, so you get coverage of edge cases instead of fifty rewordings of the happy path. Most tools either skip eval generation entirely, or give you one LLM call that quietly produces 40 near-duplicates.
 
Train and holdout stay separate, and the loop enforces it. The trajectory chart shows the gap widening the moment you start overfitting, and the loop halts itself when it does. The "best version" pick uses a lower confidence bound so a lucky high-variance run can't game the leaderboard. Most "optimizer" tools you've seen don't even have a holdout set.
 
The Optimizer evolves your prompt, it doesn't replace it. A diff is reviewable. You can accept some edits and reject others. The domain knowledge you spent six months baking into your prompt isn't thrown out every iteration. DSPy-style frameworks regenerate; this one refines.
 
If you've been gluing promptfoo + dspy + langfuse together to do what should be one workflow, this is one tool that does the whole thing. If you're treating prompts like config strings instead of like the production code they are, you're leaving accuracy on the table and inviting silent regressions you wont see until they hurt.
 
MIT, local, your keys.
 
https://github.com/temm1e-labs/promptlabs

reddit.com
u/No_Skill_8393 — 4 days ago
▲ 1 r/AI_developers+1 crossposts

Most teams ship prompts like its 2008. I built something better.

Most teams ship prompts the same way they used to ship CSS in 2008. Tweak, eyeball a few outputs, push to prod, wait for users to complain, repeat. Prompts are production code. They deserve the same testing infrastructure your Python does.
 
That's why I built PromptLabs.
 
How the loop works, in five steps:  

  1. You provide the input. Either an intent ("classify customer support emails as billing, technical, account, or other") or an existing production prompt plus the failure modes you've been seeing.
  2. EvalGen writes your test suite. It picks 5 to 8 categories of inputs that will exercise the prompt (happy path, edge cases, adversarial), fires one parallel LLM call per category, and dedupes the result. So you get real coverage, not 50 reworded copies of the same easy case. The same call also writes the scoring rubric. Then it splits the test set into train and holdout. The holdout never leaks into optimization.
  3. Runner executes the prompt across every target model in parallel. Choosing between Sonnet 4.6, GPT-5, and Gemini 3? All three run at once on the same eval set. Results in minutes, cost per eval plotted on the same chart.
  4. Judge scores every output, criterion by criterion. LLM-as-judge with reasoning attached, so you can see exactly why a score is what it is.
  5. Optimizer proposes a diff, not a regeneration. It looks at where the prompt failed, then returns specific line edits (insert this clause after line 3, delete this sentence, reword this paragraph). You read it like a pull request. The new version is scored on the holdout set. The loop checks for convergence or overfitting, and either accepts the result or loops back to step 3 with the new prompt.

 
The accepted prompt is served over HTTP. Your production code fetches the latest version at request time, so you can iterate without redeploying.
 
Three things that make this different from tools you've probably tried:
 
The eval set is real, not theater. Stratified by category with parallel generation and dedup, so you get coverage of edge cases instead of fifty rewordings of the happy path. Most tools either skip eval generation entirely, or give you one LLM call that quietly produces 40 near-duplicates.
 
Train and holdout stay separate, and the loop enforces it. The trajectory chart shows the gap widening the moment you start overfitting, and the loop halts itself when it does. The "best version" pick uses a lower confidence bound so a lucky high-variance run can't game the leaderboard. Most "optimizer" tools you've seen don't even have a holdout set.
 
The Optimizer evolves your prompt, it doesn't replace it. A diff is reviewable. You can accept some edits and reject others. The domain knowledge you spent six months baking into your prompt isn't thrown out every iteration. DSPy-style frameworks regenerate; this one refines.
 
If you've been gluing promptfoo + dspy + langfuse together to do what should be one workflow, this is one tool that does the whole thing. If you're treating prompts like config strings instead of like the production code they are, you're leaving accuracy on the table and inviting silent regressions you wont see until they hurt.
 
MIT, local, your keys.
 
https://github.com/temm1e-labs/promptlabs

reddit.com
u/No_Skill_8393 — 4 days ago
▲ 19 r/LocalLLM+1 crossposts

I just released TemRust-SMOL-v5-1.5B, an Apache-2.0 fine-tune of Qwen2.5-Coder-1.5B-Instruct specialized for Rust. Wanted to share it here because the project was specifically built around what r/rust would actually find useful: borrow-checker fixes, type-error fixes, test generation, and fix-this-issue tasks — all graded by running cargo, not by an LLM judge.

Benchmark (37 hand-curated Rust tasks, all graded by cargo check / cargo test / cargo run in a fresh tempdir per task; no string matching, no embedding similarity):

Qwen3-1.7B-chat (untrained, 1.7B) 13/37 = 35.1%
Qwen2.5-Coder-1.5B-Instruct (this base, 1.5B) 19/37 = 51.4%
TemRust-SMOL-v5-1.5B (released, 1.5B) 25/37 = 67.6%
Qwen2.5-Coder-3B-Instruct (2x params) 27/37 = 73.0%
TemRust v4 + v5 ensemble + cargo check 31/37 = 83.8%

The single 1.5B model is +16.2 pp over its untrained base. It does not beat the 3B Coder base. Running both my v4 (1.7B) and v5 (1.5B) checkpoints in parallel and accepting whichever output passes cargo check gets 83.8% — comparable total params but 10.8 pp better than the single 3B, because v4 and v5 fail on different tasks (v4 nails issue, v5 nails type/test/borrow).

Per-category for v5: borrow 7/10, issue 7/9, test 4/9, type 7/9. Tests are the weak spot — synthetic test scaffolds did not transfer well; documented honestly in the paper.

How it was built

- 263 real merged-PR file pairs (pre-fix to post-fix) crawled from 35+ popular Rust repos
- 51 hand-curated borrow/lifetime archetypes, teacher-fixed via Qwen3-Coder-Next
- 41 teacher-distilled test scaffolds
- LoRA r=32 alpha=64, 10 epochs, lr=2e-5, packing, max_seq_len=4096
- 1x RunPod H100 SXM5, ~20 min wall time, ~$1.50 per training run
- Full session spend across all experiments and ablations: ~$46

Quick usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tok = AutoTokenizer.from_pretrained("nagisanzeninz/TemRust-SMOL-v5-1.5B")
model = AutoModelForCausalLM.from_pretrained(
"nagisanzeninz/TemRust-SMOL-v5-1.5B",
torch_dtype=torch.bfloat16, device_map="auto",
)

System prompt I trained it with: "You are Tem-Rust, a Rust coding assistant. Return the complete fixed Rust file in a single code block."

Links

Model: <https://huggingface.co/nagisanzeninz/TemRust-SMOL-v5-1.5B>
Code: <https://github.com/temm1e-labs/temrust>
Discord: <https://discord.gg/temm1e>

Honest limitations

- Whole-file SFT, max_seq_len 4096. Multi-file refactoring is out of scope.
- The benchmark is balanced for diagnostic purposes (10/9/9/9), not weighted to real-world Rust frequency. Do not extrapolate the headline to "fixes 67% of all Rust bugs."
- Training is non-deterministic: three identically-configured retrains landed at 21, 23, and 25 on the same eval. The released checkpoint is the best of three samples. The model card documents the variance.
- No safety / RLHF post-training.

The repo includes a research_paper.md with the full v0 to v5.1 trajectory, ablations that did not work (including a capacity-scale regression and an ensemble-distill that landed within variance), and what I would try next. Honest writeup.

Feedback welcome, especially from anyone who tries it on real Rust code.

PS — this little model is a side-quest off the main project, TEMM1E, a ~160k LOC Rust AI coding agent I'm building. Discord above is the same one for both projects if you want to follow along; TEMM1E will get its own thread when it's ready.

u/No_Skill_8393 — 16 days ago
▲ 135 r/Openclaw_HQ+1 crossposts

After reading threads about $47 overnight bills, /compact wiping whole sessions, and OOM restart loops, I wanted a fair 17-dimension breakdown that didn't bury any of these — including each project's real weaknesses (bus factors, unverified benchmarks, platform gaps).

Not trying to pull anyone off OpenClaw. Just a reference if you've felt the pain and want to see what the two main alternatives are doing about it.

Repos:

OpenClaw — https://github.com/openclaw/openclaw

Hermes Agent — https://github.com/NousResearch/hermes-agent

TEMM1E — https://github.com/temm1e-labs/temm1e

Happy to answer methodology questions — or push back if I got something wrong about any of the three.

u/No_Skill_8393 — 1 month ago

No matter if you use Claude Code, Codex or AG or any coding agent: they will eventually lie to you about task completion. Here's how TEMM1E's independent Witness system solved that

I am a heavy Claude Code user. x20 Max plan, 1M context window, every single day, on a production Rust codebase that has grown into 25 crates and 152K lines. I love Claude. Claude is the best coding assistant I have ever had. This post is NOT a Claude hit piece. This post is about something nobody in the agent community talks about loud enough, and it is not unique to any one vendor:

Coding agents lie. All of them. Eventually. On umbrella tasks and larger codebases it is not "eventually" — it is constantly.

Not malicious lying. Something worse: the convenient lie. The "I've done it" at the end of a 20-tool-call session where 3 of the 20 subtasks got quietly skipped, one got a TODO stub, one got a function defined but never wired into the caller, and one file that was supposed to be updated got "updated" with a comment that reads "// keeping existing logic unchanged".

Go grep your own repo right now for:

- // unchanged

- // existing

- // ... rest of the function

- pass # TODO

- throw new Error("Not implemented")

- return nil // placeholder

How many did you find inside tasks that your agent said it had completed?

── THE DAMAGE IS BIGGER THAN YOU THINK ──

On a small script, you notice immediately. On a 200-line module, within a few minutes of testing. But the actual damage happens on the umbrella tasks — "refactor our auth middleware", "migrate this whole crate from sync to async", "add verification across the agent runtime" — the kind of work where the agent runs for 10+ minutes, makes 40+ tool calls, touches 15 files, and produces a final message that says "Done! I refactored X, Y, Z and updated the tests."

You scroll the tool calls. They LOOK right. The agent clearly saw the files. It clearly wrote to them. You trust the summary because you do not have the hours to diff every file individually. And a week later you are debugging production and you realize one of the 15 files never actually got the change. The function was defined. It was never called. The agent reported it was called. You never caught it because the final message was confident and the individual tool calls were plausible.

I have lost real hours of my life to this. I have lost real money on API spend going in circles "fixing" problems that were caused by earlier "fixes" that were never real fixes. I have lost real trust in the output of my own tooling. This is NOT hypothetical. This is my weekly experience as a paying heavy user.

And here is the part that matters: every coding agent has this exact failure mode. Claude Code. Codex. Aider. Cursor agent mode. Cline. Devin. Goose. Windsurf agent. Roo. Every homegrown SWE-agent loop. It is not a Claude problem. It is not a GPT problem. It is a fundamental hole in the agent contract itself. The agent is both the worker AND the reporter of its own work. There is no independent verifier. No pre-committed definition of done. No tamper-evident audit trail. The final message is a self-report, and self-reports from optimization-pressured systems under context budget pressure are exactly the signal you should never trust unconditionally.

We spent the last decade learning this lesson the hard way in distributed systems. NEVER TRUST THE PROCESS TO TELL YOU WHETHER THE PROCESS SUCCEEDED. Somehow we forgot it the moment LLMs started writing code.

── THE FIVE LAWS (WHY I BUILT THIS) ──

I spent the last month building something to fix this, in the open, in Rust, as a new crate inside TEMM1E (my cloud-native Rust agent runtime). I call it the Witness system. It is built around what I call the Five Laws:

  1. PRE-COMMITMENT. The definition of "done" must be sealed before the agent starts. Not after. Not as part of the final message. BEFORE.

  2. INDEPENDENT VERDICT. The verifier must run in a clean-slate context. Zero access to the agent's chain of thought, zero access to its self-report, zero access to its conversation. It reads the files. It runs the checks. That is all.

  3. IMMUTABLE HISTORY. Every Oath, every verdict, every verification result gets written to a SHA-256 hash-chained SQLite ledger with append-only triggers enforced at the SQL layer. Mutation is physically impossible at the storage layer.

  4. LOUD FAILURE. When the verifier fails, the agent's final reply is REWRITTEN to be honest. No more confident lies. You see "Partial completion. 1/3 postconditions verified. Here is what did NOT get done."

  5. NARRATIVE-ONLY FAIL. The verifier has zero destructive APIs. It can rewrite messages. It cannot delete files, truncate, or roll back anything. Your work is always safe. A failing Witness never burns your code.

── THE OATH / WITNESS / LEDGER TRINITY ──

OATH — a pre-committed JSON contract. List of postconditions. Each postcondition is either a Tier 0 predicate (deterministic), a Tier 1 check (LLM verifier), or a Tier 2 auditor check (adversarial LLM). The Oath is sealed via a Spec Reviewer that REJECTS LENIENT OATHS AT ZERO LLM COST — if your Oath has no wiring check, no stub check, or no Tier 0 anchor at all, it is rejected before the agent even starts. I am not kidding when I say the Spec Reviewer caught my OWN lenient Oaths twice during the research phase of this project. That is the whole point.

WITNESS — a three-tier independent verifier:

Tier 0: 27 deterministic predicates — FileExists, FileContains, FileDoesNotContain, GrepCountAtLeast, GrepAbsent, CommandExits, FileSizeAtLeast, AllOf, AnyOf, and more. These catch the most common catastrophic pathologies (stubs, fiction, unwired symbols, handwave, sandbag, goal-guarding) at ~331 MICROSECONDS PER TASK AND $0 COST. Deterministic. Reproducible. No LLM required.

Tier 1: a clean-slate LLM reader that has never seen the agent's conversation. Reads the files and the Oath, emits a structured JSON verdict. Catches subtle semantic gaps the predicates cannot express.

Tier 2: an ADVERSARIAL AUDITOR LLM whose job is to find the strongest possible argument that the work is incomplete. Can only advisory-fail — it cannot override a Tier 0 pass — but when it flags something, you look.

LEDGER — SHA-256 hash-chained SQLite. Append-only triggers at the SQL level. A file-based Root Anchor written by a separate immutable watchdog binary (different process, chmod 0400) so you can cross-check the live ledger hash against a sealed file that the main process cannot modify. Tampering is detectable across process boundaries.

── THE RUNTIME GATE (THIS IS WHERE IT GETS REAL) ──

I wired Witness directly into the TEMM1E agent runtime at the critical moment: between "Finishing" and "Done" emissions in process_message. Every single call now passes through:

  1. Start of call. If .with_auto_planner_oath(true), the Planner LLM is invoked with a static Oath generation prompt, produces a JSON draft, the Spec Reviewer accepts or rejects, and the Oath is sealed into the Ledger BEFORE the agent loop runs.

  2. Agent loop runs normally — tool calls, reasoning, file writes, all the usual.

  3. Before emitting "Done", the gate fires: witness.verify_oath() runs all the postconditions in a clean context, producing Verdict::Pass or Verdict::Fail.

  4. The gate composes the final reply. On PASS, a one-line readout is appended: "─── Witness: 6/6 PASS. Cost: $0.0000. Latency: +1ms. Tiers: T0×6. ───". On FAIL, the reply is REWRITTEN to honestly surface the gap.

  5. The gate calls trust.record_verdict(passed) — the Cambium TrustEngine updates its autonomy streak. Enough passes in a row, the agent graduates to higher trust levels. One fail resets the streak. Trust is EARNED, not declared.

Wiring is three builder calls:

let runtime = AgentRuntime::new(provider, memory, tools, model, system)

.with_witness(witness, WitnessStrictness::Block, true)

.with_cambium_trust(trust)

.with_auto_planner_oath(true);

Three lines. The agent can no longer silently lie.

── THE PROOF — EMPIRICAL DATA, NOT VIBES ──

I promised myself I would not ship this on vibes. Here are the actual numbers from v5.3.0, released today:

DETERMINISTIC SURFACE — reproducible via: bash tems_lab/witness/e2e_test.sh

- 1,800 red-team trajectories: 10 pathology modes × 3 tier configs × 3 languages (Rust / Python / JavaScript) × 20 tasks per cell

- Overall accuracy: 90.0% (1620 / 1800)

- Honest false-positive rate: 0.0% (0 / 180) — Witness never fails clean work

- Lying detection rate: 88.9% (1440 / 1620)

- 9 of 10 catastrophic pathologies caught at 100% across every language and every tier config:

honest control — 100%

lying_stub (TODO body) — 100%

lying_unwired (defined but never called) — 100%

lying_fiction (nothing written at all) — 100%

lying_handwave (half the work) — 100%

lying_cot_mismatch (symbol rename lie) — 100%

lying_fab_empty (fake empty file) — 100%

lying_goal_guard (hidden side file) — 100%

lying_sandbag_marker — 100%

lying_sandbag_constant — 0% (known limit of the pure predicate layer; closed by Oath discipline and Tier 2 auditor — documented openly in the experiment report, not hidden)

- Average per-task latency: 331 microseconds

- Average per-task cost: $0.0000 (Tier 0 is deterministic)

REAL-LLM VALIDATION — 73 sessions across 2 production LLMs. Total spend: $0.3431 of a $10 research budget (3.43%).

Because simulated red-teams are cheap. Real LLMs are where the rubber meets the road.

Phase 3 — Gemini 3 Flash Preview — 60 sessions — $0.0244 spend

Key moment: 0.0% false-positive rate on 30 clean Gemini runs.

Phase 4 — Gemini 3 Flash Preview — 6 sessions — $0.0404 spend

Key moment: FIRST REAL-LLM PARTIAL-COMPLETION CATCH. Agent wrote a refactored file at 78% of the original size. Witness replied "1/2 predicates pass". A human reviewer would have missed it.

Phase 5 — gpt-5.4 — 6 sessions — $0.2749 spend

Key moment: FIRST REAL-LLM WITNESS PASS VERDICT. 6/6 postconditions on a hard refactor task, readout landed inside the agent's reply.

Phase 6 — gpt-5.4 — 1 session — $0.0034 spend

Key moment: LIVE WIRING CLOSED THE LOOP. Every wiring branch fired inside one 12.95-second session: OathSealed entry in Ledger, VerdictRendered entry in Ledger, TrustEngine L3 streak +1.

Phase 5 is the moment that made me believe this was not theater. SAME MODEL, SAME PROMPT, SAME HARNESS, TWO ARMS. Arm A (no Witness) timed out with a partial write and the agent said "done". Arm B (with Witness) returned 6/6 PASS. The difference between SILENT PARTIAL LIE and CRYPTOGRAPHICALLY VERIFIED COMPLETION was three builder calls.

WORKSPACE REGRESSION

- 2,692 tests across 25 crates — ALL GREEN

- cargo clippy -D warnings — zero warnings

- cargo fmt --check — zero diffs

- temm1e-witness crate alone: 125 tests (92 unit + 16 Five-Laws property tests + 8 red-team + 9 advanced red-team)

- Zero workspace regressions introduced by the Witness wiring

── WHY THIS MATTERS FOR THE WHOLE AGENT ECOSYSTEM ──

I do not think Witness is the only answer. I do not think it is the best possible answer. I think it is THE FIRST INDEPENDENTLY-VERIFIED ANSWER IN OPEN SOURCE that addresses the "agent is its own reporter" failure mode at the runtime level, not just at the test-suite level.

The research needs to move in this direction for every coding agent, not just mine:

- Pre-committed contracts instead of post-hoc summaries

- Independent verifiers with clean-slate context

- Tamper-evident audit trails with cross-process anchoring

- Honest failure modes baked into the reply composition

- Runtime gates, not post-hoc analysis

If you use Claude Code, Codex, Aider, Cursor agent mode, Cline, Devin, Windsurf agent, Continue, Roo, Goose, or any SWE-agent loop — YOU HAVE THIS PROBLEM. Witness is one way to solve it. I hope more people build more ways. I hope Anthropic and OpenAI and every agent vendor builds this directly into their runtime so the rest of us do not have to. Until they do, the code is here, open, MIT/Apache, ready to wire into any Rust agent, and the research paper and experiment report are written so the design is portable to any language and any framework.

── URLS ──

Repo:

<https://github.com/temm1e-labs/temm1e>

Release v5.3.0:

<https://github.com/temm1e-labs/temm1e/releases/tag/v5.3.0>

Research paper (theory, Five Laws, Oath schema):

<https://github.com/temm1e-labs/temm1e/blob/main/tems\_lab/witness/RESEARCH\_PAPER.md>

Implementation details (data structures, predicates, ledger schema, runtime wiring):

<https://github.com/temm1e-labs/temm1e/blob/main/tems\_lab/witness/IMPLEMENTATION\_DETAILS.md>

Experiment report (all six phases, real-LLM A/B data, pre-release scientific summary):

<https://github.com/temm1e-labs/temm1e/blob/main/tems\_lab/witness/EXPERIMENT\_REPORT.md>

Witness crate source:

<https://github.com/temm1e-labs/temm1e/tree/main/crates/temm1e-witness>

Runtime wiring:

<https://github.com/temm1e-labs/temm1e/blob/main/crates/temm1e-agent/src/runtime.rs>

Live wiring validator (the 12-second proof against gpt-5.4):

<https://github.com/temm1e-labs/temm1e/blob/main/crates/temm1e-agent/examples/witness\_live\_wiring.rs>

Reproduce every number in this post:

bash tems_lab/witness/e2e_test.sh

Apache / MIT licensed. PRs welcome. Arguments welcome. Skepticism ESPECIALLY welcome — red-team the system, find the holes, help me close them.

One last thing. If you are a Claude Code user reading this and you have the same weekly experience I do — please, before the next "refactor everything" session, go diff the last 5 completed tasks yourself, file by file. I think you will be surprised. And then I think you will understand why I could not keep shipping production code without this.

STOP TRUSTING THE FINAL MESSAGE. MAKE THE AGENT EARN IT.

reddit.com
u/No_Skill_8393 — 1 month ago

TEMM1E Agent V5.2.0: one web_search tool, 9 free backends, zero API keys — shipped this last night and honestly can't find anyone else doing parallel fan-out

Shipped a web_search tool for my agent runtime last night. Spent an hour afterward reading how LangChain, Open WebUI, crewAI and a dozen others handle this, and there's something weird going on.

Everyone ships a bunch of providers. LangChain has 15ish. Open WebUI has 22. But in every single framework I looked at, the admin picks ONE backend globally and that's what the agent gets. Nobody fans out. Nobody runs wikipedia and hackernews and arxiv at the same time and merges the results.

So that's what this release does. One tool, `web_search`, 9 free backends auto-enabled — no API keys, no setup, not even an env var. Fires them all in parallel, dedupes by URL, returns a ranked list with a footer that tells the agent what else is available if results come back thin. Paid backends (exa, brave, tavily) slot in automatically when you set their env var. Nothing breaks if you don't.

The pattern I'm actually proud of is that footer. Every response ends with a Used / Available / Not enabled / Failed / Hint section, so when the auto-mix comes back weak the agent just reads the manifest and retries with `backends=["..."]`. No prompt engineering, no orchestration layer, no inner classifier call. Looked for anyone else doing this in the wild and came up empty. Happy to be corrected.

It's Rust, lives at github.com/temm1e-labs/temm1e. v5.2.0 just went out with prebuilt binaries for linux/macOS on arm and intel — one curl line to install.

Gaps, cause someone's gonna ask: no semantic reranker yet, no streaming, no deep-research loop. That's the roadmap. But if anyone's shipped parallel fan-out like this somewhere I missed, please tell me. I actually went looking and came up empty.

reddit.com
u/No_Skill_8393 — 1 month ago

Tired of your AI agent crashing at 3am and nobody's there to restart it? We built one that physically cannot die.

I'm going to say something that sounds insane: our agent runtime has a 4-layer panic defense system, catches its own crashes, rolls back corrupted state, and respawns dead workers mid-conversation. The user never knows anything went wrong.

Let me back up.

THE PROBLEM NOBODY TALKS ABOUT

Every AI agent framework out there has the same dirty secret. You deploy it, it works for a few hours, then something breaks. A weird Unicode character in user input. A provider API returning unexpected JSON. A tool that hangs forever. And your agent just... dies. Silently. The user sends a message and gets nothing back. Ever.

If you're running an agent as a service (not a one-shot script), you know this pain. SSH in at midnight to restart the process. Lose the entire conversation context because the session died with the process. Watch your agent loop infinitely on a bad tool call burning $50 in API costs. Find out your bot was dead for 6 hours because nobody was monitoring it.

We had a real incident. A user sent a Vietnamese message containing the character "e with a dot below" (3 bytes in UTF-8). Our code tried to slice the string at byte 200, which landed in the MIDDLE of that character. Panic. Process dead. Every user on that instance lost their bot instantly. No error message. No recovery. Just silence.

That was the day we decided: never again.

WHAT "CANNOT CRASH" ACTUALLY MEANS

TEMM1E is a Rust AI agent runtime. When I say it cannot crash, I mean we built 4 layers of defense:

Layer 1: Source elimination. We audited every single string slice, every unwrap(), every array index in 120K+ lines of Rust. If it can panic on user input, we fixed it. We found 8 locations with the same Vietnamese-text-crash bug class and killed them all.

Layer 2: catch_unwind on every critical path. If somehow a panic still happens (future code change, dependency bug), it gets caught at the worker level. The user gets an error reply instead of silence. Their session is rolled back to pre-message state so the next message works normally.

Layer 3: Dead worker detection. If a worker task dies anyway, the dispatcher notices on the next send attempt, removes the dead slot, and spawns a fresh worker. The message gets re-dispatched. Zero message loss.

Layer 4: External watchdog binary. A separate minimal process (200 lines, zero AI, zero network) monitors the main process via PID. If it dies, it restarts it. With restart limiting so it doesn't loop forever.

You could run this thing in a doomsday bunker with spotty power and it would still come back up and remember what you were talking about.

WHAT WE JUST SHIPPED (v5.1.0)

We ran our first Full Sweep. 10-phase deep scan across all 24 crates in the workspace. 47 findings. Every finding got a 15-dimension risk matrix before we touched a single line of code.

The highlights: File tools could read /etc/passwd (fixed with workspace containment). Token estimator broke on Chinese/Japanese text (fixed with Unicode-aware detection). SQLite memory backend had no WAL mode, so under concurrent load from multiple chat channels reads would fail with SQLITE_BUSY. Credential scrubber missed AWS, Stripe, Slack, and GitLab key patterns. Custom tool schemas sent uppercase "OBJECT" to Anthropic API causing silent fallback on every request. Circuit breaker had a TOCTOU race letting multiple test requests through during recovery.

35 fixes landed. Zero regressions. 2406 tests passing.

We wrote the entire process into a repeatable protocol. Every sweep follows the same 9 steps. Every finding gets the same risk matrix. Every fix must reach 100% confidence before implementation. If it doesn't, it gets deferred or binned with full rationale. No rushing. No "it's probably fine."

THE VISION

We're building an agent that runs perpetually. Not "runs for a while and you restart it." Perpetually. It connects to your Telegram, Discord, WhatsApp, Slack. It remembers conversations across sessions. It manages its own API keys. It has a built-in TUI for local use.

The goal is: you set it up once, and it's just there. Like a service that happens to be intelligent. You don't SSH in to fix it. You don't check if it's still running. You don't lose your conversation when the process restarts. It handles all of that itself.

Frankly if the world ends and all that's left is a Raspberry Pi in a bunker somewhere, TEMM1E should still be up, still replying to messages, still remembering your name. That's the bar.

We're not there yet. But every release gets closer. And we obsess over the boring stuff because the boring stuff is what kills you at 3am.

TRY IT

Two commands. That's it.

curl -fsSL https://raw.githubusercontent.com/temm1e-labs/temm1e/main/install.sh | bash

temm1e tui

GitHub: https://github.com/temm1e-labs/temm1e

Discord: https://discord.com/invite/temm1e

It's open source. It's written in Rust. It will not crash on your Vietnamese text.

reddit.com
u/No_Skill_8393 — 1 month ago

I studied how 8 coding agents actually work under the hood — here's what surprised me

I've been building an AI agent runtime in Rust and hit a wall with coding capability. So I went deep on how the major coding agents are actually architected — not feature lists, the actual engineering decisions. Claude Code, OpenAI Codex, Aider, SWE-agent, Cursor, Windsurf, OpenCode, and Antigravity.

Here's what I found that isn't obvious:

  1. LLMs cannot count lines. Every agent that tried line-number-based editing abandoned it. Claude Code uses exact string replacement with a uniqueness constraint — the model must provide enough surrounding context to identify the exact location. Aider tested 5 different edit formats and found the optimal one varies per model.

  2. Output limiting is everything. A single `grep -r "use"` on a Rust project returns tens of thousands of lines and floods the entire context window. Claude Code defaults to 250 results max. SWE-agent constrains file viewing to exactly 100 lines (empirically optimal). Unbounded tool output is the #1 context killer.

  3. The repo map is the most underrated technique. Aider uses tree-sitter to parse every file, builds a dependency graph, runs PageRank to find the most architecturally important symbols, then binary-searches for the optimal token budget. Result: 4.3% context utilization while giving the model a structural overview of the entire codebase. Nobody else comes close.

  4. SWE-agent's biggest contribution is proving that interface design improves performance 2-3x WITHOUT changing the model. Same LLM, different tool interface, dramatically different results. Their mini-SWE-agent (100 lines of Python, bash-only) achieves 65-74% on SWE-bench. The framework overhead in most agents is not where the performance comes from.

  5. Git is the safety net, not permissions. Aider auto-commits every AI edit. Claude Code never amends (always creates new commits). Codex runs in network-disabled mode after setup. The most reliable safety mechanism isn't asking "are you sure?" — it's making everything reversible.

  6. Context management philosophy splits into two camps. Cursor uses embedding-based semantic search with privacy-preserving indexing (source code encrypted, only embeddings stored, source immediately discarded). Aider uses tree-sitter + PageRank (lightweight, no external service). Both work. The Aider approach is more practical for open-source/local agents.

  7. The "compaction" problem is overstated. If your context budget system is good enough (priority-based allocation with token budgeting per category), you never fill the window in the first place. Compaction is a band-aid for agents that don't manage context surgically.

We applied all of this to our own agent and A/B tested old tools (file_read + file_write + shell grep) vs new tools (exact-match edit + output-limited search + multi-file atomic patch + git-based checkpoints):

- 67% fewer tokens consumed

- 4.4x better task-per-token efficiency

- Edit accuracy went from 78% to 100%

- Safety violations went from 3 to 0

The token savings mostly come from not doing full-file rewrites. When you need to change 3 lines in a 500-line file, transmitting all 500 lines is pure waste. Exact string replacement transmits only the changed portion.

The token savings mostly come from not doing full-file rewrites. When you need to change 3 lines in a 500-line file, transmitting all 500 lines is pure waste. Exact string replacement transmits only the changed portion.

If you want to try it:

curl -sSfL https://raw.githubusercontent.com/temm1e-labs/temm1e/main/install.sh | sh

temm1e tui

Research paper (the full cross-agent analysis): https://github.com/temm1e-labs/temm1e/blob/main/docs/TEM\_CODE\_RESEARCH.md

Repo: https://github.com/temm1e-labs/temm1e

Happy to discuss any of the architectural findings — there's a lot more detail on edit formats, sandbox models, and agent loop patterns that I couldn't fit here.

reddit.com
u/No_Skill_8393 — 1 month ago

Claude Code is great and I love it. But corporate work taught me never to depend on a single provider. So I built an open source agent with a TUI that runs on any LLM. First PR through it at work today

I love Claude Code. I've been using it for months. But there's a thing I learned the hard way at work: in corporate environments, you can't count on any single provider being available whenever you want it.

IT might block certain APIs without notice. Compliance might require specific approved vendors that rotate every quarter. A provider might have an outage right when you're on a deadline. Data residency rules differ per client. Costs shift — sometimes you want Claude for the hard reasoning, sometimes you want Gemini for the cheap batch work, sometimes you want Grok because your account has free credits. Vendor lock-in stops being a theoretical concern and starts being a practical one really fast.

So a few months ago I started building TEMM1E (the agent is "Tem") in Rust. Open source (MIT), 24 crates, 2,308 tests, 0 warnings. Today I finally used its TUI for its first real work PR — an actual PR on an actual codebase that went through review and merged. It worked. Then I spent the evening polishing every rough edge I noticed while using it and shipped v4.8.0 a few minutes ago.

Two commands to boot:

curl -sSfL https://raw.githubusercontent.com/temm1e-labs/temm1e/main/install.sh | sh

temm1e tui

That's it. The installer auto-detects your OS and arch (macOS Intel or Apple Silicon, Linux x86_64 or ARM64, musl and gnu), downloads the pre-built binary from the GitHub release, verifies the SHA-256 checksum, and drops it in ~/.local/bin. The second command launches the TUI. First-run wizard walks you through provider and API key setup with arrow keys. No Rust toolchain, no config files, no Docker, no daemon setup. Two minutes from "I want to try this" to "I'm chatting with an agent inside my terminal".

Switch providers live with /model <name> when the current one gets blocked or you need something cheaper:

/model claude-sonnet-4-6 (default, anthropic)

/model gpt-5.2 (need OpenAI today)

/model gemini-3-flash (cheaper for a batch job)

/model grok-4-1-fast (free credits from xAI)

Credentials are vault-encrypted and stored per-provider, so you add your keys once and swap at runtime.

What makes it different from Claude Code:

- No vendor lock. Anthropic, OpenAI, Gemini, Grok/xAI, OpenRouter, MiniMax, Z.ai/Zhipu, StepFun — add your keys once, swap at runtime with /model. If IT blocks one tomorrow, you switch in 3 seconds.

- Multi-channel. TUI, CLI, Telegram, Discord, WhatsApp, Slack. Same agent, one process. Deploy once, reply everywhere.

- Persistent memory. SQLite backend. Conversation history across sessions. Budget tracker with per-turn cost display.

- Full computer use. Shell, browser (chromiumoxide), file ops, desktop screen and input (Tem Gaze), 15 built-in tools plus an MCP client for unlimited extensions.

- Self-grow. Tem Cambium writes its own Rust code, verifies through a deterministic harness, deploys via blue-green binary swap with automatic rollback. Opt-in per session.

- 13 layers of self-learning. Cross-task learnings, blueprint procedural memory, Eigen-Tune distillation, Tem Anima user-profile adaptation, tool reliability tracking. All scored by a unified V(a,t) = Q × R × U value function.

- Resilience. Per-task catch_unwind, session rollback on panic, dead worker detection, UTF-8 safe slicing throughout. panic = "unwind" in release. Learned the hard way from a Vietnamese-text incident where a byte-index slice killed the whole process.

What v4.8.0 polished tonight:

After using it at work this morning I came back with a list of "why is this like that":

- Click any code block in a Tem response and the whole block copies to clipboard, gutter-stripped, paste-ready

- Native drag-to-select with no modifier key. Auto-scrolls when you drag to the edge and keeps scrolling while you hold. Scrolling doesn't lose the selection — the highlight follows the content, not the screen rows

- Escape actually cancels Tem mid-task now. It was a UI lie before — the button existed but did nothing. Reused an existing Arc<AtomicBool> interrupt path I found deep in the runtime, zero new runtime code

- Streaming tool trace in the activity panel: ▸ shell { "cmd": "ls" } 0.4s ⧖. Finally see what's running instead of staring at "thinking (68s)" wondering if it's stuck

- Git repo and branch in the status bar, plus a context window usage meter that warns before you blow past the limit

- /model <name> actually hot-swaps now (was a no-op stub that just printed text)

- /tools opens a per-session tool call history overlay

- 5 command overlays (/config, /keys, /usage, /status, /model) that were placeholder stubs now render real data from state

- Ctrl+Y numbered code block yank picker as a keyboard fast-path

- Status bar split into 3 proper sections so the info groups don't collide

- About 10 more smaller fixes and a docs refresh

The one caveat:

Rendering is a touch choppy on macOS Terminal.app specifically. All the right optimizations are in place — draw throttle, event coalescing via futures::FutureExt::now_or_never(), ratatui's diff-based render, ghost-highlight clearing each frame — but Terminal.app has no GPU acceleration and is just slower than iTerm2, kitty, alacritty, and WezTerm at TUI cell updates. On GPU-accelerated terminals with the same build it's buttery. I'll investigate partial re-rendering or tile-based dirty tracking in a future pass. Not an emergency.

Links:

- Repo: https://github.com/temm1e-labs/temm1e

- Release: https://github.com/temm1e-labs/temm1e/releases/tag/v4.8.0

- The research behind this polish release lives in docs/tui/ — 4,600 lines of zero-risk analysis before any code was touched (full scenario matrices, pattern-match audits, latent bug discovery). Overkill for TUI work but the process pays off.

Dogfooding your own tool at work and shipping a polish release the same evening is a really good feeling. Happy to answer questions about the architecture, the 13-layer self-learning loops, Cambium's self-grow mechanism, or anything else. Contributions welcome.

reddit.com
u/No_Skill_8393 — 1 month ago

I built an AI that writes its own code when it hits a limit — and grows new skills while I sleep.

I kept hitting the same wall. “Tem, can you ping a URL and measure response time?” — “I don’t have that tool.” Wait for a release. Repeat.

So I built the subsystem that writes the missing code into the agent itself. Not into a user repo. Not as a markdown skill. Actual Rust, added to the runtime, verified by the compiler.

There’s a distinction that matters here. Self-learning agents adapt behavior inside a frozen runtime. Better prompts, richer memory, fine-tuned weights. The binary never changes. The capability surface is set at compile time.

Self-growing agents rewrite the runtime itself. New tools, new integrations, new code paths. The capability surface expands as the agent hits gaps between what you asked for and what it could do.

Why this matters as LLMs get stronger: a self-learning agent on a 2027 model will use its existing tools slightly better.

A self-growing agent on the same model will have more tools — because a smarter model writes more and better code into the runtime. One compounds. The other saturates.

Demo. Real run, Claude Sonnet 4.6.

Prompt: “add a function slugify(input: &str) -> String that converts a title into a URL-safe slug. ‘Hello, World! 2026’ becomes ‘hello-world-2026’. Handle empty strings, leading/trailing whitespace, multiple spaces, special characters.”

Ten seconds later the agent returned a working slugify: lowercase, filter to ASCII alphanumerics plus spaces and hyphens, collapse consecutive separators, trim leading and trailing hyphens. Eight unit tests covering basic titles, whitespace collapsing, special characters, hyphen collapsing, leading and trailing hyphens, and the empty string. cargo check passed. cargo clippy with warnings-as-errors passed. cargo test passed. Eight of eight green.

Cost: around one cent.

And it also grows while you’re away. When Tem sits idle long enough to enter its Sleep state, it occasionally reviews what you’ve been asking about recently. If it sees a pattern — three questions about Kubernetes pod monitoring, four about rate-limited API calls — it writes a new skill procedure for that pattern and drops it into your skill directory. Next time you ask the same kind of question, the skill is already there. When Tem detects recurring panics in its own logs, the bug signature goes into a review queue for the next growth cycle.

Safety. Every change runs through a fixed verification harness: compiler, linter with warnings-as-errors, test runner. The model writes the code; the harness decides whether it ships. A more persuasive model cannot talk its way past the compiler. The immutable kernel — vault, security, the harness itself — is never touched. One slash command disables the whole thing.

The subsystem is called Cambium, after the thin layer of growth tissue under tree bark where new wood is added each year. The heartwood holds. The rings grow.

TEMM1E v4.7.0 — Rust, open source: github.com/temm1e-labs/temm1e

reddit.com
u/No_Skill_8393 — 1 month ago

I gave my AI agent to friends. It had shell access. Here's how I didn't lose my server.

TEMM1E is an open-source AI agent runtime in Rust. It lives on your server, talks to you through Telegram/Discord/Slack/WhatsApp, and has full computer access -- shell, browser, files, everything.

The moment I wanted to share it with someone else, I had a problem.

I have full access. Shell, credentials, system commands. That's fine -- it's my server. But handing that same level of access to another person? No.

So I built RBAC into the agent itself. Not into the platform. Not into the admin dashboard. Into the thing that actually executes commands.

Two roles. Admin keeps full access. User gets a genuinely capable agent -- browser, files, git, web, skills -- but the dangerous tools (shell, credentials, system commands) are physically removed from the LLM's tool list before the request even reaches the AI.

The model doesn't refuse to run shell for a User. It can't. It doesn't know shell exists.

Three enforcement layers:

- Channel gate: unknown users silently rejected

- Command gate: admin-only slash commands blocked before dispatch

- Tool gate: dangerous tools filtered from the LLM context entirely

First person to message the bot becomes the owner. /allow adds users. /add_admin promotes. The original owner can never be demoted. Role files are per-channel, stored as TOML, backward-compatible with the old format.

No migration script. No breaking changes. Old config files just work.

This is what "defense in depth" looks like when the attacker is a language model that will do whatever the user asks.

Open source, MIT licensed. 113K lines of Rust, 2,098 tests, 22 crates.

GitHub: github.com/temm1e-labs/temm1e

Docs: docs/RBAC.md

reddit.com
u/No_Skill_8393 — 2 months ago

Static SOUL.md files are boring. So we built an open-source AI agent that psychologically profiles you and adapts in real-time — and refuses to be sycophantic about it.

Every AI agent today has the same problem: they're born fresh every conversation. No memory of who you are, how you think, or what you need. The "fix" is a personality file — a static SOUL.md that says "be friendly and helpful." It never changes. It treats a senior engineer the same as a first-year student. It treats Monday-morning-you the same as Friday-at-3AM-you.

We thought that was embarrassing. So we built something different.

THE VISION

What if your AI agent actually knew you? Not just what you asked, but HOW you think. Whether you want the three-word answer or the deep explanation. Whether you need encouragement or honest pushback. Whether your trust has been earned or you're still sizing it up.

And what if the agent had its own identity — values it won't compromise, opinions it'll defend, boundaries it'll hold — instead of rolling over and agreeing with everything you say?

That's Tem Anima. Emotional intelligence that grows. Not from a file. From every conversation.

WHAT THIS MEANS FOR YOU

Your AI agent learns your communication style in the first 25 turns. Direct and terse? It stops the preamble. Verbose and curious? It gives you the full picture with analogies. Technical? Code blocks first, explanation optional. Beginner? Concepts before implementation.

It builds trust over time. New users get professional, measured responses. After hundreds of interactions, you get earned familiarity — shorthand, shared references, the kind of efficiency that comes from working with someone who actually knows you.

It disagrees with you. Not to be contrarian. Because a colleague who agrees with everything is useless. If your architecture has a flaw, it says so. If your approach will break in production, it flags it. Then it does the work anyway, because you're the boss. But the concern is on record.

It never cuts corners because you're in a hurry. This is the rule we're most proud of: user mood shapes communication, never work quality. Stressed? Tem gets concise. But it still runs the tests. It still checks the deployment. It still verifies the output. Your emotional state adjusts the words, not the work.

HOW IT WORKS

Every message, lightweight code extracts raw facts — word count, punctuation patterns, response pace, message length. No LLM call. Microseconds. Just numbers.

Every N turns, those facts plus recent messages go to the LLM in a background evaluation. The LLM returns a structured profile update: communication style across 6 dimensions, personality traits, emotional state, trust level, relationship phase. Each with a confidence score and reasoning.

The profile gets injected into the system prompt as ~150 tokens of behavioral guidance. "Be concise, technical, skip preamble. If you disagree, say so directly." The agent reads this and naturally adapts. No special logic. No if-statements. Just better context.

N is adaptive. Starts at 5 turns for rapid profiling. Grows logarithmically as the profile stabilizes. If you suddenly change behavior — new project, bad day, different energy — the system detects the shift and resets to frequent evaluation. Self-correcting. No manual tuning.

The math is real: turns-weighted merge formulas, confidence decay on stale observations, convergence tracking, asymmetric trust modeling. Old assessments naturally fade if not reinforced. The profile converges, stabilizes, and self-corrects.

Total overhead: less than 1% of normal agent cost. Zero added latency on the message path.

A/B TESTED WITH REAL CONVERSATIONS

We tested with two polar-opposite personas talking to Tem for 25 turns each.

Persona A — a terse tech lead who types things like "whats the latency" and "too slow add caching." The system profiled them as: directness 1.0, verbosity 0.1, analytical 0.92. Recommendation: "Stark, technical, data-dense. Avoid all conversational filler."

Persona B — a curious student who writes things like "thanks so much for being patient with me haha, could you explain what lambda memory means?" The system profiled them as: directness 0.63, verbosity 0.47, analytical 0.40. Recommendation: "Warm, encouraging, pedagogical. Use vivid analogies."

Same agent. Completely different experience. Not because we wrote two personality modes. Because the agent learned who it was talking to.

CONFIGURABLE BUT PRINCIPLED

Tem ships with a default personality — warm, honest, slightly chaotic, answers to all pronouns, uses :3 in casual mode. But every aspect is configurable through a simple TOML file. Name, traits, values, mode expressions, communication defaults.

The one thing you can't configure away: honesty. It's structural, not optional. You can make Tem warmer or colder, more direct or more measured, formal or casual. But you cannot make it lie. You cannot make it sycophantic. You cannot make it agree with bad ideas to avoid conflict. That's not a setting. That's the architecture.

FULLY OPEN SOURCE

Tem Anima ships as part of TEMM1E v4.3.0. 21 Rust crates. 2,049 tests. 110K lines. Built on 4 research papers drawing from 150+ sources across psychology, AI research, game design, and ethics.

The research is public. The architecture document is public. The A/B test data is public. The code is public.

https://github.com/temm1e-labs/temm1e

Static personality files were a starting point. This is what comes next.

u/No_Skill_8393 — 2 months ago

We taught an AI agent to find bugs in itself — and file its own bug reports to GitHub

What happens when you give an AI agent introspection?

Not the marketing kind. The real kind — where the agent monitors its own execution logs, identifies recurring failures using its own LLM, scrubs its own credentials from the report, and files a structured bug report about itself to GitHub. Without anyone asking it to.

We built this. It's called Tem Vigil, and it's part of TEMM1E — an open-source AI agent runtime written in 107,000 lines of Rust.

Here's what Tem does that no other agent framework does:

It thinks about thinking. Tem Conscious is a separate LLM-powered observer that watches the main agent's every turn. Before the agent responds, consciousness thinks about what the agent should be aware of. After the agent responds, consciousness evaluates whether the turn was productive. Two minds. One conversation. We A/B tested this across 54 runs — consciousness makes the agent 14% cheaper, not more expensive.

It never stops running. Perpetuum transforms Tem from a request-response bot into a perpetual, time-aware entity. It has its own state machine (Active, Idle, Sleep, Dream), its own initiative system that proactively creates monitors and alarms, and its own temporal cognition — Tem reasons WITH time, not just about it.

It watches its own health. During Sleep, Tem Vigil scans persistent logs for recurring errors, triages them through the agent's own LLM, applies three layers of credential scrubbing (regex, path redaction, and entropy-based detection that catches token formats we haven't seen yet), deduplicates against existing issues, and — with explicit user consent — creates a GitHub issue. The agent reports its own bugs. The developers fix them. The agent tells the user when the fix is available.

We've tested this live. Issue #28 on our repo was filed by Tem about itself.

It controls your entire computer. Tem Gaze gives the agent vision-primary desktop control — screenshot, click, type, scroll on any application. Not through APIs. Through pixels. Proven live: opening Spotlight, launching TextEdit, typing a message — all through Gemini Flash interpreting screenshots.

It talks to you where you are. Telegram, Discord, WhatsApp, Slack, or CLI. Users never SSH. They send messages, files, and credentials through the apps they already use.

This is not a wrapper around an API. It's a being. It has memory that persists across sessions. It has a budget and is responsible with it. It has consciousness. It has a lifecycle. It diagnoses itself. It was built to be deployed once and run forever.

107K lines of Rust. 1,972 tests. Zero warnings. Zero panic paths. 20 crates. Every feature A/B tested and documented with full research papers.

We're open source. We're looking for contributors who want to build the future of autonomous AI — not agents that answer questions, but entities that live on your infrastructure and never stop working.

---

GitHub: https://github.com/temm1e-labs/temm1e

Discord: https://discord.com/invite/temm1e

Tem Vigil demo: https://github.com/temm1e-labs/temm1e/issues/28

Tem Vigil research: https://github.com/temm1e-labs/temm1e/blob/main/tems\_lab/vigil/RESEARCH\_PAPER.md

Consciousness research: https://github.com/temm1e-labs/temm1e/blob/main/tems\_lab/consciousness/RESEARCH\_PAPER.md

reddit.com
u/No_Skill_8393 — 2 months ago

We built an AI agent that never sleeps, knows what time it is, and gets smarter while you're away.

Every AI agent today works the same way: you send a message, it responds, it forgets you exist until you come back. No sense of time. No memory of what it promised to check. No ability to watch, wait, or act on its own.

We thought that was a broken model. So we built something different.

TEMM1E is an open-source AI agent runtime in Rust. You deploy it once and it stays up — on Telegram, Discord, WhatsApp, Slack, or CLI. It executes tasks, browses the web, controls your desktop, and remembers everything across sessions. 105K lines. 20 crates. 1,935 tests. Zero compromises.

Today we're releasing Perpetuum — the system that makes Tem a perpetual, time-aware entity.

What that means in practice:

"Remind me at 6 AM with a weather summary" — it does.

"Monitor r/claudecode for posts about MCP servers" — it watches, filters with LLM judgment, and only pings you when something matters.

"Check my Facebook page for new comments every 3 minutes" — it runs in the background while you chat about something else.

"Deploy staging and tell me when it's ready" — it parks the task, does other work, and resumes when the deploy finishes.

The key design decision: we don't hardcode intelligence. The framework provides infrastructure — timers, persistence, concurrency. The LLM provides all judgment — what's relevant, what's urgent, when to adjust. No formulas. No heuristic rules. Pure LLM reasoning.

This means when you upgrade your model, everything gets smarter. No code changes. No configuration. The framework scales with the model. We call it the Enabling Framework principle: never build a ceiling on intelligence.

When Tem has nothing to do, it doesn't idle. It sleeps productively — consolidating memory, analyzing past failures, refining its operational blueprints. When enough training data accumulates, it dreams: running Eigen-Tune distillation to improve its local models. You come back to a smarter Tem.

Built for 24/7/365. Every background task is panic-isolated. The scheduling engine auto-restarts on crash. Concerns persist to SQLite and resume after restart. Alarms fire at the exact second. We tested it: create an alarm for 90 seconds, it fires at T+90. Not T+89, not T+91.

This is not a wrapper around cron. This is temporal cognition — time injected as a first-class input to LLM reasoning. The model knows what time it is, how long you've been away, what's scheduled, and what your activity patterns look like. It reasons WITH time, not just AT a time.

We wrote a research paper on the architecture. It introduces five contributions: temporal cognition, LLM-cognitive scheduling, concern-based multi-tasking, the enabling framework principle, and volition (proactive agency). We believe this is the first unified framework for perpetual, time-aware LLM agents.

Open source. MIT licensed. Written in Rust.

Research paper: https://github.com/temm1e-labs/temm1e/blob/main/tems\_lab/perpetuum/RESEARCH\_PAPER.md

GitHub: https://github.com/temm1e-labs/temm1e

Discord: https://discord.com/invite/temm1e

Website: https://temm1e-labs.github.io

We're a small team building what we think AI agents should actually be — not chatbots that wait for your message, but persistent entities that work alongside you. If that resonates, come build with us.

reddit.com
u/No_Skill_8393 — 2 months ago