r/AISystemsEngineering

[Android] [$9.99 -> $0.99] Pocket Interpreter : Offline AI Interpreter for Real Conversations
▲ 25 r/AISystemsEngineering+21 crossposts

[Android] [$9.99 -> $0.99] Pocket Interpreter : Offline AI Interpreter for Real Conversations

Hey Reddit,

I just launched Pocket Interpreter, an AI-powered offline interpreter designed to help people have real conversations across language barriers.

Unlike traditional translation apps where you manually translate sentences back and forth, Pocket Interpreter is built around live multilingual conversations.

What makes it different?

Imagine:

You speak Spanish,

The other person speaks English.

Both of you see and hear translations in your own language.

No internet required.

Pocket Interpreter acts like a personal interpreter in your pocket.

Features

✅ Real-time conversation interpretation

✅ Offline voice-to-voice communication

✅ Direct phone-to-phone conversation mode (BLE)

✅ Offline text translation

✅ OCR translation for signs, menus, and documents

✅ On-device AI processing

✅ Privacy-first (no cloud processing)

✅ Works in Airplane Mode

Built for

International travelers

Tourists

Business meetings

Taxi and delivery drivers

Hotels and hospitality staff

International students

Anyone communicating across languages

Launch Offer 🎉

To celebrate the launch:

Lifetime Access → $0.99

App link https://play.google.com/store/apps/details?id=io.cyberfly.privatescan

One-time purchase. No subscription.

I'd love feedback from travelers, language learners, and anyone who has faced language barriers while traveling.

u/abuvanth — 20 hours ago
▲ 17 r/AISystemsEngineering+1 crossposts

If you're building long-running AI agents, do you actually care about memory observability? Like auditing what the agent "knew" and when?

Been thinking about a problem that doesn't get talked about much: agent memory is a black box.

You store something, you retrieve something — but you can't answer basic questions like: when exactly did the agent "know" this? Was this memory ever modified? What did it know at step 47 of a 300-step run? If something goes wrong during a long autonomous run, how do you even debug it?

The concept I've been thinking about is deterministic memory observability — giving agent memory the same guarantees we expect from databases and version control:

  • Hash-chained writes — cryptographically verifiable audit trail of every memory operation
  • Git-like rollback — tombstone any write, chain stays intact, reconstruct what the agent knew at any point
  • Confidence decay — memories fade automatically over time so stale knowledge stops polluting recall
  • Conflict detection — catch contradictions in memory before the agent acts on bad info
  • GDPR-style forget — proper hard deletes for compliance without breaking the chain

The mental model: persistent storage as the source of truth with full audit integrity, semantic/vector search as a sidecar. You never sacrifice the audit trail to get fast retrieval — they're separate concerns.

My actual question:

If someone built an open-source Python SDK for this — something you could just pip install and drop into your existing agent stack — would you actually use it?

Or is this a problem that either doesn't exist yet for most people, or already has a solution I'm not aware of? I don't want to build something nobody needs. Genuinely asking before I commit to it.

Especially curious if you're building:

  • Agents that run for hours or days with persistent memory
  • Multi-agent systems where agents share memory banks
  • Anything in regulated industries where you need to prove what an agent knew and when

Or is the general consensus still "just use a vector DB and don't overthink it"? Would love to know how people are actually handling this in production.

reddit.com
u/imsuryya — 1 day ago

What I learned building low latency and high throughput AI agents

  • Know your workload.
  • Before building the feature, estimate input tokens, output tokens, expected concurrency, and whether the user needs an instant response or can tolerate asynchronous processing.
  • Reduce tokens.
  • Do not send full context because it is convenient. Compress, retrieve, summarize, and preserve provenance.
  • Embrace parallelism.
  • If the work is independent, split it. File scans, scan/offset based analysis, artifact classification, and output candidate often parallelize well.
  • Microservices and queues add complexity, but they also let different stages scale, retry, and fail independently. Don't overoptimize.
  • Expect failures.

LLM APIs fail. Providers rate-limit. Responses violate schema. Tool calls hang. Sandboxes break. Repos have bad tests. Treat every model call like a network call to a flaky dependency / data source, because that is what it is.

reddit.com
u/tropical_vortex — 4 days ago
▲ 27 r/AISystemsEngineering+6 crossposts

I built a Goodhart-proof AI coding agent that runs locally on 4GB VRAM. It physically cannot see your tests.

I've been researching how AI coding agents inevitably optimize for metric-passing rather than problem-solving (Goodhart's Law). Commercial tools rely on prompt engineering and post-hoc review, but these are disciplinary, not architectural.

I built an open-source 4-layer pipeline (Planning → Execution → Verification → Optimization) where information asymmetry is enforced via strict TypedDict contracts and LangGraph state isolation: • The execution agent never receives acceptance criteria, unit tests, or the verification rubric. • Verification is blind: it evaluates git diffs without author identity or original prompt context. • Retry feedback is sanitized to abstract guidance only (prevents rubric memorization). • Neo4j graph analysis replaces context-window stuffing with precise AST dependency mapping.

Results: 26s/feature, $0.03 cost (local 3B model execution + API reasoning), reproducible benchmarks. Open-source under MIT.

Repo: https://github.com/illyar80/developer-farm

I'm particularly interested in feedback on:

  1. Formal verification approaches to guarantee isolation properties
  2. Multi-model fallback strategies for the execution layer
  3. Benchmarking frameworks for "Goodhart-resistance" in autonomous agents

Would appreciate critiques and suggestions from folks working on AI alignment, evaluation, or agentic systems.

u/illyar80 — 6 days ago
▲ 1 r/AISystemsEngineering+1 crossposts

AI Agents in Production: The Failure Modes Nobody Puts in the Demo

Hey everyone,

I’ve spent the last month building and shipping agentic systems into production. If there’s one thing I’ve realized, it’s that the gap between a flashy Twitter/X demo and a stable, secure production agent is a mile wide.

I put together a deep-dive guide breaking down the architectural realities, high-ROI use cases, and the specific security risks that only surface after you ship.

Here is the TL;DR on what happens when agents meet the real world:

1. Chatbots vs. Agents (The Power to Act)

The only difference between a chatbot and an AI agent is one word: act. An LLM generates—it takes text and returns text. An agent takes that output and runs with it (a tool call, a database query, an email). The model is the mastermind, but tools give it hands. The moment software gets hands, your entire design, testing, and security paradigm has to change.

2. The Ideal Use Case Formula

Agents aren't a silver bullet for everything. They thrive where the cost of human attention is high, but the cost of a mistake is low.

  • High ROI: Operational automation, continuous synthesis/monitoring, support deflection, and repository hygiene.
  • The Trap: Building an agent to reason in a vacuum. If it isn't checking its work against environmental ground truth (real tool results, actual error messages) at every turn of its perceive-decide-act loop, it will drift.

3. The New Attack Surface (Securing a decision-maker)

Unlike traditional software, you're no longer just securing an application—you're securing a decision-maker with credentials. The OWASP Top 10 for LLM Applications highlights exactly why teams are quietly shutting down their agent pilots:

  • Indirect Prompt Injection: Your agent reads an untrusted webpage or email containing hidden instructions. The model can't reliably tell data from commands, so it executes the attacker's will.
  • Excessive Agency & Privilege Escalation: Giving an agent broad tool access paired with a weakly scoped CRM or DB connector. A minor reasoning error turns into an unintended database deletion or unauthorized admin action.
  • Data Leakage & Poisoning: Multi-tenant context bleeding, and RAG systems pulling from poisoned knowledge bases to serve malicious data back to users.

4. Designing for Safe Autonomy

Mitigating this isn't about breakthrough AI research; it's disciplined software engineering:

  • Least Privilege at the Tool Boundary: Treat every single tool call as a permission decision. If the agent doesn't have the capability in the first place, prompt injection can't exploit it.
  • Human-in-the-Loop Gates: Reading is cheap; acting is expensive. Let the agent reason freely, but put irreversible, high-stakes operations (payments, deletions, external publishing) behind a human sign-off step.
  • Observability as a First-Class Feature: Trace every step—the context seen, the decision made, the tool used, and the result. Turn "why did the agent go weird?" into a debuggable event log.

The One-Sentence Version: Agents act—that’s why they’re powerful, why they’re risky, and why you must scope their power and gate the actions you can’t take back.

I wrote a much longer breakdown covering these architectural trade-offs, including the decision matrix on whether to build your own loop vs. use a managed agent runtime (declarative vs. hosted).

Check out the full article here if you're interested

Would love to hear from anyone else shipping agents right now. What failure modes are you hitting that caught you off guard?

reddit.com
u/SnooPuppers2477 — 7 days ago

A race condition on a shared agent instance caused a cross-tenant data leak in our multi-tenant AI system

We were close to shipping an AI agent for an ITSM tool — it turns plain-English requests into structured support tickets. Multi-tenant, one deployment serving many companies. Unit tests green, smoke tests clean, dev stable for days.

During concurrency testing I fired two requests at once — two different tenants hitting the same workflow — and Tenant A's response came back populated with Tenant B's data. Reproducible, every time the two overlapped. I pulled the deploy.

Root cause: we created a single agent instance at startup and reused it for every request. Felt efficient — agents are expensive to spin up, so build once and share. The problem: that one shared agent stored the active tenant's context on itself. Under sequential traffic it's invisible — request finishes, next one overwrites the slot, no harm. Under concurrency it's a time bomb: Request B sets tenant_id while Request A is mid-flight, A reads it back, and A gets B's value. Whoever writes last wins.

What makes agents especially prone to this is that they feel like an object you build once and reuse, and they naturally accumulate state — prompt, retrieved docs, memory, tool results. Every one of those is a slot where per-tenant data can come to rest on something shared. And the failure mode isn't a 500 anyone notices; it's a fluent, confident answer about the wrong company.

Why nothing caught it: every test we owned ran one request at a time. Unit tests are great at proving correctness in isolation and completely blind to two requests stepping on each other. Green tests meant "correct in isolation," not "safe under load" — and for a multi-tenant system those are very different claims.

The fix: the quick patch is per-request instances so there's no shared slot. But that only closes one door. We moved tenancy off the agent entirely and pushed it to the tool boundary — the agent holds no tenant state, every tool call carries its own tenant scope + scoped credentials, and the boundary enforces it per call, so even a hallucinated wrong-tenant request can't cross it. Underneath that: row-level security at the data layer, plus a last-line assertion that every returned record's tenant ID matches the requester. Defense in depth, because any single layer can fail silently.

Concurrency + tenant-isolation tests are now first-class in the pipeline — many tenants hitting the same endpoint simultaneously, asserting zero cross-contamination on every change.

Curious how others handle tenant isolation in stateful/agent systems — do you scope at the tool boundary, the data layer, both? And has anyone found a clean way to make "no per-tenant state on shared objects" enforceable rather than a thing everyone has to remember?

Wrote up the longer version with diagrams here if useful: https://medium.com/@adityadhir97/i-almost-shipped-an-ai-agent-that-could-have-exposed-customer-data-af1c5a750efd

reddit.com
u/SnooPuppers2477 — 10 days ago

How are you testing your AI Agents?

Hello developers,

I've recently been building and testing AI agents, and one thing that keeps coming up is flaky evaluations caused by the non-deterministic nature of LLMs.

Sometimes a test case fails, I rerun it immediately, and it passes without any code changes. Other times the agent produces a slightly different reasoning path that still reaches the correct outcome.

For teams shipping agentic products:

  • How much tolerance do you allow for these kinds of failures in CI/CD?
  • Do you rerun failed evaluations before failing a build?
  • How do you distinguish between genuinely broken behavior and sporadic LLM variability?
  • Are your PR gates based on individual test cases, aggregate metrics, statistical significance, or something else?

I'm curious how mature teams handle this in production because traditional "all tests must pass" approaches seem difficult to apply when some amount of variability is inherent to the system.

Would love to hear what has worked (and what hasn't) for your teams.

reddit.com
u/More-Version3682 — 11 days ago