u/UnluckyOpposition

We reached 100 users on our AI expense tracker after a massive revamp. Looking for brutal feedback on our new UI, onboarding, and retention.

We reached 100 users on our AI expense tracker after a massive revamp. Looking for brutal feedback on our new UI, onboarding, and retention.

Hey r/SideProject,

A while back, we launched the initial version of SmartWalt. The core idea was simple: an expense tracker that doesn't force you to link your bank account (because a lot of people hate giving up those credentials), but uses AI to eliminate manual data entry. You just say "Lunch $12" or snap a photo of a receipt, and it logs and categorizes it instantly.

The early feedback we got was incredibly helpful, but it also made us realize the app needed a lot of work. We went back to the drawing board, completely "refurnished" the app from the ground up, and recently rolled out a massive update.

Since that revamp, we've grown to nearly 100 active customers. It feels like a great milestone, but I know we still have blind spots and I want to fix them before we push harder for growth.

I would love some fresh, critical eyes on the app. Specifically, I'm looking for feedback on:

  1. Onboarding Flow: Does the initial setup clearly explain how to use the voice and receipt scanning features, or is it confusing? Where do you feel the friction?
  2. The New UI: We completely overhauled the design. Is it intuitive to navigate, or does it feel cluttered?
  3. Retention: Utility apps are notoriously hard for retention. If you have built something similar, what specific hooks or UX patterns actually keep users coming back daily?

You can check out the landing page and app here: https://www.smartwalt.com/

I’m fully prepared for harsh, honest feedback. I'll also be hanging around the comments if anyone is curious about the tech stack (we use Flutter for the frontend and FastAPI on AWS for the AI orchestration).

Thanks in advance for the help!

u/UnluckyOpposition — 1 day ago

(Links to the GitHub repo and Docs are in the first comment)

Prototyping LLM applications and RAG pipelines is excellent for the zero-to-one phase, but deploying them in a B2B environment introduces a specific set of infrastructure bottlenecks.

Our team has been maintaining an open-source production wrapper called LongTrainer for the last two years to handle these exact deployment gaps. We just shipped v1.3.1, and I wanted to share how we are currently handling the core challenges of production LLM infrastructure.

Here are the main issues we see, and how this architecture addresses them:

1. The Multi-Tenant Vector Problem The Issue: When you scale to dozens of clients on a single backend, relying on metadata filtering to separate client data isn't always secure enough, and managing dynamic indices manually gets messy. The Solution: We enforce hard isolation through a bot_id. Every instance gets a completely walled-off vector space and memory chain. Client A's embeddings and conversations can never intersect with Client B's, natively supported across FAISS, Pinecone, Qdrant, PGVector, and Chroma.

2. Memory Bloat and Server Restarts The Issue: Loading historical conversation data into RAM is fine for demos. But at scale, if a server restarts and has to eagerly load 100k+ past chat sessions, it chokes. The Solution: We bypass in-memory storage entirely. Chat histories are persisted to MongoDB and strictly lazy-loaded. When a user queries the bot, only that specific conversation thread is fetched on demand. Startup times stay flat regardless of database size.

3. Span Tracing (Without 3rd-Party SaaS) The Issue: Knowing why a chain failed or why retrieval was poor usually requires piping data to a paid observability platform. The Solution: We built native tracing directly into the pipeline. It logs retrieval spans (which docs were fetched, latency, similarity scores), LLM spans (exact prompts, token counts), and Agent tool calls directly into your own MongoDB instance.

4. Real-time Hallucination Detection The Issue: Users finding out the LLM hallucinated before you do. The Solution: We integrated an NLI-based CitationVerifier. Before returning the final string, the response is split into atomic claims. Each claim is cross-referenced against the retrieved source documents. If it’s unsupported, it gets flagged in the database as a hallucination.

5. Traffic Management & "Noisy Neighbors" (New in v1.3.1) The Issue: In a multi-tenant environment, one active bot or client can drain your API limits (OpenAI/Anthropic) and throttle everyone else. The Solution: We just introduced a two-layer token-bucket rate limiting system. Layer 1 enforces strict per-tenant RPM ceilings, and Layer 2 ensures equal-share budgets across all bots under that tenant. When limits are hit, the API handles 429 Too Many Requests properly, and our CLI auto-retries with a progress bar.

What the implementation actually looks like: We designed it so deploying this entire stack takes just a few lines, rather than wiring up custom DB wrappers and session managers:

Python

from longtrainer.trainer import LongTrainer

# 1. Initialize with Mongo persistence and tracing enabled
trainer = LongTrainer(
    mongo_endpoint="mongodb://localhost:27017/",
    enable_tracer=True,
    tracer_verify=True # Enables the NLI hallucination checks
)

# 2. Create isolated multi-tenant instance
bot_id = trainer.initialize_bot_id()
trainer.add_document_from_path("client_data.pdf", bot_id)
trainer.create_bot(bot_id)

# 3. Query (Memory is automatically lazy-loaded and synced)
chat_id = trainer.new_chat(bot_id)
answer, sources = trainer.get_response("Summarize the terms", bot_id, chat_id)

Honest architectural trade-offs:

  • Latency: The NLI hallucination verification adds latency per query. It is not suitable for strict sub-100ms streaming requirements.
  • Database Dependency: We currently enforce a hard dependency on MongoDB for persistence and tracing logs; no lightweight SQLite option yet.
  • CLI vs TUI: As of v1.3.1, we ripped out the heavy TUI (Rich) assets for cleaner, more standard CLI logs to make it leaner for containerized deployments.

We also just added a fully interactive RAG demo (demos/longtrainer_demo.py) that supports OpenAI, Gemini, and Ollama out of the box if you want to test it locally without writing config.

The package is MIT licensed and actively maintained.

For other devs building LLM backends right now - how are you currently handling rate limiting and memory scaling for your tenants? Are you rolling custom middleware, or is there an existing pattern you prefer?

u/UnluckyOpposition — 15 days ago
▲ 10 r/LangChain+1 crossposts

When maintaining Retrieval-Augmented Generation (RAG) pipelines in production, one of the most persistent challenges engineering teams face is silent retrieval degradation.

Updating document indexes, modifying chunking strategies, or migrating embedding models can unintentionally break previously successful queries. The context window gets filled with irrelevant chunks, and without a dedicated testing layer, these retrieval regressions instantly surface as LLM hallucinations in production environments.

To address this at the architecture level, our team open-sourced LongProbe a retrieval regression testing package designed to bring stability and predictability to RAG infrastructure.

Instead of relying on manual spot-checks, LongProbe allows engineering teams to build "boring," highly stable infrastructure by treating vector retrieval exactly like standard software regression testing. It ensures that your retrieval layer consistently returns the correct context before it ever reaches the LLM.

Core Capabilities:

  • Automated Regression Testing: Define expected retrieval baselines for specific queries and continuously test your pipeline against them as your vector database expands.
  • Pipeline and Framework Agnostic: Whether your orchestration layer relies on LangChain, LlamaIndex, or custom API integrations, LongProbe validates the actual retrieval output independent of the framework.
  • CI/CD Ready: Catch exact failure points—like a specific chunking update or embedding swap—before deploying changes to production environments.

We built this for teams that prioritize production-grade scalability and need their AI architectures to maintain high development velocity without sacrificing reliability.

You can review the source code, documentation, and a complete workflow demo here: GitHub:https://github.com/ENDEVSOLS/LongProbe

We are actively maintaining this package alongside our broader open-source RAG suite. We would welcome any technical feedback, architectural critiques, or pull requests from developers currently managing vector store evaluations in production.

u/UnluckyOpposition — 2 days ago

Hey r/LLMDevs,

A little while ago we open-sourced LongParser to handle the messy parts of document ingestion for RAG architectures. Today we are pushing out the v0.1.5 update, which shifts the focus from basic parsing to solving the real-world pipeline bottlenecks we've been hitting in production.

Here is a breakdown of the new architecture and what we implemented in this release:

  • Semantic Chunking: We moved away from blind token limits. The chunker now uses all-MiniLM-L6-v2 to track cosine similarity between text blocks, creating hard boundaries only when the actual topic shifts to preserve context.
  • Cross-Reference Resolution: We added an $O(N)$ single-pass algorithm to resolve internal references (like "see Figure 3" or "the table below") directly to their corresponding data blocks, which keeps the document's relational structure intact.
  • Zero-ML OCR Filtering: To stop garbage OCR from poisoning Vector DBs without relying on heavy ML models, we built a fast heuristic scorer. It averages raw OCR confidence, OS dictionary validation, and fastText language ID to penalize garbled text.
  • Pre-DB PII Redaction: To prevent sensitive data leaks, we introduced a two-tier redaction engine. It uses Regex/Luhn validation for structured data (SSNs, cards) and spaCy NER for contextual masking before data touches the DB or LLM. The unmasked data remains securely stored in hidden metadata.
  • Async Summary Chunks: To enable hierarchical retrieval without freezing the main parsing pipeline, all heavy LLM summarization calls are now offloaded to a non-blocking background worker using ARQ/Redis.

Repo link in the comments. You can check exactly how the code works.

reddit.com
u/UnluckyOpposition — 17 days ago

(Links to the GitHub repo and Docs are in the first comment to avoid the spam filter)

LangChain is excellent for the zero-to-one phase, but deploying it in a B2B environment introduces a specific set of infrastructure bottlenecks.

Our team has been maintaining an open-source production wrapper called LongTrainer for the last two years to handle these exact deployment gaps. We recently shipped v1.3.0, and I wanted to share how we are currently handling the core challenges of production RAG.

Here are the main issues we see, and how this architecture addresses them:

1. The Multi-Tenant Vector Problem

The Issue: When you scale to dozens of clients on a single backend, relying on metadata filtering to separate client data isn't always secure enough, and managing dynamic indices manually gets messy. The Solution: We enforce hard isolation through a bot_id. Every instance gets a completely walled-off vector space and memory chain. Client A's embeddings and conversations can never intersect with Client B's, natively supported across FAISS, Pinecone, Qdrant, PGVector, and Chroma.

2. Memory Bloat and Server Restarts

The Issue: Loading historical RunnableWithMessageHistory data into RAM is fine for demos. But at scale, if a server restarts and has to eagerly load 100k+ past chat sessions, it chokes. The Solution: We bypass in-memory storage entirely. Chat histories are persisted to MongoDB and strictly lazy-loaded. When a user queries the bot, only that specific conversation thread is fetched on demand. Startup times stay flat regardless of database size.

3. Span Tracing (Without 3rd-Party SaaS)

The Issue: Knowing why a chain failed or why retrieval was poor usually requires piping data to a paid observability platform. The Solution: We built native tracing directly into the pipeline (LongTracer). It logs retrieval spans (which docs were fetched, latency, similarity scores), LLM spans (exact prompts, token counts), and Agent tool calls directly into your own MongoDB instance.

4. Real-time Hallucination Detection (v1.3.0 update)

The Issue: Users finding out the LLM hallucinated before you do. The Solution: We integrated an NLI-based CitationVerifier. Before returning the final string, the response is split into atomic claims. Each claim is cross-referenced against the retrieved source documents. If it’s unsupported, it gets flagged in the database as a hallucination.

What the implementation actually looks like:

We designed it so deploying this entire stack takes just a few lines, rather than wiring up custom DB wrappers and session managers:

from longtrainer.trainer import LongTrainer

# 1. Initialize with Mongo persistence and tracing enabled
trainer = LongTrainer(
    mongo_endpoint="mongodb://localhost:27017/",
    enable_tracer=True,
    tracer_verify=True # Enables the NLI hallucination checks
)

# 2. Create isolated multi-tenant instance
bot_id = trainer.initialize_bot_id()
trainer.add_document_from_path("client_data.pdf", bot_id)
trainer.create_bot(bot_id)

# 3. Query (Memory is automatically lazy-loaded and synced)
chat_id = trainer.new_chat(bot_id)
answer, sources = trainer.get_response("Summarize the terms", bot_id, chat_id)

Honest architectural trade-offs:

  • The NLI hallucination verification adds latency per query. It is not suitable for strict sub-100ms streaming requirements.
  • We currently enforce a hard dependency on MongoDB for persistence and tracing logs; no lightweight SQLite option yet.
  • Agent mode (converting the bot to a tool-calling LangGraph agent) is functional but less battle-tested than the standard RAG path.

The package is MIT licensed and actively maintained.

For other teams deploying LangChain to enterprise clients right now - how are you currently handling multi-tenant memory scaling? Are you rolling custom database wrappers, or is there an existing pattern you prefer?

reddit.com
u/UnluckyOpposition — 17 days ago
▲ 11 r/Rag

LongTrainer is an open-source Python framework built on top of LangChain. It handles the production infrastructure layer that most RAG tutorials skip entirely.

Multi-tenant bot isolation

Every bot instance gets a fully isolated vector index and chat history scoped through a bot_id. Client A's documents, embeddings, and conversation memory never touch Client B's. Works across FAISS, Pinecone, Qdrant, PGVector, and Chroma.

from longtrainer.trainer import LongTrainer

trainer = LongTrainer(mongo_endpoint="mongodb://localhost:27017/")

bot_id = trainer.initialize_bot_id()
trainer.add_document_from_path("client_data.pdf", bot_id)
trainer.create_bot(bot_id)

chat_id = trainer.new_chat(bot_id)
answer, sources = trainer.get_response(
    "What are the payment terms?",
    bot_id,
    chat_id
)

Persistent memory via MongoDB

Chat histories are persisted to MongoDB and lazy-loaded per session on demand. Nothing is held in RAM at startup. Startup time stays flat at milliseconds regardless of how many historical sessions exist.

Document ingestion

Accepts PDF, DOCX, URLs, S3 buckets, Google Drive, and GitHub repos directly. Chunking, embedding, and indexing are handled automatically.

trainer.add_document_from_path("report.pdf", bot_id)
trainer.add_document_from_url("[https://docs.example.com](https://docs.example.com)", bot_id)
trainer.add_document_from_s3("s3://bucket/data.pdf", bot_id)

Streaming responses

# Sync streaming
for chunk in trainer.stream_response(query, bot_id, chat_id):
    print(chunk, end="", flush=True)

# Async
answer = await trainer.aget_response(query, bot_id, chat_id)

Agent mode with tool calling

Switches any bot from standard RAG to a LangGraph-powered agent that can call tools. Same bot_id isolation applies.

def get_current_price(ticker: str) -> str:
    return f"Price for {ticker}: $142.30"

trainer = LongTrainer(
    mongo_endpoint="mongodb://localhost:27017/",
    agent_mode=True
)
trainer.add_tool(get_current_price, bot_id)

What is new in v1.3.0: Observability and hallucination detection

The main gap in earlier versions was visibility into what was actually happening inside the pipeline when responses were wrong. v1.3.0 adds native LongTracer integration to address this directly.

Enable it with a single flag:

trainer = LongTrainer(
    mongo_endpoint="mongodb://localhost:27017/",
    enable_tracer=True,
    tracer_backend="mongo",
    tracer_verify=True,
    tracer_threshold=0.5
)

Span-level observability Every query automatically generates a trace stored in MongoDB:

  • Retrieval spans: which documents were fetched, similarity scores, latency
  • LLM spans: exact prompt, token count, generation latency
  • Agent spans: every tool call, input, output, and execution time

Hallucination detection via CitationVerifier When tracer_verify=True, every response goes through NLI cross-referencing before being returned to the user:

  1. The LLM response is split into atomic, independently verifiable claims
  2. Each claim is checked against retrieved source documents using an NLI model
  3. Unsupported claims are flagged and written to MongoDB

Querying flagged responses:

from pymongo import MongoClient

db = MongoClient("mongodb://localhost:27017/")["longtracer"]

flagged = db.runs.find({
    "inputs.bot_id": "your-bot-id",
    "outputs.is_hallucinated": True
})

for run in flagged:
    print("Query      :", run["inputs"]["query"])
    print("Failed     :", run["outputs"]["failed_claims"])
    print("Source docs:", run["outputs"]["retrieved_docs"])

Span logging only, without NLI overhead:

trainer = LongTrainer(
    mongo_endpoint="mongodb://localhost:27017/",
    enable_tracer=True,
    tracer_verify=False
)

Supported LLMs: OpenAI, Anthropic, Gemini, AWS Bedrock, HuggingFace, Groq, Ollama Supported vector stores: FAISS, Pinecone, Qdrant, PGVector, Chroma

Known limitations:

  • MongoDB required - no in-memory backend yet
  • tracer_verify=True adds latency per query - not suitable for sub-100ms SLA requirements
  • NLI model pulls weights on first run
  • Agent mode is less battle-tested than the standard RAG path

MIT licensed.

For those running RAG in production - how are you currently handling hallucination detection? Curious whether anyone has found a lower-latency alternative to NLI for claim verification at scale.

reddit.com
u/UnluckyOpposition — 18 days ago