u/FinishResponsible354

I'm building a customer-facing AI agent for an ecommerce business and running into consistent issues with the LLM making things up and selecting the wrong tools. Looking for advice from people who've solved similar problems.

Current architecture:

- Node.js backend

- LLM with tool calling (the model decides which tool to invoke)

- RAG (vector embeddings) for one section only: general knowledge base

- The rest are tools the LLM calls directly

Tools available:

- `knowledge` — RAG over a knowledge base

- `product_search` — searches product catalog (plain text responses)

- `branches` — store locations info (plain text)

- `promotions` — active discounts (plain text, manually loaded)

- `branch_stock` — stock per location

- `personality` — brand tone and style

Problems I'm seeing:

  1. Hallucination when product doesn't exist — user asked for gift cards, the tool returned a jacket (higher semantic similarity score), and the LLM responded that yes, gift cards are available starting from $0.10. Completely invented.
  2. Wrong tool selected — for questions that require combining tools (e.g. "do you have X in stock at branch Y with delivery to Z?"), the LLM picks one tool and ignores the others.
  3. Ignores active promotions — even when the promotions tool has the right data, the LLM sometimes skips calling it.
  4. Stale integration data— tools for third-party platforms (Tienda Nube, VTEX, logistics operators) sometimes return outdated info and the LLM doesn't flag it.
  5. Mixes data across sections — e.g. uses a price found in promotions to answer a plain product price question.

What I think the core issues are:

- Tool descriptions are too vague — the LLM doesn't know when NOT to use a tool or when to combine multiple tools

- Tool outputs are plain text strings, not structured JSON — so when a product isn't found, the LLM gets ambiguous output and fills in the gaps

- No explicit anchoring instruction — nothing in the system prompt tells the LLM what to do when a tool returns empty or partial results

- `personality` and `context` probably shouldn't be tools at all — they should be hardcoded in the system prompt

What I'm planning to fix:

  1. Rewrite tool descriptions with: when to use, when NOT to use, which tools to combine with
  2. Change all tool outputs to structured JSON with an explicit `found: true/false` field
  3. Add an anchoring rule to the system prompt: "If the tool returns `found: false` or an empty array, say you don't have that information. Never infer or estimate."
  4. Move personality and context to the system prompt, remove them as tools
  5. Encourage multi-tool calls per query

Questions for the community:

- Does this diagnosis sound right to you?

- Is structured JSON output from tools the standard approach, or is there a better pattern?

- Any experience with tool descriptions that reliably prevent wrong selection?

- Is there a better pattern than pure tool calling for sections like promotions and branches, which are structured business data with hard filters (date, SKU, zone)?

- Would you add an intent classifier before the tool calling step, or is that over-engineering for this scale?

Happy to share more details. Thanks.

reddit.com
u/FinishResponsible354 — 26 days ago