u/dco44

Prism Coder: Qwen3.5-14B fine-tune for MCP tool-routing — 100% on 102-case benchmark (vs Claude Opus 98.3%)

Releasing Prism Coder checkpoints.

What it is: LoRA fine-tunes on Qwen3.5-14B and 32B optimized for MCP (Model Context Protocol) tool-routing. Trained on the routing decision — not just correct tool call format, but knowing when NOT to call a tool.

Why this matters: Base Qwen3.5 over-routes. It calls tools reflexively even when a direct answer is better. The fine-tune targets this failure mode specifically.

Benchmark (102-case routing eval):

Model Score
Prism Coder 14B 100%
Claude Opus 98.3%
Base Qwen3.5-14B ~73%

Use case: Local MCP deployments. Cascade: 14B local → 32B local → cloud fallback. Keeps cloud API costs near zero.

License: AGPL-3.0 | GitHub: github.com/dcostenco/prism-mcp

Feedback on the benchmark methodology welcome.

reddit.com
u/dco44 — 1 day ago

Fine-tuning for MCP tool-routing decisions: what the benchmark revealed about small model failure modes

Building Prism Coder (Qwen3.5 fine-tune for MCP tool-routing, AGPL-3.0 — dev disclosure) forced me to think carefully about what failure modes in tool-calling actually look like. Sharing the findings.

The benchmark: 102 cases. Each: a user prompt + available tools. The model must decide — call a tool, or answer directly?

Most tool-calling benchmarks measure whether the tool call is correctly formatted. This one measures whether the decision to call is correct.

Finding 1: Two distinct failure modes below 14B

  • Over-routing (false positive): Calls a tool when a direct answer is better. Precision tanks.
  • Under-routing (false negative): Answers directly when a tool should be called. Recall tanks.

These are anti-correlated. Aggressive fine-tuning that reduces over-routing often increases under-routing. You have to optimize both simultaneously.

Finding 2: Base models over-route by default

  • Base Qwen3.5-14B: ~73% accuracy on routing decisions
  • After LoRA fine-tuning on routing corpus: 100%

Finding 3: Failure direction depends on fine-tune, not model size alone

Same base model, different training data → completely opposite failure modes. Raw accuracy is useless as a comparison metric without knowing which direction a model fails.

Practical implication for agentic pipelines: Split your eval into false positive rate (called when shouldn't) and false negative rate (didn't call when should). A model at 90% might be there because it almost never calls tools — useless in production.

Happy to share the benchmark cases or training corpus structure if anyone wants to run their own evals.

GitHub: github.com/dcostenco/prism-mcp

reddit.com
u/dco44 — 1 day ago

Fine-tuning for MCP tool-routing decisions: what the benchmark revealed about small model failure modes

Building Prism Coder (Qwen3.5 fine-tune for MCP tool-routing, AGPL-3.0 — dev disclosure) forced me to think carefully about what failure modes in tool-calling actually look like. Sharing the findings.

The benchmark: 102 cases. Each: a user prompt + available tools. The model must decide — call a tool, or answer directly?

Most tool-calling benchmarks measure whether the tool call is correctly formatted. This one measures whether the decision to call is correct.

Finding 1: Two distinct failure modes below 14B

  • Over-routing (false positive): Calls a tool when a direct answer is better. Precision tanks.
  • Under-routing (false negative): Answers directly when a tool should be called. Recall tanks.

These are anti-correlated. Aggressive fine-tuning that reduces over-routing often increases under-routing. You have to optimize both simultaneously.

Finding 2: Base models over-route by default

  • Base Qwen3.5-14B: ~73% accuracy on routing decisions
  • After LoRA fine-tuning on routing corpus: 100%

Finding 3: Failure direction depends on fine-tune, not model size alone

Same base model, different training data → completely opposite failure modes. Raw accuracy is useless as a comparison metric without knowing which direction a model fails.

Practical implication for agentic pipelines: Split your eval into false positive rate (called when shouldn't) and false negative rate (didn't call when should). A model at 90% might be there because it almost never calls tools — useless in production.

Happy to share the benchmark cases or training corpus structure if anyone wants to run their own evals.

GitHub: github.com/dcostenco/prism-mcp

reddit.com
u/dco44 — 1 day ago
▲ 11 r/BehaviorAnalysis+1 crossposts

My son is nonverbal. My wife is a BCBA. I built a free AAC app — looking for feedback.

Free: full AAC board, speech, 23 languages, 12 games, Apple Watch, emergency SOS
- On-device 1.7B AI (~0.5s, no internet needed, no PHI leaves the device)
- Per-child phrase learning (ACT-R spreading activation)
- Built for ABA workflows — data collection, verbal operant tracking

App Store: https://apps.apple.com/app/id6764692277
Web: https://synalux.ai/prism-aac
Source: https://github.com/dcostenco/prism-aac
Evaluation: https://synalux.ai/evaluation
u/dco44 — 8 days ago
▲ 6 r/SpecialNeedsChildren+2 crossposts

Same disclosure as the title — I built this. AGPL-3.0 open source, free tier requires no account. synalux.ai/prism-aac · github.com/dcostenco/prism-aac.

**Works on every device, online or offline:**

  • Any browser, any platform — iPhone, iPad, Android, Windows, Mac, Chromebook. No app store install.
  • Installable as a home-screen PWA so it feels native.
  • Fully offline after first load — communication never depends on a working internet connection.

Briefly what it does:

  • TouchChat-style pictograms next to every phrase (open ARASAAC library, free, every tier)
  • Auto-corrects hurried typing ("bowlof,ri" → "bowl of rice") so motor imprecision doesn't lose meaning. Runs on-device when possible.
  • Continuous voice input button — kid can speak; same correction step cleans up word-boundary errors
  • No data leaves the device on the free tier

Looking for caregiver feedback on what's still broken, what's still missing.

Link to try:

https://synalux.ai/prism-aac

Screenshots: [home](https://github.com/dcostenco/prism-aac/raw/main/docs/screenshots/home-v2.png) · [phrase tiles with pictograms](https://github.com/dcostenco/prism-aac/raw/main/docs/screenshots/categories-pictograms-v2.png) · [math panel](https://github.com/dcostenco/prism-aac/raw/main/docs/screenshots/math-panel-v2.png)

reddit.com
u/dco44 — 21 days ago

Disclosure: I'm the developer. Prism AAC is free + open source under AGPL-3.0, lives at synalux.ai/prism-aac, source at github.com/dcostenco/prism-aac.

**Works on every device, online or offline:**

  • Installable as a PWA so it lives on the home screen like a native app.

What it does:

  • Auto-correction for hurried/imprecise typing — runs on-device via local prism-coder model when available, portal fallback otherwise. The Speak button never blocks on the network.
  • Continuous voice input (Web Speech API, on-device)
  • Emergency button works on every tier including Free

Looking for honest critique from people who actually deploy AT. What's missing? What's wrong?

Link to try:

https://synalux.ai/prism-aac

Screenshots: [home](https://github.com/dcostenco/prism-aac/raw/main/docs/screenshots/home-v2.png) · [categories](https://github.com/dcostenco/prism-aac/raw/main/docs/screenshots/categories-list-v2.png) · [pictograms in action](https://github.com/dcostenco/prism-aac/raw/main/docs/screenshots/categories-pictograms-v2.png) · [math panel](https://github.com/dcostenco/prism-aac/raw/main/docs/screenshots/math-panel-v2.png)

reddit.com
u/dco44 — 21 days ago