u/False-Difference4010

I'm switching from Ollama to Unsloth API using the same code as Ollama (I just changed the port from 11434 to 8888). So I am using the Open-AI compatible format.

(Edit: I'm running Unsloth Studio on windows 10)

Completion:

When I ask a question to the chat completion endpoint, the API stops streaming after a while (let's say around 500 tokens), as if the connection was lost. If I switch to ollama's port it works without problem.

Offgrid:

Also, I've tried connecting OffGrid Android App to Unsloth API: I cannot get the thinking and tools working, and the App cannot list models (only the loaded one).

Context Length error:

Unsloth API seems to refuse generating a completion is the query exceeds the context length. I was expecting it to only care about the tokens that fit into it's context and disregard tokens that are out of context?

Claude Remarks about the API: > What's been awkward about Unsloth Studio usage returns 0 — token counts unavailable. The "tokens/sec" column on /ai-performance is dead weight. No native JSON mode — we prompt-engineer it; works but flaky on long extractions. No per-stage timings in the response — exactly the thing you just asked for. Ollama reports prompt_eval_duration, eval_duration, load_duration natively in every response. Unsloth returns nothing. Reasoning spills into content — we hard-disabled it. No clean separate channel. One model resident at a time — /v1/load swaps. Can't keep gemma + a smaller text-only model warm side-by-side. Auth + extra config — fine but adds friction.

I'm not sure if I'm doing anything wrong, or is anyone having a similar experience here?

reddit.com
u/False-Difference4010 — 18 days ago