I'm switching from Ollama to Unsloth API using the same code as Ollama (I just changed the port from 11434 to 8888). So I am using the Open-AI compatible format.
(Edit: I'm running Unsloth Studio on windows 10)
Completion:
When I ask a question to the chat completion endpoint, the API stops streaming after a while (let's say around 500 tokens), as if the connection was lost. If I switch to ollama's port it works without problem.
Offgrid:
Also, I've tried connecting OffGrid Android App to Unsloth API: I cannot get the thinking and tools working, and the App cannot list models (only the loaded one).
Context Length error:
Unsloth API seems to refuse generating a completion is the query exceeds the context length. I was expecting it to only care about the tokens that fit into it's context and disregard tokens that are out of context?
Claude Remarks about the API: > What's been awkward about Unsloth Studio usage returns 0 — token counts unavailable. The "tokens/sec" column on /ai-performance is dead weight. No native JSON mode — we prompt-engineer it; works but flaky on long extractions. No per-stage timings in the response — exactly the thing you just asked for. Ollama reports prompt_eval_duration, eval_duration, load_duration natively in every response. Unsloth returns nothing. Reasoning spills into content — we hard-disabled it. No clean separate channel. One model resident at a time — /v1/load swaps. Can't keep gemma + a smaller text-only model warm side-by-side. Auth + extra config — fine but adds friction.
I'm not sure if I'm doing anything wrong, or is anyone having a similar experience here?