r/unsloth

▲ 117 r/unsloth+22 crossposts

I would like to share my latest open source local LLM inference tool implemented in C#. It supports models like Gemma4, Qwen3.6 with multi-modal (image, vision, audio), reasoning and function tool. It can run on Windows/MacOS/Linux and fully leverage GPU's capability. The API is completely compatible with OpenAI and Ollama interface.

Really appreciated if you can try it and give me some feedback. If you like it, it will be a big thank you if you can star it. Thank you very much!

u/fuzhongkai — 2 hours ago

▲ 8 r/unsloth

What framework/app do you guys use to make Notebooks to train?

I'm genuinely losing my mind over this 😭.

Every time I ask ChatGPT or Claude to make a training notebook, it looks good at first, but then I end up spending like 5 hours debugging it. One fix leads to another, dependencies break, cells are out of order, training crashes halfway through and by the end I've made basically zero progress.

How are you guys making your training notebooks? Do you use a framework, template, or some app that actually produces solid notebooks, or does everyone just write them by hand?

I feel like there has to be a better workflow that I'm completely missing.

reddit.com

u/Capital_Savings_9942 — 23 hours ago

▲ 4 r/unsloth

Unsloth isnt offloading enough to my gpu ?

After trying out unsloth studio for a bit, i was really happy with the results, but then i decided to switch to linux (Unbuntu). MOE models are running flawlessly (Qwen 3.6 35B A3B for example), but for small models (Qwen 3.5 4B ), unsloth is lefting out a lot of vram when loading the model, and when using it, gpu usage doesnt go above 30%.

When running the same model with max gpu offloading through llama.cpp, all my vram is used correctly, i get 100% gpu usage and over twice the output speed. Is this an issue related to the auto fit system of unsloth, that doesnt work with my hardware ?

For context i have 16go of ram and a rtx4050 with 6gb vram

I get around 22 token/s with Q2 Qwen 3.6 35B A3B and only 12 /s with Qwen 3.5 4B ( up to 32/s when running 4B through command line )

reddit.com

u/Thin_Board4223 — 24 hours ago

▲ 397 r/unsloth+34 crossposts

browser-search — three tools, zero cost, and your AI agent learns to search and browse the web

/r/Hermes/comments/1uclwgi/browsersearch_three_tools_zero_cost_and_your_ai/

u/Ill-Tradition1362 — 4 days ago

▲ 68 r/unsloth

Ornith-1.0-35B

When will Unsloth make its own version of https://huggingface.co/deepreinforce-ai/Ornith-1.0-35B-GGUF?

🤔🤔🤔

u/Temporary-Roof2867 — 4 days ago

▲ 21 r/unsloth

BitTern / CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs

I stumbled across this cool paper, link below. They're using back propagation on the weights to progressively fit ternary using layer by layer reconstruction error.

It's surprising it works so well using a much smaller dataset than Unsloth, 512 x 2048, so 1 million tokens, and the loss being minimized is not KL divergence on the text generation, but reconstruction error inside the model.

The way it learns to scale and shift blocks of parameters - learnable modulation - seems conceptually similar to Unsloth, but rather than then applying hard quants, backpropping steadily more aggressive soft ternerizaition using layer reconstruction loss, and that seems to be the novel bit.

In the results table, something around the Llama2-70B size looks interesting, scoring well and not dropping much across the reasoning benchmarks, so it would become viable on a 24GB card. If that's the same for other big models that could be a useful unlock. But obviously the ancient Llama2-70B was eclipsed by 27B / 35B Qwen 3.6.

But more excitingly, maybe it makes something huge like MiniMax-M3 428B A23B viable on a 128GB RAM system, especially as prefill will get a lot faster in ternary?

Making your car engine lighter with an angle grinder gets better every day!

Paper:

https://arxiv.org/html/2606.26650v1

Not yet open source:

https://github.com/IntelChina-AI/BitTern

https://preview.redd.it/f9xwwlvr1uah1.png?width=713&format=png&auto=webp&s=5d805314ca683bc3fa3134eb6e57102b1c392a69

https://preview.redd.it/p3si17q12uah1.png?width=753&format=png&auto=webp&s=baee13bf29ff6198e81c37548fc443eb24584545

reddit.com

u/Luke2642 — 4 days ago

▲ 5 r/unsloth

What's the best way to benchmark a model on SWE and gsm8k?

I've recently started training and evaluating LLMs, and I'd like to learn how people benchmark models in practice.

I'm not just interested in leaderboard scores—I want to understand the evaluation process itself. Which benchmarks are considered the most useful for reasoning, coding, math, instruction following, and general capabilities? What evaluation frameworks or tools do you use? If you've built your own evaluation pipeline, I'd also love to hear how you approached it.

If you have any recommendations for guides, papers, repositories, or other resources, I'd really appreciate it. Thanks!

reddit.com

u/Capital_Savings_9942 — 3 days ago

▲ 11 r/unsloth

Show the model used for each chat?

I use various LLMs. It would be great to be able to see which LLM was used in each chat.
Does that already exist?
If not, can it be added?
Thanks for your EXCELLENT app and AI work!
You ROCK!

reddit.com

u/WiseMathematician495 — 4 days ago

▲ 15 r/unsloth

Unsloth Nvfp4

Will there be support for exporting trained models to NVFP4 directly in Unsloth Studio in the future?

reddit.com

u/Useful_Watercress350 — 5 days ago

▲ 44 r/unsloth

Qwen-Wolrd-Agent is so much fun as simulator!

simulate a linux super computer

simulate a 2048 game

I deployed a this model (35B-A3B Q4) on a 4090 24G, it runs so fast as an Moe model.

At first, i use it as a agent in opencode, but found out its not good at coding tho it can finish some quick test real fast.

then i read this official paper of this model, and it's trained well to simulate such as terminal commands and many things like 2048 games.

there are many insteresting things i still need to do with this model.

(Note: everything in the picture is simulated by this model)

reddit.com

u/Unusual-Customer713 — 6 days ago

▲ 1.3k r/unsloth+1 crossposts

DeepSeek releases DSpark - 50%-600% faster spec decoding vs MTP

DeepSeek releases DSpark for V4 Flash & Pro, a new speculative decoding method boosting throughput by 51% to 400% vs single MTP!

DeepSeek also showed DSpark works well for other OSS models like Gemma & Qwen in their research paper as well.

They also compared to Eagle3 and DFlash, and showed DSpark performs better as well!

Github: https://github.com/deepseek-ai/DeepSpec
Paper: https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf
Hugging Face Model: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-DSpark

u/danielhanchen — 9 days ago

▲ 224 r/unsloth

Unsloth team should consider building a torrent tracker for their models

Centralization of the LLM space around HuggingFace has been extremely convenient for everyone but it's also a major risk. If it were shut down, there would be a period of chaos before the next best alternative would take its place.

Torrents seem like the best solution for LLM preservation and it shifts the storage burden back to the open-source community which wouldn't be unexpected. I would happily seed key consumer models like Qwen 3.6 and Gemma 4 models and I think many others would too.

A torrent tracker built into Unsloth Studio seems like it would be fairly easily attainable (correct me if I'm wrong....), a big win for LLM preservation, and it would further cement Unsloth's position as a leader in the open-source space.

Edit: A simple starting point would be to just add a column to this page with a link to a torrent published/verified by Unsloth.

u/Cereal_Grapeist — 8 days ago

▲ 5 r/unsloth+2 crossposts

I wanted to fine-tune an LLM on my own Git history. No tool existed to extract clean training data

Every guide on fine-tuning LLMs skips the hardest part: where do you get the data?

For code-aware models, the obvious answer is your own commit history, it's literally a record of how you think, write, and fix code. But when I tried to actually do this, I hit a wall.

Raw commit diffs are garbage for training. Merge commits. Bot-generated changelogs. "fix typo," "wip," "asdfasdf." Auto-generated lockfiles. Duplicate logic committed 6 different ways across branches. None of the existing dataset tools touched this problem.

So I spent time building git2llm, a CLI tool and Python library that turns your GitHub repositories into clean, fine-tuning-ready datasets.

What it does:

Crawls commits, PRs, and issues in parallel from any public or private repo
Runs a 4-stage cleaning pipeline:
- Drops merge commits and bot-authored noise
- Filters WIP/draft/auto-generated content
- Deduplicates using MinHash LSH (fuzzy match, not exact, catches near-identical commits too)
Outputs in Alpaca or ShareGPT format, ready to feed directly into Unsloth, LLaMA-Factory, or any SFT pipeline

The stat that surprised me most: on my own repos, the pipeline dropped 78% of raw commits before a single token hit the training set. That's not a bug, that's the point. Most of what lands in git log is noise that actively hurts model quality.

Why this matters:

Fine-tuning on your own coding style is one of the few cases where you can get genuinely personalised code suggestions, not a generic GitHub Copilot, but something trained on your actual architectural decisions, naming conventions, and problem-solving patterns.

But that only works if the training data is clean. Feeding "fix stuff" commits into QLoRA is just teaching the model to be confidently wrong.

Where I used it:

I fine-tuned a base model on my own GitHub history using QLoRA via Unsloth. Hit some expected overfitting early (low data volume problem — another reason cleaning matters), but the directional results were clear: the model started picking up domain-specific patterns that generic models miss.

It's open-source. I'm looking for:

🛠 Contributors: especially around multi-repo crawling, GitHub Actions integration, and GitLab support
🧪 Testers: try it on your repos and open issues. Especially interested in edge cases: monorepos, large orgs, non-English commit messages
💡 Ideas: what cleaning heuristics am I missing? What output formats would you use?
⭐ A star if you find it useful (helps discoverability)

👉 github.com/athuKawale/git2llm

What would make you actually use a tool like this? Drop it below, genuinely trying to make this useful for the fine-tuning community, not just a side project that rots in a repo.

u/athukawale — 6 days ago

▲ 11 r/unsloth

Would vLLM help in my setup? (Single GPU + CPU offload, multi-user)

I have an RTX 3060 12GB and 16GB DDR5 RAM. Since my GPU only has 12GB VRAM, I usually offload part of the model to the CPU.
I mainly use llama.cpp with **Qwen3-36B-A3B-MXFP4 (Unsloth quant)**and MTP enabled. I currently get around 17–20 tok/s , and about 25 tok/s after disabling Flash Attention (IDK seems unusual) and tweaking a few flags (though my setup is probably still not well optimized).
My workload is mostly:
Daily AI assistant use
Coding
Multi-user inference (typically 4–5 concurrent users at peak)
Given that I’m using a single RTX 3060 with CPU offloading ,would switching to vLLM provide any real benefit for multi-user serving? I’m fine with INT4 or other quantized weights if supported.
From what I understand, vLLM is great for batching and concurrent requests, but I’m unsure whether that advantage still applies when a significant portion of the model is CPU-offloaded due to limited VRAM.
Has anyone compared llama.cpp vs vLLM in a setup like this?

reddit.com

u/zyxciss — 7 days ago

▲ 1 r/unsloth

Hi all, I am new to Data Science & AI Development and have 2 questions (so far)...

I've installed gemma4: e4b and a few other models locally and would also like to choose a qwen model as well. Any model suggestions as my hardware is limited to only 8gb unified RAM on a 2020 MacBook Pro M1 at the moment?
I am looking to create a few projects to showcase my skills/competency. Open to project suggestions that can be pushed to my Git Repo.

I love learning. Thank you in advance, I know it will be a long road to mastery, but I am taking it piece by piece and want to continue until I can upgrade my hardware.

reddit.com

u/Superfly022 — 6 days ago

▲ 19 r/unsloth+3 crossposts

I fine-tuned Llama 3.1 8B on the public-domain works of a 19th-century author (niche PT-BR domain model) — GGUF + dataset open

Sharing a small solo project in case it's useful to anyone doing domain-specific fine-tunes in non-English languages.

I trained a Portuguese (PT-BR) model on the complete works of Allan Kardec — the 19th-century codifier of Spiritism. The whole corpus is public domain (he died in 1869), which made it a clean dataset to work with for a religious/philosophical domain.

Setup:

- Base: Llama 3.1 8B Instruct

- Method: QLoRA (4-bit) via Unsloth, on a single T4

- Data: ~4,896 Q&A pairs in ShareGPT format, built from the full works

- Format: GGUF Q4_K_M for Ollama / llama.cpp, plus the safetensors adapter

The goal was a study assistant that cites its source (book, chapter, question) instead of just asserting things. It's a research/replication artifact, not a product — Apache-2.0, and the dataset is public too.

Honest limitations: it's an 8B, so specific citations (question numbers, chapters) can still be wrong — the concept tends to be right, the exact reference not always. I treat it as a study aid, not a source of truth.

To my surprise it's been downloaded a fair bit by people I'll never meet, which is the fun part of releasing open weights.

Models + dataset: huggingface.co/ia-espirita

Happy to answer anything about the data prep or training — and if anyone's done domain fine-tunes on niche public-domain corpora, I'd love to hear what worked for you.

https://iaespirita.com.br/noticias/modelos-riv-ai-1260-downloads-hugging-face

u/SideSuspicious8083 — 8 days ago

▲ 2 r/unsloth

Unsloth Studio on DGX Spark?

Has anyone successfully managed to finetune using Unsloth Studio on a DGX Spark?

reddit.com

u/schnauzergambit — 7 days ago

▲ 556 r/unsloth+2 crossposts

1-bit GLM-5.2 GGUF vs. Claude 4.8 Opus vs. GPT-5.5

Hey guys, we gave 3 models the same prompt and compared one-shot outputs.

It's not necessarily to pick out the winner but we want to showcase that 1-bit can actually perform well. GLM-5.2 at 1-bit was a first try one-shot attempt.

The 1-bit GLM-5.2 GGUF ran locally on a Mac Studio M3 Ultra 256GB RAM at ~21.6 tok/s.

Which do you like best?

GGUF: https://huggingface.co/unsloth/GLM-5.2-GGUF
Guide: https://unsloth.ai/docs/models/glm-5.2

You can run it in unsloth studio!

u/de4dee — 13 days ago

▲ 3 r/unsloth

GLM-5.2 5-bit Quantized Error

I am very new to LLM's but I have a nasa server with 64 cores 1.34tb ram but no GPU. I am attempting to run GLM-5.2 5-bit quantized (UD-Q6_K) in the web ui. It is important to note that my 1.3tb's of ram are split across several numa nodes. Below is my numa hardware.

```

free -h

numactl --hardware

total used free shared buff/cache available

Mem: 1.3Ti 35Gi 673Gi 10Mi 623Gi 1.3Ti

Swap: 2.0Gi 4.2Mi 2.0Gi

available: 2 nodes (0-1)

node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

node 0 size: 677252 MB

node 0 free: 465594 MB

node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

node 1 size: 677360 MB

node 1 free: 223814 MB

node distances:

node 0 1

0: 10 21

1: 21 10

```

I've tried many different things with claude but I ran out of tokens but it hasn't really been working. I've made progress by getting the model to load to 99% but then studio stops running and goes into a retry loop. With NUMA enabled, llama.cpp creates multiple CPU backends (one per NUMA node)

Attached in the thread is an image of the assert that is failing.

https://preview.redd.it/nkogpx57y2ah1.png?width=767&format=png&auto=webp&s=d42f89f1a94af51289d5e5427874948c41bd0921

reddit.com

u/Complex-Fun-3039 — 8 days ago

▲ 0 r/unsloth+1 crossposts

A guy called "cerealpotatochipssea" in HF cyberbulled me.

Here's the full story: he was uploading g*re, c*rn, and n*zi content. I didnt see it cuz i was late. it was already marked as explicit content. however then he did some femboy cat stuff. i said "holy cow". then , he sended me a image of a guy with a middle finger. links: https://huggingface.co/spaces/SupraLabs/Blog/discussions/13#6a3eed1c654783004bc17eb2, https://huggingface.co/Tralalabs/TralaLabs-16M-Base/discussions/1#6a3eeb268fd17dd3073f2566, https://huggingface.co/spaces/SupraLabs/SupraWeather-Nano-Demo/discussions/1#6a3eea0595b7642a18cce9be, https://huggingface.co/SupraLabs/SupraWeather-Nano-Preview/discussions/2#6a3ee7f3a0f206b2d81ede50

u/localenjoyerllm — 10 days ago

r/unsloth

What framework/app do you guys use to make Notebooks to train?

Unsloth isnt offloading enough to my gpu ?

browser-search — three tools, zero cost, and your AI agent learns to search and browse the web

Ornith-1.0-35B

BitTern / CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs

What's the best way to benchmark a model on SWE and gsm8k?

Show the model used for each chat?

Unsloth Nvfp4

Qwen-Wolrd-Agent is so much fun as simulator!

DeepSeek releases DSpark - 50%-600% faster spec decoding vs MTP

Unsloth team should consider building a torrent tracker for their models

I wanted to fine-tune an LLM on my own Git history. No tool existed to extract clean training data

Would vLLM help in my setup? (Single GPU + CPU offload, multi-user)

Hi all, I am new to Data Science &amp; AI Development and have 2 questions (so far)...

I fine-tuned Llama 3.1 8B on the public-domain works of a 19th-century author (niche PT-BR domain model) — GGUF + dataset open

Unsloth Studio on DGX Spark?

1-bit GLM-5.2 GGUF vs. Claude 4.8 Opus vs. GPT-5.5

GLM-5.2 5-bit Quantized Error

A guy called "cerealpotatochipssea" in HF cyberbulled me.

Hi all, I am new to Data Science & AI Development and have 2 questions (so far)...