I made my own organization on huggingface for soley releasing low size distills of bigger models

I recently started my own Hugging Face org called CoNDeNse-AI focused on making smaller, lightweight distilled AI models that are easier to run on normal hardware 🙌

Org: https://huggingface.co/CoNDeNse-AI

Most of the training is done on Kaggle using 2x T4 GPUs, so a big part of the project is figuring out how to get the best possible results from limited hardware. Because of this, we unfortunately can’t currently make proper distills based on newer/larger Qwen 3.5 base models since Kaggle struggles heavily with them during training and distillation.

Some current projects are:

- GLM-5.1-Qwen3-1.7B-CoNDeNse

- GLM-5.1-Qwen3-0.6B-CoNDeNse

- GLM-5.1-Qwen3-1.7B-CoNDeNse-GGUF

The 1.7B versions mainly focus on preserving reasoning, coding, and multilingual capabilities while reducing overhead, while the 0.6B variant is more focused on accessibility and lower-end hardware support. The GGUF release is aimed at easier local inference in things like llama.cpp and LM Studio 💻

The org is still very experimental, so alongside proper releases there are also research checkpoints, quantization tests, and random experiments that may or may not work 😅

Would love feedback from people working on low-resource training/distillation setups.

u/Capital_Savings_9942 — 4 days ago

▲ 0 r/LocalLLM

Cooked up a new Qwen3-8B coding model that actually "thinks" before it types (HyperThinkCode-v1.5)

Hey everyone!

I just dropped a new 4-bit QLoRA fine-tune based on Qwen3-8B under my org, Cyprus. If you're into models that map out their logic before just blindly spitting out scripts, you might want to give this a spin. It's called HyperThinkCode-Qwen3-8B-v1.

Model Link:https://huggingface.co/Andy-ML-And-AI/HyperThinkCode-Qwen3-8B-v1

The Vibe: "Think first, code second"

The main goal here was to force the model to explicitly reason before writing the final code. I used a 30k subset of the Sashvat/HyperThink-X-Nvidia-Opencode-Reasoning-200K dataset and tweaked the chat template so the assistant responds inside a thinking field first. Basically, it talks to itself to figure out the problem, then it gives you the code.

How I cooked it up:

Base: Qwen3-8B
Hardware: Trained on dual Tesla T4s (16GB VRAM each)
The Method: 4-bit QLoRA via Unsloth. Targeted all linear layers (Attention: q, k, v, o | MLP: gate, up, down) with Rank 16 / Alpha 16.
Time: Super quick run—just 50 steps (global batch size 8), which took about 1 hour and 17 minutes.
Context: Capped at 4096 tokens to balance code complexity without letting VRAM explode.

Even with just 50 steps, the training loss dropped nicely (0.8177 down to 0.6785). I'm currently running lm-eval benchmarks on HumanEval and GSM8K to see exactly how it stacks up against the base Qwen3-8B.

Running it

Since it’s an 8B, it’s super lightweight and easy to daily-drive. If you want to fire it up in Python using Unsloth, here is the quick snippet:

Python

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Andy-ML-And-AI/HyperThinkCode-Qwen3-8B-v1",
    max_seq_length = 4096,
    load_in_4bit = True,
)

I'd love for you guys to test it out against whatever local coding models you're currently using and let me know if the extra "hyperthinking" layer actually helps with your workflows!

u/Capital_Savings_9942 — 13 days ago

▲ 5 r/Lora+3 crossposts

I have written a technical report that looks at ways to optimize memory and compute for training large language models when resources are limited.

The report groups over 20 techniques into categories such as:

Model state partitioning, including things like ZeRO and FSDP
Quantization based methods, like QLoRA and NF4
Strategies for managing activation memory, including checkpointing
Optimizations for input output kernels like Flash Attention and fusion

It also covers:

How well different hardware works with these techniques, including Turing and Ampere and Hopper
Tables that compare how much video random access memory is reduced versus compute overhead
Examples of how to set things up for both graphics processing units and clusters with many graphics processing units

My goal with this report was to bring together ideas from theory and systems into one place that people can reference.

I would really like to hear any thoughts or corrections people might have, on the side of things.

I am also getting ready to send this work to arXiv. I need someone to endorse it for cs.AI and cs.LG.

I have an arXiv endorsement code (EKKH4F).
I can forward the official arXiv email with the endorsement link if you’re willing to help.

If someone who knows about this area is willing to look it over and endorse it that would be great.

drive.google.com

u/Capital_Savings_9942 — 18 days ago