▲ 1 r/csharp

TensorSharp Day-1 Support Google's Diffusion Gemma Model (GGUF-Unsloth)

Here is a screenshot showing how Diffusion Gemma working in TensorSharp. I run it locally on my RTX3060 Mobile 16GB, and the model is diffusiongemma-26B-A4B-it-Q4_K_M. Here is the model card: DiffusionGemma model card.

So far, ggml backend is optimized and fastest. MLX, CUDA and CPU backend is still under optimization. Because it's a diffusion model, KV cache and continuous batching in auto-regression model won't be applied for this type of model, so it will be slower when multi-request get processed in parallel.

Any feedback and comment is welcome, and if you like it, it would be appreicated if you can give this project a star in Github. Thanks in advance.

u/fuzhongkai — 24 days ago

TensorSharp Day-1 Supports Diffusion Gemma Model (GGUF-Unsloth)

Here is a screenshot showing how Diffusion Gemma working in TensorSharp. I run it locally on my RTX3060 Mobile 16GB, and the model is diffusiongemma-26B-A4B-it-Q4_K_M. Here is the model card: DiffusionGemma model card.

So far, ggml backend is optimized and fastest. MLX, CUDA and CPU backend is still under optimization. Because it's a diffusion model, KV cache and continuous batching in auto-regression model won't be applied for this type of model, so it will be slower when multi-request get processed in parallel.

Any feedback and comment is welcome, and if you like it, it would be appreicated if you can give this project a star in Github. Thanks in advance.

u/fuzhongkai — 24 days ago
▲ 29 r/unsloth+1 crossposts

TensorSharp Day-1 Supports Unsloth Diffusion Gemma Model

Here is a screenshot showing how Diffusion Gemma working in TensorSharp. I run it locally on my RTX3060 Mobile 16GB, and the model is diffusiongemma-26B-A4B-it-Q4_K_M. Here is the model card: DiffusionGemma model card.

So far, ggml backend is optimized and the fastest backend. MLX, CUDA and CPU backends are still under optimization. Because it's a diffusion model, KV cache and continuous batching in auto-regression model won't be applied for this type of model, so it will be slower when multi-request get processed in parallel.

Any feedback and comment is welcome, and if you like it, it would be appreicated if you can give this project a star in Github. Thanks in advance.

u/fuzhongkai — 24 days ago
▲ 172 r/LocalAIServers+20 crossposts

I would like to share my latest open source local LLM inference tool implemented in C#. It supports models like Gemma4, Qwen3.6 with multi-modal (image, vision, audio), reasoning and function tool. It can run on Windows/MacOS/Linux and fully leverage GPU's capability. The API is completely compatible with OpenAI and Ollama interface.

Really appreciated if you can try it and give me some feedback. If you like it, it will be a big thank you if you can star it. Thank you very much!

u/fuzhongkai — 20 hours ago