u/Dolboyob77 — reddlx

▲ 4 r/LocalLLM+1 crossposts

INTEL ARC PRO B70 ORNITH-1.0 with ovms results

Hello here are my latest results using the intel b70 gpu single card with ornith model using ovms :
Context is set to 65k
Results are much faster than using gguf with sycl or vulkan on llama cpp. OVMS FOR THE WIN 🎉🎉🎉

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Ornith-1.0-35B-AEON-Ultimate-Uncensored-BF16-int4-ov	pp2048	1920.52 ± 21.98		983.74 ± 13.20	982.74 ± 13.20	1020.71 ± 13.90
Ornith-1.0-35B-AEON-Ultimate-Uncensored-BF16-int4-ov	tg32	88.84 ± 0.51	91.71 ± 0.53

reddit.com

u/Dolboyob77 — 1 day ago

▲ 7 r/LocalLLM+1 crossposts

B70 OVMS vs VLLM vs VLLM+mtp

Hello i have done a little test on qwen3.6-27b-int4 and qwen3.6-35b-a3b to compare the latest updates from openvino and vllm and vllm+mtp3 on the intel arc pro b70 and here are rhe results:

Qwen3.6-27b-int4 vllm :

model |test |t/s |peak t/s |ttfr (ms) |est_ppt (ms) |e2e_ttft (ms)
/models/Qwen3.6-27B-GPTQ-Int4 |pp2048 |2108.26 ± 10.94 | |874.50 ± 21.54 |871.92 ± 21.54 |874.50 ± 21.54
/models/Qwen3.6-27B-GPTQ-Int4 |tg32 |28.49 ± 0.03 |29.00 ± 0.00 | | |

Qwen3.6-27b-int4 ovms:

model |test |t/s |peak t/s |ttfr (ms) |est_ppt (ms) |e2e_ttft (ms)
qwen3.6-27B |pp2048 |1618.20 ± 50.04 | |1169.91 ± 14.48 |1168.82 ± 14.48 |1220.32 ± 14.34
qwen3.6-27B |tg32 |37.00 ± 0.05 |38.09 ± 0.05 | | |

Qwen3.6-27b-int4 vllm + mtp3:

model |test |t/s |peak t/s |ttfr (ms) |est_ppt (ms) |e2e_ttft (ms)
/models/Qwen3.6-27B-GPTQ-Int4 |pp2048 |1961.68 ± 25.06 | |986.27 ± 40.52 |985.05 ± 40.52 |986.27 ± 40.52
/models/Qwen3.6-27B-GPTQ-Int4 |tg32 |60.87 ± 6.01 |62.83 ± 6.21 | | |

Qwen3.6-35b-a3b-int4 vllm:

model |test |t/s |peak t/s |ttfr (ms) |est_ppt (ms) |e2e_ttft (ms)
/models/Qwen3.6-35B-A3B-GPTQ-Int4 |pp2048 |9629.41 ± 299.45 | |194.64 ± 3.53 |192.47 ± 3.53 |194.64 ± 3.53
/models/Qwen3.6-35B-A3B-GPTQ-Int4 |tg32 |36.91 ± 0.22 |38.10 ± 0.22 | | |

Qwen3.6-35b-a3b-int4 ovms :

model |test |t/s |peak t/s |ttfr (ms) |est_ppt (ms) |e2e_ttft (ms)
qwen3.6-35B |pp2048 |1901.66 ± 35.84 | |994.96 ± 14.63 |994.31 ± 14.63 |1033.41 ± 7.93
qwen3.6-35B |tg32 |101.93 ± 0.20 |106.73 ± 1.43 | | |

Qwen3.6-35b-a3b-int4 vllm + mtp3 :

model |test |t/s |peak t/s |ttfr (ms) |est_ppt (ms) |e2e_ttft (ms)
/models/Qwen3.6-35B-A3B-GPTQ-Int4 |pp2048 |7949.33 ± 12.86 | |235.01 ± 2.21 |233.73 ± 2.21 |235.01 ± 2.21
/models/Qwen3.6-35B-A3B-GPTQ-Int4 |tg32 |86.90 ± 7.78 |89.71 ± 8.03 | | | | /

reddit.com

u/Dolboyob77 — 15 days ago

▲ 1 r/LocalLLM

VLLM + MTP + B70 = super fast !!!

I was getting around 32tks on qwen3.6-27b gptq int4 with intel scaler so i decided to build a vllm myself with latest image vllm available. This ilage supports MTP and i could reach 52tks!!! ))) what a difference !!!

(APIServer pid=1) INFO 06-14 17:42:20 [metrics.py:120] SpecDecoding metrics: Mean acceptance length: 2.20, Accepted throughput: 20.80 tokens/s, Drafted throughput: 52.20 tokens/s, Accepted: 208 tokens, Drafted: 522 tokens, Per-position acceptance rate: 0.644, 0.385, 0.167, Avg Draft acceptance rate: 39.8%
(APIServer pid=1) INFO 06-14 17:42:30 [loggers.py:273] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 44.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 13.3%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 06-14 17:42:30 [metrics.py:120] SpecDecoding metrics: Mean acceptance length: 2.58, Accepted throughput: 27.00 tokens/s, Drafted throughput: 51.30 tokens/s, Accepted: 270 tokens, Drafted: 513 tokens, Per-position acceptance rate: 0.684, 0.526, 0.368, Avg Draft acceptance rate: 52.6%
(APIServer pid=1) INFO 06-14 17:42:40 [loggers.py:273] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 52.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 13.3%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 06-14 17:42:40 [metrics.py:120] SpecDecoding metrics: Mean acceptance length: 3.11, Accepted throughput: 35.80 tokens/s, Drafted throughput: 50.99 tokens/s, Accepted: 358 tokens, Drafted: 510 tokens, Per-position acceptance rate: 0.853, 0.676, 0.576, Avg Draft acceptance rate: 70.2%
(APIServer pid=1) INFO 06-14 17:42:50 [loggers.py:273] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 39.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

reddit.com

u/Dolboyob77 — 21 days ago

▲ 1 r/LocalLLM

Openvino problem with Gemma4-31B-Int4-ov

Hello i get very messed up results in chat when i ask a question using Openvino 2026.2 with intel arc pro b70. Benchmark seems legit though.

u/Dolboyob77 — 23 days ago

▲ 1 r/LocalLLM

Knowing how to be grateful too…

After weeks of despair owning an intel b70 and not being able to use it properly, i can finally recon that more and more devs are now working on the intel scaler llm repo and we get patches and merges being added every single day. I was being fiercely accusing them of neglecting the intel gpu owners, but i must also say when good work is being done. Congrats to the devs and keep it going!!! Our b70 are now getting back on track ))))

reddit.com

u/Dolboyob77 — 25 days ago

▲ 15 r/LocalLLM+1 crossposts

Intel scaler llm working its magic on intel arc pro b70

Hello i have been fighting quite a while trying to squeeze some juice out of my b70 and the best i was getting from gguf models on llama cpp sycl on qwen3.6-27b-q4-kxl was around 22-25 tks. Getting around 15-16 tks on the q8 quant.
And after weeks of fighting with intel zcaler dev to put pressure on them to fibally deliver a software that was up to the job , and few tweaks on my side as well, i could get few minutes ago a beautiful 60tks on qwotus3.6-27-gptq-it4. Finally getting some results worth this gpu!!!!
Running on docker container in unraid os.
Single b70.

u/Dolboyob77 — 29 days ago

▲ 1 r/LocalLLM

Openvino 2026.2 + intel gpu

Hello the community, if someone is succeeding in setting up openvino for intel gpu ( arc pro b70) , could you please share the settings because i have been battling with claude opus for 1 hour and its not even working to compile the model qwen3.6-27b into fp8. Using regex, tansformers, eager… nothing is working. Any help would be welcome!

reddit.com

u/Dolboyob77 — 1 month ago

▲ 55 r/eGPU

My new Egpu dock beast)))

Fully satisfied with my new purschase!!!
Fully customizable tb3/4/5 - usbc4 - oculink. Supports all gpu sizes. Rgb lights with fans included to cool down the gpu and the best part is that amazing screen showing various important infos like wattage live consumption or power usage. Requires your own psu. Many option to power up the beast : tx60/dc/8pin… it comes with a double 8 pin connector cable and a usbc4 cable included!

u/Dolboyob77 — 1 month ago

▲ 2 r/LocalLLM

Scaller llm for intel big update to run 6 months old models…

So today, may 20th 2026 we finally received a long waited update on scaler llm for intel gpu!!!! FINALLY!!! I was so excited… until…. I read the supported models : Qwen2.5 and so on…. This is F……g joke….!!!!!! Please if soneoje can teach me how to compile and upgrade these things i am willing to work on it and give a decent update…. That is actually up to date!!!!

reddit.com

u/Dolboyob77 — 2 months ago

▲ 44 r/LocalLLM+1 crossposts

Where are the Intel devs????

I own 2 intel gpus both battlemage xe drivers with intel core cpu, i have been fed with the promise of a dream land being all intel it would make things so much faster and irrisistible…. What i came to understand is that everything is done for the nvidia community, maybe the devs at nvidia are more passionate or involved…. Llamacpp sycl works 70% of what the intel gpu can really achieve, and the only real reason to to buy intel gpu is because there was ipex vllm and now it is replace by intel scaler vllm… but obviously they make an update every 6 weeks or even more…. So we have gpus that are just sitting there half asleep…. Come on… our gpus were meant to run vllm!!!! But what is the point to run models that are 2-3 months old or more??? Each time im trying to launch a model on unraid os, the container crashes because the repo is too old…. If it goes on this way, i will resell everything wnd invest more for something that actually works… i was not asking to get the same tokens per second as nvidia because their bandiwth is faster…. But to get something that actually works would be rhe minimum, no?
Intel core 9 ultra 285h with 96g ram
Intel arc pro b70
Intel arc b580
If i use llamq cpp sycl with gguf models , yes it works but it is not optimized and i get way less than what the gpu is capable… so if there are Intel devs somewhere… can you please do something abiut it and update the intel scaler vllm ??? Thanks

reddit.com

u/Dolboyob77 — 2 months ago

▲ 8 r/LocalLLM+1 crossposts

[Bug] llama.cpp full-intel image breaks Q8_0 models on Intel Arc GPUs - reorder_qw_q8_0 SYCL out of memory error

hello I ran into a problem following the update of latest image :

Image: ghcr.io/ggml-org/llama.cpp:full-intel
Error: reorder_qw_q8_0 UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY
GPU: Intel Arc Pro B70 + Intel Arc B580
Works fine with Q6_K_XL, crashes with Q8_0
Working version: full-intel-b9144

u/Dolboyob77 — 2 months ago

▲ 11 r/OpenWebUI

Openwebui + comfyui

Hello, is someone succeeding in making these 2 work together? No matter what i am trying, unet loader, checkpoint… the workflow works when i type thenpromot in comfyui but as soon as i type same prompt in openwebui , i cannot manage to get an image and always get errors… i i port the fson worflow and specify prompt id and checkpoint if and model in openwebui but nothing works… is it because i use flux 1 dev fp16 ? Does it require smaller models to work ? Thanks for input !!

SOLUTION : I finally made it work using the help of qwen3.6-27b-q8 ))) so the problem is that ALL NODES ID must be filled in openwebui and also must add this command line to openwebui : ENABLE_RAG_LOCAL_WEB_FETCH=True , it was the fix for me )) now working perfectly !!!

reddit.com

u/Dolboyob77 — 2 months ago

▲ 2 r/LocalLLM

Update for intel scaler vllm ?

Hello, i am currently using the intel scaler vllm 14.8b2 i think, the one for intel arc pro b70. But the core is an old model so i cannot use newer models like qwen3.6-27b-fp8. So when will we see an update to be able to use the latest models in safetensor? Thanks

reddit.com

u/Dolboyob77 — 2 months ago

▲ 10 r/LocalLLM+1 crossposts

Stop the " Thinking" in Openwebui

Hello, i have been going crazy trying to stop the qwen3.6-27b models from thinking in openwebui. I tried all sorts of post arguments like nothink, no-think, jinja….. nothing is working. Each question i type , even “hello" it thinks for 4 linutes and then sends me choices to select as an answer… this is just ridiculous. I have tried different models gguf in llama cpp with sycl ( i have intel arc pro b70) going from qwen3.6 q4 to q8 they all load fine without error but i cant get any proper answer…. Just thinking forever and answering my hello by a list of questions. Any help would be appreciated !!!!

reddit.com

u/Dolboyob77 — 2 months ago

▲ 2 r/openclaw

Hello everyone, newbie here so don’t scream))) i have just installed openclaw on unraid. When opening the container at first they asked me for a gateway key. I wrote a password. Apparently this was a mistake becausz i cannot connect on the openclaw page, they ask for login and password and admin or root with the password i weote in container is not working obviously. Network is set on HOST . And the error on openclaw page is : origin not allowed (open the Control UI from the gateway host or allow it in gateway.controlUi.allowedOrigins)
So if anyone has a fix to this with simple words and step guide i would be very grateful !!! Thank you in advance )))

reddit.com

u/Dolboyob77 — 2 months ago

▲ 1 r/BeelinkOfficial

Hello i added a new GPU 5x16 on my EX pro dock ( gen5x8). The mini pc is advertised as 5x8 also. What a surprise when on my server it showed only GEN 4x8 speed, which is half the speed of GEN5. So i went to check my bios ( T205) and all the pcie slots show a GEN 4 max speed option….. waiting on an update with the right speeds as mentionned on their website.

reddit.com

u/Dolboyob77 — 2 months ago