
Waiting for Qwen 3.7 open weight... The new King has arrived...
The hype is real! https://qwen.ai/blog?id=qwen3.7

The hype is real! https://qwen.ai/blog?id=qwen3.7
After the recent releases, there's almost a sense of emptiness.
When do you think new models will be released? Looking at the chart, it's between the end of May and the beginning of June, but... I don't know why, it seems like something's changing about "open weights"
llama-server.exe --model "H:\gptmodel\AesSedai\MiMo-V2.5-GGUF\MiMo-V2.5-IQ3_S-00001-of-00004.gguf" --ctx-size 1048576 --threads 16 --host 127.0.0.1 --no-mmap --jinja --fit on --flash-attn on -sm layer --n-cpu-moe 0 --threads 16 --parallel 1 --temp 0.2
load_tensors: offloaded 49/49 layers to GPU
load_tensors: Vulkan0 model buffer size = 72842.29 MiB
load_tensors: Vulkan1 model buffer size = 34524.53 MiB
load_tensors: Vulkan_Host model buffer size = 488.91 MiB
RTX 6000 96gb+ W7800 48gb
I started testing with the IQ3 version because the second w7800 is on another machine. What's impressed me so far is the processing speed, both on llamaserver and vscode+kilocode. While minimax drops very quickly in processing and prefill t/sec at 50k context, mimo is faster and more stable.
It's still early to give an overall assessment. It tends to loop. With repetition penalty at 1.1 and temp at 0.2, the code seems to improve. Also, if it loops, stopping and restarting doesn't do it again. Perhaps it's better to use a fixed seed. This is the main problem I've encountered. I'll let you know how it goes when I break 300k context.
______________
EDIT: 346'733/1'048'576 (33%) Context ---> all good. Code works. Zero repetion with Temp 0.2 and rep penality 1.1
_____________
srv log_server_r: done request: GET /tools 127.0.0.1 404
slot update_slots: id 0 | task 125418 | new prompt, n_ctx_slot = 1048576, n_keep = 0, task.n_tokens = 344225
slot update_slots: id 0 | task 125418 | n_tokens = 344196, memory_seq_rm [344196, end)
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot update_slots: id 0 | task 125418 | prompt processing progress, n_tokens = 344221, batch.n_tokens = 25, progress = 0.999988
slot create_check: id 0 | task 125418 | erasing old context checkpoint (pos_min = 99868, pos_max = 100635, n_tokens = 100636, size = 146.260 MiB)
[0mslot create_check: id 0 | task 125418 | created context checkpoint 32 of 32 (pos_min = 343428, pos_max = 344195, n_tokens = 344196, size = 146.260 MiB)
[0mslot update_slots: id 0 | task 125418 | n_tokens = 344221, memory_seq_rm [344221, end)
slot init_sampler: id 0 | task 125418 | init sampler, took 71.01 ms, tokens: text = 344225, total = 344225
slot update_slots: id 0 | task 125418 | prompt processing done, n_tokens = 344225, batch.n_tokens = 4
slot print_timing: id 0 | task 125418 |
prompt eval time = 1387.92 ms / 29 tokens ( 47.86 ms per token, 20.89 tokens per second)
eval time = 80336.72 ms / 2508 tokens ( 32.03 ms per token, 31.22 tokens per second)
total time = 81724.64 ms / 2537 tokens
slot release: id 0 | task 125418 | stop processing: n_tokens = 346732, truncated = 0
srv update_slots: all slots are idle
I invested quite a bit of time and it wasn't easy but finally I can run models like Minimax 2.7 Q4 using Cuda+ROCm at the same time bypassing Vulkan.
load_tensors: offloaded 63/63 layers to GPU
load_tensors: CUDA0 model buffer size = 83650.42 MiB
load_tensors: CUDA_Host model buffer size = 622.76 MiB
load_tensors: ROCm0 model buffer size = 40314.35 MiB
the main advantage is the prefill.
On windows :
rmdir /s /q build
cmake -B build -G Ninja ^
-DCMAKE_C_COMPILER="C:/Program Files/AMD/ROCm/6.4/bin/clang-cl.exe" ^
-DCMAKE_CXX_COMPILER="C:/Program Files/AMD/ROCm/6.4/bin/clang-cl.exe" ^
-DCMAKE_HIP_COMPILER="C:/Program Files/AMD/ROCm/6.4/bin/clang-cl.exe" ^
-DCMAKE_PREFIX_PATH="C:/Program Files/AMD/ROCm/6.4" ^
-DHIP_ROOT_DIR="C:/Program Files/AMD/ROCm/6.4" ^
-DGGML_HIP=ON ^
-DGGML_CUDA=ON ^
-DGGML_BACKEND_DL=ON ^
-DGGML_CPU_ALL_VARIANTS=ON ^
-DGGML_AVX_VNNI=OFF ^
-DGGML_AVX512=OFF ^
-DGGML_AVX512_VBMI=OFF ^
-DGGML_AVX512_VNNI=OFF ^
-DGGML_AVX512_BF16=OFF ^
-DGGML_AMX_TILE=OFF ^
-DGGML_AMX_INT8=OFF ^
-DGGML_AMX_BF16=OFF ^
-DCMAKE_CUDA_COMPILER="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.1/bin/nvcc.exe" ^
-DCMAKE_CUDA_ARCHITECTURES="120" ^
-DCMAKE_BUILD_TYPE=Release
___________________
cmake --build build -j
_______________________
Unfortunately, this flag: -DGGML_CPU_ALL_VARIANTS=ON --> creates many compilation errors and I had to edit, for example:
notepad C:\llm\llamacpp\ggml\src\CMakeLists.txt
and remove # ggml_add_cpu_backend_variant(alderlake SSE42 AVX F16C FMA AVX2 BMI2 AVX_VNNI)
With Ryzen 5950x it's ok.
then:
set PATH=C:\Program Files\AMD\ROCm\6.4\bin;%PATH%
llama-server.exe --model "H:\gptmodel\unsloth\MiniMax-M2.7-GGUF\MiniMax-M2.7-UD-Q4_K_S-00001-of-00004.gguf" --ctx-size 91920 --threads 16 --host 127.0.0.1 --no-mmap --jinja --fit on --flash-attn on -sm layer --n-cpu-moe 0 --threads 16 --cache-type-k q8_0 --cache-type-v q8_0 --parallel 1
Done.
https://huggingface.co/XiaomiMiMo/MiMo-V2.5
Interesting because unlike its bigger brother it can be run on "more human" configurations