u/LegacyRemaster — reddlx

Waiting for Qwen 3.7 open weight... The new King has arrived...

The hype is real! https://qwen.ai/blog?id=qwen3.7

New models when? Forecasting release date.

After the recent releases, there's almost a sense of emptiness.

When do you think new models will be released? Looking at the chart, it's between the end of May and the beginning of June, but... I don't know why, it seems like something's changing about "open weights"

u/LegacyRemaster — 5 days ago

▲ 15 r/LocalLLaMA

Testing MiMo-V2.5-IQ3_S with 1'048'576 context

llama-server.exe --model "H:\gptmodel\AesSedai\MiMo-V2.5-GGUF\MiMo-V2.5-IQ3_S-00001-of-00004.gguf" --ctx-size 1048576 --threads 16 --host 127.0.0.1 --no-mmap --jinja --fit on --flash-attn on -sm layer --n-cpu-moe 0 --threads 16 --parallel 1 --temp 0.2

load_tensors: offloaded 49/49 layers to GPU

load_tensors: Vulkan0 model buffer size = 72842.29 MiB

load_tensors: Vulkan1 model buffer size = 34524.53 MiB

load_tensors: Vulkan_Host model buffer size = 488.91 MiB

RTX 6000 96gb+ W7800 48gb

I started testing with the IQ3 version because the second w7800 is on another machine. What's impressed me so far is the processing speed, both on llamaserver and vscode+kilocode. While minimax drops very quickly in processing and prefill t/sec at 50k context, mimo is faster and more stable.

It's still early to give an overall assessment. It tends to loop. With repetition penalty at 1.1 and temp at 0.2, the code seems to improve. Also, if it loops, stopping and restarting doesn't do it again. Perhaps it's better to use a fixed seed. This is the main problem I've encountered. I'll let you know how it goes when I break 300k context.

______________

EDIT: 346'733/1'048'576 (33%) Context ---> all good. Code works. Zero repetion with Temp 0.2 and rep penality 1.1

_____________

srv log_server_r: done request: GET /tools 127.0.0.1 404

slot update_slots: id 0 | task 125418 | new prompt, n_ctx_slot = 1048576, n_keep = 0, task.n_tokens = 344225

slot update_slots: id 0 | task 125418 | n_tokens = 344196, memory_seq_rm [344196, end)

srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200

slot update_slots: id 0 | task 125418 | prompt processing progress, n_tokens = 344221, batch.n_tokens = 25, progress = 0.999988

slot create_check: id 0 | task 125418 | erasing old context checkpoint (pos_min = 99868, pos_max = 100635, n_tokens = 100636, size = 146.260 MiB)

[0mslot create_check: id 0 | task 125418 | created context checkpoint 32 of 32 (pos_min = 343428, pos_max = 344195, n_tokens = 344196, size = 146.260 MiB)

[0mslot update_slots: id 0 | task 125418 | n_tokens = 344221, memory_seq_rm [344221, end)

slot init_sampler: id 0 | task 125418 | init sampler, took 71.01 ms, tokens: text = 344225, total = 344225

slot update_slots: id 0 | task 125418 | prompt processing done, n_tokens = 344225, batch.n_tokens = 4

slot print_timing: id 0 | task 125418 |

prompt eval time = 1387.92 ms / 29 tokens ( 47.86 ms per token, 20.89 tokens per second)

eval time = 80336.72 ms / 2508 tokens ( 32.03 ms per token, 31.22 tokens per second)

total time = 81724.64 ms / 2537 tokens

slot release: id 0 | task 125418 | stop processing: n_tokens = 346732, truncated = 0

srv update_slots: all slots are idle

u/LegacyRemaster — 14 days ago

▲ 48 r/LocalLLaMA

I invested quite a bit of time and it wasn't easy but finally I can run models like Minimax 2.7 Q4 using Cuda+ROCm at the same time bypassing Vulkan.

load_tensors: offloaded 63/63 layers to GPU

load_tensors: CUDA0 model buffer size = 83650.42 MiB

load_tensors: CUDA_Host model buffer size = 622.76 MiB

load_tensors: ROCm0 model buffer size = 40314.35 MiB

the main advantage is the prefill.

On windows :

rmdir /s /q build

cmake -B build -G Ninja ^

-DCMAKE_C_COMPILER="C:/Program Files/AMD/ROCm/6.4/bin/clang-cl.exe" ^

-DCMAKE_CXX_COMPILER="C:/Program Files/AMD/ROCm/6.4/bin/clang-cl.exe" ^

-DCMAKE_HIP_COMPILER="C:/Program Files/AMD/ROCm/6.4/bin/clang-cl.exe" ^

-DCMAKE_PREFIX_PATH="C:/Program Files/AMD/ROCm/6.4" ^

-DHIP_ROOT_DIR="C:/Program Files/AMD/ROCm/6.4" ^

-DGGML_HIP=ON ^

-DGGML_CUDA=ON ^

-DGGML_BACKEND_DL=ON ^

-DGGML_CPU_ALL_VARIANTS=ON ^

-DGGML_AVX_VNNI=OFF ^

-DGGML_AVX512=OFF ^

-DGGML_AVX512_VBMI=OFF ^

-DGGML_AVX512_VNNI=OFF ^

-DGGML_AVX512_BF16=OFF ^

-DGGML_AMX_TILE=OFF ^

-DGGML_AMX_INT8=OFF ^

-DGGML_AMX_BF16=OFF ^

-DCMAKE_CUDA_COMPILER="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.1/bin/nvcc.exe" ^

-DCMAKE_CUDA_ARCHITECTURES="120" ^

-DCMAKE_BUILD_TYPE=Release

___________________

cmake --build build -j

_______________________

Unfortunately, this flag: -DGGML_CPU_ALL_VARIANTS=ON --> creates many compilation errors and I had to edit, for example:

notepad C:\llm\llamacpp\ggml\src\CMakeLists.txt

and remove # ggml_add_cpu_backend_variant(alderlake SSE42 AVX F16C FMA AVX2 BMI2 AVX_VNNI)

With Ryzen 5950x it's ok.

then:

set PATH=C:\Program Files\AMD\ROCm\6.4\bin;%PATH%

llama-server.exe --model "H:\gptmodel\unsloth\MiniMax-M2.7-GGUF\MiniMax-M2.7-UD-Q4_K_S-00001-of-00004.gguf" --ctx-size 91920 --threads 16 --host 127.0.0.1 --no-mmap --jinja --fit on --flash-attn on -sm layer --n-cpu-moe 0 --threads 16 --cache-type-k q8_0 --cache-type-v q8_0 --parallel 1

Done.

u/LegacyRemaster — 22 days ago

▲ 48 r/LocalLLaMA

https://huggingface.co/XiaomiMiMo/MiMo-V2.5

Interesting because unlike its bigger brother it can be run on "more human" configurations

u/LegacyRemaster — 24 days ago