u/cyclebiff

Happy Sunday, everyone! I'm relatively new to running local LLMs (about two weeks in), so I appreciate your patience with my questions. I'm eager to learn from this community's expertise.

Background

A few weeks ago, I discovered agentic coding through my work's GitHub Copilot account. After quickly exhausting my usage limits (lesson learned about token management!), I decided to explore running Qwen models locally on my personal laptop for hobby projects.

Hardware

M2 MacBook Pro Max 96GB

Models Tested

oMLX: Qwen 3.6 27B (oQ4/oQ5/oQ6/oQ8-fp16-mtp variants)
LM Studio/GGUF: Qwen 3.6 27B (Q4_K_M, Q6_K, Q8_K)
llama.cpp: Configured per this post

Use Case

I'm primarily doing C++ and ESP32/PlatformIO development for personal projects, including:

Real-time voice modulation for cosplay costumes
Real-time bark detection logger (courtesy of my neighbor's enthusiastic dog)

Current Configuration

After implementing MTP changes, I've settled on the following setup:

Model: oMLX Qwen 3.6 27B-oQ5-fp16-mtp

Settings:

Context: 262,144
Temperature: 0.6
Top P: 0.95
Top K: 20
Min P: 0
Repetition Penalty: 1
Presence Penalty: 0
Extended thinking: Enabled
Native MTP: Enabled
oMLX caching: Enabled

IDE Setup:

VS Code with Cline extension
OpenAI-compatible API from oMLX

Workflow:

Enable PLAN mode in Cline
Request feature implementation or bug research plan
Switch to ACT mode and execute
Wait lol

Current Performance

While the quality of Qwen 3.6 (Q4-Q8) is impressive, performance could be better:

Prompt processing: ~120 tok/s
Token generation: ~15 tok/s

Question

For those running similar hardware (especially M2 users), what combination of:

Software stack (oMLX, LM Studio, llama.cpp, etc.)
Specific Qwen 3.6 model variants
Inference settings

...have you found optimal? Any suggestions for improving prompt processing and token generation speeds on M2 hardware would be greatly appreciated!

Seeking Optimization Advice: Qwen 3.6 27B Setup on M2 MacBook Pro