Seeking Optimization Advice: Qwen 3.6 27B Setup on M2 MacBook Pro
Happy Sunday, everyone! I'm relatively new to running local LLMs (about two weeks in), so I appreciate your patience with my questions. I'm eager to learn from this community's expertise.
Background
A few weeks ago, I discovered agentic coding through my work's GitHub Copilot account. After quickly exhausting my usage limits (lesson learned about token management!), I decided to explore running Qwen models locally on my personal laptop for hobby projects.
Hardware
- M2 MacBook Pro Max 96GB
Models Tested
- oMLX: Qwen 3.6 27B (oQ4/oQ5/oQ6/oQ8-fp16-mtp variants)
- LM Studio/GGUF: Qwen 3.6 27B (Q4_K_M, Q6_K, Q8_K)
- llama.cpp: Configured per this post
Use Case
I'm primarily doing C++ and ESP32/PlatformIO development for personal projects, including:
- Real-time voice modulation for cosplay costumes
- Real-time bark detection logger (courtesy of my neighbor's enthusiastic dog)
Current Configuration
After implementing MTP changes, I've settled on the following setup:
Model: oMLX Qwen 3.6 27B-oQ5-fp16-mtp
Settings:
- Context: 262,144
- Temperature: 0.6
- Top P: 0.95
- Top K: 20
- Min P: 0
- Repetition Penalty: 1
- Presence Penalty: 0
- Extended thinking: Enabled
- Native MTP: Enabled
- oMLX caching: Enabled
IDE Setup:
- VS Code with Cline extension
- OpenAI-compatible API from oMLX
Workflow:
- Enable PLAN mode in Cline
- Request feature implementation or bug research plan
- Switch to ACT mode and execute
- Wait lol
Current Performance
While the quality of Qwen 3.6 (Q4-Q8) is impressive, performance could be better:
- Prompt processing: ~120 tok/s
- Token generation: ~15 tok/s
Question
For those running similar hardware (especially M2 users), what combination of:
- Software stack (oMLX, LM Studio, llama.cpp, etc.)
- Specific Qwen 3.6 model variants
- Inference settings
...have you found optimal? Any suggestions for improving prompt processing and token generation speeds on M2 hardware would be greatly appreciated!