Run Qwen 3.6 Locally
Deploy Qwen 3.6 on your own hardware - from Mac M4 16GB to production servers
Qwen 3.6 open-weight models are designed for local deployment across a wide range of hardware. The 27B dense model can run on 16GB VRAM using IQ4_XS GGUF with KV cache compression supporting up to 100K context. The 35B A3B MoE model delivers 20-40 tokens per second on consumer hardware at 4-bit quantization. Community reports confirm Mac M4 16GB runs the 35B A3B at Q3 quantization. Full support for Ollama, vLLM, llama.cpp, SGLang, and KTransformers. Vision and multimodal capabilities work locally.
Local deployment
Everything you need to run Qwen 3.6 on your own machine
From hardware selection to quantization tuning, this guide covers every aspect of deploying Qwen 3.6 models locally for development, testing, and production use. Six inference frameworks supported, with hardware configurations from 16GB laptops to 96GB workstations.
Hardware requirements - 35B A3B MoE
The 35B A3B MoE model with only 3B active parameters is the most hardware-friendly option. Q3_K_M quantization: ~17GB VRAM, confirmed working on Mac M4 16GB. Q4_K_M: ~21-23GB VRAM, fits RTX 4090 24GB. Q8_0: ~35GB. BF16 full precision: ~70GB, fits RTX 6000 96GB. Expect 20-40 tokens per second on consumer hardware at 4-bit quantization based on Unsloth community benchmarks.
Hardware requirements - 27B Dense
The 27B dense model delivers maximum open-weight quality with all parameters active. IQ4_XS GGUF: can run on 16GB VRAM with KV cache compression, supporting up to 100K context length. Q4_K_M: ~16GB, needs 24GB+ GPU for comfortable operation with context. FP16 full precision: ~55.6GB, requires 2x RTX 4090 or A100 80GB. Best for workstation deployments where quality is the top priority.
Ollama one-command setup
The fastest path to local deployment: 'ollama run qwen3.6:35b-a3b'. Automatic model download, quantization selection, and GPU detection. Supports NVIDIA CUDA and Apple Metal acceleration. The OpenAI-compatible API at localhost:11434 integrates with Claude Code, Aider, Continue.dev, and other coding tools. Vision and tool calling both work out of the box - fixes over Qwen 3.5.
vLLM production serving
Production-grade serving with continuous batching, PagedAttention, and OpenAI-compatible API endpoints. Ideal for multi-user deployments and high-throughput inference on server hardware. Supports tensor parallelism for splitting the 27B model across multiple GPUs. PagedAttention enables efficient memory management for long-context requests up to the model's full context length.
llama.cpp and SGLang
llama.cpp provides lightweight C++ inference with CPU and GPU support, ideal for edge deployments and resource-constrained environments. SGLang offers high-performance serving with RadixAttention for efficient prefix caching. Both support GGUF quantized models and provide OpenAI-compatible API endpoints. KTransformers is also supported for advanced deployment scenarios.
Vision and multimodal locally
Both the 27B and 35B A3B models support vision and multimodal inputs when deployed locally. Analyze code screenshots, review UI designs, parse architecture diagrams, and debug visual issues. This capability works across Ollama, vLLM, and other supported frameworks. A significant improvement over Qwen 3.5 where local vision was broken.
Privacy and data sovereignty
All data stays on your machine. No API calls, no cloud dependencies, no usage tracking, no data leaving your network. Perfect for sensitive codebases, proprietary data, healthcare and financial applications, and air-gapped environments where data sovereignty is legally required. The Apache 2.0 license allows commercial use without restrictions.
Cost analysis vs API
Zero per-token costs after initial hardware investment. A single RTX 4090 (~$1,600) running the 35B A3B model at 20-40 tok/s can handle thousands of requests per day. At DashScope pricing of $0.40/$2.40 per million tokens, the GPU pays for itself within weeks for heavy usage. For teams processing millions of tokens daily, local deployment offers 10-100x cost savings over API access.
Quick reference
Hardware configurations and framework options
Key specifications for local Qwen 3.6 deployment across different hardware configurations and inference frameworks.
35B A3B MoE configurations
- Q3_K_M: ~17GB VRAM - Mac M4 16GB confirmed working
- Q4_K_M: ~21-23GB VRAM - RTX 4090 24GB recommended
- Q8_0: ~35GB VRAM - RTX A6000 48GB or dual GPU
- BF16: ~70GB VRAM - RTX 6000 96GB full precision
- 20-40 tok/s on consumer hardware at 4-bit (Unsloth benchmarks)
- 3B active parameters per token, efficient inference
27B Dense configurations
- IQ4_XS GGUF: 16GB VRAM with KV cache compression (100K context)
- Q4_K_M: ~16GB VRAM - RTX 4090 24GB with context room
- FP16: ~55.6GB VRAM - 2x RTX 4090 or A100 80GB
- All 27B parameters active for maximum quality
- Best open-weight coding model: 77.2% SWE-bench
Supported frameworks
- Ollama: Easiest setup, one-command deployment, vision + tool calling
- vLLM: Production serving, continuous batching, tensor parallelism
- llama.cpp: Lightweight C++ inference, CPU + GPU, edge deployment
- SGLang: High-performance serving with RadixAttention prefix caching
- KTransformers: Advanced deployment and optimization
- HuggingFace Transformers: Native Python, full fine-tuning support
Setup guides
Step-by-step local deployment for every framework
Follow these guides to get Qwen 3.6 running on your hardware in minutes, with platform-specific optimization tips.
Install Ollama and run Qwen 3.6 in under 5 minutes
Set up production-grade serving with OpenAI-compatible API
Lightweight inference with CPU and GPU support
High-performance serving with RadixAttention
Containerized deployment for reproducible environments
Run 35B A3B on Mac M4 16GB with Q3 quantization
Optimization
Get the most out of your hardware
Tune quantization, batch size, memory allocation, and context length for optimal performance on your specific hardware.
Quality vs speed vs VRAM tradeoffs for each GGUF level
Tensor parallelism for the 27B dense model across GPUs
Optimized settings for M1/M2/M3/M4 Macs with Metal
Fit 27B on 16GB VRAM with 100K context using IQ4_XS
Connect local Qwen to Claude Code, Aider, Continue.dev
Qwen ecosystem
Open-weight models built for local deployment - Apache 2.0 licensed
Qwen 3.6 open-weight models are released under the Apache 2.0 license with full support for six inference frameworks. From Mac M4 laptops to multi-GPU servers, deploy with confidence and zero ongoing costs.
Get started
Ready to run Qwen 3.6 on your own hardware? Start with one command
Try Qwen 3.6 in the browser first, then deploy locally with Ollama, vLLM, llama.cpp, or SGLang. The 35B A3B runs on Mac M4 16GB, the 27B fits 16GB VRAM with IQ4_XS. Zero per-token costs, full data privacy, Apache 2.0 licensed.