Run Qwen 3.6 Locally

Deploy Qwen 3.6 on your own hardware - from Mac M4 16GB to production servers

Qwen 3.6 open-weight models are designed for local deployment across a wide range of hardware. The 27B dense model can run on 16GB VRAM using IQ4_XS GGUF with KV cache compression supporting up to 100K context. The 35B A3B MoE model delivers 20-40 tokens per second on consumer hardware at 4-bit quantization. Community reports confirm Mac M4 16GB runs the 35B A3B at Q3 quantization. Full support for Ollama, vLLM, llama.cpp, SGLang, and KTransformers. Vision and multimodal capabilities work locally.

Local deployment

Everything you need to run Qwen 3.6 on your own machine

From hardware selection to quantization tuning, this guide covers every aspect of deploying Qwen 3.6 models locally for development, testing, and production use. Six inference frameworks supported, with hardware configurations from 16GB laptops to 96GB workstations.

Hardware requirements - 35B A3B MoE

The 35B A3B MoE model with only 3B active parameters is the most hardware-friendly option. Q3_K_M quantization: ~17GB VRAM, confirmed working on Mac M4 16GB. Q4_K_M: ~21-23GB VRAM, fits RTX 4090 24GB. Q8_0: ~35GB. BF16 full precision: ~70GB, fits RTX 6000 96GB. Expect 20-40 tokens per second on consumer hardware at 4-bit quantization based on Unsloth community benchmarks.

Hardware requirements - 27B Dense

The 27B dense model delivers maximum open-weight quality with all parameters active. IQ4_XS GGUF: can run on 16GB VRAM with KV cache compression, supporting up to 100K context length. Q4_K_M: ~16GB, needs 24GB+ GPU for comfortable operation with context. FP16 full precision: ~55.6GB, requires 2x RTX 4090 or A100 80GB. Best for workstation deployments where quality is the top priority.

Ollama one-command setup

The fastest path to local deployment: 'ollama run qwen3.6:35b-a3b'. Automatic model download, quantization selection, and GPU detection. Supports NVIDIA CUDA and Apple Metal acceleration. The OpenAI-compatible API at localhost:11434 integrates with Claude Code, Aider, Continue.dev, and other coding tools. Vision and tool calling both work out of the box - fixes over Qwen 3.5.

vLLM production serving

Production-grade serving with continuous batching, PagedAttention, and OpenAI-compatible API endpoints. Ideal for multi-user deployments and high-throughput inference on server hardware. Supports tensor parallelism for splitting the 27B model across multiple GPUs. PagedAttention enables efficient memory management for long-context requests up to the model's full context length.

llama.cpp and SGLang

llama.cpp provides lightweight C++ inference with CPU and GPU support, ideal for edge deployments and resource-constrained environments. SGLang offers high-performance serving with RadixAttention for efficient prefix caching. Both support GGUF quantized models and provide OpenAI-compatible API endpoints. KTransformers is also supported for advanced deployment scenarios.

Vision and multimodal locally

Both the 27B and 35B A3B models support vision and multimodal inputs when deployed locally. Analyze code screenshots, review UI designs, parse architecture diagrams, and debug visual issues. This capability works across Ollama, vLLM, and other supported frameworks. A significant improvement over Qwen 3.5 where local vision was broken.

Privacy and data sovereignty

All data stays on your machine. No API calls, no cloud dependencies, no usage tracking, no data leaving your network. Perfect for sensitive codebases, proprietary data, healthcare and financial applications, and air-gapped environments where data sovereignty is legally required. The Apache 2.0 license allows commercial use without restrictions.

Cost analysis vs API

Zero per-token costs after initial hardware investment. A single RTX 4090 (~$1,600) running the 35B A3B model at 20-40 tok/s can handle thousands of requests per day. At DashScope pricing of $0.40/$2.40 per million tokens, the GPU pays for itself within weeks for heavy usage. For teams processing millions of tokens daily, local deployment offers 10-100x cost savings over API access.

Quick reference

Hardware configurations and framework options

Key specifications for local Qwen 3.6 deployment across different hardware configurations and inference frameworks.

35B A3B MoE configurations

  • Q3_K_M: ~17GB VRAM - Mac M4 16GB confirmed working
  • Q4_K_M: ~21-23GB VRAM - RTX 4090 24GB recommended
  • Q8_0: ~35GB VRAM - RTX A6000 48GB or dual GPU
  • BF16: ~70GB VRAM - RTX 6000 96GB full precision
  • 20-40 tok/s on consumer hardware at 4-bit (Unsloth benchmarks)
  • 3B active parameters per token, efficient inference

27B Dense configurations

  • IQ4_XS GGUF: 16GB VRAM with KV cache compression (100K context)
  • Q4_K_M: ~16GB VRAM - RTX 4090 24GB with context room
  • FP16: ~55.6GB VRAM - 2x RTX 4090 or A100 80GB
  • All 27B parameters active for maximum quality
  • Best open-weight coding model: 77.2% SWE-bench

Supported frameworks

  • Ollama: Easiest setup, one-command deployment, vision + tool calling
  • vLLM: Production serving, continuous batching, tensor parallelism
  • llama.cpp: Lightweight C++ inference, CPU + GPU, edge deployment
  • SGLang: High-performance serving with RadixAttention prefix caching
  • KTransformers: Advanced deployment and optimization
  • HuggingFace Transformers: Native Python, full fine-tuning support

Qwen ecosystem

Open-weight models built for local deployment - Apache 2.0 licensed

Qwen 3.6 open-weight models are released under the Apache 2.0 license with full support for six inference frameworks. From Mac M4 laptops to multi-GPU servers, deploy with confidence and zero ongoing costs.

Qwen 3.6 35B A3B

MoE, 3B active params, 20-40 tok/s on consumer GPU

Download

Qwen 3.6 27B

Dense, 16GB VRAM with IQ4_XS, max quality

Download

Ollama library

Pre-built model tags for one-command setup

Browse

GGUF models

Quantized models for every VRAM budget

Download

vLLM docs

Production serving with continuous batching

Read docs

Community

Get help from the Qwen community

Join

Get started

Ready to run Qwen 3.6 on your own hardware? Start with one command

Try Qwen 3.6 in the browser first, then deploy locally with Ollama, vLLM, llama.cpp, or SGLang. The 35B A3B runs on Mac M4 16GB, the 27B fits 16GB VRAM with IQ4_XS. Zero per-token costs, full data privacy, Apache 2.0 licensed.