Run Qwen 3.6 Locally

Deploy Qwen 3.6 on your own hardware - from Mac M4 16GB to production servers

Qwen 3.6 open-weight models are designed for local deployment across a wide range of hardware. The 27B dense model can run on 16GB VRAM using IQ4_XS GGUF with KV cache compression supporting up to 100K context. The 35B A3B MoE model delivers 20-40 tokens per second on consumer hardware at 4-bit quantization. Community reports confirm Mac M4 16GB runs the 35B A3B at Q3 quantization. Full support for Ollama, vLLM, llama.cpp, SGLang, and KTransformers. Vision and multimodal capabilities work locally.

Start Chatting View hardware guide

Local deployment

Everything you need to run Qwen 3.6 on your own machine

From hardware selection to quantization tuning, this guide covers every aspect of deploying Qwen 3.6 models locally for development, testing, and production use. Six inference frameworks supported, with hardware configurations from 16GB laptops to 96GB workstations.

Hardware requirements - 35B A3B MoE

The 35B A3B MoE model with only 3B active parameters is the most hardware-friendly option. Q3_K_M quantization: ~17GB VRAM, confirmed working on Mac M4 16GB. Q4_K_M: ~21-23GB VRAM, fits RTX 4090 24GB. Q8_0: ~35GB. BF16 full precision: ~70GB, fits RTX 6000 96GB. Expect 20-40 tokens per second on consumer hardware at 4-bit quantization based on Unsloth community benchmarks.

Hardware requirements - 27B Dense

The 27B dense model delivers maximum open-weight quality with all parameters active. IQ4_XS GGUF: can run on 16GB VRAM with KV cache compression, supporting up to 100K context length. Q4_K_M: ~16GB, needs 24GB+ GPU for comfortable operation with context. FP16 full precision: ~55.6GB, requires 2x RTX 4090 or A100 80GB. Best for workstation deployments where quality is the top priority.

Ollama one-command setup

The fastest path to local deployment: 'ollama run qwen3.6:35b-a3b'. Automatic model download, quantization selection, and GPU detection. Supports NVIDIA CUDA and Apple Metal acceleration. The OpenAI-compatible API at localhost:11434 integrates with Claude Code, Aider, Continue.dev, and other coding tools. Vision and tool calling both work out of the box - fixes over Qwen 3.5.

vLLM production serving

Production-grade serving with continuous batching, PagedAttention, and OpenAI-compatible API endpoints. Ideal for multi-user deployments and high-throughput inference on server hardware. Supports tensor parallelism for splitting the 27B model across multiple GPUs. PagedAttention enables efficient memory management for long-context requests up to the model's full context length.

llama.cpp and SGLang

llama.cpp provides lightweight C++ inference with CPU and GPU support, ideal for edge deployments and resource-constrained environments. SGLang offers high-performance serving with RadixAttention for efficient prefix caching. Both support GGUF quantized models and provide OpenAI-compatible API endpoints. KTransformers is also supported for advanced deployment scenarios.

Vision and multimodal locally

Both the 27B and 35B A3B models support vision and multimodal inputs when deployed locally. Analyze code screenshots, review UI designs, parse architecture diagrams, and debug visual issues. This capability works across Ollama, vLLM, and other supported frameworks. A significant improvement over Qwen 3.5 where local vision was broken.

Privacy and data sovereignty

All data stays on your machine. No API calls, no cloud dependencies, no usage tracking, no data leaving your network. Perfect for sensitive codebases, proprietary data, healthcare and financial applications, and air-gapped environments where data sovereignty is legally required. The Apache 2.0 license allows commercial use without restrictions.

Cost analysis vs API

Zero per-token costs after initial hardware investment. A single RTX 4090 (~$1,600) running the 35B A3B model at 20-40 tok/s can handle thousands of requests per day. At DashScope pricing of $0.40/$2.40 per million tokens, the GPU pays for itself within weeks for heavy usage. For teams processing millions of tokens daily, local deployment offers 10-100x cost savings over API access.

Quick reference

Hardware configurations and framework options

Key specifications for local Qwen 3.6 deployment across different hardware configurations and inference frameworks.

35B A3B MoE configurations

Q3_K_M: ~17GB VRAM - Mac M4 16GB confirmed working
Q4_K_M: ~21-23GB VRAM - RTX 4090 24GB recommended
Q8_0: ~35GB VRAM - RTX A6000 48GB or dual GPU
BF16: ~70GB VRAM - RTX 6000 96GB full precision
20-40 tok/s on consumer hardware at 4-bit (Unsloth benchmarks)
3B active parameters per token, efficient inference

27B Dense configurations

IQ4_XS GGUF: 16GB VRAM with KV cache compression (100K context)
Q4_K_M: ~16GB VRAM - RTX 4090 24GB with context room
FP16: ~55.6GB VRAM - 2x RTX 4090 or A100 80GB
All 27B parameters active for maximum quality
Best open-weight coding model: 77.2% SWE-bench

Supported frameworks

Ollama: Easiest setup, one-command deployment, vision + tool calling
vLLM: Production serving, continuous batching, tensor parallelism
llama.cpp: Lightweight C++ inference, CPU + GPU, edge deployment
SGLang: High-performance serving with RadixAttention prefix caching
KTransformers: Advanced deployment and optimization
HuggingFace Transformers: Native Python, full fine-tuning support

Start Chatting Download models

Setup guides

Step-by-step local deployment for every framework

Follow these guides to get Qwen 3.6 running on your hardware in minutes, with platform-specific optimization tips.

Ollama quickstart

Install Ollama and run Qwen 3.6 in under 5 minutes

vLLM deployment

Set up production-grade serving with OpenAI-compatible API

llama.cpp guide

Lightweight inference with CPU and GPU support

SGLang setup

High-performance serving with RadixAttention

box

Docker setup

Containerized deployment for reproducible environments

Mac M4 guide

Run 35B A3B on Mac M4 16GB with Q3 quantization

Optimization

Get the most out of your hardware

Tune quantization, batch size, memory allocation, and context length for optimal performance on your specific hardware.

Quantization comparison

Quality vs speed vs VRAM tradeoffs for each GGUF level

Multi-GPU setup

Tensor parallelism for the 27B dense model across GPUs

Apple Silicon guide

Optimized settings for M1/M2/M3/M4 Macs with Metal

KV cache compression

Fit 27B on 16GB VRAM with 100K context using IQ4_XS

Coding tool integration

Connect local Qwen to Claude Code, Aider, Continue.dev

Qwen ecosystem

Open-weight models built for local deployment - Apache 2.0 licensed

Qwen 3.6 open-weight models are released under the Apache 2.0 license with full support for six inference frameworks. From Mac M4 laptops to multi-GPU servers, deploy with confidence and zero ongoing costs.

Explore all models HuggingFace collection

Qwen 3.6 35B A3B

MoE, 3B active params, 20-40 tok/s on consumer GPU

Download

Qwen 3.6 27B

Dense, 16GB VRAM with IQ4_XS, max quality

Download

Ollama library

Pre-built model tags for one-command setup

Browse

GGUF models

Quantized models for every VRAM budget

Download

vLLM docs

Production serving with continuous batching

Read docs

Community

Get help from the Qwen community

Join

Get started

Ready to run Qwen 3.6 on your own hardware? Start with one command

Try Qwen 3.6 in the browser first, then deploy locally with Ollama, vLLM, llama.cpp, or SGLang. The 35B A3B runs on Mac M4 16GB, the 27B fits 16GB VRAM with IQ4_XS. Zero per-token costs, full data privacy, Apache 2.0 licensed.

Start Chatting Download models