Qwen 3.6 + Ollama
Run Qwen 3.6 locally with a single command - no configuration required
Ollama makes running Qwen 3.6 as simple as 'ollama run qwen3.6:35b-a3b'. Automatic GPU detection, model downloading, and quantization selection. Supports both the 27B dense and 35B A3B MoE models with NVIDIA CUDA and Apple Metal acceleration. Expect 20-40 tokens per second on consumer hardware for the 35B A3B 4-bit model. The OpenAI-compatible API at localhost:11434 integrates directly with Claude Code, Aider, Continue.dev, and other coding tools. Vision and multimodal input supported out of the box - a key fix over Qwen 3.5 where vision and tool calling were broken.
Ollama guide
From install to inference in under 5 minutes
Ollama handles the complexity of local model deployment - GPU detection, memory management, quantization, and API serving - so you can focus on using the model. Qwen 3.6 fixes the vision and tool calling issues that plagued Qwen 3.5 on Ollama.
One-command setup
Install Ollama, then run 'ollama run qwen3.6:35b-a3b' (default tag) or 'ollama run qwen3.6:27b'. Automatic model download, GPU detection, and optimal quantization selection. Works on macOS (Apple Silicon with Metal), Linux (NVIDIA CUDA), and Windows (WSL2 or native). The 35B A3B is the default recommended model for most users due to its balance of quality and hardware requirements.
Model tag selection
Choose the right model variant: 'qwen3.6:35b-a3b' for consumer GPUs (default tag), 'qwen3.6:27b' for maximum performance on workstation hardware, 'qwen3.6:35b-a3b-q4_k_m' for specific quantization control, or 'qwen3.6:35b-a3b-q3_k_m' for tighter VRAM budgets (~17GB). Tags map directly to GGUF quantization levels. Use 'ollama list' to see downloaded models and 'ollama show qwen3.6:35b-a3b' to inspect model details.
VRAM requirements and quantization
35B A3B quantization options: Q2_K (~13GB, fastest, lowest quality), Q3_K_M (~17GB, good for Mac M4 16GB), Q4_K_M (~21GB, balanced quality/speed on 24GB GPU), Q5_K_M (~25GB), Q8_0 (~35GB, near-lossless). 27B dense: Q4_K_M ~16GB, needs 24GB+ GPU. BF16 full precision for 35B A3B requires ~70GB VRAM. Community reports confirm Mac M4 16GB runs the 35B A3B at Q3 quantization successfully.
Vision and multimodal support
Qwen 3.6 models support multi-modal inputs through Ollama - a major improvement over Qwen 3.5 where vision was broken. Pass images alongside text prompts for code screenshot analysis, UI review, diagram understanding, architecture diagram parsing, and visual debugging workflows. Use the /image command in Ollama chat or pass base64-encoded images via the API.
Performance benchmarks on consumer hardware
Unsloth community benchmarks show 20-40 tokens per second on local rigs for the 35B A3B 4-bit model. Mac M4 16GB users report usable speeds with Q3 quantization. RTX 4090 24GB handles Q4_K_M with room for context. RTX 6000 96GB can run full precision deployment. Performance scales linearly with GPU memory bandwidth - faster memory means faster inference.
Modelfile customization
Create custom Modelfiles to configure system prompts, temperature, context length (num_ctx), GPU layer offloading (num_gpu), batch size (num_batch), and thread count. Set num_ctx up to 131072 for long-context tasks. Customize the chat template for specific use cases like coding assistants, technical writing, or agentic workflows. Modelfiles are plain text and version-controllable.
Tool calling and function support
Qwen 3.6 on Ollama supports tool calling and function invocation - another fix over Qwen 3.5 where tool calling was broken. Define tools in the OpenAI-compatible format and the model will generate structured function calls. This enables integration with agentic frameworks like LangChain, AutoGen, and CrewAI through the localhost:11434 endpoint.
Coding tool integration
Ollama exposes an OpenAI-compatible API at localhost:11434. Connect directly to Claude Code (via OpenAI-compatible API), OpenClaw, Aider, Continue.dev, Cursor, and other coding tools that support custom OpenAI endpoints. Set the base URL to http://localhost:11434/v1 and use any string as the API key. The Qwen 3.6 models support the same chat completions format as OpenAI.
Quick reference
Ollama commands, model tags, and hardware requirements
Essential commands, configuration options, and hardware requirements for running Qwen 3.6 with Ollama on different platforms.
Essential commands
- ollama run qwen3.6:35b-a3b - Run MoE model (default tag, consumer GPU)
- ollama run qwen3.6:27b - Run dense model (workstation GPU)
- ollama pull qwen3.6:35b-a3b-q3_k_m - Download Q3 quant (~17GB, Mac M4 friendly)
- ollama pull qwen3.6:35b-a3b-q4_k_m - Download Q4 quant (~21GB, balanced)
- ollama serve - Start API server on localhost:11434
- ollama list - Show downloaded models and sizes
- ollama show qwen3.6:35b-a3b - Inspect model details and parameters
Hardware requirements
- 35B A3B Q3_K_M: ~17GB VRAM (Mac M4 16GB confirmed working)
- 35B A3B Q4_K_M: ~21GB VRAM (RTX 4090 24GB recommended)
- 35B A3B BF16: ~70GB VRAM (RTX 6000 96GB or multi-GPU)
- 27B Dense Q4_K_M: ~16GB VRAM (RTX 4090 24GB minimum)
- 27B Dense IQ4_XS: fits 16GB VRAM with KV cache compression
- macOS: Apple Silicon with Metal acceleration (M1 Pro+ recommended)
- 20-40 tok/s on consumer hardware for 35B A3B 4-bit
- CPU fallback available but significantly slower (~2-5 tok/s)
Fixes over Qwen 3.5
- Vision/multimodal input: broken in 3.5, fully working in 3.6
- Tool calling/function invocation: broken in 3.5, fixed in 3.6
- Improved context handling and memory efficiency
- Better quantization quality at lower bit widths
Setup guides
Get Qwen 3.6 running with Ollama on any platform
Step-by-step guides for installing Ollama and configuring Qwen 3.6 on your platform, with hardware-specific optimization tips.
Install Ollama and run Qwen 3.6 on M1/M2/M3/M4 Macs with Metal acceleration
NVIDIA GPU setup with CUDA acceleration for maximum throughput
WSL2 and native Windows installation with GPU passthrough
Run Ollama in a container with GPU access for reproducible deployments
Run 35B A3B with Q3 quantization on Mac M4 with 16GB RAM
Split large models across multiple GPUs for better performance
Advanced configuration
Optimize Qwen 3.6 performance and integrate with coding tools
Fine-tune model performance with Modelfiles, GPU configuration, context settings, and connect to your development environment.
Custom system prompts, temperature, context length, and chat templates
VRAM management, layer offloading, and batch size tuning
Use Qwen 3.6 via Ollama as a backend for Claude Code
AI coding assistant in VS Code with local Qwen 3.6
AI pair programming with Ollama-hosted Qwen 3.6
Connect Ollama's localhost:11434 to any OpenAI-compatible tool
Qwen ecosystem
Ollama is the fastest path to local Qwen 3.6 - one command, full capabilities
One-command setup with automatic GPU detection, model management, vision support, tool calling, and an OpenAI-compatible API at localhost:11434 for seamless integration with Claude Code, Aider, Continue.dev, and more.
Get started
Ready to run Qwen 3.6 with Ollama? One command is all you need
Try Qwen 3.6 in the browser first, then install Ollama for local deployment. Run 'ollama run qwen3.6:35b-a3b' to download, configure, and start chatting with 20-40 tok/s on consumer hardware. Vision, tool calling, and coding tool integration all work out of the box.