Alibaba's Qwen 3.5 9B outperformed a model 13x its size on GPQA Diamond. The 397B version beats GPT-5.2 on instruction following. Open source, free to self-host, and already running on iPhones. Honest verdict.
Qwen 3.5 is Alibaba's latest open-source model family released in early April 2026. It made immediate impact: within hours of the 397B flagship announcement it had 363 points and 173 comments on Hacker News. The 9B variant is the headline story — it scored 81.7% on GPQA Diamond a graduate-level reasoning benchmark typically dominated by models 10-15x larger. For context GPT-OSS-120B scores 71.5% on the same benchmark. The architecture uses mixture-of-experts activating only 17B parameters per forward pass on the 397B model making self-hosting more practical than the parameter count suggests. Natively multimodal supporting text images and video through the same weights with no separate vision adapter. Supports 201 languages and dialects.
On instruction following Qwen 3.5 leads GPT-5.2 on IFBench with 76.5 versus 75.4 and beats Gemini on MultiChallenge with 67.6 versus 64.2. On web browsing the BrowseComp benchmark shows Qwen 3.5 at 78.6 beating all competitors. On coding SWE-bench Verified shows 76.4 which is competitive but trails GPT-5.2 at 80.0 and Claude at 80.9. On vision tasks Qwen 3.5 leads on MathVision at 88.6 and several OCR benchmarks. Decoding throughput is 8.6x to 19x faster than the previous Qwen3-Max depending on context length. The 9B model runs on any recent iPhone in airplane mode with just 4GB of RAM — on-device AI that actually works without any cloud dependency.
Via API Qwen 3.5 costs approximately $0.40 per million input tokens and $1.20 per million output tokens. Claude Opus 4.6 costs roughly 13x more. For startups running high-volume inference the cost difference between Qwen 3.5 and frontier closed models can determine whether a product is economically viable. The open-weight availability under an open licence means teams can self-host entirely eliminating per-token costs. The 9B variant is the most cost-efficient serious reasoning model available — running locally on a MacBook Pro with 16GB RAM with no ongoing costs.
Qwen 3.5 is best suited for cost-sensitive high-volume applications fine-tuning on proprietary data and teams that need to self-host for data sovereignty reasons. For client-facing applications involving sensitive data the China-based training origin is worth evaluating — self-hosting under Apache 2.0 eliminates the data sovereignty concern entirely. Not recommended as a primary model for the most complex production coding tasks where Claude or GPT-5.4 produce meaningfully better results. Excellent for internal tools research pipelines and applications where the 80th percentile of capability at 5% of the cost is the right tradeoff.
New honest reviews every week. Zero sponsorships. Zero fluff.
Subscribe Free →