Model Comparison
A practical guide to picking the right model for the job across every provider VibeCody supports. Last updated: 2026-05-08.
Caveat: model leaderboards shift weekly. Treat the strength/weakness blurbs as a shape of each model’s bias (what it was trained for), not a final benchmark verdict. When in doubt, run the same prompt through two candidates side-by-side in VibeUI’s MultiModel panel.
Where models run
Three different execution shapes hide behind the model picker. Pick the row that matches your privacy / cost / capability needs:
vibecli-mistralrs— runs on your machine. Weights cached at~/.cache/huggingface/hub, forward passes execute on your CPU / Metal / CUDA. Nothing leaves the host. Limited to ~7B-class models on a laptop. Default for the privacy path.
ollama— runs locally OR on Ollama Cloud. Without anollamaAPI key, only models you’veollama pull-ed run (locally). With an API key, large cloud-hosted models (devstral-2:123b-cloud,nemotron-3-super, etc.) route to Ollama Cloud transparently. Open-weights only, scales up to 100B+ MoE.Cloud APIs (
claude,openai,gemini,grok,mistral,deepseek,cerebras,perplexity,together,fireworks,openrouter,azure_openai,bedrock,copilot,zhipu,vercel_ai,minimax,sambanova) — runs on the provider’s hardware. Closed-weights flagships live here. Inputs and outputs traverse the network; check each provider’s data-handling terms.
The daemon serves all three from the same HTTP surface (:7878), so a remote VibeUI / VibeMobile / VibeWatch client can use any of them. The choice of provider determines where the model itself executes, not how the client connects.
Notation
- Ctx — maximum context window (input tokens).
- Tools — native function/tool calling support: ✅ first-class, ⚠️ supported but quirky, ❌ none.
- Vision — accepts image input.
- Reasoning — model does explicit chain-of-thought / “thinking” tokens before answering.
- Open — open-weights (you can self-host).
Pick by task
The “right” pick depends on what you’re doing. Use this matrix as a starting point, then verify with the MultiModel panel in VibeUI.
Coding agent (multi-step file edits, run-and-fix loops)
| Tier | Cloud-hosted | Open-weights (Ollama Cloud) | Local pull |
|---|---|---|---|
| Flagship | Claude Sonnet 4.6, gpt-5.3-codex | devstral-2 (123B) | devstral-small-2 |
| Strong | gpt-5.5, GPT-4.1 | qwen3-coder | qwen2.5-coder:7b |
| Cheap/fast | Claude Haiku 4.5, gpt-5.3-codex-spark, gpt-4.1-mini | ministral-3, devstral-small-2 | qwen2.5-coder:1.5b |
One-shot reasoning, math, hard algorithms
| Tier | Cloud-hosted | Open-weights | Local pull |
|---|---|---|---|
| Flagship | Claude Opus 4.7, gpt-5.5, o3 | nemotron-3-super, deepseek-v4-pro | deepseek-r1:14b |
| Strong | Gemini 3.1 Pro, gpt-4.1 | glm-5.1, magistral | qwq:32b |
| Cheap | o4-mini, gpt-4.1-mini | nemotron-3-nano | phi4-reasoning |
Long context (≥200k tokens)
| Tier | Provider · Model |
|---|---|
| Flagship | Gemini 3.1 Pro (1M+), gpt-5.5 (1M), Claude Sonnet 4.6 (200k) |
| Strong | gpt-4.1 (1M), Grok-3 (256k) |
| Open | qwen3-next, llama4 (variable) |
Vision (image input)
| Tier | Provider · Model |
|---|---|
| Flagship | Gemini 3.1 Pro, gpt-5.5, Claude Sonnet 4.6 |
| Strong | GPT-4o, Grok-3, gpt-4.1 |
| Open | qwen3-coder (vision variant), llama4 vision |
| Local | llama3.2-vision, gemma3 |
Cheap & fast tool-calling agents
| Tier | Provider · Model |
|---|---|
| Cloud | Claude Haiku 4.5, gpt-5.3-codex-spark, gpt-4.1-mini, Gemini 2.5 Flash, Grok-3-mini |
| Open cloud | ministral-3, devstral-small-2, gemma4 |
| Local | phi4-mini, llama3.2:3b, qwen2.5:1.5b |
Privacy / fully offline
| Tier | Engine · Model |
|---|---|
| Daemon (mistralrs) | Qwen2.5-7B-Instruct, Qwen2.5-Coder-7B, Phi-3.5-mini |
| Ollama local | devstral-small-2, qwen2.5-coder:7b, llama3.2:3b |
Web search / news-aware
| Tier | Provider · Model |
|---|---|
| Native | Perplexity Sonar Pro, Sonar Reasoning |
| With tools | gpt-4.1 + browser tool, Claude Sonnet 4.7 + web tool |
Providers and models
Below: every provider VibeCody ships, the models we expose in the picker, and what each one is actually good at. Flagships get deeper dives; secondary models get one-liners.
Anthropic Claude (claude)
Three-tier family — Opus (deepest reasoning), Sonnet (balanced workhorse), Haiku (fast/cheap). All three support tool calling, vision, and extended thinking. Default in VibeCody is claude-opus-4-7.
| Model | Ctx | Tools | Vision | Reasoning | Notes |
|---|---|---|---|---|---|
| claude-opus-4-7 | 200k | ✅ | ✅ | ✅ | Flagship reasoning + agent default |
| claude-sonnet-4-6 | 200k | ✅ | ✅ | ✅ | Current Sonnet — best-bang-for-buck coding agent |
| claude-haiku-4-5 | 200k | ✅ | ✅ | ✅ | Current Haiku — cheap/fast tool calls + classification |
| claude-opus-4-6 | 200k | ✅ | ✅ | ✅ | Previous-gen flagship |
| claude-sonnet-4-5 | 200k | ✅ | ✅ | ✅ | Previous-gen Sonnet |
| claude-3-5-sonnet-20241022 | 200k | ✅ | ✅ | ❌ | Legacy 3.5 — kept for reproducibility |
claude-opus-4-7 — Strongest at sustained agentic loops with many tools and many turns. It rarely loses the plot on long sessions and is willing to push back on bad instructions. Most expensive of the three. Use when latency doesn’t matter and the work is hard.
claude-sonnet-4-6 — The model most VibeCody users will actually run. Roughly Opus-level coding quality on common tasks, ~3-4× cheaper, ~2× faster. Default for the VibeUI Code panel.
claude-haiku-4-5 — Surprisingly capable for its tier; handles routine tool-calling, summarization, intent classification. Don’t use it for novel architecture or deep debugging — it gets confidently wrong.
claude-code (local Claude Code CLI passthrough)
Same Anthropic models (Opus 4.7, Sonnet 4.6, Haiku 4.5), but billed against the user’s Claude.ai Free/Pro/Max/Team/Enterprise plan instead of API credits. Same capabilities; payment shape differs.
OpenAI (openai)
Three parallel lines as of May 2026: the GPT-5 line (gpt-5, gpt-5.3, gpt-5.4, gpt-5.5) is the current general-purpose flagship family with built-in adaptive reasoning; the GPT-5 codex variants (gpt-5.3-codex, gpt-5.3-codex-spark) are coding-tuned for agent loops; the GPT-4 line (gpt-4o, gpt-4.1) remains for compatibility and long-context cases; the o-line (o3, o4-mini) is the older explicit-reasoning family. VibeCody’s default is gpt-5.5.
| Model | Ctx | Tools | Vision | Reasoning | Notes |
|---|---|---|---|---|---|
| gpt-5.5 | 1M | ✅ | ✅ | ✅ | Current flagship — default in VibeCody |
| gpt-5.4 | 1M | ✅ | ✅ | ✅ | Previous flagship; cheaper than 5.5 |
| gpt-5.3-codex | 200k | ✅ | ❌ | ✅ | Coding-tuned, agent-loop optimised |
| gpt-5.3-codex-spark | 200k | ✅ | ❌ | ⚠️ | Faster/cheaper codex variant for IDE flows |
| gpt-5 | 1M | ✅ | ✅ | ✅ | First GPT-5 release; kept for reproducibility |
| gpt-4o | 128k | ✅ | ✅ | ❌ | Workhorse multimodal, omni input/output |
| gpt-4o-mini | 128k | ✅ | ✅ | ❌ | Fast/cheap variant |
| gpt-4-turbo | 128k | ✅ | ✅ | ❌ | Older; kept for reproducibility |
| gpt-4.1 | 1M | ✅ | ✅ | ❌ | Long-context GPT-4 flagship |
| gpt-4.1-mini | 1M | ✅ | ✅ | ❌ | Fast long-context |
| gpt-4.1-nano | 1M | ✅ | ❌ | ❌ | Very cheap classification/extract |
| o3 | 200k | ✅ | ✅ | ✅ | Hard reasoning, pre-GPT-5 |
| o3-mini | 200k | ✅ | ❌ | ✅ | Cheaper reasoning |
| o4-mini | 200k | ✅ | ✅ | ✅ | Reasoning + vision; replaces o3-mini for most use |
gpt-5.5 — OpenAI’s current general-purpose flagship. 1M-token context with strong long-range retrieval, built-in adaptive reasoning (the model decides per-prompt how much “thinking” to spend), native vision, rock-solid tool calling. Default in VibeCody for the OpenAI provider.
gpt-5.3-codex — Coding-specialised GPT-5 variant; tuned for multi-step file edits, run-and-fix loops, and tool-heavy agent flows. Pick this over gpt-5.5 when the workload is overwhelmingly code-editing. The -spark variant trades some quality for sub-second latency — good for inline completions and quick IDE actions.
gpt-4.1 — Still useful when you need a non-reasoning baseline at low cost or want to A/B against GPT-5 outputs. 1M-token context retrieves well. For new work, default to a GPT-5 entry instead.
o3 — Pre-GPT-5 reasoning flagship. GPT-5.5 generally matches or beats it on most benches now, but o3 is still useful where you want the older reasoning style for reproducibility.
o4-mini — A reasonable middle ground when you want some explicit reasoning but not o3 cost. Good for code review and architecture sketches.
Google Gemini (gemini)
Long context is the headline (1M+ on Pro). The Gemini 3 generation (released Q1 2026) is competitive with GPT-5-class models on most general tasks and remains best-in-class for long-context retrieval. The 2.5 line stays in the picker for cost-sensitive workloads. VibeCody’s default is gemini-3.1-pro.
| Model | Ctx | Tools | Vision | Reasoning | Notes |
|---|---|---|---|---|---|
| gemini-3.1-pro | 1M+ | ✅ | ✅ | ✅ | Current flagship — default in VibeCody |
| gemini-3-pro | 1M+ | ✅ | ✅ | ✅ | Initial Gemini 3 release; kept for reproducibility |
| gemini-2.5-pro | 1M | ✅ | ✅ | ✅ | Previous-gen long-context flagship |
| gemini-2.5-flash | 1M | ✅ | ✅ | ⚠️ | Cheap workhorse |
| gemini-2.0-flash | 1M | ✅ | ✅ | ❌ | Previous-gen flash |
| gemini-2.0-flash-lite | 1M | ✅ | ❌ | ❌ | Cheapest tier |
gemini-3.1-pro — Google’s current flagship. Strongest model in the picker for genuine 1M+ token comprehension (not just acceptance), with native multimodal handling and adaptive reasoning. Tool calling caught up to Claude/GPT-5 with the 3.x line; argument-shape hallucinations on complex tools have largely cleared. Default in VibeCody for the Gemini provider.
gemini-2.5-pro — Still a strong long-context option at lower cost than 3.1 Pro. Use when you need depth on a long input but don’t need the latest reasoning quality.
gemini-2.5-flash — Sub-second time-to-first-token, supports tools and vision, costs roughly 10× less than Pro. Good for chat-style use and high-volume tool-calling agents.
xAI Grok (grok)
Strong on real-time / news-aware tasks (it has live X data feed integration on the back end). Decent coding ability; tool calling is solid as of grok-3.
| Model | Ctx | Tools | Vision | Reasoning | Notes |
|---|---|---|---|---|---|
| grok-3 | 256k | ✅ | ✅ | ⚠️ | Flagship |
| grok-3-mini | 256k | ✅ | ❌ | ❌ | Cheap/fast — VibeCody default |
| grok-2 | 128k | ✅ | ✅ | ❌ | Previous gen |
grok-3 — Useful when the task involves recent events, market data, or code where the relevant docs were published in the last few months — it tends to be more current than rivals. Coding capability roughly between gpt-4o and gpt-4.1. Tool calling works but the JSON schema adherence is fussier than Claude’s.
Mistral (mistral)
European cloud provider, strong on multilingual and coding (Codestral). Tool calling is native and well-specced.
| Model | Ctx | Tools | Vision | Reasoning | Notes |
|---|---|---|---|---|---|
| mistral-large-latest | 128k | ✅ | ❌ | ❌ | General flagship |
| mistral-medium-latest | 128k | ✅ | ❌ | ❌ | Mid-tier balanced |
| mistral-small-latest | 32k | ✅ | ❌ | ❌ | Cheap/fast |
| codestral-latest | 32k | ✅ | ❌ | ❌ | Coding-tuned |
codestral-latest — Mistral’s coding specialist. Excellent at completion and edit tasks; smaller than Devstral but covers most languages well. Use this for inline-style completions; use devstral-2 (via Ollama) for full agentic loops.
DeepSeek (deepseek)
Chinese provider; very strong reasoning (R1) and aggressively cheap pricing. Note: data residency / outbound traffic considerations apply if your project requires non-Chinese hosting.
| Model | Ctx | Tools | Vision | Reasoning | Notes |
|---|---|---|---|---|---|
| deepseek-chat | 128k | ✅ | ❌ | ❌ | General workhorse |
| deepseek-reasoner | 128k | ✅ | ❌ | ✅ | R1-class reasoning |
| deepseek-coder | 128k | ✅ | ❌ | ❌ | Coding-tuned |
deepseek-reasoner — Strong at math and algorithmic reasoning; meaningfully cheaper than o3 for similar quality on bench tasks. Tool calling support is recent and a bit rough; verify your function schemas round-trip cleanly before relying on it for agent loops.
Cerebras (cerebras)
Inference-only platform — does not train models, but runs Llama-class open weights at extreme speed (often 10-20× faster than typical cloud endpoints) on their wafer-scale hardware.
| Model | Ctx | Tools | Vision | Reasoning | Notes |
|---|---|---|---|---|---|
| llama-3.3-70b | 128k | ✅ | ❌ | ❌ | Best-quality option |
| llama-3.1-8b | 128k | ✅ | ❌ | ❌ | Tiny + extremely fast |
llama-3.3-70b on Cerebras — Use when you want Llama-3.3 quality with 1000+ tokens/sec generation. Great for streaming-heavy chat UIs and agent loops where round-trip count dominates. Tool calling works but the model itself is slightly weaker at strict JSON than GPT-4-class.
Perplexity (perplexity)
Web-search-augmented chat. Models include browsing as a native step in their generation pipeline; you don’t add a separate tool. Citations come back inline.
| Model | Ctx | Tools | Vision | Reasoning | Notes |
|---|---|---|---|---|---|
| sonar-pro | 200k | ⚠️ | ❌ | ❌ | Default. Web-grounded answers + citations |
| sonar | 128k | ⚠️ | ❌ | ❌ | Cheaper variant |
| sonar-reasoning | 128k | ⚠️ | ❌ | ✅ | Reasoning + web search |
Use Perplexity for “what’s the latest on X” prompts where you need fresh sources. Don’t use it for code generation or long agent loops — it isn’t shaped for that.
Together.ai (together)
Inference-only marketplace for open-weights models. We expose a couple of Llama / Mixtral defaults; Together hosts dozens more — extend STATIC_MODELS if you need them.
| Model | Ctx | Tools | Vision | Reasoning | Notes |
|---|---|---|---|---|---|
| meta-llama/Llama-3.3-70B-Instruct | 128k | ⚠️ | ❌ | ❌ | Workhorse open weights |
| mistralai/Mixtral-8x7B-Instruct-v0.1 | 32k | ⚠️ | ❌ | ❌ | Older but cheap MoE |
Fireworks (fireworks)
Same shape as Together — inference-only, open-weights focus, similar Llama/Mixtral lineup.
| Model | Ctx | Tools | Vision | Reasoning | Notes |
|---|---|---|---|---|---|
| accounts/fireworks/models/llama-v3p3-70b-instruct | 128k | ⚠️ | ❌ | ❌ | Llama 3.3 default |
| accounts/fireworks/models/mixtral-8x7b-instruct | 32k | ⚠️ | ❌ | ❌ | Older Mixtral |
OpenRouter (openrouter)
Aggregator front-end — one API key, hundreds of models routed to the cheapest/fastest available backend. Useful for quick experimentation across models, less ideal as a production primary because pricing and latency vary by route.
| Model | Ctx | Tools | Vision | Reasoning | Notes |
|---|---|---|---|---|---|
| anthropic/claude-3.5-sonnet | 200k | ✅ | ✅ | ❌ | Stable Claude route via OpenRouter |
| openai/gpt-4o | 128k | ✅ | ✅ | ❌ | OpenAI passthrough |
| google/gemini-2.0-flash-001 | 1M | ✅ | ✅ | ❌ | Cheap long context |
OpenRouter’s STATIC_MODELS keeps a deliberately small, stable list — extend it in useModelRegistry.ts if you need the latest Sonnet 4.6 / Opus 4.7 / GPT-5 / Gemini 3.1 routes. Most flagships are exposed under their canonical OpenRouter slugs (anthropic/claude-opus-4-7, openai/gpt-5.5, google/gemini-3.1-pro).
Azure OpenAI (azure_openai)
Enterprise Azure-region-pinned OpenAI deployments. Same models as openai but billed via Azure with regional / compliance guarantees.
| Model | Notes |
|---|---|
| gpt-4o | Standard 4o on Azure |
| gpt-4-turbo | Older; kept for compliance reproducibility |
Amazon Bedrock (bedrock)
AWS-region-pinned Anthropic Claude (and others). Same models, AWS billing, IAM-gated. Bedrock typically lags the Anthropic API by 1-2 generations on the Sonnet/Haiku slugs that have shipped to the Bedrock catalog.
| Model | Notes |
|---|---|
| anthropic.claude-3-5-sonnet-20241022-v2:0 | Stable Sonnet route on Bedrock |
| anthropic.claude-3-haiku-20240307-v1:0 | Stable Haiku route on Bedrock |
To use the latest Anthropic models (Opus 4.7, Sonnet 4.6, Haiku 4.5) on Bedrock, add their Bedrock-side model IDs to useModelRegistry.ts:STATIC_MODELS.bedrock once AWS lists them in your region.
GitHub Copilot (copilot)
Copilot’s chat back-end uses gpt-4o-class models. We expose it as a provider for users on Copilot Business/Enterprise who want to channel chat through that quota.
| Model | Notes |
|---|---|
| gpt-4o | Routed via the Copilot endpoint |
Ollama (ollama)
The most-used provider in VibeCody. ollama covers both local-pulled models (run on your machine) and cloud-hosted models (run on ollama.com when an API key is configured). The full library list lives in vibeui/src/constants/ollamaModels.ts.
VibeCody’s default Ollama model is devstral-2 — Mistral’s 123B coding-agent flagship, non-Chinese origin, native tool calling.
Cloud-hosted flagships (non-Chinese)
| Model | Origin | Best for | Notes |
|---|---|---|---|
| devstral-2 | Mistral / France | Coding agents | 123B MoE, default. Tool calling native. |
| devstral-small-2 | Mistral / France | Cheaper coding | Smaller variant of devstral-2 |
| nemotron-3-super | NVIDIA / US | Reasoning | Llama-derived, RL-tuned for math/code reasoning |
| nemotron-3-nano | NVIDIA / US | Cheap reasoning | Smaller nemotron |
| cogito-2.1 | DeepCogito / US | Hybrid reasoning + tools | Newer entry; promising on agent benches |
| gemma4 | Google / US | General | Open-weights Gemini-adjacent |
| ministral-3 | Mistral / France | Cheap fast | Small but capable |
Cloud-hosted flagships (Chinese-origin)
These are technically excellent but may conflict with data-residency rules. Listed for completeness.
| Model | Origin | Notes |
|---|---|---|
| qwen3-coder, qwen3-coder-next | Alibaba | Strong coding model |
| qwen3-next, qwen3.5 | Alibaba | General-purpose |
| deepseek-v4-pro, deepseek-v4-flash | DeepSeek | Reasoning leader at low cost |
| glm-5, glm-5.1 | Zhipu | Strong agent eval scores |
| kimi-k2.5, kimi-k2.6 | Moonshot | 1T MoE; long context |
| minimax-m2.5, minimax-m2.7 | MiniMax | Agentic/reasoning hybrid |
Notable local-pull models
| Model | Best for | Notes |
|---|---|---|
| qwen2.5-coder:7b | Local coding | Best small-coder; ~5GB RAM |
| llama3.3:70b | Local general | Needs 48GB+ VRAM |
| llama3.2:3b | Mobile-class chat | Runs on a laptop CPU |
| phi4 | Reasoning on small hardware | Microsoft, 14B-class |
| phi4-mini | Edge inference | ~3B-class |
| deepseek-r1:14b | Local reasoning | R1-distilled |
| codellama, starcoder2 | Older code completion | Kept for reproducibility |
| llama3.2-vision | Local vision | If you need image input offline |
devstral-2 vs nemotron-3-super (most-asked)
- devstral-2 wins for coding agents — file edits, run-and-fix, multi-turn tool use. Trained specifically for that loop. SWE-Bench Verified ~58–62% per Mistral’s release numbers.
- nemotron-3-super wins for one-shot reasoning — math, algorithms, “think first then answer” problems. Heavy RLHF on reasoning benches.
- For VibeCody’s daemon (mostly multi-step coding/agent workloads),
devstral-2is the default. Switch tonemotron-3-superinuseModelRegistry.ts:PROVIDER_DEFAULT_MODEL.ollamaif your usage is reasoning-heavy.
VibeCLI mistralrs (vibecli-mistralrs)
Embedded-in-daemon inference. Talks to the local VibeCLI daemon (:7878 by default) and pins the in-process mistralrs backend via X-VibeCLI-Backend. Models here are HuggingFace repo IDs that lazy-load on first use.
VibeCody’s default mistralrs model is meta-llama/Llama-3.1-8B-Instruct — Meta’s most recent ~8B open-weights model with a 128k context window and tool-calling support.
| Model | Ctx | Best for | Notes |
|---|---|---|---|
| meta-llama/Llama-3.1-8B-Instruct | 128k | Privacy-default — general + tools | Default. Gated (see below) |
| meta-llama/Llama-3.2-3B-Instruct | 128k | Mid-tier general | Gated |
| meta-llama/Llama-3.2-1B-Instruct | 128k | Tiniest Llama | Gated |
| Qwen/Qwen2.5-Coder-7B-Instruct | 32k | Privacy-default coding | Apache-2.0, ungated |
| Qwen/Qwen2.5-7B-Instruct | 32k | General ~7B alternative | Apache-2.0, ungated |
| Qwen/Qwen2.5-Coder-1.5B-Instruct | 32k | Edge / fast coding | Apache-2.0 |
| Qwen/Qwen2.5-3B-Instruct | 32k | Mobile-class chat | Apache-2.0 |
| Qwen/Qwen2.5-1.5B-Instruct | 32k | Edge / fast general | Apache-2.0 |
| Qwen/Qwen2.5-0.5B-Instruct | 32k | Tiniest viable | Apache-2.0 |
| microsoft/Phi-3.5-mini-instruct | 128k | Smart-but-small reasoning | MIT, ungated |
About gating — Meta’s Llama models are gated repos on HuggingFace: first-time download requires you to (a) accept Meta’s community license at the model page on huggingface.co and (b) export an HF_TOKEN environment variable with read scope. Qwen (Apache-2.0) and Phi (MIT) repos are fully open and need no token. If HF_TOKEN isn’t set, the daemon’s first lazy-load of a Llama model fails with a 401/403 — switch to a Qwen or Phi model in the picker, or set up the token (see Hugging Face access token docs).
This is the default provider for VibeCody’s privacy-preserving / no-API-key path. Inference is ~5× slower than Cerebras but every byte stays on your machine.
Zhipu (zhipu)
Chinese provider; GLM family.
| Model | Notes |
|---|---|
| glm-4-plus | Flagship |
| glm-4-flash | Cheap/fast |
Vercel AI Gateway (vercel_ai)
Gateway with no preset list — you point it at any backend Vercel AI supports. Empty model list in the registry; user supplies the model string.
MiniMax (minimax)
Chinese provider.
| Model | Notes |
|---|---|
| abab6.5s-chat | General chat |
SambaNova (sambanova)
Inference-only, similar shape to Cerebras (fast Llama runs).
| Model | Notes |
|---|---|
| Meta-Llama-3.3-70B-Instruct | Default |
Open vs closed weights
| Closed weights only | Open weights (you can self-host) |
|---|---|
| Claude (Anthropic) | Llama family (Meta) |
| GPT (OpenAI) | Mistral family (incl. Devstral, Codestral, Ministral) |
| Gemini (Google) | Gemma (Google) |
| Grok (xAI) | Qwen (Alibaba) |
| Sonar (Perplexity) | DeepSeek (R1, V3, V4 family) |
| Phi (Microsoft) | |
| Nemotron (NVIDIA) | |
| GLM (Zhipu) | |
| Kimi (Moonshot) | |
| gpt-oss (OpenAI’s open-weights line) |
If your project needs to run inference offline or prove no data left the machine, only the open-weights column is viable — through Ollama (cloud or local) or the in-daemon mistralrs backend.
Model lifecycle policy
Models in this picker are not equally durable. Open-weights models on HuggingFace, closed flagships behind a paid API, and inference-only marketplaces all age differently. Plan for it.
Two clocks: supply vs quality
Every model has two deprecation timelines:
- Supply clock — will the model still be available? For open weights from Meta, Microsoft, Mistral, Alibaba, Google, etc., the answer is essentially “forever.” First-party releases from major labs are not yanked from HuggingFace. Closed APIs (
gpt-3.5-turbo, older Claude versions) do get sunset on published timelines — typically 6-18 months notice. - Quality clock — will the model still be the right pick? This runs much faster. Small-model tier sees a new generation every 6-12 months: Llama-3.2 → 3.3 → 4, Phi-3.5 → 4 → 4-mini, Qwen-2.5 → 3 → 3.5. The previous version still works; it’s just no longer competitive.
In practice: expect every model in this doc to be obsolete within 18 months, but expect open-weights models to keep working for as long as you have local copies.
Cached-weights floor
When mistralrs first uses a model, weights download once into ~/.cache/huggingface/hub. From that point forward, the model keeps working even if HuggingFace removed the upstream tomorrow. Same applies to Ollama’s local pulls (~/.ollama/models/). Cloud APIs have no equivalent floor — when Anthropic sunsets claude-3-5-sonnet-20241022, every client loses access on the same day.
Practical implication: if reproducibility matters (audit trail, regulated environment), cache open-weights models on disk and avoid relying on closed APIs for the part of the pipeline that must reproduce identically.
Risk table
| Risk | Likelihood | What breaks | Mitigation |
|---|---|---|---|
| Cloud API sunsets a model | High (planned, ~yearly) | Cloud-API jobs using that model | Track provider deprecation pages; fail over to a sibling model |
| Open-weights repo renamed on HF | Low | First-time pulls; cached copies fine | Update the model id in STATIC_MODELS |
| Open-weights repo removed | Very low for first-party | First-time pulls; cached copies fine | Same as above; preserve cache backups |
| New generation released, old becomes “legacy” | Near-certain (6-12 mo) | Nothing breaks; competitive position erodes | Periodic registry refresh |
| HF gating policy tightens | Low-Med | New downloads of gated models fail | Switch to ungated alternative (Qwen/Phi) |
| License terms change | Low | Theoretical — already-released weights stay under their original license | Monitor license pages |
| mistralrs drops architecture support | Low (Llama, Phi, Qwen are tier-1) | Models can’t load with the latest mistralrs | Pin mistralrs version; upgrade selectively |
Hardening options for VibeCody
If you ship VibeCody to users who need reproducibility (enterprise, regulated, research), there are three knobs you can turn beyond the defaults:
- Pin commit SHAs in the registry. mistralrs accepts HuggingFace revision specs — change
"meta-llama/Llama-3.1-8B-Instruct"to"meta-llama/Llama-3.1-8B-Instruct@<commit-sha>"inSTATIC_MODELS. This immunizes against silent re-uploads under the same tag. Cost: you have to manually bump the SHA when you want a newer revision. - Add a
MODEL_REPLACEMENT_MAP. When a model 404s on pull, the daemon can log “this model has been retired; suggested replacement: X” and either fail fast or auto-substitute. Not implemented today; ~30 lines if you want it. - Ship a snapshot mirror. For closed environments without HuggingFace access, mirror the open-weights models you depend on into an internal artifact store (S3, Artifactory) and point
HF_ENDPOINTat it. The daemon will pull from there.
None of these are urgent. They become useful when you start depending on a specific model staying frozen.
What we update and when
The lists in this doc and in vibeui/src/hooks/useModelRegistry.ts are refreshed on a roughly quarterly cadence — when a new flagship lands at one of the major providers, or when an existing model gets formally sunset. The “Last updated” date at the top of this page is authoritative; if it’s more than 6 months old when you read this, treat the picks as historical and verify against the providers’ current docs.
How to set a different default
Per-provider default lives in vibeui/src/hooks/useModelRegistry.ts:
export const PROVIDER_DEFAULT_MODEL: Record<string, string> = {
claude: "claude-opus-4-7",
openai: "gpt-5.5",
gemini: "gemini-3.1-pro",
// ...
ollama: "devstral-2", // ← change here
// ...
};
To add a new model to a provider’s picker, append to the array in STATIC_MODELS in the same file. (For Ollama, the array is sourced from vibeui/src/constants/ollamaModels.ts.)
Per CLAUDE.md, the model list is the only file you need to touch for a frontend-only change.
See also
- Providers overview — per-provider setup and API key configuration.
- Configuration — daemon and UI settings.
- Failover — chain providers so one going down doesn’t kill your session.