Ollama Provider
Run AI models locally on your machine with Ollama. No API key required. Completely free. Your code never leaves your machine.
Why Ollama?
- Free – no API costs, no rate limits, no usage tracking
- Private – all inference happens locally; ideal for proprietary code
- Offline – works without an internet connection (air-gapped deployments)
- Fast startup – models load in seconds on modern hardware
Installation
macOS:
brew install ollama
Linux:
curl -fsSL https://ollama.ai/install.sh | sh
Docker:
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Verify installation:
ollama --version
Pull a Model
# Recommended for coding
ollama pull qwen3-coder
# Popular alternatives
ollama pull llama3.2
ollama pull codellama:34b
ollama pull deepseek-coder-v2:16b
Recommended Models for Coding
| Model | Size | Quality | Speed | Best for |
|---|---|---|---|---|
qwen3-coder |
480B (cloud) | Excellent | Fast | General coding (VibeCody default) |
llama3.2:70b |
40 GB | Very good | Medium | Complex reasoning, refactoring |
llama3.2:8b |
4.7 GB | Good | Very fast | Quick tasks, completions |
deepseek-coder-v2:16b |
9 GB | Very good | Fast | Code generation, debugging |
codellama:34b |
19 GB | Good | Medium | Code-specific tasks |
codellama:7b |
3.8 GB | Fair | Very fast | Low-memory machines |
qwen2.5-coder:7b |
4.4 GB | Good | Very fast | Balanced quality/speed |
starcoder2:15b |
9 GB | Good | Fast | Code completion |
Minimum RAM: 8 GB for 7B models, 16 GB for 13-16B models, 64 GB for 70B models.
Configure VibeCody
Option 1: Environment variable (override API URL)
export OLLAMA_HOST="http://localhost:11434"
vibecli --provider ollama
Option 2: Config file (~/.vibecli/config.toml)
[ollama]
enabled = true
api_url = "http://localhost:11434"
model = "qwen3-coder:480b-cloud"
Option 3: CLI flag
vibecli --provider ollama --model llama3.2:8b
Verify Connection
# Check Ollama is running
curl http://localhost:11434/api/tags
# Test with VibeCody
vibecli --provider ollama -c "Say hello"
GPU Acceleration
Ollama automatically uses GPU when available:
- NVIDIA: Install CUDA drivers. Ollama detects GPUs automatically
- Apple Silicon: Metal acceleration is used by default (no setup needed)
- AMD: ROCm support on Linux
Check GPU detection:
ollama run llama3.2:8b "hello"
# Watch the logs: ollama will print which GPU layers are loaded
Air-Gapped Deployment
For environments without internet access:
-
On a machine with internet, pull the model:
ollama pull llama3.2:8b -
Copy the model directory (
~/.ollama/models/) to the air-gapped machine. -
Use the provided Docker Compose for a self-contained deployment:
docker-compose up -dThis starts VibeCLI + Ollama as a sidecar with no external network dependencies.
Troubleshooting
Connection refused
Error: Connection refused (os error 61)
Ollama is not running. Start it:
ollama serve
# or on macOS, open the Ollama app
Model not found
Error: model 'xyz' not found
Pull the model first:
ollama pull xyz
List available models:
ollama list
Out of memory (OOM)
If Ollama crashes or becomes unresponsive:
- Use a smaller model (e.g.,
llama3.2:8binstead ofllama3.2:70b) - Close other memory-intensive applications
- Set
OLLAMA_MAX_LOADED_MODELS=1to limit concurrent models - On Linux, increase swap space
Slow generation
- Ensure GPU acceleration is active (check
ollama ps) - Use a smaller quantization:
ollama pull llama3.2:8b-q4_0 - Reduce context window: set
num_ctxin a Modelfile