Codex CLI with Ollama & Local Models (2026) — Offline, Private, Zero Cost
Running AI-powered coding tools locally has never been more practical. With Ollama providing a drop-in OpenAI-compatible API on your own machine, you can point Codex CLI at a local model and get a fully private, offline, zero-cost AI coding assistant. Your source code stays on your hardware, API bills go to zero, and you can work without an internet connection — a compelling tradeoff for many professional workflows.
This guide covers everything: installing Ollama, choosing the right coding model for your hardware, configuring Codex CLI to use the local API, and understanding the real tradeoffs between local and cloud models.
1. Why Ollama + Codex CLI?
Codex CLI was designed to work with any OpenAI-compatible REST API — not just OpenAI's own servers. Ollama exposes exactly that interface on localhost:11434, making the integration seamless. Here is why developers choose this combination:
- Privacy: Your code, prompts, and completions never leave your machine. Ideal for proprietary codebases, client work under NDA, or any context where sending code to a third-party server is not acceptable.
- Zero API cost: Once you have the model downloaded, inference is free. You can run thousands of Codex tasks per day with no usage bill. The only cost is electricity and amortized hardware.
- Offline development: Ollama runs entirely locally. No internet connection required after the initial model download. Works on flights, in air-gapped environments, or when your internet connection is unreliable.
- Customizable: Full control over system prompts,
AGENTS.mdinstructions, and model parameters. You can fine-tune the model or swap versions without any API constraints. - Tradeoffs to understand: Local models currently lag behind frontier models like OpenAI o4-mini in code quality, especially for complex reasoning and multi-step refactors. They also require local GPU or CPU resources — a powerful machine matters more here than with cloud APIs.
2. Install Ollama
Ollama supports macOS, Linux, and Windows. Choose your platform below.
macOS
# Option 1: Homebrew (recommended)
brew install ollama
# Option 2: Download the .dmg installer from ollama.com
# Visit https://ollama.com/download and run the installer
Linux
curl -fsSL https://ollama.com/install.sh | sh
Windows
Download the .exe installer from ollama.com/download and run it. Ollama will install as a background service.
Start and verify the Ollama service
# Start the Ollama background service
ollama serve
# Verify the API is responding (in a separate terminal)
curl http://localhost:11434/api/tags
A successful response returns a JSON object listing your downloaded models. If you see connection refused, ensure ollama serve is running in the background.
.dmg package automatically starts Ollama as a menu bar app on login. You do not need to run ollama serve manually in that case.
3. Recommended Models & Download Commands
Not all open-source models are equal for code generation. The following table summarizes the best options tested with Codex CLI in 2026, ordered by code quality.
| Model | Disk Size | VRAM Required | Code Quality | Best For |
|---|---|---|---|---|
qwen2.5-coder:32b Best |
20 GB | 24 GB+ | ⭐⭐⭐⭐⭐ | Best code generation, complex refactors |
deepseek-coder-v2:16b |
9 GB | 16 GB | ⭐⭐⭐⭐ | Daily development, balanced performance |
codestral:22b |
13 GB | 16 GB | ⭐⭐⭐⭐ | Fast completions, Mistral ecosystem |
qwen2.5-coder:7b New |
4.7 GB | 8 GB | ⭐⭐⭐ | Low VRAM machines, getting started |
deepseek-coder:6.7b |
4.1 GB | 8 GB | ⭐⭐⭐ | Lightweight, CPU-friendly workloads |
codellama:13b |
8 GB | 12 GB | ⭐⭐⭐ | Meta open-source, broad community support |
Pull a model with the ollama pull command. The download is one-time; subsequent runs use the cached model.
# Recommended: best code quality (requires 24GB+ VRAM)
ollama pull qwen2.5-coder:32b
# Recommended: balanced performance (16GB VRAM)
ollama pull deepseek-coder-v2:16b
# Low-VRAM alternative (8GB VRAM / 16GB unified memory)
ollama pull qwen2.5-coder:7b
Check available models at any time with ollama list.
4. Configure Codex CLI to Use Ollama
Codex CLI needs two pieces of information to use Ollama: the base URL of the local API endpoint, and a model name. There are three ways to configure this, from quick one-off tests to permanent project defaults.
Method 1: Environment Variables (temporary, per-session)
Set the environment variables in your current terminal session before running Codex. This is the fastest way to test the integration without changing any config files.
export OPENAI_BASE_URL=http://localhost:11434/v1
export OPENAI_API_KEY=ollama
export CODEX_MODEL=qwen2.5-coder:32b
codex "Refactor src/auth.ts — extract JWT validation into a separate function"
OPENAI_API_KEY=ollama? Ollama does not require authentication, but Codex CLI expects an API key to be set. The literal string ollama satisfies the requirement without sending any real credentials. Any non-empty string works.
Method 2: config.toml (persistent, recommended)
Edit ~/.codex/config.toml to make Ollama the default provider permanently. This means you never need to set environment variables again for local model work.
# ~/.codex/config.toml
model = "qwen2.5-coder:32b"
provider = "ollama"
[providers.ollama]
name = "Ollama"
baseURL = "http://localhost:11434/v1"
envKey = "OPENAI_API_KEY"
Then add the following line to your shell profile (~/.zshrc on macOS or ~/.bashrc on Linux) so the API key is always available:
export OPENAI_API_KEY=ollama
Reload your shell with source ~/.zshrc (or open a new terminal), and all subsequent codex invocations will use your local Ollama model by default.
Method 3: CLI flags (per-command override)
Pass provider and model directly on the command line. Useful when you have a global config but want to run a single command against a different model.
codex --model qwen2.5-coder:32b \
--provider ollama \
"explain this code"
5. Verify Configuration
After configuring Codex CLI, run a few quick checks to confirm everything is wired up correctly before starting real work.
# Check current active model and version
codex --version
# Run a simple test task — should respond using the local model
codex "Write a Python quicksort function"
# View request logs to confirm traffic goes to local port
# Run this in a separate terminal before issuing codex commands
OLLAMA_DEBUG=1 ollama serve
When OLLAMA_DEBUG=1 is set, Ollama prints each incoming request to the terminal. You should see POST /v1/chat/completions entries arriving when Codex sends tasks to the model. If you see no traffic, double-check that OPENAI_BASE_URL is set correctly and that Ollama is listening on port 11434.
curl http://localhost:11434/v1/models to list the models Ollama is exposing through its OpenAI-compatible endpoint. The model names returned here are the exact strings to use in config.toml or --model flags.
6. Local vs Cloud Model Comparison
Understanding where local models genuinely win — and where they fall short — helps you decide when to use Ollama vs a cloud provider like OpenAI.
| Dimension | Ollama Local | OpenAI o4-mini |
|---|---|---|
| Cost | Free (hardware amortized) | Per-token billing |
| Code Quality | ⭐⭐⭐ (7B) / ⭐⭐⭐⭐ (32B) | ⭐⭐⭐⭐⭐ |
| First-token Latency | 0.5–3s (GPU) / 5–15s (CPU) | 1–5s |
| Privacy | Data never leaves machine | Sent to OpenAI servers |
| Offline | ✅ Fully offline | ❌ Requires internet |
| Context Window | 4K–32K (model-dependent) | 128K tokens |
| Multimodal | Partial support (LLaVA, etc.) | ✅ Full support |
| Reasoning | Limited on complex tasks | Strong (o-series reasoning) |
The practical recommendation: use local Ollama models for routine tasks — writing boilerplate, explaining functions, writing tests for known APIs, and code reviews. Switch to OpenAI o4-mini or a stronger cloud model for complex multi-file refactors, architectural decisions, or any task that requires deep multi-step reasoning.
7. AGENTS.md for Local Models
Codex CLI reads an AGENTS.md file from your project root to customize assistant behavior. When using local models, tailoring this file specifically for the constraints of smaller context windows and lower reasoning capacity improves output quality noticeably.
Create or update AGENTS.md in your project root with content like the following:
# Coding Assistant
You are a coding assistant running on a local Ollama model.
## Constraints
- Keep responses concise (local models have limited context)
- Prefer showing code over long explanations
- Add inline comments for non-obvious logic
## Allowed Operations
- Read any file in this project
- Write to src/, tests/, docs/
- Run: npm test, pytest, go test
8. Troubleshooting
The following table covers the most common errors when using Codex CLI with Ollama, along with their causes and fixes.
| Error | Cause | Fix |
|---|---|---|
connection refused localhost:11434 |
Ollama service not running | Run ollama serve in a terminal or start the Ollama app |
model not found |
Model not downloaded locally | Run ollama pull <model-name> to download the model first |
context length exceeded |
Prompt too long for model's context window | Use smaller, focused tasks; switch to a model with 32K context window |
| Very slow responses (10+ seconds per token) | GPU not being used; running on CPU only | Install CUDA drivers (Linux/Windows) or Metal is automatic on macOS; verify with nvidia-smi |
| Poor generation quality, repetitive output | Model too small for the task complexity | Upgrade to the 32B variant or switch to an OpenAI cloud model |
invalid API key or authentication error |
API key environment variable not set or empty | Set OPENAI_API_KEY=ollama (the literal string "ollama") |
nvidia-container-toolkit is configured. Run ollama run qwen2.5-coder:7b interactively and check the Ollama logs — it explicitly states whether it is using GPU or CPU.
9. Advanced: Remote Ollama Server
If you have a single powerful GPU workstation on your local network, you can share it across multiple developer machines. This is common in small teams where one person has a 4090 or similar hardware.
Server-side configuration
By default Ollama only listens on 127.0.0.1. To accept connections from other machines on the network, set the host to 0.0.0.0:
# Server: allow remote connections on all interfaces
OLLAMA_HOST=0.0.0.0 ollama serve
Or set it permanently in your systemd service or shell profile:
export OLLAMA_HOST=0.0.0.0
Client-side configuration
On each developer's machine, point Codex CLI to the server's local IP address instead of localhost:
# Replace 192.168.1.100 with your GPU server's local IP
export OPENAI_BASE_URL=http://192.168.1.100:11434/v1
export OPENAI_API_KEY=ollama
11434 on the server can use the model and potentially access request logs. On shared networks, protect this endpoint with a reverse proxy (Nginx or Caddy) that enforces API key authentication, or restrict access by IP using firewall rules. Never expose an unauthenticated Ollama port to the public internet.
Example Nginx reverse proxy with basic auth
# /etc/nginx/sites-available/ollama
server {
listen 443 ssl;
server_name ollama.internal.yourcompany.com;
location / {
auth_basic "Ollama API";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://127.0.0.1:11434;
proxy_set_header Host $host;
}
}
10. Frequently Asked Questions
Can Codex CLI use Ollama local models?
Yes. Codex CLI supports any OpenAI-compatible API backend, including Ollama. Set OPENAI_BASE_URL=http://localhost:11434/v1 and OPENAI_API_KEY=ollama to connect Codex CLI to a local model running via Ollama. The integration requires no plugins or patches — it works with any Ollama model that supports chat completions.
Which Ollama model works best with Codex CLI?
For users with 24GB+ VRAM, qwen2.5-coder:32b delivers the best code quality among open-source models and competes well with mid-tier cloud models. For 16GB VRAM, deepseek-coder-v2:16b provides an excellent balance of quality and speed. For getting started or machines with 8GB VRAM (or Apple Silicon with 16GB unified memory), qwen2.5-coder:7b is the recommended starting point. codestral:22b from Mistral AI is also a strong option if you prefer that ecosystem.
Is a local Ollama model better than OpenAI o4-mini?
For raw code generation quality and complex reasoning tasks, OpenAI o4-mini is currently superior. The o-series models have strong multi-step reasoning that local 32B models cannot fully replicate. However, local models win decisively on cost (completely free after hardware), privacy (code never leaves your machine), and offline availability. The best approach for many teams is a hybrid strategy: use local models for routine coding tasks and switch to o4-mini for complex architectural work.
What hardware do I need for Codex CLI with Ollama?
Hardware requirements scale with model size. As a practical guide: 7B models require 8GB VRAM (or 16GB Apple Silicon unified memory); 13B models require 12–16GB VRAM; 16B models require 16GB VRAM; 32B–34B models require 24GB+ VRAM, or can run across two 16GB GPUs with Ollama's multi-GPU support. Running on CPU only is possible with any model, but expect inference to be 5–10x slower than GPU. An NVIDIA RTX 3090 (24GB) or RTX 4090 (24GB) covers all models in the recommendation table. On macOS, an M2 Max or M3 Pro with 36GB unified memory handles 32B models well.