Codex CLI with Ollama & Local Models (2026) — Offline, Private, Zero Cost

Q: Which Ollama model works best with Codex CLI?

We recommend qwen2.5-coder:32b (best code quality), deepseek-coder-v2:16b (balanced performance), or codestral:22b (fast completions). For GPUs under 16GB VRAM, use qwen2.5-coder:7b or deepseek-coder:6.7b.

Q: Is a local Ollama model better than OpenAI o4-mini?

OpenAI o4-mini has higher code quality and stronger reasoning. Local models win on cost (free), privacy (data stays local), and offline availability. Choose local for privacy-sensitive projects or cost-constrained workflows.

Q: What hardware do I need for Codex CLI with Ollama?

7B models need 8GB VRAM (or 16GB unified memory), 13B models need 16GB VRAM, and 32B/34B models need 24GB+ VRAM. CPU-only is possible but 5-10x slower.

Quick answer: Codex CLI can run fully offline against an Ollama local model so your code never leaves your machine. In ~/.codex/config.toml, add a [model_providers] entry pointing at Ollama's OpenAI-compatible endpoint (http://localhost:11434/v1), set model_provider to it and model to a pulled model (e.g. qwen2.5-coder). You then get a private, zero-cost AI coding assistant — ideal for corporate networks and sensitive codebases.

Running AI-powered coding tools locally has never been more practical. With Ollama providing a drop-in OpenAI-compatible API on your own machine, you can point Codex CLI at a local model and get a fully private, offline, zero-cost AI coding assistant. Your source code stays on your hardware, API bills go to zero, and you can work without an internet connection — a compelling tradeoff for many professional workflows.

This guide covers everything: installing Ollama, choosing the right coding model for your hardware, configuring Codex CLI to use the local API, and understanding the real tradeoffs between local and cloud models.

1. Why Ollama + Codex CLI?

Codex CLI was designed to work with any OpenAI-compatible REST API — not just OpenAI's own servers. Ollama exposes exactly that interface on localhost:11434, making the integration seamless. Here is why developers choose this combination:

Privacy: Your code, prompts, and completions never leave your machine. Ideal for proprietary codebases, client work under NDA, or any context where sending code to a third-party server is not acceptable.
Zero API cost: Once you have the model downloaded, inference is free. You can run thousands of Codex tasks per day with no usage bill. The only cost is electricity and amortized hardware.
Offline development: Ollama runs entirely locally. No internet connection required after the initial model download. Works on flights, in air-gapped environments, or when your internet connection is unreliable.
Customizable: Full control over system prompts, AGENTS.md instructions, and model parameters. You can fine-tune the model or swap versions without any API constraints.
Tradeoffs to understand: Local models currently lag behind frontier models like OpenAI o4-mini in code quality, especially for complex reasoning and multi-step refactors. They also require local GPU or CPU resources — a powerful machine matters more here than with cloud APIs.

2. Install Ollama

Ollama supports macOS, Linux, and Windows. Choose your platform below.

macOS

# Option 1: Homebrew (recommended)
brew install ollama

# Option 2: Download the .dmg installer from ollama.com
# Visit https://ollama.com/download and run the installer

Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the .exe installer from ollama.com/download and run it. Ollama will install as a background service.

Start and verify the Ollama service

# Start the Ollama background service
ollama serve

# Verify the API is responding (in a separate terminal)
curl http://localhost:11434/api/tags

A successful response returns a JSON object listing your downloaded models. If you see connection refused, ensure ollama serve is running in the background.

Note: On macOS, installing via the .dmg package automatically starts Ollama as a menu bar app on login. You do not need to run ollama serve manually in that case.

3. Recommended Models & Download Commands

Not all open-source models are equal for code generation. The following table summarizes the best options tested with Codex CLI in 2026, ordered by code quality.

Model	Disk Size	VRAM Required	Code Quality	Best For
`qwen2.5-coder:32b` Best	20 GB	24 GB+	⭐⭐⭐⭐⭐	Best code generation, complex refactors
`deepseek-coder-v2:16b`	9 GB	16 GB	⭐⭐⭐⭐	Daily development, balanced performance
`codestral:22b`	13 GB	16 GB	⭐⭐⭐⭐	Fast completions, Mistral ecosystem
`qwen2.5-coder:7b` New	4.7 GB	8 GB	⭐⭐⭐	Low VRAM machines, getting started
`deepseek-coder:6.7b`	4.1 GB	8 GB	⭐⭐⭐	Lightweight, CPU-friendly workloads
`codellama:13b`	8 GB	12 GB	⭐⭐⭐	Meta open-source, broad community support

Pull a model with the ollama pull command. The download is one-time; subsequent runs use the cached model.

# Recommended: best code quality (requires 24GB+ VRAM)
ollama pull qwen2.5-coder:32b

# Recommended: balanced performance (16GB VRAM)
ollama pull deepseek-coder-v2:16b

# Low-VRAM alternative (8GB VRAM / 16GB unified memory)
ollama pull qwen2.5-coder:7b

Check available models at any time with ollama list.

4. Configure Codex CLI to Use Ollama

Codex CLI needs two pieces of information to use Ollama: the base URL of the local API endpoint, and a model name. There are three ways to configure this, from quick one-off tests to permanent project defaults.

Method 1: Environment Variables (temporary, per-session)

Set the environment variables in your current terminal session before running Codex. This is the fastest way to test the integration without changing any config files.

export OPENAI_BASE_URL=http://localhost:11434/v1
export OPENAI_API_KEY=ollama
export CODEX_MODEL=qwen2.5-coder:32b

codex "Refactor src/auth.ts — extract JWT validation into a separate function"

Why OPENAI_API_KEY=ollama? Ollama does not require authentication, but Codex CLI expects an API key to be set. The literal string ollama satisfies the requirement without sending any real credentials. Any non-empty string works.

Method 2: config.toml (persistent, recommended)

Edit ~/.codex/config.toml to make Ollama the default provider permanently. This means you never need to set environment variables again for local model work.

# ~/.codex/config.toml
model = "qwen2.5-coder:32b"
provider = "ollama"

[providers.ollama]
name = "Ollama"
baseURL = "http://localhost:11434/v1"
envKey = "OPENAI_API_KEY"

Then add the following line to your shell profile (~/.zshrc on macOS or ~/.bashrc on Linux) so the API key is always available:

export OPENAI_API_KEY=ollama

Reload your shell with source ~/.zshrc (or open a new terminal), and all subsequent codex invocations will use your local Ollama model by default.

Method 3: CLI flags (per-command override)

Pass provider and model directly on the command line. Useful when you have a global config but want to run a single command against a different model.

codex --model qwen2.5-coder:32b \
      --provider ollama \
      "explain this code"

5. Verify Configuration

After configuring Codex CLI, run a few quick checks to confirm everything is wired up correctly before starting real work.

# Check current active model and version
codex --version

# Run a simple test task — should respond using the local model
codex "Write a Python quicksort function"

# View request logs to confirm traffic goes to local port
# Run this in a separate terminal before issuing codex commands
OLLAMA_DEBUG=1 ollama serve

When OLLAMA_DEBUG=1 is set, Ollama prints each incoming request to the terminal. You should see POST /v1/chat/completions entries arriving when Codex sends tasks to the model. If you see no traffic, double-check that OPENAI_BASE_URL is set correctly and that Ollama is listening on port 11434.

Tip: Run curl http://localhost:11434/v1/models to list the models Ollama is exposing through its OpenAI-compatible endpoint. The model names returned here are the exact strings to use in config.toml or --model flags.

6. Local vs Cloud Model Comparison

Understanding where local models genuinely win — and where they fall short — helps you decide when to use Ollama vs a cloud provider like OpenAI.

Dimension	Ollama Local	OpenAI o4-mini
Cost	Free (hardware amortized)	Per-token billing
Code Quality	⭐⭐⭐ (7B) / ⭐⭐⭐⭐ (32B)	⭐⭐⭐⭐⭐
First-token Latency	0.5–3s (GPU) / 5–15s (CPU)	1–5s
Privacy	Data never leaves machine	Sent to OpenAI servers
Offline	✅ Fully offline	❌ Requires internet
Context Window	4K–32K (model-dependent)	128K tokens
Multimodal	Partial support (LLaVA, etc.)	✅ Full support
Reasoning	Limited on complex tasks	Strong (o-series reasoning)

The practical recommendation: use local Ollama models for routine tasks — writing boilerplate, explaining functions, writing tests for known APIs, and code reviews. Switch to OpenAI o4-mini or a stronger cloud model for complex multi-file refactors, architectural decisions, or any task that requires deep multi-step reasoning.

7. AGENTS.md for Local Models

Codex CLI reads an AGENTS.md file from your project root to customize assistant behavior. When using local models, tailoring this file specifically for the constraints of smaller context windows and lower reasoning capacity improves output quality noticeably.

Create or update AGENTS.md in your project root with content like the following:

# Coding Assistant

You are a coding assistant running on a local Ollama model.

## Constraints
- Keep responses concise (local models have limited context)
- Prefer showing code over long explanations
- Add inline comments for non-obvious logic

## Allowed Operations
- Read any file in this project
- Write to src/, tests/, docs/
- Run: npm test, pytest, go test

Context window tip: Local models typically offer 4K–32K context windows, compared to OpenAI's 128K. Keep your prompts and codebase context focused for best results. Avoid tasks that require reading many large files simultaneously — break them into smaller, focused subtasks instead.

8. Troubleshooting

The following table covers the most common errors when using Codex CLI with Ollama, along with their causes and fixes.

Error	Cause	Fix
`connection refused localhost:11434`	Ollama service not running	Run `ollama serve` in a terminal or start the Ollama app
`model not found`	Model not downloaded locally	Run `ollama pull <model-name>` to download the model first
`context length exceeded`	Prompt too long for model's context window	Use smaller, focused tasks; switch to a model with 32K context window
Very slow responses (10+ seconds per token)	GPU not being used; running on CPU only	Install CUDA drivers (Linux/Windows) or Metal is automatic on macOS; verify with `nvidia-smi`
Poor generation quality, repetitive output	Model too small for the task complexity	Upgrade to the 32B variant or switch to an OpenAI cloud model
`invalid API key` or authentication error	API key environment variable not set or empty	Set `OPENAI_API_KEY=ollama` (the literal string "ollama")

GPU drivers matter: On Linux, if Ollama is running on CPU despite having an NVIDIA GPU, install the CUDA toolkit and ensure nvidia-container-toolkit is configured. Run ollama run qwen2.5-coder:7b interactively and check the Ollama logs — it explicitly states whether it is using GPU or CPU.

9. Advanced: Remote Ollama Server

If you have a single powerful GPU workstation on your local network, you can share it across multiple developer machines. This is common in small teams where one person has a 4090 or similar hardware.

Server-side configuration

By default Ollama only listens on 127.0.0.1. To accept connections from other machines on the network, set the host to 0.0.0.0:

# Server: allow remote connections on all interfaces
OLLAMA_HOST=0.0.0.0 ollama serve

Or set it permanently in your systemd service or shell profile:

export OLLAMA_HOST=0.0.0.0

Client-side configuration

On each developer's machine, point Codex CLI to the server's local IP address instead of localhost:

# Replace 192.168.1.100 with your GPU server's local IP
export OPENAI_BASE_URL=http://192.168.1.100:11434/v1
export OPENAI_API_KEY=ollama

Security warning: Ollama has no built-in authentication. Anyone who can reach port 11434 on the server can use the model and potentially access request logs. On shared networks, protect this endpoint with a reverse proxy (Nginx or Caddy) that enforces API key authentication, or restrict access by IP using firewall rules. Never expose an unauthenticated Ollama port to the public internet.

Example Nginx reverse proxy with basic auth

# /etc/nginx/sites-available/ollama
server {
    listen 443 ssl;
    server_name ollama.internal.yourcompany.com;

    location / {
        auth_basic           "Ollama API";
        auth_basic_user_file /etc/nginx/.htpasswd;
        proxy_pass           http://127.0.0.1:11434;
        proxy_set_header     Host $host;
    }
}

10. Frequently Asked Questions

Can Codex CLI use Ollama local models?

Yes. Codex CLI supports any OpenAI-compatible API backend, including Ollama. Set OPENAI_BASE_URL=http://localhost:11434/v1 and OPENAI_API_KEY=ollama to connect Codex CLI to a local model running via Ollama. The integration requires no plugins or patches — it works with any Ollama model that supports chat completions.

Which Ollama model works best with Codex CLI?

For users with 24GB+ VRAM, qwen2.5-coder:32b delivers the best code quality among open-source models and competes well with mid-tier cloud models. For 16GB VRAM, deepseek-coder-v2:16b provides an excellent balance of quality and speed. For getting started or machines with 8GB VRAM (or Apple Silicon with 16GB unified memory), qwen2.5-coder:7b is the recommended starting point. codestral:22b from Mistral AI is also a strong option if you prefer that ecosystem.

Is a local Ollama model better than OpenAI o4-mini?

For raw code generation quality and complex reasoning tasks, OpenAI o4-mini is currently superior. The o-series models have strong multi-step reasoning that local 32B models cannot fully replicate. However, local models win decisively on cost (completely free after hardware), privacy (code never leaves your machine), and offline availability. The best approach for many teams is a hybrid strategy: use local models for routine coding tasks and switch to o4-mini for complex architectural work.

What hardware do I need for Codex CLI with Ollama?

Hardware requirements scale with model size. As a practical guide: 7B models require 8GB VRAM (or 16GB Apple Silicon unified memory); 13B models require 12–16GB VRAM; 16B models require 16GB VRAM; 32B–34B models require 24GB+ VRAM, or can run across two 16GB GPUs with Ollama's multi-GPU support. Running on CPU only is possible with any model, but expect inference to be 5–10x slower than GPU. An NVIDIA RTX 3090 (24GB) or RTX 4090 (24GB) covers all models in the recommendation table. On macOS, an M2 Max or M3 Pro with 36GB unified memory handles 32B models well.