OpenClaw Multi-Model Setup: Stop Burning Money on the Wrong Model

Updated March 22, 2026: Rewrote the model chain to reflect what actually works in production. Dropped local LLMs for task work (they’re not reliable enough), added budget cloud models like MiniMax M2.5 and DeepSeek V3.2, and clarified that your orchestration model must support tool calling. The original version oversold small local models for triage work. This version is honest about their limitations.

Most OpenClaw users pick one model and use it for everything. That’s like hiring a senior architect to sweep the floors.

I run four model tiers in my production setup. Each one handles what it’s good at. The expensive model only touches work that requires judgment. Everything else runs on cheaper alternatives. The result: better output quality AND lower costs.

Here’s the exact setup.

The Model Chain

Think of it as a hierarchy. Always use the cheapest model that can handle the task. Escalate up only when the work demands it.

Tier 1: Ollama Local (Free) — Embeddings Only

Local models running on your machine. Zero API costs. Zero data leaving your box.

What it handles:

  • Semantic memory search embeddings (nomic-embed-text or qwen3-embedding)
  • Code search embeddings
  • Vector store operations

What it does NOT handle anymore: I originally ran 7B and 14B local models for email triage, git commit messages, and cron job screening. In practice, they’re unreliable. They hallucinate action items that don’t exist, miss context that matters, and make bad ESCALATE/SKIP decisions often enough to erode trust. Embeddings are a different story. You don’t need a frontier model to turn text into vectors. A local embedding model does that perfectly, and it’s free.

Setup:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull nomic-embed-text

Configure OpenClaw to use Ollama for embeddings:

memorySearch:
  provider: openai
  remote:
    baseUrl: http://localhost:11434/v1/
  model: nomic-embed-text

The takeaway: Local models are great for embeddings. For anything that requires language understanding or decision-making, use a real model.

Tier 2: Budget Cloud Models (Cheap)

This is where the value is. Cloud-hosted models that cost pennies per call or come bundled with cheap subscriptions. They handle the grunt work that’s too important for a local 7B but doesn’t need a flagship model.

Good options in this tier:

ModelStrengthsBest For
MiniMax M2.580% SWE-Bench, 198K contextCode summaries, structured extraction
MiniMax M2.7Thorough analysis, research focusEmail triage, report summaries
DeepSeek V3.2Fast inference, 3-4s responsesQuick triage, simple cron jobs
GLM-592.7% AIME, strong reasoningComplex cron tasks, math-heavy analysis
Haiku 4.5Fast, Anthropic ecosystemFile scanning, bulk ops, boilerplate

What this tier handles:

  • Email triage and screening (ESCALATE/SKIP decisions)
  • Cron job summaries (backup reports, service health)
  • File scanning and code grep across large codebases
  • Bulk find/replace operations
  • Simple data extraction and reformatting
  • Git commit message generation

When to use it: If the task requires scanning, not thinking. If you could describe the task as “read this email and tell me if it’s important” or “look at these 50 files and tell me which ones contain X,” that’s budget tier work.

When NOT to use it: Anything requiring judgment, creativity, or handling adversarial input. These models are more susceptible to prompt injection and make worse autonomous decisions. Never put a budget model on your main orchestration layer.

Ollama Pro tip: If you’re already paying for Ollama Pro, you can access MiniMax, DeepSeek, GLM-5, and others as cloud models through the same localhost endpoint. No separate API keys needed. Just use the :cloud tag on any supported model name and Ollama routes it to cloud inference automatically. No ollama pull required:

ollama run minimax-m2.5:cloud "Summarize this email..."
ollama run deepseek-v3.2:cloud "Generate a commit message for..."

Same API, no VRAM impact, no download step.

Tier 3: GPT 5.x (Structured Builds)

OpenAI’s code-capable models. Excellent at structured generation from clear specs. Two options depending on your budget:

GPT 5.3 Codex ($2/$10 per 1M tokens input/output): Purpose-built for agentic coding. 200K context, 32K max output. The sweet spot for code generation if you’re paying per token.

GPT 5.4 ($2.50/$10 per 1M tokens input/output): The full flagship. Stronger reasoning, 1M context window, native computer use. Available via API or through a Pro/Team subscription.

What this tier handles:

  • Code generation from detailed prompts
  • Code reviews
  • Test generation
  • Documentation writing
  • Refactoring with clear patterns
  • API client scaffolding

When to use it: When the task is “build this thing according to this spec.” GPT 5.x is faster than Opus at structured code generation and often produces cleaner code for well-defined tasks.

When NOT to use it: Architecture decisions, ambiguous requirements, creative writing, anything where the spec is fuzzy. These models need precise instructions. Opus handles ambiguity.

Spawn pattern: Your main agent (Opus) writes the detailed prompt, then spawns a GPT sub-agent to execute:

# Opus writes the spec, GPT builds it
sessions_spawn(
  model="openai-codex/gpt-5.4",
  task="Build the CRUD routes according to this spec: [detailed spec here]"
)

Tier 4: Opus 4.6 (Orchestration + Judgment)

The flagship. Your orchestration layer. The model that decides what to do and when.

Critical requirement: Your orchestrator must support tool calling. This isn’t optional. The orchestration model needs to invoke tools: spawn sub-agents, read files, search memory, call APIs, send messages. Models without reliable tool-calling support can’t do this job. Today, that means Claude Opus/Sonnet, GPT-4o/5.x, or Gemini Pro. Don’t try to save money by putting a budget model here. It will break.

What it handles:

  • Main agent orchestration (deciding what to do with incoming messages)
  • PRD writing and architecture planning
  • Creative content (blog posts, social media, documentation)
  • Security decisions (evaluating untrusted input)
  • Complex reasoning and analysis
  • Anything touching untrusted content (email, web scraping, group chats)
  • Sub-agent prompt crafting (writing the instructions other models execute)

Why Opus for orchestration: Two reasons. First, it has the best prompt injection resistance. Your orchestration layer sees every incoming message, every email, every document. It will encounter adversarial input. Opus handles it. Second, it has the best judgment about when NOT to act. A cheaper model might execute an ambiguous instruction. Opus asks for clarification.

Could you use Sonnet instead? Yes, if cost is a concern. Sonnet 4.x has solid tool calling and decent prompt injection resistance. You’ll sacrifice some reasoning depth and creative quality, but for pure orchestration it works. Just don’t go cheaper than that.

The Workflow in Practice

Here’s how a typical interaction flows through the model chain:

  1. Email arrives → MiniMax M2.7 triages it: spam? Skip. Looks important? Escalate.
  2. Escalated email → Opus reads it, decides response strategy, drafts a reply if needed.
  3. “Build me a dashboard” → Opus writes the PRD and spec, spawns Codex to build it.
  4. Codex builds → Opus reviews the output, does a polish pass.
  5. “Search the codebase for security issues” → Opus spawns Haiku to scan files, reviews Haiku’s findings.
  6. Git commit → DeepSeek V3.2 generates the commit message. Fast, cheap, good enough.
  7. Memory search → Local Ollama embeds the query, searches local vector store. Free.

The expensive model (Opus) only touches steps 2, 3 (writing specs), 4 (review), and 5 (reviewing findings). Everything else runs on cheaper alternatives.

Token Optimization

Beyond model selection, there are structural ways to reduce token usage:

Heartbeat Batching

Instead of creating separate cron jobs for email checks, calendar checks, and notification checks, batch them into a single heartbeat that runs every 30 minutes. One context load, multiple checks. Saves thousands of input tokens per day.

Memory Search Over Full Context

Don’t dump your entire memory file into every conversation. Use semantic memory search (local Ollama embeddings) to retrieve only the relevant chunks. My MEMORY.md would be 57,000+ characters if loaded raw. Memory search returns the 5-10 relevant lines instead of the whole file.

Sub-Agent Isolation

Spawn sub-agents for tasks that don’t need your main session’s context. A Codex agent building a React component doesn’t need your email history, your calendar, or your personal notes in its context window. Isolated sessions start clean.

Prompt Compression

Write tight, specific prompts for sub-agents. “Build CRUD routes for this schema” with the schema attached is better than “Read all these files and figure out what to build.” Less input tokens, better output.

Cost Breakdown

Per-token pricing as of March 2026 (input/output per 1M tokens):

ModelInputOutputWhat It Does% of Work
Ollama Local$0$0Embeddings, vector search~15%
Haiku 4.5$1$5Scanning, bulk ops, boilerplate~15%
MiniMax M2.5Ollama ProOllama ProTriage, summaries, commits~20%
GPT 5.3 Codex$2$10Code gen, reviews, tests~15%
GPT 5.4$2.50$10Structured builds, complex code~10%
Opus 4.6$5$25Orchestration, judgment~25%

The math that matters: Opus at $5/$25 per million tokens handles the 25% of work that requires judgment, security awareness, and creative thinking. Everything else runs on models that cost 50-95% less per token. If you ran Opus for everything, you’d burn through rate limits or budget in days. If you ran budget models for everything, the quality would crater and your agent would be vulnerable to every prompt injection it encountered.

Many of these are also available via subscriptions (Anthropic Max, OpenAI Pro, Ollama Pro) which give you rate-limited unlimited access instead of per-token billing. Pick whichever model fits your usage pattern.

Getting Started

If you’re running a single-model setup today:

  1. Install Ollama and move your embeddings to a local model. Instant savings, zero quality loss.
  2. Keep a tool-calling model as your main agent. Opus, Sonnet, or GPT-4o minimum. Don’t downgrade your orchestration layer to save money. That’s a security and reliability decision.
  3. Add budget cloud models for triage, scanning, and mechanical tasks. MiniMax M2.5 and DeepSeek V3.2 are good starting points.
  4. Spawn sub-agents with task-specific models for builds (Codex) and bulk ops (Haiku).
  5. Batch your periodic checks into heartbeats instead of individual cron jobs.

Or Just Let Me Do It

I’ve been iterating on this multi-model setup for months. Every model assignment, every spawn pattern, every optimization came from real usage and real mistakes. If you want help getting multi-model orchestration running, reach out.