When Codex Said I Burned a Month of Tokens in One Night

I woke up to a rate limit wall and a usage report claiming I’d consumed an entire month of Codex Pro tokens overnight. I hadn’t. But the system was convinced I had, and it wasn’t letting me do anything until it reset.

This is about the night I tried to use Codex for batch code summarization and learned why some workloads should never leave your machine.

The Project

I run a local code search engine. It indexes every repository I work on: splits source files into semantically meaningful chunks (function boundaries, class definitions, exports), embeds them for vector search, and generates natural language summaries so you can search code by describing what it does rather than remembering exact syntax.

The index had grown to around 70,000 chunks across a dozen projects. Each chunk needs a one or two sentence summary explaining what it does. The summaries are what make natural language code search actually work. Without them, you’re just doing fuzzy string matching with extra steps.

I’d been generating these summaries with a local model through Ollama. It worked, but it was slow. At roughly 20 tokens per second on a 14B parameter model, summarizing 70,000 chunks was an overnight-and-then-some job. Multiple workers helped, but it was still measured in hours.

The Idea

Codex Pro is fast. Really fast. And I’m already paying for it. Why not point the summarization pipeline at Codex instead of waiting for the local GPU to grind through everything?

The plan was simple: swap the summary model from the local Ollama endpoint to the Codex API, crank up the worker count, and let it rip. Each summary request was tiny. Send a code chunk (usually 50 to 200 lines), get back two sentences. Input tokens were small, output tokens were smaller. Even at 70,000 chunks, the actual token volume shouldn’t have been that significant relative to a Pro tier.

I wrote the batch script, pointed it at the Codex endpoint, set it to three concurrent workers, and let it run overnight.

The Morning After

I checked progress around midnight. About 15,000 chunks had been summarized. Speed was exactly what I expected from Codex: fast, clean summaries, no quality complaints. I went to bed.

By morning, the script had stopped. Rate limited. Every request was getting rejected with quota errors. I checked the Codex usage dashboard.

It claimed I’d used my entire monthly allocation.

In one night. On code summaries.

The Math Didn’t Add Up

Here’s why I knew something was wrong. Each summarization request was roughly:

  • Input: 100 to 300 tokens (a code chunk plus a short system prompt)
  • Output: 30 to 50 tokens (two sentences)

Even being generous, call it 400 tokens per chunk round-trip. At 70,000 chunks, that’s 28 million tokens total. On a Codex Pro plan, 28 million tokens shouldn’t come close to the monthly limit. These are small, stateless requests with no conversation history, no tool calls, no reasoning traces.

But the usage tracker was showing token counts that were wildly higher than what the actual requests should have produced. The numbers looked like the system was counting each request as if it had a full conversation context attached, or double-counting input tokens, or doing something else entirely wrong with the accounting.

What Actually Happened

The Codex API’s token counting for batch-style workloads was glitching. When you send thousands of small, rapid requests (as opposed to the interactive back-and-forth that Codex is designed for), the usage reporting got confused. It was over-counting tokens, possibly applying per-session overhead to each stateless request, or failing to distinguish between cached and fresh tokens.

The actual computation was happening correctly. The summaries were good. The model was doing its job. But the billing/quota system was treating each tiny request as if it were far more expensive than it actually was.

I opened a support thread. OpenAI acknowledged it was a known issue with rapid-fire small requests against the Codex endpoint. The quota wasn’t actually consumed. It was a reporting bug that triggered the rate limiter prematurely.

But knowing that didn’t help at 7 AM when I needed my coder agent to work and the API was locked out.

The Fix: Ollama Cloud Models

While waiting for the quota to reset (or for OpenAI to fix the reporting), I needed an alternative fast. Ollama had recently launched cloud model support, letting you run large cloud-hosted models through the same Ollama API you use for local inference. Same interface, same tooling, different backend.

The setup:

  • Model: qwen3-coder:480b-cloud via Ollama (cloud-hosted, not local)
  • Workers: 3 concurrent
  • Throughput: roughly 2,200 summaries per hour
  • Cost: free tier (Ollama cloud models have generous free usage)

The key insight: Ollama cloud models use the exact same API as local models. I changed one model name in the config and the entire pipeline kept working. No new SDK, no new auth flow, no code changes beyond the model identifier.

# Before (Codex)
SUMMARY_MODEL = "gpt-5.3-codex"
SUMMARY_ENDPOINT = "https://api.openai.com/v1/..."

# After (Ollama cloud)
SUMMARY_MODEL = "qwen3-coder:480b-cloud"
SUMMARY_ENDPOINT = "http://localhost:11434/api/generate"

Same localhost endpoint. Ollama handles the routing to cloud infrastructure transparently. The 480B parameter model produced summaries that were arguably better than what Codex was generating, since it’s a coding specialist with a massive parameter count.

And here’s what it doesn’t do:

  1. It doesn’t hit phantom rate limits
  2. It doesn’t lock you out because a billing system glitched
  3. It doesn’t require a separate API key or auth flow
  4. It slots into the existing Ollama workflow seamlessly

The Architecture Now

The code search pipeline uses a hybrid approach: local embeddings, cloud summarization through Ollama:

Source repos → Code-aware chunking → Local embedding (qwen3-embedding:8b on GPU)
                                   → Cloud summarization (qwen3-coder:480b-cloud via Ollama)
                                   → Dual embedding (code + summary vectors)
                                   → SQLite index

Embeddings run locally on the GPU (free, fast, no network dependency). Summaries route through Ollama’s cloud models (free tier, fast, zero config overhead). Re-indexing is incremental (content hashing skips unchanged chunks). A nightly cron job catches new commits.

The summary quality from qwen3-coder:480b is comparable to or better than what Codex was producing. For one to two sentence code descriptions, a 480B coding specialist is more than capable.

The Takeaways

  1. Batch workloads and interactive APIs don’t always mix. Codex is optimized for the back-and-forth of coding sessions: long context, tool calls, multi-turn reasoning. Firing 70,000 tiny stateless requests at it is a usage pattern the system wasn’t designed for. The model handled it fine. The billing system didn’t.

  2. Usage dashboards can lie in both directions. Sometimes dashboards show full quota when you’re actually blocked. This time it showed depleted quota when barely anything was consumed. Neither direction is reliable for real-time decisions.

  3. Ollama cloud models are a cheat code for batch work. Same API as local models, no new integration, free tier that handles serious volume. When your local GPU isn’t fast enough and your paid API is acting up, Ollama cloud models sit right in the middle: fast, free, and compatible with your existing tooling.

  4. Speed isn’t worth fragility. Codex was faster for this job. But it introduced a dependency that failed catastrophically at the worst possible time. The Ollama pipeline uses the same interface whether you’re running local or cloud. If cloud goes down, swap to a local model and keep going. That kind of flexibility is worth more than raw speed.

The Codex Pro subscription is still worth it for interactive coding work. But for batch jobs that touch tens of thousands of items? Use something that won’t lock you out overnight because a token counter had a bug.