Ops Deck
Developer Infrastructure Platform
Overview
The Ops Deck is a suite of local-first backend APIs that power my AI agent (Rocinante), development workflow, and content pipeline. Every service runs on my workstation, uses local models for embeddings and cloud models for summaries, costing zero cloud tokens for routine search operations.
The core principle: if the data is local, the processing should be local. Cloud APIs are for reasoning and generation. Everything else (search, templates, knowledge retrieval, content publishing) runs on-premises with no external dependencies.
Architecture
Process Management
All services managed via PM2 with auto-restart, log rotation, and crash recovery. Single ecosystem.config.cjs defines every service with environment variables, working directories, and health checks.
Model Layer
Ollama runs locally for embeddings (qwen3-embedding:8b) and routes to Ollama Pro cloud for summaries and heavy generation. Embeddings stay on-device with zero API cost. Summarization uses cloud models (MiniMax M2.5, GLM-5, Kimi K2.5) for better quality.
Storage
SQLite databases per service. No PostgreSQL overhead for single-user workloads. Each service owns its schema. Nightly cron re-indexes the code search database with fresh summaries.
Services
Code Search API
Semantic + Keyword Hybrid
Semantic search across 40+ repositories using qwen3-embedding:8b vectors. Every file is chunked, summarized by cloud LLMs (MiniMax M2.5, Kimi K2.5 via Ollama Pro), and embedded into a dual vector index. Search modes: hybrid (default, combines vector similarity with keyword matching), code (raw text matches), and summary (natural language queries against AI-generated summaries).
How It Works
On index, each file is split into semantic chunks (by function, class, or logical block). Each chunk is embedded locally via qwen3-embedding:8b (768-dimensional vectors) and summarized by a cloud LLM (MiniMax M2.5 or Kimi K2.5 via round-robin). Both the embedding vector and summary are stored in SQLite alongside the raw code. Code and summary vectors are stored separately with configurable weighting (code: 0.35, summary: 0.65).
On search, the query is embedded with the same local model, then cosine similarity is computed against all stored vectors. In hybrid mode, keyword matches boost the similarity score. Results are ranked by combined score and returned with file path, project name, chunk content, and summary.
Prompt Library
Template ManagementA versioned library of reusable prompt templates. Instead of rewriting the same instructions every session, the AI agent checks the library first and adapts existing templates. Currently holds 30+ templates across categories: build prompts, content generation, code review, system operations, and analysis.
{{variable_name}} placeholders with typed definitions. The agent fills in project-specific values at runtime. Template Categories
Agent Intel
Knowledge BaseA curated knowledge base the AI agent checks before answering questions from its training data. Contains project-specific context, infrastructure details, workflow decisions, and accumulated lessons that a foundation model wouldn't know. The agent is instructed to check this API before every knowledge question, even ones it thinks it already knows the answer to.
Dev Tools API
Central HubThe main orchestration API that ties everything together. Serves the developer dashboard, manages project metadata, tracks port assignments, and provides the preflight endpoint that the AI agent hits before every task.
Local-First Enforcement
Having local APIs is only useful if the AI agent actually uses them. I built a system prompt injection hook that enforces five rules on every session:
- Code Search first: Before any codebase question or file operation, hit the Code Search API. If it returns results, use them. No manual grep.
- Prompt Library first: Before writing any prompt or template, check the library. If a matching template exists, adapt it.
- Agent Intel first: Before answering any knowledge question (even "obvious" ones), check the knowledge base.
- No web search for indexed code: Never use external search engines for questions about code in indexed repositories.
- Delegate file operations: Route mechanical file scanning (grep, find, cat) to a cheaper sub-agent model instead of burning expensive orchestrator tokens.
Rules 1-3 worked on the first attempt because they're binary: "before X, do Y." Rule 5 (delegation) required keyword blocklists and pre-empted rationalizations because the model kept finding excuses to run commands directly. The enforcement rules are injected via an OpenClaw lifecycle hook on every agent:bootstrap event.
Model Fleet
The model fleet evolved significantly from the original all-local approach. After months of testing, the local instance now handles one job: embeddings. Everything else moved to cloud models where quality is dramatically better.
Local Models (Ollama)
| Model | Role |
|---|---|
qwen3-embedding:8b | Code search embeddings (768d vectors, dual code + summary) |
nomic-embed-text | OpenClaw memory search embeddings |
Ollama Pro Cloud Models
| Model | Role | Notes |
|---|---|---|
minimax-m2.5 | Code summaries, structured output | 80.2% SWE-Bench, 198K context |
glm-5 | Complex reasoning, fallback summaries | 744B params (40B active MoE), 92.7% AIME 2026 |
deepseek-v3.2 | Fast general-purpose inference | 3-4s typical response |
qwen3-coder-next | Specialized coding tasks | Code-focused training |
kimi-k2.5 | Code summaries (round-robin with MiniMax) | Free tier via Moonshot API |
Cloud Agent Models (API)
| Model | Provider | Role |
|---|---|---|
Claude Opus 4.6 | Anthropic API | Orchestrator, architecture, creative, security analysis |
GPT 5.3 Codex | OpenAI API | Code generation, reviews, structured builds |
Gemini 3 Pro/Flash | Google AI API | Image generation, multimodal tasks |
Key lesson: local models are excellent for embeddings and completely inadequate for generation quality. The fleet now runs local Ollama exclusively for embedding vectors and routes all generation to cloud models where response quality justifies the cost.
What Got Cut (and Why)
The original fleet included four local generation models. All three non-embedding models were removed after testing showed cloud alternatives were categorically better for every task they handled.
| Model | Was Used For | Why It Got Cut |
|---|---|---|
gemma2:9b (5.4GB) | Git commits, cron triage, structured output | Cloud models handle structured output better. The "S-tier for structured output" label held up in benchmarks but not in production. Cloud models at 3-5s latency made the local speed advantage irrelevant. |
qwen2.5:7b | Code summaries, bulk text processing | Replaced by MiniMax M2.5 and Kimi K2.5 for code summaries. Cloud summaries are significantly higher quality. The 43 tokens/sec local speed didn't matter because summary quality was the bottleneck, not throughput. |
qwen2.5-coder:14b (8.9GB) | Local code generation | GPT 5.3 Codex and cloud models made this redundant. Output quality required frequent human correction compared to frontier cloud models. |
nomic-embed-text (code search) | Code search embeddings | Replaced by qwen3-embedding:8b for code search (better multilingual and code understanding). Still used for OpenClaw memory search where the lighter model is sufficient. |
The pattern: local 7B-14B models were good enough to build with, but "good enough" creates technical debt when cloud models produce categorically better results for minimal API cost. The local Ollama instance now runs two embedding models and nothing else.
Content Pipeline
The Ops Deck includes a content publishing pipeline with a security boundary. A content scrubber script runs before any content leaves the machine, stripping internal infrastructure details (IP addresses, port numbers, hostnames) from blog posts, social media drafts, and documentation.
Token Economics
The entire Ops Deck exists to answer one question: what's worth paying cloud API prices for, and what can run locally for free?
Free (Local Ollama)
- Code search embeddings
- Memory search embeddings
- Semantic similarity
Low Cost (Ollama Pro Cloud)
- Code summaries (MiniMax, Kimi)
- Bulk text generation
- Structured output tasks
- Reasoning fallback (GLM-5)
API Tokens (Frontier Models)
- Architecture and planning (Opus 4.6)
- Code generation (GPT 5.3 Codex)
- Security analysis
- Creative writing
- Orchestration decisions