Ops Deck

Developer Infrastructure Platform

Status: Active Origin: Personal Infrastructure Since: 2026.02

FastAPIPythonSQLiteOllamaOllama Proqwen3-embeddingPM2Caddy

Overview

The Ops Deck is a suite of local-first backend APIs that power my AI agent (Rocinante), development workflow, and content pipeline. Every service runs on my workstation, uses local models for embeddings and cloud models for summaries, costing zero cloud tokens for routine search operations.

The core principle: if the data is local, the processing should be local. Cloud APIs are for reasoning and generation. Everything else (search, templates, knowledge retrieval, content publishing) runs on-premises with no external dependencies.

Architecture

Process Management

All services managed via PM2 with auto-restart, log rotation, and crash recovery. Single ecosystem.config.cjs defines every service with environment variables, working directories, and health checks.

Runtime: Node.js + Python (FastAPI)

Reverse Proxy: Caddy with automatic HTTPS for external services

Model Layer

Ollama runs locally for embeddings (qwen3-embedding:8b) and routes to Ollama Pro cloud for summaries and heavy generation. Embeddings stay on-device with zero API cost. Summarization uses cloud models (MiniMax M2.5, GLM-5, Kimi K2.5) for better quality.

Local: Embedding models via Ollama

Cloud: Ollama Pro, 3-5s response time

Storage

SQLite databases per service. No PostgreSQL overhead for single-user workloads. Each service owns its schema. Nightly cron re-indexes the code search database with fresh summaries.

Services

Code Search API

Semantic + Keyword Hybrid

Semantic search across 40+ repositories using qwen3-embedding:8b vectors. Every file is chunked, summarized by cloud LLMs (MiniMax M2.5, Kimi K2.5 via Ollama Pro), and embedded into a dual vector index. Search modes: hybrid (default, combines vector similarity with keyword matching), code (raw text matches), and summary (natural language queries against AI-generated summaries).

Hybrid Search Combines cosine similarity on embeddings with keyword scoring. Best of both worlds: finds semantically related code even when terminology differs, but also respects exact matches.

AI Summaries Every code chunk gets a natural language summary generated by cloud LLMs (MiniMax M2.5, Kimi K2.5) via Ollama Pro. Summaries are higher quality than the local models they replaced. "What does this function do?" queries match against summaries, not just raw code.

Nightly Re-index 4am cron job crawls all tracked repos, detects changed files, re-chunks and re-embeds. Index stays fresh without manual intervention.

Project Filtering Filter results by project name, set minimum similarity scores, control result count. Queries scoped to a single repo return in under 100ms.

How It Works

On index, each file is split into semantic chunks (by function, class, or logical block). Each chunk is embedded locally via qwen3-embedding:8b (768-dimensional vectors) and summarized by a cloud LLM (MiniMax M2.5 or Kimi K2.5 via round-robin). Both the embedding vector and summary are stored in SQLite alongside the raw code. Code and summary vectors are stored separately with configurable weighting (code: 0.35, summary: 0.65).

On search, the query is embedded with the same local model, then cosine similarity is computed against all stored vectors. In hybrid mode, keyword matches boost the similarity score. Results are ranked by combined score and returned with file path, project name, chunk content, and summary.

FastAPI SQLite qwen3-embedding:8b Ollama Pro Cloud

Prompt Library

Template Management

A versioned library of reusable prompt templates. Instead of rewriting the same instructions every session, the AI agent checks the library first and adapts existing templates. Currently holds 30+ templates across categories: build prompts, content generation, code review, system operations, and analysis.

Variable Substitution Templates use {{variable_name}} placeholders with typed definitions. The agent fills in project-specific values at runtime.

Version Control Every template edit creates a new version. Roll back to any previous version without losing history.

Category + Tag Search Templates organized by category (build-prompts, content-prompts, review-prompts, system-prompts, automation) with freeform tags for cross-cutting concerns.

Agent Integration The AI agent is instructed to check the library before writing any prompt from scratch. If a matching template exists, it adapts rather than reinvents. Saves tokens and ensures consistency.

Template Categories

Build Prompts PRD specs, variant builds, project scaffolding, sub-agent pipelines

Content Prompts Blog posts, social media, multi-platform publishing

Review Prompts Code review checklists, polish passes, CodeRabbit workflows

System Prompts Model assignment rules, pickup protocols, agent patterns

Analysis Threat reports, malware analysis, vendor risk, server logs

Automation Obsidian notes, repo hygiene, inbox organization

FastAPI SQLite Jinja2

Agent Intel

Knowledge Base

A curated knowledge base the AI agent checks before answering questions from its training data. Contains project-specific context, infrastructure details, workflow decisions, and accumulated lessons that a foundation model wouldn't know. The agent is instructed to check this API before every knowledge question, even ones it thinks it already knows the answer to.

Curated Over Generated Every entry is manually reviewed. This isn't auto-scraped documentation. It's the stuff that matters: architecture decisions, gotchas, port assignments, credential locations, and workflow rules.

Enforcement via System Prompt The agent's system prompt includes a hard rule: "NEVER answer knowledge questions before checking Agent Intel." This prevents the model from hallucinating project-specific details it doesn't actually know.

Living Document Updated as infrastructure changes. When a port moves, a service is added, or a workflow changes, Agent Intel gets updated so the agent's answers stay accurate.

FastAPI JSON

Dev Tools API

Central Hub

The main orchestration API that ties everything together. Serves the developer dashboard, manages project metadata, tracks port assignments, and provides the preflight endpoint that the AI agent hits before every task.

Preflight Check Single endpoint the agent calls before any work. Returns relevant context from all local APIs in one shot: matching code search results, applicable prompt templates, and related intel entries.

Developer Dashboard Web UI showing all running services, project cards with status and links, port map, and quick-access to each tool's interface.

Port Registry Central source of truth for which service runs on which port. Prevents collisions when spinning up new dev servers.

FastAPI React Vite

Local-First Enforcement

Having local APIs is only useful if the AI agent actually uses them. I built a system prompt injection hook that enforces five rules on every session:

Code Search first: Before any codebase question or file operation, hit the Code Search API. If it returns results, use them. No manual grep.
Prompt Library first: Before writing any prompt or template, check the library. If a matching template exists, adapt it.
Agent Intel first: Before answering any knowledge question (even "obvious" ones), check the knowledge base.
No web search for indexed code: Never use external search engines for questions about code in indexed repositories.
Delegate file operations: Route mechanical file scanning (grep, find, cat) to a cheaper sub-agent model instead of burning expensive orchestrator tokens.

Rules 1-3 worked on the first attempt because they're binary: "before X, do Y." Rule 5 (delegation) required keyword blocklists and pre-empted rationalizations because the model kept finding excuses to run commands directly. The enforcement rules are injected via an OpenClaw lifecycle hook on every agent:bootstrap event.

Model Fleet

The model fleet evolved significantly from the original all-local approach. After months of testing, the local instance now handles one job: embeddings. Everything else moved to cloud models where quality is dramatically better.

Local Models (Ollama)

Model	Role
`qwen3-embedding:8b`	Code search embeddings (768d vectors, dual code + summary)
`nomic-embed-text`	OpenClaw memory search embeddings

Ollama Pro Cloud Models

Model	Role	Notes
`minimax-m2.5`	Code summaries, structured output	80.2% SWE-Bench, 198K context
`glm-5`	Complex reasoning, fallback summaries	744B params (40B active MoE), 92.7% AIME 2026
`deepseek-v3.2`	Fast general-purpose inference	3-4s typical response
`qwen3-coder-next`	Specialized coding tasks	Code-focused training
`kimi-k2.5`	Code summaries (round-robin with MiniMax)	Free tier via Moonshot API

Cloud Agent Models (API)

Model	Provider	Role
`Claude Opus 4.6`	Anthropic API	Orchestrator, architecture, creative, security analysis
`GPT 5.3 Codex`	OpenAI API	Code generation, reviews, structured builds
`Gemini 3 Pro/Flash`	Google AI API	Image generation, multimodal tasks

Key lesson: local models are excellent for embeddings and completely inadequate for generation quality. The fleet now runs local Ollama exclusively for embedding vectors and routes all generation to cloud models where response quality justifies the cost.

What Got Cut (and Why)

The original fleet included four local generation models. All three non-embedding models were removed after testing showed cloud alternatives were categorically better for every task they handled.

Model	Was Used For	Why It Got Cut
`gemma2:9b` (5.4GB)	Git commits, cron triage, structured output	Cloud models handle structured output better. The "S-tier for structured output" label held up in benchmarks but not in production. Cloud models at 3-5s latency made the local speed advantage irrelevant.
`qwen2.5:7b`	Code summaries, bulk text processing	Replaced by MiniMax M2.5 and Kimi K2.5 for code summaries. Cloud summaries are significantly higher quality. The 43 tokens/sec local speed didn't matter because summary quality was the bottleneck, not throughput.
`qwen2.5-coder:14b` (8.9GB)	Local code generation	GPT 5.3 Codex and cloud models made this redundant. Output quality required frequent human correction compared to frontier cloud models.
`nomic-embed-text` (code search)	Code search embeddings	Replaced by qwen3-embedding:8b for code search (better multilingual and code understanding). Still used for OpenClaw memory search where the lighter model is sufficient.

The pattern: local 7B-14B models were good enough to build with, but "good enough" creates technical debt when cloud models produce categorically better results for minimal API cost. The local Ollama instance now runs two embedding models and nothing else.

Content Pipeline

The Ops Deck includes a content publishing pipeline with a security boundary. A content scrubber script runs before any content leaves the machine, stripping internal infrastructure details (IP addresses, port numbers, hostnames) from blog posts, social media drafts, and documentation.

Content Scrubber Regex-based CLI tool that catches RFC 1918 addresses, localhost references, internal hostnames, SSH targets, and known service ports. Runs in preview mode (shows diff) or apply mode (edits in-place).

Social Media Pipeline Draft, review, and schedule posts across LinkedIn, X, Bluesky, and Mastodon. Multi-platform templates adapt content to each platform's character limits and conventions.

Blog Integration Blog posts written in Markdown, built with Astro, deployed via Vercel. The scrubber runs against the blog content directory as a preflight step before any push.

Token Economics

The entire Ops Deck exists to answer one question: what's worth paying cloud API prices for, and what can run locally for free?

Free (Local Ollama)

Code search embeddings
Memory search embeddings
Semantic similarity

Low Cost (Ollama Pro Cloud)

Code summaries (MiniMax, Kimi)
Bulk text generation
Structured output tasks
Reasoning fallback (GLM-5)

API Tokens (Frontier Models)

Architecture and planning (Opus 4.6)
Code generation (GPT 5.3 Codex)
Security analysis
Creative writing
Orchestration decisions

Data Flow

DATA SOURCES

Ollama Local + Cloud Model Server Local embeddings, cloud summaries via Ollama Pro

Code Repos Source Code 40+ repositories, auto-discovered changes

Agent Sessions AI Workflows Queries, tool calls, memory reads

Content Pipeline Publishing Blog posts, social drafts, documentation

▼▼▼▼

Local APIs (zero cloud cost)

PROCESSING

Chunk + Embed qwen3-embedding:8b (local Ollama)

→

Summarize MiniMax / Kimi (Ollama Pro cloud)

→

Index SQLite per service

→

Serve FastAPI endpoints

▼

System Prompt Injection (OpenClaw Hook)

CONSUMERS

Rocinante (Opus 4.6) Orchestrator queries before every task

Sub-Agents GPT 5.3 Codex for builds, delegated ops

Cron Jobs Nightly re-index, content scrubbing

Technical Decisions

Decision Reasoning

SQLite over PostgreSQL Single-user workload. No connection pooling, no daemon, no config. Just a file.

Local embeddings over cloud APIs Embeddings run thousands of times during re-index. Cloud APIs would burn through token budgets quickly. Running them locally keeps that cost at zero.

FastAPI for all services Async by default, auto-generated OpenAPI docs, consistent patterns across services.

PM2 over systemd Faster iteration: restart, logs, env vars, ecosystem file. systemd for production, PM2 for dev.

Hybrid search over pure vector Pure vector misses exact matches. Pure keyword misses semantic similarity. Combine both.

qwen3-embedding:8b (768d) Upgraded from nomic-embed-text. Better multilingual coverage and code understanding. Runs locally via Ollama.

Cloud summaries over local generation Moved code summaries from local qwen2.5 to cloud models via Ollama Pro (MiniMax M2.5, Kimi K2.5). Better quality, no resource contention, minimal API cost.

Regex scrubber over LLM scrubber Deterministic, auditable, fast. LLM scrubbing would miss edge cases and cost tokens.

System prompt enforcement over tool restrictions Tool-level blocks would break legitimate use. Prompt rules preserve flexibility while enforcing priorities.

Development Timeline

Phase 1 Code Search API ✓ completed

File chunkingnomic-embed-text embeddingsHybrid searchSQLite storage

Phase 2 Prompt Library ✓ completed

CRUD APIVersion controlVariable substitutionCategory/tag system

Phase 3 Agent Intel ✓ completed

Knowledge base APISystem prompt enforcementLiving document updates

Phase 4 Dev Tools Hub ✓ completed

Dashboard UIPort registryPreflight endpointProject cards

Phase 5 Local-First Enforcement ✓ completed

OpenClaw hookRule injectionKeyword blocklistDelegation routing

Phase 6 Content Pipeline ✓ completed

Content scrubber CLISocial media templatesBlog integration

Phase 7 AI Summary Layer ✓ completed

Cloud LLM summaries per chunk (MiniMax M2.5, Kimi K2.5)Nightly re-index cronSummary search modeqwen3-embedding:8b dual vectors

Phase 8 Usage Analytics ✓ completed

Token tracking per modelCost attributionUsage dashboard (Usage Tracker)

Phase 9 Multi-Agent Routing ✓ completed

Opus 4.6 orchestratorGPT 5.3 Codex for buildsOllama Pro cloud for summariesLocal embeddings onlyTask-based model selection

Phase 10 Continuous Optimization ○ ongoing

Model fleet tuningResource allocation optimizationPrompt template refinementEnforcement rule iterationNew local model evaluation

Read the Blog Posts → ← All Projects