I wanted my AI agent to talk directly to my security stack. Not through copy-pasted log snippets. Not through screenshots of dashboards. Actual tool calls against live data.

So I built seven MCP servers. Wazuh. Suricata. Zeek. TheHive. Cortex. MISP. MITRE ATT&CK. All open source, all on my GitHub. Full project details on the Security MCP Servers project page.

The protocol layer took a weekend. The context engineering took weeks. That ratio surprised me.

What I Actually Built

The servers break down into three categories based on how they access data:

API-based servers talk directly to running services. Wazuh MCP hits the manager’s REST API on port 55000 for alerts, agent status, vulnerability scans, and file integrity events. TheHive and Cortex connect to their respective APIs for case management and observable analysis. MISP pulls threat intelligence feeds and IOC lookups.

Log-based servers parse files on disk. Zeek MCP reads from a log directory (JSON or TSV format), letting you query connection logs, DNS, HTTP, SSL, and file analysis data. Suricata MCP reads EVE JSON logs for IDS alerts, flow data, and protocol metadata.

Knowledge-base servers work offline. The MITRE ATT&CK server downloads STIX 2.1 bundles and lets you query techniques, tactics, groups, software, and mitigations without hitting any external API.

Each server exposes a focused set of tools. Wazuh has get_alerts, list_agents, get_vulnerabilities, get_fim_events. Zeek has query_connections, search_dns, get_ssl_certs. Suricata has get_alerts, get_flow_stats, search_protocols.

The surface area is intentional. Every tool does one thing with predictable output. The full server documentation and code is at github.com/solomonneas.

The Protocol Was Not the Hard Part

MCP itself is straightforward. You define tools with typed parameters, handle calls, return structured results. The TypeScript SDK handles the transport layer. Zod validates the inputs. Build, run, connect.

Getting a model to successfully call get_alerts with the right parameters and get data back took maybe a day per server. Standard integration work.

The hard part started after the connection worked.

Testing Every Server Against Live Infrastructure

I didn’t just build these and hope for the best. Every server got tested against real running services on my home infrastructure.

Wazuh MCP was tested against my Wazuh 4.14.1 instance running on Proxmox (container 105). I queried live alerts, pulled agent status for my connected machines, ran vulnerability scan results, and verified file integrity monitoring events. The agent reconnection workflow (listing disconnected agents, checking last keep-alive, triggering restarts) got tested end-to-end.

Zeek and Suricata MCP servers were tested against actual captured traffic. I fed real log files through both parsers, verified connection correlation worked across source/destination pairs, confirmed DNS query lookups returned the right records, and stress-tested the time-window filtering with large log directories. Edge cases like malformed log entries and mixed JSON/TSV formats got handled explicitly.

TheHive and Cortex were tested against their APIs with sample cases and observables. MISP was tested against threat intel feeds with real IOC lookups. The MITRE ATT&CK server was verified against the full STIX 2.1 enterprise bundle: technique lookups, tactic mappings, group associations, software references.

The goal was not just “does the tool call succeed.” It was “does the model get back data it can actually reason about for a real investigation.”

Context Design Is the Real Engineering

Security telemetry is exactly the kind of data language models handle poorly. It’s verbose. It’s repetitive. It’s full of fields that matter sometimes and are noise the rest of the time.

Take Wazuh alerts. A single alert has 40+ fields. Agent metadata, rule details, decoder information, syscall data, file paths, timestamps, GeoIP, compliance mappings. Dump all of that into a model and ask it to “analyze the situation.” You’ll get a vague summary that touches everything and understands nothing.

I learned this the hard way. My first versions returned raw API responses. The model would pick whatever fields were easiest to talk about instead of whatever actually mattered for the investigation.

So I started designing the context layer. What does the model see first? What time window makes sense for correlation? Which fields help with triage and which ones are just metadata? What should get pre-summarized before it reaches the model?

For Wazuh alerts, I filter to severity 8+ by default and return a focused subset: timestamp, rule description, agent name, source IP, and MITRE technique. The full payload is available if the model asks, but the first pass is clean enough to reason about.

For Zeek connection logs, I pre-aggregate by source/destination pair and surface the unusual patterns (long durations, high byte counts, rare ports) before the bulk data. The model gets a summary table first, then can drill into specific connections.

For Suricata, I separate true IDS alerts from flow metadata. The model sees detections first, network context second. That ordering matters more than I expected.

Where It Gets Interesting

The payoff comes when the model pulls from multiple servers in one workflow.

A Wazuh alert fires for a suspicious process on a workstation. The model checks Zeek connection logs for that host’s network activity in the same time window. It finds outbound connections to an unusual IP. It queries MITRE ATT&CK to map the process behavior to known techniques. It checks MISP for threat intel on the destination IP.

That correlation chain used to take 15 minutes of clicking through interfaces. Now it takes one question.

I’m not replacing analysts. I’m killing the mechanical evidence-gathering that burns time before a human reaches the actual decision points. The model assembles the first layer: timeline, supporting evidence, technique mapping, known indicators. The analyst starts from a briefing instead of starting from scratch.

The Lesson

Before this project, I assumed protocol maturity would be the bottleneck for AI-security integration. I was wrong. The protocol is a solved problem. MCP works fine. The tools connect. The calls succeed.

The bottleneck is what happens between the raw data and the model’s context window. Filtering, ordering, scoping, pre-summarizing. That’s where the quality of the analysis is determined.

A model with access to every field in every log is not better equipped than a model that sees the right 15 fields in the right order. It’s worse.

Now when I evaluate any agent-tool integration, that’s the first thing I look at. Not whether the model can reach the tool. Whether the tool is feeding the model something it can actually think with.

Seven servers. All open source. All tested against live infrastructure. The code is at github.com/solomonneas. The protocol was a weekend. The context design is ongoing. That’s the ratio that matters.