10 min read

Can You Use AI for Security Work Without the Cloud?

Can You Use AI for Security Work Without the Cloud?
AI And Privacy

This article draws on three months of production experience building the CyberDesserts Security Assistant, a RAG-based AI system trained on 67,900 documents from 30 curated security sources. The findings reflect real queries, real failures, and real costs.

Yes, but only for a narrow set of tasks, and the hardware trade-offs are worse than most guides admit.

Security teams sending incident timelines, internal detection logic, and CVE triage notes to a third-party API have a legitimate concern. Most major inference providers offer zero data retention modes that prevent prompts being used for model training. What they cannot guarantee is that queries are invisible during processing.

Your data still transits their infrastructure, hits CDN layers in multiple jurisdictions, and can be reviewed by safety teams if automated systems flag it. Zero data retention and zero data exposure are different promises.

So the interest in running AI locally is understandable. The question nobody answers with actual production data is whether it works for security tasks. After three months running a production AI assistant against real security queries, here is what the numbers show.

Local AI works for formatting and reformatting tasks. It fails for anything requiring cross-source reasoning or domain attribution. The hardware you need to clear that second bar costs more than most security teams can justify, and the API you are trying to avoid costs £5-7 a month for the same capability.

The rest of this article explains exactly where the line falls, what hardware sits on each side of it, and the three scenarios where local inference genuinely earns its place.


Get practical security content like this in your inbox. Subscribe to CyberDesserts. No fluff, no vendor pitches.


What Security Tasks Are Practitioners Actually Using AI For?

The hardware question comes second. The task question comes first, and most local AI guides skip it entirely.

AI in security work splits into two categories. The first is retrieval and formatting: taking information that already exists somewhere and restructuring it usefully. The second is reasoning: connecting information across sources, attributing behaviour to specific actors, and synthesising judgements that require domain knowledge baked into the model itself.

These two categories have completely different model requirements. A 14B model can reformat a CVE advisory into a triage brief. It cannot reliably tell you which threat actor uses a particular technique cluster without hallucinating MITRE IDs that were never in your input.

From three months of testing, the tasks that come up most for security practitioners are:

  • Threat actor attribution and TTP mapping
  • CVE triage and summary reformatting
  • Log explanation and incident triage notes
  • Detection rule writing (Sigma, YARA)
  • Framework guidance (NIST, CIS Controls)
  • Career and skills queries

The list above looks uniform. It is not.

The tasks at the top require reasoning. The tasks at the bottom are mostly formatting.

The hardware gap between those two task types is where the local AI argument falls apart for most teams.

The log triage and incident notes category is worth flagging. SOC analysts are the audience most actively evaluating AI tools right now, partly because the productivity case is obvious and partly because the "AI will replace the SOC analyst" narrative has put teams on the defensive. Local inference seems like a way to get the benefit without the data exposure. Whether it delivers is what this article tests.

Which AI Tasks Work Locally and Which Break?

Having tested models from 8B to 70B parameters against the full query range above, the failure patterns held across model families. Size matters less than people assume. The task type is what determines whether a smaller model is usable or dangerous.

Security Task Works at 8B-14B? Why It Fails (If It Does) Minimum That Works
Log explanation / triage note drafting Yes Does not fail. Single document, format-and-extract. 8B-14B
CVE summary reformatting Yes Does not fail. NVD text to analyst brief. 8B-14B
Framework checklists (NIST, CIS) Yes Does not fail. Sequential, template-driven. 8B-14B
Threat actor TTP attribution No Hallucinated MITRE technique IDs. 5-8 ungrounded T-codes per response vs acceptable max of 2. 70B+
Multi-source threat synthesis No Ignores injected enrichment data. Repeats same content across response sections. 70B+
Compound queries (techniques AND defences) No Collapses into single-focus output. One thread drops entirely. 70B+
Air-gapped / classified use (any task) Accept trade-off No API alternative. Use best available hardware. Whatever fits

The failure mode on threat intelligence is the one worth dwelling on. The model does not say "I do not know." It generates confident-sounding MITRE technique IDs that were never in the provided context.

A junior analyst reading a response with T1566.001, T1003.004, and T1078 listed without qualification has no way to know those IDs came from training data rather than the threat intelligence you fed it. That is not a minor inconvenience. It is a patient zero scenario for bad intelligence propagating through your team.

The good news (if you can call it that) is that the tasks where hallucination risk is highest are also the tasks that define whether local hardware is worth buying. If you only need reformatting, a used RTX 3090 is genuinely adequate. If you need threat reasoning, you are either spending several thousand on hardware or you are back on an API.

What Hardware Do You Need for Each Task?

Most hardware guides benchmark token generation speed in isolation. Nobody multiplies that speed by the actual response length practitioners need.

Our production system averages 600-800 tokens per response. That is roughly 450-600 words: a structured security brief covering a summary, key findings, and a technical details section.

Short enough to be useful. Long enough to expose the performance gap between hardware tiers.

At 800 tokens output, the picture looks like this:

Hardware Approx. Price Best Model Speed (tok/s) 800-token response Suitable tasks
Mac Mini M4 Pro 24GB ~£700 base 14B Q4 20-30 27-40 seconds Formatting only
Used RTX 3090 24GB ~£700 used 14B Q4 30-45 18-27 seconds Formatting only
Mac Mini M4 Pro 64GB ~£1,600 70B Q4 8-12 65-100 seconds Batch overnight only
Dual RTX 3090 48GB ~£1,500 used 70B Q4 12-18 45-65 seconds Batch overnight only
Mac Studio M4 Max 128GB ~£3,200 70B Q4 30-45 18-27 seconds Threat intel, interactive

Why a Mac Mini as the test case?

The Mac Mini M4 Pro 24GB is the machine I tested the CyberDesserts Security Assistant on. It is not enterprise hardware, and that is the point. It represents a test case (not a funded SOC project)

If you are evaluating local AI for a security team rather than a personal lab, the hardware picture looks different. A workstation with dual A100s or an H100 runs 70B models at 80-150 tok/s, well into interactive territory for most tasks. The £3,200 Mac Studio M4 Max in this table is the realistic consumer ceiling. For serious enterprise deployment, dedicated GPU servers change the economics considerably.

This article focuses on testing what I had available: hardware you can buy without a procurement approval and run on your home network. If you are sizing infrastructure for a team, those numbers are a floor, not a ceiling.

The 64GB Mac Mini is the most misleading option on this list. It can load a 70B model. Running one at a speed most users would tolerate is a different matter entirely.

One to two minutes per response is not a usable security tool. (I tested this on the same hardware I use daily. The wait is genuinely painful.)

The used RTX 3090 is the interesting pick for formatting-only workloads. At around £700 on the used market, it delivers 935 GB/s memory bandwidth, more than three times the Mac Mini M4 Pro's 273 GB/s.

Memory bandwidth is the actual bottleneck for token generation, not raw GPU compute. For a 14B model doing CVE reformatting or log triage, that bandwidth advantage means faster responses than Apple Silicon at the same price point.

The Mac Studio M4 Max is the minimum viable hardware for interactive threat intelligence. At £3,200, it runs a quantised 70B at 30-45 tok/s.

Your worst-case 800-token response lands under 30 seconds. Borderline acceptable for a production tool, and the quality ceiling clears the bar for threat actor queries. The price argument, though, deserves its own section.

Does Using an Inference API Actually Solve the Privacy Problem?

The instinct to avoid cloud APIs is reasonable. The alternatives are more varied than most practitioners realise.

The market for running open-source models via API has expanded considerably in the past year. Beyond the frontier providers (OpenAI, Anthropic, Google) there are now dedicated inference platforms serving open-weight models like Llama, Qwen, and DeepSeek at significantly lower cost. Groq, Fireworks AI, Together AI, Cerebras, SambaNova, and aggregators like OpenRouter all operate in this space, each with different performance and privacy characteristics.

Privacy postures vary meaningfully between providers. Fireworks AI holds SOC2 Type II and HIPAA certification, the strongest compliance credentials at this tier.

Groq retains logs for 7 days by default with no API opt-out option. Most providers offer opt-in zero data retention for prompts, but as covered above, that prevents training use rather than infrastructure visibility.

For teams that need to run open-weight models with a stronger compliance posture than consumer AI products, these platforms are worth evaluating before spending on hardware. They are not the same as sending queries to ChatGPT.

Over three months of production at roughly 500 queries per month, the inference costs broke down as follows:

Component Monthly Cost
LLM inference (70B open-source model via API) £5-7
Web search fallback (Brave API) ~£4
Hosting £35-40
Total ~£45-50

The LLM inference itself (the piece local hardware would replace) costs £5-7 a month. A Mac Studio M4 Max at £3,200 breaks even on inference savings in roughly 38 years. A used dual RTX 3090 at £1,500 takes 18 years, and cannot run 70B at interactive speeds, so quality does not compare like-for-like.

The economics only shift if your query volume is dramatically higher, your needs are continuous, or your data genuinely cannot transit third-party infrastructure under any circumstances.

When Running AI Locally Is Actually Worth It

Three scenarios where local inference earns its place, rather than just sounding appealing.

Classified and air-gapped environments. If your network cannot reach an API endpoint by design, local is not a preference. It is the only option. Accept the quality ceiling on reasoning tasks. Use 14B for formatting work, build human review into the workflow for anything that touches attribution or TTP mapping, and budget accordingly for hardware if reasoning quality matters.

High-volume overnight batch work. Reformatting 5,000 CVE records. Generating triage notes from a log export. Summarising a month of threat reports. When latency per response does not matter, a 14B model on a used RTX 3090 handles this cleanly. Run overnight, collect results in the morning. The economics make sense at scale for formatting tasks.

Sensitive internal documents, simple tasks. Summarising a proprietary threat model document. Reformatting an internal incident template. Tasks where the answer is entirely in the text you provide and the model restructures rather than generates from memory. The privacy concern is real here. The hallucination risk is low. A 14B model is adequate.

Notice what is not on this list: interactive threat intelligence queries, actor attribution, TTP mapping, or detection engineering requiring cross-source reasoning. Those need 70B. 70B at interactive speeds means £3,200 minimum in local hardware, against £5-7 a month on a compliant inference API.

Why the Economics Rarely Favour Local AI

The privacy concern driving interest in local inference is legitimate. Sending active incident data to a third-party API is a real risk worth taking seriously.

The problem is that the tasks where privacy risk is highest (active incident telemetry, proprietary detection logic, internal threat models) are also the tasks where local AI at affordable prices performs worst. The scenarios where local hardware works well enough tend to involve tasks with lower privacy sensitivity. Higher-sensitivity tasks require hardware that costs more than most teams can justify on API savings alone.

This calculus changes if your data handling requirements are regulatory rather than economic. If compliance mandates that certain data never leaves the building, the £6k+ (decent) hardware cost is a compliance cost, not an ROI question. That is a different conversation entirely.


Which Setup Is Right for Your Situation?

Sensitive data, simple tasks (log formatting, CVE summaries): A used RTX 3090 (~£700) or as an example your existing Mac Mini M4 Pro handles 14B formatting work adequately. The privacy benefit is genuine, and quality is acceptable for pure reformatting.

Sensitive data, complex tasks (threat intel, TTP mapping): Local hardware does not work at affordable prices. Your options are accepting the API privacy trade-off with a compliance-certified provider, spending several thousand on enterprise hardware, or building structured human review into a workflow using a lower-quality local model.

Air-gapped environment: Use the best hardware available. Accept 14B limitations for reasoning tasks and compensate with a structured RAG pipeline that pre-digests the cross-referencing the model cannot do reliably on its own.

Query volume under 1,000/month: Inference costs are negligible at this scale, under £10/month for open-weight 70B via API. Spend the time evaluating your provider's data handling rather than researching hardware.


How to Start Testing Local AI for Security Work

If you want to benchmark local inference against your actual workflow before committing to production hardware, here is the minimal setup.

What you need:

  • Mac with Apple Silicon (M1 or later), or a GPU with 16GB+ VRAM
  • Ollama, which installs in one command and exposes an OpenAI-compatible API at localhost:11434

Models worth pulling first:

# Solid 14B for formatting tasks
ollama pull qwen2.5:14b

# Lighter option for 8GB VRAM
ollama pull mistral:7b

A realistic test query:

Summarise this CVE advisory as a 5-bullet triage brief for a security analyst.
Include: severity, affected versions, attack vector, whether a patch exists,
and recommended immediate action.

[Paste full NVD advisory text here]

If the output is accurate and well-structured, that task belongs in your local inference workflow. If the model adds information that was not in the advisory (technique IDs, related CVEs, vendor names not mentioned) you are seeing the hallucination pattern that makes smaller models unreliable for anything beyond pure reformatting.

Test with tasks from your actual workflow before committing to hardware.


Last updated: March 2026

The CyberDesserts Security Assistant runs on 67,900 documents from 30 curated security sources, with hybrid retrieval and a knowledge graph covering 18,920 threat actors, techniques, and defences.

Subscribers get practical security content: production findings, tool teardowns, and threat intelligence with no vendor spin.


References

  1. Fireworks AI. (2026). Security and compliance documentation. SOC2 Type II and HIPAA certification. fireworks.ai/security
  2. Together AI. (2026). Privacy Policy. Zero Data Retention settings. together.ai/privacy
  3. DEV Community / Tiamatenity. (2026). The AI Training Data Opt-Out Lie: Why Your Prompts Are Being Used Anyway. dev.to. Analysis of what zero retention policies actually cover.
  4. Apple. (2024). M4 Pro chip technical specifications. Memory bandwidth 273 GB/s. apple.com
  5. Apple. (2024). M4 Max chip technical specifications. Memory bandwidth 546 GB/s. apple.com
  6. Ollama. (2026). Local model serving documentation and model library. ollama.com
  7. Artificial Analysis. (2026). LLM inference benchmark data. Tokens per second by hardware configuration. artificialanalysis.ai