opensourcesecurity/llm-labs-protectli

Fork 0

Multi-model local LLM evaluation for production technical support. A1 prompt template, Prism Porters Puzzle benchmark, failure-mode taxonomy.

Find a file

skip 8994d6799f Update POSTMORTEM.md		2026-05-25 18:54:45 -07:00
benchmarks	Update benchmarks/prism-porters-puzzle.md	2026-05-24 07:22:14 -07:00
docs	Update docs/hardware-notes.md	2026-05-24 07:23:09 -07:00
prompts	Initial commit: LLM Labs Protectli evaluation	2026-05-24 08:08:34 -06:00
results	Update results/failure-modes.md	2026-05-24 07:24:08 -07:00
.gitignore	Initial commit: LLM Labs Protectli evaluation	2026-05-24 08:08:34 -06:00
LICENSE	Initial commit: LLM Labs Protectli evaluation	2026-05-24 08:08:34 -06:00
POSTMORTEM.md	Update POSTMORTEM.md	2026-05-25 18:54:45 -07:00
README.md	Update README.md	2026-05-25 18:45:33 -07:00
REPORT.md	Update REPORT.md	2026-05-25 18:48:36 -07:00

README.md

LLM Labs: Protectli Technical Support Evaluation

A two-day, multi-model evaluation of locally-run open-source LLMs as a candidate replacement for the OpenAI API in a production technical support workflow. The customer was Protectli, a network security appliance manufacturer. The work was conducted by Skip Star (Open Source Security Inc.) in October 2025 with Claude Sonnet 4.5 and Claude Opus 4.1 serving as analysis and rubric-grading partners.

TL;DR

Goal: Determine whether a dedicated GPU workstation could run an open-source LLM well enough to replace the OpenAI API call in Protectli's RAG-based Zammad ticket workflow.

Verdict: Technically yes, business-case no - and we recommended against the hardware spend.

Qwen3-14B beat GPT-5 on a domain-specific diagnostic rubric (101/100 vs 40/100 on the hardest scenario) and matched Claude Sonnet 4.5.
All models under 14B parameters failed catastrophically on multi-constraint reasoning - including hallucinated hardware ("remove the batteries" from a desktop appliance with no batteries) and complete state-tracking collapse.
Hardware ROI doesn't pencil out. The Big RAG system burned ~$450 in OpenAI API spend across 2025 serving ~20 concurrent employees across the US, Canada, and EU and processing ~4000 Zammad tickets over a 12-month lifespan. A $5,000 5090-class GPU server that "might beat OpenAI at times" is an 11-year payback on the API cost alone, before electricity, before sysadmin time, before the next GPU refresh. We told the CEO that directly. The justification has to be privacy, control, or future scale - not savings.
The Chinese model dilemma is real. The best-performing local model is Alibaba's. Protectli is a security company that already works hard to differentiate from its (Chinese-manufactured) hardware origins. We documented it as a deployment constraint rather than pretending it away.

What's in this repo

Path	What it is
`REPORT.md`	The formal lab report delivered to Protectli's CEO. Start here for the polished version.
`POSTMORTEM.md`	Day-by-day project post-mortem with methodology evolution, false starts, and discoveries.
`prompts/A1-template-v1.1.txt`	The standardized prompt template ("A1") that controlled for prompt variance across all model comparisons.
`benchmarks/prism-porters-puzzle.md`	An original multi-constraint logic puzzle that defeated 7 of 10 frontier models. Includes rules, optimal solution, difficulty analysis, and full results.
`results/model-leaderboard.md`	Score tables for all 11 models across all test scenarios.
`results/failure-modes.md`	A 5-tier failure taxonomy with case studies (Phi-3's psychotic break, GPT-5's confident-and-wrong, DeepSeek's rule invention).
`docs/hardware-notes.md`	Test platform, VRAM ceilings, quantization choices, Ollama and WSL2 findings.
`docs/cost-analysis.md`	TCO breakdown comparing API vs. local-hardware deployment at Protectli's actual volume.

Big RAG vs Little RAG

Two architectures existed in parallel during this evaluation, and the distinction matters for interpreting the results.

Big RAG was the production system at the time of evaluation: a chat interface served to ~20 concurrent employees across the US, Canada, and EU, backed by Grafana and Prometheus for observability, three Gunicorn backend workers, and two frontend services. It handled the OpenAI API calls for both interactive chat and Zammad ticket suggestions. Across its 12-month lifespan it processed ~4000 tickets and ran ~$450 in total OpenAI spend.

Little RAG was the trimmed successor built during and after this work. The chat interface was killed. Grafana, Prometheus, all three Gunicorn workers, and both frontends came out. What remained was the Zammad ticket path only - the actual revenue-adjacent use case. The codebase shrank ~80%.

The cost analysis and the local-vs-API recommendation in this report apply to the Little RAG scope - ticket suggestions only. The Big RAG numbers ($450/year, ~4000 tickets, 20 concurrent users) are the historical baseline that justified Little RAG's existence in the first place.

Methodology highlights

The A1 prompt template. A standardized prompt structure (role definition, embedded KB articles, response schema) used as the controlled variable across every model comparison. Without this, "Qwen vs Llama" results are just measuring prompt phrasing. With it, the differences are the models.

The Prism Porters Puzzle. When the TPM-troubleshooting tests started looking too easy for the top-tier models, I designed an original multi-constraint logic puzzle to stress-test reasoning. Three porters, three prisms, guardian-pairing safety rules, a 2-porter bridge capacity. Optimal solution is two moves. GPT-5 broke a rule in move one. Phi-3 invented a new entity called "A-C" and repeated "Bruno without prize" eight times before giving up. Details in benchmarks/prism-porters-puzzle.md.

Domain-specific rubrics, not generic benchmarks. Each test was scored against a hand-built rubric tied to actual Protectli KB content (TPM model compatibility, BIOS settings, pin-clipping requirements for the VP2410). Generic benchmarks don't predict domain performance.

Honest cost analysis. The original framing was "save money by going local." Once the actual Big RAG spend came into focus - $450/year for the whole system, chat plus tickets, 20 users across three regions - the local-hardware case collapsed. A $5,000 GPU server to replace a $450/year API bill isn't a cost play. That pivot, and the rebuild of the justification around privacy and control, is documented in POSTMORTEM.md.

Test environment

2025 Asus Zephyrus G16 laptop
Intel Core Ultra 9 285H (16 cores)
NVIDIA RTX 5070 Ti Mobile (12 GB VRAM)
32 GB DDR5-7000
Windows 11 + WSL2 (Ubuntu)
Ollama for local inference
Q4_K_M quantization across all local models

The laptop was the evaluation rig, not the proposed deployment target. The hypothetical production box was a $5,000-floor 5090-class GPU workstation - which is what the ROI math is built against.

Models evaluated

Local (Ollama / Q4_K_M): Qwen 2.5 14B, Qwen 3 14B, Qwen 2.5 32B, Llama 3.1 8B, Llama 3.1 70B, Mistral 7B, Mistral Small 24B, Gemma 3 12B, Phi-3 Mini 3.8B, DeepSeek Coder 6.7B

Closed-source baselines: Claude Sonnet 4.5, Claude Opus 4.1, GPT-5, Gemini 2.5 Pro, Le Chat (Mistral)

Final recommendation delivered to the customer

Don't buy the GPU server. $5,000 to replace a $450/year line item, with no guarantee of matching quality, is a bad trade.
Migrate the API endpoint from OpenAI to Claude. Better quality on this domain at comparable cost.
Re-evaluate hardware investment if any of the following happen: ticket volume increases 10×, customer-facing AI features are reintroduced, or data sovereignty becomes a compliance requirement.
If local deployment becomes necessary, Qwen3-14B is the technical winner but the Chinese-model concern needs to be addressed at the policy level - not engineered around.

About this work

This was contract work for Protectli, conducted alongside the launch of Open Source Security Inc. The materials here are published with the goals of (1) demonstrating practical FDE-style evaluation methodology and (2) sharing the Prism Porters Puzzle as a benchmark that anyone is welcome to use on their own model comparisons.

If you're a hiring manager looking at this for a Forward Deployed Engineer position: the artifacts that probably matter most to you are REPORT.md, benchmarks/prism-porters-puzzle.md, and prompts/A1-template-v1.1.txt.

Skip Star

README.md Unescape Escape