- Python 98.7%
- Dockerfile 0.7%
- Shell 0.6%
| config | ||
| docker | ||
| docs | ||
| services | ||
| tests | ||
| .env.example | ||
| .gitignore | ||
| docker-compose.yml | ||
| Docker_Commands.md | ||
| LICENSE | ||
| README.md | ||
RAG_AI
A production retrieval-augmented generation system that ran a hardware company's customer support and sales operations for a full year, zero lockups, zero downtime, on infrastructure built solo on personal time.
Status: Retired and open-sourced. Originally deployed at Protectli (a small-form-factor computer manufacturer) from Dec 2024 to May 2026. Shut down after the author's departure. Released here under BSD-2-Clause so the patterns can live on.
Provenance
The earliest working version of this system - a CLI-only proof of concept built to demonstrate that retrieval-augmented generation was viable for Protectli's support and sales workflows - exists as a separate, timestamped repository on GitHub:
github.com/bestcoast127/Protectli-DRAGON
That repository was created in late 2024 to preserve the original proof of concept. It is intentionally not migrated to Forgejo because the third-party timestamps are the point. The CLI demo got the full project funded; what is in this repository is the production system that followed.
This codebase itself was not developed under day-by-day version control. As a solo project, the rollback discipline was filesystem snapshots (often six per day during heavy iteration), not git commits. That choice is worth naming honestly: it kept the right things safe (a 10-day Flask-to-FastAPI refactor disaster was recovered from a snapshot) and it left no per-commit audit trail to point at later. A second engineer would have changed this. There was not a second engineer.
Why this repo exists
Most RAG demos are toys. This one shipped, scaled across two departments (support and sales), and was operated daily by people who had never heard the word "vector." It is being open-sourced because:
- Two of its pieces may have been firsts. A CRUD UI for Qdrant aimed at non-technical users (we searched, never found a prior one). An adaptive context-window token manager (ACWTM) for session-aware recursive summarization, built in December 2024 before "context compression" was an industry term.
- It was built the hard way. During an era when context windows were 4K tokens, LLM assistants had no project memory, no web search, and could not see entire scripts. V1 was scrapped. V2 was scrapped. V3 became the system that ran for a year. The patterns here are battle-tested, not theorized.
- The author is moving on. Original work, owned by the author, gifted to Protectli for operational use, now released for anyone to learn from, fork, or rebuild.
What it did
- Received Zammad ticket webhooks via a WordPress proxy
- Filtered spam and auto-replies before incurring API cost
- Classified incoming queries via regex pattern matching into six response shapes (described in detail below)
- Retrieved relevant context from a Qdrant vector database curated by support and sales staff
- Generated responses through OpenAI GPT-4o
- Posted responses back to Zammad as either customer-facing replies or internal tech notes for Tier 1 agents
- Handled queries in English, German, and French (Zammad served three geographic offices; multilingual response handling was tested by the German office team on live calls)
Alongside the ticket pipeline, the same backend served:
- A Streamlit chat interface with live temperature, k-value, and confidence-threshold controls, available on the internal network to all employees including remote workers tunneling in. Supported roughly 20 concurrent sessions with sustained use.
- A Qdrant CRUD UI that let support and sales staff add, edit, search-test, bulk-re-embed, and back up vectors without touching code (see
services/frontend/vector_crud.py) - A full observability stack (Prometheus, Grafana, Elasticsearch, Fluentd) feeding service health, latency, and API-cost telemetry
Single-script deployment
The entire 18-container production stack stood up from a single install script run inside a fresh Ubuntu VM. The actual deployment at Protectli was: provision a VM, run the script, change the GUID and IP, done. That reproducibility was deliberate from the moment Docker Compose became the orchestration layer in V3 (after V1 ran natively and V2 ran on a bare-metal server and broke nginx for the services already on that box).
Bring your own corpus, your own API keys, and your own webhook source, and the infrastructure stands up the same way it did in production.
Three things worth highlighting
ACWTM - Adaptive Context Window Token Manager
Built December 2024. At the time, context windows were small and a 30-message support session would blow the budget. ACWTM is a Flask service that, per session:
- Tracks every message with a token count via
tiktoken - Maintains a sliding window of raw recent messages
- Recursively summarizes older messages once the window approaches a configurable safe limit (
MAX_TOKENS - TOKEN_BUFFER) - Persists "already summarized" flags so messages are never summarized twice
- Times out idle sessions and reclaims their context
Today, every major LLM lab does some variant of this - context compression, semantic memory, recursive summarization, sliding-window attention. In late 2024, it was hand-rolled because nobody had named the problem yet. See services/acwtm/acwtm.py.
The ACWTM was deprecated when foundation models started shipping with 128K+ context windows. The code is preserved here because the technique still has applications in cost optimization and long-running agents, but the daily operational need vanished almost overnight.
Vector CRUD UI
A Streamlit app at port 8503 that gave non-technical staff direct access to the Qdrant collection. Five tabs:
- Browse: live filter against the running collection with no caching (kept the operator from acting on stale data)
- Edit: in-place editing with two save modes - text-and-metadata-only (cheap payload update) or full re-embed-and-save (regenerates the vector via the OpenAI embeddings API). The operator picks which the change warrants.
- Add: live token counter with hard 8192-token ceiling, three ID-generation strategies (UUID, hash, manual integer)
- Search Test: arbitrary query against the live corpus with adjustable result count and minimum similarity score, so operators could preview retrieval before relying on it
- Tools: bulk re-embed across the entire corpus (with an embedding-model picker, which means swapping embedding models was a corpus-wide button-click operation), full-collection JSON export for backup, collection statistics
The unlock was operational: the corpus stayed accurate because the people closest to the work - support engineers and sales - owned curation. We searched for a comparable open-source Qdrant CRUD UI before building this and could not find one. It may have been first. See services/frontend/vector_crud.py.
Modular response-shape templates with regex classification
The chat pipeline ran a user query through a chain of ~30 regex patterns, case-insensitive, mostly Protectli-hardware-specific. The matched pattern selected one of six response templates:
- Troubleshooting: numbered-step technical responses, constrained to steps that appear in retrieved troubleshooting guides; explicitly forbidden from improvising. Temperature 0.3. (Lower output randomness for technical steps.)
- Initial response: the Zammad-webhook template, a codified one-shot support reply with hardcoded link policy (CMOS reset link is provided verbatim; BIOS guide links are gated to URLs that appear in the retrieved context), explicit anti-hallucination instructions ("Never include a generic BIOS index link. Never invent links."), and a mandatory closing line. Temperature 0.0.
- Clarification: explanatory responses for "what does X mean" style queries. Temperature 0.3.
- Informational: spec lookups, compatibility, feature questions. Temperature 0.3.
- Fallback: hardcoded polite-decline for out-of-scope queries (anything not Protectli hardware). No model call required.
- Escalation: hardcoded handoff message for "I need a human" / "this isn't working" patterns. No model call required. Returns the same string every time.
Two of these six (fallback and escalation) gate before the OpenAI API, the same way the spam filter does. That is three deliberate cost-control gates in the request pipeline.
Query classification re-ran on every turn, so a session could legitimately move from informational to troubleshooting mid-conversation. See services/backend/templates.py.
The modular template structure existed because GPT-4o's output ceiling was ~4K tokens in 2024 and a unified prompt that handled all response shapes did not fit. The Zammad ticketing path bypassed the modular branching entirely and used a single template (INITIAL_RESPONSE_TEMPLATE) because by the time Zammad webhooks came online, output limits had risen and one well-engineered template covered the cases.
Architecture
┌─────────────┐
│ Zammad │ (ticketing)
└──────┬──────┘
│ webhook
┌──────▼──────────┐
│ WordPress Proxy │ (auth + forwarding, not in this repo)
└──────┬──────────┘
│ HTTP POST /api/initial-response
┌────────▼────────┐
│ Nginx │
└────────┬────────┘
│
┌──────────────────┼───────────────────┐
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ Backend │ x3 │ Frontend │ x2 │ ACWTM │ x2
│ Flask │ │ Streamlit │ │ Flask │
└────┬────┘ └─────┬─────┘ └───────────┘
│ │
│ ┌─────────────┴────────────┐
│ │ │
┌────▼────▼───┐ ┌──────────┐ ┌─▼───────┐
│ Qdrant │ │ Redis │ │ OpenAI │
│ (vectors) │ │(sessions)│ │ API │
└─────────────┘ └──────────┘ └─────────┘
Observability: Prometheus, Grafana, Elasticsearch, Fluentd
(separate containers, scraping all of the above)
Vector search uses HNSW with ef=128. The backend exposes a Prometheus endpoint and structured JSON logs flow through Fluentd to Elasticsearch.
Live testing dashboard
The internal development dashboard exposed live controls for k value (retrieval depth), confidence threshold (similarity-score floor for retrieved chunks), and temperature, plus the document-upload ingestion pipeline and a per-conversation latency telemetry chart. The latency chart routinely showed the descending shape characteristic of ACWTM kicking in: as the session progressed and older messages got summarized, each subsequent turn became faster because the context payload shrunk.
These screenshots are operational artifacts from the production system rather than reconstructions.
Prompt engineering and the in-place iteration model
The Protectli templates went through somewhere between 200 and 300 iterations over the course of the system's life. Iteration was not formally version-controlled in code. Old templates were overwritten in place; rollback was available through the filesystem snapshots described in the Provenance section.
What was formalized was the evaluation loop. Each candidate template was scored against a reference "ideal answer" using a separate Claude / GPT instance as the grader. This is the same shape as a modern LLM-eval harness (Promptfoo, DeepEval, and others that exist now); none of those tools existed in usable form when this work was being done, so the harness was built by hand: candidate prompt + reference answer + LLM grader + manual review of disagreements. The iteration that survived was the one whose outputs the grader rated closer to the reference than the previous champion.
A specific anti-hallucination pattern emerged from this loop and is worth naming. The templates are written to constrain the model to retrieved context ("Use only information contained in the provided context chunks. Do not guess specs, versions, or procedures") rather than to instruct the model to hedge linguistically. The result is what modern RAG practice calls grounded generation: hallucinations that did happen tended to be over-recommendation of plausible-but-wrong steps (suggesting a CMOS reset for a symptom CMOS resets do not address - the model was statistically not wrong, just over-fitted to the most common fix) rather than fabricated specs, fabricated product names, or fabricated URLs. The "never invent links" instruction in the Zammad template is the explicit version of that posture, and it was added after seeing the model fabricate a plausible-looking Protectli URL that 404'd.
Corpus evolution
The Qdrant corpus did not stay one size. The arc:
- Start: ~2,000 small chunks, roughly 300 tokens each, retrieved with k=12.
- End: ~200-300 large chunks, roughly 1,500 tokens each, retrieved with k=3-4.
The reduction was a deliberate response to A/B testing. As the OpenAI models improved through 2025, larger chunks started outperforming smaller-chunks-higher-k on response quality, because the larger chunks preserved more local semantic coherence and the better models could handle the larger context payload without losing the thread. Lower k also reduced API cost per query, so the corpus and the model improvements were both moving toward the same operational point.
That is empirical tuning against measured output quality, including the willingness to invert an earlier design decision when the data said to. It is not a static "200+ vectors" claim.
Multi-tenancy, multi-language, and the actual deployment shape
This was a single-tenant deployment by design. One Qdrant collection, one Protectli, one shared corpus serving both support and sales. There was no SaaS-style tenant isolation, no per-customer namespacing, no per-tenant API-key routing - because none of that was a requirement. A multi-tenant or external-customer-facing deployment would warrant tenant-scoped vector namespacing and a different secrets posture. That work was never built because the threat model never required it.
What was distributed was geography. Protectli had three offices across the US, Germany, and France, all unified through Zammad. Tickets came in in all three languages.
Multilingual handling was first built as an explicit prompt instruction ("if the customer's query is in German, respond in German"). That worked, was A/B-tested on conference calls with the German office team, and was rated acceptable-but-not-perfect by native speakers. The explicit instruction was removed later in 2025 when OpenAI's models began reliably matching response language to query language without scaffolding - the same pattern as the modular templates becoming unnecessary when output ceilings rose. Build for the current model's limits; remove the workaround when the model catches up.
The Streamlit chat interface's session isolation (per-session timeout, race-condition guards in Redis) supported about 20 concurrent users but is not the same thing as tenant isolation; it is session isolation within a single tenant.
Error handling and the production failure record
The honest version of this section is that almost nothing ever broke in production. Months of pre-production hammering on the system from the author's home workstation surfaced and patched the gnarly failure modes (Flask 5xx errors from refactoring routes were routine during development; none ever reached production traffic). The system was effectively in a long staging phase before it ever touched a Protectli ticket.
The failure modes that did matter in production:
-
OpenAI API latency. The only failure mode that fired with any frequency was OpenAI's API taking longer to respond than the configured timeout. The async OpenAI client was configured with a 30-second timeout and the outer retry-and-fallback path extended that to 60 seconds total. Both values were tuned after observing that high-k, high-temperature, deep-thought responses could legitimately take 30-45 seconds. The user-facing failure case was "no automatic Zammad response was posted to the ticket"; a Tier 1 agent would handle the ticket manually, and a Protectli ticket without an automatic reply was not a customer-facing SLA breach.
-
Routes returning 500. Common during development as new features and templates were added; tested out of existence before deploy. Zero occurrences in production.
-
Qdrant unavailable, Redis sideways, zero-match retrieval, regex misclassification. None of these were observed in production at the volume the system ran. A higher-traffic or stricter-SLA deployment would warrant explicit fallback paths (retry-with-backoff for the LLM call, a degraded-mode response when Qdrant is unreachable, an "I do not have information about that" path for zero-match retrievals). Those paths were not built because the failure rate against actual call volume was effectively zero.
The detection mechanism for the rare production failure was twofold: the Grafana/Prometheus observability stack caught it on the system-health side (CPU, RAM, request latency, error rates) and the absence of an automatic reply on a Zammad ticket caught it on the operator-experience side. Both signals were used in practice; the second one was the one that actually surfaced the one or two times OpenAI had a flaky window during the year.
This is a defensible operational posture for a single-operator system serving a single company at the volume Protectli ran. It is not the same posture that would be defensible for a multi-tenant SaaS or for a system on a strict customer-facing SLA. The README is naming this distinction explicitly because the reviewer was right to flag it.
Secrets handling
API keys lived in a .env file on the production VM, ingested by Docker Compose via env_file:. VM access was restricted to four named individuals: the CEO, two directors, and the author. None of the other three ever opened the file; the work was opaque to them.
This is appropriate for the deployment context: single-tenant, single-deployment, internal-tools system, four pre-vetted humans with VM credentials. The threat model was "unauthorized external access to the VM," not "internal bad actor exfiltrating the API key." A multi-tenant or external-customer-facing deployment would warrant a real secret manager (vault, sops, cloud KMS) because the threat model would change.
Rotation discipline: the key was rotated when there was a rotation event (departure, suspected leak). No event triggered a rotation during the production year.
What was deprecated, and why
Engineering judgment is not just what gets built. It is what gets killed. Four deliberate removals over the system's lifetime:
Document Processor (deprecated mid-2025). Originally a drag-and-drop block on the Streamlit dashboard. You would drop a .txt, .md, or .ipynb file, it would round-trip through Python-side cleaning (lowercase normalization, whitespace stripping, comment removal) and the OpenAI embedding API, then auto-ingest into Qdrant with configurable chunk size and overlap on live sliders. It worked. The CRUD UI replaced it because the round-trip-then-chunk model produced vectors that nobody had reviewed before they went live, where the CRUD UI made human-in-the-loop the default. The embedding pipeline that powered the Document Processor is preserved in services/document_processor/ for reference, though it is no longer the operational ingestion path.
ACWTM (deprecated when context windows grew). Killed when foundation models started shipping with 128K+ context windows. The clever workaround became obsolete almost overnight. The code is preserved in this repo because the technique still has applications in cost optimization and long-running agents, but the daily operational need vanished.
Streamlit chat interface (deprecated in Little RAG). The chat was popular. Too popular. Support engineers started using it as a crutch instead of learning the product, which made them slower on tickets that fell outside its knowledge. Worse, some Tier 1 agents copy-pasted chat responses directly into customer replies without verifying. Company memos went out about over-trusting AI outputs. Little RAG (the four-container successor) removed the chat entirely and replaced it with structured internal tech notes posted directly to Zammad tickets, where Tier 1 agents still had to read and engage. Better outcomes for the humans.
Sales recommendation extension (killed, never shipped). Roughly 100 hours went into a sales-RAG extension that used the same corpus plus additional sales reasoning, with branching logic and mandatory disambiguation questions to walk a customer toward a Protectli Vault recommendation (RAM, CPU, NIC count, Suricata-yes-or-no, and so on). It reached ~85% perceived-correctness against an internal rubric. It was killed because 85% is not acceptable when the failure mode is recommending the wrong product to a paying customer - a customer who buys an overspec'd Vault has been talked into spending more than they needed to, and a customer who buys an underspec'd Vault has been sold something that will not do the job. Either failure is a trust failure that no amount of "but it's right 85% of the time" recovers from. Today's retrieval techniques (graph RAG, multi-hop query rewriting, structured tool use) would close that gap; in late 2024 and early 2025 they did not exist in usable form. The right call was to stop.
The pattern in all four: the system worked, and was still removed because something better existed or because the design had stopped serving the people using it. That instinct matters more than any individual feature.
Little RAG: the successor
After a year of running rag_ai in production, the next iteration was a from-scratch rebuild. Little RAG, built in late 2025, ran four containers instead of eighteen, removed the Streamlit chat entirely, replaced OpenAI with Anthropic Claude, and handled tickets exclusively as internal Zammad tech notes for Tier 1 agents.
Two things make Little RAG worth mentioning here:
It was not a refactor. rag_ai was never refactored across its production year. It ran. Little RAG was a separate, ground-up build with only one carryover from rag_ai: the Zammad webhook handler, lightly refactored to fit the new structure. Knowing when to refactor versus when to rebuild is a senior engineering decision that goes wrong in both directions for most people. The honest reasoning here was that the requirements had changed enough (no chat, internal-notes-only, single response shape, smaller container footprint), the AI tooling had advanced enough to make a from-scratch build cheaper than a refactor, and a parallel rebuild let the team A/B-test the old system against the new before cutting over.
It was AI-built from a specification. Most of the Little RAG code was written by Anthropic Claude (Sonnet/Opus, late 2025) from a system specification the author wrote. The week of validation that followed is the part worth claiming: rag_ai and Little RAG ran in parallel, each receiving the same test queries, with outputs collected into a shared Google Drive folder accessible to the Protectli team. Rubric-graded A/B comparisons (with Claude as the grader against a reference rubric) showed Little RAG outperforming rag_ai's OpenAI-backed responses on the majority of test queries. The migration to Anthropic was an outcome of the A/B test, not a preset goal.
That validation methodology - reference-rubric-based A/B testing against a known-good production system, with results shared transparently with the team - is the testing discipline that did exist on this project. It was not formal pytest. It was the right harness for the question being asked: is the new system actually better than the system it would replace?
Little RAG is not in this repository. It may be released separately. If it is, it will get its own README and its own portfolio review.
Tech stack
| Layer | Choice |
|---|---|
| API | Flask (3 replicas behind nginx), Gunicorn |
| Frontend | Streamlit (chat interface, CRUD UI, dashboard) |
| Vector DB | Qdrant (HNSW, cosine, 200-300 curated large chunks in final production form) |
| Sessions | Redis (per-session timeout, race-condition guards) |
| Context manager | ACWTM (Flask, tiktoken, recursive summarization) |
| LLM | OpenAI GPT-4o |
| Embeddings | text-embedding-3-small (with text-embedding-3-large available via the CRUD UI's bulk re-embed tool) |
| Ticketing | Zammad (via WordPress webhook proxy, not in this repo) |
| Orchestration | Docker Compose (18 containers in production, single-script VM deploy) |
| Observability | Prometheus + Grafana + Elasticsearch + Fluentd |
| Language | Python 3.9+ |
The OpenAI client wrapper (services/backend/openai_client.py) was written with templates passed in as parameters rather than baked in, which means a swap to a different LLM provider would have been a sibling class implementing the same method signatures rather than a refactor of every consumer. The class was named OpenAIClient and the abstraction was not exercised in this codebase. Little RAG, when it migrated to Anthropic Claude, used different code entirely.
Repository layout
RAG_AI/
├── README.md ← you are here
├── LICENSE ← BSD-2-Clause
├── .env.example ← copy to .env, fill in values
├── docker-compose.yml ← the full 18-container stack
├── Docker_Commands.md ← operational cheatsheet
│
├── config/ ← service configs (nginx, prometheus, redis, spam patterns)
├── docker/ ← Dockerfiles for every service
│
├── docs/
│ ├── FILE_TREE.md ← the full original tree (for reference)
│ ├── THE_PLAN_V2.md ← original planning doc
│ ├── PCSIR_V2.md ← spec
│ ├── flask_notes.md ← Flask 3.1 migration notes
│ ├── corpus_schema/ ← how the Qdrant corpus was structured & tagged
│ ├── test_fixtures/ ← synthetic test stories for retrieval testing
│ └── screenshots/ ← dashboard and Grafana screenshots
│
├── services/
│ ├── backend/ ← Flask API, query templates, routes, vector store
│ ├── frontend/ ← Streamlit chat + CRUD UI
│ ├── acwtm/ ← Adaptive Context Window Token Manager
│ ├── document_processor/ ← Embedding pipeline preserved; UI deprecated in favor of CRUD UI
│ ├── query_processor/ ← Query-to-vector service
│ ├── redis/ ← Healthcheck and tests
│ ├── collectors/ ← Custom metric collector (Prometheus)
│ └── generators/ ← Error, load, and metric generators (used during early development)
│
└── tests/ ← Zammad webhook integration test
Will this run out of the box?
No. And that is deliberate. This repo is a reference implementation and case study. What is missing for a live deployment:
- The Qdrant corpus. The production system held Protectli-specific technical and product documentation. That data belongs to Protectli. Bring your own corpus.
- The WordPress webhook proxy. Protectli's integration layer between Zammad and this system was a custom WordPress plugin, not part of this codebase. You would need to wire your own webhook source.
- API keys. Provide your own in
.env(see.env.example). - The full observability stack is included in the compose file. You can disable Prometheus/Grafana/Elasticsearch/Fluentd if you do not need them.
If you want to spin up the components and see them talk to each other:
cp .env.example .env
# Edit .env with your OPENAI_API_KEY
docker compose up -d backend qdrant redis frontend
Visit the frontend at http://localhost:8501 and the CRUD UI at http://localhost:8503.
For the full production-shape deployment (all 18 containers including observability), the single-script install path is documented in Docker_Commands.md.
The origin story
V1 was built outside Docker Compose and hit every wall imaginable. Scrapped.
V2 ran on a personal dedicated server, on the host. It broke the nginx configuration for the services already on that box. Regretted. Scrapped. About a month gone.
V3 is what is here. Started as the CLI-only proof of concept now preserved at github.com/bestcoast127/Protectli-DRAGON and demoed to Protectli leadership in late 2024 as proof that the API could be productized internally. They greenlit a full build. The Streamlit chat interface, ACWTM, observability stack, multi-replica scaling, and finally the CRUD UI followed over the next several months. The bulk of the system was finished by the end of January 2025; the CRUD UI was added mid-2025.
The system ran in production for the rest of 2025. It was never refactored. Then Little RAG was built from scratch alongside it, validated against it with rubric-graded A/B testing, and deployed in its place.
The whole thing was built when LLM assistants had 4K context windows, no project memory, no web search, and could not see entire files. Every function was passed back and forth in pieces. Three accounts in rotation, every one hitting limits twice a day. For months. Including a 10-day refactor disaster where Claude Sonnet 3.5 confidently said it could move the codebase from Flask to FastAPI, the author trusted it, the rebuild did not work, and a filesystem snapshot from the day before the refactor started recovered the work.
The failure modes encountered during that build are mostly invisible to anyone starting RAG work today. That is not a complaint. It is the credential.
License
BSD-2-Clause. Use it, fork it, ship it, sell what you build on top of it. Just keep the copyright notice.
Author
Built by Skip. Founder OSS Open Source Security - pre-configured OPNsense security appliances on Protectli hardware. "Don't trust us. Verify us."
If you are hiring for Forward Deployed Engineer or Founding Engineer roles and you read this far, you already know what you are going to find when you look at the rest of the code.

