opensourcesecurity/oss-vpn-test

Fork 0

Continuous head-to-head performance comparison of SafeNet vs. Proton and Mullvad. Methodology behind safenetvpn.us/proof.

Python 98.3%
Dockerfile 1.7%

Find a file

skip 25681f860c Update README.md		2026-05-24 08:31:23 -07:00
configs	Initial commit: OSS VPN Test methodology and harness	2026-05-09 11:33:48 -06:00
docs	Initial commit: OSS VPN Test methodology and harness	2026-05-09 11:33:48 -06:00
harness	Initial commit: OSS VPN Test methodology and harness	2026-05-09 11:33:48 -06:00
.gitignore	Initial commit: OSS VPN Test methodology and harness	2026-05-09 11:33:48 -06:00
compose.yml	Initial commit: OSS VPN Test methodology and harness	2026-05-09 11:33:48 -06:00
Dockerfile	Initial commit: OSS VPN Test methodology and harness	2026-05-09 11:33:48 -06:00
LICENSE	Initial commit: OSS VPN Test methodology and harness	2026-05-09 11:33:48 -06:00
README.md	Update README.md	2026-05-24 08:31:23 -07:00
snapshot_emit.py	Initial commit: OSS VPN Test methodology and harness	2026-05-09 11:33:48 -06:00

README.md

OSS VPN Test

Continuous head-to-head performance comparison of GlassBox VPN against two well-respected privacy VPNs (Proton and Mullvad), plus a no-tunnel baseline. The numbers shown on [https://glassboxvpn.com/proof] come from this repository.

This repo exists so a skeptical reviewer can read the test harness end to end and decide for themselves whether the comparison is fair. Everything here - every measurement, every aggregation, every filter - is what the production system runs. No marketing layer.

What the test does

Every 20 minutes, a containerized harness running on a dedicated server in Dallas brings up a WireGuard tunnel, runs a battery of measurements through it, tears it down, and moves to the next configuration. One full cycle covers:

A no-tunnel baseline measurement (the floor - bare network from the same client)
9 tunneled measurements: GlassBox, Proton, Mullvad × LA, Chicago, Virginia

Across 10 real-world target sites (Amazon, Apple, BBC, Chase, Craigslist, GitHub, Microsoft, Stack Overflow, Wikipedia, Yahoo). Cache-busted on every request. Same client, same code path, same targets - only the tunnel changes.

Every result lands in a SQLite database on the test server. Every 15 minutes, a separate process reads that database read-only and writes a JSON snapshot. The public dashboard reads that JSON. The path from raw measurement to public number is short, mechanical, and visible in this repo.

Why this design

Real protocols, not synthetic throughput

No iperf, no speedtest.net. Those numbers don't reflect what users experience and they're easy to game with QoS rules that prioritize speedtest endpoints. The harness fetches actual websites the same way a browser would. The measurement that matters most - time to first byte - is what users feel as "this page is responsive."

Head-to-head with named providers

Comparing GlassBox VPN only to itself proves nothing. Proton and Mullvad are two of the most respected privacy VPNs available; running our test through their tunnels using their official WireGuard configs anchors the GlassBox VPN numbers in something a reviewer can verify. If our test is rigged, the rig has to fool Proton and Mullvad's servers identically, every cycle, for months - that's not a rigging strategy, that's a research project.

Baseline as the floor

A no-tunnel measurement runs first in every cycle, against the same target list, from the same client. This is the floor: the fastest the network can possibly go from this client to these targets. Every tunneled provider's numbers are read in relation to it. GlassBox VPN will be slower than the baseline - every tunnel costs something. The honest question isn't "is GlassBox VPN slower than no VPN?" (yes, by definition); it's "by how much, compared to alternatives?"

Containerized for isolation

The harness runs inside a Docker container with cap_add: NET_ADMIN. WireGuard tunnels come up inside the container, not on the host. The host's networking is untouched between cycles. This means the test is reproducible: anyone with Docker, three sets of VPN credentials, and this repo can stand up an identical rig and verify our numbers from their own client.

Rotating configs (where it matters)

GlassBox VPN has one peer per region - that's the product, that's what's being tested. Proton and Mullvad ship multiple servers per region; the harness round-robins through three configs per region for each so we're not exclusively measuring the performance of one Proton/Mullvad server. Configs and rotation policy are declared in configs/providers.yaml.

Test client location: disclosed

The client lives at a Dallas Psychz Networks colocation facility (TEST_CLIENT_LOCATION=dallas-psychz in compose.yml). Geographic distance from Dallas to each tested region affects all three providers identically - which is the point. A test client in Chicago would advantage all three Chicago tunnels equally; the relative comparison would be the same.

What's measured

Per cycle, per (region, provider):

Network path: RTT to Cloudflare, jitter, packet loss, MTU sanity, IPv6 reachability through tunnel
DNS: resolution time of google.com against the provider's tunnel-side resolver
Per target: TCP connect, TLS handshake, TTFB, HTTP status, small-asset (≤256 KB) transfer time
Cycle metadata: observed exit IP, exit city, ASN, WireGuard handshake time

Full reference: docs/METRICS.md.

The headline number on the dashboard is median TTFB - the metric that most directly tracks "does this site feel responsive." Medians (not means) so one timeout doesn't drag a provider through the mud; the underlying data is in the database for anyone wanting tail analysis.

How a measurement becomes a public number

WireGuard tunnel up
  → curl through tunnel against target
  → measurement row inserted in SQLite (storage.py)
  → ...repeat for every metric, every target...
  → tunnel down

Cron @ */15 min:
  snapshot_emit.py
  → reads SQLite read-only
  → applies aggregation filters (Y8 fix, window floor)
  → computes 24h medians, hourly buckets, comparison percentages
  → writes dashboard_snapshot.json atomically

LA Ops server:
  rsync pulls dashboard_snapshot.json
  → glassboxvpn.com/proof renders from it

The public dashboard never reaches into the database. It reads the JSON snapshot only. The shape of that JSON, and every aggregation that produces it, is documented in docs/SCHEMA.md.

What's deliberately not measured, and why

Synthetic throughput (iperf, speedtest): doesn't model real traffic shape; QoS-friendly to providers
Bulk download: would risk hitting GlassBox VPN's 100 Mbps per-peer cap and pollute comparisons
Provider's own assets (proton.me, mullvad.net): structural advantage for that provider's tunnel
Sites in shared datacenter clusters with one provider's POPs: same problem
Aggressively datacenter-blocking sites (Walmart, BestBuy historically): unstable signal

The exclusions are recorded in configs/targets.yaml with reasoning per entry.

Methodology iteration

The test has iterated. Targets have been added and removed; bugs have been found and fixed; aggregation rules have been adjusted. We don't paper over this - every change is recorded as a row in the dataset_notes table inside the database, and every row of dataset_notes is published in the JSON snapshot under the dataset_notes_active key. The public dashboard surfaces the audit log directly.

The two filters currently active in production are documented in docs/SCHEMA.md:

Y8 payload-floor fix (May 2026): excludes pre-fix rows for targets that returned redirect-terminal sub-1KB responses
Window floor (cycle 334): excludes data from before a target list reset so the 24h dashboard window stays comparable

Older data isn't deleted. It's preserved in the DB and excluded from rendering. Anyone running ad-hoc analysis can query it directly.

Repository contents

Path	What it is
`compose.yml`	Docker Compose definition for the harness container
`Dockerfile`	Ubuntu 24.04 + WireGuard tools + Python deps
`harness/`	The test runner - config loader, measurements, tunnel control, storage
`configs/providers.yaml`	Which providers, which regions, which configs, rotation policy
`configs/targets.yaml`	The 10 target sites and selection/exclusion criteria
`configs/{GlassBox VPN,proton,mullvad}/EXAMPLE.conf`	Redacted WireGuard config templates
`snapshot_emit.py`	DB → public JSON aggregator (runs every 15 min via cron)
`docs/METRICS.md`	Every metric, what it measures, gotchas
`docs/SCHEMA.md`	DB schema and JSON snapshot shape

Real WireGuard configs live in configs/{GlassBox VPN,proton,mullvad}/ on the production server and are excluded from this repo - they contain the test client's private keys for each provider. The example configs show the structure and document what we strip from upstream-provided configs (DNS lines, IPv6 endpoints) and why.

License

BSD 2-Clause. See LICENSE.

README.md Unescape Escape