Cloud NAC performance test: 10,000 RADIUS auths/min on production hardware

2026-05-13

We ramped a single Arbiter tenant from 1,000 to 10,000 RADIUS authentications per minute on the production VM pair, then sat at 2,000/min for four hours. Everything held. Arbiter's target customer is SMEs and the MSPs supporting them: a 2,000-endpoint tenant's worst minute lands around 500/min, so 10,000/min is roughly 20x deliberate overshoot. Here's what we measured.

Two tests this week on the production VM pair. First we ramped a single tenant from 1,000 to 10,000 RADIUS authentications per minute. Then we sat at 2,000/min for four hours to see if anything drifted. Two tenants ran concurrently throughout, both on real per-tenant PKI, both seeing the traffic mix a customer generates: EAP-TLS, MAC-Auth-Bypass, accounting on every permit, reject paths for bad certs and unknown MACs.

Arbiter is built for SMEs and the MSPs that run them, tenants between 50 and 2,000 endpoints whose busiest minute of the year tends to land around 300 to 500 authentications. We tested at 20x that ceiling on purpose: a headroom claim is only useful when there's enough gap that bad days disappear into the noise.

10,000 auths/min sustained, 100% policy accuracy. p99 latency 1,394 ms at the ceiling (well under the 2-second pass threshold). Zero queue growth, flat FreeRADIUS memory, no leaks. Two tenants concurrent on the same hardware with no resource contention.

What we're not claiming

Three caveats up front, before the numbers, because credibility on a performance post is built on what you don't claim:

These numbers are per tenant. We tested two tenants concurrently and the architecture isolates them by design, but we have not yet run a meaningful fan-out. The next test is ten tenants at 5,000 authentications per minute each (50,000/min aggregate, ten times the load any one SME tenant would generate, run for thirty minutes) to see how the shared layers behave when many tenants push at once.
The mix here was weighted towards MAC-Auth-Bypass. EAP-TLS is heavier per packet because of the handshake cost, and customers with mostly-wireless 802.1X traffic will sit at a different point on the curve. We're going to measure that separately and publish it the same way.
These results come from synthetic traffic. Real switches, real supplicants and real wireless controllers introduce retries, fragmentation and timing patterns that no script perfectly reproduces. The numbers below are the upper bound under clean conditions, not a substitute for testing in your own environment.

Test one: ramp to the ceiling

Five 30-second tiers at constant rate: 1,000, 2,000, 5,000, 7,500 and 10,000 auths/min. Pass criteria: at least 95% verdict accuracy per class, p99 latency under 2 seconds, no internal queue overflow. All five tiers passed with 100% accuracy: every permit permitted, every reject rejected, every Acct-Start acknowledged. Zero unexpected accepts on the bad-cert path.

Harness specifics

Traffic mix: 80% MAC-Auth-Bypass, 20% EAP-TLS. TLS session resumption disabled so every handshake paid full cost.
Test client outside the VPC, talking to the public arbiter-radius IP over UDP/1812. Eight simulated NAS-Identifier source IPs so the per-source rate limit (200 req/60s) didn't dominate.
Server side: the production pair on Hetzner. arbiter-radius on CPX21 (2 vCPU / 4 GB), arbiter-core on CPX31 (4 vCPU / 8 GB). Production sizing, not a tuned test cluster.

Sustained authentication rate by tier. Target shown as dashed line, actual sustained rate as the bar. Every tier passed.

FreeRADIUS deliberately sleeps for one second on every Access-Reject as a credential-stuffing brake. The ~1.1-second floor in the chart below is that pause, not server processing time. Subtract a second to read real per-request work, which is in the low-millisecond range.

p99 latency climbed from 1.06s at the bottom tier to 1.39s at the top, well under the 2-second threshold. Server-side, every internal queue stayed under 1% of capacity through the whole ramp and FreeRADIUS resident memory finished where it started.

p99 latency by tier. The 2,000 ms ceiling is our pass threshold; the line that matters is well under it.

What this looks like at SME scale

The chart below puts realistic SME tenants next to the tested ceiling. Loads are peak-burst (the boot-storm Monday or post-policy-push reauth wave), not steady-state. Average daily load sits in fractions of a percent.

Worst-minute (boot-storm / mass re-auth) load for realistic SME tenants against the 10,000 auth/minute tested ceiling. Not average daily load: the busiest 60 seconds of the year.

A 2,000-endpoint SME at peak burst sits at about 5% of capacity. The 500-endpoint case is under 2%. That's per tenant, and the two concurrent tenants in this run shared the FreeRADIUS process, the database pool and the accounting writer without competing for any of them.

Test two: four hours at the wheel

Ramping is one thing. Sitting at a number for hours catches the bugs the ramp can't: memory leaks, slow queue drift, caches that grow and don't evict. We set the rate to 2,000 auths/min (20% of the proven ceiling), cleared the tables and let it run for four hours: twenty-four 10-minute windows, same mix as the ramp, no warm restart.

Result: 465,367 authentications and 185,369 accounting writes. Every window held 100% policy accuracy. p99 latency stayed between 1,098 and 1,124 ms across the whole run, a drift ratio of 0.98x first-to-last (the last window was fractionally faster than the first). FreeRADIUS memory finished flat.

Two windows about twenty minutes in did trip the harness's pass criteria with 52% and 21% timeout rates. We dug in: FreeRADIUS received only 11,000 and 14,000 requests in those windows instead of 20,000, and the journal had no entries at all for that period. The packets never arrived. The cause was packet loss on the test client's own domestic uplink, not anything on the Arbiter side. The detail that matters is what happened next: when traffic resumed, p99 was 1,111 ms, identical to the first window. No backlog, no recovery curve, no drift. We kept the FAIL in the record because the more useful result is what the server did when its load dropped out and came back: nothing.

p99 latency across twenty-four 10-minute windows of the four-hour soak. The 2,000 ms pass-threshold sits near the top of the chart; the actual line barely moves. The shaded band marks the two windows where the test-client internet uplink dropped packets (see text): server-side latency did not budge.

What the database did

465,367 rows into auth_log, 185,369 into accounting_log. Effectively zero dead tuples (nine across 650,000 inserts). Both tables are append-only on the hot path so autovacuum stayed idle.
Connection pool peaked at 1-2 active connections. All writes go through batched async queues, so the auth path never waits on Postgres.
Disk grew ~240 MB across the four hours (~1 MB/min, ~1.4 GB/day at this load), well inside the default per-tenant retention budget.

The engineering that made it possible

Four small changes landed in the two days before the test. Each one shifted the curve more than its diff size would suggest.

Asynchronous accounting writes. Acct-Start used to do five synchronous DB ops on the FreeRADIUS request thread. Moved them into a 5-second batched queue. Soak-test peak depth 236 events, median batch 163, p99 flush 44 ms.
Profile-cache invalidation on enrichment. New MACs used to see a stale empty profile for 30 seconds while vendor enrichment ran. Now invalidated the moment enrichment lands.
RFC 3579 EAP-Failure compliance. Post-Auth REJECT was sending Access-Reject with the EAP-Success message still attached. Strict supplicants are within their rights to act on it. Fixed.
Per-tenant rate-limit fan-out. The test client was hitting the 200/60s/IP brute-force cap on itself. Spreading across simulated NAS-Identifiers moved the ceiling from 'rate limiter blocked us' to 'server-side capacity', which is what we wanted to measure.

How to reproduce

The harness is a multi-process Python driver using pyrad for plain RADIUS and eapol_test for EAP-TLS, fanning out across simulated NAS-Identifier values against any Arbiter tenant. We'll share the harness, the cert helpers and a short README with serious evaluators on request: email support@arbiter.ie.

What's next

The single-tenant ceiling is on the record. The interesting unanswered question is the shared layers: database pool, accounting writer, audit pipeline. So the next test is a fan-out: ten concurrent tenants at 5,000 auths/min each for thirty minutes (50,000/min aggregate, well past any plausible MSP-managed load), each on its own per-tenant virtual server, PKI and policy chain. Same pass criteria. We'll pair it with an EAP-TLS-heavy mix on at least one tenant, since the crypto path deserves its own honest number. Results in the next devlog.

See it for yourself

Spin up a free trial tenant on app.arbiter.ie and point any RADIUS test client at it. Want our reference stress-test script (the same one that produced the numbers above)? Email support@arbiter.ie and we will send you the code, the cert helpers and a short README. No smoke and mirrors.

Start a free trial

Read the docs