Breaking Arbiter, round two: ten tenants, four hours and the control plane that buckled

Round one (last week) was one tenant on the wheel: 10,000 auths/min, clean. Round two: ten tenants in parallel at 1,000 auths/min each, MAB-only, four hours straight. The auth path held perfectly. The control plane did not. Here is the honest account of what gave way, why and what we changed.

Round one of breaking Arbiter put one tenant on the wheel and pushed it from 1,000 to 10,000 RADIUS authentications per minute. Everything held. The question round one didn't answer was what happens to a cloud NAC service when many tenants push at once.

Round two: ten tenants, 1,000 auths/min each, MAB-only, four hours straight. 2.4 million authentications across the run. This is the first published multi-tenant cloud RADIUS soak test for Arbiter.

The auth path behaved exactly the way the single-tenant test suggested it would. Ten tenants sustained 10,000 RADIUS authentications per minute for four hours, no queue pressure, no auth-side drops, 802.1X reject paths and MAB accept paths both clean across the run.

The failures appeared somewhere else entirely: the control plane.

With nearly 2.7 million auth_log rows on disk, two API endpoints slowed from ~50 ms to multi-second responses under realistic data scale. The API connection pool saturated, admin pages returned 500s and the status badge stayed pinned on Degraded long after the service had recovered.

Results at a glance

How we tested

Ten tenants ran in parallel, each against a separate per-tenant virtual server and PKI chain. Each generated 1,000 MAB authentications per minute for four hours straight. Total load: 10,000 auths/min aggregate, 2.4 million authentications overall.

Traffic originated from ten Fly.io machines spread across five European regions, two per region: Frankfurt, Amsterdam, Paris, London and Stockholm. The Arbiter server pair sits in Hetzner Falkenstein (Germany), so every authentication crossed a real WAN path rather than a same-VPC loopback. WAN latency variance was part of the test design.

Pass criteria stayed the same as round one: p99 under two seconds, no queue overflow, 95% verdict accuracy. MAB-only this round to isolate the multi-tenant question from EAP-TLS handshake cost. Round three picks up the EAP-TLS-heavy fan-out.

Aggregate auth rate across all ten tenants. Twenty-four 10-minute windows, flat on target. The 10,000 auths/min mark sustained for four hours.

Per tenant the picture is the same: each of the ten tenants held within 1-2% of its 1,000/min target across every 10-minute window. FreeRADIUS memory finished where it started.

Internal queue depth across the soak. Auth-log writer queue peaked at 250 against a 50,000 ceiling. Accounting writer queue peaked at 110 against 10,000. Session policy cache held between 450 and 870 entries. The hot path never queued.

Why we are comfortable with 50 tenants per block

Arbiter is sized in blocks: one core VM, two listener VMs and a load balancer, sized for fifty tenants. The soak puts an honest number against that sizing. Ten tenants in parallel sustained 10,000 RADIUS authentications per minute for four hours on the production-grade CPX pair. The block is sized for five times the tenant count we tested in parallel here, but the per-tenant load it has to absorb is far smaller than the 1,000/min we drove each test tenant at.

Arbiter targets SMEs and the MSPs that look after them. Round one published the realistic peak-burst numbers: a 50-endpoint office tops out around 25 auths in a high-volume minute, a 500-endpoint SME around 150, a 2,000-endpoint estate around 500. Even a worst-case simultaneous burst across fifty SME tenants stays well below the 10,000/min soak load. Real boot storms are local to one customer, not synchronised across an entire block. The arithmetic gives us five to ten times headroom against the load a real block of fifty SMEs would produce during a high volume of authentications.

This test was a code-design validation, not a throughput claim for marketing. The interesting question was whether ten tenants in parallel compose cleanly with the single-tenant numbers round one produced. They did. The shared layers (the FreeRADIUS process, the database pool, the accounting writer) absorbed ten simultaneous streams without contention or fairness issues. If demand ever outgrows the current sizing the scale-out path is straightforward: larger Hetzner instance types for the core and listeners, or a second block once the first fills up. No architecture changes required.

Tenant isolation under a hostile load is the harder problem. To protect co-tenants on the same block from a misconfigured supplicant or a flapping NAS at one tenant, we have added a per-tenant rate limit at 5x the tenant's endpoint limit per minute. One tenant going off cannot burn the block's headroom; the rate limiter cuts the offending source off long before they touch the listener's ceiling.

What broke

Three things gave way during and immediately after the soak. None of them were the auth path.

1. The admin tenant list timed out

/api/tenants/, the endpoint that backs the admin Tenants page, ran a COUNT(DISTINCT mac_address) over a month of auth_log for every tenant on every page load. At test-data scale this was a few hundred milliseconds and nobody noticed. Post-soak, with ~250,000 rows per active tenant inside the billing window, the same query ran four to six seconds.

Once auth_log crossed roughly two million rows, the Postgres planner stopped using the per-tenant index scan and switched to a parallel sequential scan instead. Each worker filtered hundreds of thousands of rows and eventually hit the five-second statement timeout.

The connection sat in the API connection pool until the timeout, the next request waited on the pool, the pool saturated, the page started returning 500s.

Lesson: a planner choice that was correct on the test dataset is not necessarily correct on the production dataset. The test dataset has to look like production before the test means much.

2. Tenant portal reloads got noticeably slower

The per-tenant detail endpoint had the same shape: two correlated COUNT(DISTINCT) aggregates inline, one for the billable count, one for the live month-to-date count. Each ran ~600 ms at stress-tenant scale. Every full-page reload inside the portal hit it (the layout fetches tenant metadata on mount), so refreshing the NAS page on stress-04 suddenly felt sluggish in normal use.

3. The status badge stayed Degraded for a day

Cloud-Probe caught one bad latency sample during our deploy restart: 3,027 ms p99 for MAB authentication. The probe was timing the moment FreeRADIUS was bouncing for a config push, so the supplicant retried and Cloud-Probe logged the retry time. Real cause was clear from the journal. What kept the badge stuck on Partial Degradation was that our status logic computed p99 over a rolling 24-hour window: one sample at 3 seconds, no matter when, kept the badge in degraded state until that sample aged out twenty-four hours later.

Same status page also showed 133 drifts on dot1x next to 100% uptime, because drifts were defined as 'service replied with the wrong answer' and uptime only counted real timeouts. Technically correct, customer-confusing.

How we fixed it

Cache the expensive aggregates, refresh in the background

Both admin endpoints now read endpoint counts from an in-process cache, refreshed every five minutes by a background daemon. List endpoint dropped from 4-6 seconds to about 5 ms. Per-tenant detail dropped from 1.2 seconds to about 30 ms. The numbers lag by up to five minutes, which is invisible for a slow-changing monthly billable count.

Admin endpoint response time before and after caching the heavy aggregates. Same data, same queries, just moved off the hot path.

Switch the status badge to a one-hour window

The badge now computes from the last hour of probe data, not the last 24. A restart blip rolls off in 60 minutes. The 24-hour and 90-day figures are still shown on the per-service strip so the longer-term picture is visible, but the headline status follows current health, not a sample from yesterday morning that no longer reflects reality.

Time the badge stays Degraded after one bad p99 sample. Old window: 24 hours. New window: 60 minutes. Same probe data, different aggregation period.

Define service health honestly

The status page now reports Service Health rather than Uptime as its headline. Health is pass / (pass + drift + error): a probe that gets the wrong answer counts against the headline just like a probe that gets no answer. For a NAC service, an authentication that returns the wrong policy decision is a service failure for the downstream customer, even if the network path was up. Reachability stays as a sub-metric. Yesterday's 133 dot1x drifts now show on the 24-hour bar as 96.71% health, with reachability still at 100%. The two numbers next to each other tell the truth.

What we learned

What's next

Round three: EAP-TLS heavy mix on the same ten-tenant fan-out, since round two was MAB-only and the crypto path deserves its own honest number under multi-tenant load. After that, the per-tenant rate limiter (a new defensive backstop sized at 5x endpoint_limit per minute) gets its own deliberate breakage test. Same shape, same rules: publish what breaks, what we changed, what we did not claim to have solved.

See it for yourself

Spin up a free trial tenant on app.arbiter.ie and point any RADIUS test client at it. Want our reference stress-test script or the Cloud-Probe rig (the same one that produced the numbers above)? Email support@arbiter.ie and we will send you the code, the cert helpers and a short README.

Start a free trial

Read the docs