Breaking Arbiter: scaling EAP-TLS in a multi-tenant cloud NAC

Round three tested Arbiter under sustained mixed 802.1X and RADIUS load across ten tenants in parallel over real WAN. 332,128 authentications, 100% expected policy-decision accuracy, zero auth-path drops. Four scaling limits surfaced; three were removed during the run, the fourth was characterised and queued for the next architectural change.

This round tested Arbiter's cloud NAC platform under sustained mixed 802.1X and RADIUS authentication load across ten tenants running in parallel over real WAN links.

The workload combined successful EAP-TLS authentication, failed certificate validation, MAB allow decisions and MAB deny decisions. The goal was to validate multi-tenant isolation, RADIUS throughput, EAP-TLS performance, policy correctness and stability during sustained concurrent authentication load.

The test exposed four scaling limits in the cloud authentication path. Three were removed during the run. The fourth was characterised and queued for the next architectural change.

This round was a stress test, not a production sizing exercise. The authentication rates and concurrency patterns were intentionally more aggressive than what a typical Arbiter block would see in normal customer use. The objective was to force queueing, concurrency and protocol edge cases out into the open before customers met them in production.

Results at a glance

Test design

Ten tenants ran concurrently. Each tenant used a dedicated FreeRADIUS virtual server, an independent certificate trust chain, isolated policy evaluation and isolated audit logging. Traffic originated from five European regions so every authentication crossed a real WAN path rather than localhost or intra-region networking.

Every authentication used a unique device identity, a unique MAC address, real certificate validation and full policy evaluation.

Authentication mix across the run. The deny paths were deliberately included to validate policy enforcement correctness under sustained load, not just successful authentication throughput.

Forward-path serialisation under EAP-TLS concurrency

The primary throughput limit appeared in the cloud RadSec proxy layer between the Edge appliance tunnels and the tenant FreeRADIUS instances. Each tenant tunnel used a single forwarding worker: receive packet, forward, block on reply, forward next.

This architecture behaved acceptably under MAB load because MAB typically completes in a single RADIUS round trip. Under EAP-TLS load the behaviour changed significantly. A typical EAP-TLS authentication involves multiple sequential round trips: identity exchange, TLS negotiation, certificate exchange, key exchange, TLS completion, authentication success.

Each in-flight handshake held the forwarding path while waiting for upstream replies. Under concurrent supplicant activity, EAP-TLS sessions queued behind each other within the tenant tunnel and eventually entered retry behaviour. The practical effect was that the proxy serialised authentication throughput per tenant.

Proxy architecture changes

The proxy was reworked to support concurrent in-flight RADIUS exchanges per tenant. Changes: socket pooling per tenant, independent reply readers, parallel forward paths, bounded queue behaviour under burst load.

After the change, the proxy ceased to be the primary throughput limit. The next bottlenecks shifted upward into FreeRADIUS worker scheduling, PostgreSQL connection pooling and Python interpreter serialisation inside the policy layer.

The change is expected to materially improve customer-facing behaviour during burst events such as morning boot storms, where a large number of endpoints attempt to authenticate within a short window. A boot storm that previously took several minutes to clear should now complete in well under a minute under the same authentication mix.

Per-tenant EAP-TLS throughput, single-worker forwarder vs concurrent in-flight exchanges. Same FreeRADIUS, same Postgres, same WAN: just the proxy layer reworked.

Listener behaviour under stalled TLS setup

The second scaling issue appeared in the cloud listener itself. The TLS handshake path originally executed on the same thread responsible for accepting inbound tunnel connections. A small number of stalled handshake attempts could fill the accept queue, blocking new tunnels until the stalls cleared.

The listener was reworked to move TLS handshake processing off the accept thread, increase accept queue depth, apply fixed handshake timeouts and isolate stalled clients to per-connection workers. Existing production tunnels remained unaffected throughout testing because the issue only impacted new tunnel establishment during accumulation conditions.

Database concurrency limits

Once the proxy bottleneck was removed, the authentication layer pushed significantly higher transaction concurrency through PgBouncer and PostgreSQL. The database layer required additional headroom: increased PgBouncer transaction pool sizing and increased PostgreSQL connection limits. After adjustment, sustained write-path stability was confirmed under concurrent EAP-TLS load. The database layer no longer constrained the tested workload.

Residual behaviour under synthetic concurrency

After the earlier fixes, the platform sustained the soak cleanly at the tested concurrency level with effectively zero timeout behaviour. We then increased concurrent supplicant density per tunnel well beyond typical real-world Edge appliance behaviour.

Under this synthetic load shape, some EAP-TLS authentications completed on retry rather than first attempt. Endpoints still authenticated successfully through normal supplicant retry behaviour. The visible effect was delayed first success during the burst window rather than outright authentication failure.

Three contributors were identified: latency amplification across multi-round-trip EAP-TLS handshakes, serialisation inside the per-instance Python interpreter, and FreeRADIUS duplicate-request suppression under aggressive synthetic concurrency.

The next optimisation work is focused on reducing Python-side work during EAP-TLS processing, expanding per-instance interpreter concurrency and improving policy-cache locality. This behaviour only appeared under synthetic concurrency levels substantially higher than those produced by current customer Edge appliances.

Outcome accuracy across all four traffic classes. Every class hit its expected verdict 100% of the time. The single residual timeout sits inside the EAP-TLS invalid-cert path at 1 in 33,292 attempts (0.003%).

Architecture overview

All authentication paths, policy evaluation and certificate trust chains remain tenant-isolated throughout the flow.

Cloud authentication path after the proxy rework. The per-tenant socket pool is new in this round; it removes the per-tunnel serialisation that was capping EAP-TLS throughput.

What this round validated

What still needs testing

The next round focuses on eight-hour endurance under sustained heavy EAP-TLS load and higher per-tenant concurrency levels once the next optimisation lands.

Failure-injection scenarios such as live FreeRADIUS restart, database failover under load and Edge-to-cloud network partition are on the longer-term test roadmap and will be covered in their own dedicated rounds.

Why we publish this

The difficult parts of a cloud NAC platform are usually invisible until something breaks: a school boot storm, a faulty supplicant or a queue that silently backs up under load. Publishing the rough edges is more useful than publishing perfect-looking graphs without context.

Engineers evaluating network access control platforms need to understand where the limits are, how failures present, how bottlenecks are diagnosed and what changes actually improved behaviour.

See it for yourself

Spin up a free trial tenant on app.arbiter.ie and point any RADIUS test client at it. Want our reference stress-test script and the EAP-TLS load harness? Email support@arbiter.ie and we will send you the code, the cert helpers and a short README.

Start a free trial

Read the docs