Don't break the chain: offline 802.1X on a SaaS NAC

2026-05-29

Cloud NAC has an obvious failure mode: if you lose the WAN, auth stops. MAB can be cached but 802.1X can't, as it's a live cryptographic handshake, not a replayable permit based on MAC address. So we built a real EAP-TLS server into the Edge. Here's the design.

TL;DR. We taught the on-premises Edge appliance to terminate 802.1X EAP-TLS handshakes locally when the cloud control plane is unreachable and to forward RADIUS to a peer Edge over the LAN when the local tunnel is down but a peer's tunnel is still up. The HA design works around the variety of dead-server-detection behaviours across switch vendors (each has its own timers and thresholds) instead of betting on any one of them: failover is a sub-second LAN round-trip, not seconds-to-tens-of-seconds waiting on whichever NAS detector the customer's switches happen to ship with. Audit replay closes the compliance hole on reconnect.

The problem

Cloud-delivered NAC follows a common pattern: a small on-prem appliance accepts RADIUS from the customer's switches, wraps each packet in a long-lived TLS tunnel (RadSec, RFC 6614) and forwards to a SaaS control plane running policy and audit. The customer gets multi-tenant management, centralised logs and no on-prem servers to operate.

The obvious problem is what happens when the site loses connectivity to the control plane. ISP outage, BGP flap, upstream firewall change, control-plane brown-out. Switches stop getting answers. Dot1x sessions go unauthorised. MAB lookups time out. Within minutes the auth-fail VLAN fills up, or worse, the switch fail-opens via critical-auth bypass (its own set of problems).

In operational terms: a branch office loses WAN connectivity for an hour. Employees should not lose wired authentication. Printers should not disappear from the network. IP phones should keep working. Switches should not enter critical-auth fail-open. None of those things should require the operator to do anything.

A cache solves the easy half. MAB authentications can be replayed from a cached Access-Accept. Most NAC vendors do this; it's well-trodden ground.

802.1X is the hard half. A dot1x authentication isn't a single request/response: it's a multi-round-trip cryptographic exchange whose state is bound to the live TLS session. You cannot cache an EAP-Success and replay it. The supplicant has no matching TLS state, the reply is garbage and authentication fails closed.

Doing 802.1X offline means the appliance has to become a real EAP server. Not a proxy, not a cache, an actual EAP-TLS server with cert chain validation, fragmentation handling and per-session state. We shipped that. Here is how it fits together.

Core architecture

Two Edges per site, both tunnel to the cloud independently and peer to each other over the LAN. The peer link is mTLS over the existing tenant PKI and carries both state polling and RADIUS forwarding, with no new secret material and no dependency on the cloud being up.

The Edge does four jobs:

Proxy. Normal day, tunnel up: RADIUS in, wrap in RadSec, reply back to the NAS.
MAB cache. Tunnel down: replay cached Access-Accepts for known MACs, with the integrity fields recomputed per packet (see the previous post for why).
Local EAP-TLS server. Tunnel down: drive a real EAP-TLS handshake against the supplicant, validate its cert chain, emit a fresh Access-Accept.
Audit queue. Every offline-served auth gets written to an append-only JSONL file and replayed to the cloud auth_log on tunnel reconnect, keyed for idempotency so a retry never double-inserts.

Why local EAP-TLS is its own thing

EAP-TLS (RFC 5216) is stateful. The TLS session is cryptographically bound to that specific exchange: its master secret, its random nonces, its cert verify hash. Three consequences fall out of that:

You cannot replay a cached EAP-Success. The supplicant has no matching TLS state and rejects the reply.
The Edge has to be a real EAP server during the outage: present a server certificate, validate the client's cert against the tenant CA, drive fragmentation and reassembly across multiple RADIUS round-trips.
The MS-MPPE keys for wireless WPA2/3-Enterprise are derived from the TLS master secret. Cached MPPE keys are useless on replay because they were encrypted under the original request's authenticator.

That is why the cache works for MAB and does not work for 802.1X. There is no clever workaround. You either build a local EAP server or you give up 802.1X during outages. We built one, on top of CPython's ssl.SSLContext.wrap_bio() (the documented way to do TLS without a socket), plus a state machine for EAP-TLS fragmentation per RFC 5216 section 3 and a RADIUS wrapper for EAP-Message and Message-Authenticator packing per RFC 3579 section 3. Three modules, each independently testable. No new crypto library: just the appliance's existing python3-cryptography for cert chain operations.

Trust material bootstrap

The appliance gets a per-device server certificate at activation time:

Edge generates an ECDSA P-256 keypair locally. The private key never leaves the appliance.
Edge builds a CSR with CN = edge-{device_id}.{tenant_shortname}.arbiter.internal.
Edge submits the CSR to the cloud over its existing tunnel client cert (already chained to the tenant CA).
Cloud signs with the tenant intermediate, returns the leaf PEM.
Edge writes server.crt (0644) and server.key (0600).

The tenant CA bundle (public, no private key) ships via the heartbeat response, gated on a SHA-256 hash so we carry the ~2 KB payload only on first fetch and on rotation.

Audit replay closes the compliance loop

SOC 2 CC7.2 expects logging completeness. NIS2 Article 21(d) expects incident reporting. Both want a coherent audit trail across the outage window.

Every locally-served auth gets appended to a local journal with fdatasync per write so we never lose an acknowledged auth to power loss. On tunnel reconnect, a background drainer ships events to the cloud's audit-replay endpoint in batches of 100, which writes to auth_log with the markers served_offline=TRUE and served_by_edge_id pointing at the Edge that handled it. The customer's compliance officer can then query:

SELECT mac_address, result, served_by_edge_id, event_timestamp
  FROM auth_log
 WHERE served_offline = TRUE
 ORDER BY event_timestamp DESC;

And see every offline-served auth, attributable to a specific Edge, for the specific outage window. Without this, the cloud audit log has a hole through every outage. With this, the hole closes within seconds of reconnect.

HA behaviour: multi-Edge peer awareness

The single hardest design question wasn't about cert validation or EAP fragmentation. It was about how two Edges on the same site should coordinate during a partial outage.

Why not lean on the NAS dead-server detector

The first version of this design leaned on Cisco's dead-server detector: when Edge A's tunnel broke, A would silent-drop the request so the switch would time out and fail over to Edge B. That works on paper. In practice it couples customer auth latency to radius-server dead-criteria (default tries 3 time 5, so up to fifteen seconds of timeouts before failover) and risks marking A dead for any single-Edge customer who has no B to fail over to.

Different NAS vendors implement dead-server detection differently. Juniper, Aruba, Ubiquiti and MikroTik each ship their own timer model and threshold defaults; some operators tune those defaults, some don't. Building the failover path on top of whichever detector the customer's switches happen to ship with means a fragmented testing matrix and surprises on first-deployment day.

The version we shipped does not silent-drop. When A's tunnel is down and a peer's tunnel is up, A forwards the RADIUS request to the peer over the LAN, the peer serves it normally and A relays the reply back to the NAS. Failover takes one LAN round-trip instead of three timeouts, and the NAS never sees A go quiet. The behaviour is the same whichever vendor's dead-server timer is underneath, because the design no longer depends on it firing.

The peer-aware design

Two Edges on a site. Switch configured with both as RADIUS targets, priorities 1 and 2. Edge A's tunnel breaks, B's is fine. A receives a RADIUS request, sees its own tunnel is down, sees B's tunnel is up, opens a short-lived LAN session to B, forwards the request, waits for the reply and answers the NAS. From the switch's perspective, A answered. From the cloud's perspective, the request came in via B's tunnel as normal.

The two requirements that fall out of that:

A needs to know B's tunnel state in real time, without depending on the cloud being up (because the cloud might be the thing that's down). So the peering layer is LAN-local.
Forwarded requests need a loop-prevention marker so B doesn't re-forward back to A if B's own tunnel is also down. The peer-forward channel carries an originating-Edge HTTP header on the request; if a request arrives carrying that header, the receiving Edge treats it as locally-handled-only and never forwards again.

Each Edge runs a small local HTTPS endpoint that returns its tunnel state to authenticated peers and accepts forwarded RADIUS over the same mTLS channel. Mutual TLS uses the existing per-appliance server cert (chained to the tenant CA); peer validation uses the existing tenant CA trust bundle. Mutual-mTLS-by-PKI, with no new secret to rotate.

Each Edge polls every peer every two seconds with a five-second freshness TTL. Failure to reach a peer counts as tunnel_up=false (the safe direction: if I cannot see you, I have to assume you cannot help me). Short interval, short TTL: peer state never lags behind reality by long, and the LAN poll is cheap enough that there's no cost reason to wait longer.

The decision tree:

if request arrived with peer-origin header:
    handle_locally_only()          # came via peer-forward, must not re-forward

elif local_tunnel_up:
    forward_to_cloud()             # normal path

elif peer_poller.any_peer_tunnel_up():
    proxy_to_peer()                # LAN round-trip via mTLS to a healthy peer

elif request carries EAP-Message AND local EAP armed:
    drive_local_handshake()        # fresh EAP-Success bound to this session

elif request matches a cached MAC:
    serve_from_cache_filtered()    # cached policy attrs, EAP TLVs stripped

else:
    unknown_mac_policy()           # deny/permit/vlan_override

Each branch is operationally visible in the journal so the operator can see which path served any given auth.

The failure modes that are not failures

Asymmetric reachability. A can poll B but B can't poll A: A forwards to B over the LAN, B answers via its own cloud tunnel, no degradation.
Both tunnels down. Forwarding is a no-op (no healthy peer); each Edge falls through to local EAP-TLS or cached MAB independently. Single-Edge semantics.
No peers configured. Forwarding path is dead code; behaviour identical to a single-Edge install (local EAP-TLS, cached MAB).
Peer link itself dies. Forwarder times out fast (sub-second LAN budget) and falls through to local EAP-TLS or cached MAB. The peer-link doesn't get a second chance to delay the auth.

Lab validation

Two Edges (.34 and .45) on the same LAN, both on the same software version. Cisco Catalyst 9000 switch with dot1x + MAB configured on multiple ports. Windows test laptop on g1/0/2 with a client cert signed by the tenant CA. A second non-dot1x device on g1/0/3 for MAB.

Baseline: both Edges healthy

ARBITER-SW1#show access-session int g1/0/2 de
            Interface:  GigabitEthernet1/0/2
          MAC Address:  98fa.9b23.d8d1
            User-Name:  host/windows-test-laptop
               Status:  Authorized
               Domain:  DATA

Server Policies:
           Vlan Group:  Vlan: 21

Method status list:
       Method           State
        dot1x           Authc Success

Auth flows normally: switch to Edge to cloud to policy, VLAN 21 returned in the Access-Accept tunnel triplet. Edge journal shows the round-trip:

auth fwd: auth rid=124 mac=98fa9b23d8d1 from=192.168.0.21:5249
auth rpy: ACCEPT  rid=124 mac=98fa9b23d8d1 to=192.168.0.21:5249

Both tunnels down, dot1x via local EAP-TLS

Drop both tunnels with nft add rule inet filter output tcp dport 2083 counter reject on each Edge. Trigger a reauth on the switch. The Edge that receives the request drives a full local handshake:

21:23:34  INFO  local EAP-TLS termination armed
21:23:34  INFO  local EAP: served 64 bytes (live sessions=1)         # Start
21:23:41  INFO  local EAP: served 1100 bytes (live sessions=2)       # Server flight
21:23:41  INFO  local EAP: served 604 bytes (live sessions=2)        # cert + done
21:23:41  INFO  local EAP: served 60 bytes (live sessions=1)         # Access-Accept
21:23:41  INFO  local EAP: Access-Accept peer_cn='host/windows-test-laptop'
                          policy_attrs=16B reason='cert PASS'

Four fragments out, one fresh Access-Accept back. The 60-byte payload is the Accept: RADIUS header (20) + EAP-Success TLV (6) + VLAN tunnel triplet (16) + Message-Authenticator (18). Math checks out.

The switch's view:

ARBITER-SW1#show access-session int g1/0/2 de
               Status:  Authorized

Server Policies:
           Vlan Group:  Vlan: 21

Method status list:
       Method           State
        dot1x           Authc Success

Port authorised by the certificate (no cloud round-trip) at the same VLAN the cloud's policy engine would have returned. From the customer's perspective the outage is invisible at the auth layer.

MAB during the same outage

On g1/0/3, a non-dot1x device hits MAB during the same outage window. The Edge serves the filtered cache attrs:

21:26:00  INFO  offline auth: served mac=f8c65049d672 code=2 (20 bytes)
                                       from cache to 192.168.0.21
21:26:00  INFO  offline acct: ACKed (code=5) ... -- keeping NAS
                                       server-health green during outage

ARBITER-SW1#show access-session int g1/0/3 de
               Status:  Authorized
Method status list:
       Method           State
        dot1x           Stopped
          mab           Authc Success

Same outage, same Edge, two auth paths, both working. Laptops using dot1x and devices using MAB continued authenticating through the outage without operator intervention.

Multi-Edge HA: A forwards to B

With A's tunnel down and B's tunnel up, an auth that the switch sends to A is forwarded over the LAN peer link to B, B services it via its own cloud tunnel and A relays the reply back to the NAS. End-to-end latency is dominated by the cloud round-trip; the LAN hop adds single-digit milliseconds. The peer-origin header on the inbound at B blocks B from ever forwarding it back to A.

From the switch's perspective, A answered normally. No timeout, no retransmission, no failover increment, no dead-time entry. Both Edges stay UP. The lab traces for the proxy path will publish in a follow-up once the soak round completes.

Audit replay

After both tunnels restore, the offline events drain to the cloud within seconds:

tunnel connected to 46.225.164.231:2083
offline audit drain: replayed=3 failed=0

On Core:

SELECT mac_address, result, served_by_edge_id, event_timestamp
  FROM auth_log
 WHERE served_offline = TRUE
 ORDER BY event_timestamp DESC LIMIT 10;

Shows every locally-served auth from the outage window, attributable to the Edge that handled it. The audit-log hole closes the moment the tunnel comes back.

What the lab taught us that the design didn't

Three bugs worth recording.

1. Authority-Key-Identifier strictness

Modern OpenSSL strict-validates the X.509 chain including AKI/SKI consistency. Our unit-test PKI fixture didn't include either; tests passed because we'd set check_hostname=False. Lab pairing failed because production validation is strict. Fix: real AKI/SKI on minted certs, plus a Core-side EKU guarantee (TLS Web Server Authentication) on issued server certs.

2. Logging propagation

The daemon configures a named logger "edge_client" with propagate=False. Helper modules used getLogger(__name__) and produced loggers named after their files; those records propagated to the root logger which has no handler. Result: peer modules' INFO startup lines never reached journald even though the listeners were bound. One-line fix: getLogger("edge_client.peer_server"). The kind of thing only operating the software at scale catches.

3. The cached-attrs Frankenstein

An early implementation of policy-attribute grafting copied the ENTIRE cached Access-Accept attribute blob onto the locally-minted Access-Accept, including stale EAP-Message, MS-MPPE keys encrypted under a long-gone request authenticator and a stale Message-Authenticator. The result was a 226-byte malformed reply Cisco refused (visible on the switch as the incorrect counter on show aaa servers incrementing on every offline-EAP reply).

The fix: a filter that keeps only RFC-standard policy attributes (Tunnel-*, Filter-Id, Reply-Message, timeouts) and drops everything else, including all Vendor-Specific Attributes. We default to vendor-neutral attributes per RADIUS standards. Customers using Cisco AVPair-style policy need a separate code path which we will add when a real customer needs it.

The size delta is the smoking gun: 60 bytes after the filter, 226 bytes before. The supplicant's view of that reply was the difference between "valid EAP-Success, port authorised" and "garbage, fail closed".

Where this sequence goes next

Local EAP-TLS termination, MAB cache replay, multi-Edge peer awareness and audit replay are shipped and lab-validated. The sequencing for the rest of the offline story is straightforward:

BYO endpoint trust. Today the Edge validates supplicant certs against the Arbiter-managed tenant CA. For customers whose own PKI issues the certs their fleet already uses, the next iteration decouples the two trust roles: tenant-ca.pem continues to anchor Arbiter's per-tenant PKI for internal mTLS, while a new endpoint-trust.pem shipped via heartbeat carries the customer's CA bundle for EAP-TLS supplicant validation.
The audit-replay data already lands in auth_log with the offline markers set. The portal read view ("offline auths during last outage", per Edge, per tenant) is a small UX add on top of data that's already there.

Open questions

The cached-attrs filter drops all Vendor-Specific Attributes. Customers whose policy depends on Cisco AVPair-style attributes (cisco-av-pair=...) won't get their dACL or URL-redirect in offline replay. Tradeoff that works for our current customer base; we'll revisit when a real customer policy needs it.
Peer-forward at scale: the current design relies on a single peer being reachable when the local tunnel is down. For sites with three or more Edges, the policy for which peer to forward to (least-loaded? lowest-RTT? round-robin?) is the next thing to characterise under real lab load.

References

RFC 5216: The EAP-TLS Authentication Protocol
RFC 6614: Transport Layer Security (TLS) Encryption for RADIUS
RFC 3579: RADIUS Support for EAP
RFC 2548: Microsoft Vendor-specific RADIUS Attributes
RFC 2865: Remote Authentication Dial In User Service (RADIUS)

Customer-deployed Edge components are reviewable by customers under NDA. The hosted control plane remains proprietary. Reach out if you'd like a deeper dive into any of the above.