Don't break the chain: offline 802.1X on a SaaS NAC

Cloud NAC has an obvious failure mode: if you lose the WAN, auth stops. MAB can be cached but 802.1X can't, as it's a live cryptographic handshake, not a replayable permit based on MAC address. So we built a real EAP-TLS server into the Edge. Here's the design.

TL;DR. We taught the on-premises Edge appliance to terminate 802.1X EAP-TLS handshakes locally when the cloud control plane is unreachable and to forward RADIUS to a peer Edge over the LAN when the local tunnel is down but a peer's tunnel is still up. The HA design works around the variety of dead-server-detection behaviours across switch vendors (each has its own timers and thresholds) instead of betting on any one of them: failover is a sub-second LAN round-trip, not seconds-to-tens-of-seconds waiting on whichever NAS detector the customer's switches happen to ship with. Audit replay closes the compliance hole on reconnect.

The problem

Cloud-delivered NAC follows a common pattern: a small on-prem appliance accepts RADIUS from the customer's switches, wraps each packet in a long-lived TLS tunnel (RadSec, RFC 6614) and forwards to a SaaS control plane running policy and audit. The customer gets multi-tenant management, centralised logs and no on-prem servers to operate.

The obvious problem is what happens when the site loses connectivity to the control plane. ISP outage, BGP flap, upstream firewall change, control-plane brown-out. Switches stop getting answers. Dot1x sessions go unauthorised. MAB lookups time out. Within minutes the auth-fail VLAN fills up, or worse, the switch fail-opens via critical-auth bypass (its own set of problems).

In operational terms: a branch office loses WAN connectivity for an hour. Employees should not lose wired authentication. Printers should not disappear from the network. IP phones should keep working. Switches should not enter critical-auth fail-open. None of those things should require the operator to do anything.

A cache solves the easy half. MAB authentications can be replayed from a cached Access-Accept. Most NAC vendors do this; it's well-trodden ground.

802.1X is the hard half. A dot1x authentication isn't a single request/response: it's a multi-round-trip cryptographic exchange whose state is bound to the live TLS session. You cannot cache an EAP-Success and replay it. The supplicant has no matching TLS state, the reply is garbage and authentication fails closed.

Doing 802.1X offline means the appliance has to become a real EAP server. Not a proxy, not a cache, an actual EAP-TLS server with cert chain validation, fragmentation handling and per-session state. We shipped that. Here is how it fits together.

Core architecture

Two Edges per site, both tunnel to the cloud independently and peer to each other over the LAN. The peer link is mTLS over the existing tenant PKI and carries both state polling and RADIUS forwarding, with no new secret material and no dependency on the cloud being up.

The Edge does four jobs:

Why local EAP-TLS is its own thing

EAP-TLS (RFC 5216) is stateful. The TLS session is cryptographically bound to that specific exchange: its master secret, its random nonces, its cert verify hash. Three consequences fall out of that:

That is why the cache works for MAB and does not work for 802.1X. There is no clever workaround. You either build a local EAP server or you give up 802.1X during outages. We built one, on top of CPython's ssl.SSLContext.wrap_bio() (the documented way to do TLS without a socket), plus a state machine for EAP-TLS fragmentation per RFC 5216 section 3 and a RADIUS wrapper for EAP-Message and Message-Authenticator packing per RFC 3579 section 3. Three modules, each independently testable. No new crypto library: just the appliance's existing python3-cryptography for cert chain operations.

Trust material bootstrap

The appliance gets a per-device server certificate at activation time:

The tenant CA bundle (public, no private key) ships via the heartbeat response, gated on a SHA-256 hash so we carry the ~2 KB payload only on first fetch and on rotation.

Audit replay closes the compliance loop

SOC 2 CC7.2 expects logging completeness. NIS2 Article 21(d) expects incident reporting. Both want a coherent audit trail across the outage window.

Every locally-served auth gets appended to a local journal with fdatasync per write so we never lose an acknowledged auth to power loss. On tunnel reconnect, a background drainer ships events to the cloud's audit-replay endpoint in batches of 100, which writes to auth_log with the markers served_offline=TRUE and served_by_edge_id pointing at the Edge that handled it. The customer's compliance officer can then query:

SELECT mac_address, result, served_by_edge_id, event_timestamp
  FROM auth_log
 WHERE served_offline = TRUE
 ORDER BY event_timestamp DESC;

And see every offline-served auth, attributable to a specific Edge, for the specific outage window. Without this, the cloud audit log has a hole through every outage. With this, the hole closes within seconds of reconnect.

HA behaviour: multi-Edge peer awareness

The single hardest design question wasn't about cert validation or EAP fragmentation. It was about how two Edges on the same site should coordinate during a partial outage.

Why not lean on the NAS dead-server detector

The first version of this design leaned on Cisco's dead-server detector: when Edge A's tunnel broke, A would silent-drop the request so the switch would time out and fail over to Edge B. That works on paper. In practice it couples customer auth latency to radius-server dead-criteria (default tries 3 time 5, so up to fifteen seconds of timeouts before failover) and risks marking A dead for any single-Edge customer who has no B to fail over to.

Different NAS vendors implement dead-server detection differently. Juniper, Aruba, Ubiquiti and MikroTik each ship their own timer model and threshold defaults; some operators tune those defaults, some don't. Building the failover path on top of whichever detector the customer's switches happen to ship with means a fragmented testing matrix and surprises on first-deployment day.

The version we shipped does not silent-drop. When A's tunnel is down and a peer's tunnel is up, A forwards the RADIUS request to the peer over the LAN, the peer serves it normally and A relays the reply back to the NAS. Failover takes one LAN round-trip instead of three timeouts, and the NAS never sees A go quiet. The behaviour is the same whichever vendor's dead-server timer is underneath, because the design no longer depends on it firing.

The peer-aware design

Two Edges on a site. Switch configured with both as RADIUS targets, priorities 1 and 2. Edge A's tunnel breaks, B's is fine. A receives a RADIUS request, sees its own tunnel is down, sees B's tunnel is up, opens a short-lived LAN session to B, forwards the request, waits for the reply and answers the NAS. From the switch's perspective, A answered. From the cloud's perspective, the request came in via B's tunnel as normal.

The two requirements that fall out of that:

Each Edge runs a small local HTTPS endpoint that returns its tunnel state to authenticated peers and accepts forwarded RADIUS over the same mTLS channel. Mutual TLS uses the existing per-appliance server cert (chained to the tenant CA); peer validation uses the existing tenant CA trust bundle. Mutual-mTLS-by-PKI, with no new secret to rotate.

Each Edge polls every peer every two seconds with a five-second freshness TTL. Failure to reach a peer counts as tunnel_up=false (the safe direction: if I cannot see you, I have to assume you cannot help me). Short interval, short TTL: peer state never lags behind reality by long, and the LAN poll is cheap enough that there's no cost reason to wait longer.

The decision tree:

if request arrived with peer-origin header:
    handle_locally_only()          # came via peer-forward, must not re-forward

elif local_tunnel_up:
    forward_to_cloud()             # normal path

elif peer_poller.any_peer_tunnel_up():
    proxy_to_peer()                # LAN round-trip via mTLS to a healthy peer

elif request carries EAP-Message AND local EAP armed:
    drive_local_handshake()        # fresh EAP-Success bound to this session

elif request matches a cached MAC:
    serve_from_cache_filtered()    # cached policy attrs, EAP TLVs stripped

else:
    unknown_mac_policy()           # deny/permit/vlan_override

Each branch is operationally visible in the journal so the operator can see which path served any given auth.

The failure modes that are not failures

Lab validation

Two Edges (.34 and .45) on the same LAN, both on the same software version. Cisco Catalyst 9000 switch with dot1x + MAB configured on multiple ports. Windows test laptop on g1/0/2 with a client cert signed by the tenant CA. A second non-dot1x device on g1/0/3 for MAB.

Baseline: both Edges healthy

ARBITER-SW1#show access-session int g1/0/2 de
            Interface:  GigabitEthernet1/0/2
          MAC Address:  98fa.9b23.d8d1
            User-Name:  host/windows-test-laptop
               Status:  Authorized
               Domain:  DATA

Server Policies:
           Vlan Group:  Vlan: 21

Method status list:
       Method           State
        dot1x           Authc Success

Auth flows normally: switch to Edge to cloud to policy, VLAN 21 returned in the Access-Accept tunnel triplet. Edge journal shows the round-trip:

auth fwd: auth rid=124 mac=98fa9b23d8d1 from=192.168.0.21:5249
auth rpy: ACCEPT  rid=124 mac=98fa9b23d8d1 to=192.168.0.21:5249

Both tunnels down, dot1x via local EAP-TLS

Drop both tunnels with nft add rule inet filter output tcp dport 2083 counter reject on each Edge. Trigger a reauth on the switch. The Edge that receives the request drives a full local handshake:

21:23:34  INFO  local EAP-TLS termination armed
21:23:34  INFO  local EAP: served 64 bytes (live sessions=1)         # Start
21:23:41  INFO  local EAP: served 1100 bytes (live sessions=2)       # Server flight
21:23:41  INFO  local EAP: served 604 bytes (live sessions=2)        # cert + done
21:23:41  INFO  local EAP: served 60 bytes (live sessions=1)         # Access-Accept
21:23:41  INFO  local EAP: Access-Accept peer_cn='host/windows-test-laptop'
                          policy_attrs=16B reason='cert PASS'

Four fragments out, one fresh Access-Accept back. The 60-byte payload is the Accept: RADIUS header (20) + EAP-Success TLV (6) + VLAN tunnel triplet (16) + Message-Authenticator (18). Math checks out.

The switch's view:

ARBITER-SW1#show access-session int g1/0/2 de
               Status:  Authorized

Server Policies:
           Vlan Group:  Vlan: 21

Method status list:
       Method           State
        dot1x           Authc Success

Port authorised by the certificate (no cloud round-trip) at the same VLAN the cloud's policy engine would have returned. From the customer's perspective the outage is invisible at the auth layer.

MAB during the same outage

On g1/0/3, a non-dot1x device hits MAB during the same outage window. The Edge serves the filtered cache attrs:

21:26:00  INFO  offline auth: served mac=f8c65049d672 code=2 (20 bytes)
                                       from cache to 192.168.0.21
21:26:00  INFO  offline acct: ACKed (code=5) ... -- keeping NAS
                                       server-health green during outage
ARBITER-SW1#show access-session int g1/0/3 de
               Status:  Authorized
Method status list:
       Method           State
        dot1x           Stopped
          mab           Authc Success

Same outage, same Edge, two auth paths, both working. Laptops using dot1x and devices using MAB continued authenticating through the outage without operator intervention.

Multi-Edge HA: A forwards to B

With A's tunnel down and B's tunnel up, an auth that the switch sends to A is forwarded over the LAN peer link to B, B services it via its own cloud tunnel and A relays the reply back to the NAS. End-to-end latency is dominated by the cloud round-trip; the LAN hop adds single-digit milliseconds. The peer-origin header on the inbound at B blocks B from ever forwarding it back to A.

From the switch's perspective, A answered normally. No timeout, no retransmission, no failover increment, no dead-time entry. Both Edges stay UP. The lab traces for the proxy path will publish in a follow-up once the soak round completes.

Audit replay

After both tunnels restore, the offline events drain to the cloud within seconds:

tunnel connected to 46.225.164.231:2083
offline audit drain: replayed=3 failed=0

On Core:

SELECT mac_address, result, served_by_edge_id, event_timestamp
  FROM auth_log
 WHERE served_offline = TRUE
 ORDER BY event_timestamp DESC LIMIT 10;

Shows every locally-served auth from the outage window, attributable to the Edge that handled it. The audit-log hole closes the moment the tunnel comes back.

What the lab taught us that the design didn't

Three bugs worth recording.

1. Authority-Key-Identifier strictness

Modern OpenSSL strict-validates the X.509 chain including AKI/SKI consistency. Our unit-test PKI fixture didn't include either; tests passed because we'd set check_hostname=False. Lab pairing failed because production validation is strict. Fix: real AKI/SKI on minted certs, plus a Core-side EKU guarantee (TLS Web Server Authentication) on issued server certs.

2. Logging propagation

The daemon configures a named logger "edge_client" with propagate=False. Helper modules used getLogger(__name__) and produced loggers named after their files; those records propagated to the root logger which has no handler. Result: peer modules' INFO startup lines never reached journald even though the listeners were bound. One-line fix: getLogger("edge_client.peer_server"). The kind of thing only operating the software at scale catches.

3. The cached-attrs Frankenstein

An early implementation of policy-attribute grafting copied the ENTIRE cached Access-Accept attribute blob onto the locally-minted Access-Accept, including stale EAP-Message, MS-MPPE keys encrypted under a long-gone request authenticator and a stale Message-Authenticator. The result was a 226-byte malformed reply Cisco refused (visible on the switch as the incorrect counter on show aaa servers incrementing on every offline-EAP reply).

The fix: a filter that keeps only RFC-standard policy attributes (Tunnel-*, Filter-Id, Reply-Message, timeouts) and drops everything else, including all Vendor-Specific Attributes. We default to vendor-neutral attributes per RADIUS standards. Customers using Cisco AVPair-style policy need a separate code path which we will add when a real customer needs it.

The size delta is the smoking gun: 60 bytes after the filter, 226 bytes before. The supplicant's view of that reply was the difference between "valid EAP-Success, port authorised" and "garbage, fail closed".

Where this sequence goes next

Local EAP-TLS termination, MAB cache replay, multi-Edge peer awareness and audit replay are shipped and lab-validated. The sequencing for the rest of the offline story is straightforward:

Open questions

References

Customer-deployed Edge components are reviewable by customers under NDA. The hosted control plane remains proprietary. Reach out if you'd like a deeper dive into any of the above.