Chasing an MTU Ghost Through Three Vendorsblog
PPSK handshake failure on cloud-managed APs. A fragmented TLS certificate story.
A working archive of problems I've solved in the field — from stubborn network handshakes to multi-site rollouts. Each case is available in two voices.
PPSK handshake failure on cloud-managed APs. A fragmented TLS certificate story.
Companion technical report available via the toggle above.
When you work in enterprise IT you often need to rely on third-parties for information and support, even when you're more capable than the tech on the other end of the phone. You'll find yourself needing to try solutions before you know the problem because when the company is losing business you need it to work now whether you understand why it isn't or not.
This issue could've ended up like that but the best of us can't settle for lack of understanding. We can't settle for a band-aid. We need to know why, how, and what.
A new retail inventory solution's Android-based hand scanners refused to connect to our PPSK-managed private wifi network at numerous locations. When attempting to connect the client device to the network with the device-specific password we'd receive an authentication error.
As is often the case, I got handed the ticket after the usual solutions (factory reset, fresh configuration, etc.) failed. In fact, the previous troubleshooting had failed so badly that not only was this specific PPSK device not connecting, we now weren't able to push configuration from our cloud portal at all.
All I was given was a description of the issue, what had caused the new failure (a factory reset of the AP), and a screenshot of a vague error message thrown in the cloud.
I often approach problems by skipping down as many layers as possible to bypass the most annoying failures, especially when I don't have permissions on the higher-layer infrastructure. In this case I bypassed my lack of cloud access with Gemini-assisted research of the AP documentation to dig up default SSH credentials and get into the console directly.
I tested the very basics first: DNS, traceroute out to the AWS server hosting our cloud portal. Everything checked out. From there I ran the same command my supervisor was attempting to push from the cloud GUI on the much more reliable CLI, with verbose output. The result: an SSL error. The AP was failing because of the cloud handshake. The cloud wasn't able to validate it was talking to the AP securely. The next step was to figure out why.
There were many links in the chain behind the fog of vendor support: our AP vendor (whose cloud portal I didn't have access to), our managed network provider (MNSP), and our ISP, where you either get lucky with an outstanding tech or wait three weeks for a senior engineer.
Without going into too much detail, I got unlucky with each link. They either wasted my time on basic troubleshooting I had already ruled out, or threw me down silly rabbit holes the next tech would reverse. One ISP tech turning off bridge mode (which I'd just enabled to rule out double-NAT), the next saying "no, pass-through," the next saying neither, and the fourth agreeing with me and keeping it on bridge mode.
Eventually I brought the AP to a nearby site with a working AP, just to rule out quirks. Still nothing. Finally my supervisor mentioned something I wish I'd known from the start: when they provision APs in the corporate office they have to put them on a specific policy that completely bypasses our MNSP. Plugged directly into the modem, the config pushed without issue.
I called MNSP and asked what we could do to bypass our ISP from their end. The answer was simple: fail the firewall over to cellular, keeping the MNSP route but bypassing the ISP. Immediately, the handset connected. And the kicker: it stayed connected after we failed back to the primary ISP, confirming the handshake was to blame.
Now this is where we're often urged to stop. We've found a way around it, we can do this at other sites, we're done. But I didn't agree with that assessment. PPSK handshakes don't last forever. Are we really going to keep half a dozen sites routinely failing over to cellular just to keep our inventory scanners online? And what if this issue comes up again, on something more critical?
I was through relying on vendors. After a few YouTube crash courses, some Claude-assisted run-book docs, and my notes from the night before, I was confident today would be the day.
First, a wireless capture of the handshake. As expected, it failed after Message 2. The AP couldn't verify the MIC from the client. The problem was that every time a new client attempts a connection, the AP needs to look up that client's PSK from the cloud (over RADsec) before it can verify the Message 2 MIC. No successful RADsec session, no PSK lookup, no client connection.
Next, a wired capture between the AP and switch, filtered to TCP/2083. The failing exchange: a TLS certificate from AP to cloud totaling 3570 bytes, across three frames (1514 + 1514 + 542). Response from the cloud? Dup ACK. The cloud was still listening, but whatever it was expecting, this wasn't it.
Now for you network engineers out there, you might be screaming. Well, I had the same thought: the day before all this. The MNSP and I had already verified WAN MTU and MSS clamp. The WAN MTU was a non-standard 1492, and the ISP PMTU test came back with a largest successful payload of 1464, validating 1492/1452.
So now I knew the WAN MTU was correct, the AP could handshake over MNSP routes without our ISP involved, and the certificate exchange involved significantly larger packets than other traffic on the same route. Hmmm...
I asked Claude how I could verify the MSS myself. The breakthrough came not from the failures but from the successes: the AP and cloud could talk fine when certificates weren't involved. Their SYN packets reported a bidirectional MSS of 1460, eight bytes more than the MNSP's 1452. Both endpoints thought they could send payloads sized for a 1460-byte hole, then tried to shove them through a 1452-byte hole.
Smoking gun. The DF bit wasn't set, so the WAN was fragmenting them, which is a no-no for internet-bound secure traffic. I called MNSP back. A technician I know well 100% confirmed the values, reset them to be 101% sure, ran another test. Crestfallen, but then it came to me.
The thing you don't normally think about with a VLAN is adjusting the MTU, at least not outside a data center. In a moment of desperation-turned-brilliance, I asked the tech if it was possible to adjust the MTU on the VLAN the AP was behind.
There was a pause. "I've never done that before but I'm certainly willing to try for you." He verified the VLAN had a standard 1500 MTU, different from the 1492 WAN. Then he lowered it to match.
SUCCESS. Instantaneously, the client device connected. I was momentarily speechless. I'd gone on such a deep dive I was barely prepared to have reached the bottom.
As cool as it was to have solved the issue with the VLAN MTU, the real problem was the MSS clamp not applying correctly. That was a question for MNSP engineers, not me. But getting a senior engineer at a massive national MNSP to fix an issue in their firewall, reported by one of their customers' support engineers? That's where my technical writing came in. I put together a detailed technical report of the issue, the solution, and my recommendation for a permanent fix, a version of which is the companion technical brief to this post.
My takeaway, when I give advice to new technicians or colleagues: respect the vendors (they can point out the obvious sometimes) but don't be afraid to dig up your own evidence. It's the best there is.
Companion technical report to the narrative blog post.
Android-based retail inventory scanners could not authenticate to a PPSK-managed wireless network at multiple retail sites. A parallel failure prevented initial configuration push from the cloud controller to any factory-reset AP at the same sites. Both failures traced to a single root cause: the AP's VLAN on the edge gateway used a standard 1500-byte MTU, while the gateway's WAN interface operated at 1492. The MSS clamp that would normally correct the mismatch was not being applied to this segment, so TLS certificate traffic between the AP and the cloud controller was fragmented at the WAN and silently dropped upstream. Lowering the VLAN MTU to match the WAN (1492) restored service immediately. A permanent fix (correcting the MSS clamp scope on the MNSP-managed firewall) was recommended to the MNSP's senior network engineering team.
| Component | Detail |
|---|---|
| Access point | Cloud-managed AP, PPSK-enabled |
| Cloud management | Cloud controller, RADsec over TCP/2083 |
| MNSP | Managed network service provider; operates the edge gateway and the AP VLAN at site |
| ISP | Cable ISP circuit; modem in bridge mode |
| Client | Android-based handheld inventory scanners (new retail inventory solution) |
| Primary affected site | Site A |
| Secondary validated site | Site B |
Two related failures observed at Site A:
A wireless capture of the PPSK handshake confirmed that the 4-way handshake broke down after Message 2. The AP was unable to verify the MIC from the client. Because MIC validation on Message 2 requires the AP to hold the correct PSK for the client's MAC, this indicates the AP could not retrieve or validate the client's PPSK entry from the cloud controller at association time (a RADsec-dependent lookup). The wireless handshake failure was a downstream symptom of a failed cloud auth lookup.
With RADsec running over TCP/2083, the capture was filtered to that flow. The relevant exchange before failure was a TLS certificate transmission from the AP to the cloud controller totaling 3570 bytes across three frames: 1514 + 1514 + 542. The cloud responded with a Dup ACK, indicating the certificate frames were not received properly. This was the only traffic on the segment with a payload of this magnitude; ordinary RADsec keepalives and small-payload traffic had no issue.
TLS certificate exchanges are unusually large and are a well-known pressure point for MTU misconfiguration. An MTU issue was hypothesized and then validated through layered testing.
The AP is attached to a VLAN with an MTU of 1500 on the edge gateway. With no MSS clamp applied to traffic originating from this VLAN, the AP and the cloud controller completed the TCP three-way handshake advertising MSS 1460 in both directions. When the AP then transmitted its ~3570-byte TLS certificate chain to the cloud controller (RADsec uses mutual TLS), the segments left the AP at 1500-byte IP size and arrived at the gateway's WAN interface, which can only transmit 1492-byte IP packets. The DF bit was not set, so the gateway fragmented the oversized packets rather than returning an ICMP Fragmentation Needed. The fragmented TLS frames were then dropped upstream, consistent with common behavior at internet-facing load balancers that discard IP fragments as a hardening measure. The certificate transmission never completed, so the RADsec session never established. Without a successful RADsec session, the AP's PPSK database could not be refreshed, which caused the wireless 4-way handshake to fail at Message 2 on any new client attempt.
Immediate, per-site remediation: lower the MTU on the AP's VLAN to match the WAN MTU (1492). This brings the negotiated MSS down to 1452, matching what the WAN can actually forward, and eliminates fragmentation of the TLS certificate exchange. Validated at Site A by resolving the original PPSK failure. Validated at Site B by reproducing the fix on a second site with the same architecture.
The VLAN MTU adjustment is a workaround that relies on each VLAN being correctly sized for its WAN. The underlying defect is that the MSS clamp on the edge gateway is not being applied to all traffic on this segment, specifically not to RADsec / TLS traffic on TCP/2083 originating from the AP VLAN. Recommendations to the MNSP senior engineering team:
tcp.port == 2083)