Field Notes · 2023 – Present

Case studies.

A working archive of problems I've solved in the field — from stubborn network handshakes to multi-site rollouts. Each case is available in two voices.

Blog version The story, in plain English. How it felt, what I tried, where I got stuck, and the shape of the breakthrough. Written for anyone who enjoys a good chase.
Technical brief The report, in engineer's shorthand. Environment, packet captures, root cause, and the permanent-fix recommendation. Written for the next person to pick up the ticket.

Chasing an MTU Ghost Through Three Vendorsblog

PPSK handshake failure on cloud-managed APs. A fragmented TLS certificate story.

Retail · 2024
Network · RADsec
← All case studies

Chasing an MTU Ghost Through Three Vendors

Companion technical report available via the toggle above.

When you work in enterprise IT you often need to rely on third-parties for information and support, even when you're more capable than the tech on the other end of the phone. You'll find yourself needing to try solutions before you know the problem because when the company is losing business you need it to work now whether you understand why it isn't or not.

This issue could've ended up like that but the best of us can't settle for lack of understanding. We can't settle for a band-aid. We need to know why, how, and what.

The issue

A new retail inventory solution's Android-based hand scanners refused to connect to our PPSK-managed private wifi network at numerous locations. When attempting to connect the client device to the network with the device-specific password we'd receive an authentication error.

As is often the case, I got handed the ticket after the usual solutions (factory reset, fresh configuration, etc.) failed. In fact, the previous troubleshooting had failed so badly that not only was this specific PPSK device not connecting, we now weren't able to push configuration from our cloud portal at all.

All I was given was a description of the issue, what had caused the new failure (a factory reset of the AP), and a screenshot of a vague error message thrown in the cloud.

Digging past vendor support

I often approach problems by skipping down as many layers as possible to bypass the most annoying failures, especially when I don't have permissions on the higher-layer infrastructure. In this case I bypassed my lack of cloud access with Gemini-assisted research of the AP documentation to dig up default SSH credentials and get into the console directly.

I tested the very basics first: DNS, traceroute out to the AWS server hosting our cloud portal. Everything checked out. From there I ran the same command my supervisor was attempting to push from the cloud GUI on the much more reliable CLI, with verbose output. The result: an SSL error. The AP was failing because of the cloud handshake. The cloud wasn't able to validate it was talking to the AP securely. The next step was to figure out why.

There were many links in the chain behind the fog of vendor support: our AP vendor (whose cloud portal I didn't have access to), our managed network provider (MNSP), and our ISP, where you either get lucky with an outstanding tech or wait three weeks for a senior engineer.

Without going into too much detail, I got unlucky with each link. They either wasted my time on basic troubleshooting I had already ruled out, or threw me down silly rabbit holes the next tech would reverse. One ISP tech turning off bridge mode (which I'd just enabled to rule out double-NAT), the next saying "no, pass-through," the next saying neither, and the fourth agreeing with me and keeping it on bridge mode.

The workaround (and why it wasn't enough)

Eventually I brought the AP to a nearby site with a working AP, just to rule out quirks. Still nothing. Finally my supervisor mentioned something I wish I'd known from the start: when they provision APs in the corporate office they have to put them on a specific policy that completely bypasses our MNSP. Plugged directly into the modem, the config pushed without issue.

I called MNSP and asked what we could do to bypass our ISP from their end. The answer was simple: fail the firewall over to cellular, keeping the MNSP route but bypassing the ISP. Immediately, the handset connected. And the kicker: it stayed connected after we failed back to the primary ISP, confirming the handshake was to blame.

Now this is where we're often urged to stop. We've found a way around it, we can do this at other sites, we're done. But I didn't agree with that assessment. PPSK handshakes don't last forever. Are we really going to keep half a dozen sites routinely failing over to cellular just to keep our inventory scanners online? And what if this issue comes up again, on something more critical?

Learning Wireshark on the fly

I was through relying on vendors. After a few YouTube crash courses, some Claude-assisted run-book docs, and my notes from the night before, I was confident today would be the day.

First, a wireless capture of the handshake. As expected, it failed after Message 2. The AP couldn't verify the MIC from the client. The problem was that every time a new client attempts a connection, the AP needs to look up that client's PSK from the cloud (over RADsec) before it can verify the Message 2 MIC. No successful RADsec session, no PSK lookup, no client connection.

Next, a wired capture between the AP and switch, filtered to TCP/2083. The failing exchange: a TLS certificate from AP to cloud totaling 3570 bytes, across three frames (1514 + 1514 + 542). Response from the cloud? Dup ACK. The cloud was still listening, but whatever it was expecting, this wasn't it.

"MTU! MTU! MTU!"

Now for you network engineers out there, you might be screaming. Well, I had the same thought: the day before all this. The MNSP and I had already verified WAN MTU and MSS clamp. The WAN MTU was a non-standard 1492, and the ISP PMTU test came back with a largest successful payload of 1464, validating 1492/1452.

So now I knew the WAN MTU was correct, the AP could handshake over MNSP routes without our ISP involved, and the certificate exchange involved significantly larger packets than other traffic on the same route. Hmmm...

I asked Claude how I could verify the MSS myself. The breakthrough came not from the failures but from the successes: the AP and cloud could talk fine when certificates weren't involved. Their SYN packets reported a bidirectional MSS of 1460, eight bytes more than the MNSP's 1452. Both endpoints thought they could send payloads sized for a 1460-byte hole, then tried to shove them through a 1452-byte hole.

Smoking gun. The DF bit wasn't set, so the WAN was fragmenting them, which is a no-no for internet-bound secure traffic. I called MNSP back. A technician I know well 100% confirmed the values, reset them to be 101% sure, ran another test. Crestfallen, but then it came to me.

The VLAN

The thing you don't normally think about with a VLAN is adjusting the MTU, at least not outside a data center. In a moment of desperation-turned-brilliance, I asked the tech if it was possible to adjust the MTU on the VLAN the AP was behind.

There was a pause. "I've never done that before but I'm certainly willing to try for you." He verified the VLAN had a standard 1500 MTU, different from the 1492 WAN. Then he lowered it to match.

SUCCESS. Instantaneously, the client device connected. I was momentarily speechless. I'd gone on such a deep dive I was barely prepared to have reached the bottom.

The real fix

As cool as it was to have solved the issue with the VLAN MTU, the real problem was the MSS clamp not applying correctly. That was a question for MNSP engineers, not me. But getting a senior engineer at a massive national MNSP to fix an issue in their firewall, reported by one of their customers' support engineers? That's where my technical writing came in. I put together a detailed technical report of the issue, the solution, and my recommendation for a permanent fix, a version of which is the companion technical brief to this post.

My takeaway, when I give advice to new technicians or colleagues: respect the vendors (they can point out the obvious sometimes) but don't be afraid to dig up your own evidence. It's the best there is.

← All case studies

PPSK Handshake Failure on Cloud-Managed APs: MTU Mismatch on Managed VLAN

Companion technical report to the narrative blog post.

Summary

Android-based retail inventory scanners could not authenticate to a PPSK-managed wireless network at multiple retail sites. A parallel failure prevented initial configuration push from the cloud controller to any factory-reset AP at the same sites. Both failures traced to a single root cause: the AP's VLAN on the edge gateway used a standard 1500-byte MTU, while the gateway's WAN interface operated at 1492. The MSS clamp that would normally correct the mismatch was not being applied to this segment, so TLS certificate traffic between the AP and the cloud controller was fragmented at the WAN and silently dropped upstream. Lowering the VLAN MTU to match the WAN (1492) restored service immediately. A permanent fix (correcting the MSS clamp scope on the MNSP-managed firewall) was recommended to the MNSP's senior network engineering team.

Environment

ComponentDetail
Access pointCloud-managed AP, PPSK-enabled
Cloud managementCloud controller, RADsec over TCP/2083
MNSPManaged network service provider; operates the edge gateway and the AP VLAN at site
ISPCable ISP circuit; modem in bridge mode
ClientAndroid-based handheld inventory scanners (new retail inventory solution)
Primary affected siteSite A
Secondary validated siteSite B

Problem Statement

Two related failures observed at Site A:

  1. PPSK handshake failure. New client devices received an authentication error when attempting to associate with their device-specific PPSK password.
  2. Initial configuration push failure. After on-site factory reset, the cloud controller could not push initial configuration to the AP.

Initial Observations

Wireless Packet Capture (AP ↔ Client)

A wireless capture of the PPSK handshake confirmed that the 4-way handshake broke down after Message 2. The AP was unable to verify the MIC from the client. Because MIC validation on Message 2 requires the AP to hold the correct PSK for the client's MAC, this indicates the AP could not retrieve or validate the client's PPSK entry from the cloud controller at association time (a RADsec-dependent lookup). The wireless handshake failure was a downstream symptom of a failed cloud auth lookup.

Wired Packet Capture (AP ↔ Switch)

With RADsec running over TCP/2083, the capture was filtered to that flow. The relevant exchange before failure was a TLS certificate transmission from the AP to the cloud controller totaling 3570 bytes across three frames: 1514 + 1514 + 542. The cloud responded with a Dup ACK, indicating the certificate frames were not received properly. This was the only traffic on the segment with a payload of this magnitude; ordinary RADsec keepalives and small-payload traffic had no issue.

Hypothesis: MTU Mismatch

TLS certificate exchanges are unusually large and are a well-known pressure point for MTU misconfiguration. An MTU issue was hypothesized and then validated through layered testing.

Validation

  1. WAN MTU confirmation. The edge gateway's WAN interface MTU was confirmed at 1492 by the MNSP. A PMTU test produced a final successful packet of 1464 bytes, validating MTU 1492 and the derived MSS of 1452.
  2. SYN inspection. SYN packets between the AP and the cloud controller reported a bidirectional MSS of 1460, 8 bytes higher than the 1452 required for the WAN. This indicated the MSS clamp was not being applied to AP-originated traffic.
  3. VLAN MTU inspection. On a second call, the MNSP re-confirmed the gateway's WAN values, then checked the MTU on the AP VLAN. The VLAN MTU was a standard 1500, with a corresponding MSS of 1460, matching exactly what the AP and the cloud controller had negotiated.
  4. Fix applied. The AP VLAN MTU was lowered to match the WAN (1492). The client device connected to the PPSK network immediately.

Root Cause

The AP is attached to a VLAN with an MTU of 1500 on the edge gateway. With no MSS clamp applied to traffic originating from this VLAN, the AP and the cloud controller completed the TCP three-way handshake advertising MSS 1460 in both directions. When the AP then transmitted its ~3570-byte TLS certificate chain to the cloud controller (RADsec uses mutual TLS), the segments left the AP at 1500-byte IP size and arrived at the gateway's WAN interface, which can only transmit 1492-byte IP packets. The DF bit was not set, so the gateway fragmented the oversized packets rather than returning an ICMP Fragmentation Needed. The fragmented TLS frames were then dropped upstream, consistent with common behavior at internet-facing load balancers that discard IP fragments as a hardening measure. The certificate transmission never completed, so the RADsec session never established. Without a successful RADsec session, the AP's PPSK database could not be refreshed, which caused the wireless 4-way handshake to fail at Message 2 on any new client attempt.

Resolution

Immediate, per-site remediation: lower the MTU on the AP's VLAN to match the WAN MTU (1492). This brings the negotiated MSS down to 1452, matching what the WAN can actually forward, and eliminates fragmentation of the TLS certificate exchange. Validated at Site A by resolving the original PPSK failure. Validated at Site B by reproducing the fix on a second site with the same architecture.

Recommended Permanent Fix

The VLAN MTU adjustment is a workaround that relies on each VLAN being correctly sized for its WAN. The underlying defect is that the MSS clamp on the edge gateway is not being applied to all traffic on this segment, specifically not to RADsec / TLS traffic on TCP/2083 originating from the AP VLAN. Recommendations to the MNSP senior engineering team:

  1. Validate at Site A by factory resetting the AP again and confirming config push completes with the VLAN MTU change in place.
  2. Validate at Site B by reducing the AP VLAN MTU to 1492 and confirming a successful PPSK handshake.
  3. Audit the WAN-to-AP-VLAN segment's firewall rules to ensure the MSS clamp applies to all traffic, so that the AP and the cloud controller receive accurate MSS values without depending on manual per-VLAN MTU tuning.
  4. Audit all other retail sites running a non-standard WAN MTU for the same class of VLAN/WAN MTU mismatch.

Tools and References

Outcomes

ozzyphantom.com / case-studies Updated 2026