When problems on your home network start affecting the family’s use of the Internet… you know you’ve got an urgent problem to fix!
At home I’ve moved exclusively to the UniFi range of networking products – from Ubiquiti Network. To say I’m impressed is an understatement! One of the many fantastic features of this Software Defined Networking (SDN) product range is the centralised controller that manages all my network devices. I’ve configured the controller to report into the centralised console offered by Ubiquiti – which gives me a single place to access the multiple sites that I am an administrator of. It allows you to remotely administer your entire network from anywhere in the world 🙂
Approximately 2 weeks ago I noted something strange, my home network stopped reporting into the cloud console at https://unifi.ubnt.com Initially I thought nothing of this – I was on holidays, had a scheduled restart in place to work around an obscure PPPoE issue I’d been noticing (now fixed), and wasn’t using the console on a regular basis.
When I was archiving some of my personal email I noted that I didn’t get the expected “controller available” message that is normally sent post-reboot. My VPN still worked, just not the management interface. Being in Australia, it didn’t bother me too much, so left if for something to fix when I got back home. That’s when things got a little strange…!
I was having inconsistent problems when accessing certain sites. The most obvious one the I knew about was the connection of my UniFi Controller to the cloud portal. But then I realised that I couldn’t log into any of the forum pages from Ubiquiti either. A little troubleshooting showed me that going to unify.ubnt.com redirected me to account.ubnt.com for authentication using their SSO engine – with a redirect back to the original site. The same pattern for their Community site community.ubnt.com. Interestingly, the initial connection to the cloud or community page worked, but then I received a timeout in the browser when connecting to the account page.
Then I started to observe that a lot of other sites were having problems too. Seemingly randomly in-so-far as I could not predict what site might be affected, but if it was unavailable, it was not accessible on any of my devices (we have a lot in our house!).
I opened a support case to Ubiquiti. This was VERY easy – I used the built-in chat feature that’s available from the Controller interface. Vann from the team was extremely responsive and helpful – working through the problem description, trying a few options, and then gathering log data so that he could share internally. He raised a support case with their L2 team.
Now that’s interesting isn’t it?… While I can’t log into any of the SSO-protected services, I can raise a new support case … hmmmmmm!
Could this be HTTPS only? Hard to be sure – most of the sites I use are available on HTTPS. In fact, if it’s not an HTTPS connection I’ll often force the browser to try https by changing the address in the address bar of my browser. Something for me to think about.
Watching the behaviour, it looked like I was having dropped packets. But where – was it my laptop? my WiFi Access Point (AP)? my switch? my UniFi Secure Gateway (USG)? A problem with my ISP?
- I quickly ruled out a device-specific problem – “everything” was affected the same way
- It wasn’t my APs – I have multiple and had the same problem on each
- It didn’t seem like it would be a Switch issue. Only having one makes it difficult to rule out – so I decided to leave that for last
- While unlikely, I did ask a friend of mine who works at my ISP Eir.ie, to see if he knew of any issues – but nothing reported other than me
So back to my USG – could it be a problem there? I’d been doing some remote updates of my SDN infrastructure while in Australia (I too like to live dangerously 😉 ), but without looking couldn’t recall a specific change I’d made that aligned with the loss of connectivity.
Given that I’m running BETA versions of their code, and that I had applied some recent changes, I wondered if perhaps there was an unanticipated behaviour. Searching for other similar cases (on my phone, because I needed to be “out of band” from my home network to access the Ubiquiti Forum pages), there were some suggestions to (a) disable the hardware offload capabilities, and/or (b) disable QUIC in my Chrome browser (Quick UDP Internet Connection).
Neither of these worked 🙁
Isolating the traffic was the next thing I did. I identified the IP Address(es) of the “problematic” account.ubnt.com service… there were 2 of them. I then logged into my USG and performed a tcpdump on my internal interface (eth1), limiting the scope of my trace to only those IPs. I used this service because I knew that my laptop would be the only device making such a connection – so it made it easy to isolate the traffic in the trace file.
tcpdump -vvv -i eth1 -w /tmp/tcpdump-20170817-1952.cap host 126.96.36.199 or host 188.8.131.52
Having successfully captured the traffic, I opened it in Wireshark. The first thing that stood out was the packet fragmentation. Also, RST (reset) packets when not expected. I also saw duplicate ACKs (acknowledgements).
Now that was very interesting. My first thought was MTU size … could it be that I needed to update this on my USG? Only one way to find out – play with more settings!
Interestingly, there is a setting in the UniFi Controller software which is very similar. I found that the “MSS Clamping” parameter was set to 1492 – which I reduced to 1412 to force the issue and allow me to test
Everything Worked Again!
Literally in a heartbeat, all those sites that were previously not opening – they worked! I was immediately able to re-connect my controller to the UniFi cloud service. And, perhaps most importantly, Netflix started working again #familyPriority
After discussing with another member of the Ubiquiti Forum Community (which I could now access from my laptop), I increased the MSS Clamping size to 1452 – on the basis of a default (and standard) MTU of 1500, less – 8 for pppoe, and less a further 40 for TCP+IP headers combined. This value continues to work for me.
So here I am, happily able to access all the sites I could ever want to, thanks to a change MTU value. Now – the question of why this was necessary continues. I’m working on Root Cause Analysis (RCA) with the support of the awesome team in Ubiquiti Networks! One blog post suggests a change in one of the recent USG firmware updates… makes sense, but I’ll wait for the full RCA from the support team before I draw any final conclusions!
I will write more about my experiences with UniFi in a future post.
tl;dr – I needed to reduce the MSS size on my internet gateway’s WAN interface to avoid packets being discarded so that Netflix would work again!
Update – 20170818-2111
So there was an interesting discussion on the Ubiquiti Community where one of the support engineers/developers indicated that this is not a firmware issue but a problem with the 5.6.x version of the controller. It seems that the UI wrongly defaults to 1492 for MSS clamping. So if you save the Advanced panel without changing that, then it sets your MSS clamping to 1492 where it should actually be 1452. 1492 is the MTU, not MSS.
In true Agile manner, the fix is already pushed into an upcoming release of the firmware along with a new option to MSS Clamping to set to Automatic, Customer-Defined, or Disabled.
I continue to be inspired by the team in Ubiquiti for their support, dedication, insight, and passion!