I just want to write about a recent experience I had troubleshooting an issue during a Lync 2010 migration to Skype for Business 2015, it caused a lot of head scratching and frustration so I wanted to document the steps I took before getting it working. The migration to Skype itself was straight forward and everything seemed fine after the decommission of the Lync pool, however a few days later an issue emerged.
This particular customer had been using Lync 2010 for many years and had hundreds of Lync Phone Edition devices deployed for both users and in common areas. These phones have not changed much (if at all) over the lifetime of the Lync deployment and users logged in on them were successfully migrated and relogged back into Skype during the account move and calls were made successfully.
As part of the migration the DHCP options were updated and the scope moved to a new server so when a few days later some phones started to stop working and just kept displaying “Connecting to Lync Server” the first thought was to look at the DHCP server. Now if all the phones had stopped working at the same time this post would be a lot shorter, but they didn’t. The majority of phones continued to work, even surviving reboots and new leases. The affected ones seem to be phones that had either been logged out or reset (wiped) of user data, a fresh phone was not able to log in using either USB based authentication or PhoneExt/Pin methods.
The first thing I did was check that the right DHCP options were being provided, to do this I first used the DHCPutil tool to emulate a client request, this was performed on the same vLAN as the phones and reported a success. The next step was to run the PowerShell command test-csphonebootstrap again from a laptop on the same vLAN, this command replicates a phone completing the authentication process, and it too passed. Just in case DHCP was pointed back the original server with its options updated to point at Skype but still no luck.
The customer, like most, used an internal Certificate Authority, in this case with an offline Root and subordinate Issuing CA, neither of which the phones trust by default. This is perfectly acceptable and expected since the Lync Phone Edition sign in process includes connecting via http to download the certificate information prior to connecting via TLS or as I’ve found on the Microsoft documentation in some scenarios it can connect via AD to retrieve the certificate or use web enrolment methods to retrieve it, however the root certificate was published correctly within AD. Upon checking the Skype service and IIS logs I could see the requests from the phone over HTTP and it being presented with the certificate chain but still no joy.
My next thought was maybe it was a TLS related issue since it was constantly asking for the certificate information to be downloaded, either it’s not liking the certificates or since the phones only support TLS 1.0 (hence why they’ll soon not be supported against Office 365 with it enforcing TLS 1.2) that it couldn’t negotiate a secure connection so I started down that road. After checking the Schannel section of the registry on the frontends I could see that the servers did not have TLS1.0 disabled but did have some changes made to the Cipher suite to increase security, so these were then removed back to Windows 2012r2 defaults but still no joy.
At this point I decided to try another type of device, since Lync 2010 and Skype for Business desktop clients were still able to sign in I was sure it was client related somehow. Since Lync Phone Edition was released several vendors have brought out certified (3PIP) devices that do not run the Lync Phone Edition OS/App but are capable of registering natively to Lync/Skype. Fortunately I have my own Polycom VVX series phone that I was able to take with me on my return visit to the customer to troubleshoot further, I ensured my phone was using the Skype base profile and reset it to make sure no other settings were retained. After restarting it was able to register using the PhoneExt/Pin method, now we were getting somewhere.
I then struggled with working out the differences between the two device types, the VVX series has accessible logging and management which allowed me to confirm it was connecting as expected to Skype, retrieving the private certificate and storing it but unfortunately getting logs from the Lync Phone Edition was a struggle and when I did they didn’t seem to shed any light.
Since I wasn’t able to get logs to show me what was happening with the Lync phone during start up and registration I asked the customer to set up a port mirror on a switch so I could use Wireshark to see what traffic was being sent in the hope I could see something and work out what was happening. With Wireshark up and running I was able to see the phone start up, retrieve DHCP options and then use HTTP to retrieve the certificate, all as it should be, before it then looped.
It took me a while looking blankly at the packet capture before something jumped out at me, the phone was trying to establish a TLS session with the Edge servers external IP addresses! Internal clients should not be able to reach the external interface of the Edge server and all communication when on the Lan should be by using the internal one so why was it trying to connect externally? (and how?)
The why became apparent when it dawned on me that they did not have split-brain DNS but rather they had been creating pinpoint DNS zones for each of the required hostnames on their internal DNS as well as creating them externally. The phone was carrying out a DNS request for the external srv record (_sip._tls.) and was receiving a response from public DNS. Now this in of itself wouldn’t necessarily be a problem but it turns out that on their network clients can reach the external IP addresses of servers.
The fastest solution was to create an empty zone for _sip._tls. since I didn’t want to make firewall changes late in the day or try and work out why they were able to route to the external interface. After the empty zone was created the broken phones were able to sign back in again, happy days!
TLDR: The lesson of this story is to check that internal devices can’t reach the external interface of the Edge servers, as per the Microsoft documentation, since the desktop client was working internally/externally it wasn’t something that jumped out to check.
Bonus info: I was curious as to why the Polycom VVX phone worked so I performed a Wireshark capture on it and could see that after it performed the http request for the certificate it then switched to using lyncdiscoverinternal/lyncdiscover method to authenticate like the desktop/mobile Lync 2013/Skype clients do.