Skype for Business CMS Pool Failover Issue

Good Morning All

Last week, while working on a clients deployment one of the Modality team came across a rather interesting issue around failing over a Standard Edition SkypeFB pool pairing.

What started out to be a simple failover test became a monster of an issue without any real good reason. The situation was when the pool faileded over the CMS , it became kindof orphaned in the process with the secondary front end not becoming the master and also what was the primary releasing itself as the master. Its also worth confirming that the procedure to failover was completed without error and issue and it was only when the services restarted that we saw the issue. **The Master replicator service would not start.

As this was the first time of seeing such an odd issue, we referenced the usual channels and came across Mark Vales blog almost detailing word for word the same situation/issue.

Doing some investigation we could see that the CMS was no longer attached to the primary frontend and also the secondary frontend, yet the SkypeFB client was happily working??. Also running powershell commands we could also see that the replication was now false across all the frontends/edge.

Issue1

We tried a number of steps to resolve this issue including the ones Mark detailed in his blog, without success. It seems this issue can morph itself into many factor.

We tried to failback to the primary without success and got the following error

issue3

And when we attempted to restart the services we were getting the following error

issue2

Also we went to the extent of deleting the xds/lis databases and reinstalling them and importing the backup we had onto the clean databases. Still we was seeing the same errors.

The solution that worked for us was a simple one really.

Uninstall-CsDatabase –CentralManagement

then

Install-CsDatabase –CentralManagement

and restoring CMS from backup did the trick.

 

 

Advertisements

5 thoughts on “Skype for Business CMS Pool Failover Issue

  1. Hi Ian, thanks for the pingback. Yes it is a monster. I have another post that shows how to restore the CMS without having a backup too which also comes in handy in this event.

  2. Hi guys,
    This issue is awful stuff…do you know if there is any response on this from MS?
    @mark On my test failover I got the same error as you but I was able to ignore ad the cms failed overall successfully I.e.. My PowerShell commands returned what I expected.
    However when I went to failback the CMS I got the exact same issue as Ian…so looks like I will be doing a restore also..

    Dreadful hairy stuff

  3. We have run into this problem with several customers. We have gleaned the following observations:

    1. The combination of pool types doesn’t matter. It happens when you have 2 Enterprise Edition pools, or 2 Standard Editions or one of each in a pair
    2. It occurs randomly and is most evident when you try to do a Maintenance Failover, i.e. a failover when both pools are functioning. I have observed this happening, conservatively, 75% of the time.
    3. It can also “fail” when doing a DR failover but does not become apparent until you try to move the CMS back after recovering the dead pool.
    4. While your method sounds like it would work, it probably won’t work all the time.
    5. Having the pools in separate AD sites does impact CMS failovers if the site-to-site replication is lengthy or a replication isn’t forced. In the testing I did, all of the pools were in the same AD site

    We opened a support case with Microsoft regarding the issue. I was able to demonstrate the behavior to them. I had run over 15 CMS failover tests in my lab environment in which most failed. I supplied these results to Microsoft with the steps I had to use to recover my environment.

    Microsoft identified the likely cause of the issue. In the XDS database, on your SQL backend, there is a table named dbo.dbConfigInt. The table appears to have a single row. There is a field in the table named “CurrentState”. In a paired set of pools, on the pool with the active version of the CMS the value in this field should be “0”. It should concurrently be “3” on the DR pool with the passive version of the CMS. During a CMS failover, one of the first things that should happen is these values should flip on both pools. The primary pool’s value should change to “3” and the DR should become “0”. When I had a CMS failure, both pools would have a value of “0”. Microsoft provided a SQL query that we could run that would switch the value in this field to it’s proper setting. Even after doing this I usually had to prod things along to get the CMS failover to complete. This ranged from bouncing the FTA, MRA and RRA services on all servers in all pools. Bootstrapping one or all of the servers, running the first step in the deployment wizard on one server in primary pool then bootstrapping, breaking the pool pairing, etc. My experience with this has been that this is a lot of try this then this then this until everything is working.

    At the end of the day, Microsoft provided us with a process to fix this problem if it occurred during a CMS failover. They advised us that if one of our customers ran into this issue, the customer should open a support case with us or Microsoft to get it resolved. Microsoft did not advocate having the customer making a direct change to the SQL database. They are much more comfortable with having us or them making the change and getting the failover to complete. With this information in hand, the support folks brought the problem to the product team.

    We heard back from Microsoft a few months later after they had conferred with the product team. The issue was acknowledged as a bug and since their was a workaround, opening a support ticket, the developers declined to repair the problem. The bottom line from Microsoft is that a customer experiencing this issue should open a ticket with either their integrator or Microsoft to resolve it. We were disappointed with this decision and advised Microsoft of our distaste for this solution and our belief that this would negatively impact our customers ability to recover their Skype for Business functionality within their defined RTOs.

    I am working on creating a decision tree or flowchart for our support personnel to help them fix this problem for our customers. It’s the best we can offer them for now. I had considered abandoning the “supported” method for failing over a Skype for Business pool and using an alternate method. My Plan B was to move the CMS to the DR pool when the primary failed and then do a move-csuser to move the users from the primary pool to the DR. In theory, this should work but we lose the RTO of near zero for voice since the users would not go into Limited Functionality mode. I’d also have to make sure that the DR pool had a recent copy of both the CMS and LIS as well as a current listing of users homed on the primary pool. I’d need this list in order to move the primary pool users back easily. We are letting our customers decide which way to go: use the supported version which would likely increase the RTO for all Skype for Business services but provide a zero RTO for voice or use the Plan B which would not provide them an RTO of zero for voice but would provide a more reliable method to return all Skype for Business functions to an operational state.

    I hope you find this information useful

    John Miller, Messaging Engineer
    Enabling Technologies
    http://www.enablingtechcorp.com

  4. Had the exact same issue, running the following SQL query on the surviving CMS server also fixed it for me.
    USE xds
    UPDATE dbo.DbConfigInt SET Value = 0
    WHERE Value = 3 AND Name =
    ‘CurrentState’

  5. Pingback: Shawn Harry | Skype for Business Split Brain

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s