Lync 2013 – RTCSRV Frontend Service failing to start “showing as starting”

Good Morning

This blog post is one to talk through a situation experienced with a client recently. The scenario was the client had vanilla Lync 2013 Enterprise edition implementation with three front ends and a backend SQL. All of the servers were running Windows 2008r2 standard edition.

The installation had gone by design with no issues with the prior steps. on starting the services though i ran into a issue id personally never seen before with the RTCSRV service stuck recycling on ‘starting’ with no ending. (I left this for 2 days and it still didn’t finish’.

So what was causing the problem. This is what i did to track down the issue and resolve the problem. <Its probably worth noting that their is a lot on the tech net and other blog sites around this issue, and in some cases some of the suggestions i found are crazy and would break your Lync environment if ran>

  1. First step was to check event error logs for information..  > this proved fruitless as nothing in the way or a error or warning was showing against the start of the services.
  2. Check the binding of the trust on the certificate including the intermediately chain. > This checked out ok and the certificate was good to use.
  3. Get snooper running. Add SIP Stack and S4 all tracing and stop and start the service for the front-ends again while you have snooper running. NOTE: you will need to kill the RTCSRV process off by command. (first cmd, sc queryex RTCSRV, this will give your the process number. then run taskkill /f /pid <process number>)  > I ran this and again it checked out ok with no errors to be seen.
  4. Run some powershells command just to check the status of this Lync 2013 implementation just to ensure it did actually go ok.

These command were

  • Get-CSManagementstoreReplicationStatus > Check that the readings are true 

  • Get-CSpoolreadinessstate > this was ready

So what was my next step… After consulting other internal consultants on this (Thanks to the Modality Systems Guys), the next natural step was to patch the lync 2013 environment even with the issue. this is something i don’t usually do as i don’t like to muddy the water with patching until I’m happy that the implementation is working as expected. HOWEVER as Tom Arbuthnot mentioned there had been changes in the way things worked within Lync 2013 internally in patch CU4 so it was worth a shot to see if patching fixed this odd issue.

I patched all three Lync FEs and the Backend SQL upto CU January 2014 patch, and still NO the service was stuck on recycling. As with all things as a consultant you follow the same trodden path on investigation so again i set about looking in event logs and snooper. This time though in event logs there was a lot more information to view and one key line of relevance was the below warning showing.

<<<<

Server startup is being delayed because fabric pool manager has not finished initial placement of users.

 

Currently waiting for routing group: {63BB8586-A9D8-5AF2-83FF-B5CE680594C0}.

Number of groups potentially not yet placed: 1.

Total number of groups: 1.

Cause: This is normal during cold-start of a Pool and during server startup.

If you continue to see this message many times, it indicates that insufficient number of Front-Ends are available in the Pool.

Resolution:

During a cold-start of a large Pool it can take upto an hour for the placement process to finish as it needs to populate all the Front-End databases with data from the Backup Store. If the Pool is running and the Front-End is just started, this is normal for some time. If this repeats for a long time, ensure that all the Front-Ends configured for this Pool are up and running. If multiple Front-Ends have been recently decommissioned, run Reset-CsPoolRegistrarState -ResetType QuorumLossRecovery to enable the Pool to recover from Quorum Loss and make progress

 >>>>

What interesting about this is why has quorum got itself in a  twist.?? yes the servers have been rebooted but the issue was already showing before the reboots.? No servers have been removed from the pool so again this shouldn’t have affected the quorum state.

Anyhow i ran the quorum lossRecovery command.

Reset-CsPoolRegistrarState -ResetType QuorumLossRecovery

 AND BOOM.. the frontend services started as expected.

 

KEY TAKEAWAYS 

  1. Always follow the same process in investigation work even after your patched your Lync environment. 
  2. DONT aways follow what people write on tech net forum and either you will end up chasing your tail, or more drastically breaking your already not working Lync environment.

Thats it for this blog post

Thanks

IainS

 

Advertisements

7 thoughts on “Lync 2013 – RTCSRV Frontend Service failing to start “showing as starting”

  1. I had the same problem in our setup. But the quorumloss-switch did not help. I had to run “Reset-CsPoolRegistrarState -PoolFqdn “atl-cs-001.litwareinc.com” -ResetType FullReset
    ” And then it started.

  2. “What interesting about this is why has quorum got itself in a twist.?? yes the servers have been rebooted but the issue was already showing before the reboots.? No servers have been removed from the pool so again this shouldn’t have affected the quorum state.”

    Maybe, this is a solution. It worked for us.

  3. Key Takeaways!!!! I am glad I took the time to read this article after I chased my own tail for a while. Sometimes the step by step Technet articles can throw you off.

  4. I have just had a very similar problem,with an extremely unlikely apparent cause: Powershell.

    I had to do a QuorumLossRecovery and it seems that it only worked when run on the front end server directly, not from my own machine. It took 4 days of troubleshooting before I got that far…

    When I run the Lync Server Management Shell on my machine, I have to do a “run as a different user” to use my domain admin account. This works and I get a powershell window. But its blue. (like the standard powershell window). Running the management shell as administrator (but my standard user) opens a black powershell window. I can’t run commands like this as I lack permissions. If I open it directly on the front end, I get a black window and can do things as I am logged in as my domain admin via RDP. This does not make much sense, but it absolutely is the only thing I did differently to get the pool working.

    Hopefully this may save someone else a lot of frustration….

  5. After pulling my hair out trying to figure out why, despite trying all the pool reset types, etc. and performing two complete rebuilds, I was still having this issue. It turns out that Windows Server 2012 R2 does not support the SHA512 cipher block algorithm over TLS1.2 out of the box. As our environment uses certs signed with the SHA512 hashing algorithm, we would see one FE (Enterprise) server start, but others would not start with it. The FE servers use TLS1.2 to communicate with each other, so if that communication cannot occur, no quorum. If you are seeing SChannel errors in your System event log, apply this: http://support.microsoft.com/kb/2973337

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s