SharePoint Blogs / SharePoint University
SharePoint Blogs and SharePoint University - all in one place!
Need SharePoint Training? Attend a SharePoint Bootcamp!

Please delete cookies related to sharepointblogs.com and sharepointu.com to resolve login issues!

What to do when Load Balanced MOSS Servers can't see each other

I was checking status of a pair of MOSS servers I had recently set up in a web farm. 

NLB Unicast Layout 

I noted that the cluster appeared to be functioning, but when connecting from WFE1, only WFE1's configuration could be loaded.  Likewise, when connecting to the NLB cluster from WFE2, only WFE2's configuration could be loaded.  The servers could not ping each other on any IP address.

Still, web traffic to the portal site was being routed properly to each server, and the servers themselves could connect to the back end database (and any other machine on the network)... they simply could not connect to each other.This is a problem, of course, because the various shared services need to be able to talk to each other.  For example, the Central Administration site is on WFE1, but if I went into the Operations-->Services on Server page and attempted to open the Search Service properties on WFE2, I got an error.  I'm sure there are plenty of other complications, but I didn't wait around to see what they were.

I mentioned in a previous post that I had originally set up this NLB cluster in Multicast Mode, which works fine unless you have more than one subnet.  I had installed an additional network card in each MOSS server and switched to Unicast Mode. 
Here's a quick blurb from Brian Madden about Unicast and Multicast:
Windows NLB has the ability to work in two different modes: “unicast” and “multicast.”Regardless of the mode you choose, NLB creates a new virtual MAC address assigned to the network card that has NLB enabled, and all hosts in the cluster share this virtual MAC. Then, all incoming packets are received by all servers in the cluster, and each server’s NLB drivers are responsible for filtering which packets are for that server and which are not.When in unicast mode, NLB replaces the network card’s original MAC address. When in multicast mode, NLB adds the new virtual MAC to the network card, but also keeps the card’s original MAC address.

Both unicast and multicast modes have benefits and drawbacks. One benefit of unicast mode is that it works out of the box with all routers and switches (since each network card only has one MAC address). The disadvantage is that since all hosts in the cluster all have the same MAC and IP address, they do not have the ability to communicate with each other via their NLB network card. A second network card is required for communication between the servers.


Multicast mode does not have the problem that unicast operation does since the servers can communicate with each other via the original addresses of their NLB network cards. However, the fact that each server’s NLB network card operating in multicast mode has two MAC addresses (the original one and the virtual one for the cluster) causes some problems on its own. Most routers reject the ARP replies sent by hosts in the cluster, since the router sees the response to the ARP request that contains a unicast IP address with a multicast MAC address. The router considers this to be invalid and rejects the update to the ARP table. In this case you’ll need to manually configure the ARP entries on the router. (Don’t worry if you’re lost at this point. Just be aware that if you’re using multicast mode, you’ll need to get one of your network infrastructure people involved.)


The bottom line is that you don’t want to use unicast in a Terminal Server environment unless you have two network cards. (That way, you can still connect to a specific Terminal Server if you need to via another adapter and another IP address.) If your servers have only a single network card, then you’ll want to use the multicast mode.

What it boils down to is that in my case, I needed to use Unicast because of the multi-subnetted environment, and I needed to use dual network cards on each server so that each node could talk to the others.  So that's what I did... AND IT WORKED FOR A WHILE.  This is important.  It threw me for a loop until I remembered that it had been working originally.  Something must have changed.

This week when I came in, I saw that for some reason, the MOSS servers had stopped communicating with each other.  If I was on WFE1 and I went to Start-->Administrative Tools--> Network Load Balancing Manager, I would only be able to load the local node (WFE1).  From a third Windows 2003 server, I was able to connect to the cluster and verify that it was all working just fine.  I don't know why it did this, but I think something may have changed around the 29th of September:  
Event Type:        Warning
Event Source:      WLBS

Event Category:    None

EventID:           18

Date:              9/29/2008
Time:              11:00:20PM
User:              N/A
Computer:          WFE1
Description:
NLB Cluster 10.10.10.163 : Duplicate cluster subnets detected.  The network may have been inadvertently partitioned.
  

 

...but per Microsoft, this is not a big deal:

 During the time that the cluster was partitioned, the members of the cluster converged into two or more separate clusters. This event is an informational message that reports the network had been partitioned and the WLBS hosts now have correctly converged in just one cluster. This event is benign but if it is logged repeatedly there may be an issue with the underlying network or the network infrastructure may be insufficient for the volume of traffic. STATUSThis behavior is by design.
Was it a corrupt configuration?  I tried destroying and recreating the cluster.  Interestingly, as soon as I removed one of the nodes, communications were instantly restored between the servers.  It was only after both nodes were converged that communications ceased.  It seemed like a routing problem to me. I did a lot of reading this morning but there were a lot of dead ends.  Finally I happened across this tidbit, buried six feet deep in a Microsoft FAQ page on NLB:
Q. I Have Two Network Adapters on Each Server in My NLB Cluster. How Do I Ensure That All Outbound Traffic Goes Through Non-Load-Balanced Network Adapters? 

A. Sometimes it is desirable for performance or other reasons to direct all outgoing traffic through a different network adapter that the one that is being load balanced with NLB. This implies that there is more than one network adapter on each host in a cluster: NLB is bound to one network adapter called cluster network interface card, and the other network adapter does not have NLB bound to it. To make sure that the outbound traffic leaves each host through the non-cluster network adapter, do the following: 

Set the metric on the cluster network adapter to a higher value than the non-cluster network adapter. For example, if you have two network adapters on each host, set the non-cluster network adapter metric to 1 and cluster network adapter metric to 2. The network adapter with a higher metric means it is more expensive to use than the other one with a lower metric. That will ensure that the outbound traffic will be routed out of the non-cluster network adapter. 

If you want to use default gateways on both cluster and non-cluster network adapters, make sure the metric of the default gateway on the cluster network adapter has a higher value than the one on the non-cluster network adapter. If you do not want to route any outgoing traffic out of the cluster network adapter, you should not specify the default gateway for it at all.
  

In this case, the solution involved going to each network adapter on WFE1 and WFE2 and, under Advanced TCP/IP Properties, deselecting the "automatic metric" property and specifying the explicit values of "1" for the Production NIC and "2" for the NLB NIC.  After that, it started working perfectly.

 

NLB Unicast TCPIP Settings


Posted 10-08-2008 1:11 PM by moffitar

Comments

Links (10/12/2008) « Steve Pietrek - Everything SharePoint wrote Links (10/12/2008) « Steve Pietrek - Everything SharePoint
on 10-12-2008 7:02 PM

Pingback from  Links (10/12/2008) « Steve Pietrek - Everything SharePoint

irfan joiya wrote re: What to do when Load Balanced MOSS Servers can't see each other
on 03-12-2009 1:06 AM

if u have any problem regarding to NLB configuration or troublshooting you can call me or contact me at this No.00923452246755 or mail me at raiirfan@hotmail.com

Add a Comment

(required)  
(optional)
(required)  
Remember Me?
Need SharePoint Training? Attend a SharePoint Bootcamp!
Posts (c) their respective authors. Everything else (c) 2009 SharePoint Experts, Inc.