asked on

Windows Server 2008 R2 Network Disconnect Intermittently

I am running ESXi 4.1 on a HP Proliant ML 350 G6 hosting 3 virtual guest Windows servers. One server is running Windows 2008 R2 64 Bit as a domain controller and DNS. The second server is Windows 2003 and is a print server, fileserver and application server. The third and problematic server is Windows Server 2008 R2 hosting Exchange Server 2010 with a base build exactly the same as the Windows 2008 domain controller. Intermittently as much as once or more a day the network connection will drop and only a server restart will resolve the connection. The Exchange Server has sufficient hardware resources and other then network disconnections runs normally with acceptable performance. The only difference between the Windows 2008 servers is one is a domain controller and the other is not and is hosting Exchange.

After a restart the server will behave normally from 3 hours up to 16 hours where all mail sending and receiving services through POP3, IMAP4 or OWA all work without issue. Then the network will randomly disconnect with the following test results:

•      Network and Sharing shows complete network disconnect – no LAN or WAN access
•      ipconfig returns expected network configurations settings
•      From Exchange server cannot ping DNS, Gateway or any other LAN IP or hostname
•      No response when I ping Exchange server from another computer on the LAN
•      Loopback or ping of Exchange IP does return a response (network card is active and responding)
•      Network troubleshoot/repair does not resolve the problem
•      Disable and enable NIC does not repair the problem.
•      Restarting Exchange and Network service does not resolve problem
•      Timing for disconnect Event ID 1014:
Name resolution for the name dns.msftncsi.com timed out after none of the configure DNS servers responded.

The Exchange 2010 server is configured as one Organization hosting Mailbox, Client Access and Hub Transport. It has only one mailbox database. Client Access includes OWA, POP3, IMAP4 and Offline Address Book and two receive connectors for client and OWA.

All three servers described above have their virtual hardware configured identical including E1000 NICs with one assigned IP. Network configurations are identical including one IP, subnet, Gateway and DNS as well as all the NIC driver settings themselves.

The two other two Windows server network connections are fully reliable and have been stable since they went into production and they are hosted off the same physical NIC.

Some things I have tried in attempt to resolve the problem:

•      Uninstalled and re-installed NIC driver through windows
•      Confirmed same driver version as other known good Windows 2008 (Micorsoft 8.4.1.0)
•      Uninstalled and re-installed E1000 VMWare hardware NICs
•      Uninstalled E1000 NIC and tried using VMXNET 2 (Enhanced) and VMXNET3
•      Confirmed VMWare Tools are update and service is running
•      Reinstalled VMWare Tools
•      Configured Windows Power Options are set to Performance
•      Configured Windows Power Options PCI Express option is Off
•      NIC Driver Disabled “Allow the computer to turn off this device to save power”
•      Fully disabled IPV6 using Microsoft tool
•      Checked Group Policies to confirm no network or power settings are being forced
•      Fully disabled Windows Firewall including service
•      Updated Kaspersky from Windows Server 6.0.4.1424 to Windows Server Enterprise 8.0.0.599
•      Configured Kaspersky with recommended Exclusion Rules for Microsoft and Kaspersky.
•      Kaspersky Trusted Processes list in empty?
•      Windows logs around disconnect times Event Viewer > System
Event ID 1014
Name resolution for the name dns.msftncsi.com timed out after none of the configured DNS servers responded.
•      No Kaspersky logs
•      Virus scan report indicates 100% clean

All symptoms indicate this is a local issue and not related to any actual network connectivity since other guest VMs with the same configuration do not have this problem. Hardware resources are sufficient. There is no decipherable pattern other than daily and normally between 5:00 and 11:00 am so I suspect some background process such as power options or possibly Kaspersky invokes the disconnect.

Let me know if you have any suggestions or need me more information.

Thanks,
Mike

cbielich

Lets try some basic network trouble shooting

I am assuming you are using a dedicated IP for the server. Try disabling the nic or change the ip address and see if at that time you can still ping the address. Maybe someone else has the same ip address assigned and you are getting conflicts at the time they boot up or are online.

You running full, half duplex? try changing those up and see what happens.

AutomationOne

ASKER

Thanks for your response.

Sorry I forgot to mention that in my original post. After a disconnect I did ping the IP the server has configured and no response. It is not an IP conflict. I will double check the next time it disconnects.

I have duplexing set to Auto-negotiate which is the same as the other servers that are okay. Could it still make a difference?

vvzar

Please check is there any other services at problematic server? May be one of them configured incorrectly?

Also this may ve a routing issue.

please put here result of route print command. when all ok, and then when connection problem.

AutomationOne

ASKER

The only services that are running are for Exchange as previously mentioned.

Since Kaspersky can be quite aggressive with perceived threats I thought it might be possible that it was disconnecting the network as an intrusion detection method but after installing Kaspersky AV and Agent the network still dropped.

Find attached the files for ROUTE for conencted and disconnected. AOEX01-Route-Command-Connected.txt AOEX01-Route-Command-Disconnecte.txt

AutomationOne

ASKER

Also when the network disconnects I found that Disabling and Re-enabling the Local Area Connections reestablishes the connection. So as mentioned previously a restart is not necessary to resolve the problem.

vvzar

routes seems to be all ok.
when you disable lan connection. hmmm...
sounds like a software network loop. description as well as carp table override, or ip \ mac address conflict.

when issue happnens, in command shell, try to enter the next: netsh interface ip delete arpcache

AutomationOne

ASKER

Since it is a VM I am going to try changing the MAC to rule out that out.

Fairly confident it is not an IP conflict as when it occurs I disconnect the server from network and ping the IP with no response. It is possible another networked device has ping response disabled so there would be no reply.

Just to clarify the next time this occurs on the server you want me to clear ARP cache by running "netsh interface 192.168.1.x ip delete arpcache".

If that doesn't resolve the problem I will try changing IPs but obviously there is a bit of work with DNS and firewall if I take that step.

Thanks again for your help.

AutomationOne

ASKER

Changing MAC did not resolve the problem.

There does seem to be a pattern transpiring where in the morning between 8:00 am and 10:00 am it will disconnect. So there might be a device connecting to the network or waking up on the network that is causing the problem. I would have thought the live IP would not be affected and the device that connects with the same IP would be the one impacted. In fact I tried to recreate the problem by configuring a workstation with the same IP, restarting and found the workstation did not connect while the server's connection remained live.

cbielich

What kind of switch are you connected too

Model?

AutomationOne

ASKER

DLink DES1024R

The thing is the other VMs on the same virtual and physical switch are okay.

To rule out the physical NIC and switch port I am connecting the second physical NIC on the server. Then I'm going to create a second virtual switch for the card and route the problematic server through there.

cbielich

Yeah but your VM has a unique MAC address, there could be something bugging out in the ARP table somewere

cbielich

Did you clone your VM from a physical server that is now being used on the same network?

AutomationOne

ASKER

Yes I may have used the VMWare OVF template to create the VM.

The VM MACs are definitely unique. In fact the problem still occurs after I have added a new virtual NIC.

Thanks,

Mike

ArneLovius

As you yhave done extensive troubleshooting that has not brought to light anything obvious, I would be very tempted to sping up a new VM, install 2k8r2 on it install Exchange 2010 on it and see if you have the same problem.

When installing 2k8, I would suggest doing the install "manually" from a mounted ISO, not using a template.

If you do not have the dame problem, I would suggest moving connectors and mailboxes etc onto the "new" Exchange server. Once you have moved all of the "working" parts of Exchange, you can decide if you want to further explore, or just uninstall Exchange and take the VM off the Domain.

Cheers

AutomationOne

ASKER

I made two changes July 8 and just returned to work today July 13 to find it has been live ever since.

I enabled the second physical NIC and configured load blalance connecting both NICs directly to the Sonic Wall's LAN ports.

I configured both physical NICs to 100 Full Duplex in ESXi. In Windows Server the NICs are Auto-negotiate

Since I am only running 3 servers in ESXi with moderate network utilization I have a hard time believing load balancing was a solution. What I suspect is the orginal NIC that has been providing network connectivity has an issue where it is not handling the three VMs well. Possibly hardware defect or firmware upgrade.

At this time the problem is resolved although I am not 100% sure what fixed it. I will continue to investigate and report the results.

Thanks,

Mike

ASKER CERTIFIED SOLUTION

AutomationOne

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

AutomationOne

ASKER

The solution was identified in the first posted response to check link speed and duplexing options. The settings were confirmed in Windows Server 2008 but until I investigated further I was not aware of the ESXi host settings for speed and duplexing options.