Link to home
Start Free TrialLog in
Avatar of thijsschoonbrood
thijsschoonbrood

asked on

ESXi 4.0 U1 - Management network becomes unstable after a few days

I’ve installed ESXi 4.0 Update 1 on two identical machines that reside in the same network segment. On both servers, I’ve created two virtual machines. One runs RedHat Enterprise Linux 5.4 and one runs a small load balancer appliance (Hercules).

Hardware:
Dell PowerEdge R210
Intel Xeon X3450 2.66GHz HT
8GB RAM
2x500 GB in RAID 1
Using ONE port of the internal Broadcom netxtreme II bcm5716 NIC (this port is shared between the management network and the VM’s).
(all hardware is marked as ‘supported’ by VMware)

We applied all available patches, including the recent april 1st patch; we’re at build 244038 now.

The Problem
After a few days the vSphere client cannot establish a connection to the ESXi hosts anymore. The virtual machines continue to keep running without any problem, however. Only a full reset (applied thru the remote power cycle) restores the connectivity to the management network. We experience this issue on both servers: about three days after power-on/reset, the vSphere client cannot connect anymore.

Observations:
• Only the management network suffers from connectivity problems.
• Restarting the management network (agents) via the physical console doesn’t restore service
• The physical console offers some basic diagnostics like ‘testing the management network’. The PING tests intermittently fail: about half of the PINGs to the gateway or dns-servers fails. The hardware and the network config MUST be correct, since the management network works for a few days before failing and the VM’s keep running without any problem.
• We’ve investigated the network traffic from a remote vSphere client that is trying to connect to the ESXi server using a packet sniffer. The remote ESXi hosts resets the connection after initial contact, so there IS packet interchange.

Given the above, I strongly suspect a problem in the network driver in ESXi, but I don’t know how to diagnose the issue any further. I’ve exhausted all options on the physical ESXi console. I know how to access the (unsupported) commandline console, but don’t know what to look for. Could it be a problem that the management network shares the same NIC as the VM’s?

I’ve been struggling with this issue for a several weeks now – any help/suggestions is highly appreciated.
Avatar of wolfje_xp
wolfje_xp
Flag of Belgium image

Hi,
I also use ESXi 4.0, but haven't applied the last update. I also share a nic for mgmt and VM traffic, without problems so far.
Maybe you can have a look in the logfiles /var/log ?
What does vpxa.log and hostd.log say? If it's agent issues within each esx host, those logs should review some clue for you.

./var/log/vmware/hostd.og
./var/log/vmware/vpx/vpxa.log

Post back.
Avatar of thijsschoonbrood
thijsschoonbrood

ASKER

After a recent boot, vSphere is able to connect again. However, i see some disturbing lines in these logs:

In /var/log/messages:
Hostd: [2010-04 ..... 29766DC0 warning 'Proxysvc'] Accept on client connection failed: Bad file descriptor

In /var/log/hostd.org:
[2010-04-....  warning 'Proxysvc Req00005'] Error reading from client while waiting for header: N7Vmacore15SystemExceptionE (Connection reset by peer)

I was unable to find '/var/log/vmware/vpx/vpxa.log', the subdir vpx didn't exist either.


Before i rebooted the physical machine, i noticed a even more disturbing message in a log:
[2010-04 -… 20BECCDC panic ‘App’] error: Cannot allocate memory

and several of these (which i suspect are related to an out of memory situation):
vmkernel: 3:19:39:51.629 cpu2:1028056)WARNING: Tcpip_Socket: 1619: socreate(type=2, proto=0) failed with error 55

I had allocated 8GB to one of the virtual machines, while the host only has 8 GB in total. Can this have caused the memory problem? To be on the safe side, i reduced the amounth to 6GB.
Also, i put the management network on another vSwitch connected to a dedicated NIC port. So now the VM's share a NIC port and the management network uses a seperate one.

Perhaps one of these actions fixes the issue (i only know after three days...), but I would very much like to understand what went wrong. Did i misconfigure anything?
Hi. IT sure can be a memory issue. It's not a good idea to grand all your physical memory to a vm. (it shouldn't produce those errors however, but swapping will kill your performance)
I assume your network settings are correct (domain etc)?
Any chance you have bad memory / hard disk ?

I can only assume that all physical server stuff is OK, since we experience experience exactly same issue on two identical (brand new) servers and only after three days. It's of course an assumption, but i think i can be rather sure the hw is OK. Network settings are correct i think, i verified them quite a few times. :-)
Can I assume that you have run the memtest fully for the memories for both servers? :)
Put the mgt. network on a different NIC port (and vSwitch) and reduced the memory allocated to VM's to 6GB. Still, after three days, vSphere is unable to connect to ESXi again. :-( I'm running out of ideas here. :-(
Are you sure you have the latest driver for your NICs? sounds like a NIC problem when you experiencing "PING tests intermittently fail" problems.
Hi bbnp2006,

Well, i've applied all available patches, assuming that all the network drivers would be updated to the new version as well. Is my assumption wrong? How can i determine which network driver version i'm exactly using? When my vSphere client still could connect, i was not able to find any version reference except the build number.
(I'm using the (internal) Broadcom netxtreme II bcm5716 NIC.)
SOLUTION
Avatar of bbnp2006
bbnp2006
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi BBNP2006,
First, thanks for your help! I did check the compatiblity guide, but i guess I overlooked the 'S'...However, DELL indicates that ESX 4.0 U1 is supported on their R210 server hardware. Perhaps it's supported, but not using the internal NIC then? I really don't like to revert to 3.5 though, perhaps it's possible to use a bnx2 version that supports the 5716 (without the S)? How can i determine which bnx2 version is currently being used?
I did find a driver for ESXi 4.0 for the BCM 5716 (without the S ;-)).
http://downloads.vmware.com/d/details/esx_esxi_40_broadcom_bnx2_dt/ZCV0YmRqZHRidHdw
I'll check it out asap
yes, that's the after release of the driver for your NIC that was not originally supported "inbox" when vsphere released. That will hopefully get rid of the networking issue.
Report back if it is working for you. Good luck!
I will report back asap. One question though: are drivers (like the bnx2 driver) also included when i check for updates using the Host Update Utility?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I checked the bnx2 driver version using 'ethtool -i vmnic1'
It releaved that i was still on 1.6.9. The latest release available from VMWare (URL few posts back) is 2.0.7c. Just installed this new driver successfully and rebooted the system. In three days i'll know if that solved the issue. :-)
Great stuff! Fingers crossed and looking forward to your updates in 3 days :)
~bbnp2006
Unfortunately, it turned out that the issue was not solved. I also disabled the CIM agents but after a while the servers both crashed again.
sorry to hear that mate.  I would contact VMWare for support (hopefully that's an option). Let me do some digging to see if can find anything.
Hi bbnp2006,
VMware support definetely is an option if it's up to me. However, i've enabled SSH (to help diagnose the problem) and disabled the CIM agents which breaks the support option i'm afraid.
Also, here is another guy having the exact same issue: http://communities.vmware.com/thread/268626.
I've added the observation below to that thread, maybe it helps in generating some new ideas.. ;-) Is there a way the two servers can influence one another?
Thanks for sticking with me. Appreciate
--------
We're running two (identical) servers. It appears that the issue only occures when the both ESXi hosts are up (in the sence that vSphere can connect). This chain of events leads me to believe this:

both servers were inaccessible due to the issue at hand (VM's running fine, but vSphere couldn't connect)
i disabled the CIM agents on one server and rebooted this server
worked for weeks, vSphere was able to connect to one server (and not the other which was still to be fixed)
i became confident that this solved the issue and implemented on the other server as well.
rebooted the second server.
vSphere was able to connect to both servers
a few days later: both servers were inaccessible thru vSphere again... same story with 'cannot allocate memory'......
no problem bud.
just another thought, are you 100% confident that there's no any sort of IP conflict on your network? it just so strange that it will work for a while and all of a sudden stops working... just trying to cover all the basis.
These days i'm never 100% certain of anything anymore. ;-) But yes, i did check that (quite a few times ;-)) Could an IP conflict cause memory problems?
Perhaps some process exhausts all sockets? Is there a way to determine which (and how many) sockets are opened by a particular process under ESXi?
#ps command should give you lots of options to query all the processes running on the host. but I am not sure if it will show all the ports being open though.
actually,
#netstat -tulnap
will give you all the ports bud :P
Hi bbnp2006
No netstat on the SSH console (using ESXi 4.0 U1)... ;-(