Two months back we purchased three completely the same big blue servers, and we started to experience issue with two of them. The servers are turning off on random, meaning they can work for a week or two and then they will turn off. When we dug into the logs we saw that the servers are detecting too high temperature and IMM would initiate turn off immediately.
So there is no alert that temperature is going up or anything else the server just turns off like somebody unplugged the cable.
Of course when you unplug the cable on the server that can cause a various case of corruptions, from operating system to the hypervisor, disks etc. In our case the corruption was manifested in the config file so the server would not start stating it is in saved-critical state or Incomplete VM Configuration in the System Center Virtual Machine Manager. The state simply means config file is corrupted and Hyper-V manager or SCVMM are not able to read from it.
The solution I found online is to recreate machine, so basically delete a machine in Hyper-V manager, create new machine and attach to existing vhd or vhdx file. While this will work I think it is very time consuming. It will work for few machines, but imagine what you will do next 5 days if you have hundreds of test machines like we do J
The problem with recreating the machine is it will create a brand new Network adapters what will cause old adapters to hide and servers will acquire new IP address. This involves digging into HKLM\SYSTEM\ControlSet001\services\Tcpip\Parameters\Interfaces\ finding old IP address and then assigning old IP address, or you can see on the what is the old IP address therefore assign the old IP to the new network adapter. Too much time consuming.
What we did? As VMM stated the configuration was incomplete we further investigated the config files on the Hyper-V hosts. There is no way you can do anything with the VMM so abandon it and connect directly to the Hyper-V host.
In Hyper-V manager you will see a big mess like this
So let’s say machine we will going to fix is WS2008R2-001.test.local. You will go into the configuration folder for the virtual machines, somebody keeps it on the C drive some folks keep it together with virtual Machines, you have various examples.
The config files you are looking if you leave VMM to create config files will look like this.
You get that right, it is a big mess and If you have hundreds of virtual Machines you need some kind of text search application. In our case we use good old Total Commander. So search all this files for WS2008R2-001.test.local server in our example.
After you find a file open it in something like Notepad ++ and scroll to the bottom.
This is the end of the file in our environment:
So obviously you see the configuration is corrupted because it is not complete and Hyper-V console and VMM is not able to read from it. What we did is open a healthy virtual machine and find tag <count_per_node>.
In the healthy file you see something like.
So basically we need to finish our corrupted file. What we did is simply copy everything below <count_per_node> and paste instead od old tag <count_per_node> and VOILA machine config is no long corrupted and it is visible again in the VMM or the Hyper-V manager because config file is complete.
What you need to look for in your cases is other tags like Count_per_node, Node_per_scoket, Stopped_at_host_shutdown etc like you see in the screenshot. All this are default values, but if you need to change this to something else then do that accordingly, but this is fine tuning and for the most environments you don’t need to touch this values.