HP ASR – Automatic Server Recovery
How much do you know about HP ASR? How it works?
The ASR feature is a hardware-based timer. If a true hardware failure occurs, the Health Monitor might not be called, but the server will be reset as if the power switch were pressed. The ProLiant ROM code may log an event to the IML when the server reboots.
The ASR Timeout option sets a timeout limit for resetting a server that is not responding. When the server has not responded in the selected amount of time, the server automatically resets.
The available time increments are:
This ASR feature is implemented using a “heartbeat ” timer that continually counts down. The Health Monitor frequently reloads the counter to prevent it from counting down to zero.
If the ASR counts down to zero, it is assumed that the operating system has locked up and the system will automatically attempt to reboot.
Events which may contribute to the operating system locking up include:
A peripheral device − such as a Peripheral Component Interconnect Specification (PCI) adapter − that generates numerous spurious interrupts when it fails.
A high priority software application consumes all the available central processing unit (CPU) cycles and does not allow the operating system scheduler to run the ASR timer reset process.
A software or kernel application consumes all available memory, including the virtual memory space (for example, swap). This may cause the operating system scheduler to cease functioning.
A critical operating system component, such as a file system, fails and causes the operating system scheduler to cease functioning.
Any other event besides an ASR timeout that causes a Non-Maskable Interrupt (NMI) to be generated.
The Health Monitor is notified of ASR timeout through a NMI. If possible, the driver will attempt to perform the following actions:
Displays a message on the console stating the problem.
Makes an entry in the IML.
Attempts to gracefully shut down the operating system to close the file systems.
There is no guarantee that the operating system will gracefully shutdown. This shutdown depends on the type of error condition (software or hardware) and its severity. The Health Monitor logs a series of messages when an ASR event occurs. The presence or absence of these messages can provide some insight into the reason for the ASR event. The order of the messages is important, since the ASR event is always a symptom of another error condition.
Fine, ASR is a good feature but what is VMware experts’s recommendation about this feature?
ASR must be disabled when you have ESXi installed on a ProLiant server.
There are three primary arguments for disabling ASR on ESXi hosts:
- Unintended virtual machine outages: If the heartbeat timer reaches zero as a result of a problem within the Service Console (for example, CPU or memory utilization or an agent failure), ASR may determine that the server has failed, even if the overlying virtual machines are still functioning. In this case, attempt to migrate the virtual machines off the host prior to a host restart. If ASR is enabled, the host is rebooted and the overlying virtual machines fail, resulting in an outage to business-facing applications that may have been avoided (by migrating or working with support) or minimized (by scheduling a maintenance window in the event that migration fails).
- Loss of diagnostic data: If the heartbeat timer reaches zero as a result of a purple diagnostic screen error and ASR reboots the system, it may become impossible to determine the root cause of the ASR reboot since diagnostic data related to the crash is lost upon restart. In addition, if there is a delay between the service console becoming unresponsive and the resultant purple diagnostic screen error, it is possible that ASR could reboot the system prior to the purple diagnostic screen error being generated. This could circumvent the generation of the purple diagnostic screen error and related diagnostic data. The purple diagnostic screen error contains a wealth of valuable information that can aid in pinpointing a root cause.
- Increasing the ASR timer may not help: The ASR timer can be increased from 10 to as high as 30 or 60 minutes. However, doing so may reduce ASR’s effectiveness. Its intent is to minimize downtime, and 30 or 60 minutes is a long time for a system to be unresponsive without operator intervention. Further, even with a timer set that high, ASRs can still occur, which will impact the administrator’s ability to troubleshoot the issue.