Memory failure is one of reasons that can be cause of server crash and impact on service availability and performance. Think about a service which including multiple servers, server could be crashed cause of single memory module failure or uncorrectable memory error. Regarding to preventing memory impact of memory errors on services, HPE provides RAS technologies. RAS (reliability, availability, and serviceability) technologies are including:
- HPE Fast Fault Tolerance
- Advanced ECC support
- Online spare with Advanced ECC support
- Mirrored memory with advanced ECC support
In this post, we’ll compare HPE Fast Fault Tolerance and HPE Advanced ECC Support. Before comparison, let’s find that why we need to memory RAS?
Why Memory RAS is needed?
Server uptime is still one of the most critical aspects of data center maintenance. Unfortunately, servers can run into trouble from time to time due to software issues, power outages, or memory errors. The three major categories of memory errors we track and manage include correctable errors, uncorrectable errors, and recoverable errors. The determination of which errors are correctable and uncorrectable is completely dependent on the capability of the memory controller.
Correctable errors are, by definition, errors that can be detected and corrected by the chipset. Correctable errors are generally single-bit errors. All HPE servers are capable of detecting and correcting single-bit errors and with advanced error-correcting code (ECC) support. On HPE systems, the user is warned about a DIMM exceeding the correctable error threshold (maximum amount of correctable errors tolerated in a certain time window) either through lights on the front panel or system board (if available), or the HPE Integrated Management Log (IML).
Uncorrectable errors are errors that can be detected but not corrected by the chipset. These are always multi-bit memory errors. The error will be logged in the IML. Uncorrectable errors can typically be isolated down to a single DIMM. Uncorrectable errors will usually immediately result in a system crash or shutdown. In some cases, with operating system (OS) support and advanced SKU processors (Intel Xeon Platinum and Gold processors), uncorrectable errors do not result in a system crash. We call these recoverable errors. For error recovery details, please check with your OS vendor for details.
DRAM errors generally come in two different types hard errors and soft errors:
- Hard errors typically indicate a problem with the DIMM itself. Although hard correctable errors are corrected by the system and will not result in system downtime or data corruption, they still indicate a hardware problem. Hard errors will typically cause a DIMM to exceed HPE systems’ correctable error threshold. The user is warned about those errors.
- Soft errors do not indicate any issues with the DIMM. They occur when the data and/or ECC bits on the DIMM are incorrect, but the error will not continue to occur once the data and/or ECC bits on the DIMM have been corrected. Soft errors will not typically cause a DIMM to exceed HPE systems’ correctable error threshold and therefore, no indication of a hardware issue is shown.
Any kind of error, if not handled correctly, can eventually cause a system shutdown. In the early days of servers, basic ECC was sufficient to resolve most DRAM failures. However, today’s servers present a completely different challenge, so additional RAS features are necessary to maintain expected server stability and uptime. It is important to note that by avoiding a critical failure, a system crash can be avoided. Failed memory devices are replaced as part of periodic service. Also, memory RAS technologies can detect a DRAM device on a DIMM that has had numerous soft errors, and recommend replacing it before it has a hard failure.
HPE Advanced ECC Support
In system ROM revisions prior to 1.50, Advanced ECC memory is the default memory protection mode for HPE servers. In revision 1.50 and later, HPE Fast Fault Tolerance is the default RAS mode in all Workload Profiles except for the Low Latency Profile.
Standard ECC can correct single-bit memory errors and detect multi-bit memory errors. When multi-bit errors are detected using standard ECC, the error is signaled to the server and causes the server to halt.
Advanced ECC has been the default error correction scheme in HPE servers for over two decades. It not only protects servers against single-bit errors, it also protects against some multi-bit memory errors specifically those that occur within a single DRAM chip.
Advanced ECC can correct both single-bit memory errors and 4-bit memory errors if all failed bits are on the same DRAM device on the DIMM. Advanced ECC provides more protection than standard ECC because it is possible to correct certain memory errors that would otherwise be uncorrected and result in a server failure. Using HPE advanced memory error detection technology, the server provides notification when a DIMM is degrading and has a higher probability of an uncorrectable memory error.
There are no specific memory population rules or RBSU settings required for advanced ECC support. It’s enabled as the default on Purley platforms.
Although advanced ECC provides failure protection, it can reliably correct multi-bit errors only when they occur within a single DRAM chip. Advanced ECC does not provide failover capability. As a result, if there is a memory failure, the system must be shut down before the memory can be replaced. The latest generation of HPE ProLiant/Synergy/Blade servers using Intel Xeon Scalable processors offers three levels of advanced memory protection (including HPE Fast Fault Tolerance) that provide increased fault tolerance for applications requiring higher levels of availability.
HPE Fast Fault Tolerance
HPE Fast Fault Tolerance is a new HPE Memory RAS feature first introduced in HPE Gen10 servers with Intel Xeon Scalable processors. Those servers configured with HPE SmartMemory and HPE Fast Fault Tolerance offer an extra layer of protection against planned server downtime and server crashes. HPE Fast Fault Tolerance, an enhanced version of adaptive double device data correction (ADDDC), is a result of a joint Intel and Hewlett Packard Enterprise Collaboration. HPE Fast Fault Tolerance has more spare regions (part of memory allocated only for replacing bad memory areas) and more options to map out bad sections of memory. This results in significantly better memory reliability and availability than what the rest of the industry can provide using ADDDC only.
Starting with ROM revision 1.50, HPE Fast Fault Tolerance is enabled by default for all Workload Profiles with the exception of the Low Latency Profile. In all previous ROM revisions, HPE Fast Fault Tolerance is only enabled by default when the “mission-critical” profile is selected in the ROM-based setup utility (RBSU).
In the past server generations, the most advanced memory protection technology in ProLiant servers was double device data correction (DDDC). The biggest issue with this was that it had to be enabled at boot and it significantly reduced memory throughput when enabled. Customers had to choose between resiliency and performance. HPE Fast Fault Tolerance provides significant improvement over DDDC because it incorporates the performance benefits of single device data correction (SDDC) with the availability of DDDC. HPE Fast Fault Tolerance allows the system to boot with full-memory performance and only puts small sections (banks) of memory into lockstep when needed to correct failures resulting in a significantly better performance than DDDC. When the failing section is larger than a bank, a larger negative impact on performance may be observed.
There must be a minimum of two ranks on each populated channel. Furthermore, only HPE SmartMemory in x4 organization can be used.
Currently, HPE Fast Fault Tolerance requires that the server run in “closed-page” mode and some workloads will see a minor reduction in throughput. Closed-page mode is not expected to have a significant performance loss for random access memory patterns (e.g., SQL or other databases), but there will be a performance loss for sequential access memory patterns (e.g., data streams).
There will also be a minimal performance reduction in throughput if a DRAM fails but only in the typically very small region (most common size is a bank) of memory that is affected. No significant loss is expected for random-access memory patterns because the region in lockstep will be accessed infrequently. The loss can be significant if you have rank level virtual lockstep or if an application accesses the region frequently until the DIMM is replaced. The overall reduction in throughput from HPE Fast Fault Tolerance is expected to be minimal for the vast majority of customers but does depend on the application, the size of the affected region, and the memory configuration.