[Review]: What’s Remote Direct Memory Access(RDMA)?
What’s Remote Direct Memory Access (RDMA)?
Remote Direct Memory Access (RDMA) provides direct memory access from the memory of one host (storage or compute) to the memory of another host without involving the remote Operating System and CPU, boosting network and host performance with lower latency, lower CPU load and higher bandwidth. In contrast, TCP/IP communications typically require copy operations, which add latency and consume significant CPU and memory resources.
RDMA supports zero-copy networking by enabling the network adapter to transfer data directly to or from application memory, eliminating the need to copy data between application memory and the data buffers in the operating system. Such transfers require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel with other system operations. When an application performs an RDMA Read or Write request, the application data is delivered directly to the network, reducing latency and enabling fast message transfer.
Advantages of Usign RDMA
- Zero-copy: Send and receive data to and from remote buffers
- Kernel Bypass: Improving latency and throughput
- Low CPU Involvement: Access remote server’s memory without consuming CPU cycles on the remote server
- Convergence: Single fabric to support Storage and Compute
- Close to wire speed performance on Lossy Fabrics
- Available in InfiniBand and Ethernet (L2 and L3)
Where is RDMA used?
- High Performance Computing (HPC): MPI and SHMEM
- Machine Learning: TensorFlow™, Caffe, Microsoft Cognitive Toolkit (CNTK), PaddlePaddle and more
- Big Data: Spark, Hadoop
- Data Bases: Oracle, SAP (HANA)
- Storage: NVMe-oF (remote block access to NVMe SSDs), iSER (iSCSI Extensions for RDMA), Lustre, GPFS, HDFS, Ceph, EMC ScaleIO, VMware Virtual SAN, Dell Fluid Cache, Windows SMB Direct
Common RDMA implementations are as follows:
- Virtual Interface Architecture (VIA) is an abstract model of a user-level zero-copy network, and is the basis for InfiniBand, iWARP and RoCE. Created by Microsoft, Intel, and Compaq, the original VIA sought to standardize the interface for high-performance network technologies known as System Area Networks (SANs; not to be confused with Storage Area Networks).
- RDMA over Converged Ethernet (RoCE) is a network protocol that allows remote direct memory access (RDMA) over an Ethernet network. There are two RoCE versions, RoCE v1 and RoCE v2. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed. Although the RoCE protocol benefits from the characteristics of a converged Ethernet network, the protocol can also be used on a traditional or non-converged Ethernet network
- InfiniBand (abbreviated IB) is a computer-networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used as either a direct or switched interconnect between servers and storage systems, as well as an interconnect between storage systems.
- Omni-Path (also Omni-Path Architecture, abbr. OPA) is a high-performance communication architecture owned by Intel. It aims for low communication latency, low power consumption and a high throughput. Intel plans to develop technology based on this architecture for exascale computing. In 2017 Intel is offering at least 7 variations of multi-port Ethernet switches using this term in the form “Intel® Omni-Path Edge Switch 100 Series” all “supporting 100 Gb/s for all ports”. First models of that series were already available starting Q4/2015.
- iWARP (Internet Wide-area RDMA Protocol) is a computer networking protocol that implements remote direct memory access (RDMA) for efficient data transfer over Internet Protocol networks.Because iWARP is layered on IETF-standard congestion-aware protocols such as TCP and SCTP, it makes few requirements on the network, and can be successfully deployed in a broad range of environments.
- Soft RoCE is a software implementation of RoCE that allows RoCE to run on any Ethernet network adapter whether it offers hardware acceleration or not. Soft-RoCE is released as part of upstream kernel 4.8 as well as with Mellanox OFED 4.0 and above.
- SCSI RDMA Protocol (SRP) is a protocol that allows one computer to access SCSI devices attached to another computer via remote direct memory access (RDMA). The SRP protocol is also known as the SCSI Remote Protocol. The use of RDMA makes higher throughput and lower latency possible than what is possible through e.g. the TCP/IP communication protocol. RDMA is only possible with network adapters that support RDMA in hardware. Examples of such network adapters are InfiniBand HCAs and 10 GbE network adapters with iWARP support. While the SRP protocol has been designed to use RDMA networks efficiently, it is also possible to implement the SRP protocol over networks that do not support RDMA.
RDMA in Bare-Metal
Currently, most of popular Operating Systems supporting RDMA with different implementation methods. Microsoft Windows Server 2012 R2 and later, most enterprise Linux distrobutions such as RHEL supporting RDMA as native feature or by installing third-party driver and software to enabling related APIs.
RDMA in Virtualization
Virtualization has lot of benefits to reduce services implementation and maintenance cost. But many of organizations and companies still keeping their latency-sensitive applications on bare-metal or using virtual machines with some features which allows accessing to hardware directly such as DirectPath I/O (In vSphere) or SRIO-V.
Using the features will reduce latency and CPU utilization (Especially on high workloads) but those features have limitations. RDMA will help to reduce latency and utilization more than other solutions because other solutions still using TCP/IP and also RDMA has fewer limitations.
RDMA in VMware vSphere
Paravirtual devices are common in virtualized environments, providing improved virtual device performance compared to emulated physical devices. For virtualization to make inroads in High Performance Computing and other areas that require high bandwidth and low latency, high-performance transports such as InfiniBand, the Internet Wide Area RDMA Protocol (iWARP), and RDMA over Converged Ethernet (RoCE) must be virtualized.
VMware developed a paravirtual interface called Virtual RDMA (vRDMA) that provides an RDMA-like interface for VMware ESXi guests. vRDMA uses the Virtual Machine Communication Interface (VMCI) virtual device to interact with ESXi. The vRDMA interface is designed to support snapshots and VMware vMotion so the state of the virtual machine can be easily isolated and transferred. This paper describes our vRDMA design and its components, and outlines the current state of work and challenges faced while developing this device.
Paravirtualized devices are common in virtualized environments because they provide better performance than emulated devices. With the increased importance of newer high-performance fabrics such as InfiniBand, iWARP, and RoCE for Big Data, High Performance Computing, financial trading systems, and so on, there is a need to support such technologies in a virtualized environment. These devices support zero-copy, operating system-bypass and CPU offload for data transfer, providing low latency and high throughput to applications. It is also true, however, that applications running in virtualized environments benefit from features such as vMotion (virtual machine live migration), resource management, and virtual machine fault tolerance. For applications to continue to benefit from the full value of virtualization while also making use of RDMA, the paravirtual interface must be designed to support these virtualization features.
Currently there are several ways to provide RDMA support in virtual machines. The first option, called passthrough (or VM DirectPath I/O on ESXi), allows virtual machines to directly control RDMA devices. Passthrough also can be used in conjunction with single root I/O virtualization (SR-IOV) to support the sharing of a single hardware device between multiple virtual machines by passing through a Virtual Function (VF) to each virtual machine. This method, however, restricts the ability to use virtual machine live migration or to perform any resource management. A second option is to use a software-level driver, called SoftRoCE, to convert RDMA Verbs operations into socket operations across an Ethernet device. This technique, however, suffers from performance penalties and may not be a viable option for some applications.
With that in mind, VMware developed a paravirtual device driver for RDMA-capable fabrics, called Virtual RDMA (vRDMA). It allows multiple guests to access the RDMA device using a Verbs API, an industry-standard interface. A set of these Verbs was implemented to expose an RDMA-capable guest device (vRDMA) to applications. The applications can use the vRDMA guest driver to communicate with the underlying physical device. This paper describes our design and implementation of the vRDMA guest driver using the VMCI virtual device. It also discusses the various components of vRDMA and how they work in different levels of the virtualization stack. The remainder of the paper describes how RDMA works, the vRDMA architecture and interaction with VMCI, and vRDMA components. Finally, the current status of vRDMA and future work are described.
The Remote Direct Memory Access (RDMA) technique allows devices to read/write directly to an application’s memory without interacting with the CPU or operating system, enabling higher throughput and lower latencies. The application can directly program the network device to perform DMA to and from application memory. Essentially, network processing is pushed onto the device, which is responsible for performing all protocol operations. As a result, RDMA devices historically have been extremely popular for High Performance Computing (HPC) applications. More recently, many clustered enterprise applications, such as databases, file systems and emerging Big Data application frameworks such as Apache Hadoop, have demonstrated performance benefits using RDMA.
While data transfer operations can be performed directly by the application as described above, control operations such as allocation of network resources on the device need to be executed by the device driver in the operating system for each application. This allows the device to multiplex between various applications using these resources. After the control path is established, the application can directly interact with the device, programming it to perform DMA operations to other hosts, a capability often called OS-bypass. RDMA also is said to support zero-copy since the device directly reads/writes from/to application memory and there is no buffering of data in the operating system. This offloading of capabilities onto the device, coupled with direct user-level access to the hardware, largely explains why such devices offer superior performance. The next section describes our paravirtualized RDMA device, called Virtual RDMA (vRDMA).
Using RDMA in vSphere
In vSphere, a virtual machine can use a PVRDMA network adapter to communicate with other virtual machines that have PVRDMA devices . The virtual machines must be connected to the same vSphere Distributed Switch.
The PVRDMA device automatically selects the method of communication between the virtual machines . For virtual machines that run on the same ESXi host with or without a physical RDMA device, the data transfer is a memcpy between the two virtual machines . The physical RDMA hardware is not used in this case .
For virtual machines that reside on different ESXi hosts and that have a physical RDMA connection, the physical RDMA devices must be uplinks on the distributed switch. In this case, the communication between the virtual machines by way of PVRDMA uses the underlying physical RDMA devices.
For two virtual machines that run on different ESXi hosts, when at least one of the hosts does not have a physical RDMA device, the communication falls back to a TCP-based channel and the performance is reduced.
vSphere 6.5 and later supports PVRDMA only in environments with specific configuration:
Host Channel Adapter (HCA)
Currently PVRDMA is not supported for Windows virtual machines.
VMware PVRDMA currently supports these Linux distributions:
- CentOS 7.2 or later
- Red Hat Enterprise Linux (RHEL) 7.2 or later
- SUSE Linux Enterprise Server (SLES) 12 SP1 or later
- Oracle Linux 7 UEKR4 or later
- Ubuntu LTS Releases 14.04 or later
Installing PVRDMA Support in Linux
Use OFED version 4.8 or above
Both the PVRDMA library and driver are part of a larger software called OpenFabrics Enterprise Distribution (OFED) which installs RDMA support in Linux. The OFED software can be downloaded here: http://downloads.openfabrics.org/OFED/. This is the recommended method to install RDMA and PVRDMA support in Linux.
Use other open source locations
Alternatively, the PVRDMA kernel driver is available in Linux kernel version 4.10 and above.
The PVRDMA library is available through a new set of common libraries called rdma-core: http://github.com/linux-rdma/rdma-core.
RDMA in Other Virtualization Platforms
Microsoft Hyper-V, Oracle VM and most other virtualization platforms are supporting RDMA for using on virtual machines or using for some features such as Live migration or fast access to storage.
- RDMA – Wikipedia
- RDMA Consortium
- A Tutorial of the RDMA Model
- RDMA usage
- A Critique of RDMA for high-performance computing
- Is RDMA the Future of Data Center Storage Fabrics?
- Toward a Paravirtual vRDMA Device for VMware ESXi Guests
- Programming the Verbs API for a PVRDMA Device on ESXi 6.5 Hosts