[Review]: Oracle Unbreakable Enterprise Kernel Release 5
New Features and Changes
The Unbreakable Enterprise Kernel Release 5 (UEK R5) is a heavily tested and optimized operating system kernel for Oracle Linux 7 Update 5 and later on the x86-64 and 64-bit ARM (aarch64) architectures. It is based on the mainline Linux kernel version 4.14.35. This release also updates drivers and includes bug and security fixes.
Notable Features and Changes
The Unbreakable Enterprise Kernel Release 5 (UEK R5) has many changes and new features that we want to review the notable new features and changes in this section.
64-bit ARM (aarch64) architecture
With Unbreakable Enterprise Kernel Release 5, Oracle delivers kernel modifications to enable support for 64-bit ARM (aarch64) architecture. These changes are built and tested against existing ARM hardware and provide support for Oracle Linux for ARM.
- 64 KB Base Page Size: During testing the use of a 64 KB base page size resulted in significant performance gains for workloads that stress memory, such as MySQL and Java middleware, where THP (Transparent Huge Pages) are not used or the application is not configured to use huge pages. This change results in better overall performance and removes complex configuration requirements to configure huge pages manually.
- ARM port of DTrace code: Kernel code has been patched to facilitate an ARM (aarch64) port of DTrace on UEK R5. This includes changes to add support for aarch64 in the SDT collection process and to allow SDT to be disabled even when DTrace is enabled. Profile and systrace providers have been updated and tested to be functional on aarch64.
- Kdump modifications: Changes were made to kexec to ensure that the crashdump kernel runs at exception level 2 (EL2)
- KVM patches for ARM: A large number of ARM-related backports are included to help to enable KVM for ARM.
- CPU topology workaround to resolve missing cache information in ACPI: Due to lack of an official cache property for ARM64 in ACPI, the CPU cache information is not present in
sysfs. To resolve this issue, a patch has been applied to display default cache information until such time that ACPI provides better information.
Core Kernel Functionality
The following notable core kernel features are implemented in UEK R5:
- Ambient Capability Mask included: When performing privileged tasks, processes can be assigned capabilities in the form of different masks. The ambient capability mask is added to help solve inheritability problems in the current capability model that made capabilities difficult to use.
kmodsupport for PKCS#7: Previous versions of
kmod, up through 20-21.0.1, do not support the PKCS#7 signature type. As a result, the modinfocommand does not display signature information for a signed module. As a workaround, confirmation that a module is signed may be obtained by checking for the label
~Module signature appended~returned at the end of the module binary. For example:
xzgrep 'Module signature appended' /lib/modules/<kernel_version>/kernel/drivers/net/dummy.ko.xzBinary file (standard input) matches
kmodversion 20-21.0.2, basic PKCS#7 signature type support has been added. You can use modinfo to display whether a PKCS#7 signature is present. However, note that
sig_keyinformation is still missing and that the algorithm displayed for the
sig_hashalgovalue may incorrectly display as using the
md4algorithm when the
sha512algorithm has been used instead. Note that the default algorithm used for module signature hashes in UEK R5 is
xzkernel compression enabled: The
CONFIG_HAVE_KERNEL_XZoption is enabled in UEK R5. This means that the kernel image and all kernel modules are automatically compressed, using
xzcompression, when compiled. Module file suffixes indicate that they are compressed and differ from the suffix used in UEK R4 and other previous releases of UEK. For example, modules are named in the format:
. This change significantly reduces kernel footprint and package size.
- cgroup updates and changes: The cgroup mechanism has been updated and improved in UEK R5. Notable upstream changes available in this release include:
- Berkeley packet filter (BPF) cgroup controller and pids controller configuration enabled in the kernel.
- Thread mode for cgroup v2 is available to enable thread granularity for some controllers. This update facilitates hierarchical resource distribution across the threads of a group of processes. By default, all threads of a process belong to the same cgroup, which also serves as the resource domain to host resource consumptions which are not specific to a process or thread. The thread mode allows threads to be spread across a subtree while still maintaining the common resource domain for them.
- The memcontrol cgroup has introduced three new entries in memory.stat:
workingset_refault(number of refaults of previously evicted pages),
workingset_activate(number of refaulted pages that were immediately activated), and
workingset_nodereclaim(number of times a shadow node has been reclaimed).
- The memcontrol cgroup now also provides shmem statistics.
- The rdma cgroup controller was added to perform accounting and limit enforcement on RDMA or InfiniBand resources.
- A new boot option,
cgroup_no_v1, has been added to make it possible to disable specified controllers in cgroup v1 mounts, so that they remain available for cgroup v2 mounts.
- Futex scalability improvements: Several improvements were made to futex code, including the addition of a patch that removes the requirement to lock pages when handling keys for shared futexes. These improvements can boost hashing of shared futexes significantly, resulting in better performance.
- Legacy mcelog device enabled: The kernel configuration option
CONFIG_X86_MCELOG_LEGACYis enabled in UEK R5. Although support for
/dev/mcelogis deprecated upstream and this option is usually disabled by default, this device is required for the Oracle Linux FMA Software that is part of the Oracle Hardware Management Pack.
Intel QuickAssist Technology enabled: UEK R5 enables Intel QuickAssist Technology, which is used to offload cryptographic workloads to hardware capable of optimizing these operations. UEK R5 includes the drivers and firmware required to use this hardware for cryptographic compression and acceleration. No user space packages are provided for this technology at this point.
- SDT probes enabled for KASLR-enabled kernels: A fix has been applied to resolve an issue that caused a kernel crash if Kernel Address Space Layout Randomization (KASLR) was enabled and DTrace SDT probes were enabled at the same time. DTrace can now be used with KASLR-enabled kernels.
- Added dynamic debugging. Where a kernel is enabled for dynamic debugging (
CONFIG_DYNAMIC_DEBUGis enabled), DTrace is built with all debugging messages enabled.
- Array size boundary checking in user space: An enhancement was applied to the DTrace user space packages to add checking of the bounds of non-associative arrays, both in CTF and in declared arrays. Lvalue arrays used for assignment are also bounds-checked. It is possible to bypass the bounds checking by casting to an untyped pointer type. For example:
- Disassembler prints all actions: A fix has been applied to the D disassembler to follow the full chain of actions per statement so that it prints out all actions.
- PID provider added: A new PID provider has been added to both the DTrace kernel and user space code. It extends the existing fasttrap provider (used for USDT probes) with the ability to set function boundary probes on user space functions, and to probe most arbitrary instructions within user space functions. It is called the ‘pid’ provider because it is a meta-provider that creates user space tracing providers on demand based on process IDs (pid).
Several OCFS2 improvements and patches have been applied in this update, including the following notable items:
- Inode cluster lock set before moving reflinked inodes: A fix was applied to inode cluster locking to ensure that a cluster lock is taken in EX mode before initializing security ACLs on the orphan inode that is being moved to a reflinked destination. This fix helps to prevent problems from occurring due to missing checks on lock modes.
- Added feature to attempt to reuse the extent block in
meta_alloc: A feature was added to reuse the extent block cached in
deallocafter it has been unlinked from the extent tree to resolve an issue where the extent tree needs to grow but no metadata has been reserved ahead of time. By reusing the extents in
dealloc, where deleted extents are cached, the extent tree can grow without the need to reserve additional metadata. This patch can solve a potential crash issue.
The following notable memory management features are implemented in UEK R5:
- Heterogeneous Memory Management (HMM) support: UEK R5 introduces HMM, a helper layer that allows device drivers to mirror address space for a process. This new memory management facility includes features to shadow the CPU page table of a process into a device specific page table and to keep both the tables synchronized; to handle DMA mapping for the shadowed page table; and to migrate private anonymous memory to private device memory and vice versa. These features allow device drivers to avoid pinning memory which blocks some kernel features and allows the user space API to decouple from the requirement to manually manage memory copies to and from device memory. The change is transparent to the user space, effectively allowing a library to use GPU, DSP or FPGA without requiring links within the application.
hugetlbfshole punching enhancement: Updates to the
userfaultfdmechanism to allow it to deliver a SIGBUS signal to the faulting process, instead of a page-fault event. This update to
userfaultfdallows an application to prevent pages from being allocated implicitly when a hole in a
hugetlbfsfile is accessed by using the mapped address so that an application can explicitly manage page allocations of
he following notable networking features are implemented in Unbreakable Enterprise Kernel Release 5:
- TCP-BBR enabled. UEK R5 enables TCP-BBR, a feature that can be used to achieve higher bandwidth and lower latency for internet traffic to offer significant performance improvements for internet based applications. BBR (Bottleneck Bandwidth and Round-Trip Time) is a scheduling algorithm that helps to control the transmit rate of the TCP protocol to reduce buffering by monitoring round trip times against bandwidth bottlenecks to reduce TCP congestion.
Many modern multiprocessors have non-uniform memory access (NUMA) memory designs, where the performance of a process can depend on whether the memory range being accessed is attached to the local CPU or to another CPU. As performance is different depending on memory locality, the operating system should ideally schedule a process to run on the CPU whose memory controller is connected to the memory to be accessed.
- NUMA balancing enabled. UEK R5 includes improvements and fixes to NUMA balancing to resolve issues that caused high I/O Wait times when this feature was enabled. NUMA balancing is automatically enabled on systems that have multiple NUMA nodes.
Remote Direct Memory Access (RDMA) is a feature that allows direct memory access between two systems that are connected by a network. RDMA facilitates high-throughput and low-latency networking in clusters.
Unbreakable Enterprise Kernel Release 5 includes RDMA features that are provided in the upstream kernel, with the addition of Ksplice and DTrace functionality, along with Oracle’s own RDMA features, including support for RDS and Shared-PD.
The following RDS protocols are enabled with UEK R5:
- SCSI RDMA Protocol (SRP) enables access to remote SCSI devices through remote direct memory access (RDMA)
- iSCSI Extensions for remote direct memory access (iSER) provide access to iSCSI storage devices
- Reliable Datagram Sockets (RDS) is a high-performance, low-latency, reliable connectionless protocol for datagram delivery
- Internet Protocol over InfiniBand (IPoIB)
Ethernet tunneling over IPoIB (eIPoIB) is not supported with UEK R5.
The following RDS features are enabled with UEK R5:
- Quality of Service (QoS)
- Active Bonding (AB)
- Netfilter (NF)
Oracle provides support for RDMA on InfiniBand on the following Oracle-branded HCAs:
- Sun InfiniBand Dual Port 4x QDR Host Channel Adapters M2
- Oracle Dual Port QDR InfiniBand Adapter M3
Oracle provides limited support for RDMA over Converged Ethernet (RoCEv2). Hardware vendors are responsible for testing and supporting RoCE on their own hardware. For more information on RoCE support for your hardware, please contact your hardware vendor.
New RDMA features implemented in UEK R5 include:
- Various RoCE features added: UEK R5 introduces RDMA over Converged Ethernet (RoCE), a standard InfiniBand Trade Association (IBTA) protocol that enables efficient data transfer for RDMA over Ethernet networks using UDP encapsulation to transcend Layer 3 networks.Kernel configuration options required to enable RoCE on a variety of hardware and the appropriate kernel driver modules have been included.The following additional notable RoCE features are implemented in UEK R5:
- Soft-RoCE functionality: Soft-RoCE is a software implementation of the RDMA transport that allows the use of standard Ethernet adapters to connect servers to high performance storage units using hardware-based RoCE. Soft-RoCE makes it possible to test RoCE on systems that do not have the RDMA hardware required for RoCE, and allows for graduated adoption where full-scale hardware upgrades may not be possible in the short term.
- RoCE with SR-IOV: The code responsible for handling Single Root I/O Virtualization, a major feature that allows for RDMA within virtual machines, has been updated to enable the same functionality on RoCE.
- Mellanox HCA drivers updated: The Mellanox
mlx4HCA driver has been updated for Ethernet and InfiniBand. The Mellanox
mlx5HCA driver has been updated to facilitate Ethernet and RoCEv2.
- RDMA subsystem updated: The RDMA subsystem has been updated. This includes an update to
ib_coreand new user land based on upstream RDMA Core libraries.
- QoS features added: Quality-of-Service (QoS) technologies such as PFC and CNP Counters and DSCP (including DSCP-to-Priority Mapping) have been added to facilitate QoS.
resilient_rdmaipmodule added: The Active-Active Bonding feature that was previously available in the RDS driver module is moved into a new independent driver module,
resilient_rdmaip, in UEK R5. This change acknowledges that the Active-Active Bonding feature is more generic and applies more widely to RDMA, as a whole. It also helps to reduce code complexity within the RDS module and brings the UEK RDS driver closer to matching the upstream RDS implementation. Finally, this change facilitates further improvement to the Active-Active Bonding code.
The following notable security features are implemented in Unbreakable Enterprise Kernel Release 5:
- Secure boot improvements: Secure boot is designed to protect a system against malicious code being loaded and executed early in the boot process. Secured platforms load only software binaries, such as option ROM drivers, boot loaders, and operating system loaders, that are unmodified and trusted by the platform. While the operating system is loaded, measures have been added to prevent malicious code from being injected on subsequent boots. Although this feature was available in previous releases of UEK, the implementation differed significantly from the approach taken in UEK R5. The new design avoids any relation to the
securelevelsecurity mechanism used in BSD kernels. These updates and changes help to ensure that the approach that is taken in UEK R5 brings Oracle Linux in line with other mainstream distributions.Some of the secure boot features that are applied to the kernel when it is locked down are described briefly in the following list:
- Facilitates using keys in the UEFI database when in secure boot mode
- Enforces module signatures
- Disallows access to
do_kexec_load, which is used to allocate structs and load initram
- Copies the
secure_bootflag in the boot parameters across
- Disallows images to be loaded into trusted kernels where the signature is not verified in the
- Disables hibernate and user space software suspend (
- Locks down PCIe Base Address Register access
- Locks down IO port access
- Restricts CPU Model Specific Register access
- Restricts the debugfs interface in the ASUS WMI driver
- Restricts access to custom ACPI methods
- Ignores the
- Disables ACPI table override
- Disables ACPI Platform Error Interface (APEI) error injection
- Disables the EATA SCSI driver
- Prohibits PCMCIA CIS storage
- Prohibits using TIOCSSERIAL to change device addresses, IRQs and DMA channels
- Prevents using module parameters that specify hardware options (such as
- Disables the
- Disables debugfs
- Disables kprobes for debugging
- Disables Berkeley Packet Filter functions
- Disables DTrace
Several new kernel configuration options have been added to cater for secure boot:
LOCK_DOWN_KERNEL: Allows the kernel to be locked down under certain circumstances, such as when UEFI secure boot is enabled.
ALLOW_LOCKDOWN_LIFT_BY_SYSRQ: Allows the lockdown on a kernel to be lifted, by pressing a SysRq key combination on a wired keyboard.
LOCK_DOWN_IN_EFI_SECURE_BOOT: Allows kernel lockdown to be triggered if EFI Secure Boot is set in an EFI variable provided by system firmware if not indicated by a boot parameter.
LOAD_UEFI_KEYS: Allows a kernel in secure boot mode to load modules signed with UEFI-stored keys and to reject modules signed with keys that match the blacklist.
- User space updates to enable FIPS: The
dracutpackage for Oracle Linux 7 has been updated to
dracut-033-535.0.2. This update enables FIPS support and compatibility with UEK R5. You must install this version or higher of the
dracutpackage if you intend to enable FIPS mode on a system running UEK R5.
The following notable storage features are implemented in Unbreakable Enterprise Kernel Release 5:
- NBD functionality enabled: Network Block Device (NBD) functionality is enabled as a loadable kernel module in UEK R5. This allows the operating system to use a remote server as one of its block devices by using TCP.
libnvdimmsubsystem added to kernel and updated for PMEM and DAX: The
libnvdimmkernel subsystem, which is responsible for the detection, configuration, and management of Non-Volatile Dual Inline Memory Modules (NVDIMMs) is enabled in UEK R5. If NVDIMMs are present in the system, they are exposed through the
/dev/pmem*device nodes and can be configured by using the ndctl utility.PMEM through
libnvdimm, also makes DAX (Direct Access) functionality available. DAX is a facility that avoids the overhead of traditional buffer I/O on the page cache and produces direct file mappings into user space.Upstream patches for
libnvdimmwere also backported to introduce a ‘flags’ attribute that exports the generic DIMM status to indicate whether it is locked or whether it is in an alias state; and to clean up some code for better stability.ACPI 6.2 allows for named methods to access the label storage area of an NVDIMM. A patch has been applied to ensure that the new standard _LSI, _LSR and _LSW label methods are used, if available, and to fall back to use the NVDIMM_FAMILY_INTEL device-specific methods. This enables interoperability with environments that only implement standardized methods.
- TCMU functionality backported: TCMU (Target Core Module in Userspace) features have been backported from the 4.16 release of the upstream kernel to enable this functionality in UEK R5. These features allow Linux I/O iSCSI targets to be run as user space programs and facilitate targets to function in a Highly Available manner, allowing failover and failback of multiple iSCSI target gateways without data corruption.
The following notable virtualization features are implemented in Unbreakable Enterprise Kernel Release 5:
- KVM updated to include backported bug fixes: KVM features in the upstream 4.15 and 4.16 kernels are backported into UEK R5. Many of these patches offer better stability and resolve bugs and performance issues.
- Secure Encrypted Virtualization (SEV) for AMD-V enabled: AMD’s Secure Encrypted Virtualization (SEV) feature that extends the AMD-V architecture has been enabled in UEK R5 and upstream patches from the 4.16 kernel have been backported to ensure that the latest features and functionality are available. Hardware that supports SEV can use this feature to run multiple virtual machines under the control of a hypervisor in a more secure fashion. Private memory space can be encrypted with a guest-specific key, while shared memory space can be encrypted with a hypervisor key. This feature can protect data on guest virtual machines from a potentially compromised hypervisor.
- User-Mode Instruction Prevention (UMIP) for Intel enabled: Intel’s UMIP feature has been enabled in UEK R5 and upstream patches from the 4.16 kernel have been backported to ensure that the latest features and functionality are available. UMIP is a security feature present in newer Intel processors, that can prevent the execution of certain instructions if the Current Privilege Level (CPL) is greater than 0. UMIP helps to protect access to system-wide settings such as the global and local descriptor tables, the task register and the interrupt descriptor table. UMIP has specifically been integrated with KVM to enable support for UMIP within a virtualized environment.
- Paravirtual TLB shootdown implemented: Patches have been applied to implement a KVM paravirtual translation lookaside buffer (TLB) shootdown algorithm. TLB is a memory cache that reduces the time taken to access a memory location. TLB shootdown is an operation that runs on multi-processor machines to flush the TLB on all processors to ensure that page restrictions are respected. Typically, TLB shootdown is managed by the host scheduler. In environments where multi-CPU virtual machines are running, VCPUs are not scheduled simultaneously. This can waste CPU cycles and cause synchronization latency, particularly in oversubscribed situations. The paravirtual TLB shootdown code helps to resolve this and makes TLB invalidation significantly more effective.
Notable Driver Features
The following new features are noted in the drivers shipped with UEK R5:
- VXLAN offload support on Mellanox CX-5 HCAs: The
mlx5edriver has added netdev support for VXLAN tunneling. This feature reduces CPU overhead by offloading packet processing for VXLAN encapsulation to the HCA hardware directly. This reduces system load for VXLAN tunneling, improves performance and reduces packet throughput.
- Hyper-V drivers updated: The Hyper-V storage driver,
hv_storvsc, has been updated to provide performance improvements for I/O operations on certain workloads by eliminating bounce buffers. The Hyper-V network driver,
hv_netvsc, has been updated to support transparent SR-IOV on Virtual Function devices to reduce configuration complexity and the use of a dedicated bonding driver and script to handle hot plugging of the required PCI devices.A large number of other upstream patches from the 4.15 and 4.16 Linux kernel versions have been backported to deliver a full range of expected functionality and features for Hyper-V on UEK R5.
- Intel iWARP RDMA driver added: The Intel Ethernet Connection X722 iWARP RDMA driver,
i40iw, has been added to the driver modules included in this kernel release. A library,
libi40iw, has been added for direct user space use of this RDMA hardware.
- QLogic 40G/100G RoCE driver added and iWARP enabled: The QLogic 40G/100G RoCE driver,
qedr, has been added to the driver modules included in this kernel release. Additionally, the existing QLogic FastLinQ 4xxxx Core Module,
qed, was updated to include patches to enable iWARP. A library,
libqedr, has been added for direct user space use of this RDMA hardware.
- QLogic QEDF 25/40/50/100Gb FCoE driver added: The QLogic QEDF 25/40/50/100Gb FCoE driver,
qedf, has been added to the driver modules included in this kernel release. The driver introduces FCoE support for QLogic 41000 Series Converged Network Adapters.
- FC-NVMe transport support for Emulex and QLogic devices enabled: The NVM Express drivers,
nvme, have been patched and updated to support enabling NVMe over Fibre Channel fabrics. This change involved the addition of several new
nvmemodules, updates to other modules, such as the Emulex LightPulse Fibre Channel SCSI driver,
lpfcand modifications to kernel block layer code such as the multi-queue block I/O queueing mechanism. Note that this functionality is currently available as a technical preview. Hardware vendors are responsible for testing and supporting FC-NVMe transport for their own devices. For more information on FC-NVMe support for your hardware, please contact your hardware vendor.
- Broadcom/Emulex LightPulse Fibre Channel SCSI driver updated to 18.104.22.168: The Broadcom/Emulex LightPulse Fibre Channel SCSI driver,
lpfchas been updated to version 22.214.171.124. This release adds support for Emulex 32/64GB Host Bus Adapters and the initial framework to enable NVMe on Fibre Channel. Note that FC-NVMe in
lpfcis available as a technical preview.
- QLogic Fibre Channel HBA driver updated to 10.00.00.06-k1: The QLogic Fibre Channel HBA driver,
qla2xxxhas been updated to version 10.00.00.06-k1. Changes include many bug fixes for stability and performance. This release also includes a large number of vendor supplied and upstream patches to enable NVMe on Fibre Channel. Note that FC-NVMe in
qla2xxxis available as a technical preview.
- LSI MPT Fusion SAS 3.0 device driver updated: The LSI MPT Fusion SAS 3.0 device driver,
mpt3sas, has been patched and updated to support NVMe drives and to add support for the Broadcom SAS3616 HBA. Other upstream patches have also been applied for bug fixes.
- Amazon Elastic Network Adapter driver updated to 1.5.0k: The Elastic Network Adapter driver,
ena, has been updated to version 1.5.0k. This version provides a number of upstream bug fixes and improvements. Additional features include additional power management operations, initial support for IPv6 RSS and improved driver robustness.
- Avago MegaRAID SAS driver updated: The Avago MegaRAID SAS driver,
megaraid_sas, has been updated to version 07.704.04.00-rc1 and includes upstream and vendor supplied patches. Additional features include added support for the SAS3.5 generation of MegaRAID SAS controllers. Changes were also applied to cater for the potential to increase the adapter Queue Depth (QD) to 9k.
- Interface driver for GENEVE encapsulated traffic included: The interface driver for GENEVE encapsulated traffic,
geneve, is included in this release of the kernel. Although this driver is provided simply as part of the upstream code used by this kernel release, it is mentioned as its inclusion resolves a known issue in Oracle Linux 7 Update 5.