Beware of R1 - Updated

Sven Illert - June 2, 2023

The Good

Recently I discovered a critical issue when installing a new Oracle cluster for a customer on Oracle Linux systems. The hardware are shiny new blade systems and I started with the latest and greatest software components which are certified for the Oracle Enterprise Edition and Real Application Cluster. Of course the OS would be Oracle Linux 8.7 with the famous Unbreakable Enterprise Kernel which is at version 7 since this year. In April 2023 Oracle relased the release update 1 which of course brings some enhancements and fixes. Business as usual as you would think.

Going on with the project I installed the operating systems and of course I consulted the certification matrix in the MOS before doing anything. There you can see that with the April 2023 RU of the database - which is release 19.19 and in fact the release I wanted to install - that any kernel version newer than a specific version of UEKR7 was supported. Since I used the latest UEKR7 I thought that I would be in a safe harbour. Things went on and after installation of core components like the GI, configuration of diskgroups, enablement of log maintenance and the creation of a database I came to the point to apply best practices like enabling huge pages. Since I used the glorious oracle-database-preinstall-19c package I just had to set the vm.nr_hugepages system parameter because the memlock setting was already configured in a proper way.

UPDATE: Apparently the statement about the certification state is not entirely correct. It seems that only UEKR6 was certified and the or later clause was only referring to UEKR6. The UEKR7 support seems to be only for some GI components

The Bad

Since the system had 768 GB of RAM and about 450 GB of that should be used for SGAs, there is no reason to not insist on having huge pages available. So I moved on to configure the database instance to require them with the following statement. That is something that you should do with any database instance that has more than 4GB of SGA.

ALTER SYSTEM SET use_large_pages='ONLY' SCOPE=SPFILE;

After the change the instance would not start when there are not enough huge pages available. So I bounced the database and was wondering why it did not come up and for some reason the host was unreachable too. That is something I have done tens or hundreds of times in the past and I never experienced such an issue. As this is a cluster, at first I disabled the database from the surviving node using srvctl and started to investigate.

The Ugly

Further I checked the whole hugepages and memlock configuration. Nope, everything was fine. Then I tried to configure the instance with and without huge pages. Without the latter the instance would come up without any issues, but enabling it reproduced the failure reliably and cluster-wide. Of course I had some kernel(-module) bug in mind and went ahead to dig around. One good thing with Oracle Linux 8 is, that kdump is enabled by default and when such things happen, you can rely on the crash information in /var/crash/<timestamp>. There I found this neat kernel panic report at the end of the vmcore-dmesg.txt file.

% cat vmcore-dmesg.txt
...
[ 1103.887526] BUG: unable to handle page fault for address: ffffeb95612cd034
[ 1103.887605] #PF: supervisor write access in kernel mode
[ 1103.887656] #PF: error_code(0x0003) - permissions violation
[ 1103.887705] PGD c03ff5f067 P4D c03ff5f067 PUD c03ff59067 PMD bf38acf067 PTE 800000bea00c8161
[ 1103.887734] Oops: 0003 [#1] SMP NOPTI
[ 1103.887749] CPU: 23 PID: 22345 Comm: oracle_22345_cd Kdump: loaded Tainted: P           O      5.15.0-101.103.2.1.el8uek.x86_64 #2
[ 1103.887786] Hardware name: xxxxxxxxxxxxxxxxxxxxx
[ 1103.887819] RIP: 0010:safd_rio_unmappages+0x63/0x120 [oracleafd]
[ 1103.887853] Code: 0b 83 c3 01 39 9d c0 00 00 00 76 34 48 8b 95 b8 00 00 00 48 63 c3 48 8b 3c c2 48 8b 47 08 48 8d 50 ff a8 01 48 0f 45 fa 66 90 <f0> ff 4f 34 75 d1 e8 82 82 3b cb 83 c3 01 39 9d c0 00 00 00 77 cc
[ 1103.887909] RSP: 0018:ffffb542889cfb90 EFLAGS: 00010246
[ 1103.887926] RAX: ffffeb95aef70008 RBX: 0000000000000000 RCX: 0000000000000004
[ 1103.887950] RDX: ffffeb95aef70007 RSI: 0000000000000000 RDI: ffffeb95612cd000
[ 1103.887972] RBP: ffff9f9afc3dc000 R08: 0000000000000000 R09: 0000000000000000
[ 1103.887994] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9f9a35051900
[ 1103.888017] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9edd035b25b0
[ 1103.888039] FS:  00007f3b44a7f400(0000) GS:ffff9f993ffc0000(0000) knlGS:0000000000000000
[ 1103.888065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1103.888085] CR2: ffffeb95612cd034 CR3: 000000bd5362a002 CR4: 00000000007706e0
[ 1103.888108] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1103.888131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1103.888154] PKRU: 55555554
[ 1103.888164] Call Trace:
[ 1103.888176]  <TASK>
[ 1103.888187]  afdq_request_drop+0x127/0x140 [oracleafd]
[ 1103.888217]  afdq_request_wait+0x1fd/0x230 [oracleafd]
[ 1103.888242]  afdq_batch_submit+0x526/0xbb0 [oracleafd]
[ 1103.888268]  afdc_io+0xb52/0x1000 [oracleafd]
[ 1103.888289]  ? AfdgCopyin+0x39/0x70 [oracleafd]
[ 1103.888312]  afdc_execute_ioctl+0x163/0x220 [oracleafd]
[ 1103.888337]  afd_ioctl+0x81/0x330 [oracleafd]
[ 1103.888359]  block_ioctl+0x48/0x5f
[ 1103.888375]  __x64_sys_ioctl+0x8f/0xce
[ 1103.888391]  do_syscall_64+0x38/0x8d
[ 1103.888408]  entry_SYSCALL_64_after_hwframe+0x63/0x0
[ 1103.888429] RIP: 0033:0x7f3b402e27cb
[ 1103.888443] Code: 73 01 c3 48 8b 0d bd 56 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8d 56 38 00 f7 d8 64 89 01 48
[ 1103.888500] RSP: 002b:00007ffca9a7aa58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1103.888525] RAX: ffffffffffffffda RBX: 00000000197f2620 RCX: 00007f3b402e27cb
[ 1103.888547] RDX: 00000000197f2620 RSI: 0000000000507606 RDI: 000000000000000b
[ 1103.889314] RBP: 00007ffca9a7aad0 R08: 00007f3b448e01d8 R09: 0000000000000000
[ 1103.890006] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 1103.890696] R13: 0000000000000001 R14: 00000000197f2490 R15: 00007ffca9a7acf8
[ 1103.891379]  </TASK>
[ 1103.892047] Modules linked in: rds_tcp rds oracleacfs(PO) oracleadvm(PO) oracleoks(PO) oracleafd(PO) rfkill cuse fuse sunrpc dm_round_robin dm_multipath intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel acpi_ipmi aesni_intel ipmi_si crypto_simd mei_me ipmi_devintf cryptd pcspkr ioatdma joydev mei lpc_ich intel_pch_thermal dca ipmi_msghandler acpi_power_meter acpi_pad binfmt_misc sch_fq_codel ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec fnic drm libfcoe ahci libfc libahci libata enic megaraid_sas scsi_transport_fc i2c_algo_bit wmi dm_mirror dm_region_hash dm_log dm_mod
[ 1103.897202] CR2: ffffeb95612cd034

Of course I reached out to MOS and requested an investigation of the problem. But until a solution is provided I wanted to get ahead with the project. Since the UEKR7 release update 1 was released in April 2023 I thought about using an older version of the kernel, because there might be some regression that was not yet disovered and a fix was not included in the lates April RU for the database and GI. To list all available kernel versions I used the following dnf command (stripped to the latest available versions).

% dnf --showduplicates list kernel-uek
...
kernel-uek.src         5.15.0-6.80.3.1.el8uek         ol8_UEKR7
kernel-uek.x86_64      5.15.0-6.80.3.1.el8uek         ol8_UEKR7
kernel-uek.src         5.15.0-7.86.6.1.el8uek         ol8_UEKR7
kernel-uek.x86_64      5.15.0-7.86.6.1.el8uek         ol8_UEKR7
kernel-uek.src         5.15.0-8.91.4.1.el8uek         ol8_UEKR7
kernel-uek.x86_64      5.15.0-8.91.4.1.el8uek         ol8_UEKR7
kernel-uek.src         5.15.0-100.96.32.el8uek        ol8_UEKR7
kernel-uek.x86_64      5.15.0-100.96.32.el8uek        ol8_UEKR7
kernel-uek.src         5.15.0-101.103.2.1.el8uek      ol8_UEKR7
kernel-uek.x86_64      5.15.0-101.103.2.1.el8uek      ol8_UEKR7 <-- the one installed

After Identifing the previous releases I went back one after another by installing and selecting for boot (ignoring the RH kernel, since it is way older than the UEK).

% dnf install kernel-uek-5.15.0-100.96.32.el8uek
% grubby --info=ALL | grep ^kernel
kernel="/boot/vmlinuz-5.15.0-101.103.2.1.el8uek.x86_64"
kernel="/boot/vmlinuz-5.15.0-100.96.32.el8uek.x86_64"
kernel="/boot/vmlinuz-4.18.0-477.13.1.el8_8.x86_64"
kernel="/boot/vmlinuz-4.18.0-477.10.1.el8_8.x86_64"
kernel="/boot/vmlinuz-4.18.0-425.3.1.el8.x86_64"
kernel="/boot/vmlinuz-0-rescue-a157dc48c8a240e7b30731914073a86f"
% grubby --set-default /boot/vmlinuz-5.15.0-100.96.32.el8uek.x86_64
% reboot

Fortunately there were only two releases of the UEK release update 1, so that I soon came to try the release 5.15.0-8.91.4.1.el8uek.x86_64 which would let me open the database using a version of the UEKR7 release. So there might be some regression in the latest update to the UEKR7 and you should be careful to update. Luckily this happened before the system became productive and I hope that not a lot of people face the same issue. Maybe this is also related to the hardware in use and not a problem everywhere. The general advise to use a test system implementing updates before, is something that roams around not without a reason.