Specifying nohz_full breaks CPU frequency reporting

Bug #2051733 reported by Lastique
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux-signed-lowlatency-hwe-6.5 (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

With the lowlatency kernel, if I specify "nohz_full=1-15" boot parameter then CPU frequency reporting doesn't work for the logical cores 1-15. That is, only logical core 0 shows varying CPU frequency in its /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq file, all other cores constantly show 800000 in their scaling_cur_freq files (which is the lowest supported frequency) regardless of the CPU load.

Steps to reproduce:

1. Add "nohz_full=1-15" (specify the core numbers to include all logical cores except 0) to kernel boot options in /etc/default/grub.
2. Run `sudo update-grub` and reboot.
3. Upon booting, run a multithreaded workload. For example, run `openssl speed -multi $(nproc --all)`.
4. In another console, monitor CPU frequencies by running `watch cat /sys/devices/system/cpu/cpu[0-9]*/cpufreq/scaling_cur_freq`.

Actual results:

All cores specified in "nohz_full" parameter always report their lowest frequency.

Despite that, the actual performance seems to be as if frequency scaling actually works (i.e. according to benchmarks, the performance seems to be similar with and without the "nohz_full" parameter).

Expected results:

All cores must report their actual frequency depending on the load.

ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: linux-image-6.5.0-15-lowlatency 6.5.0-15.15.1.1~22.04.1
ProcVersionSignature: Ubuntu 6.5.0-15.15.1.1~22.04.1-lowlatency 6.5.3
Uname: Linux 6.5.0-15-lowlatency x86_64
NonfreeKernelModules: nvidia_modeset nvidia
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
CasperMD5CheckResult: unknown
CurrentDesktop: KDE
Date: Tue Jan 30 23:39:51 2024
InstallationDate: Installed on 2015-05-01 (3196 days ago)
InstallationMedia: Kubuntu 15.04 "Vivid Vervet" - Release amd64 (20150422)
SourcePackage: linux-signed-lowlatency-hwe-6.5
UpgradeStatus: Upgraded to jammy on 2022-05-14 (626 days ago)

Revision history for this message
Lastique (andysem) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-signed-lowlatency-hwe-6.5 (Ubuntu):
status: New → Confirmed
Revision history for this message
Doug Smythies (dsmythies) wrote :

I confirm your findings.

In my case I am using: The intel_pstate CPU frequency scaling driver; powersave CPU frequency scaling governor; HWP, HardWare Pstate, control is disabled; A mainline kernel, 6.8-rc1, compiled with the kernel configuration changes being considered in that other bug report. My main test server with Ubuntu 20.04.6.

Note also, in my case, the CPU frequencies actually seem to be scaling properly, it is just that they are not being reported properly via "/sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq".

Revision history for this message
Andrea Righi (arighi) wrote :

@dsmythies IIUC this happens also with a mainline kernel, right? Not just the Ubuntu lowlatency one.

Revision history for this message
Lastique (andysem) wrote :

> Note also, in my case, the CPU frequencies actually seem to be scaling properly, it is just that they are not being reported properly via "/sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq".

To be clear, I cannot be sure whether that was also the case in my testing. I didn't test the actual performance much. In a few short tests it did seem like the performance was lower, but that was not in any way scientific, so it is possible that the problem is just representation in scaling_cur_freq files.

Revision history for this message
Doug Smythies (dsmythies) wrote :

Yes, this happens also with the mainline kernel.

It also happens with the intel_cpufreq CPU frequency scaling driver (i.e. the intel_pstate driver in passive mode), and all governors. It also happens with the acpi-cpufreq CPU frequency scaling driver, and all governors. However the manifestations of the incorrectly reported scaling_cur_freq can be anywhere from wrong to correct.

Example 1: 100% load on all 12 CPUs; acpi-cpufreq; schedutil:

doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:4800005
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:4101000
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:4101000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:4101000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:4101000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:4101000
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:4101000
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:4101000
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:4101000
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:4101000
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:4101000
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:4101000

except for CPU 0, which seems to be reporting as is it is using a different driver, the results are correct.

Example 2: 100% load on CPU 4 only; acpi-cpufreq; ondemand:

doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:4799876
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:4101000
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:800000

Again, except for CPU 0, the results are correct.

Revision history for this message
Lastique (andysem) wrote :

I have run a quick `7z b` benchmark on the lowlatency kernel with and without `nohz_full` parameter, and the results are fairly close:

No `nohz_full` parameter:

                       Compressing | Decompressing
Dict Speed Usage R/U Rating | Speed Usage R/U Rating
         KiB/s % MIPS MIPS | KiB/s % MIPS MIPS

22: 82694 1333 6033 80445 | 741767 1584 3994 63265
23: 81022 1400 5896 82552 | 736593 1589 4011 63731
24: 79427 1429 5978 85401 | 722675 1581 4011 63433
25: 77665 1459 6077 88676 | 711778 1587 3990 63346
---------------------------------- | ------------------------------
Avr: 1405 5996 84269 | 1585 4002 63444
Tot: 1495 4999 73856

With `nohz_full=1-15` parameter:

                       Compressing | Decompressing
Dict Speed Usage R/U Rating | Speed Usage R/U Rating
         KiB/s % MIPS MIPS | KiB/s % MIPS MIPS

22: 84475 1357 6055 82177 | 738578 1552 4060 62993
23: 80361 1376 5951 81878 | 726439 1482 4240 62852
24: 80275 1437 6006 86312 | 715778 1496 4199 62827
25: 77007 1448 6073 87924 | 708632 1563 4034 63066
---------------------------------- | ------------------------------
Avr: 1404 6021 84573 | 1523 4133 62935
Tot: 1464 5077 73754

In the latter case, decompressing is slightly slower, but definitely not "800MHz" slower, so it looks like the problem is indeed with frequency reporting rather than scaling.

summary: - Specifying nohz_full disables CPU frequency scaling
+ Specifying nohz_full breaks CPU frequency reporting
description: updated
Revision history for this message
Doug Smythies (dsmythies) wrote :

There is a high probability that the root issue here is related to some work done in August September.
There was already an outstanding issue with intel_cpufreq driver / schedutil governor, hwp enabled.

References:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d51847acb018d83186e4af67bc93f9a00a8644f7

https://bugzilla.kernel.org/show_bug.cgi?id=217597

Revision history for this message
Doug Smythies (dsmythies) wrote :

CPU frequency scaling driver: intel_pstate
CPU frequency scaling governor: powersave
HWP: disabled.

Purpose to verify that the driver is working correctly, regardless of CPU frequencies reported.
A single threaded load was applied to CPU 5 at 347 hertz sleep/work frequency. The load was increased then deceased. The intel_pstate_tracer.py utility was run during the test capturing the attached.
All pstates were used and appropriate per the load.

Revision history for this message
Doug Smythies (dsmythies) wrote :

The way it is currently done, I don't think valid CPU frequency listing via "scaling_cur_freq", or /proc/cpuinfo, is expected to work. Why not? Because the required code is never executed, on purpose. Here is an excerpt from a commit (see the bit about NOHZ full)

commit f3eca381bd49d708073ba1a9af4fa6ea5d5810a6
Author: Thomas Gleixner <email address hidden>
Date: Fri Apr 15 21:20:04 2022 +0200

    x86/aperfmperf: Replace arch_freq_get_on_cpu()

    Reading the current CPU frequency from /sys/..../scaling_cur_freq involves
    in the worst case two IPIs due to the ad hoc sampling.

    The frequency invariance infrastructure provides the APERF/MPERF samples
    already. Utilize them and consolidate this with the /proc/cpuinfo readout.

    The sample is considered valid for 20ms. So for idle or isolated NOHZ full
    CPUs the function returns 0, which is matching the previous behaviour.

There was couple of later commits and now it prints out the minimum CPU frequency when it thinks the number are stale. With NOHz full it always thinks the numbers are stale.

The intel_cpufreq driver seems to display CPU frequencies okay, but only the pstate that was requested, not the actual frequency granted.

Revision history for this message
Lastique (andysem) wrote :

I'm not familiar with Linux kernel internals, or how scaling_cur_freq is implemented internally, but that doesn't look like a valid logic to me. If it takes an IPI (or two, as the commit message suggests) to read the core frequency, then make those IPIs. It doesn't matter how expensive it is - if the user wants to read the current frequency then he is willing to pay for that information. This likely won't be a frequent operation anyway. Providing an interface to read this information and then feeding bogus data through it is not acceptable, IMO.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.