commit e641e014278f6144faed0bf14fe79be0b220c2ef
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Wed Jan 12 23:54:08 2022 +0000

    Linux 5.16.0-xanmod1
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit f83f4f4e20d2492746317ff64eb061107494fd3c
Author: graysky <graysky@archlinux.us>
Date:   Tue Sep 14 15:35:34 2021 -0400

    x86/kconfig: more uarches for kernel 5.15+
    
    FEATURES
    This patch adds additional CPU options to the Linux kernel accessible under:
     Processor type and features  --->
      Processor family --->
    
    With the release of gcc 11.1 and clang 12.0, several generic 64-bit levels are
    offered which are good for supported Intel or AMD CPUs:
    • x86-64-v2
    • x86-64-v3
    • x86-64-v4
    
    Users of glibc 2.33 and above can see which level is supported by current
    hardware by running:
      /lib/ld-linux-x86-64.so.2 --help | grep supported
    
    Alternatively, compare the flags from /proc/cpuinfo to this list.[1]
    
    CPU-specific microarchitectures include:
    • AMD Improved K8-family
    • AMD K10-family
    • AMD Family 10h (Barcelona)
    • AMD Family 14h (Bobcat)
    • AMD Family 16h (Jaguar)
    • AMD Family 15h (Bulldozer)
    • AMD Family 15h (Piledriver)
    • AMD Family 15h (Steamroller)
    • AMD Family 15h (Excavator)
    • AMD Family 17h (Zen)
    • AMD Family 17h (Zen 2)
    • AMD Family 19h (Zen 3)†
    • Intel Silvermont low-power processors
    • Intel Goldmont low-power processors (Apollo Lake and Denverton)
    • Intel Goldmont Plus low-power processors (Gemini Lake)
    • Intel 1st Gen Core i3/i5/i7 (Nehalem)
    • Intel 1.5 Gen Core i3/i5/i7 (Westmere)
    • Intel 2nd Gen Core i3/i5/i7 (Sandybridge)
    • Intel 3rd Gen Core i3/i5/i7 (Ivybridge)
    • Intel 4th Gen Core i3/i5/i7 (Haswell)
    • Intel 5th Gen Core i3/i5/i7 (Broadwell)
    • Intel 6th Gen Core i3/i5/i7 (Skylake)
    • Intel 6th Gen Core i7/i9 (Skylake X)
    • Intel 8th Gen Core i3/i5/i7 (Cannon Lake)
    • Intel 10th Gen Core i7/i9 (Ice Lake)
    • Intel Xeon (Cascade Lake)
    • Intel Xeon (Cooper Lake)*
    • Intel 3rd Gen 10nm++ i3/i5/i7/i9-family (Tiger Lake)*
    • Intel 3rd Gen 10nm++ Xeon (Sapphire Rapids)‡
    • Intel 11th Gen i3/i5/i7/i9-family (Rocket Lake)‡
    • Intel 12th Gen i3/i5/i7/i9-family (Alder Lake)‡
    
    Notes: If not otherwise noted, gcc >=9.1 is required for support.
           *Requires gcc >=10.1 or clang >=10.0
           †Required gcc >=10.3 or clang >=12.0
           ‡Required gcc >=11.1 or clang >=12.0
    
    It also offers to compile passing the 'native' option which, "selects the CPU
    to generate code for at compilation time by determining the processor type of
    the compiling machine. Using -march=native enables all instruction subsets
    supported by the local machine and will produce code optimized for the local
    machine under the constraints of the selected instruction set."[2]
    
    Users of Intel CPUs should select the 'Intel-Native' option and users of AMD
    CPUs should select the 'AMD-Native' option.
    
    MINOR NOTES RELATING TO INTEL ATOM PROCESSORS
    This patch also changes -march=atom to -march=bonnell in accordance with the
    gcc v4.9 changes. Upstream is using the deprecated -match=atom flags when I
    believe it should use the newer -march=bonnell flag for atom processors.[3]
    
    It is not recommended to compile on Atom-CPUs with the 'native' option.[4] The
    recommendation is to use the 'atom' option instead.
    
    BENEFITS
    Small but real speed increases are measurable using a make endpoint comparing
    a generic kernel to one built with one of the respective microarchs.
    
    See the following experimental evidence supporting this statement:
    https://github.com/graysky2/kernel_gcc_patch
    
    REQUIREMENTS
    linux version >=5.15
    gcc version >=9.0 or clang version >=9.0
    
    ACKNOWLEDGMENTS
    This patch builds on the seminal work by Jeroen.[5]
    
    REFERENCES
    1.  https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/77566eb03bc6a326811cb7e9
    2.  https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-x86-Options
    3.  https://bugzilla.kernel.org/show_bug.cgi?id=77461
    4.  https://github.com/graysky2/kernel_gcc_patch/issues/15
    5.  http://www.linuxforge.net/docs/linux/linux-gcc.php
    
    Signed-off-by: graysky <graysky@archlinux.us>

commit dfa0270cbaa1678e908061412b44a01bdc5bd28c
Author: Zebediah Figura <z.figura12@gmail.com>
Date:   Thu Nov 4 17:15:38 2021 +0000

    winesync: Introduce the winesync driver and character device
    
    Rebased-by: Tk-Glitch <ti3nou@gmail.com>
    Rebased-by: Alexandre Frade <kernel@xanmod.org>
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 79f9f08c2b2b5ea30d617167ca0d40b593622855
Author: Serge Hallyn <serge.hallyn@canonical.com>
Date:   Fri May 31 19:12:12 2013 +0100

    sysctl: add sysctl to disallow unprivileged CLONE_NEWUSER by default
    
    add sysctl to disallow unprivileged CLONE_NEWUSER by default
    
    This is a short-term patch.  Unprivileged use of CLONE_NEWUSER
    is certainly an intended feature of user namespaces.  However
    for at least saucy we want to make sure that, if any security
    issues are found, we have a fail-safe.
    
    Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
    [bwh: Remove unneeded binary sysctl bits]
    [bwh: Keep this sysctl, but change the default to enabled]

commit 8a357c256342bd41dda20108cba17d54d9a62461
Author: Christian Brauner <christian@brauner.io>
Date:   Wed Jan 23 21:54:23 2019 +0100

    SAUCE: binder: give binder_alloc its own debug mask file
    
    Currently both binder.c and binder_alloc.c both register the
    /sys/module/binder_linux/paramters/debug_mask file which leads to conflicts
    in sysfs. This commit gives binder_alloc.c its own
    /sys/module/binder_linux/paramters/alloc_debug_mask file.
    
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

commit e44ebc9b1b0b33b9fe2248a488479e3ace470ea3
Author: Christian Brauner <christian@brauner.io>
Date:   Wed Jan 16 23:13:25 2019 +0100

    SAUCE: binder: turn into module
    
    The Android binder driver needs to become a module for the sake of shipping
    Anbox. To do this we need to export the following functions since binder is
    currently still using them:
    
    - security_binder_set_context_mgr()
    - security_binder_transaction()
    - security_binder_transfer_binder()
    - security_binder_transfer_file()
    - can_nice()
    - __wake_up_pollfree()
    - __close_fd_get_file()
    - mmput_async()
    - task_work_add()
    - map_kernel_range_noflush()
    - get_vm_area()
    - zap_page_range()
    - put_ipc_ns()
    - get_ipc_ns_exported()
    - show_init_ipc_ns()
    
    Rebased-by: Alexandre Frade <kernel@xanmod.org>
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
    [ saf: fix additional reference to init_ipc_ns from 5.0-rc6 ]
    Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 8ddaada2d676852853db4555b27c9ef7f9ad2ba5
Author: Christian Brauner <christian@brauner.io>
Date:   Wed Jun 20 19:21:37 2018 +0200

    SAUCE: ashmem: turn into module
    
    The Android ashmem driver needs to become a module for the sake of Anbox.
    To do this we need to export shmem_zero_setup() since ashmem is currently
    using is.
    Note, the abomination that is the Android ashmem driver will go away in the
    not so distant future in favour of memfds.
    
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

commit 039f212ae30995afc7974b1d1ed9905264c7fa57
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Wed May 17 01:52:11 2017 +0000

    init: wait for partition and retry scan
    
    As Clear Linux boots fast the device is not ready when
    the mounting code is reached, so a retry device scan will
    be performed every 0.5 sec for at least 40 sec
    and synchronize the async task.
    
    Signed-off-by: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>

commit 36c62c812e748ad3b1b3333d3928a578e6c14a2d
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Thu Jun 2 23:36:32 2016 -0500

    drivers: initialize ata before graphics
    
    ATA init is the long pole in the boot process, and its asynchronous.
    move the graphics init after it so that ata and graphics initialize
    in parallel

commit 0047359f551468633c95a851ad99754cd35955cc
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Sun Feb 18 23:35:41 2018 +0000

    locking: rwsem: spin faster
    
    tweak rwsem owner spinning a bit
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 9ce5b74ea37c43e78cb099f02e01d22430445330
Author: William Douglas <william.douglas@intel.com>
Date:   Wed Jun 20 17:23:21 2018 +0000

    firmware: Enable stateless firmware loading
    
    Prefer the order of specific version before generic and /etc before
    /lib to enable the user to give specific overrides for generic
    firmware and distribution firmware.

commit de4f4a54ac9b5eab572c1341fb834628363a8f40
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Sun Sep 22 11:12:35 2019 -0300

    intel_rapl: Silence rapl trace debug

commit 968d9525eb6f1cf6d6abc0afb61e41ac357d5955
Author: Mark Weiman <mark.weiman@markzz.com>
Date:   Sun Aug 12 11:36:21 2018 -0400

    pci: Enable overrides for missing ACS capabilities
    
    This an updated version of Alex Williamson's patch from:
    https://lkml.org/lkml/2013/5/30/513
    
    Original commit message follows:
    
    PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
    allows us to control whether transactions are allowed to be redirected
    in various subnodes of a PCIe topology.  For instance, if two
    endpoints are below a root port or downsteam switch port, the
    downstream port may optionally redirect transactions between the
    devices, bypassing upstream devices.  The same can happen internally
    on multifunction devices.  The transaction may never be visible to the
    upstream devices.
    
    One upstream device that we particularly care about is the IOMMU.  If
    a redirection occurs in the topology below the IOMMU, then the IOMMU
    cannot provide isolation between devices.  This is why the PCIe spec
    encourages topologies to include ACS support.  Without it, we have to
    assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.
    
    Unfortunately, far too many topologies do not support ACS to make this
    a steadfast requirement.  Even the latest chipsets from Intel are only
    sporadically supporting ACS.  We have trouble getting interconnect
    vendors to include the PCIe spec required PCIe capability, let alone
    suggested features.
    
    Therefore, we need to add some flexibility.  The pcie_acs_override=
    boot option lets users opt-in specific devices or sets of devices to
    assume ACS support.  The "downstream" option assumes full ACS support
    on root ports and downstream switch ports.  The "multifunction"
    option assumes the subset of ACS features available on multifunction
    endpoints and upstream switch ports are supported.  The "id:nnnn:nnnn"
    option enables ACS support on devices matching the provided vendor
    and device IDs, allowing more strategic ACS overrides.  These options
    may be combined in any order.  A maximum of 16 id specific overrides
    are available.  It's suggested to use the most limited set of options
    necessary to avoid completely disabling ACS across the topology.
    Note to hardware vendors, we have facilities to permanently quirk
    specific devices which enforce isolation but not provide an ACS
    capability.  Please contact me to have your devices added and save
    your customers the hassle of this boot option.
    
    Rebased-by: Alexandre Frade <kernel@xanmod.org>
    Signed-off-by: Mark Weiman <mark.weiman@markzz.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 989ec85f9ed607a294a80c6823d647d153515db2
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Thu Oct 7 14:09:55 2021 +0000

    i2c: busses: Add SMBus capability to work with OpenRGB driver control
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 37c65fd2b0504b517b4460be6e1cabefe58e02e2
Author: Deren Wu <deren.wu@mediatek.com>
Date:   Sun Nov 14 10:46:57 2021 +0800

    mt76: mt7921: add support for PCIe ID 0x0608/0x0616
    
    New mt7921 serials chip support
    
    Signed-off-by: Deren Wu <deren.wu@mediatek.com>

commit 08e0b9854c04c080bbc6d0a7496003454a8bd655
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Wed Dec 8 11:55:28 2021 +0000

    netfilter: Add full cone NAT support
    
    Link: https://github.com/llccd/netfilter-full-cone-nat
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 9463259810f512aa60a8b64a67e44f9eb940ea4d
Author: Huang Rui <ray.huang@amd.com>
Date:   Thu Jan 6 15:43:06 2022 +0800

    x86, sched: Fix the undefined reference building error of init_freq_invariance_cppc
    
    The init_freq_invariance_cppc function is implemented in smpboot and depends on
    CONFIG_SMP.
    
      MODPOST vmlinux.symvers
      MODINFO modules.builtin.modinfo
      GEN     modules.builtin
      LD      .tmp_vmlinux.kallsyms1
    ld: drivers/acpi/cppc_acpi.o: in function `acpi_cppc_processor_probe':
    /home/ray/brahma3/linux/drivers/acpi/cppc_acpi.c:819: undefined reference to `init_freq_invariance_cppc'
    make: *** [Makefile:1161: vmlinux] Error 1
    
    See https://lore.kernel.org/lkml/484af487-7511-647e-5c5b-33d4429acdec@infradead.org/.
    
    Fixes: 41ea667227ba ("x86, sched: Calculate frequency invariance for AMD systems")
    Reported-by: kernel test robot <lkp@intel.com>
    Reported-by: Randy Dunlap <rdunlap@infradead.org>
    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Huang Rui <ray.huang@amd.com>
    Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Ingo Molnar <mingo@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: x86@kernel.org
    Cc: stable@vger.kernel.org

commit 55cb4113ae2bb4cc2b7a201291c4511ea35fc447
Author: Huang Rui <ray.huang@amd.com>
Date:   Thu Jan 6 15:43:05 2022 +0800

    cpufreq: amd-pstate: Fix the dependence issue of AMD P-State
    
    The AMD P-State driver is based on ACPI CPPC function, so ACPI should be
    dependence of this driver in the kernel config.
    
    In file included from ../drivers/cpufreq/amd-pstate.c:40:0:
    ../include/acpi/processor.h:226:2: error: unknown type name ‘phys_cpuid_t’
      phys_cpuid_t phys_id; /* CPU hardware ID such as APIC ID for x86 */
      ^~~~~~~~~~~~
    ../include/acpi/processor.h:355:1: error: unknown type name ‘phys_cpuid_t’; did you mean ‘phys_addr_t’?
     phys_cpuid_t acpi_get_phys_id(acpi_handle, int type, u32 acpi_id);
     ^~~~~~~~~~~~
     phys_addr_t
      CC      drivers/rtc/rtc-rv3029c2.o
    ../include/acpi/processor.h:356:1: error: unknown type name ‘phys_cpuid_t’; did you mean ‘phys_addr_t’?
     phys_cpuid_t acpi_map_madt_entry(u32 acpi_id);
     ^~~~~~~~~~~~
     phys_addr_t
    ../include/acpi/processor.h:357:20: error: unknown type name ‘phys_cpuid_t’; did you mean ‘phys_addr_t’?
     int acpi_map_cpuid(phys_cpuid_t phys_id, u32 acpi_id);
                        ^~~~~~~~~~~~
                        phys_addr_t
    
    See https://lore.kernel.org/lkml/20e286d4-25d7-fb6e-31a1-4349c805aae3@infradead.org/.
    
    Reported-by: Randy Dunlap <rdunlap@infradead.org>
    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Huang Rui <ray.huang@amd.com>

commit 632f0bcb2fac314d5a8c35757d6efb51a8c32f60
Author: Huang Rui <ray.huang@amd.com>
Date:   Fri Dec 24 09:05:08 2021 +0800

    MAINTAINERS: Add AMD P-State driver maintainer entry
    
    I will continue to add new feature and processor support, optimize the
    performance, and handle the issues for AMD P-State driver.
    
    Signed-off-by: Huang Rui <ray.huang@amd.com>

commit 89af400445a9b0c636976ca813c7f9e39896b9e3
Author: Huang Rui <ray.huang@amd.com>
Date:   Fri Dec 24 09:05:07 2021 +0800

    Documentation: amd-pstate: Add AMD P-State driver introduction
    
    Introduce the AMD P-State driver design and implementation.
    
    Signed-off-by: Huang Rui <ray.huang@amd.com>

commit f250f45f86134f7ac1bd2893194eac063e6a1772
Author: Huang Rui <ray.huang@amd.com>
Date:   Fri Dec 24 09:05:06 2021 +0800

    cpufreq: amd-pstate: Add AMD P-State performance attributes
    
    Introduce sysfs attributes to get the different level AMD P-State
    performances.
    
    Signed-off-by: Huang Rui <ray.huang@amd.com>

commit 76026702049aa265e904c41bd70d13cea5450cc2
Author: Huang Rui <ray.huang@amd.com>
Date:   Fri Dec 24 09:05:05 2021 +0800

    cpufreq: amd-pstate: Add AMD P-State frequencies attributes
    
    Introduce sysfs attributes to get the different level processor
    frequencies.
    
    Signed-off-by: Huang Rui <ray.huang@amd.com>

commit 36ec7e5e5ff8c6fd2f694cf149b7b03261a5d187
Author: Huang Rui <ray.huang@amd.com>
Date:   Fri Dec 24 09:05:04 2021 +0800

    cpufreq: amd-pstate: Add boost mode support for AMD P-State
    
    If the sbios supports the boost mode of AMD P-State, let's switch to
    boost enabled by default.
    
    Signed-off-by: Huang Rui <ray.huang@amd.com>

commit 949edd580676ef0412b28dccf040a8877360452e
Author: Huang Rui <ray.huang@amd.com>
Date:   Fri Dec 24 09:05:03 2021 +0800

    cpufreq: amd-pstate: Add trace for AMD P-State module
    
    Add trace event to monitor the performance value changes which is
    controlled by cpu governors.
    
    Signed-off-by: Huang Rui <ray.huang@amd.com>

commit 3701edd92882f976f261d3d5ed1ae2e86b63a0ce
Author: Huang Rui <ray.huang@amd.com>
Date:   Fri Dec 24 09:05:02 2021 +0800

    cpufreq: amd-pstate: Introduce the support for the processors with shared memory solution
    
    In some of Zen2 and Zen3 based processors, they are using the shared
    memory that exposed from ACPI SBIOS. In this kind of the processors,
    there is no MSR support, so we add acpi cppc function as the backend for
    them.
    
    It is using a module param (shared_mem) to enable related processors
    manually. We will enable this by default once we address performance
    issue on this solution.
    
    Signed-off-by: Jinzhou Su <Jinzhou.Su@amd.com>
    Signed-off-by: Huang Rui <ray.huang@amd.com>

commit 8bca5bff7205475c000d310be17b6b4ae8efd37e
Author: Huang Rui <ray.huang@amd.com>
Date:   Fri Dec 24 09:05:01 2021 +0800

    cpufreq: amd-pstate: Add fast switch function for AMD P-State
    
    Introduce the fast switch function for AMD P-State on the AMD processors
    which support the full MSR register control. It's able to decrease the
    latency on interrupt context.
    
    Signed-off-by: Huang Rui <ray.huang@amd.com>

commit 6a3addfce915a50f177693dcc487d1423ab03de7
Author: Huang Rui <ray.huang@amd.com>
Date:   Fri Dec 24 09:05:00 2021 +0800

    cpufreq: amd-pstate: Introduce a new AMD P-State driver to support future processors
    
    AMD P-State is the AMD CPU performance scaling driver that introduces a
    new CPU frequency control mechanism on AMD Zen based CPU series in Linux
    kernel. The new mechanism is based on Collaborative processor
    performance control (CPPC) which is finer grain frequency management
    than legacy ACPI hardware P-States. Current AMD CPU platforms are using
    the ACPI P-states driver to manage CPU frequency and clocks with
    switching only in 3 P-states. AMD P-State is to replace the ACPI
    P-states controls, allows a flexible, low-latency interface for the
    Linux kernel to directly communicate the performance hints to hardware.
    
    AMD P-State leverages the Linux kernel governors such as *schedutil*,
    *ondemand*, etc. to manage the performance hints which are provided by CPPC
    hardware functionality. The first version for AMD P-State is to support one
    of the Zen3 processors, and we will support more in future after we verify
    the hardware and SBIOS functionalities.
    
    There are two types of hardware implementations for AMD P-State: one is full
    MSR support and another is shared memory support. It can use
    X86_FEATURE_CPPC feature flag to distinguish the different types.
    
    Using the new AMD P-State method + kernel governors (*schedutil*,
    *ondemand*, ...) to manage the frequency update is the most appropriate
    bridge between AMD Zen based hardware processor and Linux kernel, the
    processor is able to adjust to the most efficiency frequency according to
    the kernel scheduler loading.
    
    Please check the detailed CPU feature and MSR register description in
    Processor Programming Reference (PPR) for AMD Family 19h Model 51h,
    Revision A1 Processors:
    
    https://www.amd.com/system/files/TechDocs/56569-A1-PUB.zip
    
    Signed-off-by: Huang Rui <ray.huang@amd.com>

commit 4da27c1163145643bdb002dd9b821d4803232700
Author: Jinzhou Su <Jinzhou.Su@amd.com>
Date:   Fri Dec 24 09:04:59 2021 +0800

    ACPI: CPPC: Add CPPC enable register function
    
    Add a new function to enable CPPC feature. This function
    will write Continuous Performance Control package
    EnableRegister field on the processor.
    
    CPPC EnableRegister register described in section 8.4.7.1 of ACPI 6.4:
    This element is optional. If supported, contains a resource descriptor
    with a single Register() descriptor that describes a register to which
    OSPM writes a One to enable CPPC on this processor. Before this register
    is set, the processor will be controlled by legacy mechanisms (ACPI
    Pstates, firmware, etc.).
    
    This register will be used for AMD processors to enable AMD P-State
    function instead of legacy ACPI P-States.
    
    Signed-off-by: Jinzhou Su <Jinzhou.Su@amd.com>
    Signed-off-by: Huang Rui <ray.huang@amd.com>

commit 4821365415ebb57161757ac5d0c196b0344b8d5f
Author: Mario Limonciello <mario.limonciello@amd.com>
Date:   Fri Dec 24 09:04:58 2021 +0800

    ACPI: CPPC: Check present CPUs for determining _CPC is valid
    
    As this is a static check, it should be based upon what is currently
    present on the system. This makes probeing more deterministic.
    
    While local APIC flags field (lapic_flags) of cpu core in MADT table is
    0, then the cpu core won't be enabled. In this case, _CPC won't be found
    in this core, and return back to _CPC invalid with walking through
    possible cpus (include disable cpus). This is not expected, so switch to
    check present CPUs instead.
    
    Reported-by: Jinzhou Su <Jinzhou.Su@amd.com>
    Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
    Signed-off-by: Huang Rui <ray.huang@amd.com>

commit ea3b8f094eaf46a8862d845a626c23c0ed3a80ec
Author: Steven Noonan <steven@valvesoftware.com>
Date:   Fri Dec 24 09:04:57 2021 +0800

    ACPI: CPPC: Implement support for SystemIO registers
    
    According to the ACPI v6.2 (and later) specification, SystemIO can be
    used for _CPC registers. This teaches cppc_acpi how to handle such
    registers.
    
    This patch was tested using the amd_pstate driver on my Zephyrus G15
    (model GA503QS) using the current version 410 BIOS, which uses
    a SystemIO register for the HighestPerformance element in _CPC.
    
    Signed-off-by: Steven Noonan <steven@valvesoftware.com>
    Signed-off-by: Huang Rui <ray.huang@amd.com>

commit d7ec32945cf92feb0d70798880c0cc85199329aa
Author: Huang Rui <ray.huang@amd.com>
Date:   Fri Dec 24 09:04:56 2021 +0800

    x86/msr: Add AMD CPPC MSR definitions
    
    AMD CPPC (Collaborative Processor Performance Control) function uses MSR
    registers to manage the performance hints. So add the MSR register macro
    here.
    
    Signed-off-by: Huang Rui <ray.huang@amd.com>
    Acked-by: Borislav Petkov <bp@suse.de>

commit 8eb61010f2fddbf5888bae765ad5d2a447c2f479
Author: Huang Rui <ray.huang@amd.com>
Date:   Fri Dec 24 09:04:55 2021 +0800

    x86/cpufeatures: Add AMD Collaborative Processor Performance Control feature flag
    
    Add Collaborative Processor Performance Control feature flag for AMD
    processors.
    
    This feature flag will be used on the following AMD P-State driver. The
    AMD P-State driver has two approaches to implement the frequency control
    behavior. That depends on the CPU hardware implementation. One is "Full
    MSR Support" and another is "Shared Memory Support". The feature flag
    indicates the current processors with "Full MSR Support".
    
    Acked-by: Borislav Petkov <bp@suse.de>
    Signed-off-by: Huang Rui <ray.huang@amd.com>

commit 85f1e1442b2c72a4b1b1bd467e93fce11b5b65d8
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Nov 18 09:52:39 2021 -0800

    x86/csum: Fix compilation error for UM
    
    load_unaligned_zeropad() is not yet universal.
    
    ARCH=um SUBARCH=x86_64 builds do not have it.
    
    When CONFIG_DCACHE_WORD_ACCESS is not set, simply continue
    the bisection with 4, 2 and 1 byte steps.
    
    Fixes: df4554cebdaa ("x86/csum: Rewrite/optimize csum_partial()")
    Reported-by: kernel test robot <lkp@intel.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lkml.kernel.org/r/20211118175239.1525650-1-eric.dumazet@gmail.com

commit b0d8e91932a1bec7cda2fcac70c1c2bc37e5cf66
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Nov 12 08:19:50 2021 -0800

    x86/csum: Rewrite/optimize csum_partial()
    
    With more NIC supporting CHECKSUM_COMPLETE, and IPv6 being widely
    used.  csum_partial() is heavily used with small amount of bytes, and
    is consuming many cycles.
    
    IPv6 header size for instance is 40 bytes.
    
    Another thing to consider is that NET_IP_ALIGN is 0 on x86, meaning
    that network headers are not word-aligned, unless the driver forces
    this.
    
    This means that csum_partial() fetches one u16 to 'align the buffer',
    then perform three u64 additions with carry in a loop, then a
    remaining u32, then a remaining u16.
    
    With this new version, we perform a loop only for the 64 bytes blocks,
    then the remaining is bisected.
    
    Tested on various cpus, all of them show a big reduction in
    csum_partial() cost (by 50 to 80 %)
    
    Before:
            4.16%  [kernel]       [k] csum_partial
    After:
            0.83%  [kernel]       [k] csum_partial
    
    If run in a loop 1,000,000 times:
    
    Before:
            26,922,913      cycles                    # 3846130.429 GHz
            80,302,961      instructions              #    2.98  insn per cycle
            21,059,816      branches                  # 3008545142.857 M/sec
                 2,896      branch-misses             #    0.01% of all branches
    After:
            17,960,709      cycles                    # 3592141.800 GHz
            41,292,805      instructions              #    2.30  insn per cycle
            11,058,119      branches                  # 2211623800.000 M/sec
                 2,997      branch-misses             #    0.03% of all branches
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
    Link: https://lore.kernel.org/r/20211112161950.528886-1-eric.dumazet@gmail.com

commit 5cbf462e52947c696e25053ba9b799fb4f3ba9ca
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:49 2021 -0800

    net: move early demux fields close to sk_refcnt
    
    sk_rx_dst/sk_rx_dst_ifindex/sk_rx_dst_cookie are read in early demux,
    and currently spans two cache lines.
    
    Moving them close to sk_refcnt makes more sense, as only one cache
    line is needed.
    
    New layout for this hot cache line is :
    
    struct sock {
            struct sock_common         __sk_common;          /*     0  0x88 */
            /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
            struct dst_entry *         sk_rx_dst;            /*  0x88   0x8 */
            int                        sk_rx_dst_ifindex;    /*  0x90   0x4 */
            u32                        sk_rx_dst_cookie;     /*  0x94   0x4 */
            socket_lock_t              sk_lock;              /*  0x98  0x20 */
            atomic_t                   sk_drops;             /*  0xb8   0x4 */
            int                        sk_rcvlowat;          /*  0xbc   0x4 */
            /* --- cacheline 3 boundary (192 bytes) --- */
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

commit 2c0abf22886b1a84a8f8ec1d880f22351eed9b0d
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:48 2021 -0800

    tcp: do not call tcp_cleanup_rbuf() if we have a backlog
    
    Under pressure, tcp recvmsg() has logic to process the socket backlog,
    but calls tcp_cleanup_rbuf() right before.
    
    Avoiding sending ACK right before processing new segments makes
    a lot of sense, as this decrease the number of ACK packets,
    with no impact on effective ACK clocking.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

commit 6c55b2052590896513f92958478ec2ff390733ad
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:47 2021 -0800

    tcp: check local var (timeo) before socket fields in one test
    
    Testing timeo before sk_err/sk_state/sk_shutdown makes more sense.
    
    Modern applications use non-blocking IO, while a socket is terminated
    only once during its life time.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

commit 8a307a7acb16b2dae9ebe692ff6c13817fe9ba70
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:46 2021 -0800

    tcp: defer skb freeing after socket lock is released
    
    tcp recvmsg() (or rx zerocopy) spends a fair amount of time
    freeing skbs after their payload has been consumed.
    
    A typical ~64KB GRO packet has to release ~45 page
    references, eventually going to page allocator
    for each of them.
    
    Currently, this freeing is performed while socket lock
    is held, meaning that there is a high chance that
    BH handler has to queue incoming packets to tcp socket backlog.
    
    This can cause additional latencies, because the user
    thread has to process the backlog at release_sock() time,
    and while doing so, additional frames can be added
    by BH handler.
    
    This patch adds logic to defer these frees after socket
    lock is released, or directly from BH handler if possible.
    
    Being able to free these skbs from BH handler helps a lot,
    because this avoids the usual alloc/free assymetry,
    when BH handler and user thread do not run on same cpu or
    NUMA node.
    
    One cpu can now be fully utilized for the kernel->user copy,
    and another cpu is handling BH processing and skb/page
    allocs/frees (assuming RFS is not forcing use of a single CPU)
    
    Tested:
     100Gbit NIC
     Max throughput for one TCP_STREAM flow, over 10 runs
    
    MTU : 1500
    Before: 55 Gbit
    After:  66 Gbit
    
    MTU : 4096+(headers)
    Before: 82 Gbit
    After:  95 Gbit
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

commit bb1c0c8c5e9a71b53002401e6d8ff303b18e915a
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:45 2021 -0800

    tcp: avoid indirect calls to sock_rfree
    
    TCP uses sk_eat_skb() when skbs can be removed from receive queue.
    However, the call so skb_orphan() from __kfree_skb() incurs
    an indirect call so sock_rfee(), which is more expensive than
    a direct call, especially for CONFIG_RETPOLINE=y.
    
    Add tcp_eat_recv_skb() function to make the call before
    __kfree_skb().
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

commit 3b47a3af97b99eae5facd29f3070f33703dc3ec3
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:44 2021 -0800

    tcp: tp->urg_data is unlikely to be set
    
    Use some unlikely() hints in the fast path.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

commit 00d8db65fe8165059cf023a285474a5ea27c5287
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:43 2021 -0800

    tcp: annotate races around tp->urg_data
    
    tcp_poll() and tcp_ioctl() are reading tp->urg_data without socket lock
    owned.
    
    Also, it is faster to first check tp->urg_data in tcp_poll(),
    then tp->urg_seq == tp->copied_seq, because tp->urg_seq is
    located in a different/cold cache line.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

commit 3c4bc0be566c44077cdd15ad4e39df3b16d7d9d4
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:42 2021 -0800

    tcp: annotate data-races on tp->segs_in and tp->data_segs_in
    
    tcp_segs_in() can be called from BH, while socket spinlock
    is held but socket owned by user, eventually reading these
    fields from tcp_get_info()
    
    Found by code inspection, no need to backport this patch
    to older kernels.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

commit 14e47c2467182abfe785e1f842d342fac7758abd
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:41 2021 -0800

    tcp: add RETPOLINE mitigation to sk_backlog_rcv
    
    Use INDIRECT_CALL_INET() to avoid an indirect call
    when/if CONFIG_RETPOLINE=y
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

commit 06dcd20283c886976257bb99ce38ba89044d4ebc
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:40 2021 -0800

    tcp: small optimization in tcp recvmsg()
    
    When reading large chunks of data, incoming packets might
    be added to the backlog from BH.
    
    tcp recvmsg() detects the backlog queue is not empty, and uses
    a release_sock()/lock_sock() pair to process this backlog.
    
    We now have __sk_flush_backlog() to perform this
    a bit faster.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

commit 05091893496e16c9b38b647bd02ef882ac053b09
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:39 2021 -0800

    net: cache align tcp_memory_allocated, tcp_sockets_allocated
    
    tcp_memory_allocated and tcp_sockets_allocated often share
    a common cache line, source of false sharing.
    
    Also take care of udp_memory_allocated and mptcp_sockets_allocated.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

commit 44c254410781627e7db7f02be8a35e32644f8016
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:38 2021 -0800

    net: forward_alloc_get depends on CONFIG_MPTCP
    
    (struct proto)->sk_forward_alloc is currently only used by MPTCP.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

commit db7962b994e245716af6e2e6ab43aac937609254
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:37 2021 -0800

    net: shrink struct sock by 8 bytes
    
    Move sk_bind_phc next to sk_peer_lock to fill a hole.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

commit 47f7af169d29dc708243414366753968bec71283
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:36 2021 -0800

    ipv6: shrink struct ipcm6_cookie
    
    gso_size can be moved after tclass, to use an existing hole.
    (8 bytes saved on 64bit arches)
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

commit c6fdfb7ab54b12ed489362df18af470ca4aa91ca
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:32 2021 -0800

    tcp: small optimization in tcp_v6_send_check()
    
    For TCP flows, inet6_sk(sk)->saddr has the same value
    than sk->sk_v6_rcv_saddr.
    
    Using sk->sk_v6_rcv_saddr increases data locality.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

commit 1b9d62d108711fac71d5714fd88fa9dfb7700b35
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:31 2021 -0800

    tcp: remove dead code in __tcp_v6_send_check()
    
    For some reason, I forgot to change __tcp_v6_send_check() at
    the same time I removed (ip_summed == CHECKSUM_PARTIAL) check
    in __tcp_v4_send_check()
    
    Fixes: 98be9b12096f ("tcp: remove dead code after CHECKSUM_PARTIAL adoption")
    Signed-off-by: Eric Dumazet <edumazet@google.com>

commit 2aef2076bced854f82b4ab7af33c64e5bfcb1673
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 11:02:30 2021 -0800

    tcp: minor optimization in tcp_add_backlog()
    
    If packet is going to be coalesced, sk_sndbuf/sk_rcvbuf values
    are not used. Defer their access to the point we need them.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

commit 6564991ef2b36b8d5c19f0ff6ccc1831625b0875
Author: Adithya Abraham Philip <abrahamphilip@google.com>
Date:   Fri Jun 11 21:56:10 2021 +0000

    net-tcp_bbr: v2: Fix missing ECT markings on retransmits for BBRv2
    
    Adds a new flag TCP_ECN_ECT_PERMANENT that is used by CCAs to
    indicate that retransmitted packets and pure ACKs must have the
    ECT bit set. This is a necessary fix for BBRv2, which when using
    ECN expects ECT to be set even on retransmitted packets and ACKs.
    Currently CCAs like BBRv2 which can use ECN but don't "need" it
    do not have a way to indicate that ECT should be set on
    retransmissions/ACKs.
    
    Signed-off-by: Adithya Abraham Philip <abrahamphilip@google.com>
    Signed-off-by: Neal Cardwell <ncardwell@google.com>

commit 6de1c71a63ed6b9d5c0b571d198d222dbd23e1a3
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Dec 28 19:23:09 2020 -0500

    net-tcp_bbr: v2: don't assume prior_cwnd was set entering CA_Loss
    
    Fix WARN_ON_ONCE() warnings that were firing and pointing to a
    bbr->prior_cwnd of 0 when exiting CA_Loss and transitioning to
    CA_Open.
    
    The issue was that tcp_simple_retransmit() calls:
    
      tcp_set_ca_state(sk, TCP_CA_Loss);
    
    without first calling icsk_ca_ops->ssthresh(sk) (because
    tcp_simple_retransmit() is dealing with losses due to MTU issues and
    not congestion). The lack of this callback means that BBR did not get
    a chance to set bbr->prior_cwnd, and thus upon exiting CA_Loss in such
    cases the WARN_ON_ONCE() would fire due to a zero bbr->prior_cwnd.
    
    This commit removes that warning, since a bbr->prior_cwnd of 0 is a
    valid situation in this state transition.
    
    For setting inflight_lo upon entering CA_Loss, to avoid setting an
    inflight_lo of 0 in this case, this commit switches to taking the max
    of cwnd and prior_cwnd. We plan to remove that line of code when we
    switch to cautious (PRR-style) recovery, so that awkwardness will go
    away.
    
    Change-Id: I575dce871c2f20e91e3e9449e1706f42a07b8118

commit 72a0d85b73045cde78d6288fe8ddb8516d1a6ff3
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Aug 17 19:10:21 2020 -0400

    net-tcp_bbr: v2: remove cycle_rand parameter that is unused in BBRv2
    
    Change-Id: Iee1df7e41e42de199068d7c89131ed3d228327c0

commit cc2d74bb89a9debd8a360e16c694469d4c1e6eda
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Aug 17 19:08:41 2020 -0400

    net-tcp_bbr: v2: remove field bw_rtts that is unused in BBRv2
    
    Change-Id: I58e3346c707748a6f316f3ed060d2da84c32a79b

commit b3ac5ca7c674fe79527f3cfbd553f8fbc57a96bf
Author: Neal Cardwell <ncardwell@google.com>
Date:   Thu Nov 21 15:28:01 2019 -0500

    net-tcp_bbr: v2: remove unnecessary rs.delivered_ce logic upon loss
    
    There is no reason to compute rs.delivered_ce upon loss.
    
    In fact, we specifically do not want to compute rs.delivered_ce upon loss.
    
    Two issues:
    
    (1) This would be the wrong thing to do, in behavior terms.  With
        RACK's dynamic reordering window, losses can be marked long after
        the sequence hole appears in the ACK/SACK stream. We want to to
        catch the ECN mark rate rising too high as quickly as possible,
        which means we want to check for high ECN mark rates at ACK time
        (as BBRv2 currently does) and not loss marking time.
    
    (2) This is dead code. The ECN mark rate cannot be detected as too
        high because the check needs rs->delivered to be > 0 as well:
    
           if (rs->delivered_ce > 0 && rs->delivered > 0 &&
    
        Since we are not setting rs->delivered upon loss, this check
        cannot succeed, so setting delivered_ce is pointless.
    
    This dead and wrong line was discovered by Randall Stewart at Netflix
    as he was reading the BBRv2 code.
    
    Change-Id: I37f83f418a259ec31d8f82de986db071b364b76a

commit ad956a6f603a75ee10e7fabe7bf25074c9be4668
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Jul 22 23:18:56 2019 -0400

    net-tcp_bbr: v2: add a README.md for TCP BBR v2 alpha release
    
    Change-Id: I35a8c984e299d2af6e78c3d4b3aade5627678306

commit 8b19d651c9e9d1cded4984f6c691c720026eeaf9
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Jun 11 12:54:22 2019 -0400

    net-tcp_bbr: v2: BBRv2 ("bbr2") congestion control for Linux TCP
    
    BBR v2 is an enhacement to the BBR v1 algorithm. It's designed to aim for lower
    queues, lower loss, and better Reno/CUBIC coexistence than BBR v1.
    
    BBR v2 maintains the core of BBR v1: an explicit model of the network
    path that is two-dimensional, adapting to estimate the (a) maximum
    available bandwidth and (b) maximum safe volume of data a flow can
    keep in-flight in the network. It maintains the estimated BDP as a
    core guide for estimating an appropriate level of in-flight data.
    
    BBR v2 makes several key enhancements:
    
    o Its bandwidth-probing time scale is adapted, within bounds, to allow improved
    coexistence with Reno and CUBIC. The bandwidth-probing time scale is (a)
    extended dynamically based on estimated BDP to improve coexistence with
    Reno/CUBIC; (b) bounded by an interactive wall-clock time-scale to be more
    scalable and responsive than Reno and CUBIC.
    
    o Rather than being largely agnostic to loss and ECN marks, it explicitly uses
    loss and (DCTCP-style) ECN signals to maintain its model.
    
    o It aims for lower losses than v1 by adjusting its model to attempt to stay
    within loss rate and ECN mark rate bounds (loss_thresh and ecn_thresh,
    respectively).
    
    o It adapts to loss/ECN signals even when the application is running out of
    data ("application-limited"), in case the "application-limited" flow is also
    "network-limited" (the bw and/or inflight available to this flow is lower than
    previously estimated when the flow ran out of data).
    
    o It has a three-part model: the model explicit three tracks operating points,
    where an operating point is a tuple: (bandwidth, inflight). The three operating
    points are:
    
      o latest:        the latest measurement from the current round trip
      o upper bound:   robust, optimistic, long-term upper bound
      o lower bound:   robust, conservative, short-term lower bound
    
    These are stored in the following state variables:
    
      o latest:  bw_latest, inflight_latest
      o lo:      bw_lo,     inflight_lo
      o hi:      bw_hi[2],  inflight_hi
    
    To gain intuition about the meaning of the three operating points, it
    may help to consider the analogs in CUBIC, which has a somewhat
    analogous three-part model used by its probing state machine:
    
      BBR param     CUBIC param
      -----------   -------------
      latest     ~  cwnd
      lo         ~  ssthresh
      hi         ~  last_max_cwnd
    
    The analogy is only a loose one, though, since the BBR operating
    points are calculated differently, and are 2-dimensional (bw,inflight)
    rather than CUBIC's one-dimensional notion of operating point
    (inflight).
    
    o It uses the three-part model to adapt the magnitude of its bandwidth
    to match the estimated space available in the buffer, rather than (as
    in BBR v1) assuming that it was always acceptable to place 0.25*BDP in
    the bottleneck buffer when probing (commodity datacenter switches
    commonly do not have that much buffer for WAN flows). When BBR v2
    estimates it hit a buffer limit during probing, its bandwidth probing
    then starts gently in case little space is still available in the
    buffer, and the accelerates, slowly at first and then rapidly if it
    can grow inflight without seeing congestion signals. In such cases,
    probing is bounded by inflight_hi + inflight_probe, where
    inflight_probe grows as: [0, 1, 2, 4, 8, 16,...]. This allows BBR to
    keep losses low and bounded if a bottleneck remains congested, while
    rapidly/scalably utilizing free bandwidth when it becomes available.
    
    o It has a slightly revised state machine, to achieve the goals above.
        BBR_BW_PROBE_UP:    pushes up inflight to probe for bw/vol
        BBR_BW_PROBE_DOWN:  drain excess inflight from the queue
        BBR_BW_PROBE_CRUISE: use pipe, w/ headroom in queue/pipe
        BBR_BW_PROBE_REFILL: try refill the pipe again to 100%, leaving queue empty
    
    o The estimated BDP: BBR v2 continues to maintain an estimate of the
    path's two-way propagation delay, by tracking a windowed min_rtt, and
    coordinating (on an as-ndeeded basis) to try to expose the two-way
    propagation delay by draining the bottleneck queue.
    
    BBR v2 continues to use its min_rtt and (currently-applicable) bandwidth
    estimate to estimate the current bandwidth-delay product. The estimated BDP
    still provides one important guideline for bounding inflight data. However,
    because any min-filtered RTT and max-filtered bw inherently tend to both
    overestimate, the estimated BDP is often too high; in this case loss or ECN
    marks can ensue, in which case BBR v2 adjusts inflight_hi and inflight_lo to
    adapt its sending rate and inflight down to match the available capacity of the
    path.
    
    o Space: Note that ICSK_CA_PRIV_SIZE increased. This is because BBR v2
    requires more space. Note that much of the space is due to support for
    per-socket parameterization and debugging in this release for research
    and debugging. With that state removed, the full "struct bbr" is 140
    bytes, or 144 with padding. This is an increase of 40 bytes over the
    existing ca_priv space.
    
    o Code: BBR v2 reuses many pieces from BBR v1. But it omits the following
      significant pieces:
    
      o "packet conservation" (bbr_set_cwnd_to_recover_or_restore(),
        bbr_can_grow_inflight())
      o long-term bandwidth estimator ("policer mode")
    
      The code layout tries to keep BBR v2 code near the bottom of the
      file, so that v1-applicable code in the top does not accidentally
      refer to v2 code.
    
    o Docs:
      See the following docs for more details and diagrams decsribing the BBR v2
      algorithm:
        https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00
        https://datatracker.ietf.org/meeting/102/materials/slides-102-iccrg-an-update-on-bbr-work-at-google-00
    
    o Internal notes:
      For this upstream rebase, Neal started from:
        git show fed518041ac6:net/ipv4/tcp_bbr.c > net/ipv4/tcp_bbr.c
      then removed dev instrumentation (dynamic get/set for parameters)
      and code that was only used by BBRv1
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 2c84098e60bed6d67dde23cd7538c51dee273102
    Change-Id: I125cf26ba2a7a686f2fa5e87f4c2afceb65f7a05

commit 8c91b8b17e402084165719e427e4113d245e57d1
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sat Nov 16 13:16:25 2019 -0500

    net-tcp: add fast_ack_mode=1: skip rwin check in tcp_fast_ack_mode__tcp_ack_snd_check()
    
    Add logic for an experimental TCP connection behavior, enabled with
    tp->fast_ack_mode = 1, which disables checking the receive window
    before sending an ack in __tcp_ack_snd_check(). If this behavior is
    enabled, the data receiver sends an ACK if the amount of data is >
    RCV.MSS.
    
    Change-Id: Iaa0a0fd7108221f883137a79d5bfa724f1b096d4

commit df2dd44c0324db5b2f8e44e82a317015b4c933eb
Author: Neal Cardwell <ncardwell@google.com>
Date:   Fri Sep 27 17:10:26 2019 -0400

    net-tcp: re-generalize TSO sizing in TCP CC module API
    
    Reorganize the API for CC modules so that the CC module once again
    gets complete control of the TSO sizing decision. This is how the API
    was set up around 2016 and the initial BBRv1 upstreaming. Later Eric
    Dumazet simplified it. But with wider testing it now seems that to
    avoid CPU regressions BBR needs to have a different TSO sizing
    function.
    
    This is necessary to handle cases where there are many flows
    bottlenecked on the sender host's NIC, in which case BBR's pacing rate
    is much lower than CUBIC/Reno/DCTCP's. Why does this happen? Because
    BBR's pacing rate adapts to the low bandwidth share each flow sees. By
    contrast, CUBIC/Reno/DCTCP see no loss or ECN, so they grow a very
    large cwnd, and thus large pacing rate and large TSO burst size.
    
    Change-Id: Ic8ccfdbe4010ee8d4bf6a6334c48a2fceb2171ea

commit 958aaeff5afb297f82031818ecba3661da134808
Author: Yousuk Seung <ysseung@google.com>
Date:   Wed May 23 17:55:54 2018 -0700

    net-tcp: add new ca opts flag TCP_CONG_WANTS_CE_EVENTS
    
    Add a a new ca opts flag TCP_CONG_WANTS_CE_EVENTS that allows a
    congestion control module to receive CE events.
    
    Currently congestion control modules have to set the TCP_CONG_NEEDS_ECN
    bit in opts flag to receive CE events but this may incur changes in ECN
    behavior elsewhere. This patch adds a new bit TCP_CONG_WANTS_CE_EVENTS
    that allows congestion control modules to receive CE events
    independently of TCP_CONG_NEEDS_ECN.
    
    Effort: net-tcp
    Origin-9xx-SHA1: 9f7e14716cde760bc6c67ef8ef7e1ee48501d95b
    Change-Id: I2255506985242f376d910c6fd37daabaf4744f24

commit 08ed311d295db7c463c476524d95f52030462fb5
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue May 7 22:37:19 2019 -0400

    net-tcp_bbr: v2: set tx.in_flight for skbs in repair write queue
    
    Syzkaller was able to use TCP_REPAIR to reproduce the new warning
    added in tcp_fragment():
    
      WARNING: CPU: 0 PID: 118174 at net/ipv4/tcp_output.c:1487
        tcp_fragment+0xdcc/0x10a0 net/ipv4/tcp_output.c:1487()
      inconsistent: tx.in_flight: 0 old_factor: 53
    
    The warning happens because skbs inserted into the tcp_rtx_queue
    during the repair process go through a sort of "fake send" process,
    and that process was seting pcount but not tx.in_flight, and thus the
    warnings (where old_factor is the old pcount).
    
    The fix of setting tx.in_flight in the TCP_REPAIR code path seems
    simple enough, and indeed makes the repro code from syzkaller stop
    producing warnings. Running through kokonut tests, and will send out
    for review when all tests pass.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 330f825a08a6fe92cef74d799cc468864c479f63
    Change-Id: I0bc4a790f040fd4239620e1eedd5dc64666c6f05

commit 38d074ba1c129b8738eb7eead9698d320c0eb240
Author: Neal Cardwell <ncardwell@google.com>
Date:   Wed May 1 20:16:25 2019 -0400

    net-tcp_bbr: v2: adjust skb tx.in_flight upon split in tcp_fragment()
    
    When we fragment an skb that has already been sent, we need to update
    the tx.in_flight for the first skb in the resulting pair ("buff").
    
    Because we were not updating the tx.in_flight, the tx.in_flight value
    was inconsistent with the pcount of the "buff" skb (tx.in_flight would
    be too high). That meant that if the "buff" skb was lost, then
    bbr2_inflight_hi_from_lost_skb() would calculate an inflight_hi value
    that is too high. This could result in longer queues and higher packet
    loss.
    
    Packetdrill testing verified that without this commit, when the second
    half of an skb is SACKed and then later the first half of that skb is
    marked lost, the calculated inflight_hi was incorrect.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 385f1ddc610798fab2837f9f372857438b25f874
    Change-Id: I617f8cab4e9be7a0b8e8d30b047bf8645393354d

commit 010819adf5096d03f5d3ba32176514345de98c89
Author: Neal Cardwell <ncardwell@google.com>
Date:   Wed May 1 20:16:33 2019 -0400

    net-tcp_bbr: v2: adjust skb tx.in_flight upon merge in tcp_shifted_skb()
    
    When tcp_shifted_skb() updates state as adjacent SACKed skbs are
    coalesced, previously the tx.in_flight was not adjusted, so we could
    get contradictory state where the skb's recorded pcount was bigger
    than the tx.in_flight (the number of segments that were in_flight
    after sending the skb).
    
    Normally have a SACKed skb with contradictory pcount/tx.in_flight
    would not matter. However, with SACK reneging, the SACKed bit is
    removed, and an skb once again becomes eligible for retransmitting,
    fragmenting, SACKing, etc. Packetdrill testing verified the following
    sequence is possible in a kernel that does not have this commit:
    
     - skb N is SACKed
     - skb N+1 is SACKed and combined with skb N using tcp_shifted_skb()
       - tcp_shifted_skb() will increase the pcount of prev,
         but leave tx.in_flight as-is
       - so prev skb can have pcount > tx.in_flight
     - RTO, tcp_timeout_mark_lost(), detect reneg,
       remove "SACKed" bit, mark skb N as lost
       - find pcount of skb N is greater than its tx.in_flight
    
    I suspect this issue iw what caused the bbr2_inflight_hi_from_lost_skb():
      WARN_ON_ONCE(inflight_prev < 0)
    to fire in production machines using bbr2.
    
    Tested: See last commit in series for sponge link.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 1a3e997e613d2dcf32b947992882854ebe873715
    Change-Id: I1b0b75c27519953430c7db51c6f358f104c7af55

commit 6f61d68d1e53865dfa49e20b44e0db0dfa4bc47b
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue May 7 22:36:36 2019 -0400

    net-tcp_bbr: v2: factor out tx.in_flight setting into tcp_set_tx_in_flight()
    
    Factor out the code to set an skb's tx.in_flight field into its own
    function, so that this code can be used for the TCP_REPAIR "fake send"
    code path that inserts skbs into the rtx queue without sending
    them. This is in preparation for the following patch, which fixes an
    issue with TCP_REPAIR and tx.in_flight.
    
    Tested: See last patch in series for sponge link.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: e880fc907d06ea7354333f60f712748ebce9497b
    Change-Id: I4fbd4a6e18a51ab06d50ab1c9ad820ce5bea89af

commit fed267c2eb13256f89fcc907e814c73a919311ca
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Aug 7 21:52:06 2018 -0400

    net-tcp_bbr: v2: introduce ca_ops->skb_marked_lost() CC module callback API
    
    For connections experiencing reordering, RACK can mark packets lost
    long after we receive the SACKs/ACKs hinting that the packets were
    actually lost.
    
    This means that CC modules cannot easily learn the volume of inflight
    data at which packet loss happens by looking at the current inflight
    or even the packets in flight when the most recently SACKed packet was
    sent. To learn this, CC modules need to know how many packets were in
    flight at the time lost packets were sent. This new callback, combined
    with TCP_SKB_CB(skb)->tx.in_flight, allows them to learn this.
    
    This also provides a consistent callback that is invoked whether
    packets are marked lost upon ACK processing, using the RACK reordering
    timer, or at RTO time.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: afcbebe3374e4632ac6714d39e4dc8a8455956f4
    Change-Id: I54826ab53df636be537e5d3c618a46145d12d51a

commit b324fd8092194ee352cf47e4af633a3b90a83f6c
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Nov 19 13:48:36 2018 -0500

    net-tcp_bbr: v2: export FLAG_ECE in rate_sample.is_ece
    
    For understanding the relationship between inflight and ECN signals,
    to try to find the highest inflight value that has acceptable levels
    ECN marking.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 3eba998f2898541406c2666781182200934965a8
    Change-Id: I3a964e04cee83e11649a54507043d2dfe769a3b3

commit b2f308e94a5564555f63349019314523c258103a
Author: Neal Cardwell <ncardwell@google.com>
Date:   Thu Oct 12 23:44:27 2017 -0400

    net-tcp_bbr: v2: count packets lost over TCP rate sampling interval
    
    For understanding the relationship between inflight and packet loss
    signals, to try to find the highest inflight value that has acceptable
    levels of packet losses.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 4527e26b2bd7756a88b5b9ef1ada3da33dd609ab
    Change-Id: I594c2500868d9c530770e7ddd68ffc87c57f4fd5

commit 2854a76b0897c45fce59754e9d2c689361c826d5
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sat Aug 5 11:49:50 2017 -0400

    net-tcp_bbr: v2: snapshot packets in flight at transmit time and pass in rate_sample
    
    For understanding the relationship between inflight and losses or ECN
    signals, to try to find the highest inflight value that has acceptable
    levels of loss/ECN marking.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: b3eb4f2d20efab4ca001f32c9294739036c493ea
    Change-Id: I7314047d0ff14dd261a04b1969a46dc658c8836a

commit d0775a1e4f43188c66e972ea8447eb5fa0e3bdae
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sun Jun 24 21:55:59 2018 -0400

    net-tcp_bbr: v2: shrink delivered_mstamp, first_tx_mstamp to u32 to free up 8 bytes
    
    Free up some space for tracking inflight and losses for each
    bw sample, in upcoming commits.
    
    These timestamps are in microseconds, and are now stored in 32
    bits. So they can only hold time intervals up to roughly 2^12 = 4096
    seconds.  But Linux TCP RTT and RTO tracking has the same 32-bit
    microsecond implementation approach and resulting deployment
    limitations. So this is not introducing a new limit. And these should
    not be a limitation for the foreseeable future.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 238a7e6b5d51625fef1ce7769826a7b21b02ae55
    Change-Id: I3b779603797263b52a61ad57c565eb91fe42680c

commit d6c71581cba6a4d1ff7d4cf899ab0ec052123b65
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Jun 11 12:26:55 2019 -0400

    net-tcp_bbr: broaden app-limited rate sample detection
    
    This commit is a bug fix for the Linux TCP app-limited
    (application-limited) logic that is used for collecting rate
    (bandwidth) samples.
    
    Previously the app-limited logic only looked for "bubbles" of
    silence in between application writes, by checking at the start
    of each sendmsg. But "bubbles" of silence can also happen before
    retransmits: e.g. bubbles can happen between an application write
    and a retransmit, or between two retransmits.
    
    Retransmits are triggered by ACKs or timers. So this commit checks
    for bubbles of app-limited silence upon ACKs or timers.
    
    Why does this commit check for app-limited state at the start of
    ACKs and timer handling? Because at that point we know whether
    inflight was fully using the cwnd.  During processing the ACK or
    timer event we often change the cwnd; after changing the cwnd we
    can't know whether inflight was fully using the old cwnd.
    
    Origin-9xx-SHA1: 3fe9b53291e018407780fb8c356adb5666722cbc
    Change-Id: I37221506f5166877c2b110753d39bb0757985e68

commit ed307ccbcfff0b21a526f53cfa28df9b2052db53
Author: Stephan Mueller <smueller@chronox.de>
Date:   Tue Sep 28 17:41:57 2021 +0200

    char/lrng: add power-on and runtime self-tests
    
    Parts of the LRNG are already covered by self-tests, including:
    
    * Self-test of SP800-90A DRBG provided by the Linux kernel crypto API.
    
    * Self-test of the PRNG provided by the Linux kernel crypto API.
    
    * Raw noise source data testing including SP800-90B compliant
      tests when enabling CONFIG_LRNG_HEALTH_TESTS
    
    This patch adds the self-tests for the remaining critical functions of
    the LRNG that are essential to maintain entropy and provide
    cryptographic strong random numbers. The following self-tests are
    implemented:
    
    * Self-test of the time array maintenance. This test verifies whether
    the time stamp array management to store multiple values in one integer
    implements a concatenation of the data.
    
    * Self-test of the software hash implementation ensures that this
    function operates compliant to the FIPS 180-4 specification. The
    self-test performs a hash operation of a zeroized per-CPU data array.
    
    * Self-test of the ChaCha20 DRNG is based on the self-tests that are
    already present and implemented with the stand-alone user space
    ChaCha20 DRNG implementation available at [1]. The self-tests cover
    different use cases of the DRNG seeded with known seed data.
    
    The status of the LRNG self-tests is provided with the selftest_status
    SysFS file. If the file contains a zero, the self-tests passed. The
    value 0xffffffff means that the self-tests were not executed. Any other
    value indicates a self-test failure.
    
    The self-test may be compiled to panic the system if the self-test
    fails.
    
    All self-tests operate on private state data structures. This implies
    that none of the self-tests have any impact on the regular LRNG
    operations. This allows the self-tests to be repeated at runtime by
    writing anything into the selftest_status SysFS file.
    
    [1] https://www.chronox.de/chacha20.html
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    CC: Neil Horman <nhorman@redhat.com>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 990036a0a3ff50142abe71c31872da8c24f8d0c9
Author: Stephan Mueller <smueller@chronox.de>
Date:   Mon Oct 18 20:55:51 2021 +0200

    char/lrng: add interface for gathering of raw entropy
    
    The test interface allows a privileged process to capture the raw
    unconditioned noise that is collected by the LRNG for statistical
    analysis. Such testing allows the analysis how much entropy
    the interrupt noise source provides on a given platform.
    Extracted noise data is not used to seed the LRNG. This
    is a test interface and not appropriate for production systems.
    Yet, the interface is considered to be sufficiently secured for
    production systems.
    
    Access to the data is given through the lrng_raw debugfs file. The
    data buffer should be multiples of sizeof(u32) to fill the entire
    buffer. Using the option lrng_testing.boot_test=1 the raw noise of
    the first 1000 entropy events since boot can be sampled.
    
    This test interface allows generating the data required for
    analysis whether the LRNG is in compliance with SP800-90B
    sections 3.1.3 and 3.1.4.
    
    In addition, the test interface allows gathering of the concatenated raw
    entropy data to verify that the concatenation works appropriately.
    This includes sampling of the following raw data:
    
    * high-resolution time stamp
    
    * Jiffies
    
    * IRQ number
    
    * IRQ flags
    
    * return instruction pointer
    
    * interrupt register state
    
    * array logic batching the high-resolution time stamp
    
    * enabling the runtime configuration of entropy source entropy rates
    
    Also, a testing interface to support ACVT of the hash implementation
    is provided. The reason why only hash testing is supported (as
    opposed to also provide testing for the DRNG) is the fact that the
    LRNG software hash implementation contains glue code that may
    warrant testing in addition to the testing of the software ciphers
    via the kernel crypto API. Also, for testing the CTR-DRBG, the
    underlying AES implementation would need to be tested. However,
    such AES test interface cannot be provided by the LRNG as it has no
    means to access the AES operation.
    
    Finally, the execution duration for processing a time stamp can be
    obtained with the LRNG raw entropy interface.
    
    If a test interface is not compiled, its code is a noop which has no
    impact on the performance.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit c808bb6fb444fde94a8dec94d3936cc44e7f8a44
Author: Stephan Mueller <smueller@chronox.de>
Date:   Tue Sep 28 18:10:36 2021 +0200

    char/lrng: add SP800-90B compliant health tests
    
    Implement health tests for LRNG's slow noise sources as mandated by
    SP-800-90B The file contains the following health tests:
    
    - stuck test: The stuck test calculates the first, second and third
      discrete derivative of the time stamp to be processed by the hash
      for the per-CPU entropy pool. Only if all three values are non-zero,
      the received time delta is considered to be non-stuck.
    
    - SP800-90B Repetition Count Test (RCT): The LRNG uses an enhanced
      version of the RCT specified in SP800-90B section 4.4.1. Instead of
      counting identical back-to-back values, the input to the RCT is the
      counting of the stuck values during the processing of received
      interrupt events. The RCT is applied with alpha=2^-30 compliant to
      the recommendation of FIPS 140-2 IG 9.8. During the counting operation,
      the LRNG always calculates the RCT cut-off value of C. If that value
      exceeds the allowed cut-off value, the LRNG will trigger the health
      test failure discussed below. An error is logged to the kernel log
      that such RCT failure occurred. This test is only applied and
      enforced in FIPS mode, i.e. when the kernel compiled with
      CONFIG_CONFIG_FIPS is started with fips=1.
    
    - SP800-90B Adaptive Proportion Test (APT): The LRNG implements the
      APT as defined in SP800-90B section 4.4.2. The applied significance
      level again is alpha=2^-30 compliant to the recommendation of FIPS
      140-2 IG 9.8.
    
    The aforementioned health tests are applied to the first 1,024 time stamps
    obtained from interrupt events. In case one error is identified for either
    the RCT, or the APT, the collected entropy is invalidated and the
    SP800-90B startup health test is restarted.
    
    As long as the SP800-90B startup health test is not completed, all LRNG
    random number output interfaces that may block will block and not generate
    any data. This implies that only those potentially blocking interfaces are
    defined to provide random numbers that are seeded with the interrupt noise
    source being SP800-90B compliant. All other output interfaces will not be
    affected by the SP800-90B startup test and thus are not considered
    SP800-90B compliant.
    
    At runtime, the SP800-90B APT and RCT are applied to each time stamp
    generated for a received interrupt. When either the APT and RCT indicates
    a noise source failure, the LRNG is reset to a state it has immediately
    after boot:
    
    - all entropy counters are set to zero
    
    - the SP800-90B startup tests are re-performed which implies that
    getrandom(2) would block again until new entropy was collected
    
    To summarize, the following rules apply:
    
    • SP800-90B compliant output interfaces
    
      - /dev/random
    
      - getrandom(2) system call
    
      -  get_random_bytes kernel-internal interface when being triggered by
         the callback registered with add_random_ready_callback
    
    • SP800-90B non-compliant output interfaces
    
      - /dev/urandom
    
      - get_random_bytes kernel-internal interface called directly
    
      - randomize_page kernel-internal interface
    
      - get_random_u32 and get_random_u64 kernel-internal interfaces
    
      - get_random_u32_wait, get_random_u64_wait, get_random_int_wait, and
        get_random_long_wait kernel-internal interfaces
    
    If either the RCT, or the APT health test fails irrespective whether
    during initialization or runtime, the following actions occur:
    
      1. The entropy of the entire entropy pool is invalidated.
    
      2. All DRNGs are reset which imply that they are treated as being
         not seeded and require a reseed during next invocation.
    
      3. The SP800-90B startup health test are initiated with all
         implications of the startup tests. That implies that from that point
         on, new events must be observed and its entropy must be inserted into
         the entropy pool before random numbers are calculated from the
         entropy pool.
    
    Further details on the SP800-90B compliance and the availability of all
    test tools required to perform all tests mandated by SP800-90B are
    provided at [1].
    
    The entire health testing code is compile-time configurable.
    
    The patch provides a CONFIG_BROKEN configuration of the APT / RCT cutoff
    values which have a high likelihood to trigger the health test failure.
    The BROKEN APT cutoff is set to the exact mean of the expected value if
    the time stamps are equally distributed (512 time stamps divided by 16
    possible values due to using the 4 LSB of the time stamp). The BROKEN
    RCT cutoff value is set to 1 which is likely to be triggered during
    regular operation.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 7db70085449199372687d8cf508c5679f3bb165c
Author: Stephan Mueller <smueller@chronox.de>
Date:   Wed Oct 13 22:53:51 2021 +0200

    char/lrng: add Jitter RNG fast noise source
    
    The Jitter RNG fast noise source implemented as part of the kernel
    crypto API is queried for 256 bits of entropy at the time the seed
    buffer managed by the LRNG is about to be filled.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit e0ba90281b8805c8b7928979a46d033b73b1a7ce
Author: Stephan Mueller <smueller@chronox.de>
Date:   Wed Sep 16 09:50:27 2020 +0200

    crypto: move Jitter RNG header include dir
    
    To support the LRNG operation which uses the Jitter RNG separately
    from the kernel crypto API, the header file must be accessible to
    the LRNG code.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Roman Drahtmüller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit f07f4a5f40dc634bb20420ff319cdcc8eeb5c992
Author: Stephan Mueller <smueller@chronox.de>
Date:   Fri Jun 18 08:10:53 2021 +0200

    char/lrng: add kernel crypto API PRNG extension
    
    Add runtime-pluggable support for all PRNGs that are accessible via
    the kernel crypto API, including hardware PRNGs. The PRNG is selected
    with the module parameter drng_name where the name must be one that the
    kernel crypto API can resolve into an RNG.
    
    This allows using of the kernel crypto API PRNG implementations that
    provide an interface to hardware PRNGs. Using this extension,
    the LRNG uses the hardware PRNGs to generate random numbers. An
    example is the S390 CPACF support providing such a PRNG.
    
    The hash is provided by a kernel crypto API SHASH whose digest size
    complies with the seedsize of the PRNG.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit b86cde95ce97e7d0b1bf48179f35ea20cc7a69a1
Author: Stephan Mueller <smueller@chronox.de>
Date:   Fri Jun 18 08:09:59 2021 +0200

    char/lrng: add SP800-90A DRBG extension
    
    Using the LRNG switchable DRNG support, the SP800-90A DRBG extension is
    implemented.
    
    The DRBG uses the kernel crypto API DRBG implementation. In addition, it
    uses the kernel crypto API SHASH support to provide the hashing
    operation.
    
    The DRBG supports the choice of either a CTR DRBG using AES-256, HMAC
    DRBG with SHA-512 core or Hash DRBG with SHA-512 core. The used core can
    be selected with the module parameter lrng_drbg_type. The default is the
    CTR DRBG.
    
    When compiling the DRBG extension statically, the DRBG is loaded at
    late_initcall stage which implies that with the start of user space, the
    user space interfaces of getrandom(2), /dev/random and /dev/urandom
    provide random data produced by an SP800-90A DRBG.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 2eed3ac32f7184096ff7f18591fbedd1de66d97d
Author: Stephan Mueller <smueller@chronox.de>
Date:   Tue Sep 15 22:17:43 2020 +0200

    crypto: drbg: externalize DRBG functions for LRNG
    
    This patch allows several DRBG functions to be called by the LRNG kernel
    code paths outside the drbg.c file.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Roman Drahtmüller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 3ef6e0b3e183a5f5d97bcd3e688387c91c66a03e
Author: Stephan Mueller <smueller@chronox.de>
Date:   Fri Jun 18 08:08:20 2021 +0200

    char/lrng: add common generic hash support
    
    The LRNG switchable DRNG support also allows the replacement of the hash
    implementation used as conditioning component. The common generic hash
    support code provides the required callbacks using the synchronous hash
    implementations of the kernel crypto API.
    
    All synchronous hash implementations supported by the kernel crypto API
    can be used as part of the LRNG with this generic support.
    
    The generic support is intended to be configured by separate switchable
    DRNG backends.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: "Peter, Matthias" <matthias.peter@bsi.bund.de>
    CC: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    CC: Neil Horman <nhorman@redhat.com>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 1ac746ce017ec114f354b6f89493c8c059184fd7
Author: Stephan Mueller <smueller@chronox.de>
Date:   Fri Oct 1 20:47:30 2021 +0200

    char/lrng: add switchable DRNG support
    
    The DRNG switch support allows replacing the DRNG mechanism of the
    LRNG. The switching support rests on the interface definition of
    include/linux/lrng.h. A new DRNG is implemented by filling in the
    interface defined in this header file.
    
    In addition to the DRNG, the extension also has to provide a hash
    implementation that is used to hash the entropy pool for random number
    extraction.
    
    Note: It is permissible to implement a DRNG whose operations may sleep.
    However, the hash function must not sleep.
    
    The switchable DRNG support allows replacing the DRNG at runtime.
    However, only one DRNG extension is allowed to be loaded at any given
    time. Before replacing it with another DRNG implementation, the possibly
    existing DRNG extension must be unloaded.
    
    The switchable DRNG extension activates the new DRNG during load time.
    It is expected, however, that such a DRNG switch would be done only once
    by an administrator to load the intended DRNG implementation.
    
    It is permissible to compile DRNG extensions either as kernel modules or
    statically. The initialization of the DRNG extension should be performed
    with a late_initcall to ensure the extension is available when user
    space starts but after all other initialization completed.
    The initialization is performed by registering the function call data
    structure with the lrng_set_drng_cb function. In order to unload the
    DRNG extension, lrng_set_drng_cb must be invoked with the NULL
    parameter.
    
    The DRNG extension should always provide a security strength that is at
    least as strong as LRNG_DRNG_SECURITY_STRENGTH_BITS.
    
    The hash extension must not sleep and must not maintain a separate
    state.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 98be5134b53ff31b0169b1d71d88d98d22410ac8
Author: Stephan Mueller <smueller@chronox.de>
Date:   Mon Oct 18 20:52:42 2021 +0200

    char/lrng: CPU entropy source
    
    Certain CPUs provide instructions giving access to an entropy source
    (e.g. RDSEED on Intel/AMD, DARN on POWER, etc.). The LRNG can utilize
    the entropy source to seed its DRNG from.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Mathematical aspects Reviewed-by: "Peter, Matthias" <matthias.peter@bsi.bund.de>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 78be60b1bab4754b5b1f7b9153ed67186d4a5d24
Author: Stephan Mueller <smueller@chronox.de>
Date:   Tue Sep 28 17:05:50 2021 +0200

    char/lrng: allocate one DRNG instance per NUMA node
    
    In order to improve NUMA-locality when serving getrandom(2) requests,
    allocate one DRNG instance per node.
    
    The DRNG instance that is present right from the start of the kernel is
    reused as the first per-NUMA-node DRNG. For all remaining online NUMA
    nodes a new DRNG instance is allocated.
    
    During boot time, the multiple DRNG instances are seeded sequentially.
    With this, the first DRNG instance (referenced as the initial DRNG
    in the code) is completely seeded with 256 bits of entropy before the
    next DRNG instance is completely seeded.
    
    When random numbers are requested, the NUMA-node-local DRNG is checked
    whether it has been already fully seeded. If this is not the case, the
    initial DRNG is used to serve the request.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: Eric Biggers <ebiggers@kernel.org>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 07f25c72f6b3699b4851f2ded96fcee5f34fb709
Author: Stephan Mueller <smueller@chronox.de>
Date:   Wed Oct 13 22:50:53 2021 +0200

    char/lrng: sysctls and /proc interface
    
    The LRNG sysctl interface provides the same controls as the existing
    /dev/random implementation. These sysctls behave identically and are
    implemented identically. The goal is to allow a possible merge of the
    existing /dev/random implementation with this implementation which
    implies that this patch tries have a very close similarity. Yet, all
    sysctls are documented at [1].
    
    In addition, it provides the file lrng_type which provides details about
    the LRNG:
    
    - the name of the DRNG that produces the random numbers for /dev/random,
    /dev/urandom, getrandom(2)
    
    - the hash used to produce random numbers from the entropy pool
    
    - the number of secondary DRNG instances
    
    - indicator whether the LRNG operates SP800-90B compliant
    
    - indicator whether a high-resolution timer is identified - only with a
    high-resolution timer the interrupt noise source will deliver sufficient
    entropy
    
    - indicator whether the LRNG has been minimally seeded (i.e. is the
    secondary DRNG seeded with at least 128 bits of entropy)
    
    - indicator whether the LRNG has been fully seeded (i.e. is the
    secondary DRNG seeded with at least 256 bits of entropy)
    
    [1] https://www.chronox.de/lrng.html
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 53123f524900466761165f1d795e906ce1fddc5c
Author: Stephan Mueller <smueller@chronox.de>
Date:   Wed Oct 13 22:49:22 2021 +0200

    char/lrng: IRQ entropy source
    
    The interrupt entropy source hooks into the interrupt handler via the
    add_interrupt_randomness function callback. Every interrupt received by
    the kernel is also sent to the LRNG for processing.
    
    The IRQ entropy source performs the following processing:
    
    1. Record a time stamp
    
    2. Divide the time stamp by its greatest common divisor to eliminate
       fixed least significant bits.
    
    3. Insert the 8 LSB of the result from step 2 into the collection pool.
    
    4. When the collection pool is full, it is hashed into the per-CPU
       entropy pool (if continuous compression is enabled) or the latest
       time stamps overwrite the oldest entries in the collection pool.
    
    If entropy is requested from the IRQ entropy pool, a message digest over
    all per-CPU entropy pool digests is calculated.
    
    The GCD calculation is performed for the first 100 interupt time stamps.
    Until the GCD value is calculated, the full 32 bit time stamp is
    inserted into the collection pool.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Mathematical aspects Reviewed-by: "Peter, Matthias" <matthias.peter@bsi.bund.de>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 0d82aaa422d09de415cafb571131055c3ee53902
Author: Stephan Mueller <smueller@chronox.de>
Date:   Sun Oct 17 20:23:04 2021 +0200

    drivers/char: Introduce the Linux Random Number Generator
    
    In an effort to provide a flexible implementation for a random number
    generator that also delivers entropy during early boot time, allows
    replacement of the deterministic random number generation mechanism,
    implement the various components in separate code for easier
    maintenance, and provide compliance to SP800-90[A|B|C], introduce
    the Linux Random Number Generator (LRNG) framework.
    
    The LRNG framework provides a flexible random number generator which
    allows developers and system integrators to achieve different goals
    by ensuring that each solution establishes a secure state.
    
    The general design is as follows. Additional implementation details
    are given in [1]. The LRNG consists of the following components:
    
    1. The LRNG implements a DRNG. The DRNG always generates the
    requested amount of output. When using the SP800-90A terminology
    it operates without prediction resistance. The DRNG maintains a counter
    of how many bytes were generated since last re-seed and a timer of the
    elapsed time since last re-seed. If either the counter or the timer reaches
    a threshold, the DRNG is seeded from the entropy pool.
    
    In case the Linux kernel detects a NUMA system, one DRNG instance per NUMA
    node is maintained.
    
    2. The DRNG is seeded by concatenating the data from the following sources
       which deliver data and are credited with entropy if enabled:
    
    (a) the output of the IRQ per-CPU entropy pools,
    
    (b) the auxiliary entropy pool,
    
    (c) the Jitter RNG if available and enabled, and
    
    (d) the CPU-based noise source such as Intel RDSEED.
    
    The entropy estimate of the data of all noise sources are added to
    form the entropy estimate of the data used to seed the DRNG with.
    The LRNG ensures, however, that the DRNG after seeding is at
    maximum the security strength of the DRNG.
    
    The LRNG is designed such that none of these noise sources can dominate
    the other noise sources to provide seed data to the DRNG during due to
    the following:
    
    (a) During boot time, the amount of received entropy at the different
    entropy sources are the trigger points to (re)seed the DRNG.
    
    (b) At runtime, the available entropy from the slow noise source is
    concatenated with a pre-defined amount of data from the fast noise
    sources. In addition, each DRNG reseed operation triggers external
    noise source providers to deliver one block of data.
    
    3. The IRQ entropy pool collects noise data from interrupt timing.
    Any data received by the LRNG from the interrupt noise sources is
    inserted into a per-CPU entropy pool using a hash operation that can
    be changed during runtime. Per default, SHA-256 is used.
    
     (a) When an interrupt occurs, the 8 least significant bits of the
     high-resolution time stamp divided by the greatest common divisor (GCD)
     is mixed into the per-CPU entropy pool. This time stamp is credited with
     heuristically implied entropy.
    
     (b) HID event data like the key stroke or the mouse coordinates are
     mixed into the per-CPU entropy pool. This data is not credited with
     entropy by the LRNG.
    
    5. Any data provided from user space by either writing to /dev/random,
    /dev/urandom or the IOCTL of RNDADDENTROPY on both device files
    are always injected into the auxiliary pool. Also, device drivers may
    provide data that is mixed into an auxiliary pool using the same hash
    that is used to process the per-CPU entropy pool. This data is not
    credited with entropy by the LRNG.
    
    In addition, when a hardware random number generator covered by the
    Linux kernel HW generator framework wants to deliver random numbers,
    it is injected into the auxiliary pool as well. HW generator noise source
    is handled separately from the other noise source due to the fact that
    the HW generator framework may decide by itself when to deliver data
    whereas the other noise sources always requested for data driven by the
    LRNG operation. Similarly any user space provided data is inserted into
    the entropy pool.
    
    When seed data for the DRNG is to be generated, all per-CPU
    entropy pools are hashed. The message digest forms the data used for
    seeding the DRNG.
    
    To speed up the interrupt handling code of the LRNG, the time stamp
    collected for an interrupt event is divided by the greatest common
    divisor to eliminate fixed low bits and then truncated to the 8 least
    significant bits. 1024 truncated time stamps are concatenated and then
    jointly inserted into the per-CPU entropy pool. During boot time,
    until the fully seeded stage is reached, each time stamp with its
    32 least significant bits is are concatenated. When 1024/32 = 32 such
    events are received, they are injected into the per-CPU entropy pool.
    
    The LRNG allows the DRNG mechanism to be changed at runtime. Per default,
    a ChaCha20-based DRNG is used. The ChaCha20-DRNG implemented for the
    LRNG is also provided as a stand-alone user space deterministic random
    number generator. The LRNG also offers an SP800-90A DRBG based on the
    Linux kernel crypto API DRBG implementation.
    
    The processing of entropic data from the noise source before injecting
    them into the DRNG is performed with the following mathematical
    operations:
    
    1. Truncation: The received time stamps divided by the GCD are
    truncated to 8 least significant bits (or 32 least significant bits
    during boot time)
    
    2. Concatenation: The received and truncated time stamps as well as
    auxiliary 32 bit words are concatenated to fill the per-CPU data
    array that is capable of holding 64 8-bit words.
    
    3. Hashing: A set of concatenated time stamp data received from the
    interrupts are hashed together with the current existing per-CPU
    entropy pool state. The resulting message digest is the new per-CPU
    entropy pool state.
    
    4. Hashing: When new data is added to the auxiliary pool, the data
    is hashed together with the auxiliary pool to form a new auxiliary
    pool state.
    
    5. Hashing: A message digest of all per-CPU entropy pools and the
    is calculated which forms the new auxiliary pool state.
    
    6. Truncation: The most-significant bits (MSB) defined by the
    requested number of bits (commonly equal to the security strength
    of the DRBG) or the entropy available transported with the buffer
    (which is the minimum of the message digest size and the available
    entropy in all entropy pools and the auxiliary pool), whatever is
    smaller, are obtained from the slow noise source output buffer.
    
    7. Concatenation: The temporary seed buffer used to seed the DRNG
    is a concatenation of the slow noise source buffer, the Jitter RNG
    output, the CPU noise source output, and the current time.
    
    The DRNG always tries to seed itself with 256 bits of entropy, except
    during boot. In any case, if the noise sources cannot deliver that
    amount, the available entropy is used and the DRNG keeps track on how
    much entropy it was seeded with. The entropy implied by the LRNG
    available in the entropy pool may be too conservative. To ensure
    that during boot time all available entropy from the entropy pool is
    transferred to the DRNG, the hash_df function always generates 256
    data bits during boot to seed the DRNG. During boot, the DRNG is
    seeded as follows:
    
    1. The DRNG is reseeded from the entropy sources if the entropy sources
    collectively have at least 32 bits of entropy. The goal of this step is
    to ensure that the DRNG receives some initial entropy as early as
    possible.
    
    2. The DRNG is reseeded from the entropy sources if all entropy sources
    collectively can provide at least 128 bits of entropy.
    
    3. The DRNG is reseeded from the entropy sources if all entropy sources
    collectively can provide at least 256 bits.
    
    At the time of the reseeding steps, the DRNG requests as much entropy as
    is available in order to skip certain steps and reach the seeding level
    of 256 bits. This may imply that one or more of the aforementioned steps
    are skipped.
    
    Before the DRNG is seeded with 256 bits of entropy in step 3,
    requests of random data from /dev/random and the getrandom system
    call are not processed.
    
    The reseeding of the DRNG always ensures that all entropy sources
    collectively can deliver at least 128 entropy bits during runtime once
    the DRNG is fully seeded.
    
    The DRNG operates as deterministic random number generator with the
    following properties:
    
    * The maximum number of random bytes that can be generated with one
    DRNG generate operation is limited to 4096 bytes. When longer random
    numbers are requested, multiple DRNG generate operations are performed.
    The ChaCha20 DRNG as well as the SP800-90A DRBGs implement an update of
    their state after completing a generate request for backtracking
    resistance.
    
    * The DRNG is reseeded with whatever entropy is available -
    in the worst case where no additional entropy can be provided by the
    entropy sources, the DRNG is not re-seeded and continues its operation
    to try to reseed again after again the expiry of one of these thresholds:
    
     - If the last reseeding of the DRNG is more than 600 seconds
       ago, or
    
     - 2^20 DRNG generate operations are performed, whatever comes first, or
    
     - the DRNG is forced to reseed before the next generation of
       random numbers if data has been injected into the LRNG by writing data
       into /dev/random or /dev/urandom.
    
     - If the DRNG was not successfully reseeded after 2^30 generate requests,
       the DRNG reverts back to an unseeded stage implying that the blocking
       interfaces of /dev/random and getrandom will block again.
    
    The chosen values prevent high-volume requests from user space to cause
    frequent reseeding operations which drag down the performance of the
    DRNG.
    
    With the automatic reseeding after 600 seconds, the LRNG is triggered
    to reseed itself before the first request after a suspend that put the
    hardware to sleep for longer than 600 seconds.
    
    To support smaller devices including IoT environments, this patch
    allows reducing the runtime memory footprint of the LRNG at compile
    time by selecting smaller collection data sizes.
    
    When selecting the compilation of a kernel for a small environment,
    prevent the allocation of a buffer up to 4096 bytes to serve user space
    requests. In this case, the stack variable of 64 bytes is used to serve
    all user space requests.
    
    The LRNG has the following properties:
    
    * internal noise source: interrupts timing with fast boot time seeding
    
    * high performance of interrupt handling code: The LRNG impact on the
    interrupt handling has been reduced to a minimum. On one example
    system, the LRNG interrupt handling code in its fastest configuration
    executes within an average 55 cycles whereas the existing
    /dev/random on the same device takes about 97 cycles when measuring
    the execution time of add_interrupt_randomness().
    
    * use of almost never contended lock for hashing operation to collect
      raw entropy supporting concurrency-free use of massive parallel
      systems - worst case rate of contention is the number of DRNG
      reseeds, usually: number of NUMA nodes contentions per 5 minutes.
    
    * use of standalone ChaCha20 based RNG with the option to use a
      different DRNG selectable at compile time
    
    * instantiate one DRNG per NUMA node
    
    * support for runtime switchable output DRNGs
    
    * use of runtime-switchable hash for conditioning implementation
    following widely accepted approach
    
    * compile-time selectable collection size
    
    * support of small systems by allowing the reduction of the
    runtime memory needs
    
    Further details including the rationale for the design choices and
    properties of the LRNG together with testing is provided at [1].
    In addition, the documentation explains the conducted regression
    tests to verify that the LRNG is API and ABI compatible with the
    existing /dev/random implementation.
    
    Note, this patch covers the entropy sources manager, the API
    implementation, the built-in ChaCha20 DRNG and the auxiliary entropy
    pool.
    
    [1] https://www.chronox.de/lrng.html
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Mathematical aspects Reviewed-by: "Peter, Matthias" <matthias.peter@bsi.bund.de>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 7aa708cfb7fb0e3562fa9a15216b6675f4762618
Author: Alexey Avramov <hakavlad@inbox.lv>
Date:   Sat Nov 13 10:42:27 2021 +0900

    mm/vmscan: add sysctl knobs for protecting the working set
    
    The kernel does not provide a way to protect the working set under memory
    pressure. A certain amount of anonymous and clean file pages is required by
    the userspace for normal operation. First of all, the userspace needs a
    cache of shared libraries and executable binaries. If the amount of the
    clean file pages falls below a certain level, then thrashing and even
    livelock can take place.
    
    The patch provides sysctl knobs for protecting the working set (anonymous
    and clean file pages) under memory pressure.
    
    The vm.anon_min_kbytes sysctl knob provides *hard* protection of anonymous
    pages. The anonymous pages on the current node won't be reclaimed under any
    conditions when their amount is below vm.anon_min_kbytes. This knob may be
    used to prevent excessive swap thrashing when anonymous memory is low (for
    example, when memory is going to be overfilled by compressed data of zram
    module). The default value is defined by CONFIG_ANON_MIN_KBYTES (suggested
    0 in Kconfig).
    
    The vm.clean_low_kbytes sysctl knob provides *best-effort* protection of
    clean file pages. The file pages on the current node won't be reclaimed
    under memory pressure when the amount of clean file pages is below
    vm.clean_low_kbytes *unless* we threaten to OOM. Protection of clean file
    pages using this knob may be used when swapping is still possible to
      - prevent disk I/O thrashing under memory pressure;
      - improve performance in disk cache-bound tasks under memory pressure.
    The default value is defined by CONFIG_CLEAN_LOW_KBYTES (suggested 0 in
    Kconfig).
    
    The vm.clean_min_kbytes sysctl knob provides *hard* protection of clean
    file pages. The file pages on the current node won't be reclaimed under
    memory pressure when the amount of clean file pages is below
    vm.clean_min_kbytes. Hard protection of clean file pages using this knob
    may be used to
      - prevent disk I/O thrashing under memory pressure even with no free swap
        space;
      - improve performance in disk cache-bound tasks under memory pressure;
      - avoid high latency and prevent livelock in near-OOM conditions.
    The default value is defined by CONFIG_CLEAN_MIN_KBYTES (suggested 0 in
    Kconfig).
    
    Signed-off-by: Alexey Avramov <hakavlad@inbox.lv>

commit 1a60a347574e7dcff7d49302fb93af4eec0ad4a6
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Jan 4 13:22:28 2022 -0700

    mm: multigenerational lru: Kconfig
    
    Add configuration options for the multigenerational lru.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit 1bec6b47268d90de7299ff36f386ef344c7154d0
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Jan 4 13:22:27 2022 -0700

    mm: multigenerational lru: user interface
    
    Add /sys/kernel/mm/lru_gen/enabled as a runtime kill switch.
    
    Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention.
    Compared with the size-based approach, e.g., [1], this time-based
    approach has the following advantages:
    1) It's easier to configure because it's agnostic to applications and
       memory sizes.
    2) It's more reliable because it's directly wired to the OOM killer.
    
    Add /sys/kernel/debug/lru_gen for working set estimation and proactive
    reclaim. Compared with the page table-based approach and the PFN-based
    approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has
    the following advantages:
    1) It offers better choices because it's aware of memcgs, NUMA nodes,
       shared mappings and unmapped page cache.
    2) It's more scalable because it's O(nr_hot_evictable_pages), whereas
       the PFN-based approach is O(nr_total_pages).
    
    Add /sys/kernel/debug/lru_gen_full for debugging.
    
    [1] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit 811d1a9d6a1421021dc083f7aa8a7122167a76df
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Jan 4 13:22:26 2022 -0700

    mm: multigenerational lru: eviction
    
    The eviction consumes old generations. Given an lruvec, it scans pages
    on lrugen->lists[] indexed by min_seq%MAX_NR_GENS. A feedback loop
    modeled after the PID controller monitors refaults over anon and file
    types and decides which type to evict when both are available from the
    same generation.
    
    Each generation is divided into multiple tiers. Tiers represent
    different ranges of numbers of accesses thru file descriptors. A page
    accessed N times thru file descriptors is in tier order_base_2(N). The
    feedback loop also monitors refaults over all tiers and decides when
    to promote pages in which tiers (N>1), using the first tier (N=0,1) as
    a baseline.
    
    The eviction sorts a page according to the gen counter if the aging
    has found this page accessed thru page tables, which completes the
    promotion of this page. The eviction also promotes a page to the next
    generation (min_seq+1 rather than max_seq) if this page was accessed
    multiple times thru file descriptors and the feedback loop has
    detected higher refaults from the tier this page is in. This approach
    has the following advantages:
    1) It removes the cost of activation (recall the terms) in the
       buffered access path by inferring whether pages accessed multiple
       times thru file descriptors are statistically hot and thus worth
       promoting in the eviction path.
    2) It takes pages accessed thru page tables into account and avoids
       overprotecting pages accessed multiple times thru file descriptors.
    3) More tiers, which require additional bits in folio->flags, provide
       better protection for pages accessed more than twice thru file
       descriptors, when under heavy buffered I/O workloads.
    
    The eviction increments min_seq when lrugen->lists[] indexed by
    min_seq%MAX_NR_GENS is empty.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit 22bfc0019ac38a0967faf35bf9939387d9ba47d3
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Jan 4 13:22:25 2022 -0700

    mm: multigenerational lru: aging
    
    To avoid confusions, the term "scan" will be applied to PTEs in a page
    table and pages on an lru list. It emphasizes on consecutive elements
    in a set rather than the data structure holding this set together.
    
    The aging produces young generations. Given an lruvec, it iterates
    lruvec_memcg()->mm_list and calls walk_page_range() with each
    mm_struct on this list to scan PTEs for accessed pages. On finding a
    young PTE, it clears the accessed bit and updates the gen counter of
    the page mapped by this PTE to (max_seq%MAX_NR_GENS)+1. After each
    iteration of this list, it increments max_seq. The aging is needed
    before the eviction can continue when max_seq-min_seq+1 reaches
    MIN_NR_GENS.
    
    To avoid confusions, the terms "promotion" and "demotion" will be
    applied to the multigenerational lru, as a new convention; the terms
    "activation" and "deactivation" will be applied to the active/inactive
    lru, as usual.
    
    IOW, the aging promotes a page to the youngest generation when it
    finds this page accessed thru page tables; demotion happens
    consequently when it creates a new generation. Note that promotion
    doesn't require any lru list operations in the aging path, only the
    update of the gen counter and the lru sizes; demotion, unless as the
    result of the creation of a new generation, requires lru list
    operations, e.g., lru_deactivate_fn().
    
    The aging uses the following optimizations when walking page tables:
    1) It uses the accessed bit in non-leaf PMD entries, the hint from the
       CPU scheduler and the Bloom filters to reduce its search space.
    2) It doesn't zigzag between a PGD table and the same PMD or PTE table
       spanning multiple VMAs. In other words, it finishes all the VMAs
       within the range of the same PMD or PTE table before it returns to
       a PGD table. This improves the cache performance for workloads that
       have large numbers of tiny VMAs, especially when
       CONFIG_PGTABLE_LEVELS=5.
    
    The aging is only interested in accessed pages and therefore has the
    complexity of O(nr_hot_evictable_pages). The worst case scenario is
    the aging fails to exploit any spatial locality and the eviction has
    to promote all accessed pages when walking the rmap, which is similar
    to the active/inactive lru. However, generations still can provide
    better temporal locality.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit 4dccd52500664d2a404eb3ba3bfe158b2a2a92b5
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Jan 4 13:22:24 2022 -0700

    mm: multigenerational lru: mm_struct list
    
    To exploit spatial locality, the aging prefers to walk page tables to
    search for young PTEs. And this patch paves the way for that.
    
    An mm_struct list is maintained for each memcg, and an mm_struct
    follows its owner task to the new memcg when this task is migrated.
    
    To avoid confusions, the term "iteration" specifically means the
    traversal of an entire mm_struct list; the term "walk" will be applied
    to page tables and the rmap, as usual.
    
    A page table walker, i.e., a thread in the aging path, iterates an
    mm_struct list and calls walk_page_range() with each mm_struct on this
    list. The iteration finishes when it reaches the end of this list.
    When multiple page table walkers iterate the same list, each of them
    gets a unique mm_struct; therefore the aging can run concurrently.
    
    This infra also provides the following optimizations:
    1) It tracks the usage of mm_struct's between context switches so that
       page table walkers may skip processes that have been sleeping since
       the last iteration.
    2) It provides generational Bloom filters to record populated branches
       so that page table walkers may reduce their search space based on
       the query results.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit e45699af41cd0356c46135107969c2ba308e4cb2
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Jan 4 13:22:23 2022 -0700

    mm: multigenerational lru: groundwork
    
    Evictable pages are divided into multiple generations for each lruvec.
    The youngest generation number is stored in lrugen->max_seq for both
    anon and file types as they're aged on an equal footing. The oldest
    generation numbers are stored in lrugen->min_seq[] separately for anon
    and file types as clean file pages may be evicted regardless of swap
    constraints. These three variables are monotonically increasing.
    
    Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
    in order to fit into the gen counter in folio->flags. Each truncated
    generation number is an index to lrugen->lists[]. The sliding window
    technique is used to track at least MIN_NR_GENS and at most
    MAX_NR_GENS generations.
    
    There are two conceptually independent processes (as in manufacturing
    process): "the aging", which produces young generations, and "the
    eviction", which consumes old generations. They form a closed-loop
    system, i.e., "the page reclaim". Both processes can be triggered
    separately from userspace for the purposes of working set estimation
    and proactive reclaim. These features are required to optimize job
    scheduling in data centers. The variable size of the sliding window is
    designed for such use cases.
    
    To avoid confusions, the terms "hot" and "cold" will be applied to the
    multigenerational lru, as a new convention; the terms "active" and
    "inactive" will be applied to the active/inactive lru, as usual.
    
    The protection of hot pages and the selection of cold pages are based
    on page access channels and patterns. There are two access channels:
    one thru page tables and the other thru file descriptors. The
    protection of the former channel is by design stronger because:
    1) The uncertainty in determining the access patterns of the former
       channel is higher due to the approximation of the accessed bit.
    2) The cost of evicting the former channel is higher due to the TLB
       flushes required and the likelihood of encountering the dirty bit.
    3) The penalty of underprotecting the former channel is higher because
       applications usually don't prepare themselves for major faults like
       they do for blocked I/O. For example, GUI applications commonly use
       dedicated I/O threads to avoid blocking the rendering threads.
    There are also two access patterns: one with temporal locality and the
    other without. For the reasons listed above, the former channel is
    assumed to follow the former pattern unless VM_SEQ_READ or
    VM_RAND_READ is present and the latter channel is assumed to follow
    the latter pattern unless outlying refaults have been observed.
    
    The "outlying refaults" will be addressed in [PATCH 07/10]. A few
    macros, i.e., LRU_REFS_*, used in that patch are added in this one to
    make the patchset less diffy.
    
    A page is added to the youngest generation on faulting. The aging
    needs to check the accessed bit at least twice before handing this
    page over to the eviction. The first check takes care of the accessed
    bit set on the initial fault; the second check makes sure this page
    hasn't been used since then. This process, AKA second chance, requires
    a minimum of two generations, hence MIN_NR_GENS.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit 22380a132a26c6e67f5a9a7a615fc9c339e9c18e
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Jan 4 13:22:22 2022 -0700

    mm/vmscan.c: refactor shrink_node()
    
    This patch refactors shrink_node() to improve readability for the
    upcoming changes to mm/vmscan.c.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit 1d3d37788232f0cad7e5f46d1eae6cce28be4dd2
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Jan 4 13:22:21 2022 -0700

    mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
    
    Some architectures support the accessed bit in non-leaf PMD entries,
    e.g., x86_64 sets the accessed bit in a non-leaf PMD entry when using
    it as part of linear address translation [1]. Page table walkers that
    clear the accessed bit may use this feature to reduce their search
    space.
    
    Although an inline function is preferable, this capability is added as
    a configuration option for the consistency with the existing macros.
    
    [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
         Volume 3 (June 2021), section 4.8
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit f6abb14a93eef78e871fca379613c56702933f3f
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Jan 4 13:22:20 2022 -0700

    mm: x86, arm64: add arch_has_hw_pte_young()
    
    Some architectures automatically set the accessed bit in PTEs, e.g.,
    x86 and arm64 v8.2. On architectures that don't have this capability,
    clearing the accessed bit in a PTE usually triggers a page fault
    following the TLB miss of this PTE.
    
    Being aware of this capability can help make better decisions, e.g.,
    whether to spread the work out over a period of time to avoid bursty
    page faults when trying to clear the accessed bit in a large number of
    PTEs.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit 49f3adf9ef6a791bafcfaa6f01d2c6d9a27a9c3f
Author: André Almeida <andrealmeid@collabora.com>
Date:   Mon Oct 25 09:49:42 2021 -0300

    futex: Add entry point for FUTEX_WAIT_MULTIPLE (opcode 31)
    
    Add an option to wait on multiple futexes using the old interface, that
    uses opcode 31 through futex() syscall. Do that by just translation the
    old interface to use the new code. This allows old and stable versions
    of Proton to still use fsync in new kernel releases.
    
    Signed-off-by: André Almeida <andrealmeid@collabora.com>

commit 932fce1ed0596142382e78b221f4b965b705fc90
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Fri Jun 18 19:10:55 2021 +0000

    XANMOD: Makefile: Turn off loop vectorization for GCC -O3 optimization level
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit eb4a18e0cd72d65a9fe0afe3edcfd12be5fc7a0e
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Thu Sep 3 20:36:13 2020 +0000

    XANMOD: init/Kconfig: Enable -O3 KBUILD_CFLAGS optimization for all architectures
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit a2614c7e2a54e076363db8827c12b80db92c1f9d
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Thu Jun 25 16:40:43 2020 -0300

    XANMOD: lib/kconfig.debug: disable default CONFIG_SYMBOLIC_ERRNAME and CONFIG_DEBUG_BUGVERBOSE
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit 64901729773bf5581cc06878fb5abf8b2759ffd1
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Mon Jan 29 17:41:29 2018 +0000

    XANMOD: scripts: disable the localversion "+" tag of a git repo
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit 4752faf9f1e2437533722dcb60529d2670678646
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Tue Mar 31 13:32:08 2020 -0300

    XANMOD: cpufreq: tunes ondemand and conservative governor for performance
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit 2bd5fa59c26bfaa201a2b250cc70b65e30179b20
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Mon Jan 29 17:31:25 2018 +0000

    XANMOD: mm/vmscan: vm_swappiness = 30 decreases the amount of swapping
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit bb174dfdfa0fd21b14e8c4e7662f692c5b5e19c1
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Thu Aug 13 14:57:06 2020 +0000

    XANMOD: sched/autogroup: Add kernel parameter and config option to enable/disable autogroup feature by default
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit ee7dbc58b4fbb1a393c7f375744c3b91a875a3d6
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Mon Jan 29 16:59:22 2018 +0000

    XANMOD: dcache: cache_pressure = 50 decreases the rate at which VFS caches are reclaimed
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit f0b6b5339416631b845e3495c5625c83e9ed81ec
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Mon Jan 29 17:26:15 2018 +0000

    XANMOD: kconfig: add 500Hz timer interrupt kernel config option
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit a7f3f16374100d19b8e89768b97e421b74098f1d
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon Dec 14 16:24:26 2020 +0000

    XANMOD: block: set rq_affinity to force full multithreading I/O requests
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 641a5091ac6fd1c08c097d94b3fcadb73477f3d2
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Thu Jan 6 16:59:01 2022 +0000

    XANMOD: block/mq-deadline: Disable front_merges by default
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 05bdfcdb4122ff03c18c3f3e9ba5c59684484ef8
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Sun Aug 29 23:58:33 2021 +0000

    XANMOD: fair: Remove all energy efficiency functions
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>