commit 48d2efc7eb86fba378aa7e11b82e7f3a22ef1d2e
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon Mar 21 21:49:46 2022 +0000

    Linux 5.17.0-xanmod1
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit b8aeacd018bbe3cb933ab4ee6b964eca52e9bdfc
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon Mar 21 21:37:19 2022 +0000

    i2c: busses: Add SMBus capability to work with OpenRGB driver control
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit d10c308637d499546fe79d078aa6e22b5b1a18c9
Author: Mark Weiman <mark.weiman@markzz.com>
Date:   Sun Aug 12 11:36:21 2018 -0400

    pci: Enable overrides for missing ACS capabilities
    
    This an updated version of Alex Williamson's patch from:
    https://lkml.org/lkml/2013/5/30/513
    
    Original commit message follows:
    
    PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
    allows us to control whether transactions are allowed to be redirected
    in various subnodes of a PCIe topology.  For instance, if two
    endpoints are below a root port or downsteam switch port, the
    downstream port may optionally redirect transactions between the
    devices, bypassing upstream devices.  The same can happen internally
    on multifunction devices.  The transaction may never be visible to the
    upstream devices.
    
    One upstream device that we particularly care about is the IOMMU.  If
    a redirection occurs in the topology below the IOMMU, then the IOMMU
    cannot provide isolation between devices.  This is why the PCIe spec
    encourages topologies to include ACS support.  Without it, we have to
    assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.
    
    Unfortunately, far too many topologies do not support ACS to make this
    a steadfast requirement.  Even the latest chipsets from Intel are only
    sporadically supporting ACS.  We have trouble getting interconnect
    vendors to include the PCIe spec required PCIe capability, let alone
    suggested features.
    
    Therefore, we need to add some flexibility.  The pcie_acs_override=
    boot option lets users opt-in specific devices or sets of devices to
    assume ACS support.  The "downstream" option assumes full ACS support
    on root ports and downstream switch ports.  The "multifunction"
    option assumes the subset of ACS features available on multifunction
    endpoints and upstream switch ports are supported.  The "id:nnnn:nnnn"
    option enables ACS support on devices matching the provided vendor
    and device IDs, allowing more strategic ACS overrides.  These options
    may be combined in any order.  A maximum of 16 id specific overrides
    are available.  It's suggested to use the most limited set of options
    necessary to avoid completely disabling ACS across the topology.
    Note to hardware vendors, we have facilities to permanently quirk
    specific devices which enforce isolation but not provide an ACS
    capability.  Please contact me to have your devices added and save
    your customers the hassle of this boot option.
    
    Rebased-by: Alexandre Frade <kernel@xanmod.org>
    Signed-off-by: Mark Weiman <mark.weiman@markzz.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 4ce3c1e32d42003132bdffc327911684cee38e0d
Author: graysky <graysky@archlinux.us>
Date:   Tue Mar 15 05:58:43 2022 -0400

    x86/kconfig: more uarches for kernel 5.17+
    
    FEATURES
    This patch adds additional CPU options to the Linux kernel accessible under:
     Processor type and features  --->
      Processor family --->
    
    With the release of gcc 11.1 and clang 12.0, several generic 64-bit levels are
    offered which are good for supported Intel or AMD CPUs:
    • x86-64-v2
    • x86-64-v3
    • x86-64-v4
    
    Users of glibc 2.33 and above can see which level is supported by current
    hardware by running:
      /lib/ld-linux-x86-64.so.2 --help | grep supported
    
    Alternatively, compare the flags from /proc/cpuinfo to this list.[1]
    
    CPU-specific microarchitectures include:
    • AMD Improved K8-family
    • AMD K10-family
    • AMD Family 10h (Barcelona)
    • AMD Family 14h (Bobcat)
    • AMD Family 16h (Jaguar)
    • AMD Family 15h (Bulldozer)
    • AMD Family 15h (Piledriver)
    • AMD Family 15h (Steamroller)
    • AMD Family 15h (Excavator)
    • AMD Family 17h (Zen)
    • AMD Family 17h (Zen 2)
    • AMD Family 19h (Zen 3)†
    • Intel Silvermont low-power processors
    • Intel Goldmont low-power processors (Apollo Lake and Denverton)
    • Intel Goldmont Plus low-power processors (Gemini Lake)
    • Intel 1st Gen Core i3/i5/i7 (Nehalem)
    • Intel 1.5 Gen Core i3/i5/i7 (Westmere)
    • Intel 2nd Gen Core i3/i5/i7 (Sandybridge)
    • Intel 3rd Gen Core i3/i5/i7 (Ivybridge)
    • Intel 4th Gen Core i3/i5/i7 (Haswell)
    • Intel 5th Gen Core i3/i5/i7 (Broadwell)
    • Intel 6th Gen Core i3/i5/i7 (Skylake)
    • Intel 6th Gen Core i7/i9 (Skylake X)
    • Intel 8th Gen Core i3/i5/i7 (Cannon Lake)
    • Intel 10th Gen Core i7/i9 (Ice Lake)
    • Intel Xeon (Cascade Lake)
    • Intel Xeon (Cooper Lake)*
    • Intel 3rd Gen 10nm++ i3/i5/i7/i9-family (Tiger Lake)*
    • Intel 3rd Gen 10nm++ Xeon (Sapphire Rapids)‡
    • Intel 11th Gen i3/i5/i7/i9-family (Rocket Lake)‡
    • Intel 12th Gen i3/i5/i7/i9-family (Alder Lake)‡
    
    Notes: If not otherwise noted, gcc >=9.1 is required for support.
           *Requires gcc >=10.1 or clang >=10.0
           †Required gcc >=10.3 or clang >=12.0
           ‡Required gcc >=11.1 or clang >=12.0
    
    It also offers to compile passing the 'native' option which, "selects the CPU
    to generate code for at compilation time by determining the processor type of
    the compiling machine. Using -march=native enables all instruction subsets
    supported by the local machine and will produce code optimized for the local
    machine under the constraints of the selected instruction set."[2]
    
    Users of Intel CPUs should select the 'Intel-Native' option and users of AMD
    CPUs should select the 'AMD-Native' option.
    
    MINOR NOTES RELATING TO INTEL ATOM PROCESSORS
    This patch also changes -march=atom to -march=bonnell in accordance with the
    gcc v4.9 changes. Upstream is using the deprecated -match=atom flags when I
    believe it should use the newer -march=bonnell flag for atom processors.[3]
    
    It is not recommended to compile on Atom-CPUs with the 'native' option.[4] The
    recommendation is to use the 'atom' option instead.
    
    BENEFITS
    Small but real speed increases are measurable using a make endpoint comparing
    a generic kernel to one built with one of the respective microarchs.
    
    See the following experimental evidence supporting this statement:
    https://github.com/graysky2/kernel_gcc_patch
    
    REQUIREMENTS
    linux version 5.17+
    gcc version >=9.0 or clang version >=9.0
    
    ACKNOWLEDGMENTS
    This patch builds on the seminal work by Jeroen.[5]
    
    REFERENCES
    1.  https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/77566eb03bc6a326811cb7e9
    2.  https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-x86-Options
    3.  https://bugzilla.kernel.org/show_bug.cgi?id=77461
    4.  https://github.com/graysky2/kernel_gcc_patch/issues/15
    5.  http://www.linuxforge.net/docs/linux/linux-gcc.php
    
    Signed-off-by: graysky <graysky@archlinux.us>

commit 717f5ebd240f6f15f037f7ece05863c3ef37ddf3
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Wed May 17 01:52:11 2017 +0000

    init: wait for partition and retry scan
    
    As Clear Linux boots fast the device is not ready when
    the mounting code is reached, so a retry device scan will
    be performed every 0.5 sec for at least 40 sec
    and synchronize the async task.
    
    Signed-off-by: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>

commit 488b2e0a6e5849f3c14cfb0f2d039742234559b2
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Thu Jun 2 23:36:32 2016 -0500

    drivers: initialize ata before graphics
    
    ATA init is the long pole in the boot process, and its asynchronous.
    move the graphics init after it so that ata and graphics initialize
    in parallel

commit 47de15b3d478548b8d5251e2ae20a73e74afd209
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Sun Feb 18 23:35:41 2018 +0000

    locking: rwsem: spin faster
    
    tweak rwsem owner spinning a bit
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit dfe005a23744ef20af557e31862b754ec11ee811
Author: William Douglas <william.douglas@intel.com>
Date:   Wed Jun 20 17:23:21 2018 +0000

    firmware: Enable stateless firmware loading
    
    Prefer the order of specific version before generic and /etc before
    /lib to enable the user to give specific overrides for generic
    firmware and distribution firmware.

commit 1b40023471510546ff3f63b636408e6226d04d2f
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Sun Sep 22 11:12:35 2019 -0300

    intel_rapl: Silence rapl trace debug

commit 66fb91d3e3dcce5e12fd21f2e17594cf715f09ca
Author: Serge Hallyn <serge.hallyn@canonical.com>
Date:   Fri May 31 19:12:12 2013 +0100

    sysctl: add sysctl to disallow unprivileged CLONE_NEWUSER by default
    
    add sysctl to disallow unprivileged CLONE_NEWUSER by default
    
    This is a short-term patch.  Unprivileged use of CLONE_NEWUSER
    is certainly an intended feature of user namespaces.  However
    for at least saucy we want to make sure that, if any security
    issues are found, we have a fail-safe.
    
    Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
    [bwh: Remove unneeded binary sysctl bits]
    [bwh: Keep this sysctl, but change the default to enabled]

commit 2e62c61b100c45d10d71ff4c696159105b4f79dc
Author: Zebediah Figura <z.figura12@gmail.com>
Date:   Thu Nov 4 17:15:38 2021 +0000

    winesync: Introduce the winesync driver and character device
    
    Rebased-by: Tk-Glitch <ti3nou@gmail.com>
    Rebased-by: Alexandre Frade <kernel@xanmod.org>
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 224a666172b27edcf3c39756cfc81e4d12365eb0
Author: Christian Brauner <brauner@kernel.org>
Date:   Wed Jan 23 21:54:23 2019 +0100

    SAUCE: binder: give binder_alloc its own debug mask file
    
    Currently both binder.c and binder_alloc.c both register the
    /sys/module/binder_linux/paramters/debug_mask file which leads to conflicts
    in sysfs. This commit gives binder_alloc.c its own
    /sys/module/binder_linux/paramters/alloc_debug_mask file.
    
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

commit c8e92f84d84dbea6870d22557f2dbd14f3d7807a
Author: Christian Brauner <brauner@kernel.org>
Date:   Wed Jan 16 23:13:25 2019 +0100

    SAUCE: binder: turn into module
    
    The Android binder driver needs to become a module for the sake of shipping
    Anbox. To do this we need to export the following functions since binder is
    currently still using them:
    
    - security_binder_set_context_mgr()
    - security_binder_transaction()
    - security_binder_transfer_binder()
    - security_binder_transfer_file()
    - can_nice()
    - __wake_up_pollfree()
    - __close_fd_get_file()
    - mmput_async()
    - task_work_add()
    - map_kernel_range_noflush()
    - get_vm_area()
    - zap_page_range()
    - put_ipc_ns()
    - get_ipc_ns_exported()
    - show_init_ipc_ns()
    
    Rebased-by: Alexandre Frade <kernel@xanmod.org>
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
    [ saf: fix additional reference to init_ipc_ns from 5.0-rc6 ]
    Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 2032f85ad24639ca1463d6ac17e8e71a470f01c2
Author: Christian Brauner <brauner@kernel.org>
Date:   Wed Jun 20 19:21:37 2018 +0200

    SAUCE: ashmem: turn into module
    
    The Android ashmem driver needs to become a module for the sake of Anbox.
    To do this we need to export shmem_zero_setup() since ashmem is currently
    using is.
    Note, the abomination that is the Android ashmem driver will go away in the
    not so distant future in favour of memfds.
    
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

commit 71f1b8113d60130167a3a0cf11a0fc37b7d08110
Author: Shuah Khan <skhan@linuxfoundation.org>
Date:   Mon Mar 21 19:35:16 2022 +0000

    cpupower update for Linux 5.18-rc1
    
    This cpupower update for Linux 5.18-rc1 adds AMD P-State Support to
    cpupower tool. AMD P-State kernel support went into 5.17-rc1.
    
    ----------------------------------------------------------------
    Huang Rui (10):
           cpupower: Add AMD P-State capability flag
           cpupower: Add the function to check AMD P-State enabled
           cpupower: Initial AMD P-State capability
           cpupower: Add the function to get the sysfs value from specific table
           cpupower: Introduce ACPI CPPC library
           cpupower: Add AMD P-State sysfs definition and access helper
           cpupower: Enable boost state support for AMD P-State module
           cpupower: Move print_speed function into misc helper
           cpupower: Add function to print AMD P-State performance capabilities
           cpupower: Add "perf" option to print AMD P-State information
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit b036a92ff3a8a794a042280802bef2b06f13e59a
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Wed Dec 8 11:55:28 2021 +0000

    netfilter: Add full cone NAT support
    
    Link: https://github.com/llccd/netfilter-full-cone-nat
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 4a26170e1e7de6ff0d2f032985008b265c73aadb
Author: Felix Fietkau <nbd@openwrt.org>
Date:   Sat Dec 5 15:07:03 2015 +0100

    mac80211: ignore AP power level when tx power type is "fixed"
    
    In some cases a user might want to connect to a far away access point,
    which announces a low tx power limit. Using the AP's power limit can
    make the connection significantly more unstable or even impossible, and
    mac80211 currently provides no way to disable this behavior.
    
    To fix this, use the currently unused distinction between limited and
    fixed tx power to decide whether a remote AP's power limit should be
    accepted.
    
    Signed-off-by: Felix Fietkau <nbd@openwrt.org>

commit 19bf23d51b056b9f986de1aac14d6d62a9e5876e
Author: Adithya Abraham Philip <abrahamphilip@google.com>
Date:   Fri Jun 11 21:56:10 2021 +0000

    net-tcp_bbr: v2: Fix missing ECT markings on retransmits for BBRv2
    
    Adds a new flag TCP_ECN_ECT_PERMANENT that is used by CCAs to
    indicate that retransmitted packets and pure ACKs must have the
    ECT bit set. This is a necessary fix for BBRv2, which when using
    ECN expects ECT to be set even on retransmitted packets and ACKs.
    Currently CCAs like BBRv2 which can use ECN but don't "need" it
    do not have a way to indicate that ECT should be set on
    retransmissions/ACKs.
    
    Signed-off-by: Adithya Abraham Philip <abrahamphilip@google.com>
    Signed-off-by: Neal Cardwell <ncardwell@google.com>

commit a46b12b194146f232ece4437c71a68d4ded9872e
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Dec 28 19:23:09 2020 -0500

    net-tcp_bbr: v2: don't assume prior_cwnd was set entering CA_Loss
    
    Fix WARN_ON_ONCE() warnings that were firing and pointing to a
    bbr->prior_cwnd of 0 when exiting CA_Loss and transitioning to
    CA_Open.
    
    The issue was that tcp_simple_retransmit() calls:
    
      tcp_set_ca_state(sk, TCP_CA_Loss);
    
    without first calling icsk_ca_ops->ssthresh(sk) (because
    tcp_simple_retransmit() is dealing with losses due to MTU issues and
    not congestion). The lack of this callback means that BBR did not get
    a chance to set bbr->prior_cwnd, and thus upon exiting CA_Loss in such
    cases the WARN_ON_ONCE() would fire due to a zero bbr->prior_cwnd.
    
    This commit removes that warning, since a bbr->prior_cwnd of 0 is a
    valid situation in this state transition.
    
    For setting inflight_lo upon entering CA_Loss, to avoid setting an
    inflight_lo of 0 in this case, this commit switches to taking the max
    of cwnd and prior_cwnd. We plan to remove that line of code when we
    switch to cautious (PRR-style) recovery, so that awkwardness will go
    away.
    
    Change-Id: I575dce871c2f20e91e3e9449e1706f42a07b8118

commit 4d164de49b0872e9ea675f69d5522c72636e5a1f
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Aug 17 19:10:21 2020 -0400

    net-tcp_bbr: v2: remove cycle_rand parameter that is unused in BBRv2
    
    Change-Id: Iee1df7e41e42de199068d7c89131ed3d228327c0

commit a65ee28cafce25231e3768f7a8f5bb9692f8b75f
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Aug 17 19:08:41 2020 -0400

    net-tcp_bbr: v2: remove field bw_rtts that is unused in BBRv2
    
    Change-Id: I58e3346c707748a6f316f3ed060d2da84c32a79b

commit 901c10a766d35f1c332654bde2278bd95396870b
Author: Neal Cardwell <ncardwell@google.com>
Date:   Thu Nov 21 15:28:01 2019 -0500

    net-tcp_bbr: v2: remove unnecessary rs.delivered_ce logic upon loss
    
    There is no reason to compute rs.delivered_ce upon loss.
    
    In fact, we specifically do not want to compute rs.delivered_ce upon loss.
    
    Two issues:
    
    (1) This would be the wrong thing to do, in behavior terms.  With
        RACK's dynamic reordering window, losses can be marked long after
        the sequence hole appears in the ACK/SACK stream. We want to to
        catch the ECN mark rate rising too high as quickly as possible,
        which means we want to check for high ECN mark rates at ACK time
        (as BBRv2 currently does) and not loss marking time.
    
    (2) This is dead code. The ECN mark rate cannot be detected as too
        high because the check needs rs->delivered to be > 0 as well:
    
           if (rs->delivered_ce > 0 && rs->delivered > 0 &&
    
        Since we are not setting rs->delivered upon loss, this check
        cannot succeed, so setting delivered_ce is pointless.
    
    This dead and wrong line was discovered by Randall Stewart at Netflix
    as he was reading the BBRv2 code.
    
    Change-Id: I37f83f418a259ec31d8f82de986db071b364b76a

commit 6ca69923776efb3ba71370c287ee2f0efb7201c2
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Jul 22 23:18:56 2019 -0400

    net-tcp_bbr: v2: add a README.md for TCP BBR v2 alpha release
    
    Change-Id: I35a8c984e299d2af6e78c3d4b3aade5627678306

commit fb018c1085d01359ef0f0b726cc18e859d7e3b1a
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Jun 11 12:54:22 2019 -0400

    net-tcp_bbr: v2: BBRv2 ("bbr2") congestion control for Linux TCP
    
    BBR v2 is an enhacement to the BBR v1 algorithm. It's designed to aim for lower
    queues, lower loss, and better Reno/CUBIC coexistence than BBR v1.
    
    BBR v2 maintains the core of BBR v1: an explicit model of the network
    path that is two-dimensional, adapting to estimate the (a) maximum
    available bandwidth and (b) maximum safe volume of data a flow can
    keep in-flight in the network. It maintains the estimated BDP as a
    core guide for estimating an appropriate level of in-flight data.
    
    BBR v2 makes several key enhancements:
    
    o Its bandwidth-probing time scale is adapted, within bounds, to allow improved
    coexistence with Reno and CUBIC. The bandwidth-probing time scale is (a)
    extended dynamically based on estimated BDP to improve coexistence with
    Reno/CUBIC; (b) bounded by an interactive wall-clock time-scale to be more
    scalable and responsive than Reno and CUBIC.
    
    o Rather than being largely agnostic to loss and ECN marks, it explicitly uses
    loss and (DCTCP-style) ECN signals to maintain its model.
    
    o It aims for lower losses than v1 by adjusting its model to attempt to stay
    within loss rate and ECN mark rate bounds (loss_thresh and ecn_thresh,
    respectively).
    
    o It adapts to loss/ECN signals even when the application is running out of
    data ("application-limited"), in case the "application-limited" flow is also
    "network-limited" (the bw and/or inflight available to this flow is lower than
    previously estimated when the flow ran out of data).
    
    o It has a three-part model: the model explicit three tracks operating points,
    where an operating point is a tuple: (bandwidth, inflight). The three operating
    points are:
    
      o latest:        the latest measurement from the current round trip
      o upper bound:   robust, optimistic, long-term upper bound
      o lower bound:   robust, conservative, short-term lower bound
    
    These are stored in the following state variables:
    
      o latest:  bw_latest, inflight_latest
      o lo:      bw_lo,     inflight_lo
      o hi:      bw_hi[2],  inflight_hi
    
    To gain intuition about the meaning of the three operating points, it
    may help to consider the analogs in CUBIC, which has a somewhat
    analogous three-part model used by its probing state machine:
    
      BBR param     CUBIC param
      -----------   -------------
      latest     ~  cwnd
      lo         ~  ssthresh
      hi         ~  last_max_cwnd
    
    The analogy is only a loose one, though, since the BBR operating
    points are calculated differently, and are 2-dimensional (bw,inflight)
    rather than CUBIC's one-dimensional notion of operating point
    (inflight).
    
    o It uses the three-part model to adapt the magnitude of its bandwidth
    to match the estimated space available in the buffer, rather than (as
    in BBR v1) assuming that it was always acceptable to place 0.25*BDP in
    the bottleneck buffer when probing (commodity datacenter switches
    commonly do not have that much buffer for WAN flows). When BBR v2
    estimates it hit a buffer limit during probing, its bandwidth probing
    then starts gently in case little space is still available in the
    buffer, and the accelerates, slowly at first and then rapidly if it
    can grow inflight without seeing congestion signals. In such cases,
    probing is bounded by inflight_hi + inflight_probe, where
    inflight_probe grows as: [0, 1, 2, 4, 8, 16,...]. This allows BBR to
    keep losses low and bounded if a bottleneck remains congested, while
    rapidly/scalably utilizing free bandwidth when it becomes available.
    
    o It has a slightly revised state machine, to achieve the goals above.
        BBR_BW_PROBE_UP:    pushes up inflight to probe for bw/vol
        BBR_BW_PROBE_DOWN:  drain excess inflight from the queue
        BBR_BW_PROBE_CRUISE: use pipe, w/ headroom in queue/pipe
        BBR_BW_PROBE_REFILL: try refill the pipe again to 100%, leaving queue empty
    
    o The estimated BDP: BBR v2 continues to maintain an estimate of the
    path's two-way propagation delay, by tracking a windowed min_rtt, and
    coordinating (on an as-ndeeded basis) to try to expose the two-way
    propagation delay by draining the bottleneck queue.
    
    BBR v2 continues to use its min_rtt and (currently-applicable) bandwidth
    estimate to estimate the current bandwidth-delay product. The estimated BDP
    still provides one important guideline for bounding inflight data. However,
    because any min-filtered RTT and max-filtered bw inherently tend to both
    overestimate, the estimated BDP is often too high; in this case loss or ECN
    marks can ensue, in which case BBR v2 adjusts inflight_hi and inflight_lo to
    adapt its sending rate and inflight down to match the available capacity of the
    path.
    
    o Space: Note that ICSK_CA_PRIV_SIZE increased. This is because BBR v2
    requires more space. Note that much of the space is due to support for
    per-socket parameterization and debugging in this release for research
    and debugging. With that state removed, the full "struct bbr" is 140
    bytes, or 144 with padding. This is an increase of 40 bytes over the
    existing ca_priv space.
    
    o Code: BBR v2 reuses many pieces from BBR v1. But it omits the following
      significant pieces:
    
      o "packet conservation" (bbr_set_cwnd_to_recover_or_restore(),
        bbr_can_grow_inflight())
      o long-term bandwidth estimator ("policer mode")
    
      The code layout tries to keep BBR v2 code near the bottom of the
      file, so that v1-applicable code in the top does not accidentally
      refer to v2 code.
    
    o Docs:
      See the following docs for more details and diagrams decsribing the BBR v2
      algorithm:
        https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00
        https://datatracker.ietf.org/meeting/102/materials/slides-102-iccrg-an-update-on-bbr-work-at-google-00
    
    o Internal notes:
      For this upstream rebase, Neal started from:
        git show fed518041ac6:net/ipv4/tcp_bbr.c > net/ipv4/tcp_bbr.c
      then removed dev instrumentation (dynamic get/set for parameters)
      and code that was only used by BBRv1
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 2c84098e60bed6d67dde23cd7538c51dee273102
    Change-Id: I125cf26ba2a7a686f2fa5e87f4c2afceb65f7a05

commit a2c0c2624fc9ae5ae9b6efdce65bd8be3b7fc4da
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sat Nov 16 13:16:25 2019 -0500

    net-tcp: add fast_ack_mode=1: skip rwin check in tcp_fast_ack_mode__tcp_ack_snd_check()
    
    Add logic for an experimental TCP connection behavior, enabled with
    tp->fast_ack_mode = 1, which disables checking the receive window
    before sending an ack in __tcp_ack_snd_check(). If this behavior is
    enabled, the data receiver sends an ACK if the amount of data is >
    RCV.MSS.
    
    Change-Id: Iaa0a0fd7108221f883137a79d5bfa724f1b096d4

commit 7512e84524ba123c513681f5a25e9a265c0ffcb2
Author: Neal Cardwell <ncardwell@google.com>
Date:   Fri Sep 27 17:10:26 2019 -0400

    net-tcp: re-generalize TSO sizing in TCP CC module API
    
    Reorganize the API for CC modules so that the CC module once again
    gets complete control of the TSO sizing decision. This is how the API
    was set up around 2016 and the initial BBRv1 upstreaming. Later Eric
    Dumazet simplified it. But with wider testing it now seems that to
    avoid CPU regressions BBR needs to have a different TSO sizing
    function.
    
    This is necessary to handle cases where there are many flows
    bottlenecked on the sender host's NIC, in which case BBR's pacing rate
    is much lower than CUBIC/Reno/DCTCP's. Why does this happen? Because
    BBR's pacing rate adapts to the low bandwidth share each flow sees. By
    contrast, CUBIC/Reno/DCTCP see no loss or ECN, so they grow a very
    large cwnd, and thus large pacing rate and large TSO burst size.
    
    Change-Id: Ic8ccfdbe4010ee8d4bf6a6334c48a2fceb2171ea

commit 7b0012a4a7ec0af08d13870ae2f4c3494b6de520
Author: Yousuk Seung <ysseung@google.com>
Date:   Wed May 23 17:55:54 2018 -0700

    net-tcp: add new ca opts flag TCP_CONG_WANTS_CE_EVENTS
    
    Add a a new ca opts flag TCP_CONG_WANTS_CE_EVENTS that allows a
    congestion control module to receive CE events.
    
    Currently congestion control modules have to set the TCP_CONG_NEEDS_ECN
    bit in opts flag to receive CE events but this may incur changes in ECN
    behavior elsewhere. This patch adds a new bit TCP_CONG_WANTS_CE_EVENTS
    that allows congestion control modules to receive CE events
    independently of TCP_CONG_NEEDS_ECN.
    
    Effort: net-tcp
    Origin-9xx-SHA1: 9f7e14716cde760bc6c67ef8ef7e1ee48501d95b
    Change-Id: I2255506985242f376d910c6fd37daabaf4744f24

commit 0230a8dcaa1dca8ce747eac720534c048d25c81b
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue May 7 22:37:19 2019 -0400

    net-tcp_bbr: v2: set tx.in_flight for skbs in repair write queue
    
    Syzkaller was able to use TCP_REPAIR to reproduce the new warning
    added in tcp_fragment():
    
      WARNING: CPU: 0 PID: 118174 at net/ipv4/tcp_output.c:1487
        tcp_fragment+0xdcc/0x10a0 net/ipv4/tcp_output.c:1487()
      inconsistent: tx.in_flight: 0 old_factor: 53
    
    The warning happens because skbs inserted into the tcp_rtx_queue
    during the repair process go through a sort of "fake send" process,
    and that process was seting pcount but not tx.in_flight, and thus the
    warnings (where old_factor is the old pcount).
    
    The fix of setting tx.in_flight in the TCP_REPAIR code path seems
    simple enough, and indeed makes the repro code from syzkaller stop
    producing warnings. Running through kokonut tests, and will send out
    for review when all tests pass.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 330f825a08a6fe92cef74d799cc468864c479f63
    Change-Id: I0bc4a790f040fd4239620e1eedd5dc64666c6f05

commit 2fd1f1c98b69d090bab256e1d15b221722d7f561
Author: Neal Cardwell <ncardwell@google.com>
Date:   Wed May 1 20:16:25 2019 -0400

    net-tcp_bbr: v2: adjust skb tx.in_flight upon split in tcp_fragment()
    
    When we fragment an skb that has already been sent, we need to update
    the tx.in_flight for the first skb in the resulting pair ("buff").
    
    Because we were not updating the tx.in_flight, the tx.in_flight value
    was inconsistent with the pcount of the "buff" skb (tx.in_flight would
    be too high). That meant that if the "buff" skb was lost, then
    bbr2_inflight_hi_from_lost_skb() would calculate an inflight_hi value
    that is too high. This could result in longer queues and higher packet
    loss.
    
    Packetdrill testing verified that without this commit, when the second
    half of an skb is SACKed and then later the first half of that skb is
    marked lost, the calculated inflight_hi was incorrect.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 385f1ddc610798fab2837f9f372857438b25f874
    Change-Id: I617f8cab4e9be7a0b8e8d30b047bf8645393354d

commit 1e67187650db6609942807601a70c6bf9784248e
Author: Neal Cardwell <ncardwell@google.com>
Date:   Wed May 1 20:16:33 2019 -0400

    net-tcp_bbr: v2: adjust skb tx.in_flight upon merge in tcp_shifted_skb()
    
    When tcp_shifted_skb() updates state as adjacent SACKed skbs are
    coalesced, previously the tx.in_flight was not adjusted, so we could
    get contradictory state where the skb's recorded pcount was bigger
    than the tx.in_flight (the number of segments that were in_flight
    after sending the skb).
    
    Normally have a SACKed skb with contradictory pcount/tx.in_flight
    would not matter. However, with SACK reneging, the SACKed bit is
    removed, and an skb once again becomes eligible for retransmitting,
    fragmenting, SACKing, etc. Packetdrill testing verified the following
    sequence is possible in a kernel that does not have this commit:
    
     - skb N is SACKed
     - skb N+1 is SACKed and combined with skb N using tcp_shifted_skb()
       - tcp_shifted_skb() will increase the pcount of prev,
         but leave tx.in_flight as-is
       - so prev skb can have pcount > tx.in_flight
     - RTO, tcp_timeout_mark_lost(), detect reneg,
       remove "SACKed" bit, mark skb N as lost
       - find pcount of skb N is greater than its tx.in_flight
    
    I suspect this issue iw what caused the bbr2_inflight_hi_from_lost_skb():
      WARN_ON_ONCE(inflight_prev < 0)
    to fire in production machines using bbr2.
    
    Tested: See last commit in series for sponge link.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 1a3e997e613d2dcf32b947992882854ebe873715
    Change-Id: I1b0b75c27519953430c7db51c6f358f104c7af55

commit 03de2554b1b3659b713bb21197fb15ea420677dc
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue May 7 22:36:36 2019 -0400

    net-tcp_bbr: v2: factor out tx.in_flight setting into tcp_set_tx_in_flight()
    
    Factor out the code to set an skb's tx.in_flight field into its own
    function, so that this code can be used for the TCP_REPAIR "fake send"
    code path that inserts skbs into the rtx queue without sending
    them. This is in preparation for the following patch, which fixes an
    issue with TCP_REPAIR and tx.in_flight.
    
    Tested: See last patch in series for sponge link.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: e880fc907d06ea7354333f60f712748ebce9497b
    Change-Id: I4fbd4a6e18a51ab06d50ab1c9ad820ce5bea89af

commit 169e37e254941219c886a9b0a19e1fd9b05624e6
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Aug 7 21:52:06 2018 -0400

    net-tcp_bbr: v2: introduce ca_ops->skb_marked_lost() CC module callback API
    
    For connections experiencing reordering, RACK can mark packets lost
    long after we receive the SACKs/ACKs hinting that the packets were
    actually lost.
    
    This means that CC modules cannot easily learn the volume of inflight
    data at which packet loss happens by looking at the current inflight
    or even the packets in flight when the most recently SACKed packet was
    sent. To learn this, CC modules need to know how many packets were in
    flight at the time lost packets were sent. This new callback, combined
    with TCP_SKB_CB(skb)->tx.in_flight, allows them to learn this.
    
    This also provides a consistent callback that is invoked whether
    packets are marked lost upon ACK processing, using the RACK reordering
    timer, or at RTO time.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: afcbebe3374e4632ac6714d39e4dc8a8455956f4
    Change-Id: I54826ab53df636be537e5d3c618a46145d12d51a

commit 30c10407203775dc96efb3acef7b258c6abde977
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Nov 19 13:48:36 2018 -0500

    net-tcp_bbr: v2: export FLAG_ECE in rate_sample.is_ece
    
    For understanding the relationship between inflight and ECN signals,
    to try to find the highest inflight value that has acceptable levels
    ECN marking.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 3eba998f2898541406c2666781182200934965a8
    Change-Id: I3a964e04cee83e11649a54507043d2dfe769a3b3

commit 55eaa7e089be51bf00eba2be0c8dc05d3043587c
Author: Neal Cardwell <ncardwell@google.com>
Date:   Thu Oct 12 23:44:27 2017 -0400

    net-tcp_bbr: v2: count packets lost over TCP rate sampling interval
    
    For understanding the relationship between inflight and packet loss
    signals, to try to find the highest inflight value that has acceptable
    levels of packet losses.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 4527e26b2bd7756a88b5b9ef1ada3da33dd609ab
    Change-Id: I594c2500868d9c530770e7ddd68ffc87c57f4fd5

commit 5c5157fc70df0aa60fa065e12d7aae4e318a9854
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sat Aug 5 11:49:50 2017 -0400

    net-tcp_bbr: v2: snapshot packets in flight at transmit time and pass in rate_sample
    
    For understanding the relationship between inflight and losses or ECN
    signals, to try to find the highest inflight value that has acceptable
    levels of loss/ECN marking.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: b3eb4f2d20efab4ca001f32c9294739036c493ea
    Change-Id: I7314047d0ff14dd261a04b1969a46dc658c8836a

commit bade6fb6d05f70a8202e33eac0dd6e1eaef099c6
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sun Jun 24 21:55:59 2018 -0400

    net-tcp_bbr: v2: shrink delivered_mstamp, first_tx_mstamp to u32 to free up 8 bytes
    
    Free up some space for tracking inflight and losses for each
    bw sample, in upcoming commits.
    
    These timestamps are in microseconds, and are now stored in 32
    bits. So they can only hold time intervals up to roughly 2^12 = 4096
    seconds.  But Linux TCP RTT and RTO tracking has the same 32-bit
    microsecond implementation approach and resulting deployment
    limitations. So this is not introducing a new limit. And these should
    not be a limitation for the foreseeable future.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 238a7e6b5d51625fef1ce7769826a7b21b02ae55
    Change-Id: I3b779603797263b52a61ad57c565eb91fe42680c

commit 5fafff660ae67032a1f9e9de3034b1390e38c919
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Jun 11 12:26:55 2019 -0400

    net-tcp_bbr: broaden app-limited rate sample detection
    
    This commit is a bug fix for the Linux TCP app-limited
    (application-limited) logic that is used for collecting rate
    (bandwidth) samples.
    
    Previously the app-limited logic only looked for "bubbles" of
    silence in between application writes, by checking at the start
    of each sendmsg. But "bubbles" of silence can also happen before
    retransmits: e.g. bubbles can happen between an application write
    and a retransmit, or between two retransmits.
    
    Retransmits are triggered by ACKs or timers. So this commit checks
    for bubbles of app-limited silence upon ACKs or timers.
    
    Why does this commit check for app-limited state at the start of
    ACKs and timer handling? Because at that point we know whether
    inflight was fully using the cwnd.  During processing the ACK or
    timer event we often change the cwnd; after changing the cwnd we
    can't know whether inflight was fully using the old cwnd.
    
    Origin-9xx-SHA1: 3fe9b53291e018407780fb8c356adb5666722cbc
    Change-Id: I37221506f5166877c2b110753d39bb0757985e68

commit ea4f7926c7a742ee6c6ed2e795d79fa67d029c0a
Author: Alexey Avramov <hakavlad@inbox.lv>
Date:   Sat Nov 13 10:42:27 2021 +0900

    mm/vmscan: add sysctl knobs for protecting the working set
    
    The kernel does not provide a way to protect the working set under memory
    pressure. A certain amount of anonymous and clean file pages is required by
    the userspace for normal operation. First of all, the userspace needs a
    cache of shared libraries and executable binaries. If the amount of the
    clean file pages falls below a certain level, then thrashing and even
    livelock can take place.
    
    The patch provides sysctl knobs for protecting the working set (anonymous
    and clean file pages) under memory pressure.
    
    The vm.anon_min_kbytes sysctl knob provides *hard* protection of anonymous
    pages. The anonymous pages on the current node won't be reclaimed under any
    conditions when their amount is below vm.anon_min_kbytes. This knob may be
    used to prevent excessive swap thrashing when anonymous memory is low (for
    example, when memory is going to be overfilled by compressed data of zram
    module). The default value is defined by CONFIG_ANON_MIN_KBYTES (suggested
    0 in Kconfig).
    
    The vm.clean_low_kbytes sysctl knob provides *best-effort* protection of
    clean file pages. The file pages on the current node won't be reclaimed
    under memory pressure when the amount of clean file pages is below
    vm.clean_low_kbytes *unless* we threaten to OOM. Protection of clean file
    pages using this knob may be used when swapping is still possible to
      - prevent disk I/O thrashing under memory pressure;
      - improve performance in disk cache-bound tasks under memory pressure.
    The default value is defined by CONFIG_CLEAN_LOW_KBYTES (suggested 0 in
    Kconfig).
    
    The vm.clean_min_kbytes sysctl knob provides *hard* protection of clean
    file pages. The file pages on the current node won't be reclaimed under
    memory pressure when the amount of clean file pages is below
    vm.clean_min_kbytes. Hard protection of clean file pages using this knob
    may be used to
      - prevent disk I/O thrashing under memory pressure even with no free swap
        space;
      - improve performance in disk cache-bound tasks under memory pressure;
      - avoid high latency and prevent livelock in near-OOM conditions.
    The default value is defined by CONFIG_CLEAN_MIN_KBYTES (suggested 0 in
    Kconfig).
    
    Signed-off-by: Alexey Avramov <hakavlad@inbox.lv>

commit d4f795f9881a4d784d08b67ccaf77e58fc69d892
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Mar 8 19:12:31 2022 -0700

    mm: multi-gen LRU: design doc
    
    Add a design doc.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit dacb2912f3756908dad1e74c4aefcf60d536bdae
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Mar 8 19:12:30 2022 -0700

    mm: multi-gen LRU: admin guide
    
    Add an admin guide.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit 89e427a677584c65ad5f84f885353b376ea0546d
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Mar 8 19:12:29 2022 -0700

    mm: multi-gen LRU: debugfs interface
    
    Add /sys/kernel/debug/lru_gen for working set estimation and proactive
    reclaim. These features are required to optimize job scheduling (bin
    packing) in data centers [1][2].
    
    Compared with the page table-based approach and the PFN-based
    approach, e.g., mm/damon/[vp]addr.c, this lruvec-based approach has
    the following advantages:
    1. It offers better choices because it is aware of memcgs, NUMA nodes,
       shared mappings and unmapped page cache.
    2. It is more scalable because it is O(nr_hot_pages), whereas the
       PFN-based approach is O(nr_total_pages).
    
    Add /sys/kernel/debug/lru_gen_full for debugging.
    
    [1] https://dl.acm.org/doi/10.1145/3297858.3304053
    [2] https://dl.acm.org/doi/10.1145/3503222.3507731
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit 34739f184a28a5b66ce61a2cdf8fd20d5a2c540e
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Mar 8 19:12:28 2022 -0700

    mm: multi-gen LRU: thrashing prevention
    
    Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
    requested by many desktop users [1].
    
    When set to value N, it prevents the working set of N milliseconds
    from getting evicted. The OOM killer is triggered if this working set
    cannot be kept in memory. Based on the average human detectable lag
    (~100ms), N=1000 usually eliminates intolerable lags due to thrashing.
    Larger values like N=3000 make lags less noticeable at the risk of
    premature OOM kills.
    
    Compared with the size-based approach, e.g., [2], this time-based
    approach has the following advantages:
    1. It is easier to configure because it is agnostic to applications
       and memory sizes.
    2. It is more reliable because it is directly wired to the OOM killer.
    
    [1] https://lore.kernel.org/lkml/Ydza%2FzXKY9ATRoh6@google.com/
    [2] https://lore.kernel.org/lkml/20211130201652.2218636d@mail.inbox.lv/
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit 177221399e34cff10f1e20ec1cf613b4c1a5e61a
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Mar 8 19:12:27 2022 -0700

    mm: multi-gen LRU: kill switch
    
    Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
    can be disabled include:
      0x0001: the multi-gen LRU core
      0x0002: walking page table, when arch_has_hw_pte_young() returns
              true
      0x0004: clearing the accessed bit in non-leaf PMD entries, when
              CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
      [yYnN]: apply to all the components above
    E.g.,
      echo y >/sys/kernel/mm/lru_gen/enabled
      cat /sys/kernel/mm/lru_gen/enabled
      0x0007
      echo 5 >/sys/kernel/mm/lru_gen/enabled
      cat /sys/kernel/mm/lru_gen/enabled
      0x0005
    
    NB: the page table walks happen on the scale of seconds under heavy
    memory pressure, in which case the mmap_lock contention is a lesser
    concern, compared with the LRU lock contention and the I/O congestion.
    So far the only well-known case of the mmap_lock contention happens on
    Android, due to Scudo [1] which allocates several thousand VMAs for
    merely a few hundred MBs. The SPF and the Maple Tree also have
    provided their own assessments [2][3]. However, if walking page tables
    does worsen the mmap_lock contention, the kill switch can be used to
    disable it. In this case the multi-gen LRU will suffer a minor
    performance degradation, as shown previously.
    
    Clearing the accessed bit in non-leaf PMD entries can also be
    disabled, since this behavior was not tested on x86 varieties other
    than Intel and AMD.
    
    [1] https://source.android.com/devices/tech/debug/scudo
    [2] https://lore.kernel.org/lkml/20220128131006.67712-1-michel@lespinasse.org/
    [3] https://lore.kernel.org/lkml/20220202024137.2516438-1-Liam.Howlett@oracle.com/
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit f534a8f7060c198ae40b2b66e3fd0cf3a31f03a3
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Mar 8 19:12:26 2022 -0700

    mm: multi-gen LRU: optimize multiple memcgs
    
    When multiple memcgs are available, it is possible to make better
    choices based on generations and tiers and therefore improve the
    overall performance under global memory pressure. This patch adds a
    rudimentary optimization to select memcgs that can drop single-use
    unmapped clean pages first. Doing so reduces the chance of going into
    the aging path or swapping. These two operations can be costly.
    
    A typical example that benefits from this optimization is a server
    running mixed types of workloads, e.g., heavy anon workload in one
    memcg and heavy buffered I/O workload in the other.
    
    Though this optimization can be applied to both kswapd and direct
    reclaim, it is only added to kswapd to keep the patchset manageable.
    Later improvements will cover the direct reclaim path.
    
    Server benchmark results:
      Mixed workloads:
        fio (buffered I/O): -[28, 30]%
                    IOPS         BW
          patch1-7: 3117k        11.9GiB/s
          patch1-8: 2217k        8661MiB/s
    
        memcached (anon): +[247, 251]%
                    Ops/sec      KB/sec
          patch1-7: 563772.35    21900.01
          patch1-8: 1968343.76   76461.24
    
      Mixed workloads:
        fio (buffered I/O): -[4, 6]%
                    IOPS         BW
          5.17-rc2: 2338k        9133MiB/s
          patch1-8: 2217k        8661MiB/s
    
        memcached (anon): +[524, 530]%
                    Ops/sec      KB/sec
          5.17-rc2: 313821.65    12190.55
          patch1-8: 1968343.76   76461.24
    
      Configurations:
        (changes since patch 5)
    
        cat mixed.sh
        modprobe brd rd_nr=2 rd_size=56623104
    
        swapoff -a
        mkswap /dev/ram0
        swapon /dev/ram0
    
        mkfs.ext4 /dev/ram1
        mount -t ext4 /dev/ram1 /mnt
    
        memtier_benchmark -S /var/run/memcached/memcached.sock \
          -P memcache_binary -n allkeys --key-minimum=1 \
          --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \
          --ratio 1:0 --pipeline 8 -d 2000
    
        fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
          --buffered=1 --ioengine=io_uring --iodepth=128 \
          --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
          --rw=randread --random_distribution=random --norandommap \
          --time_based --ramp_time=10m --runtime=90m --group_reporting &
        pid=$!
    
        sleep 200
    
        memtier_benchmark -S /var/run/memcached/memcached.sock \
          -P memcache_binary -n allkeys --key-minimum=1 \
          --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \
          --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
    
        kill -INT $pid
        wait
    
    Client benchmark results:
      no change (CONFIG_MEMCG=n)
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit 869c1f3607304012f47ec46abc8a533d123a3ff1
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Mar 8 19:12:25 2022 -0700

    mm: multi-gen LRU: support page table walks
    
    To further exploit spatial locality, the aging prefers to walk page
    tables to search for young PTEs and promote hot pages. A kill switch
    will be added in the next patch to disable this behavior. When
    disabled, the aging relies on the rmap only.
    
    NB: this behavior has nothing similar with the page table scanning in
    the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
    pages to swapcache and unmaps them.
    
    To avoid confusion, the term "iteration" specifically means the
    traversal of an entire mm_struct list; the term "walk" will be applied
    to page tables and the rmap, as usual.
    
    An mm_struct list is maintained for each memcg, and an mm_struct
    follows its owner task to the new memcg when this task is migrated.
    Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
    walk_page_range() with each mm_struct on this list to promote hot
    pages before it increments max_seq.
    
    When multiple page table walkers iterate the same list, each of them
    gets a unique mm_struct; therefore they can run concurrently. Page
    table walkers ignore any misplaced pages, e.g., if an mm_struct was
    migrated, pages it left in the previous memcg will not be promoted
    when its current memcg is under reclaim. Similarly, page table walkers
    will not promote pages from nodes other than the one under reclaim.
    
    This patch uses the following optimizations when walking page tables:
    1. It tracks the usage of mm_struct's between context switches so that
       page table walkers can skip processes that have been sleeping since
       the last iteration.
    2. It uses generational Bloom filters to record populated branches so
       that page table walkers can reduce their search space based on the
       query results, e.g., to skip page tables containing mostly holes or
       misplaced pages.
    3. It takes advantage of the accessed bit in non-leaf PMD entries when
       CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
    4. It does not zigzag between a PGD table and the same PMD table
       spanning multiple VMAs. IOW, it finishes all the VMAs within the
       range of the same PMD table before it returns to a PGD table. This
       improves the cache performance for workloads that have large
       numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
    
    Server benchmark results:
      Single workload:
        fio (buffered I/O): no change
    
      Single workload:
        memcached (anon): +[5.5, 7.5]%
                    Ops/sec      KB/sec
          patch1-6: 1015292.83   39490.38
          patch1-7: 1080856.82   42040.53
    
      Configurations:
        no change
    
    Client benchmark results:
      kswapd profiles:
        patch1-6
          45.49%  lzo1x_1_do_compress (real work)
           7.38%  page_vma_mapped_walk
           7.24%  _raw_spin_unlock_irq
           2.64%  ptep_clear_flush
           2.31%  __zram_bvec_write
           2.13%  do_raw_spin_lock
           2.09%  lru_gen_look_around
           1.89%  free_unref_page_list
           1.85%  memmove
           1.74%  obj_malloc
    
        patch1-7
          47.73%  lzo1x_1_do_compress (real work)
           6.84%  page_vma_mapped_walk
           6.14%  _raw_spin_unlock_irq
           2.86%  walk_pte_range
           2.79%  ptep_clear_flush
           2.24%  __zram_bvec_write
           2.10%  do_raw_spin_lock
           1.94%  free_unref_page_list
           1.80%  memmove
           1.75%  obj_malloc
    
      Configurations:
        no change
    
    [1] https://lwn.net/Articles/23732/
    [2] https://source.android.com/devices/tech/debug/scudo
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit 1f5e85620d2af660ee1c051f617d98285ccf55eb
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Mar 8 19:12:24 2022 -0700

    mm: multi-gen LRU: exploit locality in rmap
    
    Searching the rmap for PTEs mapping each page on an LRU list (to test
    and clear the accessed bit) can be expensive because pages from
    different VMAs (PA space) are not cache friendly to the rmap (VA
    space). For workloads mostly using mapped pages, the rmap has a high
    CPU cost in the reclaim path.
    
    This patch exploits spatial locality to reduce the trips into the
    rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
    new function lru_gen_look_around() scans at most BITS_PER_LONG-1
    adjacent PTEs. On finding another young PTE, it clears the accessed
    bit and updates the gen counter of the page mapped by this PTE to
    (max_seq%MAX_NR_GENS)+1.
    
    Server benchmark results:
      Single workload:
        fio (buffered I/O): no change
    
      Single workload:
        memcached (anon): +[3.5, 5.5]%
                    Ops/sec      KB/sec
          patch1-5: 972526.07    37826.95
          patch1-6: 1015292.83   39490.38
    
      Configurations:
        no change
    
    Client benchmark results:
      kswapd profiles:
        patch1-5
          39.73%  lzo1x_1_do_compress (real work)
          14.96%  page_vma_mapped_walk
           6.97%  _raw_spin_unlock_irq
           3.07%  do_raw_spin_lock
           2.53%  anon_vma_interval_tree_iter_first
           2.04%  ptep_clear_flush
           1.82%  __zram_bvec_write
           1.76%  __anon_vma_interval_tree_subtree_search
           1.57%  memmove
           1.45%  free_unref_page_list
    
        patch1-6
          45.49%  lzo1x_1_do_compress (real work)
           7.38%  page_vma_mapped_walk
           7.24%  _raw_spin_unlock_irq
           2.64%  ptep_clear_flush
           2.31%  __zram_bvec_write
           2.13%  do_raw_spin_lock
           2.09%  lru_gen_look_around
           1.89%  free_unref_page_list
           1.85%  memmove
           1.74%  obj_malloc
    
      Configurations:
        no change
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit 2902d09901c239a56ecdb6b2029198b2f9907600
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Mar 8 19:12:23 2022 -0700

    mm: multi-gen LRU: minimal implementation
    
    To avoid confusion, the terms "promotion" and "demotion" will be
    applied to the multi-gen LRU, as a new convention; the terms
    "activation" and "deactivation" will be applied to the active/inactive
    LRU, as usual.
    
    The aging produces young generations. Given an lruvec, it increments
    max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
    promotes hot pages to the youngest generation when it finds them
    accessed through page tables; the demotion of cold pages happens
    consequently when it increments max_seq. The aging has the complexity
    O(nr_hot_pages), since it is only interested in hot pages. Promotion
    in the aging path does not require any LRU list operations, only the
    updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
    the result of the increment of max_seq, requires LRU list operations,
    e.g., lru_deactivate_fn().
    
    The eviction consumes old generations. Given an lruvec, it increments
    min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
    feedback loop modeled after the PID controller monitors refaults over
    anon and file types and decides which type to evict when both types
    are available from the same generation.
    
    Each generation is divided into multiple tiers. Tiers represent
    different ranges of numbers of accesses through file descriptors. A
    page accessed N times through file descriptors is in tier
    order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
    bits in folio->flags. In contrast to moving across generations, which
    requires the LRU lock, moving across tiers only involves operations on
    folio->flags. The feedback loop also monitors refaults over all tiers
    and decides when to protect pages in which tiers (N>1), using the
    first tier (N=0,1) as a baseline. The first tier contains single-use
    unmapped clean pages, which are most likely the best choices. The
    eviction moves a page to the next generation, i.e., min_seq+1, if the
    feedback loop decides so. This approach has the following advantages:
    1. It removes the cost of activation in the buffered access path by
       inferring whether pages accessed multiple times through file
       descriptors are statistically hot and thus worth protecting in the
       eviction path.
    2. It takes pages accessed through page tables into account and avoids
       overprotecting pages accessed multiple times through file
       descriptors. (Pages accessed through page tables are in the first
       tier, since N=0.)
    3. More tiers provide better protection for pages accessed more than
       twice through file descriptors, when under heavy buffered I/O
       workloads.
    
    Server benchmark results:
      Single workload:
        fio (buffered I/O): +[47, 49]%
                    IOPS         BW
          5.17-rc2: 2242k        8759MiB/s
          patch1-5: 3321k        12.7GiB/s
    
      Single workload:
        memcached (anon): +[101, 105]%
                    Ops/sec      KB/sec
          5.17-rc2: 476771.79    18544.31
          patch1-5: 972526.07    37826.95
    
      Configurations:
        CPU: two Xeon 6154
        Mem: total 256G
    
        Node 1 was only used as a ram disk to reduce the variance in the
        results.
    
        patch drivers/block/brd.c <<EOF
        99,100c99,100
        <   gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
        <   page = alloc_page(gfp_flags);
        ---
        >   gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
        >   page = alloc_pages_node(1, gfp_flags, 0);
        EOF
    
        cat >>/etc/systemd/system.conf <<EOF
        CPUAffinity=numa
        NUMAPolicy=bind
        NUMAMask=0
        EOF
    
        cat >>/etc/memcached.conf <<EOF
        -m 184320
        -s /var/run/memcached/memcached.sock
        -a 0766
        -t 36
        -B binary
        EOF
    
        cat fio.sh
        modprobe brd rd_nr=1 rd_size=113246208
        mkfs.ext4 /dev/ram0
        mount -t ext4 /dev/ram0 /mnt
    
        mkdir /sys/fs/cgroup/user.slice/test
        echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
        echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
        fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
          --buffered=1 --ioengine=io_uring --iodepth=128 \
          --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
          --rw=randread --random_distribution=random --norandommap \
          --time_based --ramp_time=10m --runtime=5m --group_reporting
    
        cat memcached.sh
        modprobe brd rd_nr=1 rd_size=113246208
        swapoff -a
        mkswap /dev/ram0
        swapon /dev/ram0
    
        memtier_benchmark -S /var/run/memcached/memcached.sock \
          -P memcache_binary -n allkeys --key-minimum=1 \
          --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
          --ratio 1:0 --pipeline 8 -d 2000
    
        memtier_benchmark -S /var/run/memcached/memcached.sock \
          -P memcache_binary -n allkeys --key-minimum=1 \
          --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
          --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
    
    Client benchmark results:
      kswapd profiles:
        5.17-rc2
          38.05%  page_vma_mapped_walk
          20.86%  lzo1x_1_do_compress (real work)
           6.16%  do_raw_spin_lock
           4.61%  _raw_spin_unlock_irq
           2.20%  vma_interval_tree_iter_next
           2.19%  vma_interval_tree_subtree_search
           2.15%  page_referenced_one
           1.93%  anon_vma_interval_tree_iter_first
           1.65%  ptep_clear_flush
           1.00%  __zram_bvec_write
    
        patch1-5
          39.73%  lzo1x_1_do_compress (real work)
          14.96%  page_vma_mapped_walk
           6.97%  _raw_spin_unlock_irq
           3.07%  do_raw_spin_lock
           2.53%  anon_vma_interval_tree_iter_first
           2.04%  ptep_clear_flush
           1.82%  __zram_bvec_write
           1.76%  __anon_vma_interval_tree_subtree_search
           1.57%  memmove
           1.45%  free_unref_page_list
    
      Configurations:
        CPU: single Snapdragon 7c
        Mem: total 4G
    
        Chrome OS MemoryPressure [1]
    
    [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit 1dfe8a837e89f26e8834265a5da33688245dfb8e
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Mar 8 19:12:22 2022 -0700

    mm: multi-gen LRU: groundwork
    
    Evictable pages are divided into multiple generations for each lruvec.
    The youngest generation number is stored in lrugen->max_seq for both
    anon and file types as they are aged on an equal footing. The oldest
    generation numbers are stored in lrugen->min_seq[] separately for anon
    and file types as clean file pages can be evicted regardless of swap
    constraints. These three variables are monotonically increasing.
    
    Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
    in order to fit into the gen counter in folio->flags. Each truncated
    generation number is an index to lrugen->lists[]. The sliding window
    technique is used to track at least MIN_NR_GENS and at most
    MAX_NR_GENS generations. The gen counter stores a value within [1,
    MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
    stores 0.
    
    There are two conceptually independent procedures: "the aging", which
    produces young generations, and "the eviction", which consumes old
    generations. They form a closed-loop system, i.e., "the page reclaim".
    Both procedures can be invoked from userspace for the purposes of
    working set estimation and proactive reclaim. These features are
    required to optimize job scheduling (bin packing) in data centers. The
    variable size of the sliding window is designed for such use cases
    [1][2].
    
    To avoid confusion, the terms "hot" and "cold" will be applied to the
    multi-gen LRU, as a new convention; the terms "active" and "inactive"
    will be applied to the active/inactive LRU, as usual.
    
    The protection of hot pages and the selection of cold pages are based
    on page access channels and patterns. There are two access channels:
    one through page tables and the other through file descriptors. The
    protection of the former channel is by design stronger because:
    1. The uncertainty in determining the access patterns of the former
       channel is higher due to the approximation of the accessed bit.
    2. The cost of evicting the former channel is higher due to the TLB
       flushes required and the likelihood of encountering the dirty bit.
    3. The penalty of underprotecting the former channel is higher because
       applications usually do not prepare themselves for major page
       faults like they do for blocked I/O. E.g., GUI applications
       commonly use dedicated I/O threads to avoid blocking the rendering
       threads.
    There are also two access patterns: one with temporal locality and the
    other without. For the reasons listed above, the former channel is
    assumed to follow the former pattern unless VM_SEQ_READ or
    VM_RAND_READ is present; the latter channel is assumed to follow the
    latter pattern unless outlying refaults have been observed.
    
    The next patch will address the "outlying refaults". A few macros,
    i.e., LRU_REFS_*, used later are added in this patch to make the
    patchset less diffy.
    
    A page is added to the youngest generation on faulting. The aging
    needs to check the accessed bit at least twice before handing this
    page over to the eviction. The first check takes care of the accessed
    bit set on the initial fault; the second check makes sure this page
    has not been used since then. This protocol, AKA second chance,
    requires a minimum of two generations, hence MIN_NR_GENS.
    
    [1] https://dl.acm.org/doi/10.1145/3297858.3304053
    [2] https://dl.acm.org/doi/10.1145/3503222.3507731
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit 71b42a8dd96b094c7d8e3b40b17c343e79558dea
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Mar 8 19:12:21 2022 -0700

    Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller"
    
    This patch undoes the following refactor:
    commit 289ccba18af4 ("include/linux/mm_inline.h: fold __update_lru_size() into its sole caller")
    
    The upcoming changes to include/linux/mm_inline.h will reuse
    __update_lru_size().
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>

commit 59faf72a823f99aafc0d32e4b53cd6f1cd1c3fbe
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Mar 8 19:12:20 2022 -0700

    mm/vmscan.c: refactor shrink_node()
    
    This patch refactors shrink_node() to improve readability for the
    upcoming changes to mm/vmscan.c.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Reviewed-by: Barry Song <baohua@kernel.org>

commit 9e27d98038d5d0e211a67db68e1b10c90494eeca
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Mar 8 19:12:19 2022 -0700

    mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
    
    Some architectures support the accessed bit in non-leaf PMD entries,
    e.g., x86 sets the accessed bit in a non-leaf PMD entry when using it
    as part of linear address translation [1]. Page table walkers that
    clear the accessed bit may use this capability to reduce their search
    space.
    
    Note that:
    1. Although an inline function is preferable, this capability is added
       as a configuration option for consistency with the existing macros.
    2. Due to the little interest in other varieties, this capability was
       only tested on Intel and AMD CPUs.
    
    [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
         Volume 3 (June 2021), section 4.8
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Reviewed-by: Barry Song <baohua@kernel.org>

commit 7106d489b83950a18d9fdbd53901a992d266654f
Author: Yu Zhao <yuzhao@google.com>
Date:   Tue Mar 8 19:12:18 2022 -0700

    mm: x86, arm64: add arch_has_hw_pte_young()
    
    Some architectures automatically set the accessed bit in PTEs, e.g.,
    x86 and arm64 v8.2. On architectures that do not have this capability,
    clearing the accessed bit in a PTE usually triggers a page fault
    following the TLB miss of this PTE (to emulate the accessed bit).
    
    Being aware of this capability can help make better decisions, e.g.,
    whether to spread the work out over a period of time to reduce bursty
    page faults when trying to clear the accessed bit in many PTEs.
    
    Note that theoretically this capability can be unreliable, e.g.,
    hotplugged CPUs might be different from builtin ones. Therefore it
    should not be used in architecture-independent code that involves
    correctness, e.g., to determine whether TLB flushes are required (in
    combination with the accessed bit).
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Acked-by: Brian Geffon <bgeffon@google.com>
    Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: Steven Barrett <steven@liquorix.net>
    Acked-by: Suleiman Souhlal <suleiman@google.com>
    Acked-by: Will Deacon <will@kernel.org>
    Tested-by: Daniel Byrne <djbyrne@mtu.edu>
    Tested-by: Donald Carr <d@chaos-reins.com>
    Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: Sofia Trinh <sofia.trinh@edi.works>
    Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
    Reviewed-by: Barry Song <baohua@kernel.org>

commit d85499c5302d7bca405cc3cdc0db5bbbd6186d32
Author: Stephan Mueller <smueller@chronox.de>
Date:   Tue Sep 28 17:41:57 2021 +0200

    char/lrng: add power-on and runtime self-tests
    
    Parts of the LRNG are already covered by self-tests, including:
    
    * Self-test of SP800-90A DRBG provided by the Linux kernel crypto API.
    
    * Self-test of the PRNG provided by the Linux kernel crypto API.
    
    * Raw noise source data testing including SP800-90B compliant
      tests when enabling CONFIG_LRNG_HEALTH_TESTS
    
    This patch adds the self-tests for the remaining critical functions of
    the LRNG that are essential to maintain entropy and provide
    cryptographic strong random numbers. The following self-tests are
    implemented:
    
    * Self-test of the time array maintenance. This test verifies whether
    the time stamp array management to store multiple values in one integer
    implements a concatenation of the data.
    
    * Self-test of the software hash implementation ensures that this
    function operates compliant to the FIPS 180-4 specification. The
    self-test performs a hash operation of a zeroized per-CPU data array.
    
    * Self-test of the ChaCha20 DRNG is based on the self-tests that are
    already present and implemented with the stand-alone user space
    ChaCha20 DRNG implementation available at [1]. The self-tests cover
    different use cases of the DRNG seeded with known seed data.
    
    The status of the LRNG self-tests is provided with the selftest_status
    SysFS file. If the file contains a zero, the self-tests passed. The
    value 0xffffffff means that the self-tests were not executed. Any other
    value indicates a self-test failure.
    
    The self-test may be compiled to panic the system if the self-test
    fails.
    
    All self-tests operate on private state data structures. This implies
    that none of the self-tests have any impact on the regular LRNG
    operations. This allows the self-tests to be repeated at runtime by
    writing anything into the selftest_status SysFS file.
    
    [1] https://www.chronox.de/chacha20.html
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    CC: Neil Horman <nhorman@tuxdriver.com>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit dcc8a77115e5fc627e7b3e91a55d3afb1df2b900
Author: Stephan Mueller <smueller@chronox.de>
Date:   Mon Oct 18 20:55:51 2021 +0200

    char/lrng: add interface for gathering of raw entropy
    
    The test interface allows a privileged process to capture the raw
    unconditioned noise that is collected by the LRNG for statistical
    analysis. Such testing allows the analysis how much entropy
    the interrupt noise source provides on a given platform.
    Extracted noise data is not used to seed the LRNG. This
    is a test interface and not appropriate for production systems.
    Yet, the interface is considered to be sufficiently secured for
    production systems.
    
    Access to the data is given through the lrng_raw debugfs file. The
    data buffer should be multiples of sizeof(u32) to fill the entire
    buffer. Using the option lrng_testing.boot_test=1 the raw noise of
    the first 1000 entropy events since boot can be sampled.
    
    This test interface allows generating the data required for
    analysis whether the LRNG is in compliance with SP800-90B
    sections 3.1.3 and 3.1.4.
    
    In addition, the test interface allows gathering of the concatenated raw
    entropy data to verify that the concatenation works appropriately.
    This includes sampling of the following raw data:
    
    * high-resolution time stamp
    
    * Jiffies
    
    * IRQ number
    
    * IRQ flags
    
    * return instruction pointer
    
    * interrupt register state
    
    * array logic batching the high-resolution time stamp
    
    * enabling the runtime configuration of entropy source entropy rates
    
    Also, a testing interface to support ACVT of the hash implementation
    is provided. The reason why only hash testing is supported (as
    opposed to also provide testing for the DRNG) is the fact that the
    LRNG software hash implementation contains glue code that may
    warrant testing in addition to the testing of the software ciphers
    via the kernel crypto API. Also, for testing the CTR-DRBG, the
    underlying AES implementation would need to be tested. However,
    such AES test interface cannot be provided by the LRNG as it has no
    means to access the AES operation.
    
    Finally, the execution duration for processing a time stamp can be
    obtained with the LRNG raw entropy interface.
    
    If a test interface is not compiled, its code is a noop which has no
    impact on the performance.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@tuxdriver.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 33b609e59b861da9b5dd657425c03c69ceb4aa00
Author: Stephan Mueller <smueller@chronox.de>
Date:   Tue Sep 28 18:10:36 2021 +0200

    char/lrng: add SP800-90B compliant health tests
    
    Implement health tests for LRNG's slow noise sources as mandated by
    SP-800-90B The file contains the following health tests:
    
    - stuck test: The stuck test calculates the first, second and third
      discrete derivative of the time stamp to be processed by the hash
      for the per-CPU entropy pool. Only if all three values are non-zero,
      the received time delta is considered to be non-stuck.
    
    - SP800-90B Repetition Count Test (RCT): The LRNG uses an enhanced
      version of the RCT specified in SP800-90B section 4.4.1. Instead of
      counting identical back-to-back values, the input to the RCT is the
      counting of the stuck values during the processing of received
      interrupt events. The RCT is applied with alpha=2^-30 compliant to
      the recommendation of FIPS 140-2 IG 9.8. During the counting operation,
      the LRNG always calculates the RCT cut-off value of C. If that value
      exceeds the allowed cut-off value, the LRNG will trigger the health
      test failure discussed below. An error is logged to the kernel log
      that such RCT failure occurred. This test is only applied and
      enforced in FIPS mode, i.e. when the kernel compiled with
      CONFIG_CONFIG_FIPS is started with fips=1.
    
    - SP800-90B Adaptive Proportion Test (APT): The LRNG implements the
      APT as defined in SP800-90B section 4.4.2. The applied significance
      level again is alpha=2^-30 compliant to the recommendation of FIPS
      140-2 IG 9.8.
    
    The aforementioned health tests are applied to the first 1,024 time stamps
    obtained from interrupt events. In case one error is identified for either
    the RCT, or the APT, the collected entropy is invalidated and the
    SP800-90B startup health test is restarted.
    
    As long as the SP800-90B startup health test is not completed, all LRNG
    random number output interfaces that may block will block and not generate
    any data. This implies that only those potentially blocking interfaces are
    defined to provide random numbers that are seeded with the interrupt noise
    source being SP800-90B compliant. All other output interfaces will not be
    affected by the SP800-90B startup test and thus are not considered
    SP800-90B compliant.
    
    At runtime, the SP800-90B APT and RCT are applied to each time stamp
    generated for a received interrupt. When either the APT and RCT indicates
    a noise source failure, the LRNG is reset to a state it has immediately
    after boot:
    
    - all entropy counters are set to zero
    
    - the SP800-90B startup tests are re-performed which implies that
    getrandom(2) would block again until new entropy was collected
    
    To summarize, the following rules apply:
    
    • SP800-90B compliant output interfaces
    
      - /dev/random
    
      - getrandom(2) system call
    
      -  get_random_bytes kernel-internal interface when being triggered by
         the callback registered with add_random_ready_callback
    
    • SP800-90B non-compliant output interfaces
    
      - /dev/urandom
    
      - get_random_bytes kernel-internal interface called directly
    
      - randomize_page kernel-internal interface
    
      - get_random_u32 and get_random_u64 kernel-internal interfaces
    
      - get_random_u32_wait, get_random_u64_wait, get_random_int_wait, and
        get_random_long_wait kernel-internal interfaces
    
    If either the RCT, or the APT health test fails irrespective whether
    during initialization or runtime, the following actions occur:
    
      1. The entropy of the entire entropy pool is invalidated.
    
      2. All DRNGs are reset which imply that they are treated as being
         not seeded and require a reseed during next invocation.
    
      3. The SP800-90B startup health test are initiated with all
         implications of the startup tests. That implies that from that point
         on, new events must be observed and its entropy must be inserted into
         the entropy pool before random numbers are calculated from the
         entropy pool.
    
    Further details on the SP800-90B compliance and the availability of all
    test tools required to perform all tests mandated by SP800-90B are
    provided at [1].
    
    The entire health testing code is compile-time configurable.
    
    The patch provides a CONFIG_BROKEN configuration of the APT / RCT cutoff
    values which have a high likelihood to trigger the health test failure.
    The BROKEN APT cutoff is set to the exact mean of the expected value if
    the time stamps are equally distributed (512 time stamps divided by 16
    possible values due to using the 4 LSB of the time stamp). The BROKEN
    RCT cutoff value is set to 1 which is likely to be triggered during
    regular operation.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@tuxdriver.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 86155b3d21c54ffc70579889fcdb36059adfd5b5
Author: Stephan Mueller <smueller@chronox.de>
Date:   Wed Oct 13 22:53:51 2021 +0200

    char/lrng: add Jitter RNG fast noise source
    
    The Jitter RNG fast noise source implemented as part of the kernel
    crypto API is queried for 256 bits of entropy at the time the seed
    buffer managed by the LRNG is about to be filled.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@tuxdriver.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 3a1d4d80393262a50fddacd90a7bfea45d471ce7
Author: Stephan Mueller <smueller@chronox.de>
Date:   Thu Mar 17 21:20:40 2022 +0100

    crypto: move Jitter RNG header include dir
    
    To support the LRNG operation which uses the Jitter RNG separately
    from the kernel crypto API, the header file must be accessible to
    the LRNG code.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Roman Drahtmüller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@tuxdriver.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 4c243c474ce57a63b93077c4500c84173f251800
Author: Stephan Mueller <smueller@chronox.de>
Date:   Fri Jun 18 08:10:53 2021 +0200

    char/lrng: add kernel crypto API PRNG extension
    
    Add runtime-pluggable support for all PRNGs that are accessible via
    the kernel crypto API, including hardware PRNGs. The PRNG is selected
    with the module parameter drng_name where the name must be one that the
    kernel crypto API can resolve into an RNG.
    
    This allows using of the kernel crypto API PRNG implementations that
    provide an interface to hardware PRNGs. Using this extension,
    the LRNG uses the hardware PRNGs to generate random numbers. An
    example is the S390 CPACF support providing such a PRNG.
    
    The hash is provided by a kernel crypto API SHASH whose digest size
    complies with the seedsize of the PRNG.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@tuxdriver.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 0ec706165f73c5258b9729e875ff851aa3c6714a
Author: Stephan Mueller <smueller@chronox.de>
Date:   Fri Jun 18 08:09:59 2021 +0200

    char/lrng: add SP800-90A DRBG extension
    
    Using the LRNG switchable DRNG support, the SP800-90A DRBG extension is
    implemented.
    
    The DRBG uses the kernel crypto API DRBG implementation. In addition, it
    uses the kernel crypto API SHASH support to provide the hashing
    operation.
    
    The DRBG supports the choice of either a CTR DRBG using AES-256, HMAC
    DRBG with SHA-512 core or Hash DRBG with SHA-512 core. The used core can
    be selected with the module parameter lrng_drbg_type. The default is the
    CTR DRBG.
    
    When compiling the DRBG extension statically, the DRBG is loaded at
    late_initcall stage which implies that with the start of user space, the
    user space interfaces of getrandom(2), /dev/random and /dev/urandom
    provide random data produced by an SP800-90A DRBG.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@tuxdriver.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 73d160043e10339200377333bf9948300bd0440c
Author: Stephan Mueller <smueller@chronox.de>
Date:   Tue Sep 15 22:17:43 2020 +0200

    crypto: drbg: externalize DRBG functions for LRNG
    
    This patch allows several DRBG functions to be called by the LRNG kernel
    code paths outside the drbg.c file.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Roman Drahtmüller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@tuxdriver.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 2f0ab3165e259e5e9de5fb7c018d14dc9f4669c9
Author: Stephan Mueller <smueller@chronox.de>
Date:   Fri Jun 18 08:08:20 2021 +0200

    char/lrng: add common generic hash support
    
    The LRNG switchable DRNG support also allows the replacement of the hash
    implementation used as conditioning component. The common generic hash
    support code provides the required callbacks using the synchronous hash
    implementations of the kernel crypto API.
    
    All synchronous hash implementations supported by the kernel crypto API
    can be used as part of the LRNG with this generic support.
    
    The generic support is intended to be configured by separate switchable
    DRNG backends.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: "Peter, Matthias" <matthias.peter@bsi.bund.de>
    CC: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    CC: Neil Horman <nhorman@tuxdriver.com>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 183a2f7658b49eb85ba526345a5cb37b5701fe25
Author: Stephan Mueller <smueller@chronox.de>
Date:   Fri Oct 1 20:47:30 2021 +0200

    char/lrng: add switchable DRNG support
    
    The DRNG switch support allows replacing the DRNG mechanism of the
    LRNG. The switching support rests on the interface definition of
    include/linux/lrng.h. A new DRNG is implemented by filling in the
    interface defined in this header file.
    
    In addition to the DRNG, the extension also has to provide a hash
    implementation that is used to hash the entropy pool for random number
    extraction.
    
    Note: It is permissible to implement a DRNG whose operations may sleep.
    However, the hash function must not sleep.
    
    The switchable DRNG support allows replacing the DRNG at runtime.
    However, only one DRNG extension is allowed to be loaded at any given
    time. Before replacing it with another DRNG implementation, the possibly
    existing DRNG extension must be unloaded.
    
    The switchable DRNG extension activates the new DRNG during load time.
    It is expected, however, that such a DRNG switch would be done only once
    by an administrator to load the intended DRNG implementation.
    
    It is permissible to compile DRNG extensions either as kernel modules or
    statically. The initialization of the DRNG extension should be performed
    with a late_initcall to ensure the extension is available when user
    space starts but after all other initialization completed.
    The initialization is performed by registering the function call data
    structure with the lrng_set_drng_cb function. In order to unload the
    DRNG extension, lrng_set_drng_cb must be invoked with the NULL
    parameter.
    
    The DRNG extension should always provide a security strength that is at
    least as strong as LRNG_DRNG_SECURITY_STRENGTH_BITS.
    
    The hash extension must not sleep and must not maintain a separate
    state.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@tuxdriver.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit cff16e8705f514a52d8babca73f40080dfe99801
Author: Stephan Mueller <smueller@chronox.de>
Date:   Wed Dec 29 20:23:37 2021 +0100

    char/lrng: CPU entropy source
    
    Certain CPUs provide instructions giving access to an entropy source
    (e.g. RDSEED on Intel/AMD, DARN on POWER, etc.). The LRNG can utilize
    the entropy source to seed its DRNG from.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Mathematical aspects Reviewed-by: "Peter, Matthias" <matthias.peter@bsi.bund.de>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@tuxdriver.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit db5e3433900eb60dd2c89493a540003732a9024f
Author: Stephan Mueller <smueller@chronox.de>
Date:   Tue Sep 28 17:05:50 2021 +0200

    char/lrng: allocate one DRNG instance per NUMA node
    
    In order to improve NUMA-locality when serving getrandom(2) requests,
    allocate one DRNG instance per node.
    
    The DRNG instance that is present right from the start of the kernel is
    reused as the first per-NUMA-node DRNG. For all remaining online NUMA
    nodes a new DRNG instance is allocated.
    
    During boot time, the multiple DRNG instances are seeded sequentially.
    With this, the first DRNG instance (referenced as the initial DRNG
    in the code) is completely seeded with 256 bits of entropy before the
    next DRNG instance is completely seeded.
    
    When random numbers are requested, the NUMA-node-local DRNG is checked
    whether it has been already fully seeded. If this is not the case, the
    initial DRNG is used to serve the request.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: Eric Biggers <ebiggers@kernel.org>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@tuxdriver.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 7601a99bc5a0489be857abd7f4802a213a33b402
Author: Stephan Mueller <smueller@chronox.de>
Date:   Thu Mar 17 21:18:48 2022 +0100

    char/lrng: sysctls and /proc interface
    
    The LRNG sysctl interface provides the same controls as the existing
    /dev/random implementation. These sysctls behave identically and are
    implemented identically. The goal is to allow a possible merge of the
    existing /dev/random implementation with this implementation which
    implies that this patch tries have a very close similarity. Yet, all
    sysctls are documented at [1].
    
    In addition, it provides the file lrng_type which provides details about
    the LRNG:
    
    - the name of the DRNG that produces the random numbers for /dev/random,
    /dev/urandom, getrandom(2)
    
    - the hash used to produce random numbers from the entropy pool
    
    - the number of secondary DRNG instances
    
    - indicator whether the LRNG operates SP800-90B compliant
    
    - indicator whether a high-resolution timer is identified - only with a
    high-resolution timer the interrupt noise source will deliver sufficient
    entropy
    
    - indicator whether the LRNG has been minimally seeded (i.e. is the
    secondary DRNG seeded with at least 128 bits of entropy)
    
    - indicator whether the LRNG has been fully seeded (i.e. is the
    secondary DRNG seeded with at least 256 bits of entropy)
    
    [1] https://www.chronox.de/lrng.html
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@tuxdriver.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit b1ff737d62c28dd47788b80fdaa531bf59c04883
Author: Stephan Mueller <smueller@chronox.de>
Date:   Thu Mar 17 21:17:52 2022 +0100

    char/lrng: IRQ entropy source
    
    The interrupt entropy source hooks into the interrupt handler via the
    add_interrupt_randomness function callback. Every interrupt received by
    the kernel is also sent to the LRNG for processing.
    
    The IRQ entropy source performs the following processing:
    
    1. Record a time stamp
    
    2. Divide the time stamp by its greatest common divisor to eliminate
       fixed least significant bits.
    
    3. Insert the 8 LSB of the result from step 2 into the collection pool.
    
    4. When the collection pool is full, it is hashed into the per-CPU
       entropy pool (if continuous compression is enabled) or the latest
       time stamps overwrite the oldest entries in the collection pool.
    
    If entropy is requested from the IRQ entropy pool, a message digest over
    all per-CPU entropy pool digests is calculated.
    
    The GCD calculation is performed for the first 100 interrupt time stamps.
    Until the GCD value is calculated, the full 32 bit time stamp is
    inserted into the collection pool.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Mathematical aspects Reviewed-by: "Peter, Matthias" <matthias.peter@bsi.bund.de>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@tuxdriver.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit e090bd0e82df44cb3d3732249edb7c01b6639b15
Author: Stephan Mueller <smueller@chronox.de>
Date:   Sun Oct 17 20:23:04 2021 +0200

    drivers/char: Introduce the Linux Random Number Generator
    
    In an effort to provide a flexible implementation for a random number
    generator that also delivers entropy during early boot time, allows
    replacement of the deterministic random number generation mechanism,
    implement the various components in separate code for easier
    maintenance, and provide compliance to SP800-90[A|B|C], introduce
    the Linux Random Number Generator (LRNG) framework.
    
    The LRNG framework provides a flexible random number generator which
    allows developers and system integrators to achieve different goals
    by ensuring that each solution establishes a secure state.
    
    The general design is as follows. Additional implementation details
    are given in [1]. The LRNG consists of the following components:
    
    1. The LRNG implements a DRNG. The DRNG always generates the
    requested amount of output. When using the SP800-90A terminology
    it operates without prediction resistance. The DRNG maintains a counter
    of how many bytes were generated since last re-seed and a timer of the
    elapsed time since last re-seed. If either the counter or the timer reaches
    a threshold, the DRNG is seeded from the entropy pool.
    
    In case the Linux kernel detects a NUMA system, one DRNG instance per NUMA
    node is maintained.
    
    2. The DRNG is seeded by concatenating the data from the following sources
       which deliver data and are credited with entropy if enabled:
    
    (a) the output of the IRQ per-CPU entropy pools,
    
    (b) the auxiliary entropy pool,
    
    (c) the Jitter RNG if available and enabled, and
    
    (d) the CPU-based noise source such as Intel RDSEED.
    
    The entropy estimate of the data of all noise sources are added to
    form the entropy estimate of the data used to seed the DRNG with.
    The LRNG ensures, however, that the DRNG after seeding is at
    maximum the security strength of the DRNG.
    
    The LRNG is designed such that none of these noise sources can dominate
    the other noise sources to provide seed data to the DRNG during due to
    the following:
    
    (a) During boot time, the amount of received entropy at the different
    entropy sources are the trigger points to (re)seed the DRNG.
    
    (b) At runtime, the available entropy from the slow noise source is
    concatenated with a pre-defined amount of data from the fast noise
    sources. In addition, each DRNG reseed operation triggers external
    noise source providers to deliver one block of data.
    
    3. The IRQ entropy pool collects noise data from interrupt timing.
    Any data received by the LRNG from the interrupt noise sources is
    inserted into a per-CPU entropy pool using a hash operation that can
    be changed during runtime. Per default, SHA-256 is used.
    
     (a) When an interrupt occurs, the 8 least significant bits of the
     high-resolution time stamp divided by the greatest common divisor (GCD)
     is mixed into the per-CPU entropy pool. This time stamp is credited with
     heuristically implied entropy.
    
     (b) HID event data like the key stroke or the mouse coordinates are
     mixed into the per-CPU entropy pool. This data is not credited with
     entropy by the LRNG.
    
    5. Any data provided from user space by either writing to /dev/random,
    /dev/urandom or the IOCTL of RNDADDENTROPY on both device files
    are always injected into the auxiliary pool. Also, device drivers may
    provide data that is mixed into an auxiliary pool using the same hash
    that is used to process the per-CPU entropy pool. This data is not
    credited with entropy by the LRNG.
    
    In addition, when a hardware random number generator covered by the
    Linux kernel HW generator framework wants to deliver random numbers,
    it is injected into the auxiliary pool as well. HW generator noise source
    is handled separately from the other noise source due to the fact that
    the HW generator framework may decide by itself when to deliver data
    whereas the other noise sources always requested for data driven by the
    LRNG operation. Similarly any user space provided data is inserted into
    the entropy pool.
    
    When seed data for the DRNG is to be generated, all per-CPU
    entropy pools are hashed. The message digest forms the data used for
    seeding the DRNG.
    
    To speed up the interrupt handling code of the LRNG, the time stamp
    collected for an interrupt event is divided by the greatest common
    divisor to eliminate fixed low bits and then truncated to the 8 least
    significant bits. 1024 truncated time stamps are concatenated and then
    jointly inserted into the per-CPU entropy pool. During boot time,
    until the fully seeded stage is reached, each time stamp with its
    32 least significant bits is are concatenated. When 1024/32 = 32 such
    events are received, they are injected into the per-CPU entropy pool.
    
    The LRNG allows the DRNG mechanism to be changed at runtime. Per default,
    a ChaCha20-based DRNG is used. The ChaCha20-DRNG implemented for the
    LRNG is also provided as a stand-alone user space deterministic random
    number generator. The LRNG also offers an SP800-90A DRBG based on the
    Linux kernel crypto API DRBG implementation.
    
    The processing of entropic data from the noise source before injecting
    them into the DRNG is performed with the following mathematical
    operations:
    
    1. Truncation: The received time stamps divided by the GCD are
    truncated to 8 least significant bits (or 32 least significant bits
    during boot time)
    
    2. Concatenation: The received and truncated time stamps as well as
    auxiliary 32 bit words are concatenated to fill the per-CPU data
    array that is capable of holding 64 8-bit words.
    
    3. Hashing: A set of concatenated time stamp data received from the
    interrupts are hashed together with the current existing per-CPU
    entropy pool state. The resulting message digest is the new per-CPU
    entropy pool state.
    
    4. Hashing: When new data is added to the auxiliary pool, the data
    is hashed together with the auxiliary pool to form a new auxiliary
    pool state.
    
    5. Hashing: A message digest of all per-CPU entropy pools and the
    is calculated which forms the new auxiliary pool state.
    
    6. Truncation: The most-significant bits (MSB) defined by the
    requested number of bits (commonly equal to the security strength
    of the DRBG) or the entropy available transported with the buffer
    (which is the minimum of the message digest size and the available
    entropy in all entropy pools and the auxiliary pool), whatever is
    smaller, are obtained from the slow noise source output buffer.
    
    7. Concatenation: The temporary seed buffer used to seed the DRNG
    is a concatenation of the slow noise source buffer, the Jitter RNG
    output, the CPU noise source output, and the current time.
    
    The DRNG always tries to seed itself with 256 bits of entropy, except
    during boot. In any case, if the noise sources cannot deliver that
    amount, the available entropy is used and the DRNG keeps track on how
    much entropy it was seeded with. The entropy implied by the LRNG
    available in the entropy pool may be too conservative. To ensure
    that during boot time all available entropy from the entropy pool is
    transferred to the DRNG, the hash_df function always generates 256
    data bits during boot to seed the DRNG. During boot, the DRNG is
    seeded as follows:
    
    1. The DRNG is reseeded from the entropy sources if the entropy sources
    collectively have at least 32 bits of entropy. The goal of this step is
    to ensure that the DRNG receives some initial entropy as early as
    possible.
    
    2. The DRNG is reseeded from the entropy sources if all entropy sources
    collectively can provide at least 128 bits of entropy.
    
    3. The DRNG is reseeded from the entropy sources if all entropy sources
    collectively can provide at least 256 bits.
    
    At the time of the reseeding steps, the DRNG requests as much entropy as
    is available in order to skip certain steps and reach the seeding level
    of 256 bits. This may imply that one or more of the aforementioned steps
    are skipped.
    
    Before the DRNG is seeded with 256 bits of entropy in step 3,
    requests of random data from /dev/random and the getrandom system
    call are not processed.
    
    The reseeding of the DRNG always ensures that all entropy sources
    collectively can deliver at least 128 entropy bits during runtime once
    the DRNG is fully seeded.
    
    The DRNG operates as deterministic random number generator with the
    following properties:
    
    * The maximum number of random bytes that can be generated with one
    DRNG generate operation is limited to 4096 bytes. When longer random
    numbers are requested, multiple DRNG generate operations are performed.
    The ChaCha20 DRNG as well as the SP800-90A DRBGs implement an update of
    their state after completing a generate request for backtracking
    resistance.
    
    * The DRNG is reseeded with whatever entropy is available -
    in the worst case where no additional entropy can be provided by the
    entropy sources, the DRNG is not re-seeded and continues its operation
    to try to reseed again after again the expiry of one of these thresholds:
    
     - If the last reseeding of the DRNG is more than 600 seconds
       ago, or
    
     - 2^20 DRNG generate operations are performed, whatever comes first, or
    
     - the DRNG is forced to reseed before the next generation of
       random numbers if data has been injected into the LRNG by writing data
       into /dev/random or /dev/urandom.
    
     - If the DRNG was not successfully reseeded after 2^30 generate requests,
       the DRNG reverts back to an unseeded stage implying that the blocking
       interfaces of /dev/random and getrandom will block again.
    
    The chosen values prevent high-volume requests from user space to cause
    frequent reseeding operations which drag down the performance of the
    DRNG.
    
    With the automatic reseeding after 600 seconds, the LRNG is triggered
    to reseed itself before the first request after a suspend that put the
    hardware to sleep for longer than 600 seconds.
    
    To support smaller devices including IoT environments, this patch
    allows reducing the runtime memory footprint of the LRNG at compile
    time by selecting smaller collection data sizes.
    
    When selecting the compilation of a kernel for a small environment,
    prevent the allocation of a buffer up to 4096 bytes to serve user space
    requests. In this case, the stack variable of 64 bytes is used to serve
    all user space requests.
    
    The LRNG has the following properties:
    
    * internal noise source: interrupts timing with fast boot time seeding
    
    * high performance of interrupt handling code: The LRNG impact on the
    interrupt handling has been reduced to a minimum. On one example
    system, the LRNG interrupt handling code in its fastest configuration
    executes within an average 55 cycles whereas the existing
    /dev/random on the same device takes about 97 cycles when measuring
    the execution time of add_interrupt_randomness().
    
    * use of almost never contended lock for hashing operation to collect
      raw entropy supporting concurrency-free use of massive parallel
      systems - worst case rate of contention is the number of DRNG
      reseeds, usually: number of NUMA nodes contentions per 5 minutes.
    
    * use of standalone ChaCha20 based RNG with the option to use a
      different DRNG selectable at compile time
    
    * instantiate one DRNG per NUMA node
    
    * support for runtime switchable output DRNGs
    
    * use of runtime-switchable hash for conditioning implementation
    following widely accepted approach
    
    * compile-time selectable collection size
    
    * support of small systems by allowing the reduction of the
    runtime memory needs
    
    Further details including the rationale for the design choices and
    properties of the LRNG together with testing is provided at [1].
    In addition, the documentation explains the conducted regression
    tests to verify that the LRNG is API and ABI compatible with the
    existing /dev/random implementation.
    
    Note, this patch covers the entropy sources manager, the API
    implementation, the built-in ChaCha20 DRNG and the auxiliary entropy
    pool.
    
    [1] https://www.chronox.de/lrng.html
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    Reviewed-by: Alexander Lobakin <alobakin@pm.me>
    Tested-by: Alexander Lobakin <alobakin@pm.me>
    Mathematical aspects Reviewed-by: "Peter, Matthias" <matthias.peter@bsi.bund.de>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@tuxdriver.com>
    Tested-by: Jirka Hladky <jhladky@redhat.com>
    Reviewed-by: Jirka Hladky <jhladky@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 54bd86dd89d088591a1a2ac0558913d897ef08be
Author: André Almeida <andrealmeid@collabora.com>
Date:   Mon Oct 25 09:49:42 2021 -0300

    futex: Add entry point for FUTEX_WAIT_MULTIPLE (opcode 31)
    
    Add an option to wait on multiple futexes using the old interface, that
    uses opcode 31 through futex() syscall. Do that by just translation the
    old interface to use the new code. This allows old and stable versions
    of Proton to still use fsync in new kernel releases.
    
    Signed-off-by: André Almeida <andrealmeid@collabora.com>

commit b71598eb5518f151daa411f1df3fcdff76a52d19
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Fri Jun 18 19:10:55 2021 +0000

    XANMOD: Makefile: Turn off loop vectorization for GCC -O3 optimization level
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit b1bf0e58f253e91574705f8de1cf0aa2e5db5f7c
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Thu Sep 3 20:36:13 2020 +0000

    XANMOD: init/Kconfig: Enable -O3 KBUILD_CFLAGS optimization for all architectures
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 66732288b3179eaa21400c237831e42246bfcbf1
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Thu Jun 25 16:40:43 2020 -0300

    XANMOD: lib/kconfig.debug: disable default CONFIG_SYMBOLIC_ERRNAME and CONFIG_DEBUG_BUGVERBOSE
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit 12641d2090cbd7105cc2d30e7a319c38fcf1c3ba
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Mon Jan 29 17:41:29 2018 +0000

    XANMOD: scripts: disable the localversion "+" tag of a git repo
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit ab0ade86b6934f5b2036979924d23ab9c572c669
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Tue Mar 31 13:32:08 2020 -0300

    XANMOD: cpufreq: tunes ondemand and conservative governor for performance
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit 16ec98feec2a4bc6c3e6aac3bbab65a033e4956f
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Mon Jan 29 17:31:25 2018 +0000

    XANMOD: mm/vmscan: vm_swappiness = 30 decreases the amount of swapping
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit 33f482ba6f3fbbc8d47cf5fd45af789c5eca40df
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Thu Aug 13 14:57:06 2020 +0000

    XANMOD: sched/autogroup: Add kernel parameter and config option to enable/disable autogroup feature by default
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit bdcf06963cf8c7c5285550ddbe6f9a4adb5ce768
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Mon Jan 29 16:59:22 2018 +0000

    XANMOD: dcache: cache_pressure = 50 decreases the rate at which VFS caches are reclaimed
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit eb7fef2d44cd9ce1ab68415901ea56a56b033330
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Mon Jan 29 17:26:15 2018 +0000

    XANMOD: kconfig: add 500Hz timer interrupt kernel config option
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit 35b03b65b89c7c392972c2b82fee154c6f0cb044
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon Dec 14 16:24:26 2020 +0000

    XANMOD: block: set rq_affinity to force full multithreading I/O requests
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit b6c6279d1d9614dfb85063dcfdddb4a8c6acc08d
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Thu Jan 6 16:59:01 2022 +0000

    XANMOD: block/mq-deadline: Disable front_merges by default
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 12cf86a1b23ccb8871d3cea8955916cd48be91cc
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon Mar 21 18:20:24 2022 +0000

    XANMOD: fair: Remove all energy efficiency functions
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>