commit 8c839bfa06e51e6409f73894a13f8dfa6d8d8b34
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Tue Aug 31 18:41:03 2021 +0000

    Linux 5.14.0-xanmod1
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 6317b65b9a958a48a4a20e6f32e4e3755370c852
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Thu Aug 12 17:08:06 2021 +0000

    netfilter: Add full cone NAT support
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 74a20708125794e1c0a516230810340b64f86705
Author: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Date:   Tue Aug 31 16:57:40 2021 +0300

    fs/ntfs3: Restyle comments to better align with kernel-doc
    
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit cdba19cfda59563175086a8c5fcd3795cabfbaa6
Author: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Date:   Tue Aug 31 18:52:39 2021 +0300

    fs/ntfs3: Rework file operations
    
    Rename now works "Add new name and remove old name".
    "Remove old name and add new name" may result in bad inode
    if we can't add new name and then can't restore (add) old name.
    
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 46a09b9ebba0d5c08edd49e35dd0c6a742e19e88
Author: Kari Argillander <kari.argillander@gmail.com>
Date:   Tue Aug 24 21:20:20 2021 +0300

    fs/ntfs3: Remove fat ioctl's from ntfs3 driver for now
    
    For some reason we have FAT ioctl calls. Even old ntfs driver did not
    use these. We should not use these because it his hard to get things out
    of kernel when they are upstream. That's why we remove these for now.
    
    More discussion is needed what ioctl should be implemented and what is
    important.
    
    Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit a1508640c47150d3bb3a1dfc0754ec964148b694
Author: Kari Argillander <kari.argillander@gmail.com>
Date:   Tue Aug 3 14:57:09 2021 +0300

    fs/ntfs3: Restyle comments to better align with kernel-doc
    
    Capitalize comments and end with period for better reading.
    
    Also function comments are now little more kernel-doc style. This way we
    can easily convert them to kernel-doc style if we want. Note that these
    are not yet complete with this style. Example function comments start
    with /* and in kernel-doc style they start /**.
    
    Use imperative mood in function descriptions.
    
    Change words like ntfs -> NTFS, linux -> Linux.
    
    Use "we" not "I" when commenting code.
    
    Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 82380696c7c6c731298b985f7ee9369c09977b5b
Author: Dan Carpenter <dan.carpenter@oracle.com>
Date:   Tue Aug 24 10:51:04 2021 +0300

    fs/ntfs3: Fix error handling in indx_insert_into_root()
    
    There are three bugs in this code:
    1) If indx_get_root() fails, then return -EINVAL instead of success.
    2) On the "/* make root external */" -EOPNOTSUPP; error path it should
       free "re" but it has a memory leak.
    3) If indx_new() fails then it will lead to an error pointer dereference
       when we call put_indx_node().
    
    I've re-written the error handling to be more clear.
    
    Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
    Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
    Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit ecda7a75d9b7f2f8ed56a1d108faf90d3a813642
Author: Dan Carpenter <dan.carpenter@oracle.com>
Date:   Tue Aug 24 10:50:15 2021 +0300

    fs/ntfs3: Potential NULL dereference in hdr_find_split()
    
    The "e" pointer is dereferenced before it has been checked for NULL.
    Move the dereference after the NULL check to prevent an Oops.
    
    Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
    Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
    Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit df5d9bb6c1bc22d237a8fc7a462562a7044b0faa
Author: Dan Carpenter <dan.carpenter@oracle.com>
Date:   Tue Aug 24 10:49:32 2021 +0300

    fs/ntfs3: Fix error code in indx_add_allocate()
    
    Return -EINVAL if ni_find_attr() fails.  Don't return success.
    
    Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
    Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
    Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 548f3a8b345412b9ef22e8347d687c6c7168f5db
Author: Dan Carpenter <dan.carpenter@oracle.com>
Date:   Tue Aug 24 14:48:58 2021 +0300

    fs/ntfs3: fix an error code in ntfs_get_acl_ex()
    
    The ntfs_get_ea() function returns negative error codes or on success
    it returns the length.  In the original code a zero length return was
    treated as -ENODATA and results in a NULL return.  But it should be
    treated as an invalid length and result in an PTR_ERR(-EINVAL) return.
    
    Fixes: be71b5cba2e6 ("fs/ntfs3: Add attrib operations")
    Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit a84a0bb4c9aaa90c3594f125b6c9e9ed46be788a
Author: Dan Carpenter <dan.carpenter@oracle.com>
Date:   Tue Aug 24 14:52:36 2021 +0300

    fs/ntfs3: add checks for allocation failure
    
    Add a check for when the kzalloc() in init_rsttbl() fails.  Some of
    the callers checked for NULL and some did not.  I went down the call
    tree and added NULL checks where ever they were missing.
    
    Fixes: b46acd6a6a62 ("fs/ntfs3: Add NTFS journal")
    Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
    Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 582437a4316954b9012171b51796e78d8602f7be
Author: Kari Argillander <kari.argillander@gmail.com>
Date:   Tue Aug 24 21:37:08 2021 +0300

    fs/ntfs3: Use kcalloc/kmalloc_array over kzalloc/kmalloc
    
    Use kcalloc/kmalloc_array over kzalloc/kmalloc when we allocate array.
    Checkpatch found these after we did not use our own defined allocation
    wrappers.
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 3d45b1852f34f7654c7b9f64e57354659cee8355
Author: Kari Argillander <kari.argillander@gmail.com>
Date:   Tue Aug 24 21:37:07 2021 +0300

    fs/ntfs3: Do not use driver own alloc wrappers
    
    Problem with these wrapper is that we cannot take off example GFP_NOFS
    flag. It is not recomended use those in all places. Also if we change
    one driver specific wrapper to kernel wrapper then it would look really
    weird. People should be most familiar with kernel wrappers so let's just
    use those ones.
    
    Driver specific alloc wrapper also confuse some static analyzing tools,
    good example is example kernels checkpatch tool. After we converter
    these to kernel specific then warnings is showed.
    
    Following Coccinelle script was used to automate changing.
    
    virtual patch
    
    @alloc depends on patch@
    expression x;
    expression y;
    @@
    (
    -       ntfs_malloc(x)
    +       kmalloc(x, GFP_NOFS)
    |
    -       ntfs_zalloc(x)
    +       kzalloc(x, GFP_NOFS)
    |
    -       ntfs_vmalloc(x)
    +       kvmalloc(x, GFP_NOFS)
    |
    -       ntfs_free(x)
    +       kfree(x)
    |
    -       ntfs_vfree(x)
    +       kvfree(x)
    |
    -       ntfs_memdup(x, y)
    +       kmemdup(x, y, GFP_NOFS)
    )
    
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 7b2edf8084fb7af91b7bc80e87ebb95073eb528b
Author: Kari Argillander <kari.argillander@gmail.com>
Date:   Thu Aug 26 11:56:29 2021 +0300

    fs/ntfs3: Use kernel ALIGN macros over driver specific
    
    The static checkers (Smatch) were complaining because QuadAlign() was
    buggy.  If you try to align something higher than UINT_MAX it got
    truncated to a u32.
    
    Smatch warning was:
            fs/ntfs3/attrib.c:383 attr_set_size_res()
            warn: was expecting a 64 bit value instead of '~7'
    
    So that this will not happen again we will change all these macros to
    kernel made ones. This can also help some other static analyzing tools
    to give us better warnings.
    
    Patch was generated with Coccinelle script and after that some style
    issue was hand fixed.
    
    Coccinelle script:
    
    virtual patch
    
    @alloc depends on patch@
    expression x;
    @@
    (
    -       #define QuadAlign(n)            (((n) + 7u) & (~7u))
    |
    -       QuadAlign(x)
    +       ALIGN(x, 8)
    |
    -       #define IsQuadAligned(n)        (!((size_t)(n)&7u))
    |
    -       IsQuadAligned(x)
    +       IS_ALIGNED(x, 8)
    |
    -       #define Quad2Align(n)           (((n) + 15u) & (~15u))
    |
    -       Quad2Align(x)
    +       ALIGN(x, 16)
    |
    -       #define IsQuad2Aligned(n)       (!((size_t)(n)&15u))
    |
    -       IsQuad2Aligned(x)
    +       IS_ALIGNED(x, 16)
    |
    -       #define Quad4Align(n)           (((n) + 31u) & (~31u))
    |
    -       Quad4Align(x)
    +       ALIGN(x, 32)
    |
    -       #define IsSizeTAligned(n)       (!((size_t)(n) & (sizeof(size_t) - 1)))
    |
    -       IsSizeTAligned(x)
    +       IS_ALIGNED(x, sizeof(size_t))
    |
    -       #define DwordAlign(n)           (((n) + 3u) & (~3u))
    |
    -       DwordAlign(x)
    +       ALIGN(x, 4)
    |
    -       #define IsDwordAligned(n)       (!((size_t)(n)&3u))
    |
    -       IsDwordAligned(x)
    +       IS_ALIGNED(x, 4)
    |
    -       #define WordAlign(n)            (((n) + 1u) & (~1u))
    |
    -       WordAlign(x)
    +       ALIGN(x, 2)
    |
    -       #define IsWordAligned(n)        (!((size_t)(n)&1u))
    |
    -       IsWordAligned(x)
    +       IS_ALIGNED(x, 2)
    |
    )
    
    Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 6071926968f1446d01baaa20f1a0d811e58b946f
Author: Kari Argillander <kari.argillander@gmail.com>
Date:   Tue Aug 24 21:37:06 2021 +0300

    fs/ntfs3: Restyle comment block in ni_parse_reparse()
    
    First of this fix one none utf8 char in this comment block. Maybe
    this happened because error in filesystem ;)
    
    Also this block was hard to read because long lines so make it max 80
    long. And while we doing this stuff make little better grammer.
    
    Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 75d85f88e9c3bfb5ef7a9e0c64d7f962914f060e
Author: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Date:   Thu Aug 19 16:23:37 2021 +0800

    fs/ntfs3: Remove unused including <linux/version.h>
    
    Eliminate the follow versioncheck warning:
    
    ./fs/ntfs3/inode.c: 16 linux/version.h not needed.
    
    Reported-by: Abaci Robot <abaci@linux.alibaba.com>
    Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
    Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
    Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 208e2b5f62f9b6290a44e3067ba4aaa07cf1cf30
Author: Gustavo A. R. Silva <gustavoars@kernel.org>
Date:   Wed Aug 18 17:21:46 2021 -0500

    fs/ntfs3: Fix fall-through warnings for Clang
    
    Fix the following fallthrough warnings:
    
    fs/ntfs3/inode.c:1792:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
    fs/ntfs3/index.c:178:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
    
    This helps with the ongoing efforts to globally enable
    -Wimplicit-fallthrough for Clang.
    
    Link: https://github.com/KSPP/linux/issues/115
    Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
    Reviewed-by: Nathan Chancellor <nathan@kernel.org>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit dba411671cc79059f036cc8335a3e9f633031bfa
Author: Kari Argillander <kari.argillander@gmail.com>
Date:   Wed Aug 18 04:06:47 2021 +0300

    fs/ntfs3: Fix one none utf8 char in source file
    
    In one source file there is for some reason non utf8 char. But hey this
    is fs development so this kind of thing might happen.
    
    Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 448da00d3a07ff09004aa9dcf9cf782586bc2391
Author: Nathan Chancellor <nathan@kernel.org>
Date:   Mon Aug 16 12:30:41 2021 -0700

    fs/ntfs3: Remove unused variable cnt in ntfs_security_init()
    
    Clang warns:
    
    fs/ntfs3/fsntfs.c:1874:9: warning: variable 'cnt' set but not used
    [-Wunused-but-set-variable]
            size_t cnt, off;
                   ^
    1 warning generated.
    
    It is indeed unused so remove it.
    
    Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
    Signed-off-by: Nathan Chancellor <nathan@kernel.org>
    Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
    Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 7848a6f50e1c106cfff56907bba40bc85949b868
Author: Colin Ian King <colin.king@canonical.com>
Date:   Mon Aug 16 17:30:25 2021 +0100

    fs/ntfs3: Fix integer overflow in multiplication
    
    The multiplication of the u32 data_size with a int is being performed
    using 32 bit arithmetic however the results is being assigned to the
    variable nbits that is a size_t (64 bit) value. Fix a potential
    integer overflow by casting the u32 value to a size_t before the
    multiply to use a size_t sized bit multiply operation.
    
    Addresses-Coverity: ("Unintentional integer overflow")
    Fixes: 82cae269cfa9 ("fs/ntfs3: Add initialization of super block")
    Signed-off-by: Colin Ian King <colin.king@canonical.com>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 9457b121f871c7f279b082a118ea955a778f2d2c
Author: Kari Argillander <kari.argillander@gmail.com>
Date:   Mon Aug 16 15:01:56 2021 +0300

    fs/ntfs3: Add ifndef + define to all header files
    
    Add guards so that compiler will only include header files once.
    
    Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit cfab079d7add7cea956d1c73bfb227277acbfd4c
Author: Kari Argillander <kari.argillander@gmail.com>
Date:   Mon Aug 16 13:37:32 2021 +0300

    fs/ntfs3: Use linux/log2 is_power_of_2 function
    
    We do not need our own implementation for this function in this
    driver. It is much better to use generic one.
    
    Signed-off-by: Kari Argillander <kari.argillander@gmail.com>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 6ac3cde638e1ff25fd379512a81e383890eaadc7
Author: Colin Ian King <colin.king@canonical.com>
Date:   Mon Aug 16 11:13:08 2021 +0100

    fs/ntfs3: Fix various spelling mistakes
    
    There is a spelling mistake in a ntfs_err error message. Also
    fix various spelling mistakes in comments.
    
    Signed-off-by: Colin Ian King <colin.king@canonical.com>
    Reviewed-by: Kari Argillander <kari.argillander@gmail.com>
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 7aeab99b2591883a9a64dca274b0bbf12ac0192e
Author: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Date:   Fri Aug 13 17:21:31 2021 +0300

    fs/ntfs3: Add MAINTAINERS
    
    This adds MAINTAINERS
    
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 9e3c3d099f1819618461480af4a242f1fb916405
Author: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Date:   Fri Aug 13 17:21:30 2021 +0300

    fs/ntfs3: Add NTFS3 in fs/Kconfig and fs/Makefile
    
    This adds NTFS3 in fs/Kconfig and fs/Makefile
    
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 93ead8fb54035d3803d4698cf41bdd1a1958ea13
Author: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Date:   Fri Aug 13 17:21:30 2021 +0300

    fs/ntfs3: Add Kconfig, Makefile and doc
    
    This adds Kconfig, Makefile and doc
    
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit fd63d37c1b9ebe5b34ad544d706178b871aea1ba
Author: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Date:   Fri Aug 13 17:21:30 2021 +0300

    fs/ntfs3: Add NTFS journal
    
    This adds NTFS journal
    
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit d3f5f3ad9f8c5f20aa1e1cf27a499a4dd36bd65a
Author: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Date:   Fri Aug 13 17:21:30 2021 +0300

    fs/ntfs3: Add compression
    
    This patch adds different types of NTFS-applicable compressions:
    - lznt
    - lzx
    - xpress
    Latter two (lzx, xpress) implement Windows Compact OS feature and
    were taken from ntfs-3g system comression plugin authored by Eric Biggers
    (https://github.com/ebiggers/ntfs-3g-system-compression)
    which were ported to ntfs3 and adapted to Linux Kernel environment.
    
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit cb25eb8b77e4c8126831cf9cb7620f327a9a4ea8
Author: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Date:   Fri Aug 13 17:21:30 2021 +0300

    fs/ntfs3: Add attrib operations
    
    This adds attrib operations
    
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 9d01713d336b42ce2f54360ffb81cb572d17440e
Author: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Date:   Fri Aug 13 17:21:29 2021 +0300

    fs/ntfs3: Add file operations and implementation
    
    This adds file operations and implementation
    
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 018d7923d76995c99b7ac4ece05117df226de852
Author: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Date:   Fri Aug 13 17:21:29 2021 +0300

    fs/ntfs3: Add bitmap
    
    This adds bitmap
    
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 5bf9cd7aca3534e7a9550507e4e1e029a7239eaf
Author: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Date:   Fri Aug 13 17:21:29 2021 +0300

    fs/ntfs3: Add initialization of super block
    
    This adds initialization of super block
    
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 7f3f48560484db9ef0d998305da1f598227de434
Author: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Date:   Fri Aug 13 17:21:29 2021 +0300

    fs/ntfs3: Add headers and misc files
    
    This adds headers and misc files
    
    Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>

commit 418f552b08faf3d1170fa09c40b12bdb39e347b0
Author: Mark Weiman <mark.weiman@markzz.com>
Date:   Sun Aug 12 11:36:21 2018 -0400

    pci: Enable overrides for missing ACS capabilities
    
    This an updated version of Alex Williamson's patch from:
    https://lkml.org/lkml/2013/5/30/513
    
    Original commit message follows:
    
    PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
    allows us to control whether transactions are allowed to be redirected
    in various subnodes of a PCIe topology.  For instance, if two
    endpoints are below a root port or downsteam switch port, the
    downstream port may optionally redirect transactions between the
    devices, bypassing upstream devices.  The same can happen internally
    on multifunction devices.  The transaction may never be visible to the
    upstream devices.
    
    One upstream device that we particularly care about is the IOMMU.  If
    a redirection occurs in the topology below the IOMMU, then the IOMMU
    cannot provide isolation between devices.  This is why the PCIe spec
    encourages topologies to include ACS support.  Without it, we have to
    assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.
    
    Unfortunately, far too many topologies do not support ACS to make this
    a steadfast requirement.  Even the latest chipsets from Intel are only
    sporadically supporting ACS.  We have trouble getting interconnect
    vendors to include the PCIe spec required PCIe capability, let alone
    suggested features.
    
    Therefore, we need to add some flexibility.  The pcie_acs_override=
    boot option lets users opt-in specific devices or sets of devices to
    assume ACS support.  The "downstream" option assumes full ACS support
    on root ports and downstream switch ports.  The "multifunction"
    option assumes the subset of ACS features available on multifunction
    endpoints and upstream switch ports are supported.  The "id:nnnn:nnnn"
    option enables ACS support on devices matching the provided vendor
    and device IDs, allowing more strategic ACS overrides.  These options
    may be combined in any order.  A maximum of 16 id specific overrides
    are available.  It's suggested to use the most limited set of options
    necessary to avoid completely disabling ACS across the topology.
    Note to hardware vendors, we have facilities to permanently quirk
    specific devices which enforce isolation but not provide an ACS
    capability.  Please contact me to have your devices added and save
    your customers the hassle of this boot option.
    
    Signed-off-by: Mark Weiman <mark.weiman@markzz.com>

commit 080344179972fc6dea99e30be2b37c4aa8abff55
Author: graysky <graysky@archlinux.us>
Date:   Sun Jun 6 09:41:36 2021 -0400

    x86/kconfig: more uarches for kernel 5.8+
    
    FEATURES
    This patch adds additional CPU options to the Linux kernel accessible under:
     Processor type and features  --->
      Processor family --->
    
    With the release of gcc 11.1 and clang 12.0, several generic 64-bit levels are
    offered which are good for supported Intel or AMD CPUs:
    • x86-64-v2
    • x86-64-v3
    • x86-64-v4
    
    Users of glibc 2.33 and above can see which level is supported by current
    hardware by running:
      /lib/ld-linux-x86-64.so.2 --help | grep supported
    
    Alternatively, compare the flags from /proc/cpuinfo to this list.[1]
    
    CPU-specific microarchitectures include:
    • AMD Improved K8-family
    • AMD K10-family
    • AMD Family 10h (Barcelona)
    • AMD Family 14h (Bobcat)
    • AMD Family 16h (Jaguar)
    • AMD Family 15h (Bulldozer)
    • AMD Family 15h (Piledriver)
    • AMD Family 15h (Steamroller)
    • AMD Family 15h (Excavator)
    • AMD Family 17h (Zen)
    • AMD Family 17h (Zen 2)
    • AMD Family 19h (Zen 3)†
    • Intel Silvermont low-power processors
    • Intel Goldmont low-power processors (Apollo Lake and Denverton)
    • Intel Goldmont Plus low-power processors (Gemini Lake)
    • Intel 1st Gen Core i3/i5/i7 (Nehalem)
    • Intel 1.5 Gen Core i3/i5/i7 (Westmere)
    • Intel 2nd Gen Core i3/i5/i7 (Sandybridge)
    • Intel 3rd Gen Core i3/i5/i7 (Ivybridge)
    • Intel 4th Gen Core i3/i5/i7 (Haswell)
    • Intel 5th Gen Core i3/i5/i7 (Broadwell)
    • Intel 6th Gen Core i3/i5/i7 (Skylake)
    • Intel 6th Gen Core i7/i9 (Skylake X)
    • Intel 8th Gen Core i3/i5/i7 (Cannon Lake)
    • Intel 10th Gen Core i7/i9 (Ice Lake)
    • Intel Xeon (Cascade Lake)
    • Intel Xeon (Cooper Lake)*
    • Intel 3rd Gen 10nm++ i3/i5/i7/i9-family (Tiger Lake)*
    • Intel 3rd Gen 10nm++ Xeon (Sapphire Rapids)‡
    • Intel 11th Gen i3/i5/i7/i9-family (Rocket Lake)‡
    • Intel 12th Gen i3/i5/i7/i9-family (Alder Lake)‡
    
    Notes: If not otherwise noted, gcc >=9.1 is required for support.
           *Requires gcc >=10.1 or clang >=10.0
           †Required gcc >=10.3 or clang >=12.0
           ‡Required gcc >=11.1 or clang >=12.0
    
    It also offers to compile passing the 'native' option which, "selects the CPU
    to generate code for at compilation time by determining the processor type of
    the compiling machine. Using -march=native enables all instruction subsets
    supported by the local machine and will produce code optimized for the local
    machine under the constraints of the selected instruction set."[2]
    
    Users of Intel CPUs should select the 'Intel-Native' option and users of AMD
    CPUs should select the 'AMD-Native' option.
    
    MINOR NOTES RELATING TO INTEL ATOM PROCESSORS
    This patch also changes -march=atom to -march=bonnell in accordance with the
    gcc v4.9 changes. Upstream is using the deprecated -match=atom flags when I
    believe it should use the newer -march=bonnell flag for atom processors.[3]
    
    It is not recommended to compile on Atom-CPUs with the 'native' option.[4] The
    recommendation is to use the 'atom' option instead.
    
    BENEFITS
    Small but real speed increases are measurable using a make endpoint comparing
    a generic kernel to one built with one of the respective microarchs.
    
    See the following experimental evidence supporting this statement:
    https://github.com/graysky2/kernel_gcc_patch
    
    REQUIREMENTS
    linux version >=5.8
    gcc version >=9.0 or clang version >=9.0
    
    ACKNOWLEDGMENTS
    This patch builds on the seminal work by Jeroen.[5]
    
    REFERENCES
    1.  https://gitlab.com/x86-psABIs/x86-64-ABI/-/commit/77566eb03bc6a326811cb7e9
    2.  https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-x86-Options
    3.  https://bugzilla.kernel.org/show_bug.cgi?id=77461
    4.  https://github.com/graysky2/kernel_gcc_patch/issues/15
    5.  http://www.linuxforge.net/docs/linux/linux-gcc.php
    
    Signed-off-by: graysky <graysky@archlinux.us>

commit af2d931bdfa54d2260e47c826a93b50ac16f72db
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Wed May 17 01:52:11 2017 +0000

    init: wait for partition and retry scan
    
    As Clear Linux boots fast the device is not ready when
    the mounting code is reached, so a retry device scan will
    be performed every 0.5 sec for at least 40 sec
    and synchronize the async task.
    
    Signed-off-by: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>

commit a41a87f11c0cc996bc5f1a34f1f2644858282d89
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Thu Jun 2 23:36:32 2016 -0500

    drivers: initialize ata before graphics
    
    ATA init is the long pole in the boot process, and its asynchronous.
    move the graphics init after it so that ata and graphics initialize
    in parallel

commit 80e01b0e579c3b6d879d7be33dd39bb0e9797f52
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Sun Feb 18 23:35:41 2018 +0000

    locking: rwsem: spin faster
    
    tweak rwsem owner spinning a bit
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 5948ca1226e4477038e0a3b514623281de1d9111
Author: William Douglas <william.douglas@intel.com>
Date:   Wed Jun 20 17:23:21 2018 +0000

    firmware: Enable stateless firmware loading
    
    Prefer the order of specific version before generic and /etc before
    /lib to enable the user to give specific overrides for generic
    firmware and distribution firmware.

commit 09e053e2b26e19d575b7bf4a111473c31abaf5ca
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Sun Sep 22 11:12:35 2019 -0300

    intel_rapl: Silence rapl trace debug

commit 0196fd65dbcdba18dd61fab8a8655e6ae28ff649
Author: Christian Brauner <christian@brauner.io>
Date:   Wed Jan 23 21:54:23 2019 +0100

    SAUCE: binder: give binder_alloc its own debug mask file
    
    Currently both binder.c and binder_alloc.c both register the
    /sys/module/binder_linux/paramters/debug_mask file which leads to conflicts
    in sysfs. This commit gives binder_alloc.c its own
    /sys/module/binder_linux/paramters/alloc_debug_mask file.
    
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

commit 04709621350442cdc7164066a9a46bc3ba4d65a8
Author: Christian Brauner <christian@brauner.io>
Date:   Wed Jan 16 23:13:25 2019 +0100

    SAUCE: binder: turn into module
    
    The Android binder driver needs to become a module for the sake of shipping
    Anbox. To do this we need to export the following functions since binder is
    currently still using them:
    
    - security_binder_set_context_mgr()
    - security_binder_transaction()
    - security_binder_transfer_binder()
    - security_binder_transfer_file()
    - can_nice()
    - __close_fd_get_file()
    - mmput_async()
    - task_work_add()
    - map_kernel_range_noflush()
    - get_vm_area()
    - zap_page_range()
    - put_ipc_ns()
    - get_ipc_ns_exported()
    - show_init_ipc_ns()
    
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
    [ saf: fix additional reference to init_ipc_ns from 5.0-rc6 ]
    Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

commit 7183da0464044bb1d743207c6e60fd83c04c5457
Author: Christian Brauner <christian@brauner.io>
Date:   Wed Jun 20 19:21:37 2018 +0200

    SAUCE: ashmem: turn into module
    
    The Android ashmem driver needs to become a module for the sake of Anbox.
    To do this we need to export shmem_zero_setup() since ashmem is currently
    using is.
    Note, the abomination that is the Android ashmem driver will go away in the
    not so distant future in favour of memfds.
    
    Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

commit 97e857f92dde43310ac3eb959b3df0074490f084
Author: Serge Hallyn <serge.hallyn@canonical.com>
Date:   Fri May 31 19:12:12 2013 +0100

    sysctl: add sysctl to disallow unprivileged CLONE_NEWUSER by default
    
    add sysctl to disallow unprivileged CLONE_NEWUSER by default
    
    This is a short-term patch.  Unprivileged use of CLONE_NEWUSER
    is certainly an intended feature of user namespaces.  However
    for at least saucy we want to make sure that, if any security
    issues are found, we have a fail-safe.
    
    Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
    [bwh: Remove unneeded binary sysctl bits]
    [bwh: Keep this sysctl, but change the default to enabled]

commit f6002a622eceae4ed570608b95736bcf8d534b61
Author: Adithya Abraham Philip <abrahamphilip@google.com>
Date:   Fri Jun 11 21:56:10 2021 +0000

    net-tcp_bbr: v2: Fix missing ECT markings on retransmits for BBRv2
    
    Adds a new flag TCP_ECN_ECT_PERMANENT that is used by CCAs to
    indicate that retransmitted packets and pure ACKs must have the
    ECT bit set. This is a necessary fix for BBRv2, which when using
    ECN expects ECT to be set even on retransmitted packets and ACKs.
    Currently CCAs like BBRv2 which can use ECN but don't "need" it
    do not have a way to indicate that ECT should be set on
    retransmissions/ACKs.
    
    Signed-off-by: Adithya Abraham Philip <abrahamphilip@google.com>
    Signed-off-by: Neal Cardwell <ncardwell@google.com>

commit 68856d40909bf33eb65dbf406124250d2025186e
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Dec 28 19:23:09 2020 -0500

    net-tcp_bbr: v2: don't assume prior_cwnd was set entering CA_Loss
    
    Fix WARN_ON_ONCE() warnings that were firing and pointing to a
    bbr->prior_cwnd of 0 when exiting CA_Loss and transitioning to
    CA_Open.
    
    The issue was that tcp_simple_retransmit() calls:
    
      tcp_set_ca_state(sk, TCP_CA_Loss);
    
    without first calling icsk_ca_ops->ssthresh(sk) (because
    tcp_simple_retransmit() is dealing with losses due to MTU issues and
    not congestion). The lack of this callback means that BBR did not get
    a chance to set bbr->prior_cwnd, and thus upon exiting CA_Loss in such
    cases the WARN_ON_ONCE() would fire due to a zero bbr->prior_cwnd.
    
    This commit removes that warning, since a bbr->prior_cwnd of 0 is a
    valid situation in this state transition.
    
    For setting inflight_lo upon entering CA_Loss, to avoid setting an
    inflight_lo of 0 in this case, this commit switches to taking the max
    of cwnd and prior_cwnd. We plan to remove that line of code when we
    switch to cautious (PRR-style) recovery, so that awkwardness will go
    away.
    
    Change-Id: I575dce871c2f20e91e3e9449e1706f42a07b8118

commit 7de3f36f619c585abd6c48396ca49b2227d5061c
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Aug 17 19:10:21 2020 -0400

    net-tcp_bbr: v2: remove cycle_rand parameter that is unused in BBRv2
    
    Change-Id: Iee1df7e41e42de199068d7c89131ed3d228327c0

commit dfedade9dca832438327a41f43f3d29bb971d255
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Aug 17 19:08:41 2020 -0400

    net-tcp_bbr: v2: remove field bw_rtts that is unused in BBRv2
    
    Change-Id: I58e3346c707748a6f316f3ed060d2da84c32a79b

commit 7882fddcbfb3a8edc3f22fca51541cd2d221cd22
Author: Neal Cardwell <ncardwell@google.com>
Date:   Thu Nov 21 15:28:01 2019 -0500

    net-tcp_bbr: v2: remove unnecessary rs.delivered_ce logic upon loss
    
    There is no reason to compute rs.delivered_ce upon loss.
    
    In fact, we specifically do not want to compute rs.delivered_ce upon loss.
    
    Two issues:
    
    (1) This would be the wrong thing to do, in behavior terms.  With
        RACK's dynamic reordering window, losses can be marked long after
        the sequence hole appears in the ACK/SACK stream. We want to to
        catch the ECN mark rate rising too high as quickly as possible,
        which means we want to check for high ECN mark rates at ACK time
        (as BBRv2 currently does) and not loss marking time.
    
    (2) This is dead code. The ECN mark rate cannot be detected as too
        high because the check needs rs->delivered to be > 0 as well:
    
           if (rs->delivered_ce > 0 && rs->delivered > 0 &&
    
        Since we are not setting rs->delivered upon loss, this check
        cannot succeed, so setting delivered_ce is pointless.
    
    This dead and wrong line was discovered by Randall Stewart at Netflix
    as he was reading the BBRv2 code.
    
    Change-Id: I37f83f418a259ec31d8f82de986db071b364b76a

commit 729237398032a90cc069a2a6d1c84ef79b556adf
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Jun 11 12:54:22 2019 -0400

    net-tcp_bbr: v2: BBRv2 ("bbr2") congestion control for Linux TCP
    
    BBR v2 is an enhacement to the BBR v1 algorithm. It's designed to aim for lower
    queues, lower loss, and better Reno/CUBIC coexistence than BBR v1.
    
    BBR v2 maintains the core of BBR v1: an explicit model of the network
    path that is two-dimensional, adapting to estimate the (a) maximum
    available bandwidth and (b) maximum safe volume of data a flow can
    keep in-flight in the network. It maintains the estimated BDP as a
    core guide for estimating an appropriate level of in-flight data.
    
    BBR v2 makes several key enhancements:
    
    o Its bandwidth-probing time scale is adapted, within bounds, to allow improved
    coexistence with Reno and CUBIC. The bandwidth-probing time scale is (a)
    extended dynamically based on estimated BDP to improve coexistence with
    Reno/CUBIC; (b) bounded by an interactive wall-clock time-scale to be more
    scalable and responsive than Reno and CUBIC.
    
    o Rather than being largely agnostic to loss and ECN marks, it explicitly uses
    loss and (DCTCP-style) ECN signals to maintain its model.
    
    o It aims for lower losses than v1 by adjusting its model to attempt to stay
    within loss rate and ECN mark rate bounds (loss_thresh and ecn_thresh,
    respectively).
    
    o It adapts to loss/ECN signals even when the application is running out of
    data ("application-limited"), in case the "application-limited" flow is also
    "network-limited" (the bw and/or inflight available to this flow is lower than
    previously estimated when the flow ran out of data).
    
    o It has a three-part model: the model explicit three tracks operating points,
    where an operating point is a tuple: (bandwidth, inflight). The three operating
    points are:
    
      o latest:        the latest measurement from the current round trip
      o upper bound:   robust, optimistic, long-term upper bound
      o lower bound:   robust, conservative, short-term lower bound
    
    These are stored in the following state variables:
    
      o latest:  bw_latest, inflight_latest
      o lo:      bw_lo,     inflight_lo
      o hi:      bw_hi[2],  inflight_hi
    
    To gain intuition about the meaning of the three operating points, it
    may help to consider the analogs in CUBIC, which has a somewhat
    analogous three-part model used by its probing state machine:
    
      BBR param     CUBIC param
      -----------   -------------
      latest     ~  cwnd
      lo         ~  ssthresh
      hi         ~  last_max_cwnd
    
    The analogy is only a loose one, though, since the BBR operating
    points are calculated differently, and are 2-dimensional (bw,inflight)
    rather than CUBIC's one-dimensional notion of operating point
    (inflight).
    
    o It uses the three-part model to adapt the magnitude of its bandwidth
    to match the estimated space available in the buffer, rather than (as
    in BBR v1) assuming that it was always acceptable to place 0.25*BDP in
    the bottleneck buffer when probing (commodity datacenter switches
    commonly do not have that much buffer for WAN flows). When BBR v2
    estimates it hit a buffer limit during probing, its bandwidth probing
    then starts gently in case little space is still available in the
    buffer, and the accelerates, slowly at first and then rapidly if it
    can grow inflight without seeing congestion signals. In such cases,
    probing is bounded by inflight_hi + inflight_probe, where
    inflight_probe grows as: [0, 1, 2, 4, 8, 16,...]. This allows BBR to
    keep losses low and bounded if a bottleneck remains congested, while
    rapidly/scalably utilizing free bandwidth when it becomes available.
    
    o It has a slightly revised state machine, to achieve the goals above.
        BBR_BW_PROBE_UP:    pushes up inflight to probe for bw/vol
        BBR_BW_PROBE_DOWN:  drain excess inflight from the queue
        BBR_BW_PROBE_CRUISE: use pipe, w/ headroom in queue/pipe
        BBR_BW_PROBE_REFILL: try refill the pipe again to 100%, leaving queue empty
    
    o The estimated BDP: BBR v2 continues to maintain an estimate of the
    path's two-way propagation delay, by tracking a windowed min_rtt, and
    coordinating (on an as-ndeeded basis) to try to expose the two-way
    propagation delay by draining the bottleneck queue.
    
    BBR v2 continues to use its min_rtt and (currently-applicable) bandwidth
    estimate to estimate the current bandwidth-delay product. The estimated BDP
    still provides one important guideline for bounding inflight data. However,
    because any min-filtered RTT and max-filtered bw inherently tend to both
    overestimate, the estimated BDP is often too high; in this case loss or ECN
    marks can ensue, in which case BBR v2 adjusts inflight_hi and inflight_lo to
    adapt its sending rate and inflight down to match the available capacity of the
    path.
    
    o Space: Note that ICSK_CA_PRIV_SIZE increased. This is because BBR v2
    requires more space. Note that much of the space is due to support for
    per-socket parameterization and debugging in this release for research
    and debugging. With that state removed, the full "struct bbr" is 140
    bytes, or 144 with padding. This is an increase of 40 bytes over the
    existing ca_priv space.
    
    o Code: BBR v2 reuses many pieces from BBR v1. But it omits the following
      significant pieces:
    
      o "packet conservation" (bbr_set_cwnd_to_recover_or_restore(),
        bbr_can_grow_inflight())
      o long-term bandwidth estimator ("policer mode")
    
      The code layout tries to keep BBR v2 code near the bottom of the
      file, so that v1-applicable code in the top does not accidentally
      refer to v2 code.
    
    o Docs:
      See the following docs for more details and diagrams decsribing the BBR v2
      algorithm:
        https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00
        https://datatracker.ietf.org/meeting/102/materials/slides-102-iccrg-an-update-on-bbr-work-at-google-00
    
    o Internal notes:
      For this upstream rebase, Neal started from:
        git show fed518041ac6:net/ipv4/tcp_bbr.c > net/ipv4/tcp_bbr.c
      then removed dev instrumentation (dynamic get/set for parameters)
      and code that was only used by BBRv1
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 2c84098e60bed6d67dde23cd7538c51dee273102
    Change-Id: I125cf26ba2a7a686f2fa5e87f4c2afceb65f7a05

commit 002c56a891c693f9e596c6c397ca80ae460bf2d2
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sat Nov 16 13:16:25 2019 -0500

    net-tcp: add fast_ack_mode=1: skip rwin check in tcp_fast_ack_mode__tcp_ack_snd_check()
    
    Add logic for an experimental TCP connection behavior, enabled with
    tp->fast_ack_mode = 1, which disables checking the receive window
    before sending an ack in __tcp_ack_snd_check(). If this behavior is
    enabled, the data receiver sends an ACK if the amount of data is >
    RCV.MSS.
    
    Change-Id: Iaa0a0fd7108221f883137a79d5bfa724f1b096d4

commit aaac92aea9ffb99bd18c2431cee31e47eacfc6e5
Author: Neal Cardwell <ncardwell@google.com>
Date:   Fri Sep 27 17:10:26 2019 -0400

    net-tcp: re-generalize TSO sizing in TCP CC module API
    
    Reorganize the API for CC modules so that the CC module once again
    gets complete control of the TSO sizing decision. This is how the API
    was set up around 2016 and the initial BBRv1 upstreaming. Later Eric
    Dumazet simplified it. But with wider testing it now seems that to
    avoid CPU regressions BBR needs to have a different TSO sizing
    function.
    
    This is necessary to handle cases where there are many flows
    bottlenecked on the sender host's NIC, in which case BBR's pacing rate
    is much lower than CUBIC/Reno/DCTCP's. Why does this happen? Because
    BBR's pacing rate adapts to the low bandwidth share each flow sees. By
    contrast, CUBIC/Reno/DCTCP see no loss or ECN, so they grow a very
    large cwnd, and thus large pacing rate and large TSO burst size.
    
    Change-Id: Ic8ccfdbe4010ee8d4bf6a6334c48a2fceb2171ea

commit 7df691bf67541601ac74d9a6dc55e988dd7706c5
Author: Yousuk Seung <ysseung@google.com>
Date:   Wed May 23 17:55:54 2018 -0700

    net-tcp: add new ca opts flag TCP_CONG_WANTS_CE_EVENTS
    
    Add a a new ca opts flag TCP_CONG_WANTS_CE_EVENTS that allows a
    congestion control module to receive CE events.
    
    Currently congestion control modules have to set the TCP_CONG_NEEDS_ECN
    bit in opts flag to receive CE events but this may incur changes in ECN
    behavior elsewhere. This patch adds a new bit TCP_CONG_WANTS_CE_EVENTS
    that allows congestion control modules to receive CE events
    independently of TCP_CONG_NEEDS_ECN.
    
    Effort: net-tcp
    Origin-9xx-SHA1: 9f7e14716cde760bc6c67ef8ef7e1ee48501d95b
    Change-Id: I2255506985242f376d910c6fd37daabaf4744f24

commit 358fdaea39eaa8248856ed9b7073f9996c2e331f
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue May 7 22:37:19 2019 -0400

    net-tcp_bbr: v2: set tx.in_flight for skbs in repair write queue
    
    Syzkaller was able to use TCP_REPAIR to reproduce the new warning
    added in tcp_fragment():
    
      WARNING: CPU: 0 PID: 118174 at net/ipv4/tcp_output.c:1487
        tcp_fragment+0xdcc/0x10a0 net/ipv4/tcp_output.c:1487()
      inconsistent: tx.in_flight: 0 old_factor: 53
    
    The warning happens because skbs inserted into the tcp_rtx_queue
    during the repair process go through a sort of "fake send" process,
    and that process was seting pcount but not tx.in_flight, and thus the
    warnings (where old_factor is the old pcount).
    
    The fix of setting tx.in_flight in the TCP_REPAIR code path seems
    simple enough, and indeed makes the repro code from syzkaller stop
    producing warnings. Running through kokonut tests, and will send out
    for review when all tests pass.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 330f825a08a6fe92cef74d799cc468864c479f63
    Change-Id: I0bc4a790f040fd4239620e1eedd5dc64666c6f05

commit 88004ea5d93d28413f1337e2995ac6f28fb89d9c
Author: Neal Cardwell <ncardwell@google.com>
Date:   Wed May 1 20:16:25 2019 -0400

    net-tcp_bbr: v2: adjust skb tx.in_flight upon split in tcp_fragment()
    
    When we fragment an skb that has already been sent, we need to update
    the tx.in_flight for the first skb in the resulting pair ("buff").
    
    Because we were not updating the tx.in_flight, the tx.in_flight value
    was inconsistent with the pcount of the "buff" skb (tx.in_flight would
    be too high). That meant that if the "buff" skb was lost, then
    bbr2_inflight_hi_from_lost_skb() would calculate an inflight_hi value
    that is too high. This could result in longer queues and higher packet
    loss.
    
    Packetdrill testing verified that without this commit, when the second
    half of an skb is SACKed and then later the first half of that skb is
    marked lost, the calculated inflight_hi was incorrect.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 385f1ddc610798fab2837f9f372857438b25f874
    Change-Id: I617f8cab4e9be7a0b8e8d30b047bf8645393354d

commit 817fad8477b264a77f5bd37adfb92ad1eb06ab3e
Author: Neal Cardwell <ncardwell@google.com>
Date:   Wed May 1 20:16:33 2019 -0400

    net-tcp_bbr: v2: adjust skb tx.in_flight upon merge in tcp_shifted_skb()
    
    When tcp_shifted_skb() updates state as adjacent SACKed skbs are
    coalesced, previously the tx.in_flight was not adjusted, so we could
    get contradictory state where the skb's recorded pcount was bigger
    than the tx.in_flight (the number of segments that were in_flight
    after sending the skb).
    
    Normally have a SACKed skb with contradictory pcount/tx.in_flight
    would not matter. However, with SACK reneging, the SACKed bit is
    removed, and an skb once again becomes eligible for retransmitting,
    fragmenting, SACKing, etc. Packetdrill testing verified the following
    sequence is possible in a kernel that does not have this commit:
    
     - skb N is SACKed
     - skb N+1 is SACKed and combined with skb N using tcp_shifted_skb()
       - tcp_shifted_skb() will increase the pcount of prev,
         but leave tx.in_flight as-is
       - so prev skb can have pcount > tx.in_flight
     - RTO, tcp_timeout_mark_lost(), detect reneg,
       remove "SACKed" bit, mark skb N as lost
       - find pcount of skb N is greater than its tx.in_flight
    
    I suspect this issue iw what caused the bbr2_inflight_hi_from_lost_skb():
      WARN_ON_ONCE(inflight_prev < 0)
    to fire in production machines using bbr2.
    
    Tested: See last commit in series for sponge link.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 1a3e997e613d2dcf32b947992882854ebe873715
    Change-Id: I1b0b75c27519953430c7db51c6f358f104c7af55

commit 3fe3672f973c91ec3311654b584394204001fec3
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue May 7 22:36:36 2019 -0400

    net-tcp_bbr: v2: factor out tx.in_flight setting into tcp_set_tx_in_flight()
    
    Factor out the code to set an skb's tx.in_flight field into its own
    function, so that this code can be used for the TCP_REPAIR "fake send"
    code path that inserts skbs into the rtx queue without sending
    them. This is in preparation for the following patch, which fixes an
    issue with TCP_REPAIR and tx.in_flight.
    
    Tested: See last patch in series for sponge link.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: e880fc907d06ea7354333f60f712748ebce9497b
    Change-Id: I4fbd4a6e18a51ab06d50ab1c9ad820ce5bea89af

commit 1a21903c691c851309e9c5fa166227f16cab2a46
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Aug 7 21:52:06 2018 -0400

    net-tcp_bbr: v2: introduce ca_ops->skb_marked_lost() CC module callback API
    
    For connections experiencing reordering, RACK can mark packets lost
    long after we receive the SACKs/ACKs hinting that the packets were
    actually lost.
    
    This means that CC modules cannot easily learn the volume of inflight
    data at which packet loss happens by looking at the current inflight
    or even the packets in flight when the most recently SACKed packet was
    sent. To learn this, CC modules need to know how many packets were in
    flight at the time lost packets were sent. This new callback, combined
    with TCP_SKB_CB(skb)->tx.in_flight, allows them to learn this.
    
    This also provides a consistent callback that is invoked whether
    packets are marked lost upon ACK processing, using the RACK reordering
    timer, or at RTO time.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: afcbebe3374e4632ac6714d39e4dc8a8455956f4
    Change-Id: I54826ab53df636be537e5d3c618a46145d12d51a

commit a88fb3d7e3cbde0593e9b1a005234ec6d7622751
Author: Neal Cardwell <ncardwell@google.com>
Date:   Mon Nov 19 13:48:36 2018 -0500

    net-tcp_bbr: v2: export FLAG_ECE in rate_sample.is_ece
    
    For understanding the relationship between inflight and ECN signals,
    to try to find the highest inflight value that has acceptable levels
    ECN marking.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 3eba998f2898541406c2666781182200934965a8
    Change-Id: I3a964e04cee83e11649a54507043d2dfe769a3b3

commit 3c836d6a378442044d82b9492f9dfe1937f07e32
Author: Neal Cardwell <ncardwell@google.com>
Date:   Thu Oct 12 23:44:27 2017 -0400

    net-tcp_bbr: v2: count packets lost over TCP rate sampling interval
    
    For understanding the relationship between inflight and packet loss
    signals, to try to find the highest inflight value that has acceptable
    levels of packet losses.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 4527e26b2bd7756a88b5b9ef1ada3da33dd609ab
    Change-Id: I594c2500868d9c530770e7ddd68ffc87c57f4fd5

commit a9dca3ca500c3f089eb010859faaaeb9501bb6f9
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sat Aug 5 11:49:50 2017 -0400

    net-tcp_bbr: v2: snapshot packets in flight at transmit time and pass in rate_sample
    
    For understanding the relationship between inflight and losses or ECN
    signals, to try to find the highest inflight value that has acceptable
    levels of loss/ECN marking.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: b3eb4f2d20efab4ca001f32c9294739036c493ea
    Change-Id: I7314047d0ff14dd261a04b1969a46dc658c8836a

commit 72b3d52fbcfa38d3e8c029f77f2c1d9607a6de76
Author: Neal Cardwell <ncardwell@google.com>
Date:   Sun Jun 24 21:55:59 2018 -0400

    net-tcp_bbr: v2: shrink delivered_mstamp, first_tx_mstamp to u32 to free up 8 bytes
    
    Free up some space for tracking inflight and losses for each
    bw sample, in upcoming commits.
    
    These timestamps are in microseconds, and are now stored in 32
    bits. So they can only hold time intervals up to roughly 2^12 = 4096
    seconds.  But Linux TCP RTT and RTO tracking has the same 32-bit
    microsecond implementation approach and resulting deployment
    limitations. So this is not introducing a new limit. And these should
    not be a limitation for the foreseeable future.
    
    Effort: net-tcp_bbr
    Origin-9xx-SHA1: 238a7e6b5d51625fef1ce7769826a7b21b02ae55
    Change-Id: I3b779603797263b52a61ad57c565eb91fe42680c

commit 4beae98fdebeee68908a4cd2086e27782dbdd84f
Author: Yuchung Cheng <ycheng@google.com>
Date:   Tue Mar 27 18:01:46 2018 -0700

    net-tcp_rate: account for CE marks in rate sample
    
    This patch counts number of packets delivered have CE mark in the
    rate sample, using similar approach of delivery accounting.
    
    Effort: net-tcp_rate
    Origin-9xx-SHA1: 710644db434c3da335a7c8b72207a671ccbb5cf8
    Change-Id: I0968fb33fe19b5c774e8c3afd2685558a6ec8710

commit e1fa0b1b4d7186616bdff28be52e8f25e3fdeb1d
Author: Yuchung Cheng <ycheng@google.com>
Date:   Tue Mar 27 18:33:29 2018 -0700

    net-tcp_rate: consolidate inflight tracking approaches in TCP
    
    In order to track CE marks per rate sample (one round trip), we'll
    need to snap the starting tcp delivered_ce acount in the packet
    meta header (tcp_skb_cb). But there's not enough space.
    
    Good news is that the "last_in_flight" in the header, used by
    NV congestion control, is almost equivalent as "delivered". In
    fact "delivered" is better by accounting out-of-order packets
    additionally.  Therefore we can remove it to make room for the
    CE tracking.
    
    This would make delayed ACK detection slightly less accurate but the
    impact is negligible since it's not used for any critical control.
    
    Effort: net-tcp_rate
    Origin-9xx-SHA1: ddcd46ec85d5f1c4454258af0c54b3254c0d64a7
    Change-Id: I1a184aad6d101c981ac7f2f275aa9417ff856910

commit e99a898dd29ed733a368201f8d0654e067853169
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Jun 11 12:26:55 2019 -0400

    net-tcp_bbr: broaden app-limited rate sample detection
    
    This commit is a bug fix for the Linux TCP app-limited
    (application-limited) logic that is used for collecting rate
    (bandwidth) samples.
    
    Previously the app-limited logic only looked for "bubbles" of
    silence in between application writes, by checking at the start
    of each sendmsg. But "bubbles" of silence can also happen before
    retransmits: e.g. bubbles can happen between an application write
    and a retransmit, or between two retransmits.
    
    Retransmits are triggered by ACKs or timers. So this commit checks
    for bubbles of app-limited silence upon ACKs or timers.
    
    Why does this commit check for app-limited state at the start of
    ACKs and timer handling? Because at that point we know whether
    inflight was fully using the cwnd.  During processing the ACK or
    timer event we often change the cwnd; after changing the cwnd we
    can't know whether inflight was fully using the old cwnd.
    
    Origin-9xx-SHA1: 3fe9b53291e018407780fb8c356adb5666722cbc
    Change-Id: I37221506f5166877c2b110753d39bb0757985e68

commit 50bc1ffc2beef1e67f02d25fd93ed56dc3b12141
Author: Nick Terrell <terrelln@fb.com>
Date:   Thu Apr 29 18:31:56 2021 -0700

    lib: zstd: Upgrade to latest upstream zstd version 1.4.10
    
    Upgrade to the latest upstream zstd version 1.4.10.
    
    This patch is 100% generated from upstream zstd commit 67a426c322c5 [0].
    
    This patch is very large because it is transitioning from the custom
    kernel zstd to using upstream directly. The new zstd follows upstreams
    file structure which is different. Future update patches will be much
    smaller because they will only contain the changes from one upstream
    zstd release.
    
    As an aid for review I've created a commit [1] that shows the diff
    between upstream zstd as-is (which doesn't compile), and the zstd
    code imported in this patch. The verion of zstd in this patch is
    generated from upstream with changes applied by automation to replace
    upstreams libc dependencies, remove unnecessary portability macros,
    replace `/**` comments with `/*` comments, and use the kernel's xxhash
    instead of bundling it.
    
    The benefits of this patch are as follows:
    1. Using upstream directly with automated script to generate kernel
       code. This allows us to update the kernel every upstream release, so
       the kernel gets the latest bug fixes and performance improvements,
       and doesn't get 3 years out of date again. The automation and the
       translated code are tested every upstream commit to ensure it
       continues to work.
    2. Upgrades from a custom zstd based on 1.3.1 to 1.4.10, getting 3 years
       of performance improvements and bug fixes. On x86_64 I've measured
       15% faster BtrFS and SquashFS decompression+read speeds, 35% faster
       kernel decompression, and 30% faster ZRAM decompression+read speeds.
    3. Zstd-1.4.10 supports negative compression levels, which allow zstd to
       match or subsume lzo's performance.
    4. Maintains the same kernel-specific wrapper API, so no callers have to
       be modified with zstd version updates.
    
    One concern that was brought up was stack usage. Upstream zstd had
    already removed most of its heavy stack usage functions, but I just
    removed the last functions that allocate arrays on the stack. I've
    measured the high water mark for both compression and decompression
    before and after this patch. Decompression is approximately neutral,
    using about 1.2KB of stack space. Compression levels up to 3 regressed
    from 1.4KB -> 1.6KB, and higher compression levels regressed from 1.5KB
    -> 2KB. We've added unit tests upstream to prevent further regression.
    I believe that this is a reasonable increase, and if it does end up
    causing problems, this commit can be cleanly reverted, because it only
    touches zstd.
    
    I chose the bulk update instead of replaying upstream commits because
    there have been ~3500 upstream commits since the 1.3.1 release, zstd
    wasn't ready to be used in the kernel as-is before a month ago, and not
    all upstream zstd commits build. The bulk update preserves bisectablity
    because bugs can be bisected to the zstd version update. At that point
    the update can be reverted, and we can work with upstream to find and
    fix the bug.
    
    Note that upstream zstd release 1.4.10 doesn't exist yet. I have cut a
    staging branch at 4432dac93bea [0] and will apply any changes requested
    to the staging branch. Once we're ready to merge this update I will cut
    a zstd release at the commit we merge, so we have a known zstd release
    in the kernel.
    
    The implementation of the kernel API is contained in
    zstd_compress_module.c and zstd_decompress_module.c.
    
    [0] https://github.com/facebook/zstd/commit/67a426c322c58a91b7fa59fdb2d59ea4f641b185
    [1] https://github.com/terrelln/linux/commit/cc1caeead9ca616386e35078a36e05ab536af112
    
    Signed-off-by: Nick Terrell <terrelln@fb.com>

commit 2f701516ba0f6f4d539e20a58365481a67508654
Author: Nick Terrell <terrelln@fb.com>
Date:   Thu Apr 29 18:31:55 2021 -0700

    lib: zstd: Add decompress_sources.h for decompress_unzstd
    
    Adds decompress_sources.h which includes every .c file necessary for
    zstd decompression. This is used in decompress_unzstd.c so the internal
    structure of the library isn't exposed.
    
    This allows us to upgrade the zstd library version without modifying any
    callers. Instead we just need to update decompress_sources.h.
    
    Signed-off-by: Nick Terrell <terrelln@fb.com>

commit 041ebfb541fcd10f96c40b2dd8adb89d73189153
Author: Nick Terrell <terrelln@fb.com>
Date:   Thu Apr 29 18:31:54 2021 -0700

    lib: zstd: Add kernel-specific API
    
    This patch:
    - Moves `include/linux/zstd.h` -> `include/linux/zstd_lib.h`
    - Updates modified zstd headers to yearless copyright
    - Adds a new API in `include/linux/zstd.h` that is functionally
      equivalent to the in-use subset of the current API. Functions are
      renamed to avoid symbol collisions with zstd, to make it clear it is
      not the upstream zstd API, and to follow the kernel style guide.
    - Updates all callers to use the new API.
    
    There are no functional changes in this patch. Since there are no
    functional change, I felt it was okay to update all the callers in a
    single patch. Once the API is approved, the callers are mechanically
    changed.
    
    This patch is preparing for the 3rd patch in this series, which updates
    zstd to version 1.4.10. Since the upstream zstd API is no longer exposed
    to callers, the update can happen transparently.
    
    Signed-off-by: Nick Terrell <terrelln@fb.com>

commit 965d7f45c6714fe9e2a08b07f6f9f9b9073f93e2
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Tue Aug 31 17:06:58 2021 +0000

    futex: Implement mechanism to wait on any of several futexes
    
    This is a new futex operation, called FUTEX_WAIT_MULTIPLE, which allows
    a thread to wait on several futexes at the same time, and be awoken by
    any of them.  In a sense, it implements one of the features that was
    supported by pooling on the old FUTEX_FD interface.
    
    The use case lies in the Wine implementation of the Windows NT interface
    WaitMultipleObjects. This Windows API function allows a thread to sleep
    waiting on the first of a set of event sources (mutexes, timers, signal,
    console input, etc) to signal.  Considering this is a primitive
    synchronization operation for Windows applications, being able to quickly
    signal events on the producer side, and quickly go to sleep on the
    consumer side is essential for good performance of those running over Wine.
    
    Wine developers have an implementation that uses eventfd, but it suffers
    from FD exhaustion (there is applications that go to the order of
    multi-milion FDs), and higher CPU utilization than this new operation.
    
    The futex list is passed as an array of `struct futex_wait_block`
    (pointer, value, bitset) to the kernel, which will enqueue all of them
    and sleep if none was already triggered. It returns a hint of which
    futex caused the wake up event to userspace, but the hint doesn't
    guarantee that is the only futex triggered.  Before calling the syscall
    again, userspace should traverse the list, trying to re-acquire any of
    the other futexes, to prevent an immediate -EWOULDBLOCK return code from
    the kernel.
    
    This was tested using three mechanisms:
    
    1) By reimplementing FUTEX_WAIT in terms of FUTEX_WAIT_MULTIPLE and
    running the unmodified tools/testing/selftests/futex and a full linux
    distro on top of this kernel.
    
    2) By an example code that exercises the FUTEX_WAIT_MULTIPLE path on a
    multi-threaded, event-handling setup.
    
    3) By running the Wine fsync implementation and executing multi-threaded
    applications, in particular modern games, on top of this implementation.
    
    Changes were tested for the following ABIs: x86_64, i386 and x32.
    Support for x32 applications is not implemented since it would
    take a major rework adding a new entry point and splitting the current
    futex 64 entry point in two and we can't change the current x32 syscall
    number without breaking user space compatibility.
    
    Included Valve's Proton compatibility code.
    
    Applicable with futex2 patchset.
    
    CC: Steven Rostedt <rostedt@goodmis.org>
    Cc: Richard Yao <ryao@gentoo.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Co-developed-by: Zebediah Figura <z.figura12@gmail.com>
    Signed-off-by: Zebediah Figura <z.figura12@gmail.com>
    Co-developed-by: Steven Noonan <steven@valvesoftware.com>
    Signed-off-by: Steven Noonan <steven@valvesoftware.com>
    Co-developed-by: Pierre-Loup A. Griffais <pgriffais@valvesoftware.com>
    Signed-off-by: Pierre-Loup A. Griffais <pgriffais@valvesoftware.com>
    Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
    [Added compatibility code]
    Co-developed-by: André Almeida <andrealmeid@collabora.com>
    Signed-off-by: André Almeida <andrealmeid@collabora.com>
    
    Rebased-by: Alexandre Frade <kernel@xanmod.org>

commit 6e77d352e9db5bc367cf087e832c662c29783cf8
Author: André Almeida <andrealmeid@collabora.com>
Date:   Fri Jun 25 18:52:32 2021 -0300

    futex2: proton

commit b1a704d844e677165842cdb5afd416274822c9b5
Author: André Almeida <andrealmeid@collabora.com>
Date:   Fri Feb 5 10:34:02 2021 -0300

    futex2: Add sysfs entry for syscall numbers
    
    In the course of futex2 development, it will be rebased on top of
    different kernel releases, and the syscall number can change in this
    process. Expose futex2 syscall number via sysfs so tools that are
    experimenting with futex2 (like Proton/Wine) can test it and set the
    syscall number at runtime, rather than setting it at compilation time.
    
    Signed-off-by: André Almeida <andrealmeid@collabora.com>

commit fd4d81ccbed0cb4ad42a4d546342e93111f4d7e8
Author: André Almeida <andrealmeid@collabora.com>
Date:   Tue Jun 29 16:17:42 2021 -0300

    perf bench: Add futex2 benchmark tests
    
    Add support at the existing futex benchmarking code base to enable
    futex2 calls. `perf bench` tests can be used not only as a way to
    measure the performance of implementation, but also as stress testing
    for the kernel infrastructure.
    
    Signed-off-by: André Almeida <andrealmeid@collabora.com>

commit 8d639eab4d4f993a241c7a3624be370aa42b5e32
Author: André Almeida <andrealmeid@collabora.com>
Date:   Fri Feb 5 10:34:02 2021 -0300

    selftests: futex2: Add waitv test
    
    Create a new file to test the waitv mechanism. Test both private and
    shared futexes. Wake the last futex in the array, and check if the
    return value from futex_waitv() is the right index.
    
    Signed-off-by: André Almeida <andrealmeid@collabora.com>

commit 81056be7e1486eeafa5d572257cf60a4d3e2d0e5
Author: André Almeida <andrealmeid@collabora.com>
Date:   Fri Feb 5 10:34:01 2021 -0300

    selftests: futex2: Add wouldblock test
    
    Adapt existing futex wait wouldblock file to test the same mechanism for
    futex2.
    
    Signed-off-by: André Almeida <andrealmeid@collabora.com>

commit 33eaaee8dbec84cd30b18dede1c2ca7292ed4db0
Author: André Almeida <andrealmeid@collabora.com>
Date:   Fri Feb 5 10:34:01 2021 -0300

    selftests: futex2: Add timeout test
    
    Adapt existing futex wait timeout file to test the same mechanism for
    futex2. futex2 accepts only absolute 64bit timers, but supports both
    monotonic and realtime clocks.
    
    Signed-off-by: André Almeida <andrealmeid@collabora.com>

commit ed8b67078f182f8d2237d26d99e5220532708676
Author: André Almeida <andrealmeid@collabora.com>
Date:   Fri Feb 5 10:34:01 2021 -0300

    selftests: futex2: Add wake/wait test
    
    Add a simple file to test wake/wait mechanism using futex2 interface.
    Test three scenarios: using a common local int variable as private
    futex, a shm futex as shared futex and a file-backed shared memory as a
    shared futex. This should test all branches of futex_get_key().
    
    Create helper files so more tests can evaluate futex2. While 32bit ABIs
    from glibc aren't yet able to use 64 bit sized time variables, add a
    temporary workaround that implements the required types and calls the
    appropriated syscalls, since futex2 doesn't supports 32 bit sized time.
    
    Signed-off-by: André Almeida <andrealmeid@collabora.com>

commit c98b09c1fbe66db0bbc7814d1e5f742e322c36bf
Author: André Almeida <andrealmeid@collabora.com>
Date:   Tue Feb 9 13:59:00 2021 -0300

    docs: locking: futex2: Add documentation
    
    Add a new documentation file specifying both userspace API and internal
    implementation details of futex2 syscalls.
    
    Signed-off-by: André Almeida <andrealmeid@collabora.com>

commit a1fa8fc751afd78d478a941964c681e620b417c7
Author: André Almeida <andrealmeid@collabora.com>
Date:   Thu Jun 24 10:43:51 2021 -0300

    futex2: Implement vectorized wait
    
    Add support to wait on multiple futexes. This is the interface
    implemented by this syscall:
    
    futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes,
                unsigned int flags, struct timespec *timo)
    
    struct futex_waitv {
            __u64 val;
            void *uaddr;
            unsigned int flags;
    };
    
    Given an array of struct futex_waitv, wait on each uaddr. The thread
    wakes if a futex_wake() is performed at any uaddr. The syscall returns
    immediately if any waiter has *uaddr != val. *timo is an optional
    timeout value for the operation. The flags argument of the syscall
    should be used solely for specifying the timeout clock as realtime, if
    needed.  Flags for shared futexes, sizes, etc. should be used on the
    individual flags of each waiter.
    
    Returns the array index of one of the awakened futexes. There’s no given
    information of how many were awakened, or any particular attribute of it
    (if it’s the first awakened, if it is of the smaller index...).

commit 60e07a8eb7ac13a8e52e118523927bc8f15ee39d
Author: André Almeida <andrealmeid@collabora.com>
Date:   Thu Jun 17 11:50:20 2021 -0300

    futex2: Implement wait and wake functions
    
    Create a new set of futex syscalls known as futex2. This new interface
    is aimed to expand it with new functionalities without modifying the
    current complex interface.
    
    Implement wait and wake functions with support for 32 sized futexes:
    
    - futex_wait(void *uaddr, unsigned int val, unsigned int flags,
                 struct timespec *timo)
    
       The user thread is put to sleep, waiting for a futex_wake() at uaddr,
       if the value at *uaddr is the same as val (otherwise, the syscall
       returns immediately with -EAGAIN). timo is an optional timeout value
       for the operation.
    
       Return 0 on success, error code otherwise.
    
     - futex_wake(void *uaddr, unsigned long nr_wake, unsigned int flags)
    
       Wake `nr_wake` threads waiting at uaddr.
    
       Return the number of woken threads on success, error code otherwise.
    
    ** The `flag` argument
    
     The flag is used to specify the size of the futex word
     (FUTEX_[8, 16, 32, 64]). It's mandatory to define one.
    
     By default, the timeout uses a monotonic clock, but can be used as a
     realtime one by using the FUTEX_REALTIME_CLOCK flag.
    
     By default, futexes are of the private type, that means that this user
     address will be accessed by threads that shares the same memory region.
     This allows for some internal optimizations, so they are faster.
     However, if the address needs to be shared with different processes
     (like using `mmap()` or `shm()`), they need to be defined as shared and
     the flag FUTEX_SHARED_FLAG is used to set that.
    
     By default, the operation has no NUMA-awareness, meaning that the user
     can't choose the memory node where the kernel side futex data will be
     stored. The user can choose the node where it wants to operate by
     setting the FUTEX_NUMA_FLAG and using the following structure (where X
     can be 8, 16, or 32, 64):
    
      struct futexX_numa {
              __uX value;
              __sX hint;
      };
    
     This structure should be passed at the `void *uaddr` of futex
     functions. The address of the structure will be used to be waited/waken
     on, and the `value` will be compared to `val` as usual. The `hint`
     member is used to defined which node the futex will use. When waiting,
     the futex will be registered on a kernel-side table stored on that
     node; when waking, the futex will be searched for on that given table.
     That means that there's no redundancy between tables, and the wrong
     `hint` value will led to undesired behavior.  Userspace is responsible
     for dealing with node migrations issues that may occur. `hint` can
     range from [0, MAX_NUMA_NODES], for specifying a node, or -1, to use
     the same node the current process is using.
    
     When not using FUTEX_NUMA_FLAG on a NUMA system, the futex will be
     stored on a global table on some node, defined at compilation time.
    
    ** The `timo` argument
    
    As per the Y2038 work done in the kernel, new interfaces shouldn't add
    timeout options known to be buggy. Given that, `timo` should be a 64bit
    timeout at all platforms, using an absolute timeout value.
    
    Signed-off-by: André Almeida <andrealmeid@collabora.com>

commit b0a3be52c5df28a1e2b298d37aa8b7108f0f0bb8
Author: Alexey Avramov <hakavlad@inbox.lv>
Date:   Tue Aug 24 03:39:08 2021 +0900

    mm/vmscan: add sysctl knobs for protecting the working set
    
    The kernel does not provide a way to protect the working set under memory
    pressure. A certain amount of anonymous and clean file pages is required by
    the userspace for normal operation. First of all, the userspace needs a
    cache of shared libraries and executable binaries. If the amount of the
    clean file pages falls below a certain level, then thrashing and even
    livelock can take place.
    
    The patch provides sysctl knobs for protecting the working set (anonymous
    and clean file pages) under memory pressure.
    
    The vm.anon_min_kbytes sysctl knob provides *hard* protection of anonymous
    pages. The anonymous pages on the current node won't be reclaimed under any
    conditions when their amount is below vm.anon_min_kbytes. This knob may be
    used to prevent excessive swap thrashing when anonymous memory is low (for
    example, when memory is going to be overfilled by compressed data of zram
    module). The default value is defined by CONFIG_ANON_MIN_KBYTES (suggested
    0 in Kconfig).
    
    The vm.clean_low_kbytes sysctl knob provides *best-effort* protection of
    clean file pages. The file pages on the current node won't be reclaimed
    under memory pressure when the amount of clean file pages is below
    vm.clean_low_kbytes *unless* we threaten to OOM. Protection of clean file
    pages using this knob may be used when swapping is still possible to
      - prevent disk I/O thrashing under memory pressure;
      - improve performance in disk cache-bound tasks under memory pressure.
    The default value is defined by CONFIG_CLEAN_LOW_KBYTES (suggested 0 in
    Kconfig).
    
    The vm.clean_min_kbytes sysctl knob provides *hard* protection of clean
    file pages. The file pages on the current node won't be reclaimed under
    memory pressure when the amount of clean file pages is below
    vm.clean_min_kbytes. Hard protection of clean file pages using this knob
    may be used to
      - prevent disk I/O thrashing under memory pressure even with no free swap
        space;
      - improve performance in disk cache-bound tasks under memory pressure;
      - avoid high latency and prevent livelock in near-OOM conditions.
    The default value is defined by CONFIG_CLEAN_MIN_KBYTES (suggested 0 in
    Kconfig).
    
    Signed-off-by: Alexey Avramov <hakavlad@inbox.lv>

commit fd0956e4d4f8c43d285c5aee96e8a1fc86b4d55a
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Aug 18 00:31:07 2021 -0600

    mm: multigenerational lru: documentation
    
    Add Documentation/vm/multigen_lru.rst.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit 5b37aedba70847e62857998488fc9dff11cd7b36
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Aug 18 00:31:06 2021 -0600

    mm: multigenerational lru: Kconfig
    
    Add configuration options for the multigenerational lru.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit 29c00f025cedc4286d3eaddb1af80f2654797e43
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Aug 18 00:31:05 2021 -0600

    mm: multigenerational lru: user interface
    
    Add /sys/kernel/mm/lru_gen/enabled to enable and disable the
    multigenerational lru at runtime.
    
    Add /sys/kernel/mm/lru_gen/min_ttl_ms to protect the working set of a
    given number of milliseconds. The OOM killer is invoked if this
    working set cannot be kept in memory.
    
    Add /sys/kernel/debug/lru_gen to monitor the multigenerational lru and
    invoke the aging and the eviction. This file has the following output:
      memcg  memcg_id  memcg_path
        node  node_id
          min_gen  birth_time  anon_size  file_size
          ...
          max_gen  birth_time  anon_size  file_size
    
    min_gen is the oldest generation number and max_gen is the youngest
    generation number. birth_time is in milliseconds. anon_size and
    file_size are in pages.
    
    This file takes the following input:
      + memcg_id node_id max_gen [swappiness]
      - memcg_id node_id min_gen [swappiness] [nr_to_reclaim]
    
    The first command line invokes the aging, which scans PTEs for
    accessed pages and then creates the next generation max_gen+1. A swap
    file and a non-zero swappiness, which overrides vm.swappiness, are
    required to scan PTEs mapping anon pages. The second command line
    invokes the eviction, which evicts generations less than or equal to
    min_gen. min_gen should be less than max_gen-1 as max_gen and
    max_gen-1 are not fully aged and therefore cannot be evicted.
    nr_to_reclaim can be used to limit the number of pages to evict.
    Multiple command lines are supported, as is concatenation with
    delimiters "," and ";".
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit 1d0e6093d7254e7801d02b1c204a64268fff31cf
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Aug 18 00:31:04 2021 -0600

    mm: multigenerational lru: eviction
    
    The eviction consumes old generations. Given an lruvec, the eviction
    scans pages on lrugen->lists indexed by anon and file min_seq[2]
    (modulo MAX_NR_GENS). It first tries to select a type based on the
    values of min_seq[2]. If they are equal, it selects the type that has
    a lower refault rate. The eviction sorts a page according to its
    updated generation number if the aging has found this page accessed.
    It also moves a page to the next generation if this page is from an
    upper tier that has a higher refault rate than the base tier. The
    eviction increments min_seq[2] of a selected type when it finds
    lrugen->lists indexed by min_seq[2] of this selected type are empty.
    
    With the aging and the eviction in place, implementing page reclaim
    becomes quite straightforward:
      1) To reduce the latency, direct reclaim skips the aging unless both
      min_seq[2] are equal to max_seq-1. Then it invokes the eviction.
      2) To avoid the aging in the direct reclaim path, kswapd invokes the
      aging if either of min_seq[2] is equal to max_seq-1. Then it invokes
      the eviction.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit b87b0bd8b0e8b584b739c985d668fce7cd953b13
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Aug 18 00:31:03 2021 -0600

    mm: multigenerational lru: aging
    
    The aging produces young generations. Given an lruvec, the aging
    traverses lruvec_memcg()->mm_list and calls walk_page_range() to scan
    PTEs for accessed pages. Upon finding one, the aging updates its
    generation number to max_seq (modulo MAX_NR_GENS). After each round of
    traversal, the aging increments max_seq. The aging is due when both
    min_seq[2] have caught up with max_seq-1.
    
    The aging uses the following optimizations when walking page tables:
      1) It skips page tables of processes that have been sleeping since
      the last walk.
      2) It skips non-leaf PMD entries that have the accessed bit cleared
      when CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
      3) It does not zigzag between a PGD table and the same PMD or PTE
      table spanning multiple VMAs. In other words, it finishes all the
      VMAs within the range of the same PMD or PTE table before it returns
      to this PGD table.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit 6658557afe49792727630e53b97a29d6fb722170
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Aug 18 00:31:02 2021 -0600

    mm: multigenerational lru: mm_struct list
    
    To scan PTEs for accessed pages, a mm_struct list is maintained for
    each memcg. When multiple threads traverse the same memcg->mm_list,
    each of them gets a unique mm_struct and therefore they can run
    walk_page_range() concurrently to reach page tables of all processes
    of this memcg.
    
    And to skip page tables of processes that have been sleeping since the
    last walk, the usage of mm_struct is also tracked between context
    switches.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit 047aa506412f69a1bdb5d19bb7d073804dba7fa7
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Aug 18 00:31:01 2021 -0600

    mm: multigenerational lru: protection
    
    The protection is based on page access types and patterns. There are
    two access types: one via page tables and the other via file
    descriptors. The protection of the former type is by design stronger
    because:
      1) The uncertainty in determining the access patterns of the former
      type is higher due to the coalesced nature of the accessed bit.
      2) The cost of evicting the former type is higher due to the TLB
      flushes required and the likelihood of involving I/O.
      3) The penalty of under-protecting the former type is higher because
      applications usually do not prepare themselves for major faults like
      they do for blocked I/O. For example, client applications commonly
      dedicate blocked I/O to separate threads to avoid UI janks that
      negatively affect user experience.
    
    There are also two access patterns: one with temporal locality and the
    other without. The latter pattern, e.g., random and sequential, needs
    to be explicitly excluded to avoid weakening the protection of the
    former pattern. Generally the former type follows the former pattern
    unless MADV_SEQUENTIAL is specified and the latter type follows the
    latter pattern unless outlying refaults have been observed.
    
    Upon faulting, a page is added to the youngest generation, which
    provides the strongest protection as the eviction will not consider
    this page before the aging has scanned it at least twice. The first
    scan clears the accessed bit set during the initial fault. And the
    second scan makes sure this page has not been used since the first
    scan. A page from any other generations is brought back to the
    youngest generation whenever the aging finds the accessed bit set on
    any of the PTEs mapping this page.
    
    Unmapped pages are initially added to the oldest generation and then
    conditionally protected by tiers. Pages accessed N times via file
    descriptors belong to tier order_base_2(N). Each tier keeps track of
    how many pages from it have refaulted. Tier 0 is the base tier and
    pages from it are evicted unconditionally because there are no better
    candidates. Pages from an upper tier are either evicted or moved to
    the next generation, depending on whether this upper tier has a higher
    refault rate than the base tier. This model has the following
    advantages:
      1) It removes the cost in the buffered access path and reduces the
      overall cost of protection because pages are conditionally protected
      in the reclaim path.
      2) It takes mapped pages into account and avoids overprotecting
      pages accessed multiple times via file descriptors.
      3 Additional tiers improve the protection of pages accessed more
      than twice.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit 6016b640b11e9f6c05b3dd436ac62669a771058c
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Aug 18 00:31:00 2021 -0600

    mm: multigenerational lru: groundwork
    
    For each lruvec, evictable pages are divided into multiple
    generations. The youngest generation number is stored in
    lrugen->max_seq for both anon and file types as they are aged on an
    equal footing. The oldest generation numbers are stored in
    lrugen->min_seq[2] separately for anon and file types as clean file
    pages can be evicted regardless of swap and writeback constraints.
    These three variables are monotonically increasing. Generation numbers
    are truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit
    into page->flags. The sliding window technique is used to prevent
    truncated generation numbers from overlapping. Each truncated
    generation number is an index to
    lrugen->lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].
    
    Each generation is then divided into multiple tiers. Tiers represent
    levels of usage from file descriptors only. Pages accessed N times via
    file descriptors belong to tier order_base_2(N). Each generation
    contains at most MAX_NR_TIERS tiers, and they require additional
    MAX_NR_TIERS-2 bits in page->flags. In contrast to moving across
    generations which requires list operations, moving across tiers only
    involves operations on page->flags and therefore has a negligible
    cost. A feedback loop modeled after the PID controller monitors
    refault rates of all tiers and decides when to protect pages from
    which tiers.
    
    The framework comprises two conceptually independent components: the
    aging and the eviction, which can be invoked separately from user
    space for the purpose of working set estimation and proactive reclaim.
    
    The aging produces young generations. Given an lruvec, the aging
    traverses lruvec_memcg()->mm_list and calls walk_page_range() to scan
    PTEs for accessed pages (a mm_struct list is maintained for each
    memcg). Upon finding one, the aging updates its generation number to
    max_seq (modulo MAX_NR_GENS). After each round of traversal, the aging
    increments max_seq. The aging is due when both min_seq[2] have caught
    up with max_seq-1.
    
    The eviction consumes old generations. Given an lruvec, the eviction
    scans pages on lrugen->lists indexed by anon and file min_seq[2]
    (modulo MAX_NR_GENS). It first tries to select a type based on the
    values of min_seq[2]. If they are equal, it selects the type that has
    a lower refault rate. The eviction sorts a page according to its
    updated generation number if the aging has found this page accessed.
    It also moves a page to the next generation if this page is from an
    upper tier that has a higher refault rate than the base tier. The
    eviction increments min_seq[2] of a selected type when it finds
    lrugen->lists indexed by min_seq[2] of this selected type are empty.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit e224026584fc3de73f8b23c77b787ef76d5ba375
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Aug 18 00:30:59 2021 -0600

    mm/vmscan.c: refactor shrink_node()
    
    This patch refactors shrink_node(). This will make the upcoming
    changes to mm/vmscan.c more readable.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit 350e3508f8efa522d39e3f705ca906a5410ecaa3
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Aug 18 00:30:58 2021 -0600

    mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
    
    Some architectures support the accessed bit on non-leaf PMD entries,
    e.g., x86_64 sets the accessed bit on a non-leaf PMD entry when using
    it as part of linear address translation [1]. As an optimization, page
    table walkers who are interested in the accessed bit can skip the PTEs
    under a non-leaf PMD entry if the accessed bit is cleared on this
    non-leaf PMD entry.
    
    Although an inline function may be preferable, this capability is
    added as a configuration option to look consistent when used with the
    existing macros.
    
    [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
         Volume 3 (October 2019), section 4.8
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>
    Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>

commit 3bdf11d1fecaa4d140143baed01326dd7ee6801c
Author: Yu Zhao <yuzhao@google.com>
Date:   Wed Aug 18 00:30:57 2021 -0600

    mm: x86, arm64: add arch_has_hw_pte_young()
    
    Some architectures set the accessed bit in PTEs automatically, e.g.,
    x86, and arm64 v8.2 and later. On architectures that do not have this
    capability, clearing the accessed bit in a PTE triggers a page fault
    following the TLB miss.
    
    Being aware of this capability can help make better decisions, i.e.,
    whether to limit the size of each batch of PTEs and the burst of
    batches when clearing the accessed bit.
    
    Signed-off-by: Yu Zhao <yuzhao@google.com>

commit 81632bc71a55be61db1b8f5045cc14f5789e427a
Author: Stephan Mueller <smueller@chronox.de>
Date:   Fri Jun 18 08:26:24 2021 +0200

    char/lrng: add power-on and runtime self-tests
    
    Parts of the LRNG are already covered by self-tests, including:
    
    * Self-test of SP800-90A DRBG provided by the Linux kernel crypto API.
    
    * Self-test of the PRNG provided by the Linux kernel crypto API.
    
    * Raw noise source data testing including SP800-90B compliant
      tests when enabling CONFIG_LRNG_HEALTH_TESTS
    
    This patch adds the self-tests for the remaining critical functions of
    the LRNG that are essential to maintain entropy and provide
    cryptographic strong random numbers. The following self-tests are
    implemented:
    
    * Self-test of the time array maintenance. This test verifies whether
    the time stamp array management to store multiple values in one integer
    implements a concatenation of the data.
    
    * Self-test of the software hash implementation ensures that this
    function operates compliant to the FIPS 180-4 specification. The
    self-test performs a hash operation of a zeroized per-CPU data array.
    
    * Self-test of the ChaCha20 DRNG is based on the self-tests that are
    already present and implemented with the stand-alone user space
    ChaCha20 DRNG implementation available at [1]. The self-tests cover
    different use cases of the DRNG seeded with known seed data.
    
    The status of the LRNG self-tests is provided with the selftest_status
    SysFS file. If the file contains a zero, the self-tests passed. The
    value 0xffffffff means that the self-tests were not executed. Any other
    value indicates a self-test failure.
    
    The self-test may be compiled to panic the system if the self-test
    fails.
    
    All self-tests operate on private state data structures. This implies
    that none of the self-tests have any impact on the regular LRNG
    operations. This allows the self-tests to be repeated at runtime by
    writing anything into the selftest_status SysFS file.
    
    [1] https://www.chronox.de/chacha20.html
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    CC: Neil Horman <nhorman@redhat.com>
    CC: Alexander Lobakin <alobakin@mailbox.org>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit d3c8b63af812ac95d73930630afd92855d9326ee
Author: Stephan Mueller <smueller@chronox.de>
Date:   Fri Jun 18 08:23:31 2021 +0200

    char/lrng: add interface for gathering of raw entropy
    
    The test interface allows a privileged process to capture the raw
    unconditioned noise that is collected by the LRNG for statistical
    analysis. Such testing allows the analysis how much entropy
    the interrupt noise source provides on a given platform.
    Extracted noise data is not used to seed the LRNG. This
    is a test interface and not appropriate for production systems.
    Yet, the interface is considered to be sufficiently secured for
    production systems.
    
    Access to the data is given through the lrng_raw debugfs file. The
    data buffer should be multiples of sizeof(u32) to fill the entire
    buffer. Using the option lrng_testing.boot_test=1 the raw noise of
    the first 1000 entropy events since boot can be sampled.
    
    This test interface allows generating the data required for
    analysis whether the LRNG is in compliance with SP800-90B
    sections 3.1.3 and 3.1.4.
    
    In addition, the test interface allows gathering of the concatenated raw
    entropy data to verify that the concatenation works appropriately.
    This includes sampling of the following raw data:
    
    * high-resolution time stamp
    
    * Jiffies
    
    * IRQ number
    
    * IRQ flags
    
    * return instruction pointer
    
    * interrupt register state
    
    * array logic batching the high-resolution time stamp
    
    * enabling the runtime configuration of entropy source entropy rates
    
    Also, a testing interface to support ACVT of the hash implementation
    is provided. The reason why only hash testing is supported (as
    opposed to also provide testing for the DRNG) is the fact that the
    LRNG software hash implementation contains glue code that may
    warrant testing in addition to the testing of the software ciphers
    via the kernel crypto API. Also, for testing the CTR-DRBG, the
    underlying AES implementation would need to be tested. However,
    such AES test interface cannot be provided by the LRNG as it has no
    means to access the AES operation.
    
    Finally, the execution duration for processing a time stamp can be
    obtained with the LRNG raw entropy interface.
    
    If a test interface is not compiled, its code is a noop which has no
    impact on the performance.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: Alexander Lobakin <alobakin@mailbox.org>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit c635a8aeeca010317f0de0877ae083c07d028cde
Author: Stephan Mueller <smueller@chronox.de>
Date:   Fri Jun 18 08:17:40 2021 +0200

    char/lrng: add SP800-90B compliant health tests
    
    Implement health tests for LRNG's slow noise sources as mandated by
    SP-800-90B The file contains the following health tests:
    
    - stuck test: The stuck test calculates the first, second and third
      discrete derivative of the time stamp to be processed by the hash
      for the per-CPU entropy pool. Only if all three values are non-zero,
      the received time delta is considered to be non-stuck.
    
    - SP800-90B Repetition Count Test (RCT): The LRNG uses an enhanced
      version of the RCT specified in SP800-90B section 4.4.1. Instead of
      counting identical back-to-back values, the input to the RCT is the
      counting of the stuck values during the processing of received
      interrupt events. The RCT is applied with alpha=2^-30 compliant to
      the recommendation of FIPS 140-2 IG 9.8. During the counting operation,
      the LRNG always calculates the RCT cut-off value of C. If that value
      exceeds the allowed cut-off value, the LRNG will trigger the health
      test failure discussed below. An error is logged to the kernel log
      that such RCT failure occurred. This test is only applied and
      enforced in FIPS mode, i.e. when the kernel compiled with
      CONFIG_CONFIG_FIPS is started with fips=1.
    
    - SP800-90B Adaptive Proportion Test (APT): The LRNG implements the
      APT as defined in SP800-90B section 4.4.2. The applied significance
      level again is alpha=2^-30 compliant to the recommendation of FIPS
      140-2 IG 9.8.
    
    The aforementioned health tests are applied to the first 1,024 time stamps
    obtained from interrupt events. In case one error is identified for either
    the RCT, or the APT, the collected entropy is invalidated and the
    SP800-90B startup health test is restarted.
    
    As long as the SP800-90B startup health test is not completed, all LRNG
    random number output interfaces that may block will block and not generate
    any data. This implies that only those potentially blocking interfaces are
    defined to provide random numbers that are seeded with the interrupt noise
    source being SP800-90B compliant. All other output interfaces will not be
    affected by the SP800-90B startup test and thus are not considered
    SP800-90B compliant.
    
    At runtime, the SP800-90B APT and RCT are applied to each time stamp
    generated for a received interrupt. When either the APT and RCT indicates
    a noise source failure, the LRNG is reset to a state it has immediately
    after boot:
    
    - all entropy counters are set to zero
    
    - the SP800-90B startup tests are re-performed which implies that
    getrandom(2) would block again until new entropy was collected
    
    To summarize, the following rules apply:
    
    • SP800-90B compliant output interfaces
    
      - /dev/random
    
      - getrandom(2) system call
    
      -  get_random_bytes kernel-internal interface when being triggered by
         the callback registered with add_random_ready_callback
    
    • SP800-90B non-compliant output interfaces
    
      - /dev/urandom
    
      - get_random_bytes kernel-internal interface called directly
    
      - randomize_page kernel-internal interface
    
      - get_random_u32 and get_random_u64 kernel-internal interfaces
    
      - get_random_u32_wait, get_random_u64_wait, get_random_int_wait, and
        get_random_long_wait kernel-internal interfaces
    
    If either the RCT, or the APT health test fails irrespective whether
    during initialization or runtime, the following actions occur:
    
      1. The entropy of the entire entropy pool is invalidated.
    
      2. All DRNGs are reset which imply that they are treated as being
         not seeded and require a reseed during next invocation.
    
      3. The SP800-90B startup health test are initiated with all
         implications of the startup tests. That implies that from that point
         on, new events must be observed and its entropy must be inserted into
         the entropy pool before random numbers are calculated from the
         entropy pool.
    
    Further details on the SP800-90B compliance and the availability of all
    test tools required to perform all tests mandated by SP800-90B are
    provided at [1].
    
    The entire health testing code is compile-time configurable.
    
    The patch provides a CONFIG_BROKEN configuration of the APT / RCT cutoff
    values which have a high likelihood to trigger the health test failure.
    The BROKEN APT cutoff is set to the exact mean of the expected value if
    the time stamps are equally distributed (512 time stamps divided by 16
    possible values due to using the 4 LSB of the time stamp). The BROKEN
    RCT cutoff value is set to 1 which is likely to be triggered during
    regular operation.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: Alexander Lobakin <alobakin@mailbox.org>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 31720489c680dc9865bf92e8b21d6656e24758b2
Author: Stephan Mueller <smueller@chronox.de>
Date:   Fri Jun 18 08:13:57 2021 +0200

    char/lrng: add Jitter RNG fast noise source
    
    The Jitter RNG fast noise source implemented as part of the kernel
    crypto API is queried for 256 bits of entropy at the time the seed
    buffer managed by the LRNG is about to be filled.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: Alexander Lobakin <alobakin@mailbox.org>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 413a2475361c3a725ed208adbc6272494ea96605
Author: Stephan Mueller <smueller@chronox.de>
Date:   Wed Sep 16 09:50:27 2020 +0200

    crypto: provide access to a static Jitter RNG state
    
    To support the LRNG operation which uses the Jitter RNG separately
    from the kernel crypto API, at a time where potentially the regular
    memory management is not yet initialized, the Jitter RNG needs to
    provide a state whose memory is defined at compile time. As only once
    instance will ever be needed by the LRNG, define once static memory
    block which is solely to be used by the LRNG.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: Alexander Lobakin <alobakin@mailbox.org>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Roman Drahtmüller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit b73b9363c446dcad4bed34469284f50684d364af
Author: Stephan Mueller <smueller@chronox.de>
Date:   Fri Jun 18 08:10:53 2021 +0200

    char/lrng: add kernel crypto API PRNG extension
    
    Add runtime-pluggable support for all PRNGs that are accessible via
    the kernel crypto API, including hardware PRNGs. The PRNG is selected
    with the module parameter drng_name where the name must be one that the
    kernel crypto API can resolve into an RNG.
    
    This allows using of the kernel crypto API PRNG implementations that
    provide an interface to hardware PRNGs. Using this extension,
    the LRNG uses the hardware PRNGs to generate random numbers. An
    example is the S390 CPACF support providing such a PRNG.
    
    The hash is provided by a kernel crypto API SHASH whose digest size
    complies with the seedsize of the PRNG.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: Alexander Lobakin <alobakin@mailbox.org>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 3c8d95825b3ee26c6a8904e1565c495f9242a29a
Author: Stephan Mueller <smueller@chronox.de>
Date:   Fri Jun 18 08:09:59 2021 +0200

    char/lrng: add SP800-90A DRBG extension
    
    Using the LRNG switchable DRNG support, the SP800-90A DRBG extension is
    implemented.
    
    The DRBG uses the kernel crypto API DRBG implementation. In addition, it
    uses the kernel crypto API SHASH support to provide the hashing
    operation.
    
    The DRBG supports the choice of either a CTR DRBG using AES-256, HMAC
    DRBG with SHA-512 core or Hash DRBG with SHA-512 core. The used core can
    be selected with the module parameter lrng_drbg_type. The default is the
    CTR DRBG.
    
    When compiling the DRBG extension statically, the DRBG is loaded at
    late_initcall stage which implies that with the start of user space, the
    user space interfaces of getrandom(2), /dev/random and /dev/urandom
    provide random data produced by an SP800-90A DRBG.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: Alexander Lobakin <alobakin@mailbox.org>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 772b4355074b6d79ee538cea505b943d520df774
Author: Stephan Mueller <smueller@chronox.de>
Date:   Tue Sep 15 22:17:43 2020 +0200

    crypto: drbg - externalize DRBG functions for LRNG
    
    This patch allows several DRBG functions to be called by the LRNG kernel
    code paths outside the drbg.c file.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: Alexander Lobakin <alobakin@mailbox.org>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Roman Drahtmüller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 289ae6f42596a669c76f74255b0b156cc8d0b396
Author: Stephan Mueller <smueller@chronox.de>
Date:   Fri Jun 18 08:08:20 2021 +0200

    char/lrng: add common generic hash support
    
    The LRNG switchable DRNG support also allows the replacement of the hash
    implementation used as conditioning component. The common generic hash
    support code provides the required callbacks using the synchronous hash
    implementations of the kernel crypto API.
    
    All synchronous hash implementations supported by the kernel crypto API
    can be used as part of the LRNG with this generic support.
    
    The generic support is intended to be configured by separate switchable
    DRNG backends.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: Alexander Lobakin <alobakin@mailbox.org>
    CC: "Peter, Matthias" <matthias.peter@bsi.bund.de>
    CC: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    CC: Neil Horman <nhorman@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 255b2f8211e70b06a123fb045f9455f9828b51b6
Author: Stephan Mueller <smueller@chronox.de>
Date:   Fri Jun 18 08:06:39 2021 +0200

    char/lrng: add switchable DRNG support
    
    The DRNG switch support allows replacing the DRNG mechanism of the
    LRNG. The switching support rests on the interface definition of
    include/linux/lrng.h. A new DRNG is implemented by filling in the
    interface defined in this header file.
    
    In addition to the DRNG, the extension also has to provide a hash
    implementation that is used to hash the entropy pool for random number
    extraction.
    
    Note: It is permissible to implement a DRNG whose operations may sleep.
    However, the hash function must not sleep.
    
    The switchable DRNG support allows replacing the DRNG at runtime.
    However, only one DRNG extension is allowed to be loaded at any given
    time. Before replacing it with another DRNG implementation, the possibly
    existing DRNG extension must be unloaded.
    
    The switchable DRNG extension activates the new DRNG during load time.
    It is expected, however, that such a DRNG switch would be done only once
    by an administrator to load the intended DRNG implementation.
    
    It is permissible to compile DRNG extensions either as kernel modules or
    statically. The initialization of the DRNG extension should be performed
    with a late_initcall to ensure the extension is available when user
    space starts but after all other initialization completed.
    The initialization is performed by registering the function call data
    structure with the lrng_set_drng_cb function. In order to unload the
    DRNG extension, lrng_set_drng_cb must be invoked with the NULL
    parameter.
    
    The DRNG extension should always provide a security strength that is at
    least as strong as LRNG_DRNG_SECURITY_STRENGTH_BITS.
    
    The hash extension must not sleep and must not maintain a separate
    state.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: Alexander Lobakin <alobakin@mailbox.org>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 44dab88512d3bade571fbf415c3bb1364e56cacd
Author: Stephan Mueller <smueller@chronox.de>
Date:   Wed Jun 23 18:44:26 2021 +0200

    char/lrng: sysctls and /proc interface
    
    The LRNG sysctl interface provides the same controls as the existing
    /dev/random implementation. These sysctls behave identically and are
    implemented identically. The goal is to allow a possible merge of the
    existing /dev/random implementation with this implementation which
    implies that this patch tries have a very close similarity. Yet, all
    sysctls are documented at [1].
    
    In addition, it provides the file lrng_type which provides details about
    the LRNG:
    
    - the name of the DRNG that produces the random numbers for /dev/random,
    /dev/urandom, getrandom(2)
    
    - the hash used to produce random numbers from the entropy pool
    
    - the number of secondary DRNG instances
    
    - indicator whether the LRNG operates SP800-90B compliant
    
    - indicator whether a high-resolution timer is identified - only with a
    high-resolution timer the interrupt noise source will deliver sufficient
    entropy
    
    - indicator whether the LRNG has been minimally seeded (i.e. is the
    secondary DRNG seeded with at least 128 bits of entropy)
    
    - indicator whether the LRNG has been fully seeded (i.e. is the
    secondary DRNG seeded with at least 256 bits of entropy)
    
    [1] https://www.chronox.de/lrng.html
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: Alexander Lobakin <alobakin@mailbox.org>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 70d7ec6c96f9c69587402aaea5ef9279364004ca
Author: Stephan Mueller <smueller@chronox.de>
Date:   Fri Jun 18 08:03:15 2021 +0200

    char/lrng: allocate one DRNG instance per NUMA node
    
    In order to improve NUMA-locality when serving getrandom(2) requests,
    allocate one DRNG instance per node.
    
    The DRNG instance that is present right from the start of the kernel is
    reused as the first per-NUMA-node DRNG. For all remaining online NUMA
    nodes a new DRNG instance is allocated.
    
    During boot time, the multiple DRNG instances are seeded sequentially.
    With this, the first DRNG instance (referenced as the initial DRNG
    in the code) is completely seeded with 256 bits of entropy before the
    next DRNG instance is completely seeded.
    
    When random numbers are requested, the NUMA-node-local DRNG is checked
    whether it has been already fully seeded. If this is not the case, the
    initial DRNG is used to serve the request.
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: Eric Biggers <ebiggers@kernel.org>
    CC: Alexander Lobakin <alobakin@mailbox.org>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 80dde18aac3368db221794142040f8cd3ca9c36a
Author: Stephan Mueller <smueller@chronox.de>
Date:   Wed Jun 23 18:42:39 2021 +0200

    drivers: Introduce the Linux Random Number Generator
    
    In an effort to provide a flexible implementation for a random number
    generator that also delivers entropy during early boot time, allows
    replacement of the deterministic random number generation mechanism,
    implement the various components in separate code for easier
    maintenance, and provide compliance to SP800-90[A|B|C], introduce
    the Linux Random Number Generator (LRNG) framework.
    
    The general design is as follows. Additional implementation details
    are given in [1]. The LRNG consists of the following components:
    
    1. The LRNG implements a DRNG. The DRNG always generates the
    requested amount of output. When using the SP800-90A terminology
    it operates without prediction resistance. The secondary DRNG
    maintains a counter of how many bytes were generated since last
    re-seed and a timer of the elapsed time since last re-seed. If either
    the counter or the timer reaches a threshold, the secondary DRNG is
    seeded from the entropy pool.
    
    In case the Linux kernel detects a NUMA system, one secondary DRNG
    instance per NUMA node is maintained.
    
    2. The DRNG is seeded by concatenating the data from the
    following sources:
    
    (a) the output of the entropy pool,
    
    (b) the Jitter RNG if available and enabled, and
    
    (c) the CPU-based noise source such as Intel RDRAND if available and
    enabled.
    
    The entropy estimate of the data of all noise sources are added to
    form the entropy estimate of the data used to seed the DRNG with.
    The LRNG ensures, however, that the DRNG after seeding is at
    maximum the security strength of the DRNG.
    
    The LRNG is designed such that none of these noise sources can dominate
    the other noise sources to provide seed data to the DRNG during due to
    the following:
    
    (a) During boot time, the amount of received interrupts are the trigger
    points to (re)seed the DRNG.
    
    (b) At runtime, the available entropy from the slow noise source is
    concatenated with a pre-defined amount of data from the fast noise
    sources. In addition, each DRNG reseed operation triggers external
    noise source providers to deliver one block of data.
    
    3. The entropy pool accumulates entropy obtained from certain events,
    which will henceforth be collectively called "slow noise sources".
    The entropy pool collects noise data from slow noise sources. Any data
    received by the LRNG from the slow noise sources is inserted into a
    per-CPU entropy pool using a hash operation that can be changed during
    runtime. Per default, SHA-256 is used.
    
     (a) When an interrupt occurs, the high-resolution time stamp is mixed
    into the per-CPU entropy pool. This time stamp is credited with
    heuristically implied entropy.
    
     (b) HID event data like the key stroke or the mouse coordinates are
    mixed into the per-CPU entropy pool. This data is not credited with
    entropy by the LRNG.
    
     (c) Device drivers may provide data that is mixed into an auxiliary
    pool using the same hash that is used to process the per-CPU entropy
    pool. This data is not credited with entropy by the LRNG.
    
    Any data provided from user space by either writing to /dev/random,
    /dev/urandom or the IOCTL of RNDADDENTROPY on both device files
    are always injected into the auxiliary pool.
    
    In addition, when a hardware random number generator covered by the
    Linux kernel HW generator framework wants to deliver random numbers,
    it is injected into the auxiliary pool as well. HW generator noise source
    is handled separately from the other noise source due to the fact that
    the HW generator framework may decide by itself when to deliver data
    whereas the other noise sources always requested for data driven by the
    LRNG operation. Similarly any user space provided data is inserted into
    the entropy pool.
    
    When seed data for the DRNG is to be generated, all per-CPU
    entropy pools and the auxiliary pool are hashed. The message digest
    forms the new auxiliary pool state. At the same time, this data
    is used for seeding the DRNG.
    
    To speed up the interrupt handling code of the LRNG, the time stamp
    collected for an interrupt event is truncated to the 8 least
    significant bits. 64 truncated time stamps are concatenated and then
    jointly inserted into the per-CPU entropy pool. During boot time,
    until the fully seeded stage is reached, each time stamp with its
    32 least significant bits is are concatenated. When 16 such events
    are received, they are injected into the per-CPU entropy pool.
    
    The LRNG allows the DRNG mechanism to be changed at runtime. Per default,
    a ChaCha20-based DRNG is used. The ChaCha20-DRNG implemented for the
    LRNG is also provided as a stand-alone user space deterministic random
    number generator. The LRNG also offers an SP800-90A DRBG based on the
    Linux kernel crypto API DRBG implementation.
    
    The processing of entropic data from the noise source before injecting
    them into the DRNG is performed with the following mathematical
    operations:
    
    1. Truncation: The received time stamps are truncated to 8 least
    significant bits (or 32 least significant bits during boot time)
    
    2. Concatenation: The received and truncated time stamps as well as
    auxiliary 32 bit words are concatenated to fill the per-CPU data
    array that is capable of holding 64 8-bit words.
    
    3. Hashing: A set of concatenated time stamp data received from the
    interrupts are hashed together with the current existing per-CPU
    entropy pool state. The resulting message digest is the new per-CPU
    entropy pool state.
    
    4. Hashing: When new data is added to the auxiliary pool, the data
    is hashed together with the auxiliary pool to form a new auxiliary
    pool state.
    
    5. Hashing: A message digest of all per-CPU entropy pools and the
    auxiliary pool is calculated which forms the new auxiliary pool
    state. At the same time, this message digest is used to fill the
    slow noise source output buffer discussed in the following.
    
    6. Truncation: The most-significant bits (MSB) defined by the
    requested number of bits (commonly equal to the security strength
    of the DRBG) or the entropy available transported with the buffer
    (which is the minimum of the message digest size and the available
    entropy in all entropy pools and the auxiliary pool), whatever is
    smaller, are obtained from the slow noise source output buffer.
    
    7. Concatenation: The temporary seed buffer used to seed the DRNG
    is a concatenation of the slow noise source buffer, the Jitter RNG
    output, the CPU noise source output, and the current time.
    
    The DRNG always tries to seed itself with 256 bits of entropy, except
    during boot. In any case, if the noise sources cannot deliver that
    amount, the available entropy is used and the DRNG keeps track on how
    much entropy it was seeded with. The entropy implied by the LRNG
    available in the entropy pool may be too conservative. To ensure
    that during boot time all available entropy from the entropy pool is
    transferred to the DRNG, the hash_df function always generates 256
    data bits during boot to seed the DRNG. During boot, the DRNG is
    seeded as follows:
    
    1. The DRNG is reseeded from the entropy pool and potentially the fast
    noise sources if the entropy pool has collected at least 32 bits of
    entropy from the interrupt noise source. The goal of this step is to
    ensure that the DRNG receives some initial entropy as early as
    possible. In addition it receives the entropy available from
    the fast noise sources.
    
    2. The DRNG is reseeded from the entropy pool and potentially the fast
    noise sources if all noise sources collectively can provide at least
    128 bits of entropy.
    
    3. The DRNG is reseeded from the entropy pool and potentially the fast
    noise sources if all noise sources collectivel can provide at least 256
    bits.
    
    At the time of the reseeding steps, the DRNG requests as much entropy as
    is available in order to skip certain steps and reach the seeding level
    of 256 bits. This may imply that one or more of the aforementioned steps
    are skipped.
    
    In all listed steps, the DRNG is (re)seeded with a number of random
    bytes from the entropy pool that is at most the amount of entropy
    present in the entropy pool. This means that when the entropy pool
    contains 128 or 256 bits of entropy, the DRNG is seeded with that
    amount of entropy as well.
    
    Before the DRNG is seeded with 256 bits of entropy in step 3,
    requests of random data from /dev/random and the getrandom system
    call are not processed.
    
    The hash operation providing random data from the entropy pools will
    always require that all entropy sources collectively can deliver at
    least 128 entropy bits.
    
    The DRNG operates as deterministic random number generator with the
    following properties:
    
    * The maximum number of random bytes that can be generated with one
    DRNG generate operation is limited to 4096 bytes. When longer random
    numbers are requested, multiple DRNG generate operations are performed.
    The ChaCha20 DRNG as well as the SP800-90A DRBGs implement an update of
    their state after completing a generate request for backtracking
    resistance.
    
    * The secondary DRNG is reseeded with whatever entropy is available –
    in the worst case where no additional entropy can be provided by the
    noise sources, the DRNG is not re-seeded and continues its operation
    to try to reseed again after again the expiry of one of these thresholds:
    
     - If the last reseeding of the secondary DRNG is more than 600 seconds
       ago, or
    
     - 2^20 DRNG generate operations are performed, whatever comes first, or
    
     - the secondary DRNG is forced to reseed before the next generation of
       random numbers if data has been injected into the LRNG by writing data
       into /dev/random or /dev/urandom.
    
    The chosen values prevent high-volume requests from user space to cause
    frequent reseeding operations which drag down the performance of the
    DRNG.
    
    With the automatic reseeding after 600 seconds, the LRNG is triggered
    to reseed itself before the first request after a suspend that put the
    hardware to sleep for longer than 600 seconds.
    
    To support smaller devices including IoT environments, this patch
    allows reducing the runtime memory footprint of the LRNG at compile
    time by selecting smaller collection data sizes.
    
    When selecting the compilation of a kernel for a small environment,
    prevent the allocation of a buffer up to 4096 bytes to serve user space
    requests. In this case, the stack variable of 64 bytes is used to serve
    all user space requests.
    
    The LRNG has the following properties:
    
    * internal noise source: interrupts timing with fast boot time seeding
    
    * high performance of interrupt handling code: The LRNG impact on the
    interrupt handling has been reduced to a minimum. On one example
    system, the LRNG interrupt handling code in its fastest configuration
    executes within an average 55 cycles whereas the existing
    /dev/random on the same device takes about 97 cycles when measuring
    the execution time of add_interrupt_randomness().
    
    * use of almost never contended lock for hashing operation to collect
      raw entropy supporting concurrency-free use of massive parallel
      systems - worst case rate of contention is the number of DRNG
      reseeds, usually: number of NUMA nodes contentions per 5 minutes.
    
    * use of standalone ChaCha20 based RNG with the option to use a
      different DRNG selectable at compile time
    
    * instantiate one DRNG per NUMA node
    
    * support for runtime switchable output DRNGs
    
    * use of runtime-switchable hash for conditioning implementation
    following widely accepted approach
    
    * compile-time selectable collection size
    
    * support of small systems by allowing the reduction of the
    runtime memory needs
    
    Further details including the rationale for the design choices and
    properties of the LRNG together with testing is provided at [1].
    In addition, the documentation explains the conducted regression
    tests to verify that the LRNG is API and ABI compatible with the
    existing /dev/random implementation.
    
    [1] https://www.chronox.de/lrng.html
    
    CC: Torsten Duwe <duwe@lst.de>
    CC: "Eric W. Biederman" <ebiederm@xmission.com>
    CC: "Alexander E. Patrakov" <patrakov@gmail.com>
    CC: "Ahmed S. Darwish" <darwish.07@gmail.com>
    CC: "Theodore Y. Ts'o" <tytso@mit.edu>
    CC: Willy Tarreau <w@1wt.eu>
    CC: Matthew Garrett <mjg59@srcf.ucam.org>
    CC: Vito Caputo <vcaputo@pengaru.com>
    CC: Andreas Dilger <adilger.kernel@dilger.ca>
    CC: Jan Kara <jack@suse.cz>
    CC: Ray Strode <rstrode@redhat.com>
    CC: William Jon McCann <mccann@jhu.edu>
    CC: zhangjs <zachary@baishancloud.com>
    CC: Andy Lutomirski <luto@kernel.org>
    CC: Florian Weimer <fweimer@redhat.com>
    CC: Lennart Poettering <mzxreary@0pointer.de>
    CC: Nicolai Stange <nstange@suse.de>
    CC: Alexander Lobakin <alobakin@mailbox.org>
    Mathematical aspects Reviewed-by: "Peter, Matthias" <matthias.peter@bsi.bund.de>
    Reviewed-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Reviewed-by: Roman Drahtmueller <draht@schaltsekun.de>
    Tested-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
    Tested-by: Neil Horman <nhorman@redhat.com>
    Signed-off-by: Stephan Mueller <smueller@chronox.de>

commit 072a55ef17e96360fef026e5e40a0799d615b22e
Author: Con Kolivas <kernel@kolivas.org>
Date:   Mon Dec 14 19:09:01 2020 +0000

    clockevents, hrtimer: Make hrtimer granularity and minimum hrtimeout configurable in sysctl. Set default granularity to 100us and min timeout to 500us
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 4c9484b388e63ee3774f53de5f3d5b426e3f0e43
Author: Con Kolivas <kernel@kolivas.org>
Date:   Mon Feb 20 13:32:58 2017 +1100

    time: Don't use hrtimer overlay when pm_freezing since some drivers still don't correctly use freezable timeouts.
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 33c780391af7cb83b89ecf4a5020d866e73f60fa
Author: Con Kolivas <kernel@kolivas.org>
Date:   Mon Feb 20 13:30:32 2017 +1100

    hrtimer: Replace all calls to schedule_timeout_uninterruptible of potentially under 50ms to use schedule_msec_hrtimeout_uninterruptible
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 1db9c631c9eea9eb8ef34f4006bc6a46a7113285
Author: Con Kolivas <kernel@kolivas.org>
Date:   Mon Feb 20 13:30:07 2017 +1100

    hrtimer: Replace all calls to schedule_timeout_interruptible of potentially under 50ms to use schedule_msec_hrtimeout_interruptible.
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 9de56a69843a0e3fd2d42972ee0b69bfcbf7c87d
Author: Con Kolivas <kernel@kolivas.org>
Date:   Mon Feb 15 21:56:16 2021 +0000

    hrtimer: Replace all schedule timeout(1) with schedule_min_hrtimeout()
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit d12a11b569cd6d291b78b8ebe153ef25ef5aae6f
Author: Con Kolivas <kernel@kolivas.org>
Date:   Fri Nov 4 09:25:54 2016 +1100

    timer: Convert msleep to use hrtimers when active.
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 489c9796e4f4651d9babc3fad54bce9ceec042a7
Author: Con Kolivas <kernel@kolivas.org>
Date:   Sat Nov 5 09:27:36 2016 +1100

    time: Special case calls of schedule_timeout(1) to use the min hrtimeout of 1ms, working around low Hz resolutions.
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit d918ad3d6c2ad8a479b854f0de1eab6f32adf890
Author: Con Kolivas <kernel@kolivas.org>
Date:   Sat Aug 12 11:53:39 2017 +1000

    hrtimer: Create highres timeout variants of schedule_timeout functions.
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 0f184028885b667502e7083e63cf78ff5e8a48ed
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Sun Aug 29 23:58:33 2021 +0000

    XANMOD: fair: Remove all energy efficiency functions
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit fc28a768fbb5e3a79e1823d2e2579c968c7126f4
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Fri Jun 18 19:10:55 2021 +0000

    XANMOD: Makefile: Turn off loop vectorization for GCC -O3 optimization level
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit ad53ae3e9fc839503aeb1ae91ad84000d9ce6594
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Thu Sep 3 20:36:13 2020 +0000

    XANMOD: init/Kconfig: Enable -O3 KBUILD_CFLAGS optimization for all architectures
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 441d0b82d15297975714816627b5656a44e9f950
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Thu Jun 25 16:40:43 2020 -0300

    XANMOD: lib/kconfig.debug: disable default CONFIG_SYMBOLIC_ERRNAME and CONFIG_DEBUG_BUGVERBOSE
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit 19dc9716bba5ea323f527adc08b6811c47150f3e
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Mon Jan 29 17:41:29 2018 +0000

    XANMOD: scripts: disable the localversion "+" tag of a git repo
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit 933a3df33c872b0fbc5a1fc968f45654c6f619eb
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Tue Mar 31 13:32:08 2020 -0300

    XANMOD: cpufreq: tunes ondemand and conservative governor for performance
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit f76b2038e5fbad82c517d60f9721225d5a62ffb4
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Mon Jan 29 17:31:25 2018 +0000

    XANMOD: mm/vmscan: vm_swappiness = 30 decreases the amount of swapping
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit 9171c3f9b851993d97d4a465dab182548eacb955
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Thu Aug 13 14:57:06 2020 +0000

    XANMOD: sched/autogroup: Add kernel parameter and config option to enable/disable autogroup feature by default
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 0f67d47b522f157aa1777c76d9eb998d2f40823e
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Mon Jan 29 16:59:22 2018 +0000

    XANMOD: dcache: cache_pressure = 50 decreases the rate at which VFS caches are reclaimed
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit 7fb8252e688794275211ebac03c407f642b2eadf
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Sun Oct 13 03:10:39 2019 -0300

    XANMOD: kconfig: set PREEMPT and RCU_BOOST without delay by default
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit aeadb68c9cc496eae4119895c0ffd856f1a4a081
Author: Alexandre Frade <admfrade@gmail.com>
Date:   Mon Jan 29 17:26:15 2018 +0000

    XANMOD: kconfig: add 500Hz timer interrupt kernel config option
    
    Signed-off-by: Alexandre Frade <admfrade@gmail.com>

commit 1330aad8774a279357a4e5b06be98db75081d050
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon Dec 14 16:24:26 2020 +0000

    XANMOD: block: set rq_affinity to force full multithreading I/O requests
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit ad0a45a396146a321d0e717bd38b34c3e368c4d0
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon Jun 1 18:23:51 2020 -0300

    XANMOD: block, bfq: change BLK_DEV_ZONED depends to IOSCHED_BFQ
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 570e1b5b7e352eb2771344c36f57b97d59df686e
Author: Alexandre Frade <kernel@xanmod.org>
Date:   Mon Nov 25 15:13:06 2019 -0300

    XANMOD: elevator: set default scheduler to bfq for blk-mq
    
    Signed-off-by: Alexandre Frade <kernel@xanmod.org>

commit 7d2a07b769330c34b4deabeed939325c77a7ec2f
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sun Aug 29 15:04:50 2021 -0700

    Linux 5.14