commit a6829e61931c50f79fc67220e060e2f168fef17d Author: Alexandre Frade Date: Mon Feb 15 23:28:03 2021 +0000 Linux 5.11.0-xanmod1 Signed-off-by: Alexandre Frade commit 1ae1bd19ca7513085c2beeec52a9a622b13012d5 Author: Neal Cardwell Date: Mon Dec 28 19:23:09 2020 -0500 net-tcp_bbr: v2: don't assume prior_cwnd was set entering CA_Loss Fix WARN_ON_ONCE() warnings that were firing and pointing to a bbr->prior_cwnd of 0 when exiting CA_Loss and transitioning to CA_Open. The issue was that tcp_simple_retransmit() calls: tcp_set_ca_state(sk, TCP_CA_Loss); without first calling icsk_ca_ops->ssthresh(sk) (because tcp_simple_retransmit() is dealing with losses due to MTU issues and not congestion). The lack of this callback means that BBR did not get a chance to set bbr->prior_cwnd, and thus upon exiting CA_Loss in such cases the WARN_ON_ONCE() would fire due to a zero bbr->prior_cwnd. This commit removes that warning, since a bbr->prior_cwnd of 0 is a valid situation in this state transition. For setting inflight_lo upon entering CA_Loss, to avoid setting an inflight_lo of 0 in this case, this commit switches to taking the max of cwnd and prior_cwnd. We plan to remove that line of code when we switch to cautious (PRR-style) recovery, so that awkwardness will go away. Change-Id: I575dce871c2f20e91e3e9449e1706f42a07b8118 commit d424d3294dd3b7f19b86cdbe5910644323916575 Author: Neal Cardwell Date: Mon Aug 17 19:10:21 2020 -0400 net-tcp_bbr: v2: remove cycle_rand parameter that is unused in BBRv2 Change-Id: Iee1df7e41e42de199068d7c89131ed3d228327c0 commit 429afa9146c37ac7d85bcbdf86afe6acd34a0830 Author: Neal Cardwell Date: Mon Aug 17 19:08:41 2020 -0400 net-tcp_bbr: v2: remove field bw_rtts that is unused in BBRv2 Change-Id: I58e3346c707748a6f316f3ed060d2da84c32a79b commit 119cfaa88f307f854b65adcc87ecde93d9211663 Author: Neal Cardwell Date: Thu Nov 21 15:28:01 2019 -0500 net-tcp_bbr: v2: remove unnecessary rs.delivered_ce logic upon loss There is no reason to compute rs.delivered_ce upon loss. In fact, we specifically do not want to compute rs.delivered_ce upon loss. Two issues: (1) This would be the wrong thing to do, in behavior terms. With RACK's dynamic reordering window, losses can be marked long after the sequence hole appears in the ACK/SACK stream. We want to to catch the ECN mark rate rising too high as quickly as possible, which means we want to check for high ECN mark rates at ACK time (as BBRv2 currently does) and not loss marking time. (2) This is dead code. The ECN mark rate cannot be detected as too high because the check needs rs->delivered to be > 0 as well: if (rs->delivered_ce > 0 && rs->delivered > 0 && Since we are not setting rs->delivered upon loss, this check cannot succeed, so setting delivered_ce is pointless. This dead and wrong line was discovered by Randall Stewart at Netflix as he was reading the BBRv2 code. Change-Id: I37f83f418a259ec31d8f82de986db071b364b76a commit 6bc862c6f850835b61ee633a70c33aed03cc9115 Author: Neal Cardwell Date: Tue Jun 11 12:54:22 2019 -0400 net-tcp_bbr: v2: BBRv2 ("bbr2") congestion control for Linux TCP BBR v2 is an enhacement to the BBR v1 algorithm. It's designed to aim for lower queues, lower loss, and better Reno/CUBIC coexistence than BBR v1. BBR v2 maintains the core of BBR v1: an explicit model of the network path that is two-dimensional, adapting to estimate the (a) maximum available bandwidth and (b) maximum safe volume of data a flow can keep in-flight in the network. It maintains the estimated BDP as a core guide for estimating an appropriate level of in-flight data. BBR v2 makes several key enhancements: o Its bandwidth-probing time scale is adapted, within bounds, to allow improved coexistence with Reno and CUBIC. The bandwidth-probing time scale is (a) extended dynamically based on estimated BDP to improve coexistence with Reno/CUBIC; (b) bounded by an interactive wall-clock time-scale to be more scalable and responsive than Reno and CUBIC. o Rather than being largely agnostic to loss and ECN marks, it explicitly uses loss and (DCTCP-style) ECN signals to maintain its model. o It aims for lower losses than v1 by adjusting its model to attempt to stay within loss rate and ECN mark rate bounds (loss_thresh and ecn_thresh, respectively). o It adapts to loss/ECN signals even when the application is running out of data ("application-limited"), in case the "application-limited" flow is also "network-limited" (the bw and/or inflight available to this flow is lower than previously estimated when the flow ran out of data). o It has a three-part model: the model explicit three tracks operating points, where an operating point is a tuple: (bandwidth, inflight). The three operating points are: o latest: the latest measurement from the current round trip o upper bound: robust, optimistic, long-term upper bound o lower bound: robust, conservative, short-term lower bound These are stored in the following state variables: o latest: bw_latest, inflight_latest o lo: bw_lo, inflight_lo o hi: bw_hi[2], inflight_hi To gain intuition about the meaning of the three operating points, it may help to consider the analogs in CUBIC, which has a somewhat analogous three-part model used by its probing state machine: BBR param CUBIC param ----------- ------------- latest ~ cwnd lo ~ ssthresh hi ~ last_max_cwnd The analogy is only a loose one, though, since the BBR operating points are calculated differently, and are 2-dimensional (bw,inflight) rather than CUBIC's one-dimensional notion of operating point (inflight). o It uses the three-part model to adapt the magnitude of its bandwidth to match the estimated space available in the buffer, rather than (as in BBR v1) assuming that it was always acceptable to place 0.25*BDP in the bottleneck buffer when probing (commodity datacenter switches commonly do not have that much buffer for WAN flows). When BBR v2 estimates it hit a buffer limit during probing, its bandwidth probing then starts gently in case little space is still available in the buffer, and the accelerates, slowly at first and then rapidly if it can grow inflight without seeing congestion signals. In such cases, probing is bounded by inflight_hi + inflight_probe, where inflight_probe grows as: [0, 1, 2, 4, 8, 16,...]. This allows BBR to keep losses low and bounded if a bottleneck remains congested, while rapidly/scalably utilizing free bandwidth when it becomes available. o It has a slightly revised state machine, to achieve the goals above. BBR_BW_PROBE_UP: pushes up inflight to probe for bw/vol BBR_BW_PROBE_DOWN: drain excess inflight from the queue BBR_BW_PROBE_CRUISE: use pipe, w/ headroom in queue/pipe BBR_BW_PROBE_REFILL: try refill the pipe again to 100%, leaving queue empty o The estimated BDP: BBR v2 continues to maintain an estimate of the path's two-way propagation delay, by tracking a windowed min_rtt, and coordinating (on an as-ndeeded basis) to try to expose the two-way propagation delay by draining the bottleneck queue. BBR v2 continues to use its min_rtt and (currently-applicable) bandwidth estimate to estimate the current bandwidth-delay product. The estimated BDP still provides one important guideline for bounding inflight data. However, because any min-filtered RTT and max-filtered bw inherently tend to both overestimate, the estimated BDP is often too high; in this case loss or ECN marks can ensue, in which case BBR v2 adjusts inflight_hi and inflight_lo to adapt its sending rate and inflight down to match the available capacity of the path. o Space: Note that ICSK_CA_PRIV_SIZE increased. This is because BBR v2 requires more space. Note that much of the space is due to support for per-socket parameterization and debugging in this release for research and debugging. With that state removed, the full "struct bbr" is 140 bytes, or 144 with padding. This is an increase of 40 bytes over the existing ca_priv space. o Code: BBR v2 reuses many pieces from BBR v1. But it omits the following significant pieces: o "packet conservation" (bbr_set_cwnd_to_recover_or_restore(), bbr_can_grow_inflight()) o long-term bandwidth estimator ("policer mode") The code layout tries to keep BBR v2 code near the bottom of the file, so that v1-applicable code in the top does not accidentally refer to v2 code. o Docs: See the following docs for more details and diagrams decsribing the BBR v2 algorithm: https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00 https://datatracker.ietf.org/meeting/102/materials/slides-102-iccrg-an-update-on-bbr-work-at-google-00 o Internal notes: For this upstream rebase, Neal started from: git show fed518041ac6:net/ipv4/tcp_bbr.c > net/ipv4/tcp_bbr.c then removed dev instrumentation (dynamic get/set for parameters) and code that was only used by BBRv1 Effort: net-tcp_bbr Origin-9xx-SHA1: 2c84098e60bed6d67dde23cd7538c51dee273102 Change-Id: I125cf26ba2a7a686f2fa5e87f4c2afceb65f7a05 Signed-off-by: Alexandre Frade commit be267c759420573b7a5125f7a0f4b590551dd4b0 Author: Neal Cardwell Date: Sat Nov 16 13:16:25 2019 -0500 net-tcp: add fast_ack_mode=1: skip rwin check in tcp_fast_ack_mode__tcp_ack_snd_check() Add logic for an experimental TCP connection behavior, enabled with tp->fast_ack_mode = 1, which disables checking the receive window before sending an ack in __tcp_ack_snd_check(). If this behavior is enabled, the data receiver sends an ACK if the amount of data is > RCV.MSS. Change-Id: Iaa0a0fd7108221f883137a79d5bfa724f1b096d4 commit 71b9fc86e66e11d0c8daeff0f4f841b5164d5066 Author: Neal Cardwell Date: Fri Sep 27 17:10:26 2019 -0400 net-tcp: re-generalize TSO sizing in TCP CC module API Reorganize the API for CC modules so that the CC module once again gets complete control of the TSO sizing decision. This is how the API was set up around 2016 and the initial BBRv1 upstreaming. Later Eric Dumazet simplified it. But with wider testing it now seems that to avoid CPU regressions BBR needs to have a different TSO sizing function. This is necessary to handle cases where there are many flows bottlenecked on the sender host's NIC, in which case BBR's pacing rate is much lower than CUBIC/Reno/DCTCP's. Why does this happen? Because BBR's pacing rate adapts to the low bandwidth share each flow sees. By contrast, CUBIC/Reno/DCTCP see no loss or ECN, so they grow a very large cwnd, and thus large pacing rate and large TSO burst size. Change-Id: Ic8ccfdbe4010ee8d4bf6a6334c48a2fceb2171ea commit 4508a811c6c0e58c986e1dbd242c70df0c4df422 Author: Yousuk Seung Date: Wed May 23 17:55:54 2018 -0700 net-tcp: add new ca opts flag TCP_CONG_WANTS_CE_EVENTS Add a a new ca opts flag TCP_CONG_WANTS_CE_EVENTS that allows a congestion control module to receive CE events. Currently congestion control modules have to set the TCP_CONG_NEEDS_ECN bit in opts flag to receive CE events but this may incur changes in ECN behavior elsewhere. This patch adds a new bit TCP_CONG_WANTS_CE_EVENTS that allows congestion control modules to receive CE events independently of TCP_CONG_NEEDS_ECN. Effort: net-tcp Origin-9xx-SHA1: 9f7e14716cde760bc6c67ef8ef7e1ee48501d95b Change-Id: I2255506985242f376d910c6fd37daabaf4744f24 commit 1ede541ad1fb9e2b0733e04eeb3671f3e3fee0a4 Author: Neal Cardwell Date: Tue May 7 22:37:19 2019 -0400 net-tcp_bbr: v2: set tx.in_flight for skbs in repair write queue Syzkaller was able to use TCP_REPAIR to reproduce the new warning added in tcp_fragment(): WARNING: CPU: 0 PID: 118174 at net/ipv4/tcp_output.c:1487 tcp_fragment+0xdcc/0x10a0 net/ipv4/tcp_output.c:1487() inconsistent: tx.in_flight: 0 old_factor: 53 The warning happens because skbs inserted into the tcp_rtx_queue during the repair process go through a sort of "fake send" process, and that process was seting pcount but not tx.in_flight, and thus the warnings (where old_factor is the old pcount). The fix of setting tx.in_flight in the TCP_REPAIR code path seems simple enough, and indeed makes the repro code from syzkaller stop producing warnings. Running through kokonut tests, and will send out for review when all tests pass. Effort: net-tcp_bbr Origin-9xx-SHA1: 330f825a08a6fe92cef74d799cc468864c479f63 Change-Id: I0bc4a790f040fd4239620e1eedd5dc64666c6f05 commit 904721165f70a568a7a513e389a9dfd92be689ed Author: Neal Cardwell Date: Wed May 1 20:16:25 2019 -0400 net-tcp_bbr: v2: adjust skb tx.in_flight upon split in tcp_fragment() When we fragment an skb that has already been sent, we need to update the tx.in_flight for the first skb in the resulting pair ("buff"). Because we were not updating the tx.in_flight, the tx.in_flight value was inconsistent with the pcount of the "buff" skb (tx.in_flight would be too high). That meant that if the "buff" skb was lost, then bbr2_inflight_hi_from_lost_skb() would calculate an inflight_hi value that is too high. This could result in longer queues and higher packet loss. Packetdrill testing verified that without this commit, when the second half of an skb is SACKed and then later the first half of that skb is marked lost, the calculated inflight_hi was incorrect. Effort: net-tcp_bbr Origin-9xx-SHA1: 385f1ddc610798fab2837f9f372857438b25f874 Change-Id: I617f8cab4e9be7a0b8e8d30b047bf8645393354d commit c9a1234e491a9ce22857b1f90ae3599f7ed665ab Author: Neal Cardwell Date: Wed May 1 20:16:33 2019 -0400 net-tcp_bbr: v2: adjust skb tx.in_flight upon merge in tcp_shifted_skb() When tcp_shifted_skb() updates state as adjacent SACKed skbs are coalesced, previously the tx.in_flight was not adjusted, so we could get contradictory state where the skb's recorded pcount was bigger than the tx.in_flight (the number of segments that were in_flight after sending the skb). Normally have a SACKed skb with contradictory pcount/tx.in_flight would not matter. However, with SACK reneging, the SACKed bit is removed, and an skb once again becomes eligible for retransmitting, fragmenting, SACKing, etc. Packetdrill testing verified the following sequence is possible in a kernel that does not have this commit: - skb N is SACKed - skb N+1 is SACKed and combined with skb N using tcp_shifted_skb() - tcp_shifted_skb() will increase the pcount of prev, but leave tx.in_flight as-is - so prev skb can have pcount > tx.in_flight - RTO, tcp_timeout_mark_lost(), detect reneg, remove "SACKed" bit, mark skb N as lost - find pcount of skb N is greater than its tx.in_flight I suspect this issue iw what caused the bbr2_inflight_hi_from_lost_skb(): WARN_ON_ONCE(inflight_prev < 0) to fire in production machines using bbr2. Tested: See last commit in series for sponge link. Effort: net-tcp_bbr Origin-9xx-SHA1: 1a3e997e613d2dcf32b947992882854ebe873715 Change-Id: I1b0b75c27519953430c7db51c6f358f104c7af55 commit b0dfecfb8e15ba1bdc7e68645f03e9c94bab5ff9 Author: Neal Cardwell Date: Tue May 7 22:36:36 2019 -0400 net-tcp_bbr: v2: factor out tx.in_flight setting into tcp_set_tx_in_flight() Factor out the code to set an skb's tx.in_flight field into its own function, so that this code can be used for the TCP_REPAIR "fake send" code path that inserts skbs into the rtx queue without sending them. This is in preparation for the following patch, which fixes an issue with TCP_REPAIR and tx.in_flight. Tested: See last patch in series for sponge link. Effort: net-tcp_bbr Origin-9xx-SHA1: e880fc907d06ea7354333f60f712748ebce9497b Change-Id: I4fbd4a6e18a51ab06d50ab1c9ad820ce5bea89af commit cd518082b83b17afbeadac36d1d463c71a6d4d33 Author: Neal Cardwell Date: Tue Aug 7 21:52:06 2018 -0400 net-tcp_bbr: v2: introduce ca_ops->skb_marked_lost() CC module callback API For connections experiencing reordering, RACK can mark packets lost long after we receive the SACKs/ACKs hinting that the packets were actually lost. This means that CC modules cannot easily learn the volume of inflight data at which packet loss happens by looking at the current inflight or even the packets in flight when the most recently SACKed packet was sent. To learn this, CC modules need to know how many packets were in flight at the time lost packets were sent. This new callback, combined with TCP_SKB_CB(skb)->tx.in_flight, allows them to learn this. This also provides a consistent callback that is invoked whether packets are marked lost upon ACK processing, using the RACK reordering timer, or at RTO time. Effort: net-tcp_bbr Origin-9xx-SHA1: afcbebe3374e4632ac6714d39e4dc8a8455956f4 Change-Id: I54826ab53df636be537e5d3c618a46145d12d51a commit fc4405e66c510699cb183b437c88465efebf9a30 Author: Neal Cardwell Date: Mon Nov 19 13:48:36 2018 -0500 net-tcp_bbr: v2: export FLAG_ECE in rate_sample.is_ece For understanding the relationship between inflight and ECN signals, to try to find the highest inflight value that has acceptable levels ECN marking. Effort: net-tcp_bbr Origin-9xx-SHA1: 3eba998f2898541406c2666781182200934965a8 Change-Id: I3a964e04cee83e11649a54507043d2dfe769a3b3 commit 6f48174a4da9a2d642b348a7c31bde022f5a7116 Author: Neal Cardwell Date: Thu Oct 12 23:44:27 2017 -0400 net-tcp_bbr: v2: count packets lost over TCP rate sampling interval For understanding the relationship between inflight and packet loss signals, to try to find the highest inflight value that has acceptable levels of packet losses. Effort: net-tcp_bbr Origin-9xx-SHA1: 4527e26b2bd7756a88b5b9ef1ada3da33dd609ab Change-Id: I594c2500868d9c530770e7ddd68ffc87c57f4fd5 commit 98712a9c168a2bf1eca3ff104a9a4f8bbea1741a Author: Neal Cardwell Date: Sat Aug 5 11:49:50 2017 -0400 net-tcp_bbr: v2: snapshot packets in flight at transmit time and pass in rate_sample For understanding the relationship between inflight and losses or ECN signals, to try to find the highest inflight value that has acceptable levels of loss/ECN marking. Effort: net-tcp_bbr Origin-9xx-SHA1: b3eb4f2d20efab4ca001f32c9294739036c493ea Change-Id: I7314047d0ff14dd261a04b1969a46dc658c8836a commit d88b19523432652430a10bc8666d3be6beca457e Author: Neal Cardwell Date: Sun Jun 24 21:55:59 2018 -0400 net-tcp_bbr: v2: shrink delivered_mstamp, first_tx_mstamp to u32 to free up 8 bytes Free up some space for tracking inflight and losses for each bw sample, in upcoming commits. These timestamps are in microseconds, and are now stored in 32 bits. So they can only hold time intervals up to roughly 2^12 = 4096 seconds. But Linux TCP RTT and RTO tracking has the same 32-bit microsecond implementation approach and resulting deployment limitations. So this is not introducing a new limit. And these should not be a limitation for the foreseeable future. Effort: net-tcp_bbr Origin-9xx-SHA1: 238a7e6b5d51625fef1ce7769826a7b21b02ae55 Change-Id: I3b779603797263b52a61ad57c565eb91fe42680c commit f12804eed503e10e55e87351707058bc9e3dc89a Author: Yuchung Cheng Date: Tue Mar 27 18:01:46 2018 -0700 net-tcp_rate: account for CE marks in rate sample This patch counts number of packets delivered have CE mark in the rate sample, using similar approach of delivery accounting. Effort: net-tcp_rate Origin-9xx-SHA1: 710644db434c3da335a7c8b72207a671ccbb5cf8 Change-Id: I0968fb33fe19b5c774e8c3afd2685558a6ec8710 commit 758640779425e7126a1dc91f28040c77b3b5220f Author: Yuchung Cheng Date: Tue Mar 27 18:33:29 2018 -0700 net-tcp_rate: consolidate inflight tracking approaches in TCP In order to track CE marks per rate sample (one round trip), we'll need to snap the starting tcp delivered_ce acount in the packet meta header (tcp_skb_cb). But there's not enough space. Good news is that the "last_in_flight" in the header, used by NV congestion control, is almost equivalent as "delivered". In fact "delivered" is better by accounting out-of-order packets additionally. Therefore we can remove it to make room for the CE tracking. This would make delayed ACK detection slightly less accurate but the impact is negligible since it's not used for any critical control. Effort: net-tcp_rate Origin-9xx-SHA1: ddcd46ec85d5f1c4454258af0c54b3254c0d64a7 Change-Id: I1a184aad6d101c981ac7f2f275aa9417ff856910 commit 5c761d8922d3b2a8d9ba40cabdeeca3a01170ab7 Author: Neal Cardwell Date: Tue Jun 11 12:26:55 2019 -0400 net-tcp_bbr: broaden app-limited rate sample detection This commit is a bug fix for the Linux TCP app-limited (application-limited) logic that is used for collecting rate (bandwidth) samples. Previously the app-limited logic only looked for "bubbles" of silence in between application writes, by checking at the start of each sendmsg. But "bubbles" of silence can also happen before retransmits: e.g. bubbles can happen between an application write and a retransmit, or between two retransmits. Retransmits are triggered by ACKs or timers. So this commit checks for bubbles of app-limited silence upon ACKs or timers. Why does this commit check for app-limited state at the start of ACKs and timer handling? Because at that point we know whether inflight was fully using the cwnd. During processing the ACK or timer event we often change the cwnd; after changing the cwnd we can't know whether inflight was fully using the old cwnd. Origin-9xx-SHA1: 3fe9b53291e018407780fb8c356adb5666722cbc Change-Id: I37221506f5166877c2b110753d39bb0757985e68 commit f78bdaf7f7294afb613a3772a833973b97a27c24 Author: Con Kolivas Date: Mon Dec 14 19:09:01 2020 +0000 clockevents, hrtimer: Make hrtimer granularity and minimum hrtimeout configurable in sysctl. Set default granularity to 100us and min timeout to 500us Co-developed-by: Alexandre Frade Signed-off-by: Alexandre Frade commit 5ad7b03b30609dfc9ef1e05ef55e030579208601 Author: Con Kolivas Date: Mon Feb 20 13:32:58 2017 +1100 time: Don't use hrtimer overlay when pm_freezing since some drivers still don't correctly use freezable timeouts. commit 9fa58ac29fd3aea1ccf9ad2d3b98348e0ef9cf1d Author: Con Kolivas Date: Mon Feb 20 13:30:32 2017 +1100 hrtimer: Replace all calls to schedule_timeout_uninterruptible of potentially under 50ms to use schedule_msec_hrtimeout_uninterruptible commit 03a69a939945356f5b2b1c18655db6a80118cb8b Author: Con Kolivas Date: Mon Feb 20 13:30:07 2017 +1100 hrtimer: Replace all calls to schedule_timeout_interruptible of potentially under 50ms to use schedule_msec_hrtimeout_interruptible. commit e50ea50a19388efacb94d903f9512ecdf504ac62 Author: Con Kolivas Date: Mon Feb 15 21:56:16 2021 +0000 hrtimer: Replace all schedule timeout(1) with schedule_min_hrtimeout() Signed-off-by: Alexandre Frade commit feea21a992024b66fd5f38d7f7322f498a4a29a2 Author: Con Kolivas Date: Fri Nov 4 09:25:54 2016 +1100 timer: Convert msleep to use hrtimers when active. commit 480a4d203f2c9b201f71a76bd84a12b327ba0f66 Author: Con Kolivas Date: Sat Nov 5 09:27:36 2016 +1100 time: Special case calls of schedule_timeout(1) to use the min hrtimeout of 1ms, working around low Hz resolutions. commit 07d208a9c048c61a4316ef0e2f61ecf4527c31ff Author: Con Kolivas Date: Sat Aug 12 11:53:39 2017 +1000 hrtimer: Create highres timeout variants of schedule_timeout functions. commit 708926b710a35edd5a036421907b1d84c7658a3e Author: Gabriel Krisman Bertazi Date: Thu Feb 13 18:45:25 2020 -0300 selftests: futex: Add FUTEX_WAIT_MULTIPLE wake up test Add test for wait at multiple futexes mechanism. Skip the test if it's a x32 application and the kernel returned the approtiaded error, since this ABI is not supported for this operation. Signed-off-by: Gabriel Krisman Bertazi Co-developed-by: André Almeida Signed-off-by: André Almeida Signed-off-by: Alexandre Frade commit 1b8d17961dbf0808e3bbc695fd9ac40c6930f4ba Author: Gabriel Krisman Bertazi Date: Thu Feb 13 18:45:24 2020 -0300 selftests: futex: Add FUTEX_WAIT_MULTIPLE wouldblock test Add test for wouldblock return when waiting for multiple futexes. Skip the test if it's a x32 application and the kernel returned the approtiaded error, since this ABI is not supported for this operation. Signed-off-by: Gabriel Krisman Bertazi Co-developed-by: André Almeida Signed-off-by: André Almeida Signed-off-by: Alexandre Frade commit a8317978e81b9866fe5687fe90d440becb79e9a7 Author: Gabriel Krisman Bertazi Date: Thu Feb 13 18:45:23 2020 -0300 selftests: futex: Add FUTEX_WAIT_MULTIPLE timeout test Add test for timeout when waiting for multiple futexes. Skip the test if it's a x32 application and the kernel returned the approtiaded error, since this ABI is not supported for this operation. Signed-off-by: Gabriel Krisman Bertazi Co-developed-by: André Almeida Signed-off-by: André Almeida Signed-off-by: Alexandre Frade commit 89134fb714873e3b185c3d974ca2d6b9bfc310f1 Author: Gabriel Krisman Bertazi Date: Sat Jan 30 11:57:02 2021 -0300 futex: Implement mechanism to wait on any of several futexes This is a new futex operation, called FUTEX_WAIT_MULTIPLE, which allows a thread to wait on several futexes at the same time, and be awoken by any of them. In a sense, it implements one of the features that was supported by pooling on the old FUTEX_FD interface. The use case lies in the Wine implementation of the Windows NT interface WaitMultipleObjects. This Windows API function allows a thread to sleep waiting on the first of a set of event sources (mutexes, timers, signal, console input, etc) to signal. Considering this is a primitive synchronization operation for Windows applications, being able to quickly signal events on the producer side, and quickly go to sleep on the consumer side is essential for good performance of those running over Wine. Wine developers have an implementation that uses eventfd, but it suffers from FD exhaustion (there is applications that go to the order of multi-milion FDs), and higher CPU utilization than this new operation. The futex list is passed as an array of `struct futex_wait_block` (pointer, value, bitset) to the kernel, which will enqueue all of them and sleep if none was already triggered. It returns a hint of which futex caused the wake up event to userspace, but the hint doesn't guarantee that is the only futex triggered. Before calling the syscall again, userspace should traverse the list, trying to re-acquire any of the other futexes, to prevent an immediate -EWOULDBLOCK return code from the kernel. This was tested using three mechanisms: 1) By reimplementing FUTEX_WAIT in terms of FUTEX_WAIT_MULTIPLE and running the unmodified tools/testing/selftests/futex and a full linux distro on top of this kernel. 2) By an example code that exercises the FUTEX_WAIT_MULTIPLE path on a multi-threaded, event-handling setup. 3) By running the Wine fsync implementation and executing multi-threaded applications, in particular modern games, on top of this implementation. Changes were tested for the following ABIs: x86_64, i386 and x32. Support for x32 applications is not implemented since it would take a major rework adding a new entry point and splitting the current futex 64 entry point in two and we can't change the current x32 syscall number without breaking user space compatibility. Included Valve's Proton compatibility code. Adjusted for v5.9: Removed `put_futex_key` calls. CC: Steven Rostedt Cc: Richard Yao Cc: Thomas Gleixner Cc: Peter Zijlstra Co-developed-by: Zebediah Figura Signed-off-by: Zebediah Figura Co-developed-by: Steven Noonan Signed-off-by: Steven Noonan Co-developed-by: Pierre-Loup A. Griffais Signed-off-by: Pierre-Loup A. Griffais Signed-off-by: Gabriel Krisman Bertazi [Added compatibility code] Co-developed-by: André Almeida Signed-off-by: André Almeida Signed-off-by: Alexandre Frade commit 9253d194e4e5b1fce5d0d60a8765d6519254193c Author: Arjan van de Ven Date: Sun Feb 18 23:35:41 2018 +0000 locking: rwsem: spin faster tweak rwsem owner spinning a bit Signed-off-by: Alexandre Frade commit c0921cfd454ae00324bd62580c1f6274716cbdc1 Author: William Douglas Date: Wed Jun 20 17:23:21 2018 +0000 firmware: Enable stateless firmware loading Prefer the order of specific version before generic and /etc before /lib to enable the user to give specific overrides for generic firmware and distribution firmware. commit 2505ab248bfa5e55f3912aedf089d1fa0bd3aaff Author: Arjan van de Ven Date: Sun Sep 22 11:12:35 2019 -0300 intel_rapl: Silence rapl trace debug commit 17fe52c670350db4a3c440c514e2e4c9172e22d6 Author: Piotr Gorski Date: Thu Jul 30 16:53:12 2020 -0800 init: add support for zstd compressed modules Signed-off-by: Piotr Gorski commit 31bc0a34ad4ac5332719958b2399593800234717 Author: Ben Hutchings Date: Mon Sep 7 02:51:53 2020 +0100 android: Export symbols needed by Android drivers We want to enable use of the Android ashmem and binder drivers to support Anbox, but they should not be built-in as that would waste resources and increase security attack surface on systems that don't need them. Export the currently un-exported symbols they depend on. commit 5bffe872a2b4b8a84090fa3dd76058ea789334f3 Author: Ben Hutchings Date: Fri Jun 22 17:27:00 2018 +0100 android: Enable building ashmem and binder as modules We want to enable use of the Android ashmem and binder drivers to support Anbox, but they should not be built-in as that would waste resources and increase security attack surface on systems that don't need them. - Add a MODULE_LICENSE declaration to ashmem - Change the Makefiles to build each driver as an object with the "_linux" suffix (which is what Anbox expects) - Change config symbol types to tristate commit 6c1f4fbc91c8962953cd16ae98066ccb9609736d Author: Serge Hallyn Date: Fri May 31 19:12:12 2013 +0100 sysctl: add sysctl to disallow unprivileged CLONE_NEWUSER by default add sysctl to disallow unprivileged CLONE_NEWUSER by default This is a short-term patch. Unprivileged use of CLONE_NEWUSER is certainly an intended feature of user namespaces. However for at least saucy we want to make sure that, if any security issues are found, we have a fail-safe. Signed-off-by: Serge Hallyn [bwh: Remove unneeded binary sysctl bits] [bwh: Keep this sysctl, but change the default to enabled] commit 9971c77c59b305ad7b8f2d872c28102607b4f96b Author: graysky Date: Mon Dec 21 06:51:36 2020 -0500 x86/kconfig: Enable additional cpu optimizations for gcc v11.0+ kernel v5.8+ WARNING This patch works with gcc versions 11.0+ and with kernel version 5.8+ and should NOT be applied when compiling on older versions of gcc due to key name changes of the march flags introduced with the version 4.9 release of gcc.[1] Use the older version of this patch hosted on the same github for older versions of gcc. FEATURES This patch adds additional CPU options to the Linux kernel accessible under: Processor type and features ---> Processor family ---> The expanded microarchitectures include: * AMD Improved K8-family * AMD K10-family * AMD Family 10h (Barcelona) * AMD Family 14h (Bobcat) * AMD Family 16h (Jaguar) * AMD Family 15h (Bulldozer) * AMD Family 15h (Piledriver) * AMD Family 15h (Steamroller) * AMD Family 15h (Excavator) * AMD Family 17h (Zen) * AMD Family 17h (Zen 2) * AMD Family 17h (Zen 3) * Intel Silvermont low-power processors * Intel Goldmont low-power processors (Apollo Lake and Denverton) * Intel Goldmont Plus low-power processors (Gemini Lake) * Intel 1st Gen Core i3/i5/i7 (Nehalem) * Intel 1.5 Gen Core i3/i5/i7 (Westmere) * Intel 2nd Gen Core i3/i5/i7 (Sandybridge) * Intel 3rd Gen Core i3/i5/i7 (Ivybridge) * Intel 4th Gen Core i3/i5/i7 (Haswell) * Intel 5th Gen Core i3/i5/i7 (Broadwell) * Intel 6th Gen Core i3/i5/i7 (Skylake) * Intel 6th Gen Core i7/i9 (Skylake X) * Intel 8th Gen Core i3/i5/i7 (Cannon Lake) * Intel 10th Gen Core i7/i9 (Ice Lake) * Intel Xeon (Cascade Lake) * Intel Xeon (Cooper Lake) * Intel 3rd Gen 10nm++ i3/i5/i7/i9-family (Tiger Lake) It also offers to compile passing the 'native' option which, "selects the CPU to generate code for at compilation time by determining the processor type of the compiling machine. Using -march=native enables all instruction subsets supported by the local machine and will produce code optimized for the local machine under the constraints of the selected instruction set."[2] Do NOT try using the 'native' option on AMD Piledriver, Steamroller, or Excavator CPUs (-march=bdver{2,3,4} flag). The build will error out due the kernel's objtool issue with these.[3a,b] MINOR NOTES This patch also changes 'atom' to 'bonnell' in accordance with the gcc v4.9 changes. Note that upstream is using the deprecated 'match=atom' flags when I believe it should use the newer 'march=bonnell' flag for atom processors.[4] It is not recommended to compile on Atom-CPUs with the 'native' option.[5] The recommendation is to use the 'atom' option instead. BENEFITS Small but real speed increases are measurable using a make endpoint comparing a generic kernel to one built with one of the respective microarchs. See the following experimental evidence supporting this statement: https://github.com/graysky2/kernel_gcc_patch REQUIREMENTS linux version >=5.8 gcc version >=11.0 ACKNOWLEDGMENTS This patch builds on the seminal work by Jeroen.[6] REFERENCES 1. https://gcc.gnu.org/gcc-4.9/changes.html 2. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html 3a. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95671#c11 3b. https://github.com/graysky2/kernel_gcc_patch/issues/55 4. https://bugzilla.kernel.org/show_bug.cgi?id=77461 5. https://github.com/graysky2/kernel_gcc_patch/issues/15 6. http://www.linuxforge.net/docs/linux/linux-gcc.php commit 4ce6b243755866c75885d19b5af4b14d5c516da5 Author: Alexandre Frade Date: Thu Sep 3 20:36:13 2020 +0000 init/Kconfig: Enable -O3 KBUILD_CFLAGS optimization for all architectures Signed-off-by: Alexandre Frade commit 66c24af1104ac3a8b4d1189cfe23ed1859af3178 Author: Alexandre Frade Date: Thu Jun 25 16:40:43 2020 -0300 lib/kconfig.debug: disable default CONFIG_SYMBOLIC_ERRNAME and CONFIG_DEBUG_BUGVERBOSE Signed-off-by: Alexandre Frade commit 5579778e0d212d95cbe1f9e22ea80c195ba3cac9 Author: Alexandre Frade Date: Mon Jan 29 17:41:29 2018 +0000 scripts: disable the localversion "+" tag of a git repo Signed-off-by: Alexandre Frade commit fac98c7151aa887ef4cf1006b49c40aa12debebf Author: Alexandre Frade Date: Tue Mar 31 13:32:08 2020 -0300 cpufreq: tunes ondemand and conservative governor for performance Signed-off-by: Alexandre Frade commit 9e54b21a677654df077800fe9843cceab1e5ddb0 Author: Alexandre Frade Date: Mon Jan 29 17:31:25 2018 +0000 mm/vmscan: vm_swappiness = 30 decreases the amount of swapping Signed-off-by: Alexandre Frade commit a9487c047f0835c7f64ca3c66b54e38155643cc5 Author: Alexandre Frade Date: Thu Aug 13 14:57:06 2020 +0000 sched/autogroup: Add kernel parameter and config option to enable/disable autogroup feature by default Signed-off-by: Alexandre Frade commit 719a8927b6165d1eae1406fc56ed951eb605be02 Author: Alexandre Frade Date: Mon Jan 29 16:59:22 2018 +0000 dcache: cache_pressure = 50 decreases the rate at which VFS caches are reclaimed Signed-off-by: Alexandre Frade commit 9e669018a5bb61e5f948001ade696ed7493f3cd4 Author: Alexandre Frade Date: Sun Oct 13 03:10:39 2019 -0300 kconfig: set PREEMPT and RCU_BOOST without delay by default Signed-off-by: Alexandre Frade commit 3ef0da48ff064a194406805701d8cdaa136164e3 Author: Alexandre Frade Date: Mon Jan 29 17:26:15 2018 +0000 kconfig: add 500Hz timer interrupt kernel config option Signed-off-by: Alexandre Frade commit 1a270fff183c22a440555ef243e51c79c713e70d Author: Alexandre Frade Date: Mon Jan 29 18:29:13 2018 +0000 sched/core: nr_migrate = 256 increases number of tasks to iterate in a single balance run. Signed-off-by: Alexandre Frade commit de80a83214fc85f50e3ee39883f28cf6c7669685 Author: Alexandre Frade Date: Mon Dec 14 16:24:26 2020 +0000 block: set rq_affinity to force full multithreading I/O requests Signed-off-by: Alexandre Frade commit 04c4db0c99bdc84bab75c0ae608c176376821918 Author: Alexandre Frade Date: Mon Jun 1 18:23:51 2020 -0300 block, bfq: change BLK_DEV_ZONED depends to IOSCHED_BFQ Signed-off-by: Alexandre Frade commit 3785e9822c8bb21b926c3fdf1054d5a7d29904b0 Author: Alexandre Frade Date: Mon Nov 25 15:13:06 2019 -0300 elevator: set default scheduler to bfq for blk-mq Signed-off-by: Alexandre Frade commit f40ddce88593482919761f74910f42f4b84c004b Author: Linus Torvalds Date: Sun Feb 14 14:32:24 2021 -0800 Linux 5.11