commit e32a4776fae148ac07078d52e97b16ba8ae63cc3 Author: Alexandre Frade Date: Thu Dec 26 08:19:04 2019 -0300 5.4.5-rt3-xanmod Signed-off-by: Alexandre Frade commit f29eab5062ea59a189a478698ae3eac3c405bd87 Author: Alexandre Frade Date: Thu Dec 26 08:18:25 2019 -0300 rcu: Fix exports that make rcu_read_lock() and rcu_read_unlock() EXPORT_SYMBOL_GPL Signed-off-by: Alexandre Frade commit 81f05aa273de57b2804130406d3e0b8ffba6431a Author: Tony Hutter Date: Wed Dec 18 11:31:49 2019 -0300 fs: Introduce the ZFS filesystem v0.8.2 for Linux kernel v5.4 Signed-off-by: Tony Hutter Signed-off-by: Alexandre Frade commit d7d3134d44f60dcc72a92a196e12ca02c4849193 Author: Alexandre Frade Date: Sun Oct 13 03:10:39 2019 -0300 kconfig: set PREEMPT_RT and RCU_BOOST without delay by default Signed-off-by: Alexandre Frade commit 28f32f59d9d55ac7ec3a20b79bdd02d2a0a5f7e1 Author: Alexandre Frade Date: Mon Jan 29 18:29:13 2018 +0000 sched/core: nr_migrate = 128 increases number of tasks to iterate in a single balance run. Signed-off-by: Alexandre Frade commit 2cc3aba868988b1dc3c5af764158913028a407ab Author: Alexandre Frade Date: Thu Dec 26 05:31:02 2019 -0300 Revert "sched: Limit the number of task migrations per batch" This reverts commit 9ee3003aa3700b4f9c8941de6344050bd2e00d94. commit be72ec79c1a272dd23bb29883695b57ed8c9b4ba Merge: 76c4b93df628 1fbcaa9506f9 Author: Alexandre Frade Date: Wed Dec 25 23:08:06 2019 -0300 Merge tag 'v5.4.5-rt3' into 5.4 v5.4.5-rt3 commit 76c4b93df628713e1f4c2ab6773eaa3746aceb45 Author: Alexandre Frade Date: Wed Dec 25 23:06:41 2019 -0300 Revert "kconfig: set PREEMPT and RCU_BOOST without delay by default" This reverts commit bfcca779b4ea80997c2be518b49db90ae77a4268. commit c42fd14c1d770a0a4e7bb80eab2ab681cd7fbae8 Author: Alexandre Frade Date: Wed Dec 25 23:06:31 2019 -0300 Revert "sched/core: nr_migrate = 128 increases number of tasks to iterate in a single balance run." This reverts commit edb07e3fce7676c87eb4838a8b98e26a4aaca225. commit 8c7b98a3f6860acdb1bd5aeb49fe8b9935ca2901 Author: Alexandre Frade Date: Wed Dec 25 23:06:16 2019 -0300 sched: Remove BitMap Queue CPU scheduler patchset commit 1fbcaa9506f953b1f054c0d1ae79776fb77887b3 Author: Sebastian Andrzej Siewior Date: Fri Dec 20 16:31:01 2019 +0100 v5.4.5-rt3 Signed-off-by: Sebastian Andrzej Siewior commit 65a387a0b45cdd6844b7c6269e6333c9f0113410 Author: Sebastian Andrzej Siewior Date: Fri Dec 20 16:25:33 2019 +0100 kmemleak: Cosmetic changes Align with the patch, that got sent upstream for review. Only cosmetic changes. Signed-off-by: Sebastian Andrzej Siewior commit df5e2d5245a6f4b63d61cf332ddd629e412f01c9 Author: Sebastian Andrzej Siewior Date: Fri Dec 20 15:23:21 2019 +0100 Revert "arm*: disable NEON in kernel mode" The NEON code was disabled due to preempt_disable() / local_bh_disable() assumptions and possible memory allocations during a "cipher_walk" in the atomic sections. The has been reworked in the meantime and atomic sections is only around the encryption code. I haven't seen a failure/warning while testing with the tcrypt module. Allow NEON in kernel mode again. Signed-off-by: Sebastian Andrzej Siewior commit 1f61d338e02809d12a82c315324d28f1fda43ea8 Author: Sebastian Andrzej Siewior Date: Wed Dec 18 18:37:31 2019 +0100 Revert "cpumask: Disable CONFIG_CPUMASK_OFFSTACK for RT" The one x86 case we had was fixed in commit 832df3d47badc ("x86/smp: Enhance native_send_call_func_ipi()") I didn't find another in-IRQ user. Most callers use GFP_KERNEL and the ATOMIC users are allocating the mask while holding a spinlock_t. Allow to use CPUMASK_OFFSTACK becauase it no longer is a problem on RT. Signed-off-by: Sebastian Andrzej Siewior commit dc952a564d02997330654be9628bbe97ba2a05d3 Author: Sebastian Andrzej Siewior Date: Wed Dec 18 12:25:09 2019 +0100 userfaultfd: Use a seqlock instead of seqcount On RT write_seqcount_begin() disables preemption which leads to warning in add_wait_queue() while the spinlock_t is acquired. The waitqueue can't be converted to swait_queue because userfaultfd_wake_function() is used as a custom wake function. Use seqlock instead seqcount to avoid the preempt_disable() section during add_wait_queue(). Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior commit 66d64882cae92c30dd3019e867b28cfdce6205c4 Author: Sebastian Andrzej Siewior Date: Fri Dec 20 10:16:50 2019 +0100 v5.4.5-rt2 Signed-off-by: Sebastian Andrzej Siewior commit 4da8a664b0351a69579e5e5f073e2ef7e2f875cf Merge: 925cbfe727ed 9a088971000c Author: Sebastian Andrzej Siewior Date: Fri Dec 20 10:16:20 2019 +0100 Merge tag 'v5.4.5' into linux-5.4.y-rt This is the 5.4.5 stable release Signed-off-by: Sebastian Andrzej Siewior commit 9a088971000c4e7a4abddf9751649ead4d8a0fe0 Author: Greg Kroah-Hartman Date: Wed Dec 18 16:09:17 2019 +0100 Linux 5.4.5 commit 68159412b26e141ada8c41d30cb08871690e3126 Author: Heiner Kallweit Date: Fri Dec 6 23:27:15 2019 +0100 r8169: add missing RX enabling for WoL on RTL8125 [ Upstream commit 00222d1394104f0fd6c01ca9f578afec9e0f148b ] RTL8125 also requires to enable RX for WoL. v2: add missing Fixes tag Fixes: f1bce4ad2f1c ("r8169: add support for RTL8125") Signed-off-by: Heiner Kallweit Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 157560f95d4cb5e3d15a91e489d0acbf399fabf1 Author: Vladimir Oltean Date: Tue Dec 3 17:45:35 2019 +0200 net: mscc: ocelot: unregister the PTP clock on deinit [ Upstream commit 9385973fe8db9743fa93bf17245635be4eb8c4a6 ] Currently a switch driver deinit frees the regmaps, but the PTP clock is still out there, available to user space via /dev/ptpN. Any PTP operation is a ticking time bomb, since it will attempt to use the freed regmaps and thus trigger kernel panics: [ 4.291746] fsl_enetc 0000:00:00.2 eth1: error -22 setting up slave phy [ 4.291871] mscc_felix 0000:00:00.5: Failed to register DSA switch: -22 [ 4.308666] mscc_felix: probe of 0000:00:00.5 failed with error -22 [ 6.358270] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000088 [ 6.367090] Mem abort info: [ 6.369888] ESR = 0x96000046 [ 6.369891] EC = 0x25: DABT (current EL), IL = 32 bits [ 6.369892] SET = 0, FnV = 0 [ 6.369894] EA = 0, S1PTW = 0 [ 6.369895] Data abort info: [ 6.369897] ISV = 0, ISS = 0x00000046 [ 6.369899] CM = 0, WnR = 1 [ 6.369902] user pgtable: 4k pages, 48-bit VAs, pgdp=00000020d58c7000 [ 6.369904] [0000000000000088] pgd=00000020d5912003, pud=00000020d5915003, pmd=0000000000000000 [ 6.369914] Internal error: Oops: 96000046 [#1] PREEMPT SMP [ 6.420443] Modules linked in: [ 6.423506] CPU: 1 PID: 262 Comm: phc_ctl Not tainted 5.4.0-03625-gb7b2a5dadd7f #204 [ 6.431273] Hardware name: LS1028A RDB Board (DT) [ 6.435989] pstate: 40000085 (nZcv daIf -PAN -UAO) [ 6.440802] pc : css_release+0x24/0x58 [ 6.444561] lr : regmap_read+0x40/0x78 [ 6.448316] sp : ffff800010513cc0 [ 6.451636] x29: ffff800010513cc0 x28: ffff002055873040 [ 6.456963] x27: 0000000000000000 x26: 0000000000000000 [ 6.462289] x25: 0000000000000000 x24: 0000000000000000 [ 6.467617] x23: 0000000000000000 x22: 0000000000000080 [ 6.472944] x21: ffff800010513d44 x20: 0000000000000080 [ 6.478270] x19: 0000000000000000 x18: 0000000000000000 [ 6.483596] x17: 0000000000000000 x16: 0000000000000000 [ 6.488921] x15: 0000000000000000 x14: 0000000000000000 [ 6.494247] x13: 0000000000000000 x12: 0000000000000000 [ 6.499573] x11: 0000000000000000 x10: 0000000000000000 [ 6.504899] x9 : 0000000000000000 x8 : 0000000000000000 [ 6.510225] x7 : 0000000000000000 x6 : ffff800010513cf0 [ 6.515550] x5 : 0000000000000000 x4 : 0000000fffffffe0 [ 6.520876] x3 : 0000000000000088 x2 : ffff800010513d44 [ 6.526202] x1 : ffffcada668ea000 x0 : ffffcada64d8b0c0 [ 6.531528] Call trace: [ 6.533977] css_release+0x24/0x58 [ 6.537385] regmap_read+0x40/0x78 [ 6.540795] __ocelot_read_ix+0x6c/0xa0 [ 6.544641] ocelot_ptp_gettime64+0x4c/0x110 [ 6.548921] ptp_clock_gettime+0x4c/0x58 [ 6.552853] pc_clock_gettime+0x5c/0xa8 [ 6.556699] __arm64_sys_clock_gettime+0x68/0xc8 [ 6.561331] el0_svc_common.constprop.2+0x7c/0x178 [ 6.566133] el0_svc_handler+0x34/0xa0 [ 6.569891] el0_sync_handler+0x114/0x1d0 [ 6.573908] el0_sync+0x140/0x180 [ 6.577232] Code: d503201f b00119a1 91022263 b27b7be4 (f9004663) [ 6.583349] ---[ end trace d196b9b14cdae2da ]--- [ 6.587977] Kernel panic - not syncing: Fatal exception [ 6.593216] SMP: stopping secondary CPUs [ 6.597151] Kernel Offset: 0x4ada54400000 from 0xffff800010000000 [ 6.603261] PHYS_OFFSET: 0xffffd0a7c0000000 [ 6.607454] CPU features: 0x10002,21806008 [ 6.611558] Memory Limit: none And now that ocelot->ptp_clock is checked at exit, prevent a potential error where ptp_clock_register returned a pointer-encoded error, which we are keeping in the ocelot private data structure. So now, ocelot->ptp_clock is now either NULL or a valid pointer. Fixes: 4e3b0468e6d7 ("net: mscc: PTP Hardware Clock (PHC) support") Cc: Antoine Tenart Reviewed-by: Florian Fainelli Signed-off-by: Vladimir Oltean Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit dd561233e068044b7b2203fd55cb1337459ff1f8 Author: Shannon Nelson Date: Tue Dec 3 14:17:34 2019 -0800 ionic: keep users rss hash across lif reset [ Upstream commit ffac2027e18f006f42630f2e01a8a9bd8dc664b5 ] If the user has specified their own RSS hash key, don't lose it across queue resets such as DOWN/UP, MTU change, and number of channels change. This is fixed by moving the key initialization to a little earlier in the lif creation. Also, let's clean up the RSS config a little better on the way down by setting it all to 0. Fixes: aa3198819bea ("ionic: Add RSS support") Signed-off-by: Shannon Nelson Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 9bd01a33c780a4f10a2e36ecf42dcab9f6b01aa5 Author: Jonathan Lemon Date: Tue Dec 3 14:01:14 2019 -0800 xdp: obtain the mem_id mutex before trying to remove an entry. [ Upstream commit 86c76c09898332143be365c702cf8d586ed4ed21 ] A lockdep splat was observed when trying to remove an xdp memory model from the table since the mutex was obtained when trying to remove the entry, but not before the table walk started: Fix the splat by obtaining the lock before starting the table walk. Fixes: c3f812cea0d7 ("page_pool: do not release pool until inflight == 0.") Reported-by: Grygorii Strashko Signed-off-by: Jonathan Lemon Tested-by: Grygorii Strashko Acked-by: Jesper Dangaard Brouer Acked-by: Ilias Apalodimas Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 05f646cb2174d1a4e032b60b99097f5c4b522616 Author: Jonathan Lemon Date: Thu Nov 14 14:13:00 2019 -0800 page_pool: do not release pool until inflight == 0. [ Upstream commit c3f812cea0d7006469d1cf33a4a9f0a12bb4b3a3 ] The page pool keeps track of the number of pages in flight, and it isn't safe to remove the pool until all pages are returned. Disallow removing the pool until all pages are back, so the pool is always available for page producers. Make the page pool responsible for its own delayed destruction instead of relying on XDP, so the page pool can be used without the xdp memory model. When all pages are returned, free the pool and notify xdp if the pool is registered with the xdp memory system. Have the callback perform a table walk since some drivers (cpsw) may share the pool among multiple xdp_rxq_info. Note that the increment of pages_state_release_cnt may result in inflight == 0, resulting in the pool being released. Fixes: d956a048cd3f ("xdp: force mem allocator removal and periodic warning") Signed-off-by: Jonathan Lemon Acked-by: Jesper Dangaard Brouer Acked-by: Ilias Apalodimas Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 6b2377de13af6821d2a196a1e4326369e1a4fabd Author: Aya Levin Date: Sun Dec 1 16:33:55 2019 +0200 net/mlx5e: ethtool, Fix analysis of speed setting [ Upstream commit 3d7cadae51f1b7f28358e36d0a1ce3f0ae2eee60 ] When setting speed to 100G via ethtool (AN is set to off), only 25G*4 is configured while the user, who has an advanced HW which supports extended PTYS, expects also 50G*2 to be configured. With this patch, when extended PTYS mode is available, configure PTYS via extended fields. Fixes: 4b95840a6ced ("net/mlx5e: Fix matching of speed to PRM link modes") Signed-off-by: Aya Levin Reviewed-by: Eran Ben Elisha Signed-off-by: Saeed Mahameed Signed-off-by: Greg Kroah-Hartman commit dd54484500ec8018bd7c27caf7b8458d6328c255 Author: Aya Levin Date: Sun Dec 1 14:45:25 2019 +0200 net/mlx5e: Fix translation of link mode into speed [ Upstream commit 6d485e5e555436d2c13accdb10807328c4158a17 ] Add a missing value in translation of PTYS ext_eth_proto_oper to its corresponding speed. When ext_eth_proto_oper bit 10 is set, ethtool shows unknown speed. With this fix, ethtool shows speed is 100G as expected. Fixes: a08b4ed1373d ("net/mlx5: Add support to ext_* fields introduced in Port Type and Speed register") Signed-off-by: Aya Levin Reviewed-by: Eran Ben Elisha Signed-off-by: Saeed Mahameed Signed-off-by: Greg Kroah-Hartman commit 65523f0fe7b885da9454e97ca97996d2b3392be5 Author: Roi Dayan Date: Wed Dec 4 11:25:43 2019 +0200 net/mlx5e: Fix freeing flow with kfree() and not kvfree() [ Upstream commit a23dae79fb6555c808528707c6389345d0b0c189 ] Flows are allocated with kzalloc() so free with kfree(). Fixes: 04de7dda7394 ("net/mlx5e: Infrastructure for duplicated offloading of TC flows") Signed-off-by: Roi Dayan Reviewed-by: Eli Britstein Signed-off-by: Saeed Mahameed Signed-off-by: Greg Kroah-Hartman commit 2e4e7670cba5bf6c900e832c13f597db262f5ed5 Author: Eran Ben Elisha Date: Thu Dec 5 10:30:22 2019 +0200 net/mlx5e: Fix SFF 8472 eeprom length [ Upstream commit c431f8597863a91eea6024926e0c1b179cfa4852 ] SFF 8472 eeprom length is 512 bytes. Fix module info return value to support 512 bytes read. Fixes: ace329f4ab3b ("net/mlx5e: ethtool, Remove unsupported SFP EEPROM high pages query") Signed-off-by: Eran Ben Elisha Reviewed-by: Aya Levin Signed-off-by: Saeed Mahameed Signed-off-by: Greg Kroah-Hartman commit 4e57c233915e898678e5654d9b91ab24600fe7e5 Author: Aaron Conole Date: Tue Dec 3 16:34:14 2019 -0500 act_ct: support asymmetric conntrack [ Upstream commit 95219afbb980f10934de9f23a3e199be69c5ed09 ] The act_ct TC module shares a common conntrack and NAT infrastructure exposed via netfilter. It's possible that a packet needs both SNAT and DNAT manipulation, due to e.g. tuple collision. Netfilter can support this because it runs through the NAT table twice - once on ingress and again after egress. The act_ct action doesn't have such capability. Like netfilter hook infrastructure, we should run through NAT twice to keep the symmetry. Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct") Signed-off-by: Aaron Conole Acked-by: Marcelo Ricardo Leitner Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 411fdb975269ac2d1d746535eadb25dea3813189 Author: Eran Ben Elisha Date: Mon Nov 25 12:11:49 2019 +0200 net/mlx5e: Fix TXQ indices to be sequential [ Upstream commit c55d8b108caa2ec1ae8dddd02cb9d3a740f7c838 ] Cited patch changed (channel index, tc) => (TXQ index) mapping to be a static one, in order to keep indices consistent when changing number of channels or TCs. For 32 channels (OOB) and 8 TCs, real num of TXQs is 256. When reducing the amount of channels to 8, the real num of TXQs will be changed to 64. This indices method is buggy: - Channel #0, TC 3, the TXQ index is 96. - Index 8 is not valid, as there is no such TXQ from driver perspective (As it represents channel #8, TC 0, which is not valid with the above configuration). As part of driver's select queue, it calls netdev_pick_tx which returns an index in the range of real number of TXQs. Depends on the return value, with the examples above, driver could have returned index larger than the real number of tx queues, or crash the kernel as it tries to read invalid address of SQ which was not allocated. Fix that by allocating sequential TXQ indices, and hold a new mapping between (channel index, tc) => (real TXQ index). This mapping will be updated as part of priv channels activation, and is used in mlx5e_select_queue to find the selected queue index. The existing indices mapping (channel_tc2txq) is no longer needed, as it is used only for statistics structures and can be calculated on run time. Delete its definintion and updates. Fixes: 8bfaf07f7806 ("net/mlx5e: Present SW stats when state is not opened") Signed-off-by: Eran Ben Elisha Signed-off-by: Saeed Mahameed Signed-off-by: Greg Kroah-Hartman commit cd477d06d22d8b6d058962043060785c76819446 Author: Martin Varghese Date: Thu Dec 5 05:57:22 2019 +0530 net: Fixed updating of ethertype in skb_mpls_push() [ Upstream commit d04ac224b1688f005a84f764cfe29844f8e9da08 ] The skb_mpls_push was not updating ethertype of an ethernet packet if the packet was originally received from a non ARPHRD_ETHER device. In the below OVS data path flow, since the device corresponding to port 7 is an l3 device (ARPHRD_NONE) the skb_mpls_push function does not update the ethertype of the packet even though the previous push_eth action had added an ethernet header to the packet. recirc_id(0),in_port(7),eth_type(0x0800),ipv4(tos=0/0xfc,ttl=64,frag=no), actions:push_eth(src=00:00:00:00:00:00,dst=00:00:00:00:00:00), push_mpls(label=13,tc=0,ttl=64,bos=1,eth_type=0x8847),4 Fixes: 8822e270d697 ("net: core: move push MPLS functionality from OvS to core helper") Signed-off-by: Martin Varghese Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 10fec3e5660b40839c0109e45569909ec5e33916 Author: Taehee Yoo Date: Thu Dec 5 07:23:39 2019 +0000 hsr: fix a NULL pointer dereference in hsr_dev_xmit() [ Upstream commit df95467b6d2bfce49667ee4b71c67249b01957f7 ] hsr_dev_xmit() calls hsr_port_get_hsr() to find master node and that would return NULL if master node is not existing in the list. But hsr_dev_xmit() doesn't check return pointer so a NULL dereference could occur. Test commands: ip netns add nst ip link add veth0 type veth peer name veth1 ip link add veth2 type veth peer name veth3 ip link set veth1 netns nst ip link set veth3 netns nst ip link set veth0 up ip link set veth2 up ip link add hsr0 type hsr slave1 veth0 slave2 veth2 ip a a 192.168.100.1/24 dev hsr0 ip link set hsr0 up ip netns exec nst ip link set veth1 up ip netns exec nst ip link set veth3 up ip netns exec nst ip link add hsr1 type hsr slave1 veth1 slave2 veth3 ip netns exec nst ip a a 192.168.100.2/24 dev hsr1 ip netns exec nst ip link set hsr1 up hping3 192.168.100.2 -2 --flood & modprobe -rv hsr Splat looks like: [ 217.351122][ T1635] kasan: CONFIG_KASAN_INLINE enabled [ 217.352969][ T1635] kasan: GPF could be caused by NULL-ptr deref or user memory access [ 217.354297][ T1635] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 217.355507][ T1635] CPU: 1 PID: 1635 Comm: hping3 Not tainted 5.4.0+ #192 [ 217.356472][ T1635] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 217.357804][ T1635] RIP: 0010:hsr_dev_xmit+0x34/0x90 [hsr] [ 217.373010][ T1635] Code: 48 8d be 00 0c 00 00 be 04 00 00 00 48 83 ec 08 e8 21 be ff ff 48 8d 78 10 48 ba 00 b [ 217.376919][ T1635] RSP: 0018:ffff8880cd8af058 EFLAGS: 00010202 [ 217.377571][ T1635] RAX: 0000000000000000 RBX: ffff8880acde6840 RCX: 0000000000000002 [ 217.379465][ T1635] RDX: dffffc0000000000 RSI: 0000000000000004 RDI: 0000000000000010 [ 217.380274][ T1635] RBP: ffff8880acde6840 R08: ffffed101b440d5d R09: 0000000000000001 [ 217.381078][ T1635] R10: 0000000000000001 R11: ffffed101b440d5c R12: ffff8880bffcc000 [ 217.382023][ T1635] R13: ffff8880bffcc088 R14: 0000000000000000 R15: ffff8880ca675c00 [ 217.383094][ T1635] FS: 00007f060d9d1740(0000) GS:ffff8880da000000(0000) knlGS:0000000000000000 [ 217.384289][ T1635] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 217.385009][ T1635] CR2: 00007faf15381dd0 CR3: 00000000d523c001 CR4: 00000000000606e0 [ 217.385940][ T1635] Call Trace: [ 217.386544][ T1635] dev_hard_start_xmit+0x160/0x740 [ 217.387114][ T1635] __dev_queue_xmit+0x1961/0x2e10 [ 217.388118][ T1635] ? check_object+0xaf/0x260 [ 217.391466][ T1635] ? __alloc_skb+0xb9/0x500 [ 217.392017][ T1635] ? init_object+0x6b/0x80 [ 217.392629][ T1635] ? netdev_core_pick_tx+0x2e0/0x2e0 [ 217.393175][ T1635] ? __alloc_skb+0xb9/0x500 [ 217.393727][ T1635] ? rcu_read_lock_sched_held+0x90/0xc0 [ 217.394331][ T1635] ? rcu_read_lock_bh_held+0xa0/0xa0 [ 217.395013][ T1635] ? kasan_unpoison_shadow+0x30/0x40 [ 217.395668][ T1635] ? __kasan_kmalloc.constprop.4+0xa0/0xd0 [ 217.396280][ T1635] ? __kmalloc_node_track_caller+0x3a8/0x3f0 [ 217.399007][ T1635] ? __kasan_kmalloc.constprop.4+0xa0/0xd0 [ 217.400093][ T1635] ? __kmalloc_reserve.isra.46+0x2e/0xb0 [ 217.401118][ T1635] ? memset+0x1f/0x40 [ 217.402529][ T1635] ? __alloc_skb+0x317/0x500 [ 217.404915][ T1635] ? arp_xmit+0xca/0x2c0 [ ... ] Fixes: 311633b60406 ("hsr: switch ->dellink() to ->ndo_uninit()") Acked-by: Cong Wang Signed-off-by: Taehee Yoo Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 2cbaf5fb573a5f150109c11d9e4a99006b885702 Author: Martin Varghese Date: Mon Dec 2 10:49:51 2019 +0530 Fixed updating of ethertype in function skb_mpls_pop [ Upstream commit 040b5cfbcefa263ccf2c118c4938308606bb7ed8 ] The skb_mpls_pop was not updating ethertype of an ethernet packet if the packet was originally received from a non ARPHRD_ETHER device. In the below OVS data path flow, since the device corresponding to port 7 is an l3 device (ARPHRD_NONE) the skb_mpls_pop function does not update the ethertype of the packet even though the previous push_eth action had added an ethernet header to the packet. recirc_id(0),in_port(7),eth_type(0x8847), mpls(label=12/0xfffff,tc=0/0,ttl=0/0x0,bos=1/1), actions:push_eth(src=00:00:00:00:00:00,dst=00:00:00:00:00:00), pop_mpls(eth_type=0x800),4 Fixes: ed246cee09b9 ("net: core: move pop MPLS functionality from OvS to core helper") Signed-off-by: Martin Varghese Acked-by: Pravin B Shelar Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 23fbdd5d1e826454a1ce199e716e2015033212c4 Author: Cong Wang Date: Thu Dec 5 19:39:02 2019 -0800 gre: refetch erspan header from skb->data after pskb_may_pull() [ Upstream commit 0e4940928c26527ce8f97237fef4c8a91cd34207 ] After pskb_may_pull() we should always refetch the header pointers from the skb->data in case it got reallocated. In gre_parse_header(), the erspan header is still fetched from the 'options' pointer which is fetched before pskb_may_pull(). Found this during code review of a KMSAN bug report. Fixes: cb73ee40b1b3 ("net: ip_gre: use erspan key field for tunnel lookup") Cc: Lorenzo Bianconi Signed-off-by: Cong Wang Acked-by: Lorenzo Bianconi Acked-by: William Tu Reviewed-by: Simon Horman Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 71bc12b1fb4afedf52d558a2cfb351f68831caeb Author: Yoshiki Komachi Date: Tue Dec 3 19:40:12 2019 +0900 cls_flower: Fix the behavior using port ranges with hw-offload [ Upstream commit 8ffb055beae58574d3e77b4bf9d4d15eace1ca27 ] The recent commit 5c72299fba9d ("net: sched: cls_flower: Classify packets using port ranges") had added filtering based on port ranges to tc flower. However the commit missed necessary changes in hw-offload code, so the feature gave rise to generating incorrect offloaded flow keys in NIC. One more detailed example is below: $ tc qdisc add dev eth0 ingress $ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \ dst_port 100-200 action drop With the setup above, an exact match filter with dst_port == 0 will be installed in NIC by hw-offload. IOW, the NIC will have a rule which is equivalent to the following one. $ tc qdisc add dev eth0 ingress $ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \ dst_port 0 action drop The behavior was caused by the flow dissector which extracts packet data into the flow key in the tc flower. More specifically, regardless of exact match or specified port ranges, fl_init_dissector() set the FLOW_DISSECTOR_KEY_PORTS flag in struct flow_dissector to extract port numbers from skb in skb_flow_dissect() called by fl_classify(). Note that device drivers received the same struct flow_dissector object as used in skb_flow_dissect(). Thus, offloaded drivers could not identify which of these is used because the FLOW_DISSECTOR_KEY_PORTS flag was set to struct flow_dissector in either case. This patch adds the new FLOW_DISSECTOR_KEY_PORTS_RANGE flag and the new tp_range field in struct fl_flow_key to recognize which filters are applied to offloaded drivers. At this point, when filters based on port ranges passed to drivers, drivers return the EOPNOTSUPP error because they do not support the feature (the newly created FLOW_DISSECTOR_KEY_PORTS_RANGE flag). Fixes: 5c72299fba9d ("net: sched: cls_flower: Classify packets using port ranges") Signed-off-by: Yoshiki Komachi Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 554d2e14c5e1dac1b15ebd0c461084f7b733cb03 Author: John Hurley Date: Thu Dec 5 17:03:35 2019 +0000 net: sched: allow indirect blocks to bind to clsact in TC [ Upstream commit 25a443f74bcff2c4d506a39eae62fc15ad7c618a ] When a device is bound to a clsact qdisc, bind events are triggered to registered drivers for both ingress and egress. However, if a driver registers to such a device using the indirect block routines then it is assumed that it is only interested in ingress offload and so only replays ingress bind/unbind messages. The NFP driver supports the offload of some egress filters when registering to a block with qdisc of type clsact. However, on unregister, if the block is still active, it will not receive an unbind egress notification which can prevent proper cleanup of other registered callbacks. Modify the indirect block callback command in TC to send messages of ingress and/or egress bind depending on the qdisc in use. NFP currently supports egress offload for TC flower offload so the changes are only added to TC. Fixes: 4d12ba42787b ("nfp: flower: allow offloading of matches on 'internal' ports") Signed-off-by: John Hurley Acked-by: Jakub Kicinski Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 1b511a9d2c09bc7a0b0ea3d2b4538547b7615284 Author: John Hurley Date: Thu Dec 5 17:03:34 2019 +0000 net: core: rename indirect block ingress cb function [ Upstream commit dbad3408896c3c5722ec9cda065468b3df16c5bf ] With indirect blocks, a driver can register for callbacks from a device that is does not 'own', for example, a tunnel device. When registering to or unregistering from a new device, a callback is triggered to generate a bind/unbind event. This, in turn, allows the driver to receive any existing rules or to properly clean up installed rules. When first added, it was assumed that all indirect block registrations would be for ingress offloads. However, the NFP driver can, in some instances, support clsact qdisc binds for egress offload. Change the name of the indirect block callback command in flow_offload to remove the 'ingress' identifier from it. While this does not change functionality, a follow up patch will implement a more more generic callback than just those currently just supporting ingress offload. Fixes: 4d12ba42787b ("nfp: flower: allow offloading of matches on 'internal' ports") Signed-off-by: John Hurley Acked-by: Jakub Kicinski Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit ee0dc0c3f371197ff8dbaf4ce874bef2e33674ea Author: Guillaume Nault Date: Fri Dec 6 12:38:49 2019 +0100 tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE() [ Upstream commit 721c8dafad26ccfa90ff659ee19755e3377b829d ] Syncookies borrow the ->rx_opt.ts_recent_stamp field to store the timestamp of the last synflood. Protect them with READ_ONCE() and WRITE_ONCE() since reads and writes aren't serialised. Use of .rx_opt.ts_recent_stamp for storing the synflood timestamp was introduced by a0f82f64e269 ("syncookies: remove last_synq_overflow from struct tcp_sock"). But unprotected accesses were already there when timestamp was stored in .last_synq_overflow. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Guillaume Nault Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit e70ee16481f9030030b51349f2131116ac916859 Author: Guillaume Nault Date: Fri Dec 6 12:38:43 2019 +0100 tcp: tighten acceptance of ACKs not matching a child socket [ Upstream commit cb44a08f8647fd2e8db5cc9ac27cd8355fa392d8 ] When no synflood occurs, the synflood timestamp isn't updated. Therefore it can be so old that time_after32() can consider it to be in the future. That's a problem for tcp_synq_no_recent_overflow() as it may report that a recent overflow occurred while, in fact, it's just that jiffies has grown past 'last_overflow' + TCP_SYNCOOKIE_VALID + 2^31. Spurious detection of recent overflows lead to extra syncookie verification in cookie_v[46]_check(). At that point, the verification should fail and the packet dropped. But we should have dropped the packet earlier as we didn't even send a syncookie. Let's refine tcp_synq_no_recent_overflow() to report a recent overflow only if jiffies is within the [last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval. This way, no spurious recent overflow is reported when jiffies wraps and 'last_overflow' becomes in the future from the point of view of time_after32(). However, if jiffies wraps and enters the [last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval (with 'last_overflow' being a stale synflood timestamp), then tcp_synq_no_recent_overflow() still erroneously reports an overflow. In such cases, we have to rely on syncookie verification to drop the packet. We unfortunately have no way to differentiate between a fresh and a stale syncookie timestamp. In practice, using last_overflow as lower bound is problematic. If the synflood timestamp is concurrently updated between the time we read jiffies and the moment we store the timestamp in 'last_overflow', then 'now' becomes smaller than 'last_overflow' and tcp_synq_no_recent_overflow() returns true, potentially dropping a valid syncookie. Reading jiffies after loading the timestamp could fix the problem, but that'd require a memory barrier. Let's just accommodate for potential timestamp growth instead and extend the interval using 'last_overflow - HZ' as lower bound. Signed-off-by: Guillaume Nault Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 9afe690185bcdeae3989410a3684f02e0a1fc9e9 Author: Guillaume Nault Date: Fri Dec 6 12:38:36 2019 +0100 tcp: fix rejected syncookies due to stale timestamps [ Upstream commit 04d26e7b159a396372646a480f4caa166d1b6720 ] If no synflood happens for a long enough period of time, then the synflood timestamp isn't refreshed and jiffies can advance so much that time_after32() can't accurately compare them any more. Therefore, we can end up in a situation where time_after32(now, last_overflow + HZ) returns false, just because these two values are too far apart. In that case, the synflood timestamp isn't updated as it should be, which can trick tcp_synq_no_recent_overflow() into rejecting valid syncookies. For example, let's consider the following scenario on a system with HZ=1000: * The synflood timestamp is 0, either because that's the timestamp of the last synflood or, more commonly, because we're working with a freshly created socket. * We receive a new SYN, which triggers synflood protection. Let's say that this happens when jiffies == 2147484649 (that is, 'synflood timestamp' + HZ + 2^31 + 1). * Then tcp_synq_overflow() doesn't update the synflood timestamp, because time_after32(2147484649, 1000) returns false. With: - 2147484649: the value of jiffies, aka. 'now'. - 1000: the value of 'last_overflow' + HZ. * A bit later, we receive the ACK completing the 3WHS. But cookie_v[46]_check() rejects it because tcp_synq_no_recent_overflow() says that we're not under synflood. That's because time_after32(2147484649, 120000) returns false. With: - 2147484649: the value of jiffies, aka. 'now'. - 120000: the value of 'last_overflow' + TCP_SYNCOOKIE_VALID. Of course, in reality jiffies would have increased a bit, but this condition will last for the next 119 seconds, which is far enough to accommodate for jiffie's growth. Fix this by updating the overflow timestamp whenever jiffies isn't within the [last_overflow, last_overflow + HZ] range. That shouldn't have any performance impact since the update still happens at most once per second. Now we're guaranteed to have fresh timestamps while under synflood, so tcp_synq_no_recent_overflow() can safely use it with time_after32() in such situations. Stale timestamps can still make tcp_synq_no_recent_overflow() return the wrong verdict when not under synflood. This will be handled in the next patch. For 64 bits architectures, the problem was introduced with the conversion of ->tw_ts_recent_stamp to 32 bits integer by commit cca9bab1b72c ("tcp: use monotonic timestamps for PAWS"). The problem has always been there on 32 bits architectures. Fixes: cca9bab1b72c ("tcp: use monotonic timestamps for PAWS") Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Guillaume Nault Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 48d58ae9e87aaa11814364ddb52b3461f9abac57 Author: Sabrina Dubroca Date: Wed Dec 4 15:35:53 2019 +0100 net: ipv6_stub: use ip6_dst_lookup_flow instead of ip6_dst_lookup [ Upstream commit 6c8991f41546c3c472503dff1ea9daaddf9331c2 ] ipv6_stub uses the ip6_dst_lookup function to allow other modules to perform IPv6 lookups. However, this function skips the XFRM layer entirely. All users of ipv6_stub->ip6_dst_lookup use ip_route_output_flow (via the ip_route_output_key and ip_route_output helpers) for their IPv4 lookups, which calls xfrm_lookup_route(). This patch fixes this inconsistent behavior by switching the stub to ip6_dst_lookup_flow, which also calls xfrm_lookup_route(). This requires some changes in all the callers, as these two functions take different arguments and have different return types. Fixes: 5f81bd2e5d80 ("ipv6: export a stub for IPv6 symbols used by vxlan") Reported-by: Xiumei Mu Signed-off-by: Sabrina Dubroca Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 8cadbd146a8712cffef5921559d24b00911ac4b7 Author: Sabrina Dubroca Date: Wed Dec 4 15:35:52 2019 +0100 net: ipv6: add net argument to ip6_dst_lookup_flow [ Upstream commit c4e85f73afb6384123e5ef1bba3315b2e3ad031e ] This will be used in the conversion of ipv6_stub to ip6_dst_lookup_flow, as some modules currently pass a net argument without a socket to ip6_dst_lookup. This is equivalent to commit 343d60aada5a ("ipv6: change ipv6_stub_impl.ipv6_dst_lookup to take net argument"). Signed-off-by: Sabrina Dubroca Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 9617d69d663de358957df86862984414e0bbc1cf Author: Huy Nguyen Date: Fri Sep 6 09:28:46 2019 -0500 net/mlx5e: Query global pause state before setting prio2buffer [ Upstream commit 73e6551699a32fac703ceea09214d6580edcf2d5 ] When the user changes prio2buffer mapping while global pause is enabled, mlx5 driver incorrectly sets all active buffers (buffer that has at least one priority mapped) to lossy. Solution: If global pause is enabled, set all the active buffers to lossless in prio2buffer command. Also, add error message when buffer size is not enough to meet xoff threshold. Fixes: 0696d60853d5 ("net/mlx5e: Receive buffer configuration") Signed-off-by: Huy Nguyen Signed-off-by: Saeed Mahameed Signed-off-by: Greg Kroah-Hartman commit 0703996ff4a1344e7bab4f35d933c1ee75d78a79 Author: Taehee Yoo Date: Fri Dec 6 05:25:48 2019 +0000 tipc: fix ordering of tipc module init and exit routine [ Upstream commit 9cf1cd8ee3ee09ef2859017df2058e2f53c5347f ] In order to set/get/dump, the tipc uses the generic netlink infrastructure. So, when tipc module is inserted, init function calls genl_register_family(). After genl_register_family(), set/get/dump commands are immediately allowed and these callbacks internally use the net_generic. net_generic is allocated by register_pernet_device() but this is called after genl_register_family() in the __init function. So, these callbacks would use un-initialized net_generic. Test commands: #SHELL1 while : do modprobe tipc modprobe -rv tipc done #SHELL2 while : do tipc link list done Splat looks like: [ 59.616322][ T2788] kasan: CONFIG_KASAN_INLINE enabled [ 59.617234][ T2788] kasan: GPF could be caused by NULL-ptr deref or user memory access [ 59.618398][ T2788] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 59.619389][ T2788] CPU: 3 PID: 2788 Comm: tipc Not tainted 5.4.0+ #194 [ 59.620231][ T2788] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 59.621428][ T2788] RIP: 0010:tipc_bcast_get_broadcast_mode+0x131/0x310 [tipc] [ 59.622379][ T2788] Code: c7 c6 ef 8b 38 c0 65 ff 0d 84 83 c9 3f e8 d7 a5 f2 e3 48 8d bb 38 11 00 00 48 b8 00 00 00 00 [ 59.622550][ T2780] NET: Registered protocol family 30 [ 59.624627][ T2788] RSP: 0018:ffff88804b09f578 EFLAGS: 00010202 [ 59.624630][ T2788] RAX: dffffc0000000000 RBX: 0000000000000011 RCX: 000000008bc66907 [ 59.624631][ T2788] RDX: 0000000000000229 RSI: 000000004b3cf4cc RDI: 0000000000001149 [ 59.624633][ T2788] RBP: ffff88804b09f588 R08: 0000000000000003 R09: fffffbfff4fb3df1 [ 59.624635][ T2788] R10: fffffbfff50318f8 R11: ffff888066cadc18 R12: ffffffffa6cc2f40 [ 59.624637][ T2788] R13: 1ffff11009613eba R14: ffff8880662e9328 R15: ffff8880662e9328 [ 59.624639][ T2788] FS: 00007f57d8f7b740(0000) GS:ffff88806cc00000(0000) knlGS:0000000000000000 [ 59.624645][ T2788] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 59.625875][ T2780] tipc: Started in single node mode [ 59.626128][ T2788] CR2: 00007f57d887a8c0 CR3: 000000004b140002 CR4: 00000000000606e0 [ 59.633991][ T2788] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 59.635195][ T2788] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 59.636478][ T2788] Call Trace: [ 59.637025][ T2788] tipc_nl_add_bc_link+0x179/0x1470 [tipc] [ 59.638219][ T2788] ? lock_downgrade+0x6e0/0x6e0 [ 59.638923][ T2788] ? __tipc_nl_add_link+0xf90/0xf90 [tipc] [ 59.639533][ T2788] ? tipc_nl_node_dump_link+0x318/0xa50 [tipc] [ 59.640160][ T2788] ? mutex_lock_io_nested+0x1380/0x1380 [ 59.640746][ T2788] tipc_nl_node_dump_link+0x4fd/0xa50 [tipc] [ 59.641356][ T2788] ? tipc_nl_node_reset_link_stats+0x340/0x340 [tipc] [ 59.642088][ T2788] ? __skb_ext_del+0x270/0x270 [ 59.642594][ T2788] genl_lock_dumpit+0x85/0xb0 [ 59.643050][ T2788] netlink_dump+0x49c/0xed0 [ 59.643529][ T2788] ? __netlink_sendskb+0xc0/0xc0 [ 59.644044][ T2788] ? __netlink_dump_start+0x190/0x800 [ 59.644617][ T2788] ? __mutex_unlock_slowpath+0xd0/0x670 [ 59.645177][ T2788] __netlink_dump_start+0x5a0/0x800 [ 59.645692][ T2788] genl_rcv_msg+0xa75/0xe90 [ 59.646144][ T2788] ? __lock_acquire+0xdfe/0x3de0 [ 59.646692][ T2788] ? genl_family_rcv_msg_attrs_parse+0x320/0x320 [ 59.647340][ T2788] ? genl_lock_dumpit+0xb0/0xb0 [ 59.647821][ T2788] ? genl_unlock+0x20/0x20 [ 59.648290][ T2788] ? genl_parallel_done+0xe0/0xe0 [ 59.648787][ T2788] ? find_held_lock+0x39/0x1d0 [ 59.649276][ T2788] ? genl_rcv+0x15/0x40 [ 59.649722][ T2788] ? lock_contended+0xcd0/0xcd0 [ 59.650296][ T2788] netlink_rcv_skb+0x121/0x350 [ 59.650828][ T2788] ? genl_family_rcv_msg_attrs_parse+0x320/0x320 [ 59.651491][ T2788] ? netlink_ack+0x940/0x940 [ 59.651953][ T2788] ? lock_acquire+0x164/0x3b0 [ 59.652449][ T2788] genl_rcv+0x24/0x40 [ 59.652841][ T2788] netlink_unicast+0x421/0x600 [ ... ] Fixes: 7e4369057806 ("tipc: fix a slab object leak") Fixes: a62fbccecd62 ("tipc: make subscriber server support net namespace") Signed-off-by: Taehee Yoo Acked-by: Jon Maloy Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 2fc7d173ea6121349165f49c8bd91f82c79a9da1 Author: Eric Dumazet Date: Thu Dec 5 10:10:15 2019 -0800 tcp: md5: fix potential overestimation of TCP option space [ Upstream commit 9424e2e7ad93ffffa88f882c9bc5023570904b55 ] Back in 2008, Adam Langley fixed the corner case of packets for flows having all of the following options : MD5 TS SACK Since MD5 needs 20 bytes, and TS needs 12 bytes, no sack block can be cooked from the remaining 8 bytes. tcp_established_options() correctly sets opts->num_sack_blocks to zero, but returns 36 instead of 32. This means TCP cooks packets with 4 extra bytes at the end of options, containing unitialized bytes. Fixes: 33ad798c924b ("tcp: options clean up") Signed-off-by: Eric Dumazet Reported-by: syzbot Acked-by: Neal Cardwell Acked-by: Soheil Hassas Yeganeh Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 0fa3554e921483c34807359cd9bf30034163fa8c Author: Aaron Conole Date: Tue Dec 3 16:34:13 2019 -0500 openvswitch: support asymmetric conntrack [ Upstream commit 5d50aa83e2c8e91ced2cca77c198b468ca9210f4 ] The openvswitch module shares a common conntrack and NAT infrastructure exposed via netfilter. It's possible that a packet needs both SNAT and DNAT manipulation, due to e.g. tuple collision. Netfilter can support this because it runs through the NAT table twice - once on ingress and again after egress. The openvswitch module doesn't have such capability. Like netfilter hook infrastructure, we should run through NAT twice to keep the symmetry. Fixes: 05752523e565 ("openvswitch: Interface with NAT.") Signed-off-by: Aaron Conole Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 61c6c1296a5e3d122223890198ab017f07321def Author: Valentin Vidic Date: Thu Dec 5 07:41:18 2019 +0100 net/tls: Fix return values to avoid ENOTSUPP [ Upstream commit 4a5cdc604b9cf645e6fa24d8d9f055955c3c8516 ] ENOTSUPP is not available in userspace, for example: setsockopt failed, 524, Unknown error 524 Signed-off-by: Valentin Vidic Acked-by: Jakub Kicinski Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 94fbebd20a607d29aa8028c55d5a3521c49acd95 Author: Mian Yousaf Kaukab Date: Thu Dec 5 10:41:16 2019 +0100 net: thunderx: start phy before starting autonegotiation [ Upstream commit a350d2e7adbb57181d33e3aa6f0565632747feaa ] Since commit 2b3e88ea6528 ("net: phy: improve phy state checking") phy_start_aneg() expects phy state to be >= PHY_UP. Call phy_start() before calling phy_start_aneg() during probe so that autonegotiation is initiated. As phy_start() takes care of calling phy_start_aneg(), drop the explicit call to phy_start_aneg(). Network fails without this patch on Octeon TX. Fixes: 2b3e88ea6528 ("net: phy: improve phy state checking") Signed-off-by: Mian Yousaf Kaukab Reviewed-by: Andrew Lunn Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit c774abc60719a5472bbfb692b6462184866e0d2b Author: Eric Dumazet Date: Sat Dec 7 11:34:45 2019 -0800 net_sched: validate TCA_KIND attribute in tc_chain_tmplt_add() [ Upstream commit 2dd5616ecdcebdf5a8d007af64e040d4e9214efe ] Use the new tcf_proto_check_kind() helper to make sure user provided value is well formed. BUG: KMSAN: uninit-value in string_nocheck lib/vsprintf.c:606 [inline] BUG: KMSAN: uninit-value in string+0x4be/0x600 lib/vsprintf.c:668 CPU: 0 PID: 12358 Comm: syz-executor.1 Not tainted 5.4.0-rc8-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x220 lib/dump_stack.c:118 kmsan_report+0x128/0x220 mm/kmsan/kmsan_report.c:108 __msan_warning+0x64/0xc0 mm/kmsan/kmsan_instr.c:245 string_nocheck lib/vsprintf.c:606 [inline] string+0x4be/0x600 lib/vsprintf.c:668 vsnprintf+0x218f/0x3210 lib/vsprintf.c:2510 __request_module+0x2b1/0x11c0 kernel/kmod.c:143 tcf_proto_lookup_ops+0x171/0x700 net/sched/cls_api.c:139 tc_chain_tmplt_add net/sched/cls_api.c:2730 [inline] tc_ctl_chain+0x1904/0x38a0 net/sched/cls_api.c:2850 rtnetlink_rcv_msg+0x115a/0x1580 net/core/rtnetlink.c:5224 netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2477 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5242 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline] netlink_unicast+0xf3e/0x1020 net/netlink/af_netlink.c:1328 netlink_sendmsg+0x110f/0x1330 net/netlink/af_netlink.c:1917 sock_sendmsg_nosec net/socket.c:637 [inline] sock_sendmsg net/socket.c:657 [inline] ___sys_sendmsg+0x14ff/0x1590 net/socket.c:2311 __sys_sendmsg net/socket.c:2356 [inline] __do_sys_sendmsg net/socket.c:2365 [inline] __se_sys_sendmsg+0x305/0x460 net/socket.c:2363 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2363 do_syscall_64+0xb6/0x160 arch/x86/entry/common.c:291 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x45a649 Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f0790795c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 000000000045a649 RDX: 0000000000000000 RSI: 0000000020000300 RDI: 0000000000000006 RBP: 000000000075bfc8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f07907966d4 R13: 00000000004c8db5 R14: 00000000004df630 R15: 00000000ffffffff Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:149 [inline] kmsan_internal_poison_shadow+0x5c/0x110 mm/kmsan/kmsan.c:132 kmsan_slab_alloc+0x97/0x100 mm/kmsan/kmsan_hooks.c:86 slab_alloc_node mm/slub.c:2773 [inline] __kmalloc_node_track_caller+0xe27/0x11a0 mm/slub.c:4381 __kmalloc_reserve net/core/skbuff.c:141 [inline] __alloc_skb+0x306/0xa10 net/core/skbuff.c:209 alloc_skb include/linux/skbuff.h:1049 [inline] netlink_alloc_large_skb net/netlink/af_netlink.c:1174 [inline] netlink_sendmsg+0x783/0x1330 net/netlink/af_netlink.c:1892 sock_sendmsg_nosec net/socket.c:637 [inline] sock_sendmsg net/socket.c:657 [inline] ___sys_sendmsg+0x14ff/0x1590 net/socket.c:2311 __sys_sendmsg net/socket.c:2356 [inline] __do_sys_sendmsg net/socket.c:2365 [inline] __se_sys_sendmsg+0x305/0x460 net/socket.c:2363 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2363 do_syscall_64+0xb6/0x160 arch/x86/entry/common.c:291 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 6f96c3c6904c ("net_sched: fix backward compatibility for TCA_KIND") Signed-off-by: Eric Dumazet Reported-by: syzbot Acked-by: Cong Wang Cc: Marcelo Ricardo Leitner Cc: Jamal Hadi Salim Cc: Jiri Pirko Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 2bbcffbfc2a51739b6c5933a657e6cff2c597887 Author: Dust Li Date: Tue Dec 3 11:17:40 2019 +0800 net: sched: fix dump qlen for sch_mq/sch_mqprio with NOLOCK subqueues [ Upstream commit 2f23cd42e19c22c24ff0e221089b7b6123b117c5 ] sch->q.len hasn't been set if the subqueue is a NOLOCK qdisc in mq_dump() and mqprio_dump(). Fixes: ce679e8df7ed ("net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio") Signed-off-by: Dust Li Signed-off-by: Tony Lu Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 5fc9fc7aac9a9d0e6007988f0c075767756072c8 Author: Grygorii Strashko Date: Fri Dec 6 14:28:20 2019 +0200 net: ethernet: ti: cpsw: fix extra rx interrupt [ Upstream commit 51302f77bedab8768b761ed1899c08f89af9e4e2 ] Now RX interrupt is triggered twice every time, because in cpsw_rx_interrupt() it is asked first and then disabled. So there will be pending interrupt always, when RX interrupt is enabled again in NAPI handler. Fix it by first disabling IRQ and then do ask. Fixes: 870915feabdc ("drivers: net: cpsw: remove disable_irq/enable_irq as irq can be masked from cpsw itself") Signed-off-by: Grygorii Strashko Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 0706bfdfa7047408dcb572915063ed115fb76b09 Author: Alexander Lobakin Date: Thu Dec 5 13:02:35 2019 +0300 net: dsa: fix flow dissection on Tx path [ Upstream commit 8bef0af09a5415df761b04fa487a6c34acae74bc ] Commit 43e665287f93 ("net-next: dsa: fix flow dissection") added an ability to override protocol and network offset during flow dissection for DSA-enabled devices (i.e. controllers shipped as switch CPU ports) in order to fix skb hashing for RPS on Rx path. However, skb_hash() and added part of code can be invoked not only on Rx, but also on Tx path if we have a multi-queued device and: - kernel is running on UP system or - XPS is not configured. The call stack in this two cases will be like: dev_queue_xmit() -> __dev_queue_xmit() -> netdev_core_pick_tx() -> netdev_pick_tx() -> skb_tx_hash() -> skb_get_hash(). The problem is that skbs queued for Tx have both network offset and correct protocol already set up even after inserting a CPU tag by DSA tagger, so calling tag_ops->flow_dissect() on this path actually only breaks flow dissection and hashing. This can be observed by adding debug prints just before and right after tag_ops->flow_dissect() call to the related block of code: Before the patch: Rx path (RPS): [ 19.240001] Rx: proto: 0x00f8, nhoff: 0 /* ETH_P_XDSA */ [ 19.244271] tag_ops->flow_dissect() [ 19.247811] Rx: proto: 0x0800, nhoff: 8 /* ETH_P_IP */ [ 19.215435] Rx: proto: 0x00f8, nhoff: 0 /* ETH_P_XDSA */ [ 19.219746] tag_ops->flow_dissect() [ 19.223241] Rx: proto: 0x0806, nhoff: 8 /* ETH_P_ARP */ [ 18.654057] Rx: proto: 0x00f8, nhoff: 0 /* ETH_P_XDSA */ [ 18.658332] tag_ops->flow_dissect() [ 18.661826] Rx: proto: 0x8100, nhoff: 8 /* ETH_P_8021Q */ Tx path (UP system): [ 18.759560] Tx: proto: 0x0800, nhoff: 26 /* ETH_P_IP */ [ 18.763933] tag_ops->flow_dissect() [ 18.767485] Tx: proto: 0x920b, nhoff: 34 /* junk */ [ 22.800020] Tx: proto: 0x0806, nhoff: 26 /* ETH_P_ARP */ [ 22.804392] tag_ops->flow_dissect() [ 22.807921] Tx: proto: 0x920b, nhoff: 34 /* junk */ [ 16.898342] Tx: proto: 0x86dd, nhoff: 26 /* ETH_P_IPV6 */ [ 16.902705] tag_ops->flow_dissect() [ 16.906227] Tx: proto: 0x920b, nhoff: 34 /* junk */ After: Rx path (RPS): [ 16.520993] Rx: proto: 0x00f8, nhoff: 0 /* ETH_P_XDSA */ [ 16.525260] tag_ops->flow_dissect() [ 16.528808] Rx: proto: 0x0800, nhoff: 8 /* ETH_P_IP */ [ 15.484807] Rx: proto: 0x00f8, nhoff: 0 /* ETH_P_XDSA */ [ 15.490417] tag_ops->flow_dissect() [ 15.495223] Rx: proto: 0x0806, nhoff: 8 /* ETH_P_ARP */ [ 17.134621] Rx: proto: 0x00f8, nhoff: 0 /* ETH_P_XDSA */ [ 17.138895] tag_ops->flow_dissect() [ 17.142388] Rx: proto: 0x8100, nhoff: 8 /* ETH_P_8021Q */ Tx path (UP system): [ 15.499558] Tx: proto: 0x0800, nhoff: 26 /* ETH_P_IP */ [ 20.664689] Tx: proto: 0x0806, nhoff: 26 /* ETH_P_ARP */ [ 18.565782] Tx: proto: 0x86dd, nhoff: 26 /* ETH_P_IPV6 */ In order to fix that we can add the check 'proto == htons(ETH_P_XDSA)' to prevent code from calling tag_ops->flow_dissect() on Tx. I also decided to initialize 'offset' variable so tagger callbacks can now safely leave it untouched without provoking a chaos. Fixes: 43e665287f93 ("net-next: dsa: fix flow dissection") Signed-off-by: Alexander Lobakin Reviewed-by: Florian Fainelli Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit c1780f088f400f0cede9a2ea55761d8effa35ca4 Author: Nikolay Aleksandrov Date: Tue Dec 3 16:48:06 2019 +0200 net: bridge: deny dev_set_mac_address() when unregistering [ Upstream commit c4b4c421857dc7b1cf0dccbd738472360ff2cd70 ] We have an interesting memory leak in the bridge when it is being unregistered and is a slave to a master device which would change the mac of its slaves on unregister (e.g. bond, team). This is a very unusual setup but we do end up leaking 1 fdb entry because dev_set_mac_address() would cause the bridge to insert the new mac address into its table after all fdbs are flushed, i.e. after dellink() on the bridge has finished and we call NETDEV_UNREGISTER the bond/team would release it and will call dev_set_mac_address() to restore its original address and that in turn will add an fdb in the bridge. One fix is to check for the bridge dev's reg_state in its ndo_set_mac_address callback and return an error if the bridge is not in NETREG_REGISTERED. Easy steps to reproduce: 1. add bond in mode != A/B 2. add any slave to the bond 3. add bridge dev as a slave to the bond 4. destroy the bridge device Trace: unreferenced object 0xffff888035c4d080 (size 128): comm "ip", pid 4068, jiffies 4296209429 (age 1413.753s) hex dump (first 32 bytes): 41 1d c9 36 80 88 ff ff 00 00 00 00 00 00 00 00 A..6............ d2 19 c9 5e 3f d7 00 00 00 00 00 00 00 00 00 00 ...^?........... backtrace: [<00000000ddb525dc>] kmem_cache_alloc+0x155/0x26f [<00000000633ff1e0>] fdb_create+0x21/0x486 [bridge] [<0000000092b17e9c>] fdb_insert+0x91/0xdc [bridge] [<00000000f2a0f0ff>] br_fdb_change_mac_address+0xb3/0x175 [bridge] [<000000001de02dbd>] br_stp_change_bridge_id+0xf/0xff [bridge] [<00000000ac0e32b1>] br_set_mac_address+0x76/0x99 [bridge] [<000000006846a77f>] dev_set_mac_address+0x63/0x9b [<00000000d30738fc>] __bond_release_one+0x3f6/0x455 [bonding] [<00000000fc7ec01d>] bond_netdev_event+0x2f2/0x400 [bonding] [<00000000305d7795>] notifier_call_chain+0x38/0x56 [<0000000028885d4a>] call_netdevice_notifiers+0x1e/0x23 [<000000008279477b>] rollback_registered_many+0x353/0x6a4 [<0000000018ef753a>] unregister_netdevice_many+0x17/0x6f [<00000000ba854b7a>] rtnl_delete_link+0x3c/0x43 [<00000000adf8618d>] rtnl_dellink+0x1dc/0x20a [<000000009b6395fd>] rtnetlink_rcv_msg+0x23d/0x268 Fixes: 43598813386f ("bridge: add local MAC address to forwarding table (v2)") Reported-by: syzbot+2add91c08eb181fea1bf@syzkaller.appspotmail.com Signed-off-by: Nikolay Aleksandrov Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 62d7fdb00b0af2c0f41ea76ceff18bb99e794b66 Author: Vladyslav Tarasiuk Date: Fri Dec 6 13:51:05 2019 +0000 mqprio: Fix out-of-bounds access in mqprio_dump [ Upstream commit 9f104c7736904ac72385bbb48669e0c923ca879b ] When user runs a command like tc qdisc add dev eth1 root mqprio KASAN stack-out-of-bounds warning is emitted. Currently, NLA_ALIGN macro used in mqprio_dump provides too large buffer size as argument for nla_put and memcpy down the call stack. The flow looks like this: 1. nla_put expects exact object size as an argument; 2. Later it provides this size to memcpy; 3. To calculate correct padding for SKB, nla_put applies NLA_ALIGN macro itself. Therefore, NLA_ALIGN should not be applied to the nla_put parameter. Otherwise it will lead to out-of-bounds memory access in memcpy. Fixes: 4e8b86c06269 ("mqprio: Introduce new hardware offload mode and shaper in mqprio") Signed-off-by: Vladyslav Tarasiuk Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 20f72aae9b21577e5c325c53817ce4ea00eb1133 Author: Eric Dumazet Date: Thu Dec 5 20:43:46 2019 -0800 inet: protect against too small mtu values. [ Upstream commit 501a90c945103e8627406763dac418f20f3837b2 ] syzbot was once again able to crash a host by setting a very small mtu on loopback device. Let's make inetdev_valid_mtu() available in include/net/ip.h, and use it in ip_setup_cork(), so that we protect both ip_append_page() and __ip_append_data() Also add a READ_ONCE() when the device mtu is read. Pairs this lockless read with one WRITE_ONCE() in __dev_set_mtu(), even if other code paths might write over this field. Add a big comment in include/linux/netdevice.h about dev->mtu needing READ_ONCE()/WRITE_ONCE() annotations. Hopefully we will add the missing ones in followup patches. [1] refcount_t: saturated; leaking memory. WARNING: CPU: 0 PID: 9464 at lib/refcount.c:22 refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 9464 Comm: syz-executor850 Not tainted 5.4.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x197/0x210 lib/dump_stack.c:118 panic+0x2e3/0x75c kernel/panic.c:221 __warn.cold+0x2f/0x3e kernel/panic.c:582 report_bug+0x289/0x300 lib/bug.c:195 fixup_bug arch/x86/kernel/traps.c:174 [inline] fixup_bug arch/x86/kernel/traps.c:169 [inline] do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267 do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286 invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027 RIP: 0010:refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22 Code: 06 31 ff 89 de e8 c8 f5 e6 fd 84 db 0f 85 6f ff ff ff e8 7b f4 e6 fd 48 c7 c7 e0 71 4f 88 c6 05 56 a6 a4 06 01 e8 c7 a8 b7 fd <0f> 0b e9 50 ff ff ff e8 5c f4 e6 fd 0f b6 1d 3d a6 a4 06 31 ff 89 RSP: 0018:ffff88809689f550 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff815e4336 RDI: ffffed1012d13e9c RBP: ffff88809689f560 R08: ffff88809c50a3c0 R09: fffffbfff15d31b1 R10: fffffbfff15d31b0 R11: ffffffff8ae98d87 R12: 0000000000000001 R13: 0000000000040100 R14: ffff888099041104 R15: ffff888218d96e40 refcount_add include/linux/refcount.h:193 [inline] skb_set_owner_w+0x2b6/0x410 net/core/sock.c:1999 sock_wmalloc+0xf1/0x120 net/core/sock.c:2096 ip_append_page+0x7ef/0x1190 net/ipv4/ip_output.c:1383 udp_sendpage+0x1c7/0x480 net/ipv4/udp.c:1276 inet_sendpage+0xdb/0x150 net/ipv4/af_inet.c:821 kernel_sendpage+0x92/0xf0 net/socket.c:3794 sock_sendpage+0x8b/0xc0 net/socket.c:936 pipe_to_sendpage+0x2da/0x3c0 fs/splice.c:458 splice_from_pipe_feed fs/splice.c:512 [inline] __splice_from_pipe+0x3ee/0x7c0 fs/splice.c:636 splice_from_pipe+0x108/0x170 fs/splice.c:671 generic_splice_sendpage+0x3c/0x50 fs/splice.c:842 do_splice_from fs/splice.c:861 [inline] direct_splice_actor+0x123/0x190 fs/splice.c:1035 splice_direct_to_actor+0x3b4/0xa30 fs/splice.c:990 do_splice_direct+0x1da/0x2a0 fs/splice.c:1078 do_sendfile+0x597/0xd00 fs/read_write.c:1464 __do_sys_sendfile64 fs/read_write.c:1525 [inline] __se_sys_sendfile64 fs/read_write.c:1511 [inline] __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x441409 Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fffb64c4f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441409 RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000005 RBP: 0000000000073b8a R08: 0000000000000010 R09: 0000000000000010 R10: 0000000000010001 R11: 0000000000000246 R12: 0000000000402180 R13: 0000000000402210 R14: 0000000000000000 R15: 0000000000000000 Kernel Offset: disabled Rebooting in 86400 seconds.. Fixes: 1470ddf7f8ce ("inet: Remove explicit write references to sk/inet in ip_append_data") Signed-off-by: Eric Dumazet Reported-by: syzbot Signed-off-by: David S. Miller Signed-off-by: Greg Kroah-Hartman commit 925cbfe727edeb05cff819ee6836524ce9fe8aff Author: Thomas Gleixner Date: Fri Jul 8 20:25:16 2011 +0200 Add localversion for -RT release Signed-off-by: Thomas Gleixner commit 0ad021ba6e573099913d786df7ef6be8b53014fd Author: Clark Williams Date: Sat Jul 30 21:55:53 2011 -0500 sysfs: Add /sys/kernel/realtime entry Add a /sys/kernel entry to indicate that the kernel is a realtime kernel. Clark says that he needs this for udev rules, udev needs to evaluate if its a PREEMPT_RT kernel a few thousand times and parsing uname output is too slow or so. Are there better solutions? Should it exist and return 0 on !-rt? Signed-off-by: Clark Williams Signed-off-by: Peter Zijlstra commit 00857b80bfe7507e626ee33ac0e60eca8fcd71ae Author: Ingo Molnar Date: Fri Jul 3 08:29:57 2009 -0500 genirq: Disable irqpoll on -rt Creates long latencies for no value Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner commit 2f971baeab572b0d1fecb43467256af90334dff1 Author: Thomas Gleixner Date: Fri Jul 3 08:44:56 2009 -0500 signals: Allow rt tasks to cache one sigqueue struct To avoid allocation allow rt tasks to cache one sigqueue struct in task struct. Signed-off-by: Thomas Gleixner commit 1bcf2c295f30661561a845644572bfd59e0b89d4 Author: Haris Okanovic Date: Tue Aug 15 15:13:08 2017 -0500 tpm_tis: fix stall after iowrite*()s ioread8() operations to TPM MMIO addresses can stall the cpu when immediately following a sequence of iowrite*()'s to the same region. For example, cyclitest measures ~400us latency spikes when a non-RT usermode application communicates with an SPI-based TPM chip (Intel Atom E3940 system, PREEMPT_RT kernel). The spikes are caused by a stalling ioread8() operation following a sequence of 30+ iowrite8()s to the same address. I believe this happens because the write sequence is buffered (in cpu or somewhere along the bus), and gets flushed on the first LOAD instruction (ioread*()) that follows. The enclosed change appears to fix this issue: read the TPM chip's access register (status code) after every iowrite*() operation to amortize the cost of flushing data to chip across multiple instructions. Signed-off-by: Haris Okanovic Signed-off-by: Sebastian Andrzej Siewior commit b45f39e849edf78b6cd90edd28f3ef179fd65218 Author: Julia Cartwright Date: Mon May 7 08:58:57 2018 -0500 squashfs: make use of local lock in multi_cpu decompressor Currently, the squashfs multi_cpu decompressor makes use of get_cpu_ptr()/put_cpu_ptr(), which unconditionally disable preemption during decompression. Because the workload is distributed across CPUs, all CPUs can observe a very high wakeup latency, which has been seen to be as much as 8000us. Convert this decompressor to make use of a local lock, which will allow execution of the decompressor with preemption-enabled, but also ensure concurrent accesses to the percpu compressor data on the local CPU will be serialized. Cc: stable-rt@vger.kernel.org Reported-by: Alexander Stein Tested-by: Alexander Stein Signed-off-by: Julia Cartwright Signed-off-by: Sebastian Andrzej Siewior commit c0715d8ee510eeeb7e2cff1425f9310466d2161e Author: Mike Galbraith Date: Thu Oct 20 11:15:22 2016 +0200 drivers/zram: Don't disable preemption in zcomp_stream_get/put() In v4.7, the driver switched to percpu compression streams, disabling preemption via get/put_cpu_ptr(). Use a per-zcomp_strm lock here. We also have to fix an lock order issue in zram_decompress_page() such that zs_map_object() nests inside of zcomp_stream_put() as it does in zram_bvec_write(). Signed-off-by: Mike Galbraith [bigeasy: get_locked_var() -> per zcomp_strm lock] Signed-off-by: Sebastian Andrzej Siewior commit b61eab6ee753ff92dbf6c18fd915ac509e0ff5f8 Author: Mike Galbraith Date: Thu Mar 31 04:08:28 2016 +0200 drivers/block/zram: Replace bit spinlocks with rtmutex for -rt They're nondeterministic, and lead to ___might_sleep() splats in -rt. OTOH, they're a lot less wasteful than an rtmutex per page. Signed-off-by: Mike Galbraith Signed-off-by: Sebastian Andrzej Siewior commit 78412101a22410944389c3bbc954ba97a0f47e0f Author: Mike Galbraith Date: Sun Oct 16 05:11:54 2016 +0200 connector/cn_proc: Protect send_msg() with a local lock on RT |BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:931 |in_atomic(): 1, irqs_disabled(): 0, pid: 31807, name: sleep |Preemption disabled at:[] proc_exit_connector+0xbb/0x140 | |CPU: 4 PID: 31807 Comm: sleep Tainted: G W E 4.8.0-rt11-rt #106 |Call Trace: | [] dump_stack+0x65/0x88 | [] ___might_sleep+0xf5/0x180 | [] __rt_spin_lock+0x20/0x50 | [] rt_read_lock+0x28/0x30 | [] netlink_broadcast_filtered+0x49/0x3f0 | [] ? __kmalloc_reserve.isra.33+0x31/0x90 | [] netlink_broadcast+0x1d/0x20 | [] cn_netlink_send_mult+0x19a/0x1f0 | [] cn_netlink_send+0x1b/0x20 | [] proc_exit_connector+0xf8/0x140 | [] do_exit+0x5d1/0xba0 | [] do_group_exit+0x4c/0xc0 | [] SyS_exit_group+0x14/0x20 | [] entry_SYSCALL_64_fastpath+0x1a/0xa4 Since ab8ed951080e ("connector: fix out-of-order cn_proc netlink message delivery") which is v4.7-rc6. Signed-off-by: Mike Galbraith Signed-off-by: Sebastian Andrzej Siewior commit 8c5f41c2e7734590a54b82a6329f19fb40f4db42 Author: Thomas Gleixner Date: Mon Jul 18 17:10:12 2011 +0200 mips: Disable highmem on RT The current highmem handling on -RT is not compatible and needs fixups. Signed-off-by: Thomas Gleixner commit f8da972878774c6ac8a3bc8eb7606948bdc3f2db Author: Sebastian Andrzej Siewior Date: Fri Oct 11 13:14:41 2019 +0200 POWERPC: Allow to enable RT Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior commit 5e4329d0651cc9c8633993265c6b98f094b049bc Author: Sebastian Andrzej Siewior Date: Tue Mar 26 18:31:29 2019 +0100 powerpc/stackprotector: work around stack-guard init from atomic This is invoked from the secondary CPU in atomic context. On x86 we use tsc instead. On Power we XOR it against mftb() so lets use stack address as the initial value. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior commit ec5758bd780555e9690fa3591a571db0b149a29e Author: Thomas Gleixner Date: Mon Jul 18 17:08:34 2011 +0200 powerpc: Disable highmem on RT The current highmem handling on -RT is not compatible and needs fixups. Signed-off-by: Thomas Gleixner commit 7c221e761da3d3684029cbf140fe44b40ab9986b Author: Bogdan Purcareata Date: Fri Apr 24 15:53:13 2015 +0000 powerpc/kvm: Disable in-kernel MPIC emulation for PREEMPT_RT While converting the openpic emulation code to use a raw_spinlock_t enables guests to run on RT, there's still a performance issue. For interrupts sent in directed delivery mode with a multiple CPU mask, the emulated openpic will loop through all of the VCPUs, and for each VCPUs, it call IRQ_check, which will loop through all the pending interrupts for that VCPU. This is done while holding the raw_lock, meaning that in all this time the interrupts and preemption are disabled on the host Linux. A malicious user app can max both these number and cause a DoS. This temporary fix is sent for two reasons. First is so that users who want to use the in-kernel MPIC emulation are aware of the potential latencies, thus making sure that the hardware MPIC and their usage scenario does not involve interrupts sent in directed delivery mode, and the number of possible pending interrupts is kept small. Secondly, this should incentivize the development of a proper openpic emulation that would be better suited for RT. Acked-by: Scott Wood Signed-off-by: Bogdan Purcareata Signed-off-by: Sebastian Andrzej Siewior commit 3ab3ff1d185c02c51cfb22a22476d8c3791c8810 Author: Sebastian Andrzej Siewior Date: Tue Mar 26 18:31:54 2019 +0100 powerpc/pseries/iommu: Use a locallock instead local_irq_save() The locallock protects the per-CPU variable tce_page. The function attempts to allocate memory while tce_page is protected (by disabling interrupts). Use local_irq_save() instead of local_irq_disable(). Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior commit a5b122c32c95528fafa3fdcc74e487af07e92202 Author: Sebastian Andrzej Siewior Date: Fri Oct 11 13:14:35 2019 +0200 ARM64: Allow to enable RT Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior commit 5de71cea58c55dc14e71dbe165492087746af935 Author: Sebastian Andrzej Siewior Date: Fri Oct 11 13:14:29 2019 +0200 ARM: Allow to enable RT Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior commit 5f56dae2384176bb7669c19eeb1a2e22ba633b0a Author: Sebastian Andrzej Siewior Date: Thu Nov 7 17:49:20 2019 +0100 x86: Enable RT also on 32bit Signed-off-by: Sebastian Andrzej Siewior commit a6460f89ebb0028eede528718500fc6994b13bcc Author: Benedikt Spranger Date: Mon Mar 8 18:57:04 2010 +0100 clocksource: TCLIB: Allow higher clock rates for clock events As default the TCLIB uses the 32KiHz base clock rate for clock events. Add a compile time selection to allow higher clock resulution. (fixed up by Sami Pietikäinen ) Signed-off-by: Benedikt Spranger Signed-off-by: Thomas Gleixner commit 4d1a2226d45f5c618d4f94f9f2d8939703bfcb7f Author: Sebastian Andrzej Siewior Date: Wed Mar 9 10:51:06 2016 +0100 arm: at91: do not disable/enable clocks in a row Currently the driver will disable the clock and enable it one line later if it is switching from periodic mode into one shot. This can be avoided and causes a needless warning on -RT. Signed-off-by: Sebastian Andrzej Siewior commit 7f14cf46e212671b002993b669b3ed6891a16466 Author: Sebastian Andrzej Siewior Date: Wed Jul 25 14:02:38 2018 +0200 arm64: fpsimd: Delay freeing memory in fpsimd_flush_thread() fpsimd_flush_thread() invokes kfree() via sve_free() within a preempt disabled section which is not working on -RT. Delay freeing of memory until preemption is enabled again. Signed-off-by: Sebastian Andrzej Siewior commit ed4bf6a931037afe388a6f449e9ae529e4acc39e Author: Josh Cartwright Date: Thu Feb 11 11:54:01 2016 -0600 KVM: arm/arm64: downgrade preempt_disable()d region to migrate_disable() kvm_arch_vcpu_ioctl_run() disables the use of preemption when updating the vgic and timer states to prevent the calling task from migrating to another CPU. It does so to prevent the task from writing to the incorrect per-CPU GIC distributor registers. On -rt kernels, it's possible to maintain the same guarantee with the use of migrate_{disable,enable}(), with the added benefit that the migrate-disabled region is preemptible. Update kvm_arch_vcpu_ioctl_run() to do so. Cc: Christoffer Dall Reported-by: Manish Jaggi Signed-off-by: Josh Cartwright Signed-off-by: Sebastian Andrzej Siewior commit 36535d4c7c815c585fe89cd4275f6d03bd7ba77f Author: Sebastian Andrzej Siewior Date: Fri Dec 1 10:42:03 2017 +0100 arm*: disable NEON in kernel mode NEON in kernel mode is used by the crypto algorithms and raid6 code. While the raid6 code looks okay, the crypto algorithms do not: NEON is enabled on first invocation and may allocate/free/map memory before the NEON mode is disabled again. This needs to be changed until it can be enabled. On ARM NEON in kernel mode can be simply disabled. on ARM64 it needs to stay on due to possible EFI callbacks so here I disable each algorithm. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior commit f0d8944af08fd5b7e88dc07642737484bfaa8a1f Author: Josh Cartwright Date: Thu Feb 11 11:54:00 2016 -0600 genirq: update irq_set_irqchip_state documentation On -rt kernels, the use of migrate_disable()/migrate_enable() is sufficient to guarantee a task isn't moved to another CPU. Update the irq_set_irqchip_state() documentation to reflect this. Signed-off-by: Josh Cartwright Signed-off-by: Sebastian Andrzej Siewior commit 92410436f42873164f5b7ec12b808fca5a2b42f8 Author: Yadi.hu Date: Wed Dec 10 10:32:09 2014 +0800 ARM: enable irq in translation/section permission fault handlers Probably happens on all ARM, with CONFIG_PREEMPT_RT CONFIG_DEBUG_ATOMIC_SLEEP This simple program.... int main() { *((char*)0xc0001000) = 0; }; [ 512.742724] BUG: sleeping function called from invalid context at kernel/rtmutex.c:658 [ 512.743000] in_atomic(): 0, irqs_disabled(): 128, pid: 994, name: a [ 512.743217] INFO: lockdep is turned off. [ 512.743360] irq event stamp: 0 [ 512.743482] hardirqs last enabled at (0): [< (null)>] (null) [ 512.743714] hardirqs last disabled at (0): [] copy_process+0x3b0/0x11c0 [ 512.744013] softirqs last enabled at (0): [] copy_process+0x3b0/0x11c0 [ 512.744303] softirqs last disabled at (0): [< (null)>] (null) [ 512.744631] [] (unwind_backtrace+0x0/0x104) [ 512.745001] [] (dump_stack+0x20/0x24) [ 512.745355] [] (__might_sleep+0x1dc/0x1e0) [ 512.745717] [] (rt_spin_lock+0x34/0x6c) [ 512.746073] [] (do_force_sig_info+0x34/0xf0) [ 512.746457] [] (force_sig_info+0x18/0x1c) [ 512.746829] [] (__do_user_fault+0x9c/0xd8) [ 512.747185] [] (do_bad_area+0x7c/0x94) [ 512.747536] [] (do_sect_fault+0x40/0x48) [ 512.747898] [] (do_DataAbort+0x40/0xa0) [ 512.748181] Exception stack(0xecaa1fb0 to 0xecaa1ff8) Oxc0000000 belongs to kernel address space, user task can not be allowed to access it. For above condition, correct result is that test case should receive a “segment fault” and exits but not stacks. the root cause is commit 02fe2845d6a8 ("avoid enabling interrupts in prefetch/data abort handlers"),it deletes irq enable block in Data abort assemble code and move them into page/breakpiont/alignment fault handlers instead. But author does not enable irq in translation/section permission fault handlers. ARM disables irq when it enters exception/ interrupt mode, if kernel doesn't enable irq, it would be still disabled during translation/section permission fault. We see the above splat because do_force_sig_info is still called with IRQs off, and that code eventually does a: spin_lock_irqsave(&t->sighand->siglock, flags); As this is architecture independent code, and we've not seen any other need for other arch to have the siglock converted to raw lock, we can conclude that we should enable irq for ARM translation/section permission exception. Signed-off-by: Yadi.hu Signed-off-by: Sebastian Andrzej Siewior commit 0f7b3f1d159c441a97e626ef92ed2a3aa66c4e97 Author: Sebastian Andrzej Siewior Date: Thu Dec 22 17:28:33 2016 +0100 arm: include definition for cpumask_t This definition gets pulled in by other files. With the (later) split of RCU and spinlock.h it won't compile anymore. The split is done in ("rbtree: don't include the rcu header"). Signed-off-by: Sebastian Andrzej Siewior commit df233cc58baca634846aebb2418f4701597780c4 Author: Kurt Kanzenbach Date: Mon Sep 24 10:29:01 2018 +0200 tty: serial: pl011: explicitly initialize the flags variable Silence the following gcc warning: drivers/tty/serial/amba-pl011.c: In function ‘pl011_console_write’: ./include/linux/spinlock.h:260:3: warning: ‘flags’ may be used uninitialized in this function [-Wmaybe-uninitialized] _raw_spin_unlock_irqrestore(lock, flags); \ ^~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/tty/serial/amba-pl011.c:2214:16: note: ‘flags’ was declared here unsigned long flags; ^~~~~ The code is correct. Thus, initializing flags to zero doesn't change the behavior and resolves the warning. Signed-off-by: Kurt Kanzenbach Signed-off-by: Sebastian Andrzej Siewior commit dbac70accf940fac631aa58483ce5317a62cb6c9 Author: Thomas Gleixner Date: Tue Jan 8 21:36:51 2013 +0100 tty/serial/pl011: Make the locking work on RT The lock is a sleeping lock and local_irq_save() is not the optimsation we are looking for. Redo it to make it work on -RT and non-RT. Signed-off-by: Thomas Gleixner commit 089f9858075e5b7da8117b695135729c6d48d1c1 Author: Thomas Gleixner Date: Thu Jul 28 13:32:57 2011 +0200 tty/serial/omap: Make the locking RT aware The lock is a sleeping lock and local_irq_save() is not the optimsation we are looking for. Redo it to make it work on -RT and non-RT. Signed-off-by: Thomas Gleixner commit e3a1d988bb79d7de8b98fc5f0c94e327fec7c8e4 Author: Sebastian Andrzej Siewior Date: Thu Jan 23 14:45:59 2014 +0100 leds: trigger: disable CPU trigger on -RT as it triggers: |CPU: 0 PID: 0 Comm: swapper Not tainted 3.12.8-rt10 #141 |[] (unwind_backtrace+0x0/0xf8) from [] (show_stack+0x1c/0x20) |[] (show_stack+0x1c/0x20) from [] (dump_stack+0x20/0x2c) |[] (dump_stack+0x20/0x2c) from [] (__might_sleep+0x13c/0x170) |[] (__might_sleep+0x13c/0x170) from [] (__rt_spin_lock+0x28/0x38) |[] (__rt_spin_lock+0x28/0x38) from [] (rt_read_lock+0x68/0x7c) |[] (rt_read_lock+0x68/0x7c) from [] (led_trigger_event+0x2c/0x5c) |[] (led_trigger_event+0x2c/0x5c) from [] (ledtrig_cpu+0x54/0x5c) |[] (ledtrig_cpu+0x54/0x5c) from [] (arch_cpu_idle_exit+0x18/0x1c) |[] (arch_cpu_idle_exit+0x18/0x1c) from [] (cpu_startup_entry+0xa8/0x234) |[] (cpu_startup_entry+0xa8/0x234) from [] (rest_init+0xb8/0xe0) |[] (rest_init+0xb8/0xe0) from [] (start_kernel+0x2c4/0x380) Signed-off-by: Sebastian Andrzej Siewior commit 71ffda95e799eda9d777e8cf4020a5caf80997e5 Author: Thomas Gleixner Date: Wed Jul 8 17:14:48 2015 +0200 jump-label: disable if stop_machine() is used Some architectures are using stop_machine() while switching the opcode which leads to latency spikes. The architectures which use stop_machine() atm: - ARM stop machine - s390 stop machine The architecures which use other sorcery: - MIPS - X86 - powerpc - sparc - arm64 Signed-off-by: Thomas Gleixner [bigeasy: only ARM for now] Signed-off-by: Sebastian Andrzej Siewior commit 3968924db82e8a1e747202ea4d46e76f91cd760d Author: Anders Roxell Date: Thu May 14 17:52:17 2015 +0200 arch/arm64: Add lazy preempt support arm64 is missing support for PREEMPT_RT. The main feature which is lacking is support for lazy preemption. The arch-specific entry code, thread information structure definitions, and associated data tables have to be extended to provide this support. Then the Kconfig file has to be extended to indicate the support is available, and also to indicate that support for full RT preemption is now available. Signed-off-by: Anders Roxell commit 2db3cbcb08e3ff27330edae46792d4556bff9004 Author: Thomas Gleixner Date: Thu Nov 1 10:14:11 2012 +0100 powerpc: Add support for lazy preemption Implement the powerpc pieces for lazy preempt. Signed-off-by: Thomas Gleixner commit 8d1196cf41de6fe2d6c34aae6e8d22cf0f7bb634 Author: Thomas Gleixner Date: Wed Oct 31 12:04:11 2012 +0100 arm: Add support for lazy preemption Implement the arm pieces for lazy preempt. Signed-off-by: Thomas Gleixner commit be5f0d6796a5b66e7e12ca851f747deb8bb53010 Author: Thomas Gleixner Date: Thu Nov 1 11:03:47 2012 +0100 x86: Support for lazy preemption Implement the x86 pieces for lazy preempt. Signed-off-by: Thomas Gleixner commit 98bb2da8e3799e379a664d3a9c5935556bd87a41 Author: Thomas Gleixner Date: Fri Oct 26 18:50:54 2012 +0100 sched: Add support for lazy preemption It has become an obsession to mitigate the determinism vs. throughput loss of RT. Looking at the mainline semantics of preemption points gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks. Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies. So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light). Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop. For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter. Initial test do not expose any observable latency increasement, but history shows that I've been proven wrong before :) The lazy preemption mode is per default on, but with CONFIG_SCHED_DEBUG enabled it can be disabled via: # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features and reenabled via # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features The test results so far are very machine and workload dependent, but there is a clear trend that it enhances the non RT workload performance. Signed-off-by: Thomas Gleixner commit 9fde3d3e1f90b4a688af3fc29e542bd4659a6b4b Author: Thomas Gleixner Date: Fri Jul 3 08:44:34 2009 -0500 mm/scatterlist: Do not disable irqs on RT For -RT it is enough to keep pagefault disabled (which is currently handled by kmap_atomic()). Signed-off-by: Thomas Gleixner commit fc63f38e78834bd7005022b02d7278836bb07056 Author: Thomas Gleixner Date: Wed Feb 13 11:03:11 2013 +0100 arm: Enable highmem for rt fixup highmem for ARM. Signed-off-by: Thomas Gleixner commit 9c9120453b81788a146290f3e4f524a0870603a2 Author: Sebastian Andrzej Siewior Date: Mon Mar 11 21:37:27 2013 +0100 arm/highmem: Flush tlb on unmap The tlb should be flushed on unmap and thus make the mapping entry invalid. This is only done in the non-debug case which does not look right. Signed-off-by: Sebastian Andrzej Siewior commit bff26283078872d736a1216995ad8fe6a2fbbf18 Author: Sebastian Andrzej Siewior Date: Mon Mar 11 17:09:55 2013 +0100 x86/highmem: Add a "already used pte" check This is a copy from kmap_atomic_prot(). Signed-off-by: Sebastian Andrzej Siewior commit 93234f6354d31d88d8987e374ca877979c7dbc15 Author: Peter Zijlstra Date: Thu Jul 28 10:43:51 2011 +0200 mm, rt: kmap_atomic scheduling In fact, with migrate_disable() existing one could play games with kmap_atomic. You could save/restore the kmap_atomic slots on context switch (if there are any in use of course), this should be esp easy now that we have a kmap_atomic stack. Something like the below.. it wants replacing all the preempt_disable() stuff with pagefault_disable() && migrate_disable() of course, but then you can flip kmaps around like below. Signed-off-by: Peter Zijlstra [dvhart@linux.intel.com: build fix] Link: http://lkml.kernel.org/r/1311842631.5890.208.camel@twins [tglx@linutronix.de: Get rid of the per cpu variable and store the idx and the pte content right away in the task struct. Shortens the context switch code. ] commit 95c50e7fcaa1ebe6e9f8c98b385dcc10653781e9 Author: Sebastian Andrzej Siewior Date: Wed Aug 7 18:15:38 2019 +0200 x86: Allow to enable RT Allow to select RT. Signed-off-by: Sebastian Andrzej Siewior commit 79c91c537d21dbe7da0cd325683cbda451e7035d Author: Sebastian Andrzej Siewior Date: Wed Oct 11 17:43:49 2017 +0200 apparmor: use a locallock instead preempt_disable() get_buffers() disables preemption which acts as a lock for the per-CPU variable. Since we can't disable preemption here on RT, a local_lock is lock is used in order to remain on the same CPU and not to have more than one user within the critical section. Signed-off-by: Sebastian Andrzej Siewior commit d23b03a80783ac989cf9b9e5df9a06a414b1ad31 Author: Mike Galbraith Date: Sun Jan 8 09:32:25 2017 +0100 cpuset: Convert callback_lock to raw_spinlock_t The two commits below add up to a cpuset might_sleep() splat for RT: 8447a0fee974 cpuset: convert callback_mutex to a spinlock 344736f29b35 cpuset: simplify cpuset_node_allowed API BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:995 in_atomic(): 0, irqs_disabled(): 1, pid: 11718, name: cset CPU: 135 PID: 11718 Comm: cset Tainted: G E 4.10.0-rt1-rt #4 Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0056.R01.1409242327 09/24/2014 Call Trace: ? dump_stack+0x5c/0x81 ? ___might_sleep+0xf4/0x170 ? rt_spin_lock+0x1c/0x50 ? __cpuset_node_allowed+0x66/0xc0 ? ___slab_alloc+0x390/0x570 ? anon_vma_fork+0x8f/0x140 ? copy_page_range+0x6cf/0xb00 ? anon_vma_fork+0x8f/0x140 ? __slab_alloc.isra.74+0x5a/0x81 ? anon_vma_fork+0x8f/0x140 ? kmem_cache_alloc+0x1b5/0x1f0 ? anon_vma_fork+0x8f/0x140 ? copy_process.part.35+0x1670/0x1ee0 ? _do_fork+0xdd/0x3f0 ? _do_fork+0xdd/0x3f0 ? do_syscall_64+0x61/0x170 ? entry_SYSCALL64_slow_path+0x25/0x25 The later ensured that a NUMA box WILL take callback_lock in atomic context by removing the allocator and reclaim path __GFP_HARDWALL usage which prevented such contexts from taking callback_mutex. One option would be to reinstate __GFP_HARDWALL protections for RT, however, as the 8447a0fee974 changelog states: The callback_mutex is only used to synchronize reads/updates of cpusets' flags and cpu/node masks. These operations should always proceed fast so there's no reason why we can't use a spinlock instead of the mutex. Cc: stable-rt@vger.kernel.org Signed-off-by: Mike Galbraith Signed-off-by: Sebastian Andrzej Siewior commit 94880cde0b590202dd095b79748a5fb32ae30a72 Author: Sebastian Andrzej Siewior Date: Thu Sep 26 12:30:21 2019 +0200 drm/i915: Drop the IRQ-off asserts The lockdep_assert_irqs_disabled() check is needless. The previous lockdep_assert_held() check ensures that the lock is acquired and while the lock is acquired lockdep also prints a warning if the interrupts are not disabled if they have to be. These IRQ-off asserts trigger on PREEMPT_RT because the locks become sleeping locks and do not really disable interrupts. Remove lockdep_assert_irqs_disabled(). Reported-by: Clark Williams Signed-off-by: Sebastian Andrzej Siewior commit 4423f12da051dc72620bfa1a74d26f87b389fccf Author: Sebastian Andrzej Siewior Date: Thu Sep 26 12:29:05 2019 +0200 drm/i915: Don't disable interrupts for intel_engine_breadcrumbs_irq() The function intel_engine_breadcrumbs_irq() is always invoked from an interrupt handler and for that reason it invokes (as an optimisation) only spin_lock() for locking assuming that the interrupts are already disabled. The function intel_engine_signal_breadcrumbs() is provided to disable interrupts while the former function is invoked so that assumption is also true for callers from preemptible context. On PREEMPT_RT local_irq_disable() really disables interrupts and this forbids to invoke spin_lock() which becomes a sleeping spinlock. This is also problematic with `threadirqs' in conjunction with irq_work. With force threading the interrupt handler, the handler is invoked with disabled BH but with interrupts enabled. This is okay and the lock itself is never acquired in IRQ context. This changes with irq_work (signal_irq_work()) which _still_ invokes intel_engine_breadcrumbs_irq() from IRQ context. Lockdep should see this and complain. Acquire the locks in intel_engine_breadcrumbs_irq() with _irqsave() suffix and let all callers invoke intel_engine_breadcrumbs_irq() directly instead using intel_engine_signal_breadcrumbs(). Reported-by: Clark Williams Signed-off-by: Sebastian Andrzej Siewior commit 958ee9f391b11003f3af16881dda13cb185c36e5 Author: Sebastian Andrzej Siewior Date: Wed Dec 19 10:47:02 2018 +0100 drm/i915: skip DRM_I915_LOW_LEVEL_TRACEPOINTS with NOTRACE The order of the header files is important. If this header file is included after tracepoint.h was included then the NOTRACE here becomes a nop. Currently this happens for two .c files which use the tracepoitns behind DRM_I915_LOW_LEVEL_TRACEPOINTS. Signed-off-by: Sebastian Andrzej Siewior commit 755fb87ccda885e741c112978f2021e2e4259167 Author: Sebastian Andrzej Siewior Date: Thu Dec 6 09:52:20 2018 +0100 drm/i915: disable tracing on -RT Luca Abeni reported this: | BUG: scheduling while atomic: kworker/u8:2/15203/0x00000003 | CPU: 1 PID: 15203 Comm: kworker/u8:2 Not tainted 4.19.1-rt3 #10 | Call Trace: | rt_spin_lock+0x3f/0x50 | gen6_read32+0x45/0x1d0 [i915] | g4x_get_vblank_counter+0x36/0x40 [i915] | trace_event_raw_event_i915_pipe_update_start+0x7d/0xf0 [i915] The tracing events use trace_i915_pipe_update_start() among other events use functions acquire spin locks. A few trace points use intel_get_crtc_scanline(), others use ->get_vblank_counter() wich also might acquire a sleeping lock. Based on this I don't see any other way than disable trace points on RT. Cc: stable-rt@vger.kernel.org Reported-by: Luca Abeni Signed-off-by: Sebastian Andrzej Siewior commit f5bcd1ba26d178d80c1424195832325a2daa923b Author: Mike Galbraith Date: Sat Feb 27 09:01:42 2016 +0100 drm,i915: Use local_lock/unlock_irq() in intel_pipe_update_start/end() [ 8.014039] BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:918 [ 8.014041] in_atomic(): 0, irqs_disabled(): 1, pid: 78, name: kworker/u4:4 [ 8.014045] CPU: 1 PID: 78 Comm: kworker/u4:4 Not tainted 4.1.7-rt7 #5 [ 8.014055] Workqueue: events_unbound async_run_entry_fn [ 8.014059] 0000000000000000 ffff880037153748 ffffffff815f32c9 0000000000000002 [ 8.014063] ffff88013a50e380 ffff880037153768 ffffffff815ef075 ffff8800372c06c8 [ 8.014066] ffff8800372c06c8 ffff880037153778 ffffffff8107c0b3 ffff880037153798 [ 8.014067] Call Trace: [ 8.014074] [] dump_stack+0x4a/0x61 [ 8.014078] [] ___might_sleep.part.93+0xe9/0xee [ 8.014082] [] ___might_sleep+0x53/0x80 [ 8.014086] [] rt_spin_lock+0x24/0x50 [ 8.014090] [] prepare_to_wait+0x2b/0xa0 [ 8.014152] [] intel_pipe_update_start+0x17c/0x300 [i915] [ 8.014156] [] ? prepare_to_wait_event+0x120/0x120 [ 8.014201] [] intel_begin_crtc_commit+0x166/0x1e0 [i915] [ 8.014215] [] drm_atomic_helper_commit_planes+0x5d/0x1a0 [drm_kms_helper] [ 8.014260] [] intel_atomic_commit+0xab/0xf0 [i915] [ 8.014288] [] drm_atomic_commit+0x37/0x60 [drm] [ 8.014298] [] drm_atomic_helper_plane_set_property+0x8d/0xd0 [drm_kms_helper] [ 8.014301] [] ? __ww_mutex_lock+0x39/0x40 [ 8.014319] [] drm_mode_plane_set_obj_prop+0x2d/0x90 [drm] [ 8.014328] [] restore_fbdev_mode+0x6b/0xf0 [drm_kms_helper] [ 8.014337] [] drm_fb_helper_restore_fbdev_mode_unlocked+0x29/0x80 [drm_kms_helper] [ 8.014346] [] drm_fb_helper_set_par+0x22/0x50 [drm_kms_helper] [ 8.014390] [] intel_fbdev_set_par+0x1a/0x60 [i915] [ 8.014394] [] fbcon_init+0x4f4/0x580 [ 8.014398] [] visual_init+0xbc/0x120 [ 8.014401] [] do_bind_con_driver+0x163/0x330 [ 8.014405] [] do_take_over_console+0x11c/0x1c0 [ 8.014408] [] do_fbcon_takeover+0x63/0xd0 [ 8.014410] [] fbcon_event_notify+0x785/0x8d0 [ 8.014413] [] ? __might_sleep+0x4d/0x90 [ 8.014416] [] notifier_call_chain+0x4e/0x80 [ 8.014419] [] __blocking_notifier_call_chain+0x4d/0x70 [ 8.014422] [] blocking_notifier_call_chain+0x16/0x20 [ 8.014425] [] fb_notifier_call_chain+0x1b/0x20 [ 8.014428] [] register_framebuffer+0x21a/0x350 [ 8.014439] [] drm_fb_helper_initial_config+0x274/0x3e0 [drm_kms_helper] [ 8.014483] [] intel_fbdev_initial_config+0x1b/0x20 [i915] [ 8.014486] [] async_run_entry_fn+0x4c/0x160 [ 8.014490] [] process_one_work+0x14a/0x470 [ 8.014493] [] worker_thread+0x169/0x4c0 [ 8.014496] [] ? process_one_work+0x470/0x470 [ 8.014499] [] kthread+0xc6/0xe0 [ 8.014502] [] ? queue_work_on+0x80/0x110 [ 8.014506] [] ? kthread_worker_fn+0x1c0/0x1c0 Signed-off-by: Mike Galbraith Cc: Sebastian Andrzej Siewior Cc: linux-rt-users Signed-off-by: Thomas Gleixner commit 5fbd5698ab6d48b6eb3b813aaf9dede119ec86c0 Author: Mike Galbraith Date: Sat Feb 27 08:09:11 2016 +0100 drm,radeon,i915: Use preempt_disable/enable_rt() where recommended DRM folks identified the spots, so use them. Signed-off-by: Mike Galbraith Cc: Sebastian Andrzej Siewior Cc: linux-rt-users Signed-off-by: Thomas Gleixner commit 228184a397bb646312a3b9692ecfca1131588565 Author: Sebastian Andrzej Siewior Date: Tue Oct 17 16:36:18 2017 +0200 lockdep: disable self-test The self-test wasn't always 100% accurate for RT. We disabled a few tests which failed because they had a different semantic for RT. Some still reported false positives. Now the selftest locks up the system during boot and it needs to be investigated… Signed-off-by: Sebastian Andrzej Siewior commit f521dbc7bd4bc33e46cd3f6e77c8a7e282d7919f Author: Josh Cartwright Date: Wed Jan 28 13:08:45 2015 -0600 lockdep: selftest: fix warnings due to missing PREEMPT_RT conditionals "lockdep: Selftest: Only do hardirq context test for raw spinlock" disabled the execution of certain tests with PREEMPT_RT, but did not prevent the tests from still being defined. This leads to warnings like: ./linux/lib/locking-selftest.c:574:1: warning: 'irqsafe1_hard_rlock_12' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:574:1: warning: 'irqsafe1_hard_rlock_21' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:577:1: warning: 'irqsafe1_hard_wlock_12' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:577:1: warning: 'irqsafe1_hard_wlock_21' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:580:1: warning: 'irqsafe1_soft_spin_12' defined but not used [-Wunused-function] ... Fixed by wrapping the test definitions in #ifndef CONFIG_PREEMPT_RT conditionals. Signed-off-by: Josh Cartwright Signed-off-by: Xander Huff Acked-by: Gratian Crisan Signed-off-by: Sebastian Andrzej Siewior commit 8ff29fb487dc2fa4b1f6c83ec335241bff99609f Author: Yong Zhang Date: Mon Apr 16 15:01:56 2012 +0800 lockdep: selftest: Only do hardirq context test for raw spinlock On -rt there is no softirq context any more and rwlock is sleepable, disable softirq context test and rwlock+irq test. Signed-off-by: Yong Zhang Cc: Yong Zhang Link: http://lkml.kernel.org/r/1334559716-18447-3-git-send-email-yong.zhang0@gmail.com Signed-off-by: Thomas Gleixner commit 44143c44d0c9e2a8a2089b2f7435aecc2bb9909f Author: Thomas Gleixner Date: Sun Jul 17 18:51:23 2011 +0200 lockdep: Make it RT aware teach lockdep that we don't really do softirqs on -RT. Signed-off-by: Thomas Gleixner commit cc9bb9632e1cd389473ef053088a9629abc24479 Author: Priyanka Jain Date: Thu May 17 09:35:11 2012 +0530 net: Remove preemption disabling in netif_rx() 1)enqueue_to_backlog() (called from netif_rx) should be bind to a particluar CPU. This can be achieved by disabling migration. No need to disable preemption 2)Fixes crash "BUG: scheduling while atomic: ksoftirqd" in case of RT. If preemption is disabled, enqueue_to_backog() is called in atomic context. And if backlog exceeds its count, kfree_skb() is called. But in RT, kfree_skb() might gets scheduled out, so it expects non atomic context. 3)When CONFIG_PREEMPT_RT is not defined, migrate_enable(), migrate_disable() maps to preempt_enable() and preempt_disable(), so no change in functionality in case of non-RT. -Replace preempt_enable(), preempt_disable() with migrate_enable(), migrate_disable() respectively -Replace get_cpu(), put_cpu() with get_cpu_light(), put_cpu_light() respectively Signed-off-by: Priyanka Jain Acked-by: Rajan Srivastava Cc: Link: http://lkml.kernel.org/r/1337227511-2271-1-git-send-email-Priyanka.Jain@freescale.com Signed-off-by: Thomas Gleixner commit c15cde8892613ed8bb483f595cae470d4e48c5d8 Author: Thomas Gleixner Date: Tue Aug 21 20:38:50 2012 +0200 random: Make it work on rt Delegate the random insertion to the forced threaded interrupt handler. Store the return IP of the hard interrupt handler in the irq descriptor and feed it into the random generator as a source of entropy. Signed-off-by: Thomas Gleixner commit 9bb64746826ad390197f6e093db26c02f6736281 Author: Thomas Gleixner Date: Thu Dec 16 14:25:18 2010 +0100 x86: stackprotector: Avoid random pool on rt CPU bringup calls into the random pool to initialize the stack canary. During boot that works nicely even on RT as the might sleep checks are disabled. During CPU hotplug the might sleep checks trigger. Making the locks in random raw is a major PITA, so avoid the call on RT is the only sensible solution. This is basically the same randomness which we get during boot where the random pool has no entropy and we rely on the TSC randomnness. Reported-by: Carsten Emde Signed-off-by: Thomas Gleixner commit 3b7c9200fd46ab73dd9e1eeba4fac9d64723894e Author: Thomas Gleixner Date: Tue Jul 14 14:26:34 2015 +0200 panic: skip get_random_bytes for RT_FULL in init_oops_id Disable on -RT. If this is invoked from irq-context we will have problems to acquire the sleeping lock. Signed-off-by: Thomas Gleixner commit c7af6772e481d302f1901918f91d6ab74fb548b2 Author: Sebastian Andrzej Siewior Date: Thu Jul 26 18:52:00 2018 +0200 crypto: cryptd - add a lock instead preempt_disable/local_bh_disable cryptd has a per-CPU lock which protected with local_bh_disable() and preempt_disable(). Add an explicit spin_lock to make the locking context more obvious and visible to lockdep. Since it is a per-CPU lock, there should be no lock contention on the actual spinlock. There is a small race-window where we could be migrated to another CPU after the cpu_queue has been obtain. This is not a problem because the actual ressource is protected by the spinlock. Signed-off-by: Sebastian Andrzej Siewior commit 9970f167bbd9cc71a574908bf3cdeac0f1532e93 Author: Sebastian Andrzej Siewior Date: Thu Nov 30 13:40:10 2017 +0100 crypto: limit more FPU-enabled sections Those crypto drivers use SSE/AVX/… for their crypto work and in order to do so in kernel they need to enable the "FPU" in kernel mode which disables preemption. There are two problems with the way they are used: - the while loop which processes X bytes may create latency spikes and should be avoided or limited. - the cipher-walk-next part may allocate/free memory and may use kmap_atomic(). The whole kernel_fpu_begin()/end() processing isn't probably that cheap. It most likely makes sense to process as much of those as possible in one go. The new *_fpu_sched_rt() schedules only if a RT task is pending. Probably we should measure the performance those ciphers in pure SW mode and with this optimisations to see if it makes sense to keep them for RT. This kernel_fpu_resched() makes the code more preemptible which might hurt performance. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior commit c554a9df238c66f125c217296a7ec5576ea1f519 Author: Sebastian Andrzej Siewior Date: Fri Feb 21 17:24:04 2014 +0100 crypto: Reduce preempt disabled regions, more algos Don Estabrook reported | kernel: WARNING: CPU: 2 PID: 858 at kernel/sched/core.c:2428 migrate_disable+0xed/0x100() | kernel: WARNING: CPU: 2 PID: 858 at kernel/sched/core.c:2462 migrate_enable+0x17b/0x200() | kernel: WARNING: CPU: 3 PID: 865 at kernel/sched/core.c:2428 migrate_disable+0xed/0x100() and his backtrace showed some crypto functions which looked fine. The problem is the following sequence: glue_xts_crypt_128bit() { blkcipher_walk_virt(); /* normal migrate_disable() */ glue_fpu_begin(); /* get atomic */ while (nbytes) { __glue_xts_crypt_128bit(); blkcipher_walk_done(); /* with nbytes = 0, migrate_enable() * while we are atomic */ }; glue_fpu_end() /* no longer atomic */ } and this is why the counter get out of sync and the warning is printed. The other problem is that we are non-preemptible between glue_fpu_begin() and glue_fpu_end() and the latency grows. To fix this, I shorten the FPU off region and ensure blkcipher_walk_done() is called with preemption enabled. This might hurt the performance because we now enable/disable the FPU state more often but we gain lower latency and the bug is gone. Reported-by: Don Estabrook Signed-off-by: Sebastian Andrzej Siewior commit 70028f0dac37dec01005058d4205d0fae3eb0371 Author: Peter Zijlstra Date: Mon Nov 14 18:19:27 2011 +0100 x86: crypto: Reduce preempt disabled regions Restrict the preempt disabled regions to the actual floating point operations and enable preemption for the administrative actions. This is necessary on RT to avoid that kfree and other operations are called with preemption disabled. Reported-and-tested-by: Carsten Emde Signed-off-by: Peter Zijlstra Signed-off-by: Thomas Gleixner commit 459407acd571254b2264a39b28b6a01ff76f217c Author: Sebastian Andrzej Siewior Date: Tue Jun 23 15:32:51 2015 +0200 irqwork: push most work into softirq context Initially we defered all irqwork into softirq because we didn't want the latency spikes if perf or another user was busy and delayed the RT task. The NOHZ trigger (nohz_full_kick_work) was the first user that did not work as expected if it did not run in the original irqwork context so we had to bring it back somehow for it. push_irq_work_func is the second one that requires this. This patch adds the IRQ_WORK_HARD_IRQ which makes sure the callback runs in raw-irq context. Everything else is defered into softirq context. Without -RT we have the orignal behavior. This patch incorporates tglx orignal work which revoked a little bringing back the arch_irq_work_raise() if possible and a few fixes from Steven Rostedt and Mike Galbraith, [bigeasy: melt tglx's irq_work_tick_soft() which splits irq_work_tick() into a hard and soft variant] Signed-off-by: Sebastian Andrzej Siewior commit afc978cafd90f3b3df5d17bc9431ec5007a430ed Author: Sebastian Andrzej Siewior Date: Wed Mar 30 13:36:29 2016 +0200 net: dev: always take qdisc's busylock in __dev_xmit_skb() The root-lock is dropped before dev_hard_start_xmit() is invoked and after setting the __QDISC___STATE_RUNNING bit. If this task is now pushed away by a task with a higher priority then the task with the higher priority won't be able to submit packets to the NIC directly instead they will be enqueued into the Qdisc. The NIC will remain idle until the task(s) with higher priority leave the CPU and the task with lower priority gets back and finishes the job. If we take always the busylock we ensure that the RT task can boost the low-prio task and submit the packet. Signed-off-by: Sebastian Andrzej Siewior commit 1daee3cadaf57c6d81591a820fc7d73155968a10 Author: Thomas Gleixner Date: Tue Jul 12 15:38:34 2011 +0200 net: Use skbufhead with raw lock Use the rps lock as rawlock so we can keep irq-off regions. It looks low latency. However we can't kfree() from this context therefore we defer this to the softirq and use the tofree_queue list for it (similar to process_queue). Signed-off-by: Thomas Gleixner commit 49852dc0b90eeb31d80c6079a997f5f09fdc7e16 Author: Thomas Gleixner Date: Sun Jul 17 21:41:35 2011 +0200 debugobjects: Make RT aware Avoid filling the pool / allocating memory with irqs off(). Signed-off-by: Thomas Gleixner commit 289dcb7e4bb9238a0f29e5dcb028be12c5a84681 Author: Thomas Gleixner Date: Wed Mar 7 21:10:04 2012 +0100 net: Use cpu_chill() instead of cpu_relax() Retry loops on RT might loop forever when the modifying side was preempted. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner commit d35e246eb7f73729cda1f9640406eb15017fbabf Author: Thomas Gleixner Date: Wed Mar 7 21:00:34 2012 +0100 fs: namespace: Use cpu_chill() in trylock loops Retry loops on RT might loop forever when the modifying side was preempted. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner commit 7bd50b99052cecaca5f196407a7a369d598c777b Author: Thomas Gleixner Date: Thu Dec 20 18:28:26 2012 +0100 block: Use cpu_chill() for retry loops Retry loops on RT might loop forever when the modifying side was preempted. Steven also observed a live lock when there was a concurrent priority boosting going on. Use cpu_chill() instead of cpu_relax() to let the system make progress. [bigeasy: After all those changes that occured over the years, this one hunk is left and should not cause any starvation on -RT anymore] Signed-off-by: Thomas Gleixner commit 32f35d21bf065ee4d525b44e5ba47a674cd12fc3 Author: Thomas Gleixner Date: Wed Mar 7 20:51:03 2012 +0100 rt: Introduce cpu_chill() Retry loops on RT might loop forever when the modifying side was preempted. Add cpu_chill() to replace cpu_relax(). cpu_chill() defaults to cpu_relax() for non RT. On RT it puts the looping task to sleep for a tick so the preempted task can make progress. Steven Rostedt changed it to use a hrtimer instead of msleep(): | |Ulrich Obergfell pointed out that cpu_chill() calls msleep() which is woken |up by the ksoftirqd running the TIMER softirq. But as the cpu_chill() is |called from softirq context, it may block the ksoftirqd() from running, in |which case, it may never wake up the msleep() causing the deadlock. + bigeasy later changed to schedule_hrtimeout() |If a task calls cpu_chill() and gets woken up by a regular or spurious |wakeup and has a signal pending, then it exits the sleep loop in |do_nanosleep() and sets up the restart block. If restart->nanosleep.type is |not TI_NONE then this results in accessing a stale user pointer from a |previously interrupted syscall and a copy to user based on the stale |pointer or a BUG() when 'type' is not supported in nanosleep_copyout(). + bigeasy: add PF_NOFREEZE: | [....] Waiting for /dev to be fully populated... | ===================================== | [ BUG: udevd/229 still has locks held! ] | 3.12.11-rt17 #23 Not tainted | ------------------------------------- | 1 lock held by udevd/229: | #0: (&type->i_mutex_dir_key#2){+.+.+.}, at: lookup_slow+0x28/0x98 | | stack backtrace: | CPU: 0 PID: 229 Comm: udevd Not tainted 3.12.11-rt17 #23 | (unwind_backtrace+0x0/0xf8) from (show_stack+0x10/0x14) | (show_stack+0x10/0x14) from (dump_stack+0x74/0xbc) | (dump_stack+0x74/0xbc) from (do_nanosleep+0x120/0x160) | (do_nanosleep+0x120/0x160) from (hrtimer_nanosleep+0x90/0x110) | (hrtimer_nanosleep+0x90/0x110) from (cpu_chill+0x30/0x38) | (cpu_chill+0x30/0x38) from (dentry_kill+0x158/0x1ec) | (dentry_kill+0x158/0x1ec) from (dput+0x74/0x15c) | (dput+0x74/0x15c) from (lookup_real+0x4c/0x50) | (lookup_real+0x4c/0x50) from (__lookup_hash+0x34/0x44) | (__lookup_hash+0x34/0x44) from (lookup_slow+0x38/0x98) | (lookup_slow+0x38/0x98) from (path_lookupat+0x208/0x7fc) | (path_lookupat+0x208/0x7fc) from (filename_lookup+0x20/0x60) | (filename_lookup+0x20/0x60) from (user_path_at_empty+0x50/0x7c) | (user_path_at_empty+0x50/0x7c) from (user_path_at+0x14/0x1c) | (user_path_at+0x14/0x1c) from (vfs_fstatat+0x48/0x94) | (vfs_fstatat+0x48/0x94) from (SyS_stat64+0x14/0x30) | (SyS_stat64+0x14/0x30) from (ret_fast_syscall+0x0/0x48) Signed-off-by: Thomas Gleixner Signed-off-by: Steven Rostedt Signed-off-by: Sebastian Andrzej Siewior commit e1631aca3ee3d60ff3a5fa86c4e96141b7de21d3 Author: Mike Galbraith Date: Wed Feb 18 16:05:28 2015 +0100 sunrpc: Make svc_xprt_do_enqueue() use get_cpu_light() |BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:915 |in_atomic(): 1, irqs_disabled(): 0, pid: 3194, name: rpc.nfsd |Preemption disabled at:[] svc_xprt_received+0x4b/0xc0 [sunrpc] |CPU: 6 PID: 3194 Comm: rpc.nfsd Not tainted 3.18.7-rt1 #9 |Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.404 11/06/2014 | ffff880409630000 ffff8800d9a33c78 ffffffff815bdeb5 0000000000000002 | 0000000000000000 ffff8800d9a33c98 ffffffff81073c86 ffff880408dd6008 | ffff880408dd6000 ffff8800d9a33cb8 ffffffff815c3d84 ffff88040b3ac000 |Call Trace: | [] dump_stack+0x4f/0x9e | [] __might_sleep+0xe6/0x150 | [] rt_spin_lock+0x24/0x50 | [] svc_xprt_do_enqueue+0x80/0x230 [sunrpc] | [] svc_xprt_received+0x4b/0xc0 [sunrpc] | [] svc_add_new_perm_xprt+0x6d/0x80 [sunrpc] | [] svc_addsock+0x143/0x200 [sunrpc] | [] write_ports+0x28c/0x340 [nfsd] | [] nfsctl_transaction_write+0x4c/0x80 [nfsd] | [] vfs_write+0xb3/0x1d0 | [] SyS_write+0x49/0xb0 | [] system_call_fastpath+0x16/0x1b Signed-off-by: Mike Galbraith Signed-off-by: Sebastian Andrzej Siewior commit c74011a12990f7e6529518288c32a3da541dc3cb Author: Thomas Gleixner Date: Sat Nov 12 14:00:48 2011 +0100 scsi/fcoe: Make RT aware. Do not disable preemption while taking sleeping locks. All user look safe for migrate_diable() only. Signed-off-by: Thomas Gleixner commit ae8acffa3df219b2f37a21caa7727b647b7c024e Author: Thomas Gleixner Date: Tue Apr 6 16:51:31 2010 +0200 md: raid5: Make raid5_percpu handling RT aware __raid_run_ops() disables preemption with get_cpu() around the access to the raid5_percpu variables. That causes scheduling while atomic spews on RT. Serialize the access to the percpu data with a lock and keep the code preemptible. Reported-by: Udo van den Heuvel Signed-off-by: Thomas Gleixner Tested-by: Udo van den Heuvel commit 248851e15a4e5cba63e1f54ae3b4e28f8e78d016 Author: Sebastian Andrzej Siewior Date: Thu Jan 29 15:10:08 2015 +0100 block/mq: don't complete requests via IPI The IPI runs in hardirq context and there are sleeping locks. Assume caches are shared and complete them on the local CPU. Signed-off-by: Sebastian Andrzej Siewior commit e92d7168ed51919ff6743369e27cc29d868a95a3 Author: Sebastian Andrzej Siewior Date: Tue Jul 14 14:26:34 2015 +0200 block/mq: do not invoke preempt_disable() preempt_disable() and get_cpu() don't play well together with the sleeping locks it tries to allocate later. It seems to be enough to replace it with get_cpu_light() and migrate_disable(). Signed-off-by: Sebastian Andrzej Siewior commit fc41d69dd3d143d5d94ae1c72b593a56ba9afc37 Author: Thomas Gleixner Date: Tue Jul 12 11:39:36 2011 +0200 mm/vmalloc: Another preempt disable region which sucks Avoid the preempt disable version of get_cpu_var(). The inner-lock should provide enough serialisation. Signed-off-by: Thomas Gleixner commit 4cc9f2b1475f3d291b84a48803703ecdf0fb5ef8 Author: Thomas Gleixner Date: Fri Jul 8 16:35:35 2011 +0200 fs/epoll: Do not disable preemption on RT ep_call_nested() takes a sleeping lock so we can't disable preemption. The light version is enough since ep_call_nested() doesn't mind beeing invoked twice on the same CPU. Signed-off-by: Thomas Gleixner commit 57b920baacaefe1ba1f8183fb6a40e3cf71cc3da Author: Ingo Molnar Date: Wed Dec 14 13:05:54 2011 +0100 rt: Improve the serial console PASS_LIMIT Beyond the warning: drivers/tty/serial/8250/8250.c:1613:6: warning: unused variable ‘pass_counter’ [-Wunused-variable] the solution of just looping infinitely was ugly - up it to 1 million to give it a chance to continue in some really ugly situation. Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner commit 4590602756e69c67931ef37d6360008ce78f9060 Author: Scott Wood Date: Wed Sep 11 17:57:29 2019 +0100 rcutorture: Avoid problematic critical section nesting on RT rcutorture was generating some nesting scenarios that are not reasonable. Constrain the state selection to avoid them. Example #1: 1. preempt_disable() 2. local_bh_disable() 3. preempt_enable() 4. local_bh_enable() On PREEMPT_RT, BH disabling takes a local lock only when called in non-atomic context. Thus, atomic context must be retained until after BH is re-enabled. Likewise, if BH is initially disabled in non-atomic context, it cannot be re-enabled in atomic context. Example #2: 1. rcu_read_lock() 2. local_irq_disable() 3. rcu_read_unlock() 4. local_irq_enable() If the thread is preempted between steps 1 and 2, rcu_read_unlock_special.b.blocked will be set, but it won't be acted on in step 3 because IRQs are disabled. Thus, reporting of the quiescent state will be delayed beyond the local_irq_enable(). For now, these scenarios will continue to be tested on non-PREEMPT_RT kernels, until debug checks are added to ensure that they are not happening elsewhere. Signed-off-by: Scott Wood Signed-off-by: Sebastian Andrzej Siewior commit dccb961c56443c469469cd1bc363b78c1cd61c6c Author: Julia Cartwright Date: Wed Oct 12 11:21:14 2016 -0500 rcu: enable rcu_normal_after_boot by default for RT The forcing of an expedited grace period is an expensive and very RT-application unfriendly operation, as it forcibly preempts all running tasks on CPUs which are preventing the gp from expiring. By default, as a policy decision, disable the expediting of grace periods (after boot) on configurations which enable PREEMPT_RT. Suggested-by: Luiz Capitulino Acked-by: Paul E. McKenney Signed-off-by: Julia Cartwright Signed-off-by: Sebastian Andrzej Siewior commit 8a5f2a03d28c1523e49a5c501b74ccc567d321f7 Author: Sebastian Andrzej Siewior Date: Thu Oct 12 18:37:12 2017 +0200 srcu: replace local_irqsave() with a locallock There are two instances which disable interrupts in order to become a stable this_cpu_ptr() pointer. The restore part is coupled with spin_unlock_irqrestore() which does not work on RT. Replace the local_irq_save() call with the appropriate local_lock() version of it. Signed-off-by: Sebastian Andrzej Siewior commit 9537d44330ae6991dcb81e7c1fc3d21133861b3d Author: Scott Wood Date: Wed Sep 11 17:57:28 2019 +0100 rcu: Use rcuc threads on PREEMPT_RT as we did While switching to the reworked RCU-thread code, it has been forgotten to enable the thread processing on -RT. Besides restoring behavior that used to be default on RT, this avoids a deadlock on scheduler locks. Signed-off-by: Scott Wood Signed-off-by: Sebastian Andrzej Siewior commit 6a735ed61780f625f342ac4a27499d37646eb264 Author: Sebastian Andrzej Siewior Date: Tue Nov 19 09:25:04 2019 +0100 locking: Make spinlock_t and rwlock_t a RCU section on RT On !RT a locked spinlock_t and rwlock_t disables preemption which implies a RCU read section. There is code that relies on that behaviour. Add an explicit RCU read section on RT while a sleeping lock (a lock which would disables preemption on !RT) acquired. Signed-off-by: Sebastian Andrzej Siewior commit 8c4eab816edd78a40a502a740ac44b66ed636f78 Author: Sebastian Andrzej Siewior Date: Fri Aug 4 17:40:42 2017 +0200 locking: don't check for __LINUX_SPINLOCK_TYPES_H on -RT archs Upstream uses arch_spinlock_t within spinlock_t and requests that spinlock_types.h header file is included first. On -RT we have the rt_mutex with its raw_lock wait_lock which needs architectures' spinlock_types.h header file for its definition. However we need rt_mutex first because it is used to build the spinlock_t so that check does not work for us. Therefore I am dropping that check. Signed-off-by: Sebastian Andrzej Siewior commit 020123a7ac3c8e333f9de18e0c691b17888811e7 Author: Thomas Gleixner Date: Wed Mar 8 14:23:35 2017 +0100 futex: workaround migrate_disable/enable in different context migrate_enable() invokes __schedule() and it expects a preempt count of one. Holding a raw_spinlock_t with disabled interrupts should not allow scheduling. These little hacks ensure that we don't schedule while we lock the hb lockwith interrupts enabled and unlock it with interrupts disabled. Signed-off-by: Thomas Gleixner [XXX: As per PeterZ suggesstion set_thread_flag(TIF_NEED_RESCHED); preempt_fold_need_resched() would trigger a scheduler invocation on the last preempt_enable() which in turn would allow to drop this. ] Signed-off-by: Sebastian Andrzej Siewior commit b8ee5b0a17bf4c5c4daee5debd9f056f9a9ce617 Author: Thomas Gleixner Date: Sun Jul 17 21:56:42 2011 +0200 trace: Add migrate-disabled counter to tracing output Signed-off-by: Thomas Gleixner commit e6c287b1512dbb2e8235adba2ff7d4c2f1d7bb9e Author: Scott Wood Date: Sat Oct 12 01:52:14 2019 -0500 sched: migrate_enable: Use stop_one_cpu_nowait() migrate_enable() can be called with current->state != TASK_RUNNING. Avoid clobbering the existing state by using stop_one_cpu_nowait(). Since we're stopping the current cpu, we know that we won't get past __schedule() until migration_cpu_stop() has run (at least up to the point of migrating us to another cpu). Signed-off-by: Scott Wood [bigeasy: spin until the request has been processed] Signed-off-by: Sebastian Andrzej Siewior commit 956aceddd944f381b87e57d7ca58767029414885 Author: Sebastian Andrzej Siewior Date: Fri Nov 29 17:24:55 2019 +0100 sched/core: migrate_enable() must access takedown_cpu_task on !HOTPLUG_CPU The variable takedown_cpu_task is never declared/used on !HOTPLUG_CPU except for migrate_enable(). This leads to a link error. Don't use takedown_cpu_task in !HOTPLUG_CPU. Reported-by: Dick Hollenbeck Signed-off-by: Sebastian Andrzej Siewior commit 1100372be1ea4b8988b1b2c4290547be1b306a1e Author: Sebastian Andrzej Siewior Date: Sat May 27 19:02:06 2017 +0200 kernel/sched/core: add migrate_disable() [bristot@redhat.com: rt: Increase/decrease the nr of migratory tasks when enabling/disabling migration Link: https://lkml.kernel.org/r/e981d271cbeca975bca710e2fbcc6078c09741b0.1498482127.git.bristot@redhat.com ] [swood@redhat.com: fixups and optimisations Link:https://lkml.kernel.org/r/20190727055638.20443-1-swood@redhat.com Link:https://lkml.kernel.org/r/20191012065214.28109-1-swood@redhat.com ] Signed-off-by: Sebastian Andrzej Siewior commit 9ea4cba8f7a5718e73811a57caa237e705f9f2d2 Author: Scott Wood Date: Sat Jul 27 00:56:32 2019 -0500 sched: __set_cpus_allowed_ptr(): Check cpus_mask, not cpus_ptr This function is concerned with the long-term cpu mask, not the transitory mask the task might have while migrate disabled. Before this patch, if a task was migrate disabled at the time __set_cpus_allowed_ptr() was called, and the new mask happened to be equal to the cpu that the task was running on, then the mask update would be lost. Signed-off-by: Scott Wood Signed-off-by: Sebastian Andrzej Siewior commit 3117ce0a85abf8ce490f680018d41fcae0cf3db4 Author: Sebastian Andrzej Siewior Date: Thu Aug 29 18:21:04 2013 +0200 ptrace: fix ptrace vs tasklist_lock race As explained by Alexander Fyodorov : |read_lock(&tasklist_lock) in ptrace_stop() is converted to mutex on RT kernel, |and it can remove __TASK_TRACED from task->state (by moving it to |task->saved_state). If parent does wait() on child followed by a sys_ptrace |call, the following race can happen: | |- child sets __TASK_TRACED in ptrace_stop() |- parent does wait() which eventually calls wait_task_stopped() and returns | child's pid |- child blocks on read_lock(&tasklist_lock) in ptrace_stop() and moves | __TASK_TRACED flag to saved_state |- parent calls sys_ptrace, which calls ptrace_check_attach() and wait_task_inactive() The patch is based on his initial patch where an additional check is added in case the __TASK_TRACED moved to ->saved_state. The pi_lock is taken in case the caller is interrupted between looking into ->state and ->saved_state. Signed-off-by: Sebastian Andrzej Siewior commit a7846823256d94fb0cb8b446a6e06f8f5d450e59 Author: Sebastian Andrzej Siewior Date: Thu Nov 16 16:48:48 2017 +0100 locking/rtmutex: re-init the wait_lock in rt_mutex_init_proxy_locked() We could provide a key-class for the lockdep (and fixup all callers) or move the init to all callers (like it was) in order to avoid lockdep seeing a double-lock of the wait_lock. Reported-by: Fernando Lopez-Lezcano Signed-off-by: Sebastian Andrzej Siewior commit 935531e0dc952155ef237a5153ce59d9bed66b2d Author: Scott Wood Date: Fri Jan 4 15:33:21 2019 -0500 locking/rt-mutex: Flush block plug on __down_read() __down_read() bypasses the rtmutex frontend to call rt_mutex_slowlock_locked() directly, and thus it needs to call blk_schedule_flush_flug() itself. Cc: stable-rt@vger.kernel.org Signed-off-by: Scott Wood Signed-off-by: Sebastian Andrzej Siewior commit 08d6251af8157b330c56e5c81a6e64f2035efd37 Author: Mikulas Patocka Date: Mon Nov 13 12:56:53 2017 -0500 locking/rt-mutex: fix deadlock in device mapper / block-IO When some block device driver creates a bio and submits it to another block device driver, the bio is added to current->bio_list (in order to avoid unbounded recursion). However, this queuing of bios can cause deadlocks, in order to avoid them, device mapper registers a function flush_current_bio_list. This function is called when device mapper driver blocks. It redirects bios queued on current->bio_list to helper workqueues, so that these bios can proceed even if the driver is blocked. The problem with CONFIG_PREEMPT_RT is that when the device mapper driver blocks, it won't call flush_current_bio_list (because tsk_is_pi_blocked returns true in sched_submit_work), so deadlocks in block device stack can happen. Note that we can't call blk_schedule_flush_plug if tsk_is_pi_blocked returns true - that would cause BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on)) in task_blocks_on_rt_mutex when flush_current_bio_list attempts to take a spinlock. So the proper fix is to call blk_schedule_flush_plug in rt_mutex_fastlock, when fast acquire failed and when the task is about to block. CC: stable-rt@vger.kernel.org [bigeasy: The deadlock is not device-mapper specific, it can also occur in plain EXT4] Signed-off-by: Mikulas Patocka Signed-off-by: Sebastian Andrzej Siewior commit 6abfb596762b185b24a4f1038f3a638083e6ad45 Author: Sebastian Andrzej Siewior Date: Thu Oct 12 17:34:38 2017 +0200 rtmutex: add ww_mutex addon for mutex-rt Signed-off-by: Sebastian Andrzej Siewior commit 218eb0302e55a1dc39f36f623f32918b11bb4931 Author: Thomas Gleixner Date: Thu Oct 12 17:31:14 2017 +0200 rtmutex: wire up RT's locking Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 385fcb5719e578604c088978d79fe44632b33f5e Author: Thomas Gleixner Date: Thu Oct 12 17:18:06 2017 +0200 rtmutex: add rwlock implementation based on rtmutex The implementation is bias-based, similar to the rwsem implementation. Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 8e181ba8b077d091410121755b076bddb11d6c03 Author: Thomas Gleixner Date: Thu Oct 12 17:28:34 2017 +0200 rtmutex: add rwsem implementation based on rtmutex The RT specific R/W semaphore implementation restricts the number of readers to one because a writer cannot block on multiple readers and inherit its priority or budget. The single reader restricting is painful in various ways: - Performance bottleneck for multi-threaded applications in the page fault path (mmap sem) - Progress blocker for drivers which are carefully crafted to avoid the potential reader/writer deadlock in mainline. The analysis of the writer code pathes shows, that properly written RT tasks should not take them. Syscalls like mmap(), file access which take mmap sem write locked have unbound latencies which are completely unrelated to mmap sem. Other R/W sem users like graphics drivers are not suitable for RT tasks either. So there is little risk to hurt RT tasks when the RT rwsem implementation is changed in the following way: - Allow concurrent readers - Make writers block until the last reader left the critical section. This blocking is not subject to priority/budget inheritance. - Readers blocked on a writer inherit their priority/budget in the normal way. There is a drawback with this scheme. R/W semaphores become writer unfair though the applications which have triggered writer starvation (mostly on mmap_sem) in the past are not really the typical workloads running on a RT system. So while it's unlikely to hit writer starvation, it's possible. If there are unexpected workloads on RT systems triggering it, we need to rethink the approach. Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 7f6a0d6961fed83b7d045dfe2ad32b68cb3b1a53 Author: Thomas Gleixner Date: Thu Oct 12 17:17:03 2017 +0200 rtmutex: add mutex implementation based on rtmutex Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit d80e1718e14e1c1fb57ba2fc0d9e95a2600ad928 Author: Sebastian Andrzej Siewior Date: Wed Dec 2 11:34:07 2015 +0100 rtmutex: trylock is okay on -RT non-RT kernel could deadlock on rt_mutex_trylock() in softirq context. On -RT we don't run softirqs in IRQ context but in thread context so it is not a issue here. Signed-off-by: Sebastian Andrzej Siewior commit 0269356d1f7d0d22675528bed02d26b8e2443c13 Author: Peter Zijlstra Date: Mon Sep 30 18:15:44 2019 +0200 locking/rtmutex: Clean ->pi_blocked_on in the error case The function rt_mutex_wait_proxy_lock() cleans ->pi_blocked_on in case of failure (timeout, signal). The same cleanup is required in __rt_mutex_start_proxy_lock(). In both the cases the tasks was interrupted by a signal or timeout while acquiring the lock and after the interruption it longer blocks on the lock. Fixes: 1a1fb985f2e2b ("futex: Handle early deadlock return correctly") Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior commit f0b976c874de810888b4b9b7a9a8f16018fe08d7 Author: Thomas Gleixner Date: Sun Jul 17 22:51:33 2011 +0200 sched: Use the proper LOCK_OFFSET for cond_resched() RT does not increment preempt count when a 'sleeping' spinlock is locked. Update PREEMPT_LOCK_OFFSET for that case. Signed-off-by: Thomas Gleixner commit 0fb4c8160c7fd614aeb38a3db12f6b6c504cf284 Author: Thomas Gleixner Date: Thu Oct 12 17:11:19 2017 +0200 rtmutex: add sleeping lock implementation Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit c0c0efdc2ee4b7d48c04a071ef0f6b62a5c93243 Author: Thomas Gleixner Date: Thu Oct 12 16:36:39 2017 +0200 rtmutex: export lockdep-less version of rt_mutex's lock, trylock and unlock Required for lock implementation ontop of rtmutex. Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 14226b6ed4f57a5430dcaa147d5884f356183aa4 Author: Thomas Gleixner Date: Thu Oct 12 16:14:22 2017 +0200 rtmutex: Provide rt_mutex_slowlock_locked() This is the inner-part of rt_mutex_slowlock(), required for rwsem-rt. Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 04ce694e75a17c02586fdec53c7f9031e897ce5a Author: Sebastian Andrzej Siewior Date: Tue Dec 17 22:00:21 2019 +0100 rbtree: don't include the rcu header The RCU header pulls in spinlock.h and fails due not yet defined types: |In file included from include/linux/spinlock.h:275:0, | from include/linux/rcupdate.h:38, | from include/linux/rbtree.h:34, | from include/linux/rtmutex.h:17, | from include/linux/spinlock_types.h:18, | from kernel/bounds.c:13: |include/linux/rwlock_rt.h:16:38: error: unknown type name ‘rwlock_t’ | extern void __lockfunc rt_write_lock(rwlock_t *rwlock); | ^ This patch moves the required RCU function from the rcupdate.h header file into a new header file which can be included by both users. Signed-off-by: Sebastian Andrzej Siewior commit 2c4e68dcae7063a9cd226e58346ba00988212e20 Author: Thomas Gleixner Date: Wed Jun 29 20:06:39 2011 +0200 rtmutex: Avoid include hell Include only the required raw types. This avoids pulling in the complete spinlock header which in turn requires rtmutex.h at some point. Signed-off-by: Thomas Gleixner commit 61ff0ce6f914fabe3129f2b35a30b8f65c7c39b5 Author: Thomas Gleixner Date: Wed Jun 29 19:34:01 2011 +0200 spinlock: Split the lock types header Split raw_spinlock into its own file and the remaining spinlock_t into its own non-RT header. The non-RT header will be replaced later by sleeping spinlocks. Signed-off-by: Thomas Gleixner commit c058ad01ac2f6ee960ed735c173019fd65ccfb71 Author: Thomas Gleixner Date: Sat Apr 1 12:50:59 2017 +0200 rtmutex: Make lock_killable work Locking an rt mutex killable does not work because signal handling is restricted to TASK_INTERRUPTIBLE. Use signal_pending_state() unconditionaly. Cc: stable-rt@vger.kernel.org Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit a4498c9adf99050fe67269168090955473796978 Author: Thomas Gleixner Date: Thu Jun 9 11:43:52 2011 +0200 rtmutex: Add rtmutex_lock_killable() Add "killable" type to rtmutex. We need this since rtmutex are used as "normal" mutexes which do use this type. Signed-off-by: Thomas Gleixner commit 87e845062c447a54d5cff5632730c88d07262f06 Author: Wolfgang M. Reimer Date: Tue Jul 21 16:20:07 2015 +0200 locking: locktorture: Do NOT include rwlock.h directly Including rwlock.h directly will cause kernel builds to fail if CONFIG_PREEMPT_RT is defined. The correct header file (rwlock_rt.h OR rwlock.h) will be included by spinlock.h which is included by locktorture.c anyway. Cc: stable-rt@vger.kernel.org Signed-off-by: Wolfgang M. Reimer Signed-off-by: Sebastian Andrzej Siewior commit c78979d327cbe034b8b3ab48ed1528b7e2e0ef63 Author: Grygorii Strashko Date: Tue Jul 21 19:43:56 2015 +0300 pid.h: include atomic.h This patch fixes build error: CC kernel/pid_namespace.o In file included from kernel/pid_namespace.c:11:0: include/linux/pid.h: In function 'get_pid': include/linux/pid.h:78:3: error: implicit declaration of function 'atomic_inc' [-Werror=implicit-function-declaration] atomic_inc(&pid->count); ^ which happens when CONFIG_PROVE_LOCKING=n CONFIG_DEBUG_SPINLOCK=n CONFIG_DEBUG_MUTEXES=n CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_PID_NS=y Vanilla gets this via spinlock.h. Signed-off-by: Grygorii Strashko Signed-off-by: Sebastian Andrzej Siewior commit 9196f580e015ce1c8a27e38ac01affaa6fcbc333 Author: Thomas Gleixner Date: Fri Mar 1 11:17:42 2013 +0100 futex: Ensure lock/unlock symetry versus pi_lock and hash bucket lock In exit_pi_state_list() we have the following locking construct: spin_lock(&hb->lock); raw_spin_lock_irq(&curr->pi_lock); ... spin_unlock(&hb->lock); In !RT this works, but on RT the migrate_enable() function which is called from spin_unlock() sees atomic context due to the held pi_lock and just decrements the migrate_disable_atomic counter of the task. Now the next call to migrate_disable() sees the counter being negative and issues a warning. That check should be in migrate_enable() already. Fix this by dropping pi_lock before unlocking hb->lock and reaquire pi_lock after that again. This is safe as the loop code reevaluates head again under the pi_lock. Reported-by: Yong Zhang Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 65c0768e32a51742d2254c4b4d87934dbcf4ec33 Author: Steven Rostedt Date: Tue Jul 14 14:26:34 2015 +0200 futex: Fix bug on when a requeued RT task times out Requeue with timeout causes a bug with PREEMPT_RT. The bug comes from a timed out condition. TASK 1 TASK 2 ------ ------ futex_wait_requeue_pi() futex_wait_queue_me() double_lock_hb(); raw_spin_lock(pi_lock); if (current->pi_blocked_on) { } else { current->pi_blocked_on = PI_WAKE_INPROGRESS; run_spin_unlock(pi_lock); spin_lock(hb->lock); <-- blocked! plist_for_each_entry_safe(this) { rt_mutex_start_proxy_lock(); task_blocks_on_rt_mutex(); BUG_ON(task->pi_blocked_on)!!!! The BUG_ON() actually has a check for PI_WAKE_INPROGRESS, but the problem is that, after TASK 1 sets PI_WAKE_INPROGRESS, it then tries to grab the hb->lock, which it fails to do so. As the hb->lock is a mutex, it will block and set the "pi_blocked_on" to the hb->lock. When TASK 2 goes to requeue it, the check for PI_WAKE_INPROGESS fails because the task1's pi_blocked_on is no longer set to that, but instead, set to the hb->lock. The fix: When calling rt_mutex_start_proxy_lock() a check is made to see if the proxy tasks pi_blocked_on is set. If so, exit out early. Otherwise set it to a new flag PI_REQUEUE_INPROGRESS, which notifies the proxy task that it is being requeued, and will handle things appropriately. Signed-off-by: Steven Rostedt Signed-off-by: Thomas Gleixner commit 17a62333ac5d375f3cd66b92b586480d1aeb1f48 Author: Thomas Gleixner Date: Fri Jun 10 11:04:15 2011 +0200 rtmutex: Handle the various new futex race conditions RT opens a few new interesting race conditions in the rtmutex/futex combo due to futex hash bucket lock being a 'sleeping' spinlock and therefor not disabling preemption. Signed-off-by: Thomas Gleixner commit 665a857aea1534e328cca878a4bccaac9c365d81 Author: Sebastian Andrzej Siewior Date: Fri Jun 16 19:03:16 2017 +0200 net/core: use local_bh_disable() in netif_rx_ni() In 2004 netif_rx_ni() gained a preempt_disable() section around netif_rx() and its do_softirq() + testing for it. The do_softirq() part is required because netif_rx() raises the softirq but does not invoke it. The preempt_disable() is required to remain on the same CPU which added the skb to the per-CPU list. All this can be avoided be putting this into a local_bh_disable()ed section. The local_bh_enable() part will invoke do_softirq() if required. Signed-off-by: Sebastian Andrzej Siewior commit 5ea901fa1716906d0f7bd77a54eccb3aba9be325 Author: Thomas Gleixner Date: Mon Jul 18 13:59:17 2011 +0200 softirq: Disable softirq stacks for RT Disable extra stacks for softirqs. We want to preempt softirqs and having them on special IRQ-stack does not make this easier. Signed-off-by: Thomas Gleixner commit cc2c40590da849d3c4a43e6488e0a6219e1f9a0f Author: Thomas Gleixner Date: Sun Nov 13 17:17:09 2011 +0100 softirq: Check preemption after reenabling interrupts raise_softirq_irqoff() disables interrupts and wakes the softirq daemon, but after reenabling interrupts there is no preemption check, so the execution of the softirq thread might be delayed arbitrarily. In principle we could add that check to local_irq_enable/restore, but that's overkill as the rasie_softirq_irqoff() sections are the only ones which show this behaviour. Reported-by: Carsten Emde Signed-off-by: Thomas Gleixner commit 7c2a2410e3746e052c25f67e48f5c8f671129ac3 Author: Sebastian Andrzej Siewior Date: Sat Jun 22 00:09:22 2019 +0200 softirq: Avoid a cancel dead-lock in tasklet handling due to preemptible-softirq A pending / active tasklet which is preempted by a task on the same CPU will spin indefinitely because the tasklet makes no progress. To avoid this deadlock we can disable BH which will acquire the softirq-lock which will force the completion of the softirq and so the tasklet. The BH off/on in tasklet_kill() will force tasklets which are not yet running but scheduled (because ksoftirqd was preempted before it could start the tasklet). The BH off/on in tasklet_unlock_wait() will force tasklets which got preempted while running. Signed-off-by: Sebastian Andrzej Siewior commit 2ca7ec88833726d3fc0191173f0e7f9e52f85863 Author: Thomas Gleixner Date: Tue Sep 13 16:42:35 2011 +0200 sched: Disable TTWU_QUEUE on RT The queued remote wakeup mechanism can introduce rather large latencies if the number of migrated tasks is high. Disable it for RT. Signed-off-by: Thomas Gleixner commit 916417b560990c6e954250fe5cab6c65f367eca0 Author: Thomas Gleixner Date: Tue Jun 7 09:19:06 2011 +0200 sched: Do not account rcu_preempt_depth on RT in might_sleep() RT changes the rcu_preempt_depth semantics, so we cannot check for it in might_sleep(). Signed-off-by: Thomas Gleixner commit 00f9da84e39849647b18b46fe47b629ce6d0aae8 Author: Thomas Gleixner Date: Sat Jun 25 09:21:04 2011 +0200 sched: Add saved_state for tasks blocked on sleeping locks Spinlocks are state preserving in !RT. RT changes the state when a task gets blocked on a lock. So we need to remember the state before the lock contention. If a regular wakeup (not a RTmutex related wakeup) happens, the saved_state is updated to running. When the lock sleep is done, the saved state is restored. Signed-off-by: Thomas Gleixner commit 2cf99757b15fd508864268d58be58f703fec340a Author: Sebastian Andrzej Siewior Date: Mon Nov 21 19:31:08 2016 +0100 kernel/sched: move stack + kprobe clean up to __put_task_struct() There is no need to free the stack before the task struct (except for reasons mentioned in commit 68f24b08ee89 ("sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK")). This also comes handy on -RT because we can't free memory in preempt disabled region. vfree_atomic() delays the memory cleanup to a worker. Since we move everything to the RCU callback, we can also free it immediately. Cc: stable-rt@vger.kernel.org #for kprobe_flush_task() Signed-off-by: Sebastian Andrzej Siewior commit ebd4dc2bb9ead5efadbaa8b54dba6223e8f2dac4 Author: Thomas Gleixner Date: Mon Jun 6 12:20:33 2011 +0200 sched: Move mmdrop to RCU on RT Takes sleeping locks and calls into the memory allocator, so nothing we want to do in task switch and oder atomic contexts. Signed-off-by: Thomas Gleixner commit 9ee3003aa3700b4f9c8941de6344050bd2e00d94 Author: Thomas Gleixner Date: Mon Jun 6 12:12:51 2011 +0200 sched: Limit the number of task migrations per batch Put an upper limit on the number of tasks which are migrated per batch to avoid large latencies. Signed-off-by: Thomas Gleixner commit 31338d347fb9e8fe904ce3ff3d95f338f3f8bfa8 Author: Anna-Maria Gleixner Date: Mon May 27 16:54:06 2019 +0200 posix-timers: Add expiry lock If a about to be removed posix timer is active then the code will retry the delete operation until it succeeds / the timer callback completes. Use hrtimer_grab_expiry_lock() for posix timers which use a hrtimer underneath to spin on a lock until the callback finished. Introduce cpu_timers_grab_expiry_lock() for the posix-cpu-timer. This will acquire the proper per-CPU spin_lock which is acquired by the CPU which is expirering the timer. Signed-off-by: Anna-Maria Gleixner [bigeasy: keep the posix-cpu timer bits, everything else got applied] Signed-off-by: Sebastian Andrzej Siewior commit 932070a78f3135e0bf12e026ffec7bab63387736 Author: John Stultz Date: Fri Jul 3 08:29:58 2009 -0500 posix-timers: Thread posix-cpu-timers on -rt posix-cpu-timer code takes non -rt safe locks in hard irq context. Move it to a thread. [ 3.0 fixes from Peter Zijlstra ] Signed-off-by: John Stultz Signed-off-by: Thomas Gleixner commit 8482ef002faff9fddce94a6fc8a88ec4d8006c2a Author: Sebastian Andrzej Siewior Date: Thu Dec 6 10:15:13 2018 +0100 hrtimer: move state change before hrtimer_cancel in do_nanosleep() There is a small window between setting t->task to NULL and waking the task up (which would set TASK_RUNNING). So the timer would fire, run and set ->task to NULL while the other side/do_nanosleep() wouldn't enter freezable_schedule(). After all we are peemptible here (in do_nanosleep() and on the timer wake up path) and on KVM/virt the virt-CPU might get preempted. So do_nanosleep() wouldn't enter freezable_schedule() but cancel the timer which is still running and wait for it via hrtimer_wait_for_timer(). Then wait_event()/might_sleep() would complain that it is invoked with state != TASK_RUNNING. This isn't a problem since it would be reset to TASK_RUNNING later anyway and we don't rely on the previous state. Move the state update to TASK_RUNNING before hrtimer_cancel() so there are no complains from might_sleep() about wrong state. Cc: stable-rt@vger.kernel.org Reviewed-by: Daniel Bristot de Oliveira Signed-off-by: Sebastian Andrzej Siewior commit a5e5bcbcda8f6cd43dbc1e5af76f4557a2f231ac Author: Sebastian Andrzej Siewior Date: Fri Aug 9 15:25:21 2019 +0200 hrtimer: Allow raw wakeups during boot There are a few wake-up timers during the early boot which are essencial for the system to make progress. At this stage there are no softirq spawn for the softirq processing so there is no timer processing in softirq. The wakeup in question: smpboot_create_thread() -> kthread_create_on_cpu() -> kthread_bind() -> wait_task_inactive() -> schedule_hrtimeout() Let the timer fire in hardirq context during the system boot. Signed-off-by: Sebastian Andrzej Siewior commit ce90a519a4f9834dbe89c239a072cc5e2d5e297b Author: Thomas Gleixner Date: Fri Jan 11 11:23:51 2013 +0100 completion: Use simple wait queues Completions have no long lasting callbacks and therefor do not need the complex waitqueue variant. Use simple waitqueues which reduces the contention on the waitqueue lock. Signed-off-by: Thomas Gleixner [cminyard@mvista.com: Move __prepare_to_swait() into the do loop because swake_up_locked() removes the waiter on wake from the queue while in the original code it is not the case] Signed-off-by: Sebastian Andrzej Siewior commit 2065c5d92dce51d86cd2b055ce7bb1a75c5a28ea Author: Sebastian Andrzej Siewior Date: Mon Oct 28 12:19:57 2013 +0100 wait.h: include atomic.h | CC init/main.o |In file included from include/linux/mmzone.h:9:0, | from include/linux/gfp.h:4, | from include/linux/kmod.h:22, | from include/linux/module.h:13, | from init/main.c:15: |include/linux/wait.h: In function ‘wait_on_atomic_t’: |include/linux/wait.h:982:2: error: implicit declaration of function ‘atomic_read’ [-Werror=implicit-function-declaration] | if (atomic_read(val) == 0) | ^ This pops up on ARM. Non-RT gets its atomic.h include from spinlock.h Signed-off-by: Sebastian Andrzej Siewior commit 78650f512d29d01fb070941b200c82a7c42df067 Author: Sebastian Andrzej Siewior Date: Wed Oct 4 10:24:23 2017 +0200 pci/switchtec: Don't use completion's wait queue The poll callback is using completion's wait_queue_head_t member and puts it in poll_wait() so the poll() caller gets a wakeup after command completed. This does not work on RT because we don't have a wait_queue_head_t in our completion implementation. Nobody in tree does like that in tree so this is the only driver that breaks. Instead of using the completion here is waitqueue with a status flag as suggested by Logan. I don't have the HW so I have no idea if it works as expected, so please test it. Cc: Kurt Schwemmer Cc: Logan Gunthorpe Signed-off-by: Sebastian Andrzej Siewior commit b98d16291f1237c02e022c88801ce61ac8ff7387 Author: Thomas Gleixner Date: Sun Nov 6 12:26:18 2011 +0100 x86: kvm Require const tsc for RT Non constant TSC is a nightmare on bare metal already, but with virtualization it becomes a complete disaster because the workarounds are horrible latency wise. That's also a preliminary for running RT in a guest on top of a RT host. Signed-off-by: Thomas Gleixner commit 092bbff51e49f288d34b7f5ae4251a436457120f Author: Sebastian Andrzej Siewior Date: Wed Jan 25 16:34:27 2017 +0100 radix-tree: use local locks The preload functionality uses per-CPU variables and preempt-disable to ensure that it does not switch CPUs during its usage. This patch adds local_locks() instead preempt_disable() for the same purpose and to remain preemptible on -RT. Cc: stable-rt@vger.kernel.org Reported-and-debugged-by: Mike Galbraith Signed-off-by: Sebastian Andrzej Siewior commit da3fe91a16694488c96d359a271574003786403a Author: Luis Claudio R. Goncalves Date: Tue Jun 25 11:28:04 2019 -0300 mm/zswap: Do not disable preemption in zswap_frontswap_store() Zswap causes "BUG: scheduling while atomic" by blocking on a rt_spin_lock() with preemption disabled. The preemption is disabled by get_cpu_var() in zswap_frontswap_store() to protect the access of the zswap_dstmem percpu variable. Use get_locked_var() to protect the percpu zswap_dstmem variable, making the code preemptive. As get_cpu_ptr() also disables preemption, replace it by this_cpu_ptr() and remove the counterpart put_cpu_ptr(). Steps to Reproduce: 1. # grubby --args "zswap.enabled=1" --update-kernel DEFAULT 2. # reboot 3. Calculate the amount o memory to be used by the test: ---> grep MemAvailable /proc/meminfo ---> Add 25% ~ 50% to that value 4. # stress --vm 1 --vm-bytes ${MemAvailable+25%} --timeout 240s Usually, in less than 5 minutes the backtrace listed below appears, followed by a kernel panic: | BUG: scheduling while atomic: kswapd1/181/0x00000002 | | Preemption disabled at: | [] zswap_frontswap_store+0x21a/0x6e1 | | Kernel panic - not syncing: scheduling while atomic | CPU: 14 PID: 181 Comm: kswapd1 Kdump: loaded Not tainted 5.0.14-rt9 #1 | Hardware name: AMD Pence/Pence, BIOS WPN2321X_Weekly_12_03_21 03/19/2012 | Call Trace: | panic+0x106/0x2a7 | __schedule_bug.cold+0x3f/0x51 | __schedule+0x5cb/0x6f0 | schedule+0x43/0xd0 | rt_spin_lock_slowlock_locked+0x114/0x2b0 | rt_spin_lock_slowlock+0x51/0x80 | zbud_alloc+0x1da/0x2d0 | zswap_frontswap_store+0x31a/0x6e1 | __frontswap_store+0xab/0x130 | swap_writepage+0x39/0x70 | pageout.isra.0+0xe3/0x320 | shrink_page_list+0xa8e/0xd10 | shrink_inactive_list+0x251/0x840 | shrink_node_memcg+0x213/0x770 | shrink_node+0xd9/0x450 | balance_pgdat+0x2d5/0x510 | kswapd+0x218/0x470 | kthread+0xfb/0x130 | ret_from_fork+0x27/0x50 Cc: stable-rt@vger.kernel.org Reported-by: Ping Fang Signed-off-by: Luis Claudio R. Goncalves Reviewed-by: Daniel Bristot de Oliveira Signed-off-by: Sebastian Andrzej Siewior commit 7806c166133e9fa496a633781631e35e4f664375 Author: Mike Galbraith Date: Tue Mar 22 11:16:09 2016 +0100 mm/zsmalloc: copy with get_cpu_var() and locking get_cpu_var() disables preemption and triggers a might_sleep() splat later. This is replaced with get_locked_var(). This bitspinlocks are replaced with a proper mutex which requires a slightly larger struct to allocate. Signed-off-by: Mike Galbraith [bigeasy: replace the bitspin_lock() with a mutex, get_locked_var(). Mike then fixed the size magic] Signed-off-by: Sebastian Andrzej Siewior commit 571d9493c170937c485ca0b952d614207968ec60 Author: Sebastian Andrzej Siewior Date: Wed Jan 28 17:14:16 2015 +0100 mm/memcontrol: Replace local_irq_disable with local locks There are a few local_irq_disable() which then take sleeping locks. This patch converts them local locks. Signed-off-by: Sebastian Andrzej Siewior commit 8d9074087d4e14aa0d39f607cc3ffd367d9f3f79 Author: Yang Shi Date: Wed Oct 30 11:48:33 2013 -0700 mm/memcontrol: Don't call schedule_work_on in preemption disabled context The following trace is triggered when running ltp oom test cases: BUG: sleeping function called from invalid context at kernel/rtmutex.c:659 in_atomic(): 1, irqs_disabled(): 0, pid: 17188, name: oom03 Preemption disabled at:[] mem_cgroup_reclaim+0x90/0xe0 CPU: 2 PID: 17188 Comm: oom03 Not tainted 3.10.10-rt3 #2 Hardware name: Intel Corporation Calpella platform/MATXM-CORE-411-B, BIOS 4.6.3 08/18/2010 ffff88007684d730 ffff880070df9b58 ffffffff8169918d ffff880070df9b70 ffffffff8106db31 ffff88007688b4a0 ffff880070df9b88 ffffffff8169d9c0 ffff88007688b4a0 ffff880070df9bc8 ffffffff81059da1 0000000170df9bb0 Call Trace: [] dump_stack+0x19/0x1b [] __might_sleep+0xf1/0x170 [] rt_spin_lock+0x20/0x50 [] queue_work_on+0x61/0x100 [] drain_all_stock+0xe1/0x1c0 [] mem_cgroup_reclaim+0x90/0xe0 [] __mem_cgroup_try_charge+0x41a/0xc40 [] ? release_pages+0x1b1/0x1f0 [] ? sched_exec+0x40/0xb0 [] mem_cgroup_charge_common+0x37/0x70 [] mem_cgroup_newpage_charge+0x26/0x30 [] handle_pte_fault+0x618/0x840 [] ? unpin_current_cpu+0x16/0x70 [] ? migrate_enable+0xd4/0x200 [] handle_mm_fault+0x145/0x1e0 [] __do_page_fault+0x1a1/0x4c0 [] ? preempt_schedule_irq+0x4b/0x70 [] ? retint_kernel+0x37/0x40 [] do_page_fault+0xe/0x10 [] page_fault+0x22/0x30 So, to prevent schedule_work_on from being called in preempt disabled context, replace the pair of get/put_cpu() to get/put_cpu_light(). Signed-off-by: Yang Shi Signed-off-by: Sebastian Andrzej Siewior commit d085a52d55e71468c0bc2a9d4801d49fa6f0cee7 Author: Sebastian Andrzej Siewior Date: Wed Apr 15 19:00:47 2015 +0200 slub: Disable SLUB_CPU_PARTIAL |BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:915 |in_atomic(): 1, irqs_disabled(): 0, pid: 87, name: rcuop/7 |1 lock held by rcuop/7/87: | #0: (rcu_callback){......}, at: [] rcu_nocb_kthread+0x1ca/0x5d0 |Preemption disabled at:[] put_cpu_partial+0x29/0x220 | |CPU: 0 PID: 87 Comm: rcuop/7 Tainted: G W 4.0.0-rt0+ #477 |Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 | 000000000007a9fc ffff88013987baf8 ffffffff817441c7 0000000000000007 | 0000000000000000 ffff88013987bb18 ffffffff810eee51 0000000000000000 | ffff88013fc10200 ffff88013987bb48 ffffffff8174a1c4 000000000007a9fc |Call Trace: | [] dump_stack+0x4f/0x90 | [] ___might_sleep+0x121/0x1b0 | [] rt_spin_lock+0x24/0x60 | [] __free_pages_ok+0xaa/0x540 | [] __free_pages+0x1d/0x30 | [] __free_slab+0xc5/0x1e0 | [] free_delayed+0x56/0x70 | [] put_cpu_partial+0x14d/0x220 | [] __slab_free+0x158/0x2c0 | [] kmem_cache_free+0x221/0x2d0 | [] file_free_rcu+0x2c/0x40 | [] rcu_nocb_kthread+0x243/0x5d0 | [] kthread+0xfc/0x120 | [] ret_from_fork+0x58/0x90 Signed-off-by: Sebastian Andrzej Siewior commit da10e73944acc61e84dafd623d584dd1d3ce61ce Author: Thomas Gleixner Date: Wed Jan 9 12:08:15 2013 +0100 slub: Enable irqs for __GFP_WAIT SYSTEM_RUNNING might be too late for enabling interrupts. Allocations with GFP_WAIT can happen before that. So use this as an indicator. Signed-off-by: Thomas Gleixner commit 8d35c6992295a255ddf084c8ce4ba11e972656c6 Author: Thomas Gleixner Date: Thu Oct 25 10:32:35 2012 +0100 mm: Enable SLUB for RT Avoid the memory allocation in IRQ section Signed-off-by: Thomas Gleixner [bigeasy: factor out everything except the kcalloc() workaorund ] Signed-off-by: Sebastian Andrzej Siewior commit 6c55667c1a470341b13b174eb6afa9626bb2a233 Author: Ingo Molnar Date: Fri Jul 3 08:30:13 2009 -0500 mm/vmstat: Protect per cpu variables with preempt disable on RT Disable preemption on -RT for the vmstat code. On vanila the code runs in IRQ-off regions while on -RT it is not. "preempt_disable" ensures that the same ressources is not updated in parallel due to preemption. Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner commit 461fde0c2bbfac63e961dfbb8771b3c2aaa9da06 Author: Thomas Gleixner Date: Fri Jul 24 12:38:56 2009 +0200 preempt: Provide preempt_*_(no)rt variants RT needs a few preempt_disable/enable points which are not necessary otherwise. Implement variants to avoid #ifdeffery. Signed-off-by: Thomas Gleixner commit fc3f0ec3431c8ac40f6f797b762ac2ffff30ae2d Author: Sebastian Andrzej Siewior Date: Mon Aug 12 11:20:44 2019 +0200 mm/swap: Enable use pvec lock on RT On RT we also need to avoid preempt disable/IRQ-off regions so have to enable the locking while accessing pvecs. Signed-off-by: Sebastian Andrzej Siewior commit 89fee776c155543fe2a9c8951e501d3bb46b6f51 Author: Anna-Maria Gleixner Date: Thu Apr 18 11:09:07 2019 +0200 mm/swap: Enable "use_pvec_lock" nohz_full dependent When a system runs with CONFIG_NO_HZ_FULL enabled, the tick of CPUs listed in 'nohz_full=' kernel command line parameter should be stopped whenever possible. The tick stays longer stopped, when work for this CPU is handled by another CPU. With the already introduced static key 'use_pvec_lock' there is the possibility to prevent firing a worker for mm/swap work on a remote CPU with a stopped tick. Therefore enabling the static key in case kernel command line parameter 'nohz_full=' setup was successful, which implies that CONFIG_NO_HZ_FULL is set. Signed-off-by: Anna-Maria Gleixner Signed-off-by: Sebastian Andrzej Siewior commit da157a111c0d6039bb086245e1bdaf836b6b6279 Author: Thomas Gleixner Date: Thu Apr 18 11:09:06 2019 +0200 mm/swap: Access struct pagevec remotely When the newly introduced static key would be enabled, struct pagevec is locked during access. So it is possible to access it from a remote CPU. The advantage is that the work can be done from the "requesting" CPU without firing a worker on a remote CPU and waiting for it to complete the work. No functional change because static key is not enabled. Signed-off-by: Thomas Gleixner Signed-off-by: Anna-Maria Gleixner Signed-off-by: Sebastian Andrzej Siewior commit b659412b9cfa63fb5e9e6d33abb1205ca700c6b9 Author: Thomas Gleixner Date: Thu Apr 18 11:09:05 2019 +0200 mm/swap: Add static key dependent pagevec locking The locking of struct pagevec is done by disabling preemption. In case the struct has be accessed form interrupt context then interrupts are disabled. This means the struct can only be accessed locally from the CPU. There is also no lockdep coverage which would scream during if it accessed from wrong context. Create struct swap_pagevec which contains of a pagevec member and a spin_lock_t. Introduce a static key, which changes the locking behavior only if the key is set in the following way: Before the struct is accessed the spin_lock has to be acquired instead of using preempt_disable(). Since the struct is used CPU-locally there is no spinning on the lock but the lock is acquired immediately. If the struct is accessed from interrupt context, spin_lock_irqsave() is used. No functional change yet because static key is not enabled. [anna-maria: introduce static key] Signed-off-by: Thomas Gleixner Signed-off-by: Anna-Maria Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 51618b684886e83940abeac9147b729207757875 Author: Anna-Maria Gleixner Date: Thu Apr 18 11:09:04 2019 +0200 mm/page_alloc: Split drain_local_pages() Splitting the functionality of drain_local_pages() into a separate function. This is a preparatory work for introducing the static key dependend locking mechanism. No functional change. Signed-off-by: Anna-Maria Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 4e5b625a4e09d170a81019b99c3edd16faa8e1a4 Author: Ingo Molnar Date: Fri Jul 3 08:29:37 2009 -0500 mm: page_alloc: rt-friendly per-cpu pages rt-friendly per-cpu pages: convert the irqs-off per-cpu locking method into a preemptible, explicit-per-cpu-locks method. Contains fixes from: Peter Zijlstra Thomas Gleixner Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner commit f6336c5dafb3d1724d0ab86a09504103c014d30c Author: Thomas Gleixner Date: Thu Jun 21 17:29:19 2018 +0200 mm/SLUB: delay giving back empty slubs to IRQ enabled regions __free_slab() is invoked with disabled interrupts which increases the irq-off time while __free_pages() is doing the work. Allow __free_slab() to be invoked with enabled interrupts and move everything from interrupts-off invocations to a temporary per-CPU list so it can be processed later. Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 8405434196155585227bafa1eb174c9944b0087e Author: Thomas Gleixner Date: Mon May 28 15:24:22 2018 +0200 mm/SLxB: change list_lock to raw_spinlock_t The list_lock is used with used with IRQs off on RT. Make it a raw_spinlock_t otherwise the interrupts won't be disabled on -RT. The locking rules remain the same on !RT. This patch changes it for SLAB and SLUB since both share the same header file for struct kmem_cache_node defintion. Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 05ab7206a0899c9ad966a3d3ba7f86271688de95 Author: Peter Zijlstra Date: Mon May 28 15:24:21 2018 +0200 Split IRQ-off and zone->lock while freeing pages from PCP list #2 Split the IRQ-off section while accessing the PCP list from zone->lock while freeing pages. Introcude isolate_pcp_pages() which separates the pages from the PCP list onto a temporary list and then free the temporary list via free_pcppages_bulk(). Signed-off-by: Peter Zijlstra Signed-off-by: Sebastian Andrzej Siewior commit eebdb4f831157d1ad08a26f44160b93a9c05f95c Author: Peter Zijlstra Date: Mon May 28 15:24:20 2018 +0200 Split IRQ-off and zone->lock while freeing pages from PCP list #1 Split the IRQ-off section while accessing the PCP list from zone->lock while freeing pages. Introcude isolate_pcp_pages() which separates the pages from the PCP list onto a temporary list and then free the temporary list via free_pcppages_bulk(). Signed-off-by: Peter Zijlstra Signed-off-by: Sebastian Andrzej Siewior commit 96c89fc4c351c0e8c125c77ca4e7d6b7f9be065f Author: Oleg Nesterov Date: Tue Jul 14 14:26:34 2015 +0200 signal/x86: Delay calling signals in atomic On x86_64 we must disable preemption before we enable interrupts for stack faults, int3 and debugging, because the current task is using a per CPU debug stack defined by the IST. If we schedule out, another task can come in and use the same stack and cause the stack to be corrupted and crash the kernel on return. When CONFIG_PREEMPT_RT is enabled, spin_locks become mutexes, and one of these is the spin lock used in signal handling. Some of the debug code (int3) causes do_trap() to send a signal. This function calls a spin lock that has been converted to a mutex and has the possibility to sleep. If this happens, the above issues with the corrupted stack is possible. Instead of calling the signal right away, for PREEMPT_RT and x86_64, the signal information is stored on the stacks task_struct and TIF_NOTIFY_RESUME is set. Then on exit of the trap, the signal resume code will send the signal when preemption is enabled. [ rostedt: Switched from #ifdef CONFIG_PREEMPT_RT to ARCH_RT_DELAYS_SIGNAL_SEND and added comments to the code. ] Signed-off-by: Oleg Nesterov Signed-off-by: Steven Rostedt Signed-off-by: Thomas Gleixner [bigeasy: also needed on 32bit as per Yang Shi ] Signed-off-by: Sebastian Andrzej Siewior commit 96fac67317432408cc4b8b1f317673cb0ede399d Author: Sebastian Andrzej Siewior Date: Mon May 20 13:09:08 2019 +0200 softirq: Add preemptible softirq Add preemptible softirq for RT's needs. By removing the softirq count from the preempt counter, the softirq becomes preemptible. A per-CPU lock ensures that there is no parallel softirq processing or that per-CPU variables are not access in parallel by multiple threads. local_bh_enable() will process all softirq work that has been raised in its BH-disabled section once the BH counter gets to 0. [+ rcu_read_lock() as part of local_bh_disable() by Scott Wood] Signed-off-by: Sebastian Andrzej Siewior commit d3789504ee2a280bc4930de45d806d7393aa14ac Author: Thomas Gleixner Date: Mon Jun 20 09:03:47 2011 +0200 rt: Add local irq locks Introduce locallock. For !RT this maps to preempt_disable()/ local_irq_disable() so there is not much that changes. For RT this will map to a spinlock. This makes preemption possible and locked "ressource" gets the lockdep anotation it wouldn't have otherwise. The locks are recursive for owner == current. Also, all locks user migrate_disable() which ensures that the task is not migrated to another CPU while the lock is held and the owner is preempted. Signed-off-by: Thomas Gleixner commit 37997f80f39506180f3129113c28811a3ea99e58 Author: Sebastian Andrzej Siewior Date: Mon Jul 1 17:39:28 2019 +0200 x86: Disable HAVE_ARCH_JUMP_LABEL __text_poke() does: | local_irq_save(flags); … | ptep = get_locked_pte(poking_mm, poking_addr, &ptl); which does not work on -RT because the PTE-lock is a spinlock_t typed lock. Signed-off-by: Sebastian Andrzej Siewior commit ba511da112dd8f1c1cedd10732ece2e4e2cc6f6d Author: Sebastian Andrzej Siewior Date: Thu Jul 26 15:06:10 2018 +0200 efi: Allow efi=runtime In case the command line option "efi=noruntime" is default at built-time, the user could overwrite its state by `efi=runtime' and allow it again. Acked-by: Ard Biesheuvel Signed-off-by: Sebastian Andrzej Siewior commit e8a1edc60725fdaa5daab418ed58ff35a69dffde Author: Sebastian Andrzej Siewior Date: Thu Jul 26 15:03:16 2018 +0200 efi: Disable runtime services on RT Based on meassurements the EFI functions get_variable / get_next_variable take up to 2us which looks okay. The functions get_time, set_time take around 10ms. Those 10ms are too much. Even one ms would be too much. Ard mentioned that SetVariable might even trigger larger latencies if the firware will erase flash blocks on NOR. The time-functions are used by efi-rtc and can be triggered during runtimed (either via explicit read/write or ntp sync). The variable write could be used by pstore. These functions can be disabled without much of a loss. The poweroff / reboot hooks may be provided by PSCI. Disable EFI's runtime wrappers. This was observed on "EFI v2.60 by SoftIron Overdrive 1000". Acked-by: Ard Biesheuvel Signed-off-by: Sebastian Andrzej Siewior commit 232b6cc8548233fc7ac96c20600078e659295295 Author: Sebastian Andrzej Siewior Date: Thu Aug 29 11:48:57 2013 +0200 md: disable bcache It uses anon semaphores |drivers/md/bcache/request.c: In function ‘cached_dev_write_complete’: |drivers/md/bcache/request.c:1007:2: error: implicit declaration of function ‘up_read_non_owner’ [-Werror=implicit-function-declaration] | up_read_non_owner(&dc->writeback_lock); | ^ |drivers/md/bcache/request.c: In function ‘request_write’: |drivers/md/bcache/request.c:1033:2: error: implicit declaration of function ‘down_read_non_owner’ [-Werror=implicit-function-declaration] | down_read_non_owner(&dc->writeback_lock); | ^ either we get rid of those or we have to introduce them… Link: http://lkml.kernel.org/r/20130820111602.3cea203c@gandalf.local.home Signed-off-by: Sebastian Andrzej Siewior commit 2802d34d5d995b297eadd8de6f621942bd0d7f02 Author: Sebastian Andrzej Siewior Date: Sat May 27 19:02:06 2017 +0200 net/core: disable NET_RX_BUSY_POLL on RT napi_busy_loop() disables preemption and performs a NAPI poll. We can't acquire sleeping locks with disabled preemption so we would have to work around this and add explicit locking for synchronisation against ksoftirqd. Without explicit synchronisation a low priority process would "own" the NAPI state (by setting NAPIF_STATE_SCHED) and could be scheduled out (no preempt_disable() and BH is preemptible on RT). In case a network packages arrives then the interrupt handler would set NAPIF_STATE_MISSED and the system would wait until the task owning the NAPI would be scheduled in again. Should a task with RT priority busy poll then it would consume the CPU instead allowing tasks with lower priority to run. The NET_RX_BUSY_POLL is disabled by default (the system wide sysctls for poll/read are set to zero) so disable NET_RX_BUSY_POLL on RT to avoid wrong locking context on RT. Should this feature be considered useful on RT systems then it could be enabled again with proper locking and synchronisation. Signed-off-by: Sebastian Andrzej Siewior commit 60db774dd96aec741fd25855df070e6d898fde28 Author: Thomas Gleixner Date: Mon Jul 18 17:03:52 2011 +0200 sched: Disable CONFIG_RT_GROUP_SCHED on RT Carsten reported problems when running: taskset 01 chrt -f 1 sleep 1 from within rc.local on a F15 machine. The task stays running and never gets on the run queue because some of the run queues have rt_throttled=1 which does not go away. Works nice from a ssh login shell. Disabling CONFIG_RT_GROUP_SCHED solves that as well. Signed-off-by: Thomas Gleixner commit 51eabdb2ff938afeafc4e082aab0a2dfb7a755ad Author: Sebastian Andrzej Siewior Date: Fri Mar 21 20:19:05 2014 +0100 rcu: make RCU_BOOST default on RT Since it is no longer invoked from the softirq people run into OOM more often if the priority of the RCU thread is too low. Making boosting default on RT should help in those case and it can be switched off if someone knows better. Signed-off-by: Sebastian Andrzej Siewior commit 7723edae883b009a6c39fcf9b23649baed6c9657 Author: Ingo Molnar Date: Fri Jul 3 08:44:03 2009 -0500 mm: Allow only SLUB on RT Memory allocation disables interrupts as part of the allocation and freeing process. For -RT it is important that this section remain short and don't depend on the size of the request or an internal state of the memory allocator. At the beginning the SLAB memory allocator was adopted for RT's needs and it required substantial changes. Later, with the addition of the SLUB memory allocator we adopted this one as well and the changes were smaller. More important, due to the design of the SLUB allocator it performs better and its worst case latency was smaller. In the end only SLUB remained supported. Disable SLAB and SLOB on -RT. Only SLUB is adopted to -RT needs. Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner Signed-off-by: Sebastian Andrzej Siewior commit 25f30995a41ec39eb0983e82375dbf01c36584b1 Author: Thomas Gleixner Date: Sun Jul 24 12:11:43 2011 +0200 kconfig: Disable config options which are not RT compatible Disable stuff which is known to have issues on RT Signed-off-by: Thomas Gleixner commit 35eecc46c0a8a56b8b5f30a32bdec0ca0b1490cf Author: Thomas Gleixner Date: Wed Dec 14 01:03:49 2011 +0100 cpumask: Disable CONFIG_CPUMASK_OFFSTACK for RT There are "valid" GFP_ATOMIC allocations such as |BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:931 |in_atomic(): 1, irqs_disabled(): 0, pid: 2130, name: tar |1 lock held by tar/2130: | #0: (&mm->mmap_sem){++++++}, at: [] SyS_brk+0x39/0x190 |Preemption disabled at:[] flush_tlb_mm_range+0x28/0x350 | |CPU: 1 PID: 2130 Comm: tar Tainted: G W 4.8.2-rt2+ #747 |Call Trace: | [] dump_stack+0x86/0xca | [] ___might_sleep+0x14b/0x240 | [] rt_spin_lock+0x24/0x60 | [] get_page_from_freelist+0x83a/0x11b0 | [] __alloc_pages_nodemask+0x15b/0x1190 | [] alloc_pages_current+0xa1/0x1f0 | [] new_slab+0x3e5/0x690 | [] ___slab_alloc+0x495/0x660 | [] __slab_alloc.isra.79+0x71/0xc0 | [] __kmalloc_node+0xe7/0x240 | [] alloc_cpumask_var_node+0x20/0x50 | [] alloc_cpumask_var+0xe/0x10 | [] native_send_call_func_ipi+0x21/0x130 | [] smp_call_function_many+0x22f/0x370 | [] native_flush_tlb_others+0x1a4/0x3a0 | [] flush_tlb_mm_range+0x7b/0x350 | [] tlb_flush_mmu_tlbonly+0x62/0xd0 | [] tlb_finish_mmu+0x14/0x50 | [] unmap_region+0xe4/0x110 | [] do_munmap+0x293/0x470 | [] SyS_brk+0x13c/0x190 | [] do_fast_syscall_32+0xb2/0x2f0 | [] entry_SYSENTER_compat+0x51/0x60 which forbid allocations at run-time. Signed-off-by: Thomas Gleixner commit 1ef5c0094d80c02103664f2e28f72207aa78b361 Author: Sebastian Andrzej Siewior Date: Wed Sep 14 14:35:49 2016 +0200 fs/dcache: use swait_queue instead of waitqueue __d_lookup_done() invokes wake_up_all() while holding a hlist_bl_lock() which disables preemption. As a workaround convert it to swait. Signed-off-by: Sebastian Andrzej Siewior commit 6d614726419fedde738facffb1a7311f5d1bf8dc Author: Sebastian Andrzej Siewior Date: Wed Sep 13 12:32:34 2017 +0200 fs/dcache: bring back explicit INIT_HLIST_BL_HEAD init Commit 3d375d78593c ("mm: update callers to use HASH_ZERO flag") removed INIT_HLIST_BL_HEAD and uses the ZERO flag instead for the init. However on RT we have also a spinlock which needs an init call so we can't use that. Signed-off-by: Sebastian Andrzej Siewior commit 602660600bcdb53e56a87edee6b4dbbcc248a0cc Author: Clark Williams Date: Tue Jul 3 13:34:30 2018 -0500 fscache: initialize cookie hash table raw spinlocks The fscache cookie mechanism uses a hash table of hlist_bl_head structures. The PREEMPT_RT patcheset adds a raw spinlock to this structure and so on PREEMPT_RT the structures get used uninitialized, causing warnings about bad magic numbers when spinlock debugging is turned on. Use the init function for fscache cookies. Signed-off-by: Clark Williams Signed-off-by: Sebastian Andrzej Siewior commit d2534e94dc1ac26f4c08fdadf9d6448954f37524 Author: Paul Gortmaker Date: Fri Jun 21 15:07:25 2013 -0400 list_bl: Make list head locking RT safe As per changes in include/linux/jbd_common.h for avoiding the bit_spin_locks on RT ("fs: jbd/jbd2: Make state lock and journal head lock rt safe") we do the same thing here. We use the non atomic __set_bit and __clear_bit inside the scope of the lock to preserve the ability of the existing LIST_DEBUG code to use the zero'th bit in the sanity checks. As a bit spinlock, we had no lockdep visibility into the usage of the list head locking. Now, if we were to implement it as a standard non-raw spinlock, we would see: BUG: sleeping function called from invalid context at kernel/rtmutex.c:658 in_atomic(): 1, irqs_disabled(): 0, pid: 122, name: udevd 5 locks held by udevd/122: #0: (&sb->s_type->i_mutex_key#7/1){+.+.+.}, at: [] lock_rename+0xe8/0xf0 #1: (rename_lock){+.+...}, at: [] d_move+0x2c/0x60 #2: (&dentry->d_lock){+.+...}, at: [] dentry_lock_for_move+0xf3/0x130 #3: (&dentry->d_lock/2){+.+...}, at: [] dentry_lock_for_move+0xc4/0x130 #4: (&dentry->d_lock/3){+.+...}, at: [] dentry_lock_for_move+0xd7/0x130 Pid: 122, comm: udevd Not tainted 3.4.47-rt62 #7 Call Trace: [] __might_sleep+0x134/0x1f0 [] rt_spin_lock+0x24/0x60 [] __d_shrink+0x5c/0xa0 [] __d_drop+0x1d/0x40 [] __d_move+0x8e/0x320 [] d_move+0x3e/0x60 [] vfs_rename+0x198/0x4c0 [] sys_renameat+0x213/0x240 [] ? _raw_spin_unlock+0x35/0x60 [] ? do_page_fault+0x1ec/0x4b0 [] ? retint_swapgs+0xe/0x13 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] sys_rename+0x1b/0x20 [] system_call_fastpath+0x1a/0x1f Since we are only taking the lock during short lived list operations, lets assume for now that it being raw won't be a significant latency concern. Signed-off-by: Paul Gortmaker [julia@ni.com: Use #define instead static inline to avoid false positive from lockdep] Signed-off-by: Sebastian Andrzej Siewior commit ebc6792fba15c476eee2a9e1cc0c35b478e14513 Author: Sebastian Andrzej Siewior Date: Fri Oct 20 11:29:53 2017 +0200 fs/dcache: disable preemption on i_dir_seq's write side i_dir_seq is an opencoded seqcounter. Based on the code it looks like we could have two writers in parallel despite the fact that the d_lock is held. The problem is that during the write process on RT the preemption is still enabled and if this process is interrupted by a reader with RT priority then we lock up. To avoid that lock up I am disabling the preemption during the update. The rename of i_dir_seq is here to ensure to catch new write sides in future. Cc: stable-rt@vger.kernel.org Reported-by: Oleg.Karfich@wago.com Signed-off-by: Sebastian Andrzej Siewior commit a2e9f242ef32e232c1a396ad62eef2c6efdb0705 Author: Sebastian Andrzej Siewior Date: Thu Sep 15 10:51:27 2016 +0200 fs/nfs: turn rmdir_sem into a semaphore The RW semaphore had a reader side which used the _non_owner version because it most likely took the reader lock in one thread and released it in another which would cause lockdep to complain if the "regular" version was used. On -RT we need the owner because the rw lock is turned into a rtmutex. The semaphores on the hand are "plain simple" and should work as expected. We can't have multiple readers but on -RT we don't allow multiple readers anyway so that is not a loss. Signed-off-by: Sebastian Andrzej Siewior commit 04dfe0f92b9d9878a5e3a9f7a104ddfdf382c082 Author: Sebastian Andrzej Siewior Date: Wed Mar 20 18:06:20 2013 +0100 net: Add a mutex around devnet_rename_seq On RT write_seqcount_begin() disables preemption and device_rename() allocates memory with GFP_KERNEL and grabs later the sysfs_mutex mutex. Serialize with a mutex and add use the non preemption disabling __write_seqcount_begin(). To avoid writer starvation, let the reader grab the mutex and release it when it detects a writer in progress. This keeps the normal case (no reader on the fly) fast. [ tglx: Instead of replacing the seqcount by a mutex, add the mutex ] Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner commit 4972eb751f585228a725683944fbdad70b895a77 Author: Sebastian Andrzej Siewior Date: Wed Sep 14 17:36:35 2016 +0200 net/Qdisc: use a seqlock instead seqcount The seqcount disables preemption on -RT while it is held which can't remove. Also we don't want the reader to spin for ages if the writer is scheduled out. The seqlock on the other hand will serialize / sleep on the lock while writer is active. Signed-off-by: Sebastian Andrzej Siewior commit d264e5efb22d11e45901261fc22644941bc04ecc Author: Sebastian Andrzej Siewior Date: Fri Oct 28 23:05:11 2016 +0200 NFSv4: replace seqcount_t with a seqlock_t The raw_write_seqcount_begin() in nfs4_reclaim_open_state() causes a preempt_disable() on -RT. The spin_lock()/spin_unlock() in that section does not work. The lockdep part was removed in commit abbec2da13f0 ("NFS: Use raw_write_seqcount_begin/end int nfs4_reclaim_open_state") because lockdep complained. The whole seqcount thing was introduced in commit c137afabe330 ("NFSv4: Allow the state manager to mark an open_owner as being recovered"). The recovery threads runs only once. write_seqlock() does not work on !RT because it disables preemption and it the writer side is preemptible (has to remain so despite the fact that it will block readers). Reported-by: kernel test robot Link: https://lkml.kernel.org/r/20161021164727.24485-1-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior commit eedd46f3e578848f862940e44f54de1996b903d6 Author: Thomas Gleixner Date: Wed Feb 22 12:03:30 2012 +0100 seqlock: Prevent rt starvation If a low prio writer gets preempted while holding the seqlock write locked, a high prio reader spins forever on RT. To prevent this let the reader grab the spinlock, so it blocks and eventually boosts the writer. This way the writer can proceed and endless spinning is prevented. For seqcount writers we disable preemption over the update code path. Thanks to Al Viro for distangling some VFS code to make that possible. Nicholas Mc Guire: - spin_lock+unlock => spin_unlock_wait - __write_seqcount_begin => __raw_write_seqcount_begin Signed-off-by: Thomas Gleixner commit 2591d632cbcf144fb7b45f6eb74596d39c9e1b58 Author: Sebastian Andrzej Siewior Date: Wed Aug 14 16:38:43 2019 +0200 dma-buf: Use seqlock_t instread disabling preemption "dma reservation" disables preemption while acquiring the write access for "seqcount". Replace the seqcount with a seqlock_t which provides seqcount like semantic and lock for writer. Link: https://lkml.kernel.org/r/f410b429-db86-f81c-7c67-f563fa808b62@free.fr Signed-off-by: Sebastian Andrzej Siewior commit 821d736ef58b806d1319c771e4d04374d5c40af3 Author: Thomas Gleixner Date: Wed Sep 21 19:57:12 2011 +0200 signal: Revert ptrace preempt magic Upstream commit '53da1d9456fe7f8 fix ptrace slowness' is nothing more than a bandaid around the ptrace design trainwreck. It's not a correctness issue, it's merily a cosmetic bandaid. Signed-off-by: Thomas Gleixner commit 2748e735b1fa4f8b30ab422b76322610020b5207 Author: Thomas Gleixner Date: Thu Feb 14 22:36:59 2013 +0100 timekeeping: Split jiffies seqlock Replace jiffies_lock seqlock with a simple seqcounter and a rawlock so it can be taken in atomic context on RT. Signed-off-by: Thomas Gleixner commit 6662e8416588bc9bee5363c8ea0aeaf035b4954b Author: Liu Haitao Date: Fri Sep 27 16:22:30 2019 +0800 kmemleak: Change the lock of kmemleak_object to raw_spinlock_t The commit ("kmemleak: Turn kmemleak_lock to raw spinlock on RT") changed the kmemleak_lock to raw spinlock. However the kmemleak_object->lock is held after the kmemleak_lock is held in scan_block(). Make the object->lock a raw_spinlock_t. Cc: stable-rt@vger.kernel.org Link: https://lkml.kernel.org/r/20190927082230.34152-1-yongxin.liu@windriver.com Signed-off-by: Liu Haitao Signed-off-by: Yongxin Liu Signed-off-by: Sebastian Andrzej Siewior commit e5add8d099baaaa5d467d53caaa7cf524a056016 Author: He Zhe Date: Wed Dec 19 16:30:57 2018 +0100 kmemleak: Turn kmemleak_lock to raw spinlock on RT kmemleak_lock, as a rwlock on RT, can possibly be held in atomic context and causes the follow BUG. BUG: scheduling while atomic: migration/15/132/0x00000002 Preemption disabled at: [] cpu_stopper_thread+0x71/0x100 CPU: 15 PID: 132 Comm: migration/15 Not tainted 4.19.0-rt1-preempt-rt #1 Call Trace: schedule+0x3d/0xe0 __rt_spin_lock+0x26/0x30 __write_rt_lock+0x23/0x1a0 rt_write_lock+0x2a/0x30 find_and_remove_object+0x1e/0x80 delete_object_full+0x10/0x20 kmemleak_free+0x32/0x50 kfree+0x104/0x1f0 intel_pmu_cpu_dying+0x67/0x70 x86_pmu_dying_cpu+0x1a/0x30 cpuhp_invoke_callback+0x92/0x700 take_cpu_down+0x70/0xa0 multi_cpu_stop+0x62/0xc0 cpu_stopper_thread+0x79/0x100 smpboot_thread_fn+0x20f/0x2d0 kthread+0x121/0x140 And on v4.18 stable tree the following call trace, caused by grabbing kmemleak_lock again, is also observed. kernel BUG at kernel/locking/rtmutex.c:1048! CPU: 5 PID: 689 Comm: mkfs.ext4 Not tainted 4.18.16-rt9-preempt-rt #1 Call Trace: rt_write_lock+0x2a/0x30 create_object+0x17d/0x2b0 kmemleak_alloc+0x34/0x50 kmem_cache_alloc+0x146/0x220 mempool_alloc_slab+0x15/0x20 mempool_alloc+0x65/0x170 sg_pool_alloc+0x21/0x60 sg_alloc_table_chained+0x8b/0xb0 … blk_flush_plug_list+0x204/0x230 schedule+0x87/0xe0 rt_write_lock+0x2a/0x30 create_object+0x17d/0x2b0 kmemleak_alloc+0x34/0x50 __kmalloc_node+0x1cd/0x340 alloc_request_size+0x30/0x70 mempool_alloc+0x65/0x170 get_request+0x4e3/0x8d0 blk_queue_bio+0x153/0x470 generic_make_request+0x1dc/0x3f0 submit_bio+0x49/0x140 … kmemleak is an error detecting feature. We would not expect as good performance as without it. As there is no raw rwlock defining helpers, we turn kmemleak_lock to a raw spinlock. Signed-off-by: He Zhe Cc: catalin.marinas@arm.com Cc: bigeasy@linutronix.de Cc: tglx@linutronix.de Cc: rostedt@goodmis.org Acked-by: Catalin Marinas Link: https://lkml.kernel.org/r/1542877459-144382-1-git-send-email-zhe.he@windriver.com Link: https://lkml.kernel.org/r/20181218150744.GB20197@arrakis.emea.arm.com Signed-off-by: Sebastian Andrzej Siewior commit df2caf070bea084d30ccc03b8f7fc1a2011f3e71 Author: Rob Herring Date: Wed Dec 11 17:23:45 2019 -0600 of: Rework and simplify phandle cache to use a fixed size The phandle cache was added to speed up of_find_node_by_phandle() by avoiding walking the whole DT to find a matching phandle. The implementation has several shortcomings: - The cache is designed to work on a linear set of phandle values. This is true for dtc generated DTs, but not for other cases such as Power. - The cache isn't enabled until of_core_init() and a typical system may see hundreds of calls to of_find_node_by_phandle() before that point. - The cache is freed and re-allocated when the number of phandles changes. - It takes a raw spinlock around a memory allocation which breaks on RT. Change the implementation to a fixed size and use hash_32() as the cache index. This greatly simplifies the implementation. It avoids the need for any re-alloc of the cache and taking a reference on nodes in the cache. We only have a single source of removing cache entries which is of_detach_node(). Using hash_32() removes any assumption on phandle values improving the hit rate for non-linear phandle values. The effect on linear values using hash_32() is about a 10% collision. The chances of thrashing on colliding values seems to be low. To compare performance, I used a RK3399 board which is a pretty typical system. I found that just measuring boot time as done previously is noisy and may be impacted by other things. Also bringing up secondary cores causes some issues with measuring, so I booted with 'nr_cpus=1'. With no caching, calls to of_find_node_by_phandle() take about 20124 us for 1248 calls. There's an additional 288 calls before time keeping is up. Using the average time per hit/miss with the cache, we can calculate these calls to take 690 us (277 hit / 11 miss) with a 128 entry cache and 13319 us with no cache or an uninitialized cache. Comparing the 3 implementations the time spent in of_find_node_by_phandle() is: no cache: 20124 us (+ 13319 us) 128 entry cache: 5134 us (+ 690 us) current cache: 819 us (+ 13319 us) We could move the allocation of the cache earlier to improve the current cache, but that just further complicates the situation as it needs to be after slab is up, so we can't do it when unflattening (which uses memblock). Reported-by: Sebastian Andrzej Siewior Cc: Michael Ellerman Cc: Segher Boessenkool Cc: Frank Rowand Signed-off-by: Rob Herring Link: https://lkml.kernel.org/r/20191211232345.24810-1-robh@kernel.org Signed-off-by: Sebastian Andrzej Siewior commit 063329d7d85b7977dd4c966cd772dd073965b07e Author: Sebastian Andrzej Siewior Date: Mon Feb 11 11:33:11 2019 +0100 tpm: remove tpm_dev_wq_lock Added in commit 9e1b74a63f776 ("tpm: add support for nonblocking operation") but never actually used it. Cc: Philip Tricca Cc: Tadeusz Struk Cc: Jarkko Sakkinen Signed-off-by: Sebastian Andrzej Siewior commit 78c672961b5ca5e20461f127a79829aa6d01f778 Author: Sebastian Andrzej Siewior Date: Mon Feb 11 10:40:46 2019 +0100 mm: workingset: replace IRQ-off check with a lockdep assert. Commit 68d48e6a2df57 ("mm: workingset: add vmstat counter for shadow nodes") introduced an IRQ-off check to ensure that a lock is held which also disabled interrupts. This does not work the same way on -RT because none of the locks, that are held, disable interrupts. Replace this check with a lockdep assert which ensures that the lock is held. Cc: Peter Zijlstra Signed-off-by: Sebastian Andrzej Siewior commit a36cc883b77ad3d6b5cb49cf3764d2b4998f45d6 Author: Sebastian Andrzej Siewior Date: Fri Aug 16 12:49:36 2019 +0200 cgroup: Acquire cgroup_rstat_lock with enabled interrupts There is no need to disable interrupts while cgroup_rstat_lock is acquired. The lock is never used in-IRQ context so a simple spin_lock() is enough for synchronisation purpose. Acquire cgroup_rstat_lock without disabling interrupts and ensure that cgroup_rstat_cpu_lock is acquired with disabled interrupts (this one is acquired in-IRQ context). Signed-off-by: Sebastian Andrzej Siewior commit 1d9b31108796becf3b6c6c3bf8f08372fd433175 Author: Sebastian Andrzej Siewior Date: Fri Aug 16 12:25:35 2019 +0200 cgroup: Remove `may_sleep' from cgroup_rstat_flush_locked() cgroup_rstat_flush_locked() is always invoked with `may_sleep' set to true so that this case can be made default and the parameter removed. Remove the `may_sleep' parameter. Signed-off-by: Sebastian Andrzej Siewior commit 6119d1d5ab75db95db47e1072e1de18e12c37d4e Author: Sebastian Andrzej Siewior Date: Fri Aug 16 12:20:42 2019 +0200 cgroup: Consolidate users of cgroup_rstat_lock. cgroup_rstat_flush_irqsafe() has no users, remove it. cgroup_rstat_flush_hold() and cgroup_rstat_flush_release() are only used within this file. Make it static. Signed-off-by: Sebastian Andrzej Siewior commit aaffe8aebd9c1848997f3fd339fa4afbfca8af70 Author: Sebastian Andrzej Siewior Date: Thu Aug 15 18:14:16 2019 +0200 cgroup: Remove ->css_rstat_flush() I was looking at the lifetime of the the ->css_rstat_flush() to see if cgroup_rstat_cpu_lock should remain a raw_spinlock_t. I didn't find any users and is unused since it was introduced in commit 8f53470bab042 ("cgroup: Add cgroup_subsys->css_rstat_flush()") Remove the css_rstat_flush callback because it has no users. Signed-off-by: Sebastian Andrzej Siewior commit 044c35d09de02e5b28475372f06b2a7a9a922d9c Author: Sebastian Andrzej Siewior Date: Fri Nov 8 12:55:47 2019 +0100 mm/compaction: Disable compact_unevictable_allowed on RT Since commit 5bbe3547aa3ba ("mm: allow compaction of unevictable pages") it is allowed to examine mlocked pages for pages to compact by default. On -RT even minor pagefaults are problematic because it may take a few 100us to resolve them and until then the task is blocked. Make compact_unevictable_allowed = 0 default and remove it from /proc on RT. Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/ Signed-off-by: Sebastian Andrzej Siewior commit 1f6debad1881d9eea843daf2ca86bbf1705bdf2b Author: Sebastian Andrzej Siewior Date: Wed May 22 12:43:56 2019 +0200 workqueue: Convert the locks to raw type After all the workqueue and the timer rework, we can finally make the worker_pool lock raw. The lock is not held over an unbounded period of time/iterations. Signed-off-by: Sebastian Andrzej Siewior commit 48c3f1d992b5ede5f70c38e309af79b6105cffb3 Author: Sebastian Andrzej Siewior Date: Tue Jun 11 11:21:09 2019 +0200 workqueue: Use swait for wq_manager_wait In order for the workqueue code use raw_spinlock_t typed locking there must not be a spinlock_t typed lock be acquired. A wait_queue_head uses a spinlock_t lock for its list protection. Use a swait based queue head to avoid raw_spinlock_t -> spinlock_t locking. Signed-off-by: Sebastian Andrzej Siewior commit d54a2668e09849e039cd6470a736e2719f92cb08 Author: Sebastian Andrzej Siewior Date: Wed May 22 12:42:26 2019 +0200 sched/swait: Add swait_event_lock_irq() The swait_event_lock_irq() is inspired by wait_event_lock_irq(). This is required by the workqueue code once it switches to swait. Signed-off-by: Sebastian Andrzej Siewior commit c4ef3a44183ace4118848bc819c5677dbe2021e3 Author: Sebastian Andrzej Siewior Date: Tue Jun 11 11:21:02 2019 +0200 workqueue: Don't assume that the callback has interrupts disabled Due to the TIMER_IRQSAFE flag, the timer callback is invoked with disabled interrupts. On -RT the callback is invoked in softirq context with enabled interrupts. Since the interrupts are threaded, there are are no in_irq() users. The local_bh_disable() around the threaded handler ensures that there is either a timer or a threaded handler active on the CPU. Disable interrupts before __queue_work() is invoked from the timer callback. Signed-off-by: Sebastian Andrzej Siewior commit fe18e9e08e25d57df5a4e76d6a833e0b2afba4af Author: Sebastian Andrzej Siewior Date: Thu Oct 10 16:54:45 2019 +0200 BPF: Disable on PREEMPT_RT Disable BPF on PREEMPT_RT because - it allocates and frees memory in atomic context - it uses up_read_non_owner() - BPF_PROG_RUN() expects to be invoked in non-preemptible context Signed-off-by: Sebastian Andrzej Siewior commit ba8fae94491e5067096b67dda66dcbadc92d97a9 Author: Sebastian Andrzej Siewior Date: Fri Jul 26 11:30:49 2019 +0200 Use CONFIG_PREEMPTION Thisi is an all-in-one patch of the current `PREEMPTION' branch. Signed-off-by: Sebastian Andrzej Siewior commit ba38538074cecb86d5c2c0257abc301c6cea438a Author: Sebastian Andrzej Siewior Date: Fri Nov 15 18:04:07 2019 +0100 perf/core: Add SRCU annotation for pmus list walk Since commit 28875945ba98d ("rcu: Add support for consolidated-RCU reader checking") there is an additional check to ensure that a RCU related lock is held while the RCU list is iterated. This section holds the SRCU reader lock instead. Add annotation to list_for_each_entry_rcu() that pmus_srcu must be acquired during the list traversal. Signed-off-by: Sebastian Andrzej Siewior commit 1d6ea34403127219ccfbdd17b52e8a90d01fb1c7 Author: Clark Williams Date: Mon Jul 15 15:25:00 2019 -0500 thermal/x86_pkg_temp: Make pkg_temp_lock a raw_spinlock_t The spinlock pkg_temp_lock has the potential of being taken in atomic context because it can be acquired from the thermal IRQ vector. It's static and limited scope so go ahead and make it a raw spinlock. Signed-off-by: Clark Williams Signed-off-by: Sebastian Andrzej Siewior commit 49c3a13888907c2b60170a9b49518ce769883bf5 Author: Thomas Gleixner Date: Fri Nov 15 18:54:20 2019 +0100 fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_t Bit spinlocks are problematic if PREEMPT_RT is enabled, because they disable preemption, which is undesired for latency reasons and breaks when regular spinlocks are taken within the bit_spinlock locked region because regular spinlocks are converted to 'sleeping spinlocks' on RT. So RT replaces the bit spinlocks with regular spinlocks to avoid this problem. Bit spinlocks are also not covered by lock debugging, e.g. lockdep. Substitute the BH_Uptodate_Lock bit spinlock with a regular spinlock. Reviewed-by: Jan Kara Signed-off-by: Thomas Gleixner [bigeasy: remove the wrapper and use always spinlock_t and move it into the padding hole] Signed-off-by: Sebastian Andrzej Siewior commit 4a99b1b8b82a99010834375212f56a8a4bfdf36a Author: John Ogness Date: Tue Dec 3 09:14:57 2019 +0100 printk: hack out emergency loglevel usage Instead of using an emergency loglevel to determine if atomic messages should be printed, use oops_in_progress. This conforms to the decision that latency-causing atomic messages never be generated during normal operation. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit fbbfa6a76935ec670d7fb77a9bff3f6f1d78dbdb Author: John Ogness Date: Mon Oct 7 16:20:39 2019 +0200 printk: handle iterating while buffer changing The syslog and kmsg_dump readers are provided buffers to fill. Both try to maximize the provided buffer usage by calculating the maximum number of messages that can fit. However, if after the calculation, messages are dropped and new messages added, the calculation will no longer match. For syslog, add a check to make sure the provided buffer is not overfilled. For kmsg_dump, start over by recalculating the messages available. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit c8533dc29734466c2eb9e44be82a2d8af049dc34 Author: He Zhe Date: Tue Sep 24 15:26:39 2019 +0800 printk: devkmsg: read: Return EPIPE when the first message user-space wants has gone When user-space wants to read the first message, that is when user->seq is 0, and that message has gone, it currently automatically resets user->seq to current first seq. This mis-aligns with mainline kernel. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/ABI/testing/dev-kmsg#n39 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/printk/printk.c#n899 We should inform user-space that what it wants has gone by returning EPIPE in such scenario. Link: https://lore.kernel.org/r/20190924072639.25986-1-zhe.he@windriver.com Signed-off-by: He Zhe Signed-off-by: Sebastian Andrzej Siewior commit 6c5384859cef5c6b92c33a63cae2c5a7087ef7ba Author: John Ogness Date: Wed Apr 24 16:36:04 2019 +0200 printk: kmsg_dump: remove mutex usage The kmsg dumper can be called from any context, but the dumping helpers were using a mutex to synchronize the iterator against concurrent dumps. Rather than trying to synchronize the iterator, use a local copy of the iterator during the dump. Then no synchronization is required. Reported-by: Scott Wood Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 53b51390d52fd0e23d02e3ad6b62a7804fbd03ff Author: Sebastian Andrzej Siewior Date: Fri Feb 22 12:47:13 2019 +0100 printk: print "rate-limitted" message as info If messages which are injected via kmsg are dropped then they don't need to be printed as warnings. This is to avoid latency spikes if the interface decides to print a lot of important messages. Signed-off-by: Sebastian Andrzej Siewior commit a3aedebc5de58e2026db8e1c9ede292fe7ffea0c Author: John Ogness Date: Fri Feb 22 23:02:44 2019 +0100 printk: devkmsg: llseek: reset clear if it is lost SEEK_DATA will seek to the last clear record. If this clear record is no longer in the ring buffer, devkmsg_llseek() will go into an infinite loop. Fix that by resetting the clear sequence if the old clear record is no longer in the ring buffer. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit a44692a3e9832f7681da8071c2e95bdb91a0cd7a Author: John Ogness Date: Sun Feb 17 03:11:20 2019 +0100 printk: only allow kernel to emergency message Emergency messages exist as a mechanism for the kernel to communicate critical information to users. It is not meant for use by userspace. Only allow facility=0 messages to be processed by the emergency message code. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit c27893ab868c4ee820cdcf6c6520d5e98a6e59b4 Author: Sebastian Andrzej Siewior Date: Fri Feb 15 14:34:20 2019 +0100 arm: remove printk_nmi_.*() It is no longer provided by the printk core code. Signed-off-by: Sebastian Andrzej Siewior commit 0d5e0776bd61204367f964f4f4bcb8db4da4a466 Author: Sebastian Andrzej Siewior Date: Sat Feb 16 09:02:00 2019 +0100 serial: 8250: export symbols which are used by symbols Signed-off-by: Sebastian Andrzej Siewior commit 438845a16c162210987daf947441e31e73fa724f Author: Sebastian Andrzej Siewior Date: Thu Feb 14 17:38:24 2019 +0100 serial: 8250: remove that trylock in serial8250_console_write_atomic() This does not work as rtmutex in NMI context. As per John, it is not needed. Signed-off-by: Sebastian Andrzej Siewior commit 0b3cc0e6121ce9f0186cc86c7b04dce449be0308 Author: John Ogness Date: Thu Feb 14 23:13:30 2019 +0100 printk: set deferred to default loglevel, enforce mask All messages printed via vpritnk_deferred() were being automatically treated as emergency messages. Messages printed via vprintk_deferred() should be set to the default loglevel. LOGLEVEL_SCHED is no longer relevant. Also, enforce the loglevel mask for emergency messages. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 35a550e0af7946f894d89a6e6f2d0991f71547e6 Author: John Ogness Date: Tue Feb 12 15:30:03 2019 +0100 printk: remove unused code Code relating to the safe context and anything dealing with the previous log buffer implementation is no longer in use. Remove it. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 785e6b783625b8134aa15410e3010d047edd3cdc Author: John Ogness Date: Tue Feb 12 15:30:02 2019 +0100 printk: implement kmsg_dump Since printk messages are now logged to a new ring buffer, update the kmsg_dump functions to pull the messages from there. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit c406fbce2054efbf812b3d811ed23a872f719db9 Author: John Ogness Date: Tue Feb 12 15:30:01 2019 +0100 printk: implement syslog Since printk messages are now logged to a new ring buffer, update the syslog functions to pull the messages from there. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit fa644ef8b55461d21af563c6d6a2aa5987642036 Author: John Ogness Date: Tue Feb 12 15:30:00 2019 +0100 printk: implement /dev/kmsg Since printk messages are now logged to a new ring buffer, update the /dev/kmsg functions to pull the messages from there. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 1c805d018e40138fbad9aa9cd8071ed7550476ff Author: John Ogness Date: Tue Feb 12 15:29:59 2019 +0100 printk: implement KERN_CONT Implement KERN_CONT based on the printing CPU rather than on the printing task. As long as the KERN_CONT messages are coming from the same CPU and no non-KERN_CONT messages come, the messages are assumed to belong to each other. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit eb37d7378ab2dbbde7f1f4336a3b497bd99d7d62 Author: John Ogness Date: Tue Feb 12 15:29:58 2019 +0100 serial: 8250: implement write_atomic Implement a non-sleeping NMI-safe write_atomic console function in order to support emergency printk messages. Since interrupts need to be disabled during transmit, all usage of the IER register was wrapped with access functions that use the console_atomic_lock function to synchronize register access while tracking the state of the interrupts. This was necessary because write_atomic is can be calling from an NMI context that has preempted write_atomic. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 629c34a35f386c7d21f6dd33dcea4dfdc9cd1e6d Author: John Ogness Date: Tue Feb 12 15:29:57 2019 +0100 printk: introduce emergency messages Console messages are generally either critical or non-critical. Critical messages are messages such as crashes or sysrq output. Critical messages should never be lost because generally they provide important debugging information. Since all console messages are output via a fully preemptible printk kernel thread, it is possible that messages are not output because that thread cannot be scheduled (BUG in scheduler, run-away RT task, etc). To allow critical messages to be output independent of the schedulability of the printk task, introduce an emergency mechanism that _immediately_ outputs the message to the consoles. To avoid possible unbounded latency issues, the emergency mechanism only outputs the printk line provided by the caller and ignores any pending messages in the log buffer. Critical messages are identified as messages (by default) with log level LOGLEVEL_WARNING or more critical. This is configurable via the kernel option CONSOLE_LOGLEVEL_EMERGENCY. Any messages output as emergency messages are skipped by the printk thread on those consoles that output the emergency message. In order for a console driver to support emergency messages, the write_atomic function must be implemented by the driver. If not implemented, the emergency messages are handled like all other messages and are printed by the printk thread. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 1a9de295f7b76146c6a29521502362f28578bb89 Author: John Ogness Date: Tue Feb 12 15:29:56 2019 +0100 console: add write_atomic interface Add a write_atomic callback to the console. This is an optional function for console drivers. The function must be atomic (including NMI safe) for writing to the console. Console drivers must still implement the write callback. The write_atomic callback will only be used for emergency messages. Creating an NMI safe write_atomic that must synchronize with write requires a careful implementation of the console driver. To aid with the implementation, a set of console_atomic_* functions are provided: void console_atomic_lock(unsigned int *flags); void console_atomic_unlock(unsigned int flags); These functions synchronize using the processor-reentrant cpu lock of the printk buffer. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit a46c1b0eab6098532806704809bd9e23179e87c7 Author: John Ogness Date: Tue Feb 12 15:29:55 2019 +0100 printk: add processor number to output It can be difficult to sort printk out if multiple processors are printing simultaneously. Add the processor number to the printk output to allow the messages to be sorted. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 80b2daea61553b2ab99be5cf0d15e253b71d9fe9 Author: John Ogness Date: Tue Feb 12 15:29:54 2019 +0100 printk: implement CON_PRINTBUFFER If the CON_PRINTBUFFER flag is not set, do not replay the history for that console. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit afca0bce7e259b01afa8f6f5abe7d6cf7d7c0d40 Author: John Ogness Date: Tue Feb 12 15:29:53 2019 +0100 printk: print history for new consoles When new consoles register, they currently print how many messages they have missed. However, many (or all) of those messages may still be in the ring buffer. Add functionality to print as much of the history as available. This is a clean replacement of the old exclusive console hack. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit b2b908beefe2555512b20f76dded990bcbdff584 Author: John Ogness Date: Tue Feb 12 15:29:52 2019 +0100 printk: do boot_delay_msec inside printk_delay Both functions needed to be called one after the other, so just integrate boot_delay_msec into printk_delay for simplification. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 7bb5eac69566d9a724edab2e488cf93c6658c953 Author: John Ogness Date: Tue Feb 12 15:29:51 2019 +0100 printk: track seq per console Allow each console to track which seq record was last printed. This simplifies identifying dropped records. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 1021ee879ff255df487da6f19c2b2b497570365b Author: John Ogness Date: Tue Feb 12 15:29:50 2019 +0100 printk: minimize console locking implementation Since printing of the printk buffer is now handled by the printk kthread, minimize the console locking functions to just handle locking of the console. NOTE: With this console_flush_on_panic will no longer flush. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit bee084e7d7c7a1d69b6423ffbc390cc4aab5bf21 Author: John Ogness Date: Tue Feb 12 15:29:49 2019 +0100 printk_safe: remove printk safe code vprintk variants are now NMI-safe so there is no longer a need for the "safe" calls. NOTE: This also removes printk flushing functionality. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 4dd343fadb9f8f7b8b04e2356898f6b2a681d5da Author: John Ogness Date: Tue Feb 12 15:29:48 2019 +0100 printk: redirect emit/store to new ringbuffer vprintk_emit and vprintk_store are the main functions that all printk variants eventually go through. Change these to store the message in the new printk ring buffer that the printk kthread is reading. Remove functions no longer in use because of the changes to vprintk_emit and vprintk_store. In order to handle interrupts and NMIs, a second per-cpu ring buffer (sprint_rb) is added. This ring buffer is used for NMI-safe memory allocation in order to format the printk messages. NOTE: LOG_CONT is ignored for now and handled as individual messages. LOG_CONT functions are masked behind "#if 0" blocks until their functionality can be restored Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 81869a6180e285b4d29cd48a455400ccbb9852a8 Author: John Ogness Date: Tue Feb 12 15:29:47 2019 +0100 printk: remove exclusive console hack In order to support printing the printk log history when new consoles are registered, a global exclusive_console variable is temporarily set. This only works because printk runs with preemption disabled. When console printing is moved to a fully preemptible dedicated kthread, this hack no longer works. Remove exclusive_console usage. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit c9a71d20bad00e88ed426d5c14665d48443f7314 Author: John Ogness Date: Tue Feb 12 15:29:46 2019 +0100 printk: add ring buffer and kthread The printk ring buffer provides an NMI-safe interface for writing messages to a ring buffer. Using such a buffer for alleviates printk callers from the current burdens of disabled preemption while calling the console drivers (and possibly printing out many messages that another task put into the log buffer). Create a ring buffer to be used for storing messages to be printed to the consoles. Create a dedicated printk kthread to block on the ring buffer and call the console drivers for the read messages. NOTE: The printk_delay is relocated to _after_ the message is printed, where it makes more sense. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 9b9e811d443d676dd4db3b12b191601957779024 Author: John Ogness Date: Tue Feb 12 15:29:45 2019 +0100 printk-rb: add functionality required by printk The printk subsystem needs to be able to query the size of the ring buffer, seek to specific entries within the ring buffer, and track if records could not be stored in the ring buffer. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 5f878bc67f82c7e4a5b5688c78b742a6e9b187a6 Author: John Ogness Date: Tue Feb 12 15:29:44 2019 +0100 printk-rb: add blocking reader support Add a blocking read function for readers. An irq_work function is used to signal the wait queue so that write notification can be triggered from any context. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit f0e2db4043171b74c90b83c251f2e7fb6698e1d2 Author: John Ogness Date: Tue Feb 12 15:29:43 2019 +0100 printk-rb: add basic non-blocking reading interface Add reader iterator static declaration/initializer, dynamic initializer, and functions to iterate and retrieve ring buffer data. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 4707654b7fca859f949e0d66c671d71089dfba75 Author: John Ogness Date: Tue Feb 12 15:29:42 2019 +0100 printk-rb: add writer interface Add the writer functions prb_reserve() and prb_commit(). These make use of processor-reentrant spin locks to limit the number of possible interruption scenarios for the writers. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit a58ca0fe5d6b440c5150c19b51bb71d00493352c Author: John Ogness Date: Tue Feb 12 15:29:41 2019 +0100 printk-rb: define ring buffer struct and initializer See Documentation/printk-ringbuffer.txt for details about the initializer arguments. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 3016030f82962255e6dc036253a1aaf46f2ce832 Author: John Ogness Date: Tue Feb 12 15:29:40 2019 +0100 printk-rb: add prb locking functions Add processor-reentrant spin locking functions. These allow restricting the number of possible contexts to 2, which can simplify implementing code that also supports NMI interruptions. prb_lock(); /* * This code is synchronized with all contexts * except an NMI on the same processor. */ prb_unlock(); In order to support printk's emergency messages, a processor-reentrant spin lock will be used to control raw access to the emergency console. However, it must be the same processor-reentrant spin lock as the one used by the ring buffer, otherwise a deadlock can occur: CPU1: printk lock -> emergency -> serial lock CPU2: serial lock -> printk lock By making the processor-reentrant implemtation available externally, printk can use the same atomic_t for the ring buffer as for the emergency console and thus avoid the above deadlock. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit 490f2abb77daf20feacbbfa55d88bde616f9072e Author: John Ogness Date: Tue Feb 12 15:29:39 2019 +0100 printk-rb: add printk ring buffer documentation The full documentation file for the printk ring buffer. Signed-off-by: John Ogness Signed-off-by: Sebastian Andrzej Siewior commit cf1b46c050a47273213ef94b342453be7f59dfef Author: Thomas Gleixner Date: Tue Aug 13 14:29:41 2019 +0200 KVM: arm/arm64: Let the timer expire in hardirq context on RT The timers are canceled from an preempt-notifier which is invoked with disabled preemption which is not allowed on PREEMPT_RT. The timer callback is short so in could be invoked in hard-IRQ context on -RT. Let the timer expire on hard-IRQ context even on -RT. Signed-off-by: Thomas Gleixner Acked-by: Marc Zyngier Tested-by: Julien Grall Signed-off-by: Sebastian Andrzej Siewior commit fc2bdf536b3d145e5cb16d1d4edf36c6427760f0 Author: Uladzislau Rezki (Sony) Date: Sat Nov 30 17:54:33 2019 -0800 mm/vmalloc: remove preempt_disable/enable when doing preloading Some background. The preemption was disabled before to guarantee that a preloaded object is available for a CPU, it was stored for. That was achieved by combining the disabling the preemption and taking the spin lock while the ne_fit_preload_node is checked. The aim was to not allocate in atomic context when spinlock is taken later, for regular vmap allocations. But that approach conflicts with CONFIG_PREEMPT_RT philosophy. It means that calling spin_lock() with disabled preemption is forbidden in the CONFIG_PREEMPT_RT kernel. Therefore, get rid of preempt_disable() and preempt_enable() when the preload is done for splitting purpose. As a result we do not guarantee now that a CPU is preloaded, instead we minimize the case when it is not, with this change, by populating the per cpu preload pointer under the vmap_area_lock. This implies that at least each caller that has done the preallocation will not fallback to an atomic allocation later. It is possible that the preallocation would be pointless or that no preallocation is done because of the race but the data shows that this is really rare. For example i run the special test case that follows the preload pattern and path. 20 "unbind" threads run it and each does 1000000 allocations. Only 3.5 times among 1000000 a CPU was not preloaded. So it can happen but the number is negligible. [mhocko@suse.com: changelog additions] Link: http://lkml.kernel.org/r/20191016095438.12391-1-urezki@gmail.com Fixes: 82dd23e84be3 ("mm/vmalloc.c: preload a CPU with one object for split purpose") Signed-off-by: Uladzislau Rezki (Sony) Reviewed-by: Steven Rostedt (VMware) Acked-by: Sebastian Andrzej Siewior Acked-by: Daniel Wagner Acked-by: Michal Hocko Cc: Hillf Danton Cc: Matthew Wilcox Cc: Oleksiy Avramchenko Cc: Peter Zijlstra Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Sebastian Andrzej Siewior commit 3d3b821b18058f829ed52298ec8bca3db697dc22 Author: Marc Kleine-Budde Date: Wed Mar 5 00:49:47 2014 +0100 net: sched: Use msleep() instead of yield() On PREEMPT_RT enabled systems the interrupt handler run as threads at prio 50 (by default). If a high priority userspace process tries to shut down a busy network interface it might spin in a yield loop waiting for the device to become idle. With the interrupt thread having a lower priority than the looping process it might never be scheduled and so result in a deadlock on UP systems. With Magic SysRq the following backtrace can be produced: > test_app R running 0 174 168 0x00000000 > [] (__schedule+0x220/0x3fc) from [] (preempt_schedule_irq+0x48/0x80) > [] (preempt_schedule_irq+0x48/0x80) from [] (svc_preempt+0x8/0x20) > [] (svc_preempt+0x8/0x20) from [] (local_bh_enable+0x18/0x88) > [] (local_bh_enable+0x18/0x88) from [] (dev_deactivate_many+0x220/0x264) > [] (dev_deactivate_many+0x220/0x264) from [] (__dev_close_many+0x64/0xd4) > [] (__dev_close_many+0x64/0xd4) from [] (__dev_close+0x28/0x3c) > [] (__dev_close+0x28/0x3c) from [] (__dev_change_flags+0x88/0x130) > [] (__dev_change_flags+0x88/0x130) from [] (dev_change_flags+0x10/0x48) > [] (dev_change_flags+0x10/0x48) from [] (do_setlink+0x370/0x7ec) > [] (do_setlink+0x370/0x7ec) from [] (rtnl_newlink+0x2b4/0x450) > [] (rtnl_newlink+0x2b4/0x450) from [] (rtnetlink_rcv_msg+0x158/0x1f4) > [] (rtnetlink_rcv_msg+0x158/0x1f4) from [] (netlink_rcv_skb+0xac/0xc0) > [] (netlink_rcv_skb+0xac/0xc0) from [] (rtnetlink_rcv+0x18/0x24) > [] (rtnetlink_rcv+0x18/0x24) from [] (netlink_unicast+0x13c/0x198) > [] (netlink_unicast+0x13c/0x198) from [] (netlink_sendmsg+0x264/0x2e0) > [] (netlink_sendmsg+0x264/0x2e0) from [] (sock_sendmsg+0x78/0x98) > [] (sock_sendmsg+0x78/0x98) from [] (___sys_sendmsg.part.25+0x268/0x278) > [] (___sys_sendmsg.part.25+0x268/0x278) from [] (__sys_sendmsg+0x48/0x78) > [] (__sys_sendmsg+0x48/0x78) from [] (ret_fast_syscall+0x0/0x2c) This patch works around the problem by replacing yield() by msleep(1), giving the interrupt thread time to finish, similar to other changes contained in the rt patch set. Using wait_for_completion() instead would probably be a better solution. Signed-off-by: Marc Kleine-Budde Signed-off-by: Sebastian Andrzej Siewior commit cebffab36276fa57906b6eaf5a667acda8aa32a3 Author: Sebastian Andrzej Siewior Date: Thu Jul 26 09:13:42 2018 +0200 arm64: KVM: Invoke compute_layout() before alternatives are applied compute_layout() is invoked as part of an alternative fixup under stop_machine(). This function invokes get_random_long() which acquires a sleeping lock on -RT which can not be acquired in this context. Rename compute_layout() to kvm_compute_layout() and invoke it before stop_machine() applies the alternatives. Add a __init prefix to kvm_compute_layout() because the caller has it, too (and so the code can be discarded after boot). Signed-off-by: Sebastian Andrzej Siewior commit 6ec1cb6b21d28d8e844daa62dc1b8d9c853c70a8 Author: Sebastian Andrzej Siewior Date: Fri Nov 15 21:37:22 2019 +0100 block: Don't disable interrupts in trigger_softirq() trigger_softirq() is always invoked as a SMP-function call which is always invoked with disables interrupts. Don't disable interrupt in trigger_softirq() because interrupts are already disabled. Signed-off-by: Sebastian Andrzej Siewior commit 736b531afee862fd73e0268b72a1b95e24cb28b4 Author: Julia Cartwright Date: Fri Sep 28 21:03:51 2018 +0000 watchdog: prevent deferral of watchdogd wakeup on RT When PREEMPT_RT is enabled, all hrtimer expiry functions are deferred for execution into the context of ksoftirqd unless otherwise annotated. Deferring the expiry of the hrtimer used by the watchdog core, however, is a waste, as the callback does nothing but queue a kthread work item and wakeup watchdogd. It's worst then that, too: the deferral through ksoftirqd also means that for correct behavior a user must adjust the scheduling parameters of both watchdogd _and_ ksoftirqd, which is unnecessary and has other side effects (like causing unrelated expiry functions to execute at potentially elevated priority). Instead, mark the hrtimer used by the watchdog core as being _HARD to allow it's execution directly from hardirq context. The work done in this expiry function is well-bounded and minimal. A user still must adjust the scheduling parameters of the watchdogd to be correct w.r.t. their application needs. Cc: Guenter Roeck Reported-and-tested-by: Steffen Trumtrar Reported-by: Tim Sander Signed-off-by: Julia Cartwright Acked-by: Guenter Roeck [bigeasy: use only HRTIMER_MODE_REL_HARD] Signed-off-by: Sebastian Andrzej Siewior commit c20b053798e9dcd808ce3040e5e6128c7b176e8c Author: Sebastian Andrzej Siewior Date: Wed Apr 10 11:01:37 2019 +0200 drm/i915: Don't disable interrupts independently of the lock The locks (active.lock and rq->lock) need to be taken with disabled interrupts. This is done in i915_request_retire() by disabling the interrupts independently of the locks itself. While local_irq_disable()+spin_lock() equals spin_lock_irq() on vanilla it does not on PREEMPT_RT. Chris Wilson confirmed that local_irq_disable() was just introduced as an optimisation to avoid enabling/disabling interrupts during lock/unlock combo. Enable/disable interrupts as part of the locking instruction. Cc: Chris Wilson Signed-off-by: Sebastian Andrzej Siewior commit 1152781d37f9641bf5d41d3e9e5e67acb4ba6618 Author: Sebastian Andrzej Siewior Date: Wed Sep 4 17:59:36 2019 +0200 percpu-refcount: use normal instead of RCU-sched" This is a revert of commit a4244454df129 ("percpu-refcount: use RCU-sched insted of normal RCU") which claims the only reason for using RCU-sched is "rcu_read_[un]lock() … are slightly more expensive than preempt_disable/enable()" and "As the RCU critical sections are extremely short, using sched-RCU shouldn't have any latency implications." The problem with RCU-sched is that it disables preemption and the callback must not acquire any sleeping locks like spinlock_t on PREEMPT_RT which is the case. Convert back to normal RCU. Signed-off-by: Sebastian Andrzej Siewior commit 1cee8a1b6f3fd595a369d3d08a3397e07ce50827 Author: Thomas Gleixner Date: Thu Oct 17 12:19:02 2019 +0200 x86/ioapic: Rename misnamed functions ioapic_irqd_[un]mask() are misnomers as both functions do way more than masking and unmasking the interrupt line. Both deal with the moving the affinity of the interrupt within interrupt context. The mask/unmask is just a tiny part of the functionality. Rename them to ioapic_prepare/finish_move(), fixup the call sites and rename the related variables in the code to reflect what this is about. No functional change. Signed-off-by: Thomas Gleixner Cc: Andy Shevchenko Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sebastian Siewior Link: https://lkml.kernel.org/r/20191017101938.412489856@linutronix.de Signed-off-by: Ingo Molnar Signed-off-by: Sebastian Andrzej Siewior commit b8be116b21009a91a375fd8b47b06d62706763eb Author: Thomas Gleixner Date: Thu Oct 17 12:19:01 2019 +0200 x86/ioapic: Prevent inconsistent state when moving an interrupt There is an issue with threaded interrupts which are marked ONESHOT and using the fasteoi handler: if (IS_ONESHOT()) mask_irq(); .... cond_unmask_eoi_irq() chip->irq_eoi(); if (setaffinity_pending) { mask_ioapic(); ... move_affinity(); unmask_ioapic(); } So if setaffinity is pending the interrupt will be moved and then unconditionally unmasked at the ioapic level, which is wrong in two aspects: 1) It should be kept masked up to the point where the threaded handler finished. 2) The physical chip state and the software masked state are inconsistent Guard both the mask and the unmask with a check for the software masked state. If the line is marked masked then the ioapic line is also masked, so both mask_ioapic() and unmask_ioapic() can be skipped safely. Signed-off-by: Thomas Gleixner Cc: Andy Shevchenko Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Sebastian Siewior Fixes: 3aa551c9b4c4 ("genirq: add threaded interrupt handler support") Link: https://lkml.kernel.org/r/20191017101938.321393687@linutronix.de Signed-off-by: Ingo Molnar Signed-off-by: Sebastian Andrzej Siewior commit 03633c38041a7728507473ca7cedb892f0586f98 Author: Joel Fernandes (Google) Date: Thu Aug 15 10:18:42 2019 -0400 workqueue: Convert for_each_wq to use built-in list check Because list_for_each_entry_rcu() can now check for holding a lock as well as for being in an RCU read-side critical section, this commit replaces the workqueue_sysfs_unregister() function's use of assert_rcu_or_wq_mutex() and list_for_each_entry_rcu() with list_for_each_entry_rcu() augmented with a lockdep_is_held() optional argument. Acked-by: Tejun Heo Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney Signed-off-by: Sebastian Andrzej Siewior commit 416adddeabbd2b70224986ef5f7625228f531bfd Author: Thomas Gleixner Date: Fri Aug 9 14:42:33 2019 +0200 jbd2: Free journal head outside of locked region On PREEMPT_RT bit-spinlocks have the same semantics as on PREEMPT_RT=n, i.e. they disable preemption. That means functions which are not safe to be called in preempt disabled context on RT trigger a might_sleep() assert. The journal head bit spinlock is mostly held for short code sequences with trivial RT safe functionality, except for one place: jbd2_journal_put_journal_head() invokes __journal_remove_journal_head() with the journal head bit spinlock held. __journal_remove_journal_head() invokes kmem_cache_free() which must not be called with preemption disabled on RT. Jan suggested to rework the removal function so the actual free happens outside the bit-spinlocked region. Split it into two parts: - Do the sanity checks and the buffer head detach under the lock - Do the actual free after dropping the lock There is error case handling in the free part which needs to dereference the b_size field of the now detached buffer head. Due to paranoia (caused by ignorance) the size is retrieved in the detach function and handed into the free function. Might be over-engineered, but better safe than sorry. This makes the journal head bit-spinlock usage RT compliant and also avoids nested locking which is not covered by lockdep. Suggested-by: Jan Kara Signed-off-by: Thomas Gleixner Cc: linux-ext4@vger.kernel.org Cc: "Theodore Ts'o" Cc: Jan Kara Signed-off-by: Jan Kara Signed-off-by: Sebastian Andrzej Siewior commit 5f3933095bbefa0f536b9c701c55968629cc5bbf Author: Thomas Gleixner Date: Fri Aug 9 14:42:32 2019 +0200 jbd2: Make state lock a spinlock Bit-spinlocks are problematic on PREEMPT_RT if functions which might sleep on RT, e.g. spin_lock(), alloc/free(), are invoked inside the lock held region because bit spinlocks disable preemption even on RT. A first attempt was to replace state lock with a spinlock placed in struct buffer_head and make the locking conditional on PREEMPT_RT and DEBUG_BIT_SPINLOCKS. Jan pointed out that there is a 4 byte hole in struct journal_head where a regular spinlock fits in and he would not object to convert the state lock to a spinlock unconditionally. Aside of solving the RT problem, this also gains lockdep coverage for the journal head state lock (bit-spinlocks are not covered by lockdep as it's hard to fit a lockdep map into a single bit). The trivial change would have been to convert the jbd_*lock_bh_state() inlines, but that comes with the downside that these functions take a buffer head pointer which needs to be converted to a journal head pointer which adds another level of indirection. As almost all functions which use this lock have a journal head pointer readily available, it makes more sense to remove the lock helper inlines and write out spin_*lock() at all call sites. Fixup all locking comments as well. Suggested-by: Jan Kara Signed-off-by: Thomas Gleixner Signed-off-by: Jan Kara Cc: "Theodore Ts'o" Cc: Mark Fasheh Cc: Joseph Qi Cc: Joel Becker Cc: Jan Kara Cc: linux-ext4@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior commit 3763f91988c466323b4630a2a3e90232773f1ef9 Author: Jan Kara Date: Fri Aug 9 14:42:31 2019 +0200 jbd2: Don't call __bforget() unnecessarily jbd2_journal_forget() jumps to 'not_jbd' branch which calls __bforget() in cases where the buffer is clean which is pointless. In case of failed assertion, it can be even argued that it is safer not to touch buffer's dirty bits. Also logically it makes more sense to just jump to 'drop' and that will make logic also simpler when we switch bh_state_lock to a spinlock. Signed-off-by: Jan Kara Signed-off-by: Sebastian Andrzej Siewior commit fc006a8c30b8bb0d739cc8296f2f5672a6a8229b Author: Jan Kara Date: Fri Aug 9 14:42:30 2019 +0200 jbd2: Drop unnecessary branch from jbd2_journal_forget() We have cleared both dirty & jbddirty bits from the bh. So there's no difference between bforget() and brelse(). Thus there's no point jumping to no_jbd branch. Signed-off-by: Jan Kara Signed-off-by: Sebastian Andrzej Siewior commit 0f0a21d60a239b39223bc8db5a350ca77adc57ba Author: Jan Kara Date: Fri Aug 9 14:42:29 2019 +0200 jbd2: Move dropping of jh reference out of un/re-filing functions __jbd2_journal_unfile_buffer() and __jbd2_journal_refile_buffer() drop transaction's jh reference when they remove jh from a transaction. This will be however inconvenient once we move state lock into journal_head itself as we still need to unlock it and we'd need to grab jh reference just for that. Move dropping of jh reference out of these functions into the few callers. Signed-off-by: Jan Kara Signed-off-by: Sebastian Andrzej Siewior commit 22e19335b9f0b32f69a622b2da230de6e1fe2bb5 Author: Thomas Gleixner Date: Fri Aug 9 14:42:28 2019 +0200 jbd2: Remove jbd_trylock_bh_state() No users. Signed-off-by: Thomas Gleixner Reviewed-by: Jan Kara Cc: linux-ext4@vger.kernel.org Cc: "Theodore Ts'o" Signed-off-by: Jan Kara Signed-off-by: Sebastian Andrzej Siewior commit e0cb0cbddd3ff4fffad80d4d23f2626652d6c930 Author: Thomas Gleixner Date: Fri Aug 9 14:42:27 2019 +0200 jbd2: Simplify journal_unmap_buffer() journal_unmap_buffer() checks first whether the buffer head is a journal. If so it takes locks and then invokes jbd2_journal_grab_journal_head() followed by another check whether this is journal head buffer. The double checking is pointless. Replace the initial check with jbd2_journal_grab_journal_head() which alredy checks whether the buffer head is actually a journal. Allows also early access to the journal head pointer for the upcoming conversion of state lock to a regular spinlock. Signed-off-by: Thomas Gleixner Reviewed-by: Jan Kara Cc: linux-ext4@vger.kernel.org Cc: "Theodore Ts'o" Signed-off-by: Jan Kara Signed-off-by: Sebastian Andrzej Siewior commit 252b692c1433c5bdee609ca6cf2d31212a3f0e17 Author: Julien Grall Date: Fri Sep 20 11:08:35 2019 +0100 lib/ubsan: Don't seralize UBSAN report At the moment, UBSAN report will be serialized using a spin_lock(). On RT-systems, spinlocks are turned to rt_spin_lock and may sleep. This will result to the following splat if the undefined behavior is in a context that can sleep: | BUG: sleeping function called from invalid context at /src/linux/kernel/locking/rtmutex.c:968 | in_atomic(): 1, irqs_disabled(): 128, pid: 3447, name: make | 1 lock held by make/3447: | #0: 000000009a966332 (&mm->mmap_sem){++++}, at: do_page_fault+0x140/0x4f8 | Preemption disabled at: | [] rt_mutex_futex_unlock+0x4c/0xb0 | CPU: 3 PID: 3447 Comm: make Tainted: G W 5.2.14-rt7-01890-ge6e057589653 #911 | Call trace: | dump_backtrace+0x0/0x148 | show_stack+0x14/0x20 | dump_stack+0xbc/0x104 | ___might_sleep+0x154/0x210 | rt_spin_lock+0x68/0xa0 | ubsan_prologue+0x30/0x68 | handle_overflow+0x64/0xe0 | __ubsan_handle_add_overflow+0x10/0x18 | __lock_acquire+0x1c28/0x2a28 | lock_acquire+0xf0/0x370 | _raw_spin_lock_irqsave+0x58/0x78 | rt_mutex_futex_unlock+0x4c/0xb0 | rt_spin_unlock+0x28/0x70 | get_page_from_freelist+0x428/0x2b60 | __alloc_pages_nodemask+0x174/0x1708 | alloc_pages_vma+0x1ac/0x238 | __handle_mm_fault+0x4ac/0x10b0 | handle_mm_fault+0x1d8/0x3b0 | do_page_fault+0x1c8/0x4f8 | do_translation_fault+0xb8/0xe0 | do_mem_abort+0x3c/0x98 | el0_da+0x20/0x24 The spin_lock() will protect against multiple CPUs to output a report together, I guess to prevent them to be interleaved. However, they can still interleave with other messages (and even splat from __migth_sleep). So the lock usefulness seems pretty limited. Rather than trying to accomodate RT-system by switching to a raw_spin_lock(), the lock is now completely dropped. Link: https://lkml.kernel.org/r/20190920100835.14999-1-julien.grall@arm.com Reported-by: Andre Przywara Signed-off-by: Julien Grall Acked-by: Andrey Ryabinin Signed-off-by: Sebastian Andrzej Siewior commit 68da8bdb74d7aa217934fa4df847d9d668506d7b Author: Waiman Long Date: Thu Oct 3 16:36:08 2019 -0400 lib/smp_processor_id: Don't use cpumask_equal() The check_preemption_disabled() function uses cpumask_equal() to see if the task is bounded to the current CPU only. cpumask_equal() calls memcmp() to do the comparison. As x86 doesn't have __HAVE_ARCH_MEMCMP, the slow memcmp() function in lib/string.c is used. On a RT kernel that call check_preemption_disabled() very frequently, below is the perf-record output of a certain microbenchmark: 42.75% 2.45% testpmd [kernel.kallsyms] [k] check_preemption_disabled 40.01% 39.97% testpmd [kernel.kallsyms] [k] memcmp We should avoid calling memcmp() in performance critical path. So the cpumask_equal() call is now replaced with an equivalent simpler check. Signed-off-by: Waiman Long Signed-off-by: Sebastian Andrzej Siewior commit dc71226e59c276e531e6a512cdcf821b44ceb323 Author: Greg Kroah-Hartman Date: Tue Dec 17 19:56:55 2019 +0100 Linux 5.4.4 commit e240c7d1f17872a41df9e098fa0b06afd51b1270 Author: Robert Richter Date: Thu Nov 21 21:36:57 2019 +0000 EDAC/ghes: Do not warn when incrementing refcount on 0 [ Upstream commit 16214bd9e43a31683a7073664b000029bba00354 ] The following warning from the refcount framework is seen during ghes initialization: EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT) ------------[ cut here ]------------ refcount_t: increment on 0; use-after-free. WARNING: CPU: 36 PID: 1 at lib/refcount.c:156 refcount_inc_checked [...] Call trace: refcount_inc_checked ghes_edac_register ghes_probe ... It warns if the refcount is incremented from zero. This warning is reasonable as a kernel object is typically created with a refcount of one and freed once the refcount is zero. Afterwards the object would be "used-after-free". For GHES, the refcount is initialized with zero, and that is why this message is seen when initializing the first instance. However, whenever the refcount is zero, the device will be allocated and registered. Since the ghes_reg_mutex protects the refcount and serializes allocation and freeing of ghes devices, a use-after-free cannot happen here. Instead of using refcount_inc() for the first instance, use refcount_set(). This can be used here because the refcount is zero at this point and can not change due to its protection by the mutex. Fixes: 23f61b9fc5cc ("EDAC/ghes: Fix locking and memory barrier issues") Reported-by: John Garry Signed-off-by: Robert Richter Signed-off-by: Borislav Petkov Tested-by: John Garry Cc: Cc: James Morse Cc: Cc: linux-edac Cc: Mauro Carvalho Chehab Cc: Cc: Tony Luck Cc: Link: https://lkml.kernel.org/r/20191121213628.21244-1-rrichter@marvell.com Signed-off-by: Sasha Levin commit dc63e75e19d3509e0b52d8929ced258a8b94ef2c Author: Heiner Kallweit Date: Sat Dec 7 22:21:52 2019 +0100 r8169: fix rtl_hw_jumbo_disable for RTL8168evl [ Upstream commit 0fc75219fe9a3c90631453e9870e4f6d956f0ebc ] In referenced fix we removed the RTL8168e-specific jumbo config for RTL8168evl in rtl_hw_jumbo_enable(). We have to do the same in rtl_hw_jumbo_disable(). v2: fix referenced commit id Fixes: 14012c9f3bb9 ("r8169: fix jumbo configuration for RTL8168evl") Signed-off-by: Heiner Kallweit Signed-off-by: David S. Miller Signed-off-by: Sasha Levin commit 26ba4f73a097b41726c2046f61858c184d7f75d1 Author: Tejun Heo Date: Fri Sep 20 13:39:57 2019 -0700 workqueue: Fix missing kfree(rescuer) in destroy_workqueue() commit 8efe1223d73c218ce7e8b2e0e9aadb974b582d7f upstream. Signed-off-by: Tejun Heo Reported-by: Qian Cai Fixes: def98c84b6cd ("workqueue: Fix spurious sanity check failures in destroy_workqueue()") Cc: Nobuhiro Iwamatsu Signed-off-by: Greg Kroah-Hartman commit e13c3c2196e90a9fdd1f90201635d371e51c10b7 Author: Ming Lei Date: Mon Nov 4 16:26:53 2019 +0800 blk-mq: make sure that line break can be printed commit d2c9be89f8ebe7ebcc97676ac40f8dec1cf9b43a upstream. 8962842ca5ab ("blk-mq: avoid sysfs buffer overflow with too many CPU cores") avoids sysfs buffer overflow, and reserves one character for line break. However, the last snprintf() doesn't get correct 'size' parameter passed in, so fixed it. Fixes: 8962842ca5ab ("blk-mq: avoid sysfs buffer overflow with too many CPU cores") Signed-off-by: Ming Lei Signed-off-by: Jens Axboe Cc: Nobuhiro Iwamatsu Signed-off-by: Greg Kroah-Hartman commit 62f4e8015ed88bcae00465cc50dfc628992ce2a4 Author: Jan Kara Date: Fri Nov 8 12:45:11 2019 +0100 ext4: fix leak of quota reservations commit f4c2d372b89a1e504ebb7b7eb3e29b8306479366 upstream. Commit 8fcc3a580651 ("ext4: rework reserved cluster accounting when invalidating pages") moved freeing of delayed allocation reservations from dirty page invalidation time to time when we evict corresponding status extent from extent status tree. For inodes which don't have any blocks allocated this may actually happen only in ext4_clear_blocks() which is after we've dropped references to quota structures from the inode. Thus reservation of quota leaked. Fix the problem by clearing quota information from the inode only after evicting extent status tree in ext4_clear_inode(). Link: https://lore.kernel.org/r/20191108115420.GI20863@quack2.suse.cz Reported-by: Konstantin Khlebnikov Fixes: 8fcc3a580651 ("ext4: rework reserved cluster accounting when invalidating pages") Signed-off-by: Jan Kara Signed-off-by: Theodore Ts'o Signed-off-by: Greg Kroah-Hartman commit 5eb36e64bc9edb62dca1ecdf9261011a7e36ac8c Author: yangerkun Date: Thu Sep 19 14:35:08 2019 +0800 ext4: fix a bug in ext4_wait_for_tail_page_commit commit 565333a1554d704789e74205989305c811fd9c7a upstream. No need to wait for any commit once the page is fully truncated. Besides, it may confuse e.g. concurrent ext4_writepage() with the page still be dirty (will be cleared by truncate_pagecache() in ext4_setattr()) but buffers has been freed; and then trigger a bug show as below: [ 26.057508] ------------[ cut here ]------------ [ 26.058531] kernel BUG at fs/ext4/inode.c:2134! ... [ 26.088130] Call trace: [ 26.088695] ext4_writepage+0x914/0xb28 [ 26.089541] writeout.isra.4+0x1b4/0x2b8 [ 26.090409] move_to_new_page+0x3b0/0x568 [ 26.091338] __unmap_and_move+0x648/0x988 [ 26.092241] unmap_and_move+0x48c/0xbb8 [ 26.093096] migrate_pages+0x220/0xb28 [ 26.093945] kernel_mbind+0x828/0xa18 [ 26.094791] __arm64_sys_mbind+0xc8/0x138 [ 26.095716] el0_svc_common+0x190/0x490 [ 26.096571] el0_svc_handler+0x60/0xd0 [ 26.097423] el0_svc+0x8/0xc Run the procedure (generate by syzkaller) parallel with ext3. void main() { int fd, fd1, ret; void *addr; size_t length = 4096; int flags; off_t offset = 0; char *str = "12345"; fd = open("a", O_RDWR | O_CREAT); assert(fd >= 0); /* Truncate to 4k */ ret = ftruncate(fd, length); assert(ret == 0); /* Journal data mode */ flags = 0xc00f; ret = ioctl(fd, _IOW('f', 2, long), &flags); assert(ret == 0); /* Truncate to 0 */ fd1 = open("a", O_TRUNC | O_NOATIME); assert(fd1 >= 0); addr = mmap(NULL, length, PROT_WRITE | PROT_READ, MAP_SHARED, fd, offset); assert(addr != (void *)-1); memcpy(addr, str, 5); mbind(addr, length, 0, 0, 0, MPOL_MF_MOVE); } And the bug will be triggered once we seen the below order. reproduce1 reproduce2 ... | ... truncate to 4k | change to journal data mode | | memcpy(set page dirty) truncate to 0: | ext4_setattr: | ... | ext4_wait_for_tail_page_commit | | mbind(trigger bug) truncate_pagecache(clean dirty)| ... ... | mbind will call ext4_writepage() since the page still be dirty, and then report the bug since the buffers has been free. Fix it by return directly once offset equals to 0 which means the page has been fully truncated. Reported-by: Hulk Robot Signed-off-by: yangerkun Link: https://lore.kernel.org/r/20190919063508.1045-1-yangerkun@huawei.com Reviewed-by: Jan Kara Signed-off-by: Theodore Ts'o Signed-off-by: Greg Kroah-Hartman commit 70d3c881e8abf0bd3342b7f52fe1ec7eb4c7eac4 Author: Darrick J. Wong Date: Tue Oct 15 08:44:32 2019 -0700 splice: only read in as much information as there is pipe buffer space commit 3253d9d093376d62b4a56e609f15d2ec5085ac73 upstream. Andreas Grünbacher reports that on the two filesystems that support iomap directio, it's possible for splice() to return -EAGAIN (instead of a short splice) if the pipe being written to has less space available in its pipe buffers than the length supplied by the calling process. Months ago we fixed splice_direct_to_actor to clamp the length of the read request to the size of the splice pipe. Do the same to do_splice. Fixes: 17614445576b6 ("splice: don't read more than available pipe space") Reported-by: syzbot+3c01db6025f26530cf8d@syzkaller.appspotmail.com Reported-by: Andreas Grünbacher Reviewed-by: Andreas Grünbacher Signed-off-by: Darrick J. Wong Signed-off-by: Greg Kroah-Hartman commit b44f9cd36bbc699d7dc71c99e4b7dabcd4fd55d8 Author: Alexandre Belloni Date: Mon Oct 21 01:13:20 2019 +0200 rtc: disable uie before setting time and enable after commit 7e7c005b4b1f1f169bcc4b2c3a40085ecc663df2 upstream. When setting the time in the future with the uie timer enabled, rtc_timer_do_work will loop for a while because the expiration of the uie timer was way before the current RTC time and a new timer will be enqueued until the current rtc time is reached. If the uie timer is enabled, disable it before setting the time and enable it after expiring current timers (which may actually be an alarm). This is the safest thing to do to ensure the uie timer is still synchronized with the RTC, especially in the UIE emulation case. Reported-by: syzbot+08116743f8ad6f9a6de7@syzkaller.appspotmail.com Fixes: 6610e0893b8b ("RTC: Rework RTC code to use timerqueue for events") Link: https://lore.kernel.org/r/20191020231320.8191-1-alexandre.belloni@bootlin.com Signed-off-by: Alexandre Belloni Signed-off-by: Greg Kroah-Hartman commit edb2aa9301b1159b9ddc7ecc55302480fad35a72 Author: Andrey Konovalov Date: Mon Oct 21 16:20:58 2019 +0200 USB: dummy-hcd: increase max number of devices to 32 commit 8442b02bf3c6770e0d7e7ea17be36c30e95987b6 upstream. When fuzzing the USB subsystem with syzkaller, we currently use 8 testing processes within one VM. To isolate testing processes from one another it is desirable to assign a dedicated USB bus to each of those, which means we need at least 8 Dummy UDC/HCD devices. This patch increases the maximum number of Dummy UDC/HCD devices to 32 (more than 8 in case we need more of them in the future). Signed-off-by: Andrey Konovalov Link: https://lore.kernel.org/r/665578f904484069bb6100fb20283b22a046ad9b.1571667489.git.andreyknvl@google.com Signed-off-by: Greg Kroah-Hartman commit 246cd4b0d52e5ca37b00d8d1c4612b2022185cb9 Author: Michael Ellerman Date: Wed Nov 27 18:41:26 2019 +1100 powerpc: Define arch_is_kernel_initmem_freed() for lockdep commit 6f07048c00fd100ed8cab66c225c157e0b6c0a50 upstream. Under certain circumstances, we hit a warning in lockdep_register_key: if (WARN_ON_ONCE(static_obj(key))) return; This occurs when the key falls into initmem that has since been freed and can now be reused. This has been observed on boot, and under memory pressure. Define arch_is_kernel_initmem_freed(), which allows lockdep to correctly identify this memory as dynamic. This fixes a bug picked up by the powerpc64 syzkaller instance where we hit the WARN via alloc_netdev_mqs. Reported-by: Qian Cai Reported-by: ppc syzbot c/o Andrew Donnellan Signed-off-by: Michael Ellerman Signed-off-by: Daniel Axtens Link: https://lore.kernel.org/r/87lfs4f7d6.fsf@dja-thinkpad.axtens.net Signed-off-by: Greg Kroah-Hartman commit 12de9bf4bfba2953119c5b8e4da52de34c84cb83 Author: Chen Jun Date: Sat Nov 30 17:58:11 2019 -0800 mm/shmem.c: cast the type of unmap_start to u64 commit aa71ecd8d86500da6081a72da6b0b524007e0627 upstream. In 64bit system. sb->s_maxbytes of shmem filesystem is MAX_LFS_FILESIZE, which equal LLONG_MAX. If offset > LLONG_MAX - PAGE_SIZE, offset + len < LLONG_MAX in shmem_fallocate, which will pass the checking in vfs_fallocate. /* Check for wrap through zero too */ if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0)) return -EFBIG; loff_t unmap_start = round_up(offset, PAGE_SIZE) in shmem_fallocate causes a overflow. Syzkaller reports a overflow problem in mm/shmem: UBSAN: Undefined behaviour in mm/shmem.c:2014:10 signed integer overflow: '9223372036854775807 + 1' cannot be represented in type 'long long int' CPU: 0 PID:17076 Comm: syz-executor0 Not tainted 4.1.46+ #1 Hardware name: linux, dummy-virt (DT) Call trace: dump_backtrace+0x0/0x2c8 arch/arm64/kernel/traps.c:100 show_stack+0x20/0x30 arch/arm64/kernel/traps.c:238 __dump_stack lib/dump_stack.c:15 [inline] ubsan_epilogue+0x18/0x70 lib/ubsan.c:164 handle_overflow+0x158/0x1b0 lib/ubsan.c:195 shmem_fallocate+0x6d0/0x820 mm/shmem.c:2104 vfs_fallocate+0x238/0x428 fs/open.c:312 SYSC_fallocate fs/open.c:335 [inline] SyS_fallocate+0x54/0xc8 fs/open.c:239 The highest bit of unmap_start will be appended with sign bit 1 (overflow) when calculate shmem_falloc.start: shmem_falloc.start = unmap_start >> PAGE_SHIFT. Fix it by casting the type of unmap_start to u64, when right shifted. This bug is found in LTS Linux 4.1. It also seems to exist in mainline. Link: http://lkml.kernel.org/r/1573867464-5107-1-git-send-email-chenjun102@huawei.com Signed-off-by: Chen Jun Reviewed-by: Andrew Morton Cc: Hugh Dickins Cc: Qian Cai Cc: Kefeng Wang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman commit c5407f8859fb56ed8508ac1ac03ee5d3abbad0f9 Author: Gerald Schaefer Date: Tue Nov 19 12:30:53 2019 +0100 s390/kaslr: store KASLR offset for early dumps commit a9f2f6865d784477e1c7b59269d3a384abafd9ca upstream. The KASLR offset is added to vmcoreinfo in arch_crash_save_vmcoreinfo(), so that it can be found by crash when processing kernel dumps. However, arch_crash_save_vmcoreinfo() is called during a subsys_initcall, so if the kernel crashes before that, we have no vmcoreinfo and no KASLR offset. Fix this by storing the KASLR offset in the lowcore, where the vmcore_info pointer will be stored, and where it can be found by crash. In order to make it distinguishable from a real vmcore_info pointer, mark it as uneven (KASLR offset itself is aligned to THREAD_SIZE). When arch_crash_save_vmcoreinfo() stores the real vmcore_info pointer in the lowcore, it overwrites the KASLR offset. At that point, the KASLR offset is not yet added to vmcoreinfo, so we also need to move the mem_assign_absolute() behind the vmcoreinfo_append_str(). Fixes: b2d24b97b2a9 ("s390/kernel: add support for kernel address space layout randomization (KASLR)") Cc: # v5.2+ Signed-off-by: Gerald Schaefer Signed-off-by: Vasily Gorbik Signed-off-by: Greg Kroah-Hartman commit a7c1c595334351bc798703aa02f94f260349df26 Author: Heiko Carstens Date: Mon Nov 18 13:09:52 2019 +0100 s390/smp,vdso: fix ASCE handling commit a2308c11ecbc3471ebb7435ee8075815b1502ef0 upstream. When a secondary CPU is brought up it must initialize its control registers. CPU A which triggers that a secondary CPU B is brought up stores its control register contents into the lowcore of new CPU B, which then loads these values on startup. This is problematic in various ways: the control register which contains the home space ASCE will correctly contain the kernel ASCE; however control registers for primary and secondary ASCEs are initialized with whatever values were present in CPU A. Typically: - the primary ASCE will contain the user process ASCE of the process that triggered onlining of CPU B. - the secondary ASCE will contain the percpu VDSO ASCE of CPU A. Due to lazy ASCE handling we may also end up with other combinations. When then CPU B switches to a different process (!= idle) it will fixup the primary ASCE. However the problem is that the (wrong) ASCE from CPU A was loaded into control register 1: as soon as an ASCE is attached (aka loaded) a CPU is free to generate TLB entries using that address space. Even though it is very unlikey that CPU B will actually generate such entries, this could result in TLB entries of the address space of the process that ran on CPU A. These entries shouldn't exist at all and could cause problems later on. Furthermore the secondary ASCE of CPU B will not be updated correctly. This means that processes may see wrong results or even crash if they access VDSO data on CPU B. The correct VDSO ASCE will eventually be loaded on return to user space as soon as the kernel executed a call to strnlen_user or an atomic futex operation on CPU B. Fix both issues by intializing the to be loaded control register contents with the correct ASCEs and also enforce (re-)loading of the ASCEs upon first context switch and return to user space. Fixes: 0aaba41b58bc ("s390: remove all code using the access register mode") Cc: stable@vger.kernel.org # v4.15+ Signed-off-by: Heiko Carstens Signed-off-by: Vasily Gorbik Signed-off-by: Greg Kroah-Hartman commit 2f04249b33f4fe870d40271581f32aa06d5a3ebe Author: Will Deacon Date: Mon Nov 4 15:58:15 2019 +0000 firmware: qcom: scm: Ensure 'a0' status code is treated as signed commit ff34f3cce278a0982a7b66b1afaed6295141b1fc upstream. The 'a0' member of 'struct arm_smccc_res' is declared as 'unsigned long', however the Qualcomm SCM firmware interface driver expects to receive negative error codes via this field, so ensure that it's cast to 'long' before comparing to see if it is less than 0. Cc: Reviewed-by: Bjorn Andersson Signed-off-by: Will Deacon Signed-off-by: Greg Kroah-Hartman commit a44a5939a4097c98481a5b873b7bd9f387e56f59 Author: Theodore Ts'o Date: Mon Nov 11 22:18:13 2019 -0500 ext4: work around deleting a file with i_nlink == 0 safely commit c7df4a1ecb8579838ec8c56b2bb6a6716e974f37 upstream. If the file system is corrupted such that a file's i_links_count is too small, then it's possible that when unlinking that file, i_nlink will already be zero. Previously we were working around this kind of corruption by forcing i_nlink to one; but we were doing this before trying to delete the directory entry --- and if the file system is corrupted enough that ext4_delete_entry() fails, then we exit with i_nlink elevated, and this causes the orphan inode list handling to be FUBAR'ed, such that when we unmount the file system, the orphan inode list can get corrupted. A better way to fix this is to simply skip trying to call drop_nlink() if i_nlink is already zero, thus moving the check to the place where it makes the most sense. https://bugzilla.kernel.org/show_bug.cgi?id=205433 Link: https://lore.kernel.org/r/20191112032903.8828-1-tytso@mit.edu Signed-off-by: Theodore Ts'o Cc: stable@kernel.org Reviewed-by: Andreas Dilger Signed-off-by: Greg Kroah-Hartman commit e4d09b31ad89cd5813de71a12b9255b813dfaaeb Author: Roman Gushchin Date: Wed Dec 4 16:49:46 2019 -0800 mm: memcg/slab: wait for !root kmem_cache refcnt killing on root kmem_cache destruction commit a264df74df38855096393447f1b8f386069a94b9 upstream. Christian reported a warning like the following obtained during running some KVM-related tests on s390: WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58 Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na> CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66 Hardware name: IBM 2964 NC9 712 (LPAR) Workqueue: events sysfs_slab_remove_workfn Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000 0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000 0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00 0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0 Krnl Code: 0000001529746842: f0a0000407fe srp 4(11,%r0),2046,0 0000001529746848: 47000700 bc 0,1792 #000000152974684c: a7f40001 brc 15,152974684e >0000001529746850: a7f4fff2 brc 15,1529746834 0000001529746854: 0707 bcr 0,%r7 0000001529746856: 0707 bcr 0,%r7 0000001529746858: eb8ff0580024 stmg %r8,%r15,88(%r15) 000000152974685e: a738ffff lhi %r3,-1 Call Trace: ([<000003e009263d00>] 0x3e009263d00) [<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70 [<0000001529b04882>] kobject_put+0xaa/0xe8 [<000000152918cf28>] process_one_work+0x1e8/0x428 [<000000152918d1b0>] worker_thread+0x48/0x460 [<00000015291942c6>] kthread+0x126/0x160 [<0000001529b22344>] ret_from_fork+0x28/0x30 [<0000001529b2234c>] kernel_thread_starter+0x0/0x10 Last Breaking-Event-Address: [<000000152974684c>] percpu_ref_exit+0x4c/0x58 ---[ end trace b035e7da5788eb09 ]--- The problem occurs because kmem_cache_destroy() is called immediately after deleting of a memcg, so it races with the memcg kmem_cache deactivation. flush_memcg_workqueue() at the beginning of kmem_cache_destroy() is supposed to guarantee that all deactivation processes are finished, but failed to do so. It waits for an rcu grace period, after which all children kmem_caches should be deactivated. During the deactivation percpu_ref_kill() is called for non root kmem_cache refcounters, but it requires yet another rcu grace period to finish the transition to the atomic (dead) state. So in a rare case when not all children kmem_caches are destroyed at the moment when the root kmem_cache is about to be gone, we need to wait another rcu grace period before destroying the root kmem_cache. This issue can be triggered only with dynamically created kmem_caches which are used with memcg accounting. In this case per-memcg child kmem_caches are created. They are deactivated from the cgroup removing path. If the destruction of the root kmem_cache is racing with the removal of the cgroup (both are quite complicated multi-stage processes), the described issue can occur. The only known way to trigger it in the real life, is to unload some kernel module which creates a dedicated kmem_cache, used from different memory cgroups with GFP_ACCOUNT flag. If the unloading happens immediately after calling rmdir on the corresponding cgroup, there is some chance to trigger the issue. Link: http://lkml.kernel.org/r/20191129025011.3076017-1-guro@fb.com Fixes: f0a3a24b532d ("mm: memcg/slab: rework non-root kmem_cache lifecycle management") Signed-off-by: Roman Gushchin Reported-by: Christian Borntraeger Tested-by: Christian Borntraeger Reviewed-by: Shakeel Butt Acked-by: Michal Hocko Cc: Johannes Weiner Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman commit 7e8b342c24adc96df4ded207e377f32fca8ad0cd Author: Daniel Schultz Date: Tue Sep 17 10:12:53 2019 +0200 mfd: rk808: Fix RK818 ID template commit 37ef8c2c15bdc1322b160e38986c187de2b877b2 upstream. The Rockchip PMIC driver can automatically detect connected component versions by reading the ID_MSB and ID_LSB registers. The probe function will always fail with RK818 PMICs because the ID_MSK is 0xFFF0 and the RK818 template ID is 0x8181. This patch changes this value to 0x8180. Fixes: 9d6105e19f61 ("mfd: rk808: Fix up the chip id get failed") Cc: stable@vger.kernel.org Cc: Elaine Zhang Cc: Joseph Chen Signed-off-by: Daniel Schultz Signed-off-by: Heiko Stuebner Signed-off-by: Lee Jones Signed-off-by: Greg Kroah-Hartman commit 4d0f420c8612fd43bcf0dcc08ea991fade15d817 Author: Nicolas Geoffray Date: Sat Nov 30 17:53:28 2019 -0800 mm, memfd: fix COW issue on MAP_PRIVATE and F_SEAL_FUTURE_WRITE mappings commit 05d351102dbe4e103d6bdac18b1122cd3cd04925 upstream. F_SEAL_FUTURE_WRITE has unexpected behavior when used with MAP_PRIVATE: A private mapping created after the memfd file that gets sealed with F_SEAL_FUTURE_WRITE loses the copy-on-write at fork behavior, meaning children and parent share the same memory, even though the mapping is private. The reason for this is due to the code below: static int shmem_mmap(struct file *file, struct vm_area_struct *vma) { struct shmem_inode_info *info = SHMEM_I(file_inode(file)); if (info->seals & F_SEAL_FUTURE_WRITE) { /* * New PROT_WRITE and MAP_SHARED mmaps are not allowed when * "future write" seal active. */ if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE)) return -EPERM; /* * Since the F_SEAL_FUTURE_WRITE seals allow for a MAP_SHARED * read-only mapping, take care to not allow mprotect to revert * protections. */ vma->vm_flags &= ~(VM_MAYWRITE); } ... } And for the mm to know if a mapping is copy-on-write: static inline bool is_cow_mapping(vm_flags_t flags) { return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; } The patch fixes the issue by making the mprotect revert protection happen only for shared mappings. For private mappings, using mprotect will have no effect on the seal behavior. The F_SEAL_FUTURE_WRITE feature was introduced in v5.1 so v5.3.x stable kernels would need a backport. [akpm@linux-foundation.org: reflow comment, per Christoph] Link: http://lkml.kernel.org/r/20191107195355.80608-1-joel@joelfernandes.org Fixes: ab3948f58ff84 ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd") Signed-off-by: Nicolas Geoffray Signed-off-by: Joel Fernandes (Google) Cc: Hugh Dickins Cc: Shuah Khan Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman commit 78d375ace0f5c49ca1131fb033473de47b36312f Author: Vincenzo Frascino Date: Mon Dec 2 07:57:29 2019 +0000 powerpc: Fix vDSO clock_getres() [ Upstream commit 552263456215ada7ee8700ce022d12b0cffe4802 ] clock_getres in the vDSO library has to preserve the same behaviour of posix_get_hrtimer_res(). In particular, posix_get_hrtimer_res() does: sec = 0; ns = hrtimer_resolution; and hrtimer_resolution depends on the enablement of the high resolution timers that can happen either at compile or at run time. Fix the powerpc vdso implementation of clock_getres keeping a copy of hrtimer_resolution in vdso data and using that directly. Fixes: a7f290dad32e ("[PATCH] powerpc: Merge vdso's and add vdso support to 32 bits kernel") Cc: stable@vger.kernel.org Signed-off-by: Vincenzo Frascino Reviewed-by: Christophe Leroy Acked-by: Shuah Khan [chleroy: changed CLOCK_REALTIME_RES to CLOCK_HRTIMER_RES] Signed-off-by: Christophe Leroy Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/a55eca3a5e85233838c2349783bcb5164dae1d09.1575273217.git.christophe.leroy@c-s.fr Signed-off-by: Sasha Levin commit 002d1cac5af8ff882b0c60955ef952f3376cdd57 Author: Nathan Chancellor Date: Mon Nov 18 21:57:11 2019 -0700 powerpc: Avoid clang warnings around setjmp and longjmp [ Upstream commit c9029ef9c95765e7b63c4d9aa780674447db1ec0 ] Commit aea447141c7e ("powerpc: Disable -Wbuiltin-requires-header when setjmp is used") disabled -Wbuiltin-requires-header because of a warning about the setjmp and longjmp declarations. r367387 in clang added another diagnostic around this, complaining that there is no jmp_buf declaration. In file included from ../arch/powerpc/xmon/xmon.c:47: ../arch/powerpc/include/asm/setjmp.h:10:13: error: declaration of built-in function 'setjmp' requires the declaration of the 'jmp_buf' type, commonly provided in the header . [-Werror,-Wincomplete-setjmp-declaration] extern long setjmp(long *); ^ ../arch/powerpc/include/asm/setjmp.h:11:13: error: declaration of built-in function 'longjmp' requires the declaration of the 'jmp_buf' type, commonly provided in the header . [-Werror,-Wincomplete-setjmp-declaration] extern void longjmp(long *, long); ^ 2 errors generated. We are not using the standard library's longjmp/setjmp implementations for obvious reasons; make this clear to clang by using -ffreestanding on these files. Cc: stable@vger.kernel.org # 4.14+ Suggested-by: Segher Boessenkool Reviewed-by: Nick Desaulniers Signed-off-by: Nathan Chancellor Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20191119045712.39633-3-natechancellor@gmail.com Signed-off-by: Sasha Levin commit d6620fc5447a5782e490d2d1fcd98eaa9a5f9fba Author: H. Nikolaus Schaller Date: Thu Nov 7 11:30:39 2019 +0100 omap: pdata-quirks: remove openpandora quirks for mmc3 and wl1251 [ Upstream commit 2398c41d64321e62af54424fd399964f3d48cdc2 ] With a wl1251 child node of mmc3 in the device tree decoded in omap_hsmmc.c to handle special wl1251 initialization, we do no longer need to instantiate the mmc3 through pdata quirks. We also can remove the wlan regulator and reset/interrupt definitions and do them through device tree. Fixes: 81eef6ca9201 ("mmc: omap_hsmmc: Use dma_request_chan() for requesting DMA channel") Signed-off-by: H. Nikolaus Schaller Cc: # v4.7+ Acked-by: Tony Lindgren Signed-off-by: Ulf Hansson Signed-off-by: Sasha Levin commit 784a559f94d3510bcb1b93e770dc78e1be17fd59 Author: H. Nikolaus Schaller Date: Thu Nov 7 11:30:38 2019 +0100 omap: pdata-quirks: revert pandora specific gpiod additions [ Upstream commit 4e8fad98171babe019db51c15055ec74697e9525 ] This partly reverts the commit efdfeb079cc3 ("regulator: fixed: Convert to use GPIO descriptor only"). We must remove this from mainline first, so that the following patch to remove the openpandora quirks for mmc3 and wl1251 cleanly applies to stable v4.9, v4.14, v4.19 where the above mentioned patch is not yet present. Since the code affected is removed (no pandora gpios in pdata-quirks and more), there will be no matching revert-of-the-revert. Signed-off-by: H. Nikolaus Schaller Acked-by: Tony Lindgren Signed-off-by: Ulf Hansson Signed-off-by: Sasha Levin commit af5b2e18aed60d6df7d6ac644648d981a0de0c99 Author: Andrea Merello Date: Mon Dec 2 15:13:36 2019 +0100 iio: ad7949: fix channels mixups [ Upstream commit 3b71f6b59508b1c9befcb43de434866aafc76520 ] Each time we need to read a sample (from the sysfs interface, since the driver supports only it) the driver writes the configuration register with the proper settings needed to perform the said read, then it runs another xfer to actually read the resulting value. Most notably the configuration register is updated to set the ADC internal MUX depending by which channel the read targets. Unfortunately this seems not enough to ensure correct operation because the ADC works in a pipelined-like fashion and the new configuration isn't applied in time. The ADC alternates two phases: acquisition and conversion. During the acquisition phase the ADC samples the analog signal in an internal capacitor; in the conversion phase the ADC performs the actual analog to digital conversion of the stored voltage. Note that of course the MUX needs to be set to the proper channel when the acquisition phase is performed. Once the conversion phase has been completed, the device automatically switches back to a new acquisition; on the other hand the device switches from acquisition to conversion on the rising edge of SPI cs signal (that is when the xfer finishes). Only after both two phases have been completed (with the proper settings already written in the configuration register since the beginning) it is possible to read the outcome from SPI bus. With the current driver implementation, we end up in the following situation: _______ 1st xfer ____________ 2nd xfer ___________________ SPI cs.. \_________/ \_________/ SPI rd.. idle |(val N-2)+ idle | val N-1 + idle ... SPI wr.. idle | cfg N + idle | (X) + idle ... ------------------------ + -------------------- + ------------------ AD .. acq N-1 + cnv N-1 | acq N + cnv N | acq N+1 As shown in the diagram above, the value we read in the Nth read belongs to configuration setting N-1. In case the configuration is not changed (config[N] == config[N-1]), then we still get correct data, but in case the configuration changes (i.e. switching the MUX on another channel), we get wrong data (data from the previously selected channel). This patch fixes this by performing one more "dummy" transfer in order to ending up in reading the data when it's really ready, as per the following timing diagram. _______ 1st xfer ____________ 2nd xfer ___________ 3rd xfer ___ SPI cs.. \_________/ \_________/ \_________/ SPI rd.. idle |(val N-2)+ idle |(val N-1)+ idle | val N + .. SPI wr.. idle | cfg N + idle | (X) + idle | (X) + .. ------------------------ + -------------------- + ------------------- + -- AD .. acq N-1 + cnv N-1 | acq N + cnv N | acq N+1 | .. NOTE: in the latter case (cfg changes), the acquisition phase for the value to be read begins after the 1st xfer, that is after the read request has been issued on sysfs. On the other hand, if the cfg doesn't change, then we can refer to the fist diagram assuming N == (N - 1); the acquisition phase _begins_ before the 1st xfer (potentially a lot of time before the read has been issued via sysfs, but it _ends_ after the 1st xfer, that is _after_ the read has started. This should guarantee a reasonably fresh data, which value represents the voltage that the sampled signal has after the read start or maybe just around it. Signed-off-by: Andrea Merello Reviewed-by: Charles-Antoine Couret Cc: Signed-off-by: Jonathan Cameron Signed-off-by: Sasha Levin commit a4160d9f57c22815736897f2f0590ae0d35a1562 Author: Andrea Merello Date: Thu Sep 12 16:43:07 2019 +0200 iio: ad7949: kill pointless "readback"-handling code [ Upstream commit c270bbf7bb9ddc4e2a51b3c56557c377c9ac79bc ] The device could be configured to spit out also the configuration word while reading the AD result value (in the same SPI xfer) - this is called "readback" in the device datasheet. The driver checks if readback is enabled and it eventually adjusts the SPI xfer length and it applies proper shifts to still get the data, discarding the configuration word. The readback option is actually never enabled (the driver disables it), so the said checks do not serve for any purpose. Since enabling the readback option seems not to provide any advantage (the driver entirely sets the configuration word without relying on any default value), just kill the said, unused, code. Signed-off-by: Andrea Merello Reviewed-by: Alexandru Ardelean Signed-off-by: Jonathan Cameron Signed-off-by: Sasha Levin commit 44120fd4fd644db95868832fb2c94f716cc61d53 Author: Martin K. Petersen Date: Mon Nov 18 23:55:45 2019 -0500 Revert "scsi: qla2xxx: Fix memory leak when sending I/O fails" [ Upstream commit 5a993e507ee65a28eca6690ee11868555c4ca46b ] This reverts commit 2f856d4e8c23f5ad5221f8da4a2f22d090627f19. This patch was found to introduce a double free regression. The issue it originally attempted to address was fixed in patch f45bca8c5052 ("scsi: qla2xxx: Fix double scsi_done for abort path"). Link: https://lore.kernel.org/r/4BDE2B95-835F-43BE-A32C-2629D7E03E0A@marvell.com Requested-by: Himanshu Madhani Signed-off-by: Martin K. Petersen Signed-off-by: Sasha Levin commit 26c9d7b181bbfa1453cda6edcafe274368202cde Author: Bart Van Assche Date: Tue Nov 5 20:42:26 2019 -0800 scsi: qla2xxx: Fix a dma_pool_free() call [ Upstream commit 162b805e38327135168cb0938bd37b131b481cb0 ] This patch fixes the following kernel warning: DMA-API: qla2xxx 0000:00:0a.0: device driver frees DMA memory with different size [device address=0x00000000c7b60000] [map size=4088 bytes] [unmap size=512 bytes] WARNING: CPU: 3 PID: 1122 at kernel/dma/debug.c:1021 check_unmap+0x4d0/0xbd0 CPU: 3 PID: 1122 Comm: rmmod Tainted: G O 5.4.0-rc1-dbg+ #1 RIP: 0010:check_unmap+0x4d0/0xbd0 Call Trace: debug_dma_free_coherent+0x123/0x173 dma_free_attrs+0x76/0xe0 qla2x00_mem_free+0x329/0xc40 [qla2xxx_scst] qla2x00_free_device+0x170/0x1c0 [qla2xxx_scst] qla2x00_remove_one+0x4f0/0x6d0 [qla2xxx_scst] pci_device_remove+0xd5/0x1f0 device_release_driver_internal+0x159/0x280 driver_detach+0x8b/0xf2 bus_remove_driver+0x9a/0x15a driver_unregister+0x51/0x70 pci_unregister_driver+0x2d/0x130 qla2x00_module_exit+0x1c/0xbc [qla2xxx_scst] __x64_sys_delete_module+0x22a/0x300 do_syscall_64+0x6f/0x2e0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: 3f006ac342c0 ("scsi: qla2xxx: Secure flash update support for ISP28XX") # v5.2-rc1~130^2~270. Cc: Michael Hernandez Cc: Himanshu Madhani Link: https://lore.kernel.org/r/20191106044226.5207-3-bvanassche@acm.org Reviewed-by: Martin Wilck Acked-by: Himanshu Madhani Signed-off-by: Bart Van Assche Signed-off-by: Martin K. Petersen Signed-off-by: Sasha Levin commit dea6ee7173039d489977c9ed92e3749154615db4 Author: Quinn Tran Date: Tue Nov 5 07:06:52 2019 -0800 scsi: qla2xxx: Fix SRB leak on switch command timeout [ Upstream commit af2a0c51b1205327f55a7e82e530403ae1d42cbb ] when GPSC/GPDB switch command fails, driver just returns without doing a proper cleanup. This patch fixes this memory leak by calling sp->free() in the error path. Link: https://lore.kernel.org/r/20191105150657.8092-4-hmadhani@marvell.com Reviewed-by: Ewan D. Milne Signed-off-by: Quinn Tran Signed-off-by: Himanshu Madhani Signed-off-by: Martin K. Petersen Signed-off-by: Sasha Levin commit af7878b07aa3b498fbebabaece408cfe9e7ae83a Author: Jeff Mahoney Date: Thu Oct 24 10:31:27 2019 -0400 reiserfs: fix extended attributes on the root directory commit 60e4cf67a582d64f07713eda5fcc8ccdaf7833e6 upstream. Since commit d0a5b995a308 (vfs: Add IOP_XATTR inode operations flag) extended attributes haven't worked on the root directory in reiserfs. This is due to reiserfs conditionally setting the sb->s_xattrs handler array depending on whether it located or create the internal privroot directory. It necessarily does this after the root inode is already read in. The IOP_XATTR flag is set during inode initialization, so it never gets set on the root directory. This commit unconditionally assigns sb->s_xattrs and clears IOP_XATTR on internal inodes. The old return values due to the conditional assignment are handled via open_xa_root, which now returns EOPNOTSUPP as the VFS would have done. Link: https://lore.kernel.org/r/20191024143127.17509-1-jeffm@suse.com CC: stable@vger.kernel.org Fixes: d0a5b995a308 ("vfs: Add IOP_XATTR inode operations flag") Signed-off-by: Jeff Mahoney Signed-off-by: Jan Kara Signed-off-by: Greg Kroah-Hartman commit c46addbdd041511d871acc75dbd1be2c8441b934 Author: Jan Kara Date: Tue Nov 5 17:44:12 2019 +0100 ext4: Fix credit estimate for final inode freeing commit 65db869c754e7c271691dd5feabf884347e694f5 upstream. Estimate for the number of credits needed for final freeing of inode in ext4_evict_inode() was to small. We may modify 4 blocks (inode & sb for orphan deletion, bitmap & group descriptor for inode freeing) and not just 3. [ Fixed minor whitespace nit. -- TYT ] Fixes: e50e5129f384 ("ext4: xattr-in-inode support") CC: stable@vger.kernel.org Signed-off-by: Jan Kara Link: https://lore.kernel.org/r/20191105164437.32602-6-jack@suse.cz Signed-off-by: Theodore Ts'o Signed-off-by: Greg Kroah-Hartman commit 1a4437076566a758f7c11565907835d7ec7a4893 Author: Dmitry Monakhov Date: Thu Oct 31 10:39:19 2019 +0000 quota: fix livelock in dquot_writeback_dquots commit 6ff33d99fc5c96797103b48b7b0902c296f09c05 upstream. Write only quotas which are dirty at entry. XFSTEST: https://github.com/dmonakhov/xfstests/commit/b10ad23566a5bf75832a6f500e1236084083cddc Link: https://lore.kernel.org/r/20191031103920.3919-1-dmonakhov@openvz.org CC: stable@vger.kernel.org Signed-off-by: Konstantin Khlebnikov Signed-off-by: Dmitry Monakhov Signed-off-by: Jan Kara Signed-off-by: Greg Kroah-Hartman commit 72c7fa7466f51e6ec0d0c4b2acf5d66580dcd0cb Author: Christian Brauner Date: Fri Sep 20 10:30:06 2019 +0200 seccomp: avoid overflow in implicit constant conversion commit 223e660bc7638d126a0e4fbace4f33f2895788c4 upstream. USER_NOTIF_MAGIC is assigned to int variables in this test so set it to INT_MAX to avoid warnings: seccomp_bpf.c: In function ‘user_notification_continue’: seccomp_bpf.c:3088:26: warning: overflow in implicit constant conversion [-Woverflow] #define USER_NOTIF_MAGIC 116983961184613L ^ seccomp_bpf.c:3572:15: note: in expansion of macro ‘USER_NOTIF_MAGIC’ resp.error = USER_NOTIF_MAGIC; ^~~~~~~~~~~~~~~~ Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") Signed-off-by: Christian Brauner Reviewed-by: Tyler Hicks Cc: Andy Lutomirski Cc: Will Drewry Cc: Shuah Khan Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Martin KaFai Lau Cc: Song Liu Cc: Yonghong Song Cc: Tycho Andersen Cc: stable@vger.kernel.org Cc: linux-kselftest@vger.kernel.org Cc: netdev@vger.kernel.org Cc: bpf@vger.kernel.org Reviewed-by: Tycho Andersen Link: https://lore.kernel.org/r/20190920083007.11475-3-christian.brauner@ubuntu.com Signed-off-by: Kees Cook Signed-off-by: Greg Kroah-Hartman commit 298489477403569a7734f57b08f975cf9ee0a3ff Author: Chengguang Xu Date: Tue Nov 5 12:51:00 2019 +0800 ext2: check err when partial != NULL commit e705f4b8aa27a59f8933e8f384e9752f052c469c upstream. Check err when partial == NULL is meaningless because partial == NULL means getting branch successfully without error. CC: stable@vger.kernel.org Link: https://lore.kernel.org/r/20191105045100.7104-1-cgxu519@mykernel.net Signed-off-by: Chengguang Xu Signed-off-by: Jan Kara Signed-off-by: Greg Kroah-Hartman commit b28df8395d5e5e830d2126f9b21f5817599c815f Author: Dmitry Monakhov Date: Thu Oct 31 10:39:20 2019 +0000 quota: Check that quota is not dirty before release commit df4bb5d128e2c44848aeb36b7ceceba3ac85080d upstream. There is a race window where quota was redirted once we drop dq_list_lock inside dqput(), but before we grab dquot->dq_lock inside dquot_release() TASK1 TASK2 (chowner) ->dqput() we_slept: spin_lock(&dq_list_lock) if (dquot_dirty(dquot)) { spin_unlock(&dq_list_lock); dquot->dq_sb->dq_op->write_dquot(dquot); goto we_slept if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) { spin_unlock(&dq_list_lock); dquot->dq_sb->dq_op->release_dquot(dquot); dqget() mark_dquot_dirty() dqput() goto we_slept; } So dquot dirty quota will be released by TASK1, but on next we_sleept loop we detect this and call ->write_dquot() for it. XFSTEST: https://github.com/dmonakhov/xfstests/commit/440a80d4cbb39e9234df4d7240aee1d551c36107 Link: https://lore.kernel.org/r/20191031103920.3919-2-dmonakhov@openvz.org CC: stable@vger.kernel.org Signed-off-by: Dmitry Monakhov Signed-off-by: Jan Kara Signed-off-by: Greg Kroah-Hartman commit 8d3e44702d4e6e07f81098de2c078ff28007948c Author: Ville Syrjälä Date: Thu Sep 19 16:28:53 2019 +0300 video/hdmi: Fix AVI bar unpack commit 6039f37dd6b76641198e290f26b31c475248f567 upstream. The bar values are little endian, not big endian. The pack function did it right but the unpack got it wrong. Fix it. Cc: stable@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: Martin Bugge Cc: Hans Verkuil Cc: Thierry Reding Cc: Mauro Carvalho Chehab Fixes: 2c676f378edb ("[media] hdmi: added unpack and logging functions for InfoFrames") Signed-off-by: Ville Syrjälä Link: https://patchwork.freedesktop.org/patch/msgid/20190919132853.30954-1-ville.syrjala@linux.intel.com Reviewed-by: Thierry Reding Signed-off-by: Greg Kroah-Hartman commit 01d8c174695c6143cec2e1f8e1ca9bae33116975 Author: Cédric Le Goater Date: Tue Dec 3 17:36:42 2019 +0100 powerpc/xive: Skip ioremap() of ESB pages for LSI interrupts commit b67a95f2abff0c34e5667c15ab8900de73d8d087 upstream. The PCI INTx interrupts and other LSI interrupts are handled differently under a sPAPR platform. When the interrupt source characteristics are queried, the hypervisor returns an H_INT_ESB flag to inform the OS that it should be using the H_INT_ESB hcall for interrupt management and not loads and stores on the interrupt ESB pages. A default -1 value is returned for the addresses of the ESB pages. The driver ignores this condition today and performs a bogus IO mapping. Recent changes and the DEBUG_VM configuration option make the bug visible with : kernel BUG at arch/powerpc/include/asm/book3s/64/pgtable.h:612! Oops: Exception in kernel mode, sig: 5 [#1] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=1024 NUMA pSeries Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.4.0-0.rc6.git0.1.fc32.ppc64le #1 NIP: c000000000f63294 LR: c000000000f62e44 CTR: 0000000000000000 REGS: c0000000fa45f0d0 TRAP: 0700 Not tainted (5.4.0-0.rc6.git0.1.fc32.ppc64le) ... NIP ioremap_page_range+0x4c4/0x6e0 LR ioremap_page_range+0x74/0x6e0 Call Trace: ioremap_page_range+0x74/0x6e0 (unreliable) do_ioremap+0x8c/0x120 __ioremap_caller+0x128/0x140 ioremap+0x30/0x50 xive_spapr_populate_irq_data+0x170/0x260 xive_irq_domain_map+0x8c/0x170 irq_domain_associate+0xb4/0x2d0 irq_create_mapping+0x1e0/0x3b0 irq_create_fwspec_mapping+0x27c/0x3e0 irq_create_of_mapping+0x98/0xb0 of_irq_parse_and_map_pci+0x168/0x230 pcibios_setup_device+0x88/0x250 pcibios_setup_bus_devices+0x54/0x100 __of_scan_bus+0x160/0x310 pcibios_scan_phb+0x330/0x390 pcibios_init+0x8c/0x128 do_one_initcall+0x60/0x2c0 kernel_init_freeable+0x290/0x378 kernel_init+0x2c/0x148 ret_from_kernel_thread+0x5c/0x80 Fixes: bed81ee181dd ("powerpc/xive: introduce H_INT_ESB hcall") Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Cédric Le Goater Tested-by: Daniel Axtens Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20191203163642.2428-1-clg@kaod.org Signed-off-by: Greg Kroah-Hartman commit 34d5d5a81fc6827275b7b9c3a6b419c2b897f6ed Author: Alastair D'Silva Date: Mon Nov 4 13:32:53 2019 +1100 powerpc: Allow flush_icache_range to work across ranges >4GB commit 29430fae82073d39b1b881a3cd507416a56a363f upstream. When calling flush_icache_range with a size >4GB, we were masking off the upper 32 bits, so we would incorrectly flush a range smaller than intended. This patch replaces the 32 bit shifts with 64 bit ones, so that the full size is accounted for. Signed-off-by: Alastair D'Silva Cc: stable@vger.kernel.org Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20191104023305.9581-2-alastair@au1.ibm.com Signed-off-by: Greg Kroah-Hartman commit e6d76815e9a44774fa57f1d390ede404f1ed75cc Author: Cédric Le Goater Date: Thu Oct 31 07:31:00 2019 +0100 powerpc/xive: Prevent page fault issues in the machine crash handler commit 1ca3dec2b2dff9d286ce6cd64108bda0e98f9710 upstream. When the machine crash handler is invoked, all interrupts are masked but interrupts which have not been started yet do not have an ESB page mapped in the Linux address space. This crashes the 'crash kexec' sequence on sPAPR guests. To fix, force the mapping of the ESB page when an interrupt is being mapped in the Linux IRQ number space. This is done by setting the initial state of the interrupt to OFF which is not necessarily the case on PowerNV. Fixes: 243e25112d06 ("powerpc/xive: Native exploitation of the XIVE interrupt controller") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Cédric Le Goater Reviewed-by: Greg Kurz Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20191031063100.3864-1-clg@kaod.org Signed-off-by: Greg Kroah-Hartman commit a0fc373c0d521ff4c42e391d1de4e9ff17aa7155 Author: Alastair D'Silva Date: Mon Nov 4 13:32:54 2019 +1100 powerpc: Allow 64bit VDSO __kernel_sync_dicache to work across ranges >4GB commit f9ec11165301982585e5e5f606739b5bae5331f3 upstream. When calling __kernel_sync_dicache with a size >4GB, we were masking off the upper 32 bits, so we would incorrectly flush a range smaller than intended. This patch replaces the 32 bit shifts with 64 bit ones, so that the full size is accounted for. Signed-off-by: Alastair D'Silva Cc: stable@vger.kernel.org Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20191104023305.9581-3-alastair@au1.ibm.com Signed-off-by: Greg Kroah-Hartman commit d3416b89ce22a4c5753bd7e842b05debc4353d73 Author: Yabin Cui Date: Mon Nov 4 11:12:50 2019 -0700 coresight: Serialize enabling/disabling a link device. commit edda32dabedb01f98b9d7b9a4492c13357834bbe upstream. When tracing etm data of multiple threads on multiple cpus through perf interface, some link devices are shared between paths of different cpus. It creates race conditions when different cpus wants to enable/disable the same link device at the same time. Example 1: Two cpus want to enable different ports of a coresight funnel, thus calling the funnel enable operation at the same time. But the funnel enable operation isn't reentrantable. Example 2: For an enabled coresight dynamic replicator with refcnt=1, one cpu wants to disable it, while another cpu wants to enable it. Ideally we still have an enabled replicator with refcnt=1 at the end. But in reality the result is uncertain. Since coresight devices claim themselves when enabled for self-hosted usage, the race conditions above usually make the link devices not usable after many cycles. To fix the race conditions, this patch uses spinlocks to serialize enabling/disabling link devices. Fixes: a06ae8609b3d ("coresight: add CoreSight core layer framework") Signed-off-by: Yabin Cui Signed-off-by: Mathieu Poirier Cc: stable # 5.3 Link: https://lore.kernel.org/r/20191104181251.26732-14-mathieu.poirier@linaro.org Signed-off-by: Greg Kroah-Hartman commit 614662016d3d81a5d2fb5bc1b25b865eea0702ac Author: Alexander Shishkin Date: Thu Nov 14 08:42:00 2019 +0200 stm class: Lose the protocol driver when dropping its reference commit 0a8f72fafb3f72a08df4ee491fcbeaafd6de85fd upstream. Commit c7fd62bc69d02 ("stm class: Introduce framing protocol drivers") forgot to tear down the link between an stm device and its protocol driver when policy is removed. This leads to an invalid pointer reference if one tries to write to an stm device after the policy has been removed and the protocol driver module unloaded, leading to the below splat: > BUG: unable to handle page fault for address: ffffffffc0737068 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 3d780f067 P4D 3d780f067 PUD 3d7811067 PMD 492781067 PTE 0 > Oops: 0000 [#1] SMP NOPTI > CPU: 1 PID: 26122 Comm: cat Not tainted 5.4.0-rc5+ #1 > RIP: 0010:stm_output_free+0x40/0xc0 [stm_core] > Call Trace: > stm_char_release+0x3e/0x70 [stm_core] > __fput+0xc6/0x260 > ____fput+0xe/0x10 > task_work_run+0x9d/0xc0 > exit_to_usermode_loop+0x103/0x110 > do_syscall_64+0x19d/0x1e0 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fix this by tearing down the link from an stm device to its protocol driver when the policy involving that driver is removed. Signed-off-by: Alexander Shishkin Fixes: c7fd62bc69d02 ("stm class: Introduce framing protocol drivers") Reported-by: Ammy Yi Tested-by: Ammy Yi CC: stable@vger.kernel.org # v4.20+ Link: https://lore.kernel.org/r/20191114064201.43089-2-alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman commit 03087e5d36bc7accb0023db0f37d3a63271b31ed Author: Arnd Bergmann Date: Fri Nov 8 21:34:30 2019 +0100 ppdev: fix PPGETTIME/PPSETTIME ioctls commit 998174042da229e2cf5841f574aba4a743e69650 upstream. Going through the uses of timeval in the user space API, I noticed two bugs in ppdev that were introduced in the y2038 conversion: * The range check was accidentally moved from ppsettime to ppgettime * On sparc64, the microseconds are in the other half of the 64-bit word. Fix both, and mark the fix for stable backports. Cc: stable@vger.kernel.org Fixes: 3b9ab374a1e6 ("ppdev: convert to y2038 safe") Signed-off-by: Arnd Bergmann Link: https://lore.kernel.org/r/20191108203435.112759-8-arnd@arndb.de Signed-off-by: Greg Kroah-Hartman Signed-off-by: Greg Kroah-Hartman commit 1e974c08c73bccbb386faaec008de050c0ac689a Author: Bart Van Assche Date: Fri Oct 25 15:58:27 2019 -0700 RDMA/core: Fix ib_dma_max_seg_size() commit ecdfdfdbe4d4c74029f2b416b7ee6d0aeb56364a upstream. If dev->dma_device->params == NULL then the maximum DMA segment size is 64 KB. See also the dma_get_max_seg_size() implementation. This patch fixes the following kernel warning: DMA-API: infiniband rxe0: mapping sg segment longer than device claims to support [len=126976] [max=65536] WARNING: CPU: 4 PID: 4848 at kernel/dma/debug.c:1220 debug_dma_map_sg+0x3d9/0x450 RIP: 0010:debug_dma_map_sg+0x3d9/0x450 Call Trace: srp_queuecommand+0x626/0x18d0 [ib_srp] scsi_queue_rq+0xd02/0x13e0 [scsi_mod] __blk_mq_try_issue_directly+0x2b3/0x3f0 blk_mq_request_issue_directly+0xac/0xf0 blk_insert_cloned_request+0xdf/0x170 dm_mq_queue_rq+0x43d/0x830 [dm_mod] __blk_mq_try_issue_directly+0x2b3/0x3f0 blk_mq_request_issue_directly+0xac/0xf0 blk_mq_try_issue_list_directly+0xb8/0x170 blk_mq_sched_insert_requests+0x23c/0x3b0 blk_mq_flush_plug_list+0x529/0x730 blk_flush_plug_list+0x21f/0x260 blk_mq_make_request+0x56b/0xf20 generic_make_request+0x196/0x660 submit_bio+0xae/0x290 blkdev_direct_IO+0x822/0x900 generic_file_direct_write+0x110/0x200 __generic_file_write_iter+0x124/0x2a0 blkdev_write_iter+0x168/0x270 aio_write+0x1c4/0x310 io_submit_one+0x971/0x1390 __x64_sys_io_submit+0x12a/0x390 do_syscall_64+0x6f/0x2e0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Link: https://lore.kernel.org/r/20191025225830.257535-2-bvanassche@acm.org Cc: Fixes: 0b5cb3300ae5 ("RDMA/srp: Increase max_segment_size") Signed-off-by: Bart Van Assche Reviewed-by: Jason Gunthorpe Signed-off-by: Jason Gunthorpe Signed-off-by: Greg Kroah-Hartman commit 24b5f8ce2bada1a3f156b36216d34d224716e118 Author: Jarkko Nikula Date: Sat Nov 16 17:16:51 2019 +0200 ARM: dts: omap3-tao3530: Fix incorrect MMC card detection GPIO polarity commit 287897f9aaa2ad1c923d9875914f57c4dc9159c8 upstream. The MMC card detection GPIO polarity is active low on TAO3530, like in many other similar boards. Now the card is not detected and it is unable to mount rootfs from an SD card. Fix this by using the correct polarity. This incorrect polarity was defined already in the commit 30d95c6d7092 ("ARM: dts: omap3: Add Technexion TAO3530 SOM omap3-tao3530.dtsi") in v3.18 kernel and later changed to use defined GPIO constants in v4.4 kernel by the commit 3a637e008e54 ("ARM: dts: Use defined GPIO constants in flags cell for OMAP2+ boards"). While the latter commit did not introduce the issue I'm marking it with Fixes tag due the v4.4 kernels still being maintained. Fixes: 3a637e008e54 ("ARM: dts: Use defined GPIO constants in flags cell for OMAP2+ boards") Cc: linux-stable # 4.4+ Signed-off-by: Jarkko Nikula Signed-off-by: Tony Lindgren Signed-off-by: Greg Kroah-Hartman commit a495f6dd2a9e5f16860098d5287ee1cfa7121778 Author: H. Nikolaus Schaller Date: Thu Nov 7 11:30:37 2019 +0100 mmc: host: omap_hsmmc: add code for special init of wl1251 to get rid of pandora_wl1251_init_card commit f6498b922e57aecbe3b7fa30a308d9d586c0c369 upstream. Pandora_wl1251_init_card was used to do special pdata based setup of the sdio mmc interface. This does no longer work with v4.7 and later. A fix requires a device tree based mmc3 setup. Therefore we move the special setup to omap_hsmmc.c instead of calling some pdata supplied init_card function. The new code checks for a DT child node compatible to wl1251 so it will not affect other MMC3 use cases. Generally, this code was and still is a hack and should be moved to mmc core to e.g. read such properties from optional DT child nodes. Fixes: 81eef6ca9201 ("mmc: omap_hsmmc: Use dma_request_chan() for requesting DMA channel") Signed-off-by: H. Nikolaus Schaller Cc: # v4.7+ [Ulf: Fixed up some checkpatch complaints] Signed-off-by: Ulf Hansson Signed-off-by: Greg Kroah-Hartman commit 1dc61ab2a1136671adbcef095a4370006d99dd10 Author: Krzysztof Kozlowski Date: Mon Aug 5 18:27:09 2019 +0200 pinctrl: samsung: Fix device node refcount leaks in S3C64xx wakeup controller init commit 7f028caadf6c37580d0f59c6c094ed09afc04062 upstream. In s3c64xx_eint_eint0_init() the for_each_child_of_node() loop is used with a break to find a matching child node. Although each iteration of for_each_child_of_node puts the previous node, but early exit from loop misses it. This leads to leak of device node. Cc: Fixes: 61dd72613177 ("pinctrl: Add pinctrl-s3c64xx driver") Signed-off-by: Krzysztof Kozlowski Signed-off-by: Greg Kroah-Hartman commit 75ae5a92a1f669679ffae5df64adb468777f4e9d Author: Krzysztof Kozlowski Date: Mon Aug 5 18:27:10 2019 +0200 pinctrl: samsung: Fix device node refcount leaks in init code commit a322b3377f4bac32aa25fb1acb9e7afbbbbd0137 upstream. Several functions use for_each_child_of_node() loop with a break to find a matching child node. Although each iteration of for_each_child_of_node puts the previous node, but early exit from loop misses it. This leads to leak of device node. Cc: Fixes: 9a2c1c3b91aa ("pinctrl: samsung: Allow grouping multiple pinmux/pinconf nodes") Signed-off-by: Krzysztof Kozlowski Signed-off-by: Greg Kroah-Hartman commit 7b703ca18b92dd44419727f932d32d5d6f0dff2c Author: Krzysztof Kozlowski Date: Mon Aug 5 18:27:08 2019 +0200 pinctrl: samsung: Fix device node refcount leaks in S3C24xx wakeup controller init commit 6fbbcb050802d6ea109f387e961b1dbcc3a80c96 upstream. In s3c24xx_eint_init() the for_each_child_of_node() loop is used with a break to find a matching child node. Although each iteration of for_each_child_of_node puts the previous node, but early exit from loop misses it. This leads to leak of device node. Cc: Fixes: af99a7507469 ("pinctrl: Add pinctrl-s3c24xx driver") Signed-off-by: Krzysztof Kozlowski Signed-off-by: Greg Kroah-Hartman commit d3d3a0bc3228056b00af2955b9273aa3c5eb264a Author: Krzysztof Kozlowski Date: Mon Aug 5 18:27:07 2019 +0200 pinctrl: samsung: Fix device node refcount leaks in Exynos wakeup controller init commit 5c7f48dd14e892e3e920dd6bbbd52df79e1b3b41 upstream. In exynos_eint_wkup_init() the for_each_child_of_node() loop is used with a break to find a matching child node. Although each iteration of for_each_child_of_node puts the previous node, but early exit from loop misses it. This leads to leak of device node. Cc: Fixes: 43b169db1841 ("pinctrl: add exynos4210 specific extensions for samsung pinctrl driver") Signed-off-by: Krzysztof Kozlowski Signed-off-by: Greg Kroah-Hartman commit 4e8285d98c520eebb3a5c8740279a537684f98f1 Author: Nishka Dasgupta Date: Sun Aug 4 21:32:00 2019 +0530 pinctrl: samsung: Add of_node_put() before return in error path commit 3d2557ab75d4c568c79eefa2e550e0d80348a6bd upstream. Each iteration of for_each_child_of_node puts the previous node, but in the case of a return from the middle of the loop, there is no put, thus causing a memory leak. Hence add an of_node_put before the return of exynos_eint_wkup_init() error path. Issue found with Coccinelle. Signed-off-by: Nishka Dasgupta Cc: Fixes: 14c255d35b25 ("pinctrl: exynos: Add irq_chip instance for Exynos7 wakeup interrupts") Signed-off-by: Krzysztof Kozlowski Signed-off-by: Greg Kroah-Hartman commit 0298d6cf85462e8ba090fa37a6a30008d97c4afb Author: Gregory CLEMENT Date: Fri Nov 15 16:57:52 2019 +0100 pinctrl: armada-37xx: Fix irq mask access in armada_37xx_irq_set_type() commit 04fb02757ae5188031eb71b2f6f189edb1caf5dc upstream. As explained in the following commit a9a1a4833613 ("pinctrl: armada-37xx: Fix gpio interrupt setup") the armada_37xx_irq_set_type() function can be called before the initialization of the mask field. That means that we can't use this field in this function and need to workaround it using hwirq. Fixes: 30ac0d3b0702 ("pinctrl: armada-37xx: Add edge both type gpio irq support") Cc: stable@vger.kernel.org Reported-by: Russell King Signed-off-by: Gregory CLEMENT Link: https://lore.kernel.org/r/20191115155752.2562-1-gregory.clement@bootlin.com Signed-off-by: Linus Walleij Signed-off-by: Greg Kroah-Hartman commit c21e0c84a858465f0191d6fdabc587b68b2ca4a9 Author: Chris Brandt Date: Mon Sep 30 09:58:04 2019 -0500 pinctrl: rza2: Fix gpio name typos commit 930d3a4907ae6cdb476db23fc7caa86e9de1e557 upstream. Fix apparent copy/paste errors that were overlooked in the original driver. "P0_4" -> "PF_4" "P0_3" -> "PG_3" Fixes: b59d0e782706 ("pinctrl: Add RZ/A2 pin and gpio controller") Cc: Signed-off-by: Chris Brandt Link: https://lore.kernel.org/r/20190930145804.30497-1-chris.brandt@renesas.com Signed-off-by: Geert Uytterhoeven Signed-off-by: Greg Kroah-Hartman commit be059d26faa29f3d8bca1679f66b66d7755bf1c8 Author: Rafael J. Wysocki Date: Wed Dec 4 02:54:27 2019 +0100 ACPI: PM: Avoid attaching ACPI PM domain to certain devices commit b9ea0bae260f6aae546db224daa6ac1bd9d94b91 upstream. Certain ACPI-enumerated devices represented as platform devices in Linux, like fans, require special low-level power management handling implemented by their drivers that is not in agreement with the ACPI PM domain behavior. That leads to problems with managing ACPI fans during system-wide suspend and resume. For this reason, make acpi_dev_pm_attach() skip the affected devices by adding a list of device IDs to avoid to it and putting the IDs of the affected devices into that list. Fixes: e5cc8ef31267 (ACPI / PM: Provide ACPI PM callback routines for subsystems) Reported-by: Zhang Rui Tested-by: Todd Brandt Cc: 3.10+ # 3.10+ Signed-off-by: Rafael J. Wysocki Signed-off-by: Greg Kroah-Hartman commit 59808eaa795fd8314babc41d02fe1ba86482920d Author: Rafael J. Wysocki Date: Thu Nov 28 23:47:51 2019 +0100 ACPI: EC: Rework flushing of pending work commit 016b87ca5c8c6e9e87db442f04dc99609b11ed36 upstream. There is a race condition in the ACPI EC driver, between __acpi_ec_flush_event() and acpi_ec_event_handler(), that may cause systems to stay in suspended-to-idle forever after a wakeup event coming from the EC. Namely, acpi_s2idle_wake() calls acpi_ec_flush_work() to wait until the delayed work resulting from the handling of the EC GPE in acpi_ec_dispatch_gpe() is processed, and that function invokes __acpi_ec_flush_event() which uses wait_event() to wait for ec->nr_pending_queries to become zero on ec->wait, and that wait queue may be woken up too early. Suppose that acpi_ec_dispatch_gpe() has caused acpi_ec_gpe_handler() to run, so advance_transaction() has been called and it has invoked acpi_ec_submit_query() to queue up an event work item, so ec->nr_pending_queries has been incremented (under ec->lock). The work function of that work item, acpi_ec_event_handler() runs later and calls acpi_ec_query() to process the event. That function calls acpi_ec_transaction() which invokes acpi_ec_transaction_unlocked() and the latter wakes up ec->wait under ec->lock, but it drops that lock before returning. When acpi_ec_query() returns, acpi_ec_event_handler() acquires ec->lock and decrements ec->nr_pending_queries, but at that point __acpi_ec_flush_event() (woken up previously) may already have acquired ec->lock, checked the value of ec->nr_pending_queries (and it would not have been zero then) and decided to go back to sleep. Next, if ec->nr_pending_queries is equal to zero now, the loop in acpi_ec_event_handler() terminates, ec->lock is released and acpi_ec_check_event() is called, but it does nothing unless ec_event_clearing is equal to ACPI_EC_EVT_TIMING_EVENT (which is not the case by default). In the end, if no more event work items have been queued up while executing acpi_ec_transaction_unlocked(), there is nothing to wake up __acpi_ec_flush_event() again and it sleeps forever, so the suspend-to-idle loop cannot make progress and the system is permanently suspended. To avoid this issue, notice that it actually is not necessary to wait for ec->nr_pending_queries to become zero in every case in which __acpi_ec_flush_event() is used. First, during platform-based system suspend (not suspend-to-idle), __acpi_ec_flush_event() is called by acpi_ec_disable_event() after clearing the EC_FLAGS_QUERY_ENABLED flag, which prevents acpi_ec_submit_query() from submitting any new event work items, so calling flush_scheduled_work() and flushing ec_query_wq subsequently (in order to wait until all of the queries in that queue have been processed) would be sufficient to flush all of the pending EC work in that case. Second, the purpose of the flushing of pending EC work while suspended-to-idle described above really is to wait until the first event work item coming from acpi_ec_dispatch_gpe() is complete, because it should produce system wakeup events if that is a valid EC-based system wakeup, so calling flush_scheduled_work() followed by flushing ec_query_wq is also sufficient for that purpose. Rework the code to follow the above observations. Fixes: 56b9918490 ("PM: sleep: Simplify suspend-to-idle control flow") Reported-by: Kenneth R. Crudup Tested-by: Kenneth R. Crudup Cc: 5.4+ # 5.4+ Signed-off-by: Rafael J. Wysocki Signed-off-by: Greg Kroah-Hartman commit f296f648e76a894dd1c7612eabb03e621f3ca520 Author: Vamshi K Sthambamkadi Date: Thu Nov 28 15:58:29 2019 +0530 ACPI: bus: Fix NULL pointer check in acpi_bus_get_private_data() commit 627ead724eff33673597216f5020b72118827de4 upstream. kmemleak reported backtrace: [] kmem_cache_alloc_trace+0x128/0x260 [<6677f215>] i2c_acpi_install_space_handler+0x4b/0xe0 [<1180f4fc>] i2c_register_adapter+0x186/0x400 [<6083baf7>] i2c_add_adapter+0x4e/0x70 [] intel_gmbus_setup+0x1a2/0x2c0 [i915] [<84cb69ae>] i915_driver_probe+0x8d8/0x13a0 [i915] [<81911d4b>] i915_pci_probe+0x48/0x160 [i915] [<4b159af1>] pci_device_probe+0xdc/0x160 [] really_probe+0x1ee/0x450 [] driver_probe_device+0x142/0x1b0 [] device_driver_attach+0x49/0x50 [] __driver_attach+0xc9/0x150 [] bus_for_each_dev+0x56/0xa0 [<80089bba>] driver_attach+0x19/0x20 [] bus_add_driver+0x177/0x220 [<7b29d8c7>] driver_register+0x56/0xf0 In i2c_acpi_remove_space_handler(), a leak occurs whenever the "data" parameter is initialized to 0 before being passed to acpi_bus_get_private_data(). This is because the NULL pointer check in acpi_bus_get_private_data() (condition->if(!*data)) returns EINVAL and, in consequence, memory is never freed in i2c_acpi_remove_space_handler(). Fix the NULL pointer check in acpi_bus_get_private_data() to follow the analogous check in acpi_get_data_full(). Signed-off-by: Vamshi K Sthambamkadi [ rjw: Subject & changelog ] Cc: All applicable Signed-off-by: Rafael J. Wysocki Signed-off-by: Greg Kroah-Hartman commit b8b5c898b0081f6a180e01f5135b7b5b37493e10 Author: Francesco Ruggeri Date: Tue Nov 19 21:47:27 2019 -0800 ACPI: OSL: only free map once in osl.c commit 833a426cc471b6088011b3d67f1dc4e147614647 upstream. acpi_os_map_cleanup checks map->refcount outside of acpi_ioremap_lock before freeing the map. This creates a race condition the can result in the map being freed more than once. A panic can be caused by running for ((i=0; i<10; i++)) do for ((j=0; j<100000; j++)) do cat /sys/firmware/acpi/tables/data/BERT >/dev/null done & done This patch makes sure that only the process that drops the reference to 0 does the freeing. Fixes: b7c1fadd6c2e ("ACPI: Do not use krefs under a mutex in osl.c") Signed-off-by: Francesco Ruggeri Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Cc: All applicable Signed-off-by: Rafael J. Wysocki Signed-off-by: Greg Kroah-Hartman commit ebbc1380a366e5047c1cc579cdaecfafb2a4d937 Author: Mika Westerberg Date: Wed Oct 30 18:05:45 2019 +0300 ACPI / hotplug / PCI: Allocate resources directly under the non-hotplug bridge commit 77adf9355304f8dcf09054280af5e23fc451ab3d upstream. Valerio and others reported that commit 84c8b58ed3ad ("ACPI / hotplug / PCI: Don't scan bridges managed by native hotplug") prevents some recent LG and HP laptops from booting with endless loop of: ACPI Error: No handler or method for GPE 08, disabling event (20190215/evgpe-835) ACPI Error: No handler or method for GPE 09, disabling event (20190215/evgpe-835) ACPI Error: No handler or method for GPE 0A, disabling event (20190215/evgpe-835) ... What seems to happen is that during boot, after the initial PCI enumeration when EC is enabled the platform triggers ACPI Notify() to one of the root ports. The root port itself looks like this: pci 0000:00:1b.0: PCI bridge to [bus 02-3a] pci 0000:00:1b.0: bridge window [mem 0xc4000000-0xda0fffff] pci 0000:00:1b.0: bridge window [mem 0x80000000-0xa1ffffff 64bit pref] The BIOS has configured the root port so that it does not have I/O bridge window. Now when the ACPI Notify() is triggered ACPI hotplug handler calls acpiphp_native_scan_bridge() for each non-hotplug bridge (as this system is using native PCIe hotplug) and pci_assign_unassigned_bridge_resources() to allocate resources. The device connected to the root port is a PCIe switch (Thunderbolt controller) with two hotplug downstream ports. Because of the hotplug ports __pci_bus_size_bridges() tries to add "additional I/O" of 256 bytes to each (DEFAULT_HOTPLUG_IO_SIZE). This gets further aligned to 4k as that's the minimum I/O window size so each hotplug port gets 4k I/O window and the same happens for the root port (which is also hotplug port). This means 3 * 4k = 12k I/O window. Because of this pci_assign_unassigned_bridge_resources() ends up opening a I/O bridge window for the root port at first available I/O address which seems to be in range 0x1000 - 0x3fff. Normally this range is used for ACPI stuff such as GPE bits (below is part of /proc/ioports): 1800-1803 : ACPI PM1a_EVT_BLK 1804-1805 : ACPI PM1a_CNT_BLK 1808-180b : ACPI PM_TMR 1810-1815 : ACPI CPU throttle 1850-1850 : ACPI PM2_CNT_BLK 1854-1857 : pnp 00:05 1860-187f : ACPI GPE0_BLK However, when the ACPI Notify() happened this range was not yet reserved for ACPI/PNP (that happens later) so PCI gets it. It then starts writing to this range and accidentally stomps over GPE bits among other things causing the endless stream of messages about missing GPE handler. This problem does not happen if "pci=hpiosize=0" is passed in the kernel command line. The reason is that then the kernel does not try to allocate the additional 256 bytes for each hotplug port. Fix this by allocating resources directly below the non-hotplug bridges where a new device may appear as a result of ACPI Notify(). This avoids the hotplug bridges and prevents opening the additional I/O window. Fixes: 84c8b58ed3ad ("ACPI / hotplug / PCI: Don't scan bridges managed by native hotplug") Link: https://bugzilla.kernel.org/show_bug.cgi?id=203617 Link: https://lore.kernel.org/r/20191030150545.19885-1-mika.westerberg@linux.intel.com Reported-by: Valerio Passini Signed-off-by: Mika Westerberg Signed-off-by: Bjorn Helgaas Reviewed-by: Rafael J. Wysocki Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman commit 4b598c171e622c4775314541c68fd13245ded6c2 Author: Hans de Goede Date: Thu Oct 24 23:57:23 2019 +0200 ACPI: LPSS: Add dmi quirk for skipping _DEP check for some device-links commit 6025e2fae3dde3c3d789d08f8ceacbdd9f90d471 upstream. The iGPU / GFX0 device's _PS0 method on the ASUS T200TA depends on the I2C1 controller (which is connected to the embedded controller). But unlike in the T100TA/T100CHI this dependency is not listed in the _DEP of the GFX0 device. This results in the dev_WARN_ONCE(..., "Transfer while suspended\n") call in i2c-designware-master.c triggering and the AML code not working as it should. This commit fixes this by adding a dmi based quirk mechanism for devices which miss a _DEP, and adding a quirk for the LNXVIDEO depending on the I2C1 device on the Asus T200TA. Fixes: 2d71ee0ce72f ("ACPI / LPSS: Add a device link from the GPU to the BYT I2C5 controller") Tested-by: Pierre-Louis Bossart Signed-off-by: Hans de Goede Reviewed-by: Andy Shevchenko Cc: 4.20+ # 4.20+ Signed-off-by: Rafael J. Wysocki Signed-off-by: Greg Kroah-Hartman commit 4cbdbad9ae74bae258a074b036c9be08c4f0ad0f Author: Hans de Goede Date: Thu Oct 24 23:57:22 2019 +0200 ACPI: LPSS: Add LNXVIDEO -> BYT I2C1 to lpss_device_links commit b3b3519c04bdff91651d0a6deb79dbd4516b5d7b upstream. Various Asus Bay Trail devices (T100TA, T100CHI, T200TA) have an embedded controller connected to I2C1 and the iGPU (LNXVIDEO) _PS0/_PS3 methods access it, so we need to add a consumer link from LNXVIDEO to I2C1 on these devices to avoid suspend/resume ordering problems. Fixes: 2d71ee0ce72f ("ACPI / LPSS: Add a device link from the GPU to the BYT I2C5 controller") Tested-by: Pierre-Louis Bossart Signed-off-by: Hans de Goede Reviewed-by: Andy Shevchenko Cc: 4.20+ # 4.20+ Signed-off-by: Rafael J. Wysocki Signed-off-by: Greg Kroah-Hartman commit 8655d19193395aefeb544bde0aacfaf3e58070a8 Author: Hans de Goede Date: Thu Oct 24 23:57:21 2019 +0200 ACPI: LPSS: Add LNXVIDEO -> BYT I2C7 to lpss_device_links commit cc18735f208565343a9824adeca5305026598550 upstream. So far on Bay Trail (BYT) we only have been adding a device_link adding the iGPU (LNXVIDEO) device as consumer for the I2C controller for the PMIC for I2C5, but the PMIC only uses I2C5 on BYT CR (cost reduced) on regular BYT platforms I2C7 is used and we were not adding the device_link sometimes causing resume ordering issues. This commit adds LNXVIDEO -> BYT I2C7 to the lpss_device_links table, fixing this. Fixes: 2d71ee0ce72f ("ACPI / LPSS: Add a device link from the GPU to the BYT I2C5 controller") Tested-by: Pierre-Louis Bossart Signed-off-by: Hans de Goede Reviewed-by: Andy Shevchenko Cc: 4.20+ # 4.20+ Signed-off-by: Rafael J. Wysocki Signed-off-by: Greg Kroah-Hartman commit e9fcfbc239c0eb73a4775212f8872ce0520dedfe Author: Andy Shevchenko Date: Tue Oct 1 17:27:21 2019 +0300 ACPI / utils: Move acpi_dev_get_first_match_dev() under CONFIG_ACPI commit a814dcc269830c9dbb8a83731cfc6fc5dd787f8d upstream. We have a stub defined for the acpi_dev_get_first_match_dev() in acpi.h for the case when CONFIG_ACPI=n. Moreover, acpi_dev_put(), counterpart function, is already placed under CONFIG_ACPI. Thus, move acpi_dev_get_first_match_dev() under CONFIG_ACPI as well. Fixes: 817b4d64da03 ("ACPI / utils: Introduce acpi_dev_get_first_match_dev() helper") Reported-by: kbuild test robot Signed-off-by: Andy Shevchenko Reviewed-by: Mika Westerberg Cc: 5.2+ # 5.2+ Signed-off-by: Rafael J. Wysocki Signed-off-by: Greg Kroah-Hartman commit ea8627164928c2a57e8bc9e24e7d48ea4edb137c Author: Hui Wang Date: Wed Dec 11 13:13:21 2019 +0800 ALSA: hda/realtek - Line-out jack doesn't work on a Dell AIO commit 5815bdfd7f54739be9abed1301d55f5e74d7ad1f upstream. After applying the fixup ALC274_FIXUP_DELL_AIO_LINEOUT_VERB, the Line-out jack works well. And instead of adding a new set of pin definition in the pin_fixup_tbl, we put a more generic matching entry in the fallback_pin_fixup_tbl. Cc: Signed-off-by: Hui Wang Link: https://lore.kernel.org/r/20191211051321.5883-1-hui.wang@canonical.com Signed-off-by: Takashi Iwai Signed-off-by: Greg Kroah-Hartman commit dc4f813f1d66f32ec5b6a111ca41221517735d2a Author: Takashi Sakamoto Date: Tue Dec 10 00:03:04 2019 +0900 ALSA: oxfw: fix return value in error path of isochronous resources reservation commit 59a126aa3113fc23f03fedcafe3705f1de5aff50 upstream. Even if isochronous resources reservation fails, error code doesn't return in pcm.hw_params callback. Cc: #5.3+ Fixes: 4f380d007052 ("ALSA: oxfw: configure packet format in pcm.hw_params callback") Signed-off-by: Takashi Sakamoto Link: https://lore.kernel.org/r/20191209151655.GA8090@workstation Signed-off-by: Takashi Iwai Signed-off-by: Greg Kroah-Hartman commit d3a811fd7882cd61262aff1d5a86afa0297cedb0 Author: Takashi Sakamoto Date: Tue Dec 10 00:05:41 2019 +0900 ALSA: fireface: fix return value in error path of isochronous resources reservation commit 480136343cbe89426d6c2ab74ffb4e3ee572c7ee upstream. Even if isochronous resources reservation fails, error code doesn't return in pcm.hw_params callback. Cc: #5.3+ Fixes: 55162d2bb0e8 ("ALSA: fireface: reserve/release isochronous resources in pcm.hw_params/hw_free callbacks") Signed-off-by: Takashi Sakamoto Link: https://lore.kernel.org/r/20191209151655.GA8090@workstation Signed-off-by: Takashi Iwai Signed-off-by: Greg Kroah-Hartman commit 5ec6a40b88d8d791adcaa5503ed7c8d6ab5013b2 Author: John Hubbard Date: Wed Oct 30 22:21:59 2019 -0700 cpufreq: powernv: fix stack bloat and hard limit on number of CPUs commit db0d32d84031188443e25edbd50a71a6e7ac5d1d upstream. The following build warning occurred on powerpc 64-bit builds: drivers/cpufreq/powernv-cpufreq.c: In function 'init_chip_info': drivers/cpufreq/powernv-cpufreq.c:1070:1: warning: the frame size of 1040 bytes is larger than 1024 bytes [-Wframe-larger-than=] This is with a cross-compiler based on gcc 8.1.0, which I got from: https://mirrors.edge.kernel.org/pub/tools/crosstool/files/bin/x86_64/8.1.0/ The warning is due to putting 1024 bytes on the stack: unsigned int chip[256]; ...and it's also undesirable to have a hard limit on the number of CPUs here. Fix both problems by dynamically allocating based on num_possible_cpus, as recommended by Michael Ellerman. Fixes: 053819e0bf840 ("cpufreq: powernv: Handle throttling due to Pmax capping at chip level") Signed-off-by: John Hubbard Acked-by: Viresh Kumar Cc: 4.10+ # 4.10+ Signed-off-by: Rafael J. Wysocki Signed-off-by: Greg Kroah-Hartman commit 1b5d4a3a0957dbbd403a3f1ce3a6e23b5d3db2c0 Author: Leonard Crestez Date: Tue Sep 24 10:52:23 2019 +0300 PM / devfreq: Lock devfreq in trans_stat_show commit 2abb0d5268ae7b5ddf82099b1f8d5aa8414637d4 upstream. There is no locking in this sysfs show function so stats printing can race with a devfreq_update_status called as part of freq switching or with initialization. Also add an assert in devfreq_update_status to make it clear that lock must be held by caller. Fixes: 39688ce6facd ("PM / devfreq: account suspend/resume for stats") Cc: stable@vger.kernel.org Signed-off-by: Leonard Crestez Reviewed-by: Matthias Kaehlcke Reviewed-by: Chanwoo Choi Signed-off-by: Chanwoo Choi Signed-off-by: Greg Kroah-Hartman commit 40c3c389329f71ca2006ba802b5945e08b0a12b4 Author: Alexander Shishkin Date: Wed Nov 20 15:08:06 2019 +0200 intel_th: pci: Add Tiger Lake CPU support commit 6e6c18bcb78c0dc0601ebe216bed12c844492d0c upstream. This adds support for the Trace Hub in Tiger Lake CPU. Signed-off-by: Alexander Shishkin Reviewed-by: Andy Shevchenko Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20191120130806.44028-4-alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman commit b3e7c7242abbf26fa981233fca359d70a41612e7 Author: Alexander Shishkin Date: Wed Nov 20 15:08:05 2019 +0200 intel_th: pci: Add Ice Lake CPU support commit 6a1743422a7c0fda26764a544136cac13e5ae486 upstream. This adds support for the Trace Hub in Ice Lake CPU. Signed-off-by: Alexander Shishkin Reviewed-by: Andy Shevchenko Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20191120130806.44028-3-alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman commit eb0add45c99d97c812c986ad236a471ec4ba645d Author: Alexander Shishkin Date: Wed Nov 20 15:08:04 2019 +0200 intel_th: Fix a double put_device() in error path commit 512592779a337feb5905d8fcf9498dbf33672d4a upstream. Commit a753bfcfdb1f ("intel_th: Make the switch allocate its subdevices") factored out intel_th_subdevice_alloc() from intel_th_populate(), but got the error path wrong, resulting in two instances of a double put_device() on a freshly initialized, but not 'added' device. Fix this by only doing one put_device() in the error path. Signed-off-by: Alexander Shishkin Fixes: a753bfcfdb1f ("intel_th: Make the switch allocate its subdevices") Reported-by: Wen Yang Reviewed-by: Andy Shevchenko Cc: stable@vger.kernel.org # v4.14+ Link: https://lore.kernel.org/r/20191120130806.44028-2-alexander.shishkin@linux.intel.com Signed-off-by: Greg Kroah-Hartman commit 69fb7f4e86be62de3cdc714bc94dd0eb131dae47 Author: Madhavan Srinivasan Date: Mon Nov 18 09:14:52 2019 +0530 powerpc/perf: Disable trace_imc pmu commit 249fad734a25889a4f23ed014d43634af6798063 upstream. When a root user or a user with CAP_SYS_ADMIN privilege uses any trace_imc performance monitoring unit events, to monitor application or KVM threads, it may result in a checkstop (System crash). The cause is frequent switching of the "trace/accumulation" mode of the In-Memory Collection hardware (LDBAR). This patch disables the trace_imc PMU unit entirely to avoid triggering the checkstop. A future patch will reenable it at a later stage once a workaround has been developed. Fixes: 012ae244845f ("powerpc/perf: Trace imc PMU functions") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Madhavan Srinivasan Tested-by: Hariharan T.S. [mpe: Add pr_info_once() so dmesg shows the PMU has been disabled] Signed-off-by: Michael Ellerman Link: https://lore.kernel.org/r/20191118034452.9939-1-maddy@linux.vnet.ibm.com Signed-off-by: Greg Kroah-Hartman commit 5f7bca3f2a467a831e0f482f509ae0d2300e914e Author: Boris Brezillon Date: Fri Nov 29 14:59:05 2019 +0100 drm/panfrost: Open/close the perfcnt BO commit 0a5239985a3bc084738851afdf3fceb7d5651b0c upstream. Commit a5efb4c9a562 ("drm/panfrost: Restructure the GEM object creation") moved the drm_mm_insert_node_generic() call to the gem->open() hook, but forgot to update perfcnt accordingly. Patch the perfcnt logic to call panfrost_gem_open/close() where appropriate. Fixes: a5efb4c9a562 ("drm/panfrost: Restructure the GEM object creation") Cc: Signed-off-by: Boris Brezillon Reviewed-by: Steven Price Acked-by: Alyssa Rosenzweig Signed-off-by: Rob Herring Link: https://patchwork.freedesktop.org/patch/msgid/20191129135908.2439529-6-boris.brezillon@collabora.com Signed-off-by: Greg Kroah-Hartman commit a88259db2765999277e09f949a5d5fac34612848 Author: Leo Yan Date: Thu Nov 7 10:02:44 2019 +0800 perf tests: Fix out of bounds memory access commit af8490eb2b33684e26a0a927a9d93ae43cd08890 upstream. The test case 'Read backward ring buffer' failed on 32-bit architectures which were found by LKFT perf testing. The test failed on arm32 x15 device, qemu_arm32, qemu_i386, and found intermittent failure on i386; the failure log is as below: 50: Read backward ring buffer : --- start --- test child forked, pid 510 Using CPUID GenuineIntel-6-9E-9 mmap size 1052672B mmap size 8192B Finished reading overwrite ring buffer: rewind free(): invalid next size (fast) test child interrupted ---- end ---- Read backward ring buffer: FAILED! The log hints there have issue for memory usage, thus free() reports error 'invalid next size' and directly exit for the case. Finally, this issue is root caused as out of bounds memory access for the data array 'evsel->id'. The backward ring buffer test invokes do_test() twice. 'evsel->id' is allocated at the first call with the flow: test__backward_ring_buffer() `-> do_test() `-> evlist__mmap() `-> evlist__mmap_ex() `-> perf_evsel__alloc_id() So 'evsel->id' is allocated with one item, and it will be used in function perf_evlist__id_add(): evsel->id[0] = id evsel->ids = 1 At the second call for do_test(), it skips to initialize 'evsel->id' and reuses the array which is allocated in the first call. But 'evsel->ids' contains the stale value. Thus: evsel->id[1] = id -> out of bound access evsel->ids = 2 To fix this issue, we will use evlist__open() and evlist__close() pair functions to prepare and cleanup context for evlist; so 'evsel->id' and 'evsel->ids' can be initialized properly when invoke do_test() and avoid the out of bounds memory access. Fixes: ee74701ed8ad ("perf tests: Add test to check backward ring buffer") Signed-off-by: Leo Yan Reviewed-by: Jiri Olsa Cc: Alexander Shishkin Cc: Mark Rutland Cc: Namhyung Kim Cc: Naresh Kamboju Cc: Peter Zijlstra Cc: Wang Nan Cc: stable@vger.kernel.org # v4.10+ Link: http://lore.kernel.org/lkml/20191107020244.2427-1-leo.yan@linaro.org Signed-off-by: Arnaldo Carvalho de Melo Signed-off-by: Greg Kroah-Hartman commit a70bc7cc7608cc7bfa130c2fd982f6e7760e682b Author: Gao Xiang Date: Sun Dec 1 16:01:09 2019 +0800 erofs: zero out when listxattr is called with no xattr commit 926d1650176448d7684b991fbe1a5b1a8289e97c upstream. As David reported [1], ENODATA returns when attempting to modify files by using EROFS as an overlayfs lower layer. The root cause is that listxattr could return unexpected -ENODATA by mistake for inodes without xattr. That breaks listxattr return value convention and it can cause copy up failure when used with overlayfs. Resolve by zeroing out if no xattr is found for listxattr. [1] https://lore.kernel.org/r/CAEvUa7nxnby+rxK-KRMA46=exeOMApkDMAV08AjMkkPnTPV4CQ@mail.gmail.com Link: https://lore.kernel.org/r/20191201084040.29275-1-hsiangkao@aol.com Fixes: cadf1ccf1b00 ("staging: erofs: add error handling for xattr submodule") Cc: # 4.19+ Reviewed-by: Chao Yu Signed-off-by: Gao Xiang Signed-off-by: Greg Kroah-Hartman commit a101ec74bb19bea66ac57ddbe8676e32b1c1ed93 Author: Marcelo Tosatti Date: Fri Dec 6 13:07:41 2019 -0200 cpuidle: use first valid target residency as poll time commit 36fcb4292473cb9c9ce7706d038bcf0eda5cabeb upstream. Commit 259231a04561 ("cpuidle: add poll_limit_ns to cpuidle_device structure") changed, by mistake, the target residency from the first available sleep state to the last available sleep state (which should be longer). This might cause excessive polling. Fixes: 259231a04561 ("cpuidle: add poll_limit_ns to cpuidle_device structure") Signed-off-by: Marcelo Tosatti Cc: 5.4+ # 5.4+ Signed-off-by: Rafael J. Wysocki Signed-off-by: Greg Kroah-Hartman commit 18feee7b1cadac8b3f3a1885ff1708826966fa90 Author: Rafael J. Wysocki Date: Thu Oct 10 23:37:39 2019 +0200 cpuidle: teo: Fix "early hits" handling for disabled idle states commit 159e48560f51d9c2aa02d762a18cd24f7868ab27 upstream. The TEO governor uses idle duration "bins" defined in accordance with the CPU idle states table provided by the driver, so that each "bin" covers the idle duration range between the target residency of the idle state corresponding to it and the target residency of the closest deeper idle state. The governor collects statistics for each bin regardless of whether or not the idle state corresponding to it is currently enabled. In particular, the "early hits" metric measures the likelihood of a situation in which the idle duration measured after wakeup falls into to given bin, but the time till the next timer (sleep length) falls into a bin corresponding to one of the deeper idle states. It is used when the "hits" and "misses" metrics indicate that the state "matching" the sleep length should not be selected, so that the state with the maximum "early hits" value is selected instead of it. If the idle state corresponding to the given bin is disabled, it cannot be selected and if it turns out to be the one that should be selected, a shallower idle state needs to be used instead of it. Nevertheless, the metrics collected for the bin corresponding to it are still valid and need to be taken into account as though that state had not been disabled. As far as the "early hits" metric is concerned, teo_select() tries to take disabled states into account, but the state index corresponding to the maximum "early hits" value computed by it may be incorrect. Namely, it always uses the index of the previous maximum "early hits" state then, but there may be enabled idle states closer to the disabled one in question. In particular, if the current candidate state (whose index is the idx value) is closer to the disabled one and the "early hits" value of the disabled state is greater than the current maximum, the index of the current candidate state (idx) should replace the "maximum early hits state" index. Modify the code to handle that case correctly. Fixes: b26bf6ab716f ("cpuidle: New timer events oriented governor for tickless systems") Reported-by: Doug Smythies Signed-off-by: Rafael J. Wysocki Cc: 5.1+ # 5.1+ Signed-off-by: Greg Kroah-Hartman commit 86fe55266e56d98c4473842619246c8ed1afcb13 Author: Rafael J. Wysocki Date: Thu Oct 10 23:36:15 2019 +0200 cpuidle: teo: Consider hits and misses metrics of disabled states commit e43dcf20215f0287ea113102617ca04daa76b70e upstream. The TEO governor uses idle duration "bins" defined in accordance with the CPU idle states table provided by the driver, so that each "bin" covers the idle duration range between the target residency of the idle state corresponding to it and the target residency of the closest deeper idle state. The governor collects statistics for each bin regardless of whether or not the idle state corresponding to it is currently enabled. In particular, the "hits" and "misses" metrics measure the likelihood of a situation in which both the time till the next timer (sleep length) and the idle duration measured after wakeup fall into the given bin. Namely, if the "hits" value is greater than the "misses" one, that situation is more likely than the one in which the sleep length falls into the given bin, but the idle duration measured after wakeup falls into a bin corresponding to one of the shallower idle states. If the idle state corresponding to the given bin is disabled, it cannot be selected and if it turns out to be the one that should be selected, a shallower idle state needs to be used instead of it. Nevertheless, the metrics collected for the bin corresponding to it are still valid and need to be taken into account as though that state had not been disabled. For this reason, make teo_select() always use the "hits" and "misses" values of the idle duration range that the sleep length falls into even if the specific idle state corresponding to it is disabled and if the "hits" values is greater than the "misses" one, select the closest enabled shallower idle state in that case. Fixes: b26bf6ab716f ("cpuidle: New timer events oriented governor for tickless systems") Signed-off-by: Rafael J. Wysocki Cc: 5.1+ # 5.1+ Signed-off-by: Greg Kroah-Hartman commit e893247c71b2dd9ffaf10d9d0711519db4331136 Author: Rafael J. Wysocki Date: Thu Oct 10 23:32:59 2019 +0200 cpuidle: teo: Rename local variable in teo_select() commit 4f690bb8ce4cc5d3fabe3a8e9c2401de1554cdc1 upstream. Rename a local variable in teo_select() in preparation for subsequent code modifications, no intentional impact. Signed-off-by: Rafael J. Wysocki Cc: 5.1+ # 5.1+ Signed-off-by: Greg Kroah-Hartman commit b327b673c508c0656e07358bb746cf160210a502 Author: Rafael J. Wysocki Date: Thu Oct 10 23:32:17 2019 +0200 cpuidle: teo: Ignore disabled idle states that are too deep commit 069ce2ef1a6dd84cbd4d897b333e30f825e021f0 upstream. Prevent disabled CPU idle state with target residencies beyond the anticipated idle duration from being taken into account by the TEO governor. Fixes: b26bf6ab716f ("cpuidle: New timer events oriented governor for tickless systems") Signed-off-by: Rafael J. Wysocki Cc: 5.1+ # 5.1+ Signed-off-by: Greg Kroah-Hartman commit 768cfe83211ca6f23ba9d6c367d753a1e1697ffc Author: Zhenzhong Duan Date: Wed Oct 23 09:57:14 2019 +0800 cpuidle: Do not unset the driver if it is there already commit 918c1fe9fbbe46fcf56837ff21f0ef96424e8b29 upstream. Fix __cpuidle_set_driver() to check if any of the CPUs in the mask has a driver different from drv already and, if so, return -EBUSY before updating any cpuidle_drivers per-CPU pointers. Fixes: 82467a5a885d ("cpuidle: simplify multiple driver support") Cc: 3.11+ # 3.11+ Signed-off-by: Zhenzhong Duan [ rjw: Subject & changelog ] Signed-off-by: Rafael J. Wysocki Signed-off-by: Greg Kroah-Hartman commit dc857d605bb8c66ca8f1082e173e97406c5b3bf1 Author: Hans Verkuil Date: Mon Sep 16 02:47:41 2019 -0300 media: cec.h: CEC_OP_REC_FLAG_ values were swapped commit 806e0cdfee0b99efbb450f9f6e69deb7118602fc upstream. CEC_OP_REC_FLAG_NOT_USED is 0 and CEC_OP_REC_FLAG_USED is 1, not the other way around. Signed-off-by: Hans Verkuil Reported-by: Jiunn Chang Cc: # for v4.10 and up Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Greg Kroah-Hartman commit 3e8d9d1c4668a6eec887f0f97ba85472f3dd2d57 Author: Johan Hovold Date: Thu Oct 10 10:13:32 2019 -0300 media: radio: wl1273: fix interrupt masking on release commit 1091eb830627625dcf79958d99353c2391f41708 upstream. If a process is interrupted while accessing the radio device and the core lock is contended, release() could return early and fail to update the interrupt mask. Note that the return value of the v4l2 release file operation is ignored. Fixes: 87d1a50ce451 ("[media] V4L2: WL1273 FM Radio: TI WL1273 FM radio driver") Cc: stable # 2.6.38 Cc: Matti Aaltonen Signed-off-by: Johan Hovold Signed-off-by: Hans Verkuil Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Greg Kroah-Hartman commit 733c4d12e93234d78146381edbee033498744cbc Author: Johan Hovold Date: Thu Oct 10 10:13:31 2019 -0300 media: bdisp: fix memleak on release commit 11609a7e21f8cea42630350aa57662928fa4dc63 upstream. If a process is interrupted while accessing the video device and the device lock is contended, release() could return early and fail to free related resources. Note that the return value of the v4l2 release file operation is ignored. Fixes: 28ffeebbb7bd ("[media] bdisp: 2D blitter driver using v4l2 mem2mem framework") Cc: stable # 4.2 Signed-off-by: Johan Hovold Reviewed-by: Fabien Dessenne Signed-off-by: Hans Verkuil Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Greg Kroah-Hartman commit 7c5aabf08037965a72a219a7a74d244c77f1380a Author: Dafna Hirschfeld Date: Tue Nov 5 18:53:17 2019 +0100 media: vimc: sen: remove unused kthread_sen field commit 3ea35d5db448c27807acbcc7a2306cf65c5e6397 upstream. The field kthread_sen in the vimc_sen_device is not set and used. So remove the field and the code that check if it is non NULL Signed-off-by: Dafna Hirschfeld Cc: # for v5.4 and up Signed-off-by: Hans Verkuil Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Greg Kroah-Hartman commit ce3c4396c38ffac8ae30e2e2a498d3390dc37fce Author: Francois Buergisser Date: Tue Oct 29 02:24:48 2019 +0100 media: hantro: Fix picture order count table enable commit 58c93a548b0248fad6437f8c8921f9b031c3892a upstream. The picture order count table only makes sense for profiles higher than Baseline. This is confirmed by the H.264 specification (See 8.2.1 Decoding process for picture order count), which clarifies how POC are used for features not present in Baseline. """ Picture order counts are used to determine initial picture orderings for reference pictures in the decoding of B slices, to represent picture order differences between frames or fields for motion vector derivation in temporal direct mode, for implicit mode weighted prediction in B slices, and for decoder conformance checking. """ As a side note, this change matches various vendors downstream codebases, including ChromiumOS and IMX VPU libraries. Fixes: dea0a82f3d22 ("media: hantro: Add support for H264 decoding on G1") Signed-off-by: Francois Buergisser Signed-off-by: Ezequiel Garcia Signed-off-by: Jonas Karlman Reviewed-by: Boris Brezillon Tested-by: Boris Brezillon Cc: # for v5.4 and up Signed-off-by: Hans Verkuil Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Greg Kroah-Hartman commit 4b65b884133f7a22bf99f150c0513d2a77e444b9 Author: Francois Buergisser Date: Tue Oct 29 02:24:47 2019 +0100 media: hantro: Fix motion vectors usage condition commit 658f9d9921d7e76af03f689b5f0ffde042b8bf5b upstream. The setting of the motion vectors usage and the setting of motion vectors address are currently done under different conditions. When decoding pre-recorded videos, this results of leaving the motion vectors address unset, resulting in faulty memory accesses. Fix it by using the same condition everywhere, which matches the profiles that support motion vectors. Fixes: dea0a82f3d22 ("media: hantro: Add support for H264 decoding on G1") Signed-off-by: Francois Buergisser Signed-off-by: Ezequiel Garcia Signed-off-by: Jonas Karlman Reviewed-by: Boris Brezillon Tested-by: Boris Brezillon Cc: # for v5.4 and up Signed-off-by: Hans Verkuil Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Greg Kroah-Hartman commit 18eda8b8bb399c4cbfa6e0c90c7776bfd3383a75 Author: Ezequiel Garcia Date: Mon Oct 7 19:45:02 2019 +0200 media: hantro: Fix s_fmt for dynamic resolution changes commit ae02d49493b5d32bb3e035fdeb1655346f5e1ea5 upstream. Commit 953aaa1492c53 ("media: rockchip/vpu: Prepare things to support decoders") changed the conditions under S_FMT was allowed for OUTPUT CAPTURE buffers. However, and according to the mem-to-mem stateless decoder specification, in order to support dynamic resolution changes, S_FMT should be allowed even if OUTPUT buffers have been allocated. Relax decoder S_FMT restrictions on OUTPUT buffers, allowing a resolution modification, provided the pixel format stays the same. Tested on RK3288 platforms using ChromiumOS Video Decode/Encode Accelerator Unittests. [hverkuil: fix typo: In other -> In order] Fixes: 953aaa1492c53 ("media: rockchip/vpu: Prepare things to support decoders") Signed-off-by: Ezequiel Garcia Reviewed-by: Boris Brezillon Cc: # for v5.4 and up Signed-off-by: Hans Verkuil Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Greg Kroah-Hartman commit a76ce01ec9fcddc6aad80bb0e3be12f0f3220fa7 Author: Gerald Schaefer Date: Wed Sep 11 19:42:23 2019 +0200 s390/mm: properly clear _PAGE_NOEXEC bit when it is not supported commit ab874f22d35a8058d8fdee5f13eb69d8867efeae upstream. On older HW or under a hypervisor, w/o the instruction-execution- protection (IEP) facility, and also w/o EDAT-1, a translation-specification exception may be recognized when bit 55 of a pte is one (_PAGE_NOEXEC). The current code tries to prevent setting _PAGE_NOEXEC in such cases, by removing it within set_pte_at(). However, ptep_set_access_flags() will modify a pte directly, w/o using set_pte_at(). There is at least one scenario where this can result in an active pte with _PAGE_NOEXEC set, which would then lead to a panic due to a translation-specification exception (write to swapped out page): do_swap_page pte = mk_pte (with _PAGE_NOEXEC bit) set_pte_at (will remove _PAGE_NOEXEC bit in page table, but keep it in local variable pte) vmf->orig_pte = pte (pte still contains _PAGE_NOEXEC bit) do_wp_page wp_page_reuse entry = vmf->orig_pte (still with _PAGE_NOEXEC bit) ptep_set_access_flags (writes entry with _PAGE_NOEXEC bit) Fix this by clearing _PAGE_NOEXEC already in mk_pte_phys(), where the pgprot value is applied, so that no pte with _PAGE_NOEXEC will ever be visible, if it is not supported. The check in set_pte_at() can then also be removed. Cc: # 4.11+ Fixes: 57d7f939e7bd ("s390: add no-execute support") Signed-off-by: Gerald Schaefer Signed-off-by: Vasily Gorbik Signed-off-by: Greg Kroah-Hartman commit 2438d2f8fd78f5f9c98214f28c475165bd6c3395 Author: Denis Efremov Date: Mon Sep 30 23:31:47 2019 +0300 ar5523: check NULL before memcpy() in ar5523_cmd() commit 315cee426f87658a6799815845788fde965ddaad upstream. memcpy() call with "idata == NULL && ilen == 0" results in undefined behavior in ar5523_cmd(). For example, NULL is passed in callchain "ar5523_stat_work() -> ar5523_cmd_write() -> ar5523_cmd()". This patch adds ilen check before memcpy() call in ar5523_cmd() to prevent an undefined behavior. Cc: Pontus Fuchs Cc: Kalle Valo Cc: "David S. Miller" Cc: David Laight Cc: stable@vger.kernel.org Signed-off-by: Denis Efremov Signed-off-by: Kalle Valo Signed-off-by: Greg Kroah-Hartman commit bd69ce19571b33e1d66b279b5dc6b2e1e95a547e Author: Denis Efremov Date: Tue Oct 1 15:08:23 2019 +0300 wil6210: check len before memcpy() calls commit 2c840676be8ffc624bf9bb4490d944fd13c02d71 upstream. memcpy() in wmi_set_ie() and wmi_update_ft_ies() is called with src == NULL and len == 0. This is an undefined behavior. Fix it by checking "ie_len > 0" before the memcpy() calls. As suggested by GCC documentation: "The pointers passed to memmove (and similar functions in ) must be non-null even when nbytes==0, so GCC can use that information to remove the check after the memmove call." [1] [1] https://gcc.gnu.org/gcc-4.9/porting_to.html Cc: Maya Erez Cc: Kalle Valo Cc: "David S. Miller" Cc: stable@vger.kernel.org Signed-off-by: Denis Efremov Signed-off-by: Kalle Valo Signed-off-by: Greg Kroah-Hartman commit 2539f282e436e345bf243dc3ab5e143a31eefa22 Author: Aleksa Sarai Date: Thu Oct 17 02:50:01 2019 +1100 cgroup: pids: use atomic64_t for pids->limit commit a713af394cf382a30dd28a1015cbe572f1b9ca75 upstream. Because pids->limit can be changed concurrently (but we don't want to take a lock because it would be needlessly expensive), use atomic64_ts instead. Fixes: commit 49b786ea146f ("cgroup: implement the PIDs subsystem") Cc: stable@vger.kernel.org # v4.3+ Signed-off-by: Aleksa Sarai Signed-off-by: Tejun Heo Signed-off-by: Greg Kroah-Hartman commit 285b07348946818dcdc17aea67427627e957be0d Author: Ming Lei Date: Sat Nov 2 16:02:15 2019 +0800 blk-mq: avoid sysfs buffer overflow with too many CPU cores commit 8962842ca5abdcf98e22ab3b2b45a103f0408b95 upstream. It is reported that sysfs buffer overflow can be triggered if the system has too many CPU cores(>841 on 4K PAGE_SIZE) when showing CPUs of hctx via /sys/block/$DEV/mq/$N/cpu_list. Use snprintf to avoid the potential buffer overflow. This version doesn't change the attribute format, and simply stops showing CPU numbers if the buffer is going to overflow. Cc: stable@vger.kernel.org Fixes: 676141e48af7("blk-mq: don't dump CPU -> hw queue map on driver load") Signed-off-by: Ming Lei Signed-off-by: Jens Axboe Signed-off-by: Greg Kroah-Hartman commit f020809b8450598ae7ae83d7f480463acf9486ac Author: David Jeffery Date: Mon Sep 16 13:15:14 2019 -0400 md: improve handling of bio with REQ_PREFLUSH in md_flush_request() commit 775d78319f1ceb32be8eb3b1202ccdc60e9cb7f1 upstream. If pers->make_request fails in md_flush_request(), the bio is lost. To fix this, pass back a bool to indicate if the original make_request call should continue to handle the I/O and instead of assuming the flush logic will push it to completion. Convert md_flush_request to return a bool and no longer calls the raid driver's make_request function. If the return is true, then the md flush logic has or will complete the bio and the md make_request call is done. If false, then the md make_request function needs to keep processing like it is a normal bio. Let the original call to md_handle_request handle any need to retry sending the bio to the raid driver's make_request function should it be needed. Also mark md_flush_request and the make_request function pointer as __must_check to issue warnings should these critical return values be ignored. Fixes: 2bc13b83e629 ("md: batch flush requests.") Cc: stable@vger.kernel.org # # v4.19+ Cc: NeilBrown Signed-off-by: David Jeffery Reviewed-by: Xiao Ni Signed-off-by: Song Liu Signed-off-by: Greg Kroah-Hartman commit a11fab7708329fd902d721fe8f2b1b628da35de9 Author: Shengjiu Wang Date: Mon Nov 11 15:50:48 2019 +0800 ASoC: fsl_audmix: Add spin lock to protect tdms commit fe965096c9495ddcf78ec163348105e2baf8d185 upstream. Audmix support two substream, When two substream start to run, the trigger function may be called by two substream in same time, that the priv->tdms may be updated wrongly. The expected priv->tdms is 0x3, but sometimes the result is 0x2, or 0x1. Fixes: be1df61cf06e ("ASoC: fsl: Add Audio Mixer CPU DAI driver") Signed-off-by: Shengjiu Wang Acked-by: Nicolin Chen Reviewed-by: Daniel Baluta Link: https://lore.kernel.org/r/1e706afe53fdd1fbbbc79277c48a98f8416ba873.1573458378.git.shengjiu.wang@nxp.com Signed-off-by: Mark Brown Cc: Signed-off-by: Greg Kroah-Hartman commit 9ae0611f0c55178aea77f39ee07d539556d7c6eb Author: Pawel Harlozinski Date: Tue Nov 12 14:02:36 2019 +0100 ASoC: Jack: Fix NULL pointer dereference in snd_soc_jack_report commit 8f157d4ff039e03e2ed4cb602eeed2fd4687a58f upstream. Check for existance of jack before tracing. NULL pointer dereference has been reported by KASAN while unloading machine driver (snd_soc_cnl_rt274). Signed-off-by: Pawel Harlozinski Link: https://lore.kernel.org/r/20191112130237.10141-1-pawel.harlozinski@linux.intel.com Signed-off-by: Mark Brown Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman commit 560025a0b565b6ae2d98ff10d2030eeb798e9fb7 Author: Jacob Rasmussen Date: Thu Nov 14 16:20:11 2019 -0700 ASoC: rt5645: Fixed typo for buddy jack support. commit fe23be2d85b05f561431d75acddec726ea807d2a upstream. Had a typo in e7cfd867fd98 that resulted in buddy jack support not being fixed. Fixes: e7cfd867fd98 ("ASoC: rt5645: Fixed buddy jack support.") Signed-off-by: Jacob Rasmussen Reviewed-by: Ross Zwisler Cc: CC: stable@vger.kernel.org Link: https://lore.kernel.org/r/20191114232011.165762-1-jacobraz@google.com Signed-off-by: Mark Brown Signed-off-by: Greg Kroah-Hartman commit bb949b530cd76435c0fc743ad0c24758971bfd1c Author: Jacob Rasmussen Date: Mon Nov 11 11:59:57 2019 -0700 ASoC: rt5645: Fixed buddy jack support. commit e7cfd867fd9842f346688f28412eb83dec342900 upstream. The headphone jack on buddy was broken with the following commit: commit 6b5da66322c5 ("ASoC: rt5645: read jd1_1 status for jd detection"). This changes the jd_mode for buddy to 4 so buddy can read from the same register that was used in the working version of this driver without affecting any other devices that might use this, since no other device uses jd_mode = 4. To test this I plugged and uplugged the headphone jack, verifying audio works. Signed-off-by: Jacob Rasmussen Reviewed-by: Ross Zwisler Link: https://lore.kernel.org/r/20191111185957.217244-1-jacobraz@google.com Signed-off-by: Mark Brown Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman commit 470e77ea879585a8a2e8f264877b113d2b68074f Author: Tejun Heo Date: Wed Sep 25 06:59:15 2019 -0700 workqueue: Fix pwq ref leak in rescuer_thread() commit e66b39af00f426b3356b96433d620cb3367ba1ff upstream. 008847f66c3 ("workqueue: allow rescuer thread to do more work.") made the rescuer worker requeue the pwq immediately if there may be more work items which need rescuing instead of waiting for the next mayday timer expiration. Unfortunately, it doesn't check whether the pwq is already on the mayday list and unconditionally gets the ref and moves it onto the list. This doesn't corrupt the list but creates an additional reference to the pwq. It got queued twice but will only be removed once. This leak later can trigger pwq refcnt warning on workqueue destruction and prevent freeing of the workqueue. Signed-off-by: Tejun Heo Cc: "Williams, Gerald S" Cc: NeilBrown Cc: stable@vger.kernel.org # v3.19+ Signed-off-by: Greg Kroah-Hartman commit 20caa355f3d4dfd0e5725947d5e4c501c99ed972 Author: Tejun Heo Date: Wed Sep 18 18:43:40 2019 -0700 workqueue: Fix spurious sanity check failures in destroy_workqueue() commit def98c84b6cdf2eeea19ec5736e90e316df5206b upstream. Before actually destrying a workqueue, destroy_workqueue() checks whether it's actually idle. If it isn't, it prints out a bunch of warning messages and leaves the workqueue dangling. It unfortunately has a couple issues. * Mayday list queueing increments pwq's refcnts which gets detected as busy and fails the sanity checks. However, because mayday list queueing is asynchronous, this condition can happen without any actual work items left in the workqueue. * Sanity check failure leaves the sysfs interface behind too which can lead to init failure of newer instances of the workqueue. This patch fixes the above two by * If a workqueue has a rescuer, disable and kill the rescuer before sanity checks. Disabling and killing is guaranteed to flush the existing mayday list. * Remove sysfs interface before sanity checks. Signed-off-by: Tejun Heo Reported-by: Marcin Pawlowski Reported-by: "Williams, Gerald S" Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman commit fca436251d1f5e177a2cd6fc9d2867483a0f0afd Author: Dmitry Fomichev Date: Wed Nov 6 14:34:35 2019 -0800 dm zoned: reduce overhead of backing device checks commit e7fad909b68aa37470d9f2d2731b5bec355ee5d6 upstream. Commit 75d66ffb48efb3 added backing device health checks and as a part of these checks, check_events() block ops template call is invoked in dm-zoned mapping path as well as in reclaim and flush path. Calling check_events() with ATA or SCSI backing devices introduces a blocking scsi_test_unit_ready() call being made in sd_check_events(). Even though the overhead of calling scsi_test_unit_ready() is small for ATA zoned devices, it is much larger for SCSI and it affects performance in a very negative way. Fix this performance regression by executing check_events() only in case of any I/O errors. The function dmz_bdev_is_dying() is modified to call only blk_queue_dying(), while calls to check_events() are made in a new helper function, dmz_check_bdev(). Reported-by: zhangxiaoxu Fixes: 75d66ffb48efb3 ("dm zoned: properly handle backing device failure") Cc: stable@vger.kernel.org Signed-off-by: Dmitry Fomichev Signed-off-by: Mike Snitzer Signed-off-by: Greg Kroah-Hartman commit 26fe6306244cf5f979bf1698211eaa06ed3e7082 Author: Maged Mokhtar Date: Wed Oct 23 22:41:17 2019 +0200 dm writecache: handle REQ_FUA commit c1005322ff02110a4df7f0033368ea015062b583 upstream. Call writecache_flush() on REQ_FUA in writecache_map(). Cc: stable@vger.kernel.org # 4.18+ Signed-off-by: Maged Mokhtar Acked-by: Mikulas Patocka Signed-off-by: Mike Snitzer Signed-off-by: Greg Kroah-Hartman commit e8f0102ddfbf0bfd850924b3fdeeaaaef78a7561 Author: Sumit Garg Date: Mon Oct 14 17:32:45 2019 +0530 hwrng: omap - Fix RNG wait loop timeout commit be867f987a4e1222114dd07a01838a17c26f3fff upstream. Existing RNG data read timeout is 200us but it doesn't cover EIP76 RNG data rate which takes approx. 700us to produce 16 bytes of output data as per testing results. So configure the timeout as 1000us to also take account of lack of udelay()'s reliability. Fixes: 383212425c92 ("hwrng: omap - Add device variant for SafeXcel IP-76 found in Armada 8K") Cc: Signed-off-by: Sumit Garg Signed-off-by: Herbert Xu Signed-off-by: Greg Kroah-Hartman commit 82a0e257342b8c9831c0e5e3610be5f4284bab46 Author: Amir Goldstein Date: Fri Dec 6 08:33:36 2019 +0200 ovl: relax WARN_ON() on rename to self commit 6889ee5a53b8d969aa542047f5ac8acdc0e79a91 upstream. In ovl_rename(), if new upper is hardlinked to old upper underneath overlayfs before upper dirs are locked, user will get an ESTALE error and a WARN_ON will be printed. Changes to underlying layers while overlayfs is mounted may result in unexpected behavior, but it shouldn't crash the kernel and it shouldn't trigger WARN_ON() either, so relax this WARN_ON(). Reported-by: syzbot+bb1836a212e69f8e201a@syzkaller.appspotmail.com Fixes: 804032fabb3b ("ovl: don't check rename to self") Cc: # v4.9+ Signed-off-by: Amir Goldstein Signed-off-by: Miklos Szeredi Signed-off-by: Greg Kroah-Hartman commit f96384a621ee04df8bf5b879c81d16dcabbd8248 Author: Amir Goldstein Date: Sun Nov 17 17:43:44 2019 +0200 ovl: fix corner case of non-unique st_dev;st_ino commit 9c6d8f13e9da10a26ad7f0a020ef86e8ef142835 upstream. On non-samefs overlay without xino, non pure upper inodes should use a pseudo_dev assigned to each unique lower fs and pure upper inodes use the real upper st_dev. It is fine for an overlay pure upper inode to use the same st_dev;st_ino values as the real upper inode, because the content of those two different filesystem objects is always the same. In this case, however: - two filesystems, A and B - upper layer is on A - lower layer 1 is also on A - lower layer 2 is on B Non pure upper overlay inode, whose origin is in layer 1 will have the same st_dev;st_ino values as the real lower inode. This may result with a false positive results of 'diff' between the real lower and copied up overlay inode. Fix this by using the upper st_dev;st_ino values in this case. This breaks the property of constant st_dev;st_ino across copy up of this case. This breakage will be fixed by a later patch. Fixes: 5148626b806a ("ovl: allocate anon bdev per unique lower fs") Cc: stable@vger.kernel.org # v4.17+ Signed-off-by: Amir Goldstein Signed-off-by: Miklos Szeredi Signed-off-by: Greg Kroah-Hartman commit 84514aa3c06f2fc955a6dc654346272a1900dee6 Author: Amir Goldstein Date: Thu Nov 14 22:28:41 2019 +0200 ovl: fix lookup failure on multi lower squashfs commit 7e63c87fc2dcf3be9d3aab82d4a0ea085880bdca upstream. In the past, overlayfs required that lower fs have non null uuid in order to support nfs export and decode copy up origin file handles. Commit 9df085f3c9a2 ("ovl: relax requirement for non null uuid of lower fs") relaxed this requirement for nfs export support, as long as uuid (even if null) is unique among all lower fs. However, said commit unintentionally also relaxed the non null uuid requirement for decoding copy up origin file handles, regardless of the unique uuid requirement. Amend this mistake by disabling decoding of copy up origin file handle from lower fs with a conflicting uuid. We still encode copy up origin file handles from those fs, because file handles like those already exist in the wild and because they might provide useful information in the future. There is an unhandled corner case described by Miklos this way: - two filesystems, A and B, both have null uuid - upper layer is on A - lower layer 1 is also on A - lower layer 2 is on B In this case bad_uuid won't be set for B, because the check only involves the list of lower fs. Hence we'll try to decode a layer 2 origin on layer 1 and fail. We will deal with this corner case later. Reported-by: Colin Ian King Tested-by: Colin Ian King Link: https://lore.kernel.org/lkml/20191106234301.283006-1-colin.king@canonical.com/ Fixes: 9df085f3c9a2 ("ovl: relax requirement for non null uuid ...") Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: Amir Goldstein Signed-off-by: Miklos Szeredi Signed-off-by: Greg Kroah-Hartman commit 9b7935f72f9be674d2177c395f3cfb62283dc97e Author: Greg Kroah-Hartman Date: Fri Dec 6 16:26:00 2019 +0100 lib: raid6: fix awk build warnings commit 702600eef73033ddd4eafcefcbb6560f3e3a90f7 upstream. Newer versions of awk spit out these fun warnings: awk: ../lib/raid6/unroll.awk:16: warning: regexp escape sequence `\#' is not a known regexp operator As commit 700c1018b86d ("x86/insn: Fix awk regexp warnings") showed, it turns out that there are a number of awk strings that do not need to be escaped and newer versions of awk now warn about this. Fix the string up so that no warning is produced. The exact same kernel module gets created before and after this patch, showing that it wasn't needed. Link: https://lore.kernel.org/r/20191206152600.GA75093@kroah.com Signed-off-by: Greg Kroah-Hartman commit 6422173dd8ad3003de54e3d0a1aad403ef05574e Author: Larry Finger Date: Mon Nov 11 13:40:46 2019 -0600 rtlwifi: rtl8192de: Fix missing enable interrupt flag commit 330bb7117101099c687e9c7f13d48068670b9c62 upstream. In commit 38506ecefab9 ("rtlwifi: rtl_pci: Start modification for new drivers"), the flag that indicates that interrupts are enabled was never set. In addition, there are several places when enable/disable interrupts were commented out are restored. A sychronize_interrupts() call is removed. Fixes: 38506ecefab9 ("rtlwifi: rtl_pci: Start modification for new drivers") Cc: Stable # v3.18+ Signed-off-by: Larry Finger Signed-off-by: Kalle Valo Signed-off-by: Greg Kroah-Hartman commit ca754b3c4d2272df9ee568e04beb46f745e61806 Author: Larry Finger Date: Mon Nov 11 13:40:45 2019 -0600 rtlwifi: rtl8192de: Fix missing callback that tests for hw release of buffer commit 3155db7613edea8fb943624062baf1e4f9cfbfd6 upstream. In commit 38506ecefab9 ("rtlwifi: rtl_pci: Start modification for new drivers"), a callback needed to check if the hardware has released a buffer indicating that a DMA operation is completed was not added. Fixes: 38506ecefab9 ("rtlwifi: rtl_pci: Start modification for new drivers") Cc: Stable # v3.18+ Signed-off-by: Larry Finger Signed-off-by: Kalle Valo Signed-off-by: Greg Kroah-Hartman commit d21a09d5811befdee74513be86a25532a2bea2e6 Author: Larry Finger Date: Mon Nov 11 13:40:44 2019 -0600 rtlwifi: rtl8192de: Fix missing code to retrieve RX buffer address commit 0e531cc575c4e9e3dd52ad287b49d3c2dc74c810 upstream. In commit 38506ecefab9 ("rtlwifi: rtl_pci: Start modification for new drivers"), a callback to get the RX buffer address was added to the PCI driver. Unfortunately, driver rtl8192de was not modified appropriately and the code runs into a WARN_ONCE() call. The use of an incorrect array is also fixed. Fixes: 38506ecefab9 ("rtlwifi: rtl_pci: Start modification for new drivers") Cc: Stable # 3.18+ Signed-off-by: Larry Finger Signed-off-by: Kalle Valo Signed-off-by: Greg Kroah-Hartman commit cab5f4c6fdbde86056f8c98e580cc002175bb242 Author: Josef Bacik Date: Fri Nov 15 15:43:06 2019 -0500 btrfs: record all roots for rename exchange on a subvol commit 3e1740993e43116b3bc71b0aad1e6872f6ccf341 upstream. Testing with the new fsstress support for subvolumes uncovered a pretty bad problem with rename exchange on subvolumes. We're modifying two different subvolumes, but we only start the transaction on one of them, so the other one is not added to the dirty root list. This is caught by btrfs_cow_block() with a warning because the root has not been updated, however if we do not modify this root again we'll end up pointing at an invalid root because the root item is never updated. Fix this by making sure we add the destination root to the trans list, the same as we do with normal renames. This fixes the corruption. Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Filipe Manana Signed-off-by: Josef Bacik Signed-off-by: David Sterba Signed-off-by: Greg Kroah-Hartman commit cb7c10c675e8844a13b8168a0c5ae428a5dc8199 Author: Filipe Manana Date: Wed Oct 30 12:23:01 2019 +0000 Btrfs: send, skip backreference walking for extents with many references commit fd0ddbe2509568b00df364156f47561e9f469f15 upstream. Backreference walking, which is used by send to figure if it can issue clone operations instead of write operations, can be very slow and use too much memory when extents have many references. This change simply skips backreference walking when an extent has more than 64 references, in which case we fallback to a write operation instead of a clone operation. This limit is conservative and in practice I observed no signicant slowdown with up to 100 references and still low memory usage up to that limit. This is a temporary workaround until there are speedups in the backref walking code, and as such it does not attempt to add extra interfaces or knobs to tweak the threshold. Reported-by: Atemu Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo Signed-off-by: Filipe Manana Signed-off-by: David Sterba Signed-off-by: Greg Kroah-Hartman commit 6951a31e551e7664c90b14fd49afdca57fda1272 Author: Qu Wenruo Date: Thu Oct 24 09:38:29 2019 +0800 btrfs: Remove btrfs_bio::flags member commit 34b127aecd4fe8e6a3903e10f204a7b7ffddca22 upstream. The last user of btrfs_bio::flags was removed in commit 326e1dbb5736 ("block: remove management of bi_remaining when restoring original bi_end_io"), remove it. (Tagged for stable as the structure is heavily used and space savings are desirable.) CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo Reviewed-by: David Sterba Signed-off-by: David Sterba Signed-off-by: Greg Kroah-Hartman commit 6c2fb7a5aa87698c57b8454ea1630c5a50229f7b Author: Tejun Heo Date: Thu Oct 3 07:27:13 2019 -0700 btrfs: Avoid getting stuck during cyclic writebacks commit f7bddf1e27d18fbc7d3e3056ba449cfbe4e20b0a upstream. During a cyclic writeback, extent_write_cache_pages() uses done_index to update the writeback_index after the current run is over. However, instead of current index + 1, it gets to to the current index itself. Unfortunately, this, combined with returning on EOF instead of looping back, can lead to the following pathlogical behavior. 1. There is a single file which has accumulated enough dirty pages to trigger balance_dirty_pages() and the writer appending to the file with a series of short writes. 2. balance_dirty_pages kicks in, wakes up background writeback and sleeps. 3. Writeback kicks in and the cursor is on the last page of the dirty file. Writeback is started or skipped if already in progress. As it's EOF, extent_write_cache_pages() returns and the cursor is set to done_index which is pointing to the last page. 4. Writeback is done. Nothing happens till balance_dirty_pages finishes, at which point we go back to #1. This can almost completely stall out writing back of the file and keep the system over dirty threshold for a long time which can mess up the whole system. We encountered this issue in production with a package handling application which can reliably reproduce the issue when running under tight memory limits. Reading the comment in the error handling section, this seems to be to avoid accidentally skipping a page in case the write attempt on the page doesn't succeed. However, this concern seems bogus. On each page, the code either: * Skips and moves onto the next page. * Fails issue and sets done_index to index + 1. * Successfully issues and continue to the next page if budget allows and not EOF. IOW, as long as it's not EOF and there's budget, the code never retries writing back the same page. Only when a page happens to be the last page of a particular run, we end up retrying the page, which can't possibly guarantee anything data integrity related. Besides, cyclic writes are only used for non-syncing writebacks meaning that there's no data integrity implication to begin with. Fix it by always setting done_index past the current page being processed. Note that this problem exists in other writepages too. CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Tejun Heo Reviewed-by: David Sterba Signed-off-by: David Sterba Signed-off-by: Greg Kroah-Hartman commit b24ec1e6b6f03365eb716f3ecccf5c739dcabbdb Author: Filipe Manana Date: Fri Oct 11 16:41:20 2019 +0100 Btrfs: fix negative subv_writers counter and data space leak after buffered write commit a0e248bb502d5165b3314ac3819e888fdcdf7d9f upstream. When doing a buffered write it's possible to leave the subv_writers counter of the root, used for synchronization between buffered nocow writers and snapshotting. This happens in an exceptional case like the following: 1) We fail to allocate data space for the write, since there's not enough available data space nor enough unallocated space for allocating a new data block group; 2) Because of that failure, we try to go to NOCOW mode, which succeeds and therefore we set the local variable 'only_release_metadata' to true and set the root's sub_writers counter to 1 through the call to btrfs_start_write_no_snapshotting() made by check_can_nocow(); 3) The call to btrfs_copy_from_user() returns zero, which is very unlikely to happen but not impossible; 4) No pages are copied because btrfs_copy_from_user() returned zero; 5) We call btrfs_end_write_no_snapshotting() which decrements the root's subv_writers counter to 0; 6) We don't set 'only_release_metadata' back to 'false' because we do it only if 'copied', the value returned by btrfs_copy_from_user(), is greater than zero; 7) On the next iteration of the while loop, which processes the same page range, we are now able to allocate data space for the write (we got enough data space released in the meanwhile); 8) After this if we fail at btrfs_delalloc_reserve_metadata(), because now there isn't enough free metadata space, or in some other place further below (prepare_pages(), lock_and_cleanup_extent_if_need(), btrfs_dirty_pages()), we break out of the while loop with 'only_release_metadata' having a value of 'true'; 9) Because 'only_release_metadata' is 'true' we end up decrementing the root's subv_writers counter to -1 (through a call to btrfs_end_write_no_snapshotting()), and we also end up not releasing the data space previously reserved through btrfs_check_data_free_space(). As a consequence the mechanism for synchronizing NOCOW buffered writes with snapshotting gets broken. Fix this by always setting 'only_release_metadata' to false at the start of each iteration. Fixes: 8257b2dc3c1a ("Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume") Fixes: 7ee9e4405f26 ("Btrfs: check if we can nocow if we don't have data space") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba Signed-off-by: Greg Kroah-Hartman commit 17b22f8594fa200c870caa27bfc7c6f9b8d9849e Author: Filipe Manana Date: Wed Oct 9 17:43:59 2019 +0100 Btrfs: fix metadata space leak on fixup worker failure to set range as delalloc commit 536870071dbc4278264f59c9a2f5f447e584d139 upstream. In the fixup worker, if we fail to mark the range as delalloc in the io tree, we must release the previously reserved metadata, as well as update the outstanding extents counter for the inode, otherwise we leak metadata space. In pratice we can't return an error from btrfs_set_extent_delalloc(), which is just a wrapper around __set_extent_bit(), as for most errors __set_extent_bit() does a BUG_ON() (or panics which hits a BUG_ON() as well) and returning an -EEXIST error doesn't happen in this case since the exclusive bits parameter always has a value of 0 through this code path. Nevertheless, just fix the error handling in the fixup worker, in case one day __set_extent_bit() can return an error to this code path. Fixes: f3038ee3a3f101 ("btrfs: Handle btrfs_set_extent_delalloc failure in fixup worker") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Nikolay Borisov Signed-off-by: Filipe Manana Reviewed-by: David Sterba Signed-off-by: David Sterba Signed-off-by: Greg Kroah-Hartman commit 1e8308fb3d715f0ca4010864a6039da6db51d5b2 Author: Josef Bacik Date: Thu Sep 26 08:29:32 2019 -0400 btrfs: use refcount_inc_not_zero in kill_all_nodes commit baf320b9d531f1cfbf64c60dd155ff80a58b3796 upstream. We hit the following warning while running down a different problem [ 6197.175850] ------------[ cut here ]------------ [ 6197.185082] refcount_t: underflow; use-after-free. [ 6197.194704] WARNING: CPU: 47 PID: 966 at lib/refcount.c:190 refcount_sub_and_test_checked+0x53/0x60 [ 6197.521792] Call Trace: [ 6197.526687] __btrfs_release_delayed_node+0x76/0x1c0 [ 6197.536615] btrfs_kill_all_delayed_nodes+0xec/0x130 [ 6197.546532] ? __btrfs_btree_balance_dirty+0x60/0x60 [ 6197.556482] btrfs_clean_one_deleted_snapshot+0x71/0xd0 [ 6197.566910] cleaner_kthread+0xfa/0x120 [ 6197.574573] kthread+0x111/0x130 [ 6197.581022] ? kthread_create_on_node+0x60/0x60 [ 6197.590086] ret_from_fork+0x1f/0x30 [ 6197.597228] ---[ end trace 424bb7ae00509f56 ]--- This is because the free side drops the ref without the lock, and then takes the lock if our refcount is 0. So you can have nodes on the tree that have a refcount of 0. Fix this by zero'ing out that element in our temporary array so we don't try to kill it again. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik Reviewed-by: David Sterba [ add comment ] Signed-off-by: David Sterba Signed-off-by: Greg Kroah-Hartman commit d92f03395aeb3c27cbe8e5cd3bd7bf81d64a4024 Author: Josef Bacik Date: Tue Sep 24 16:50:44 2019 -0400 btrfs: use btrfs_block_group_cache_done in update_block_group commit a60adce85f4bb5c1ef8ffcebadd702cafa2f3696 upstream. When free'ing extents in a block group we check to see if the block group is not cached, and then cache it if we need to. However we'll just carry on as long as we're loading the cache. This is problematic because we are dirtying the block group here. If we are fast enough we could do a transaction commit and clear the free space cache while we're still loading the space cache in another thread. This truncates the free space inode, which will keep it from loading the space cache. Fix this by using the btrfs_block_group_cache_done helper so that we try to load the space cache unconditionally here, which will result in the caller waiting for the fast caching to complete and keep us from truncating the free space inode. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik Reviewed-by: Nikolay Borisov Signed-off-by: David Sterba Signed-off-by: Greg Kroah-Hartman commit 3c821cc5edf9f53753cc70273102fe338f51dca5 Author: Josef Bacik Date: Tue Sep 24 16:50:43 2019 -0400 btrfs: check page->mapping when loading free space cache commit 3797136b626ad4b6582223660c041efdea8f26b2 upstream. While testing 5.2 we ran into the following panic [52238.017028] BUG: kernel NULL pointer dereference, address: 0000000000000001 [52238.105608] RIP: 0010:drop_buffers+0x3d/0x150 [52238.304051] Call Trace: [52238.308958] try_to_free_buffers+0x15b/0x1b0 [52238.317503] shrink_page_list+0x1164/0x1780 [52238.325877] shrink_inactive_list+0x18f/0x3b0 [52238.334596] shrink_node_memcg+0x23e/0x7d0 [52238.342790] ? do_shrink_slab+0x4f/0x290 [52238.350648] shrink_node+0xce/0x4a0 [52238.357628] balance_pgdat+0x2c7/0x510 [52238.365135] kswapd+0x216/0x3e0 [52238.371425] ? wait_woken+0x80/0x80 [52238.378412] ? balance_pgdat+0x510/0x510 [52238.386265] kthread+0x111/0x130 [52238.392727] ? kthread_create_on_node+0x60/0x60 [52238.401782] ret_from_fork+0x1f/0x30 The page we were trying to drop had a page->private, but had no page->mapping and so called drop_buffers, assuming that we had a buffer_head on the page, and then panic'ed trying to deref 1, which is our page->private for data pages. This is happening because we're truncating the free space cache while we're trying to load the free space cache. This isn't supposed to happen, and I'll fix that in a followup patch. However we still shouldn't allow those sort of mistakes to result in messing with pages that do not belong to us. So add the page->mapping check to verify that we still own this page after dropping and re-acquiring the page lock. This page being unlocked as: btrfs_readpage extent_read_full_page __extent_read_full_page __do_readpage if (!nr) unlock_page <-- nr can be 0 only if submit_extent_page returns an error CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik [ add callchain ] Signed-off-by: David Sterba Signed-off-by: Greg Kroah-Hartman commit 0f16d13cb83b3d85c1be7a312236d03760cdc936 Author: Johannes Berg Date: Fri Jun 1 10:32:55 2018 +0200 iwlwifi: pcie: fix support for transmitting SKBs with fraglist commit 4f4925a7b23428d5719af5a2816586b2a0e6fd19 upstream. When the implementation of SKBs with fraglist was sent upstream, a merge-damage occurred and half the patch was not applied. This causes problems in high-throughput situations with AX200 devices, including low throughput and FW crashes. Introduce the part that was missing from the original patch. Fixes: 0044f1716c4d ("iwlwifi: pcie: support transmitting SKBs with fraglist") Cc: stable@vger.kernel.org # 4.20+ Signed-off-by: Johannes Berg [ This patch was created by me, but the original author of this code is Johannes, so his s-o-b is here and he's marked as the author of the patch. ] Signed-off-by: Luca Coelho Signed-off-by: Greg Kroah-Hartman commit cbf3de66565a3efad489cc9adb54c66d13bfe36d Author: Wen Yang Date: Tue Nov 26 22:04:52 2019 +0800 usb: typec: fix use after free in typec_register_port() commit 5c388abefda0d92355714010c0199055c57ab6c7 upstream. We can't use "port->sw" and/or "port->mux" after it has been freed. Fixes: 23481121c81d ("usb: typec: class: Don't use port parent for getting mux handles") Signed-off-by: Wen Yang Cc: stable Cc: linux-usb@vger.kernel.org Cc: linux-kernel@vger.kernel.org Acked-by: Heikki Krogerus  Link: https://lore.kernel.org/r/20191126140452.14048-1-wenyang@linux.alibaba.com Signed-off-by: Greg Kroah-Hartman commit 7d01bc8c1ac8d79f423ed96ba357c63b08aaa338 Author: Yoshihiro Shimoda Date: Mon Oct 7 16:55:10 2019 +0900 phy: renesas: rcar-gen3-usb2: Fix sysfs interface of "role" commit 4bd5ead82d4b877ebe41daf95f28cda53205b039 upstream. Since the role_store() uses strncmp(), it's possible to refer out-of-memory if the sysfs data size is smaller than strlen("host"). This patch fixes it by using sysfs_streq() instead of strncmp(). Reported-by: Pavel Machek Fixes: 9bb86777fb71 ("phy: rcar-gen3-usb2: add sysfs for usb role swap") Cc: # v4.10+ Signed-off-by: Yoshihiro Shimoda Reviewed-by: Geert Uytterhoeven Acked-by: Pavel Machek Signed-off-by: Kishon Vijay Abraham I Signed-off-by: Greg Kroah-Hartman commit e4dfa5e147283b4f27d1dd00b17d2544055c67da Author: Thinh Nguyen Date: Wed Nov 27 13:10:54 2019 -0800 usb: dwc3: ep0: Clear started flag on completion commit 2d7b78f59e020b07fc6338eefe286f54ee2d6773 upstream. Clear ep0's DWC3_EP_TRANSFER_STARTED flag if the END_TRANSFER command is completed. Otherwise, we can't start control transfer again after END_TRANSFER. Cc: stable@vger.kernel.org Signed-off-by: Thinh Nguyen Signed-off-by: Felipe Balbi Signed-off-by: Greg Kroah-Hartman commit 54f027a46b47d1f91c5a52141b33785f3506e147 Author: Thinh Nguyen Date: Wed Nov 27 13:10:47 2019 -0800 usb: dwc3: gadget: Clear started flag for non-IOC commit d3abda5a98a18e524e17fd4085c9f4bd53e9ef53 upstream. Normally the END_TRANSFER command completion handler will clear the DWC3_EP_TRANSFER_STARTED flag. However, if the command was sent without interrupt on completion, then the flag will not be cleared. Make sure to clear the flag in this case. Cc: stable@vger.kernel.org Signed-off-by: Thinh Nguyen Signed-off-by: Felipe Balbi Signed-off-by: Greg Kroah-Hartman commit a7f7e61270f1676517c2f6f2903317f77f122f15 Author: Tejas Joglekar Date: Wed Nov 13 11:45:16 2019 +0530 usb: dwc3: gadget: Fix logical condition commit 8c7d4b7b3d43c54c0b8c1e4adb917a151c754196 upstream. This patch corrects the condition to kick the transfer without giving back the requests when either request has remaining data or when there are pending SGs. The && check was introduced during spliting up the dwc3_gadget_ep_cleanup_completed_requests() function. Fixes: f38e35dd84e2 ("usb: dwc3: gadget: split dwc3_gadget_ep_cleanup_completed_requests()") Cc: stable@vger.kernel.org Signed-off-by: Tejas Joglekar Signed-off-by: Felipe Balbi Signed-off-by: Greg Kroah-Hartman commit 1dcdfe49066835aa08e31570cf64223f28cb6ed0 Author: Heikki Krogerus Date: Thu Dec 12 12:37:13 2019 +0300 usb: dwc3: pci: add ID for the Intel Comet Lake -H variant commit 3c3caae4cd6e122472efcf64759ff6392fb6bce2 upstream. The original ID that was added for Comet Lake PCH was actually for the -LP (low power) variant even though the constant for it said CMLH. Changing that while at it. Signed-off-by: Heikki Krogerus Acked-by: Felipe Balbi Cc: stable Link: https://lore.kernel.org/r/20191212093713.60614-1-heikki.krogerus@linux.intel.com Signed-off-by: Greg Kroah-Hartman commit cc3b0930f209a9ff8c6f16842b72093fd7d78f3f Author: David Hildenbrand Date: Wed Dec 11 12:11:52 2019 +0100 virtio-balloon: fix managed page counts when migrating pages between zones commit 63341ab03706e11a31e3dd8ccc0fbc9beaf723f0 upstream. In case we have to migrate a ballon page to a newpage of another zone, the managed page count of both zones is wrong. Paired with memory offlining (which will adjust the managed page count), we can trigger kernel crashes and all kinds of different symptoms. One way to reproduce: 1. Start a QEMU guest with 4GB, no NUMA 2. Hotplug a 1GB DIMM and online the memory to ZONE_NORMAL 3. Inflate the balloon to 1GB 4. Unplug the DIMM (be quick, otherwise unmovable data ends up on it) 5. Observe /proc/zoneinfo Node 0, zone Normal pages free 16810 min 24848885473806 low 18471592959183339 high 36918337032892872 spanned 262144 present 262144 managed 18446744073709533486 6. Do anything that requires some memory (e.g., inflate the balloon some more). The OOM goes crazy and the system crashes [ 238.324946] Out of memory: Killed process 537 (login) total-vm:27584kB, anon-rss:860kB, file-rss:0kB, shmem-rss:00 [ 238.338585] systemd invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 [ 238.339420] CPU: 0 PID: 1 Comm: systemd Tainted: G D W 5.4.0-next-20191204+ #75 [ 238.340139] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4 [ 238.341121] Call Trace: [ 238.341337] dump_stack+0x8f/0xd0 [ 238.341630] dump_header+0x61/0x5ea [ 238.341942] oom_kill_process.cold+0xb/0x10 [ 238.342299] out_of_memory+0x24d/0x5a0 [ 238.342625] __alloc_pages_slowpath+0xd12/0x1020 [ 238.343024] __alloc_pages_nodemask+0x391/0x410 [ 238.343407] pagecache_get_page+0xc3/0x3a0 [ 238.343757] filemap_fault+0x804/0xc30 [ 238.344083] ? ext4_filemap_fault+0x28/0x42 [ 238.344444] ext4_filemap_fault+0x30/0x42 [ 238.344789] __do_fault+0x37/0x1a0 [ 238.345087] __handle_mm_fault+0x104d/0x1ab0 [ 238.345450] handle_mm_fault+0x169/0x360 [ 238.345790] do_user_addr_fault+0x20d/0x490 [ 238.346154] do_page_fault+0x31/0x210 [ 238.346468] async_page_fault+0x43/0x50 [ 238.346797] RIP: 0033:0x7f47eba4197e [ 238.347110] Code: Bad RIP value. [ 238.347387] RSP: 002b:00007ffd7c0c1890 EFLAGS: 00010293 [ 238.347834] RAX: 0000000000000002 RBX: 000055d196a20a20 RCX: 00007f47eba4197e [ 238.348437] RDX: 0000000000000033 RSI: 00007ffd7c0c18c0 RDI: 0000000000000004 [ 238.349047] RBP: 00007ffd7c0c1c20 R08: 0000000000000000 R09: 0000000000000033 [ 238.349660] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000001 [ 238.350261] R13: ffffffffffffffff R14: 0000000000000000 R15: 00007ffd7c0c18c0 [ 238.350878] Mem-Info: [ 238.351085] active_anon:3121 inactive_anon:51 isolated_anon:0 [ 238.351085] active_file:12 inactive_file:7 isolated_file:0 [ 238.351085] unevictable:0 dirty:0 writeback:0 unstable:0 [ 238.351085] slab_reclaimable:5565 slab_unreclaimable:10170 [ 238.351085] mapped:3 shmem:111 pagetables:155 bounce:0 [ 238.351085] free:720717 free_pcp:2 free_cma:0 [ 238.353757] Node 0 active_anon:12484kB inactive_anon:204kB active_file:48kB inactive_file:28kB unevictable:0kB iss [ 238.355979] Node 0 DMA free:11556kB min:36kB low:48kB high:60kB reserved_highatomic:0KB active_anon:152kB inactivB [ 238.358345] lowmem_reserve[]: 0 2955 2884 2884 2884 [ 238.358761] Node 0 DMA32 free:2677864kB min:7004kB low:10028kB high:13052kB reserved_highatomic:0KB active_anon:0B [ 238.361202] lowmem_reserve[]: 0 0 72057594037927865 72057594037927865 72057594037927865 [ 238.361888] Node 0 Normal free:193448kB min:99395541895224kB low:73886371836733356kB high:147673348131571488kB reB [ 238.364765] lowmem_reserve[]: 0 0 0 0 0 [ 238.365101] Node 0 DMA: 7*4kB (U) 5*8kB (UE) 6*16kB (UME) 2*32kB (UM) 1*64kB (U) 2*128kB (UE) 3*256kB (UME) 2*512B [ 238.366379] Node 0 DMA32: 0*4kB 1*8kB (U) 2*16kB (UM) 2*32kB (UM) 2*64kB (UM) 1*128kB (U) 1*256kB (U) 1*512kB (U)B [ 238.367654] Node 0 Normal: 1985*4kB (UME) 1321*8kB (UME) 844*16kB (UME) 524*32kB (UME) 300*64kB (UME) 138*128kB (B [ 238.369184] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 238.369915] 130 total pagecache pages [ 238.370241] 0 pages in swap cache [ 238.370533] Swap cache stats: add 0, delete 0, find 0/0 [ 238.370981] Free swap = 0kB [ 238.371239] Total swap = 0kB [ 238.371488] 1048445 pages RAM [ 238.371756] 0 pages HighMem/MovableOnly [ 238.372090] 306992 pages reserved [ 238.372376] 0 pages cma reserved [ 238.372661] 0 pages hwpoisoned In another instance (older kernel), I was able to observe this (negative page count :/): [ 180.896971] Offlined Pages 32768 [ 182.667462] Offlined Pages 32768 [ 184.408117] Offlined Pages 32768 [ 186.026321] Offlined Pages 32768 [ 187.684861] Offlined Pages 32768 [ 189.227013] Offlined Pages 32768 [ 190.830303] Offlined Pages 32768 [ 190.833071] Built 1 zonelists, mobility grouping on. Total pages: -36920272750453009 In another instance (older kernel), I was no longer able to start any process: [root@vm ~]# [ 214.348068] Offlined Pages 32768 [ 215.973009] Offlined Pages 32768 cat /proc/meminfo -bash: fork: Cannot allocate memory [root@vm ~]# cat /proc/meminfo -bash: fork: Cannot allocate memory Fix it by properly adjusting the managed page count when migrating if the zone changed. The managed page count of the zones now looks after unplug of the DIMM (and after deflating the balloon) just like before inflating the balloon (and plugging+onlining the DIMM). We'll temporarily modify the totalram page count. If this ever becomes a problem, we can fine tune by providing helpers that don't touch the totalram pages (e.g., adjust_zone_managed_page_count()). Please note that fixing up the managed page count is only necessary when we adjusted the managed page count when inflating - only if we don't have VIRTIO_BALLOON_F_DEFLATE_ON_OOM. With that feature, the managed page count is not touched when inflating/deflating. Reported-by: Yumei Huang Fixes: 3dcc0571cd64 ("mm: correctly update zone->managed_pages") Cc: # v3.11+ Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Jiang Liu Cc: Andrew Morton Cc: Igor Mammedov Cc: virtualization@lists.linux-foundation.org Signed-off-by: David Hildenbrand Signed-off-by: Michael S. Tsirkin Signed-off-by: Greg Kroah-Hartman commit c511058f167c3622a5dfe9c9f295d7766d41267c Author: Taehee Yoo Date: Thu Nov 21 12:26:45 2019 +0000 virt_wifi: fix use-after-free in virt_wifi_newlink() commit bc71d8b580ba81b55b6e15b1c0320632515b4bac upstream. When virt_wifi interface is created, virt_wifi_newlink() is called and it calls register_netdevice(). if register_netdevice() fails, it internally would call ->priv_destructor(), which is virt_wifi_net_device_destructor() and it frees netdev. but virt_wifi_newlink() still use netdev. So, use-after-free would occur in virt_wifi_newlink(). Test commands: ip link add dummy0 type dummy modprobe bonding ip link add bonding_masters link dummy0 type virt_wifi Splat looks like: [ 202.220554] BUG: KASAN: use-after-free in virt_wifi_newlink+0x88b/0x9a0 [virt_wifi] [ 202.221659] Read of size 8 at addr ffff888061629cb8 by task ip/852 [ 202.222896] CPU: 1 PID: 852 Comm: ip Not tainted 5.4.0-rc5 #3 [ 202.223765] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 202.225073] Call Trace: [ 202.225532] dump_stack+0x7c/0xbb [ 202.226869] print_address_description.constprop.5+0x1be/0x360 [ 202.229362] __kasan_report+0x12a/0x16f [ 202.230714] kasan_report+0xe/0x20 [ 202.232595] virt_wifi_newlink+0x88b/0x9a0 [virt_wifi] [ 202.233370] __rtnl_newlink+0xb9f/0x11b0 [ 202.244909] rtnl_newlink+0x65/0x90 [ ... ] Cc: stable@vger.kernel.org Fixes: c7cdba31ed8b ("mac80211-next: rtnetlink wifi simulation device") Signed-off-by: Taehee Yoo Link: https://lore.kernel.org/r/20191121122645.9355-1-ap420073@gmail.com [trim stack dump a bit] Signed-off-by: Johannes Berg Signed-off-by: Greg Kroah-Hartman commit b0adf9e2e4c049647a793747155b94a051ea8494 Author: Piotr Sroka Date: Tue Sep 24 06:54:31 2019 +0100 mtd: rawnand: Change calculating of position page containing BBM commit a3c4c2339f8948b0f578e938970303a7372e60c0 upstream. Change calculating of position page containing BBM If none of BBM flags are set then function nand_bbm_get_next_page reports EINVAL. It causes that BBM is not read at all during scanning factory bad blocks. The result is that the BBT table is build without checking factory BBM at all. For Micron flash memories none of these flags are set if page size is different than 2048 bytes. Address this regression by: - adding NAND_BBM_FIRSTPAGE chip flag without any condition. It solves issue only for Micron devices. - changing the nand_bbm_get_next_page_function. It will return 0 if no of BBM flag is set and page parameter is 0. After that modification way of discovering factory bad blocks will work similar as in kernel version 5.1. Cc: stable@vger.kernel.org Fixes: f90da7818b14 (mtd: rawnand: Support bad block markers in first, second or last page) Signed-off-by: Piotr Sroka Reviewed-by: Frieder Schrempf Signed-off-by: Miquel Raynal Signed-off-by: Greg Kroah-Hartman commit 893f4092a3b2604977f62ab9e89e89eed89af113 Author: Miquel Raynal Date: Tue Oct 22 16:58:59 2019 +0200 mtd: spear_smi: Fix Write Burst mode commit 69c7f4618c16b4678f8a4949b6bb5ace259c0033 upstream. Any write with either dd or flashcp to a device driven by the spear_smi.c driver will pass through the spear_smi_cpy_toio() function. This function will get called for chunks of up to 256 bytes. If the amount of data is smaller, we may have a problem if the data length is not 4-byte aligned. In this situation, the kernel panics during the memcpy: # dd if=/dev/urandom bs=1001 count=1 of=/dev/mtd6 spear_smi_cpy_toio [620] dest c9070000, src c7be8800, len 256 spear_smi_cpy_toio [620] dest c9070100, src c7be8900, len 256 spear_smi_cpy_toio [620] dest c9070200, src c7be8a00, len 256 spear_smi_cpy_toio [620] dest c9070300, src c7be8b00, len 233 Unhandled fault: external abort on non-linefetch (0x808) at 0xc90703e8 [...] PC is at memcpy+0xcc/0x330 The above error occurs because the implementation of memcpy_toio() tries to optimize the number of I/O by writing 4 bytes at a time as much as possible, until there are less than 4 bytes left and then switches to word or byte writes. Unfortunately, the specification states about the Write Burst mode: "the next AHB Write request should point to the next incremented address and should have the same size (byte, half-word or word)" This means ARM architecture implementation of memcpy_toio() cannot reliably be used blindly here. Workaround this situation by update the write path to stick to byte access when the burst length is not multiple of 4. Fixes: f18dbbb1bfe0 ("mtd: ST SPEAr: Add SMI driver for serial NOR flash") Cc: Russell King Cc: Boris Brezillon Cc: stable@vger.kernel.org Signed-off-by: Miquel Raynal Reviewed-by: Russell King Signed-off-by: Greg Kroah-Hartman commit e67fa7fb36b1a982cb1fae132e0703d596b5e1cf Author: Rafał Miłecki Date: Mon Nov 18 12:53:08 2019 +0100 brcmfmac: disable PCIe interrupts before bus reset commit 5d26a6a6150c486f51ea2aaab33af04db02f63b8 upstream. Keeping interrupts on could result in brcmfmac freeing some resources and then IRQ handlers trying to use them. That was obviously a straight path for crashing a kernel. Example: CPU0 CPU1 ---- ---- brcmf_pcie_reset brcmf_pcie_bus_console_read brcmf_detach ... brcmf_fweh_detach brcmf_proto_detach brcmf_pcie_isr_thread ... brcmf_proto_msgbuf_rx_trigger ... drvr->proto->pd brcmf_pcie_release_irq [ 363.789218] Unable to handle kernel NULL pointer dereference at virtual address 00000038 [ 363.797339] pgd = c0004000 [ 363.800050] [00000038] *pgd=00000000 [ 363.803635] Internal error: Oops: 17 [#1] SMP ARM (...) [ 364.029209] Backtrace: [ 364.031725] [] (brcmf_proto_msgbuf_rx_trigger [brcmfmac]) from [] (brcmf_pcie_isr_thread+0x228/0x274 [brcmfmac]) [ 364.043662] r7:00000001 r6:c8ca0000 r5:00010000 r4:c7b4f800 Fixes: 4684997d9eea ("brcmfmac: reset PCIe bus on a firmware crash") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Rafał Miłecki Signed-off-by: Kalle Valo Signed-off-by: Greg Kroah-Hartman commit dc69bd239348021a6de499660189f34d9b6809c7 Author: Meng Li Date: Thu Nov 21 12:30:46 2019 -0600 EDAC/altera: Use fast register IO for S10 IRQs commit 56d9e7bd3fa0f105b6670021d167744bc50ae4fe upstream. When an IRQ occurs, regmap_{read,write,...}() is invoked in atomic context. Regmap must indicate register IO is fast so that a spinlock is used instead of a mutex to avoid sleeping in atomic context: lock_acquire __mutex_lock mutex_lock_nested regmap_lock_mutex regmap_write a10_eccmgr_irq_unmask unmask_irq.part.0 irq_enable __irq_startup irq_startup __setup_irq request_threaded_irq devm_request_threaded_irq altr_sdram_probe Mark it so. [ bp: Massage. ] Fixes: 3dab6bd52687 ("EDAC, altera: Add support for Stratix10 SDRAM EDAC") Reported-by: Meng Li Signed-off-by: Meng Li Signed-off-by: Thor Thayer Signed-off-by: Borislav Petkov Cc: James Morse Cc: linux-edac Cc: Mauro Carvalho Chehab Cc: Robert Richter Cc: stable Cc: Tony Luck Link: https://lkml.kernel.org/r/1574361048-17572-2-git-send-email-thor.thayer@linux.intel.com Signed-off-by: Greg Kroah-Hartman commit 23da547a26eb0f1a1eea0ccb640787c94505b71b Author: Hans de Goede Date: Fri Oct 25 11:14:48 2019 +0200 tpm: Switch to platform_get_irq_optional() commit 9c8c5742b6af76a3fd93b4e56d1d981173cf9016 upstream. platform_get_irq() calls dev_err() on an error. As the IRQ usage in the tpm_tis driver is optional, this is undesirable. Specifically this leads to this new false-positive error being logged: [ 5.135413] tpm_tis MSFT0101:00: IRQ index 0 not found This commit switches to platform_get_irq_optional(), which does not log an error, fixing this. Fixes: 7723f4c5ecdb ("driver core: platform: Add an error message to platform_get_irq*()" Cc: # 5.4.x Signed-off-by: Hans de Goede Reviewed-by: Jerry Snitselaar Reviewed-by: Jarkko Sakkinen Signed-off-by: Jarkko Sakkinen Signed-off-by: Greg Kroah-Hartman commit 12d9c03863e2b043092936b3a34410fda3c35215 Author: Tadeusz Struk Date: Mon Oct 7 14:46:37 2019 -0700 tpm: add check after commands attribs tab allocation commit f1689114acc5e89a196fec6d732dae3e48edb6ad upstream. devm_kcalloc() can fail and return NULL so we need to check for that. Cc: stable@vger.kernel.org Fixes: 58472f5cd4f6f ("tpm: validate TPM 2.0 commands") Signed-off-by: Tadeusz Struk Reviewed-by: Jerry Snitselaar Reviewed-by: Jarkko Sakkinen Tested-by: Jarkko Sakkinen Signed-off-by: Jarkko Sakkinen Signed-off-by: Greg Kroah-Hartman commit 9e28d2e9329f4f6ddb8679a21ce77d9a40f6ad51 Author: Pete Zaitcev Date: Wed Dec 4 20:39:41 2019 -0600 usb: mon: Fix a deadlock in usbmon between mmap and read commit 19e6317d24c25ee737c65d1ffb7483bdda4bb54a upstream. The problem arises because our read() function grabs a lock of the circular buffer, finds something of interest, then invokes copy_to_user() straight from the buffer, which in turn takes mm->mmap_sem. In the same time, the callback mon_bin_vma_fault() is invoked under mm->mmap_sem. It attempts to take the fetch lock and deadlocks. This patch does away with protecting of our page list with any semaphores, and instead relies on the kernel not close the device while mmap is active in a process. In addition, we prohibit re-sizing of a buffer while mmap is active. This way, when (now unlocked) fault is processed, it works with the page that is intended to be mapped-in, and not some other random page. Note that this may have an ABI impact, but hopefully no legitimate program is this wrong. Signed-off-by: Pete Zaitcev Reported-by: syzbot+56f9673bb4cdcbeb0e92@syzkaller.appspotmail.com Reviewed-by: Alan Stern Fixes: 46eb14a6e158 ("USB: fix usbmon BUG trigger") Cc: Link: https://lore.kernel.org/r/20191204203941.3503452b@suzdal.zaitcev.lan Signed-off-by: Greg Kroah-Hartman commit 363ae48f364c86746112aaffe697236ba102d9a4 Author: Emiliano Ingrassia Date: Wed Nov 27 17:03:55 2019 +0100 usb: core: urb: fix URB structure initialization function commit 1cd17f7f0def31e3695501c4f86cd3faf8489840 upstream. Explicitly initialize URB structure urb_list field in usb_init_urb(). This field can be potentially accessed uninitialized and its initialization is coherent with the usage of list_del_init() in usb_hcd_unlink_urb_from_ep() and usb_giveback_urb_bh() and its explicit initialization in usb_hcd_submit_urb() error path. Signed-off-by: Emiliano Ingrassia Cc: stable Link: https://lore.kernel.org/r/20191127160355.GA27196@ingrassia.epigenesys.com Signed-off-by: Greg Kroah-Hartman commit 710b44430ec23ff130adca8309b1154057057f64 Author: Johan Hovold Date: Tue Dec 10 12:25:59 2019 +0100 USB: adutux: fix interface sanity check commit 3c11c4bed02b202e278c0f5c319ae435d7fb9815 upstream. Make sure to use the current alternate setting when verifying the interface descriptors to avoid binding to an invalid interface. Failing to do so could cause the driver to misbehave or trigger a WARN() in usb_submit_urb() that kernels with panic_on_warn set would choke on. Fixes: 03270634e242 ("USB: Add ADU support for Ontrak ADU devices") Cc: stable # 2.6.19 Signed-off-by: Johan Hovold Link: https://lore.kernel.org/r/20191210112601.3561-3-johan@kernel.org Signed-off-by: Greg Kroah-Hartman commit 76d915a1b13efeb8be7c05001455ec348fdb9ab9 Author: Wen Yang Date: Sun Nov 24 22:22:36 2019 +0800 usb: roles: fix a potential use after free commit 1848a543191ae32e558bb0a5974ae7c38ebd86fc upstream. Free the sw structure only after we are done using it. This patch just moves the put_device() down a bit to avoid the use after free. Fixes: 5c54fcac9a9d ("usb: roles: Take care of driver module reference counting") Signed-off-by: Wen Yang Reviewed-by: Heikki Krogerus Reviewed-by: Peter Chen Cc: stable Cc: Hans de Goede Cc: Chunfeng Yun Cc: Suzuki K Poulose Cc: linux-usb@vger.kernel.org Cc: linux-kernel@vger.kernel.org Link: https://lore.kernel.org/r/20191124142236.25671-1-wenyang@linux.alibaba.com Signed-off-by: Greg Kroah-Hartman commit ebedb736280f7e6fc20d770208af4083d26b69ea Author: Johan Hovold Date: Tue Dec 10 12:26:01 2019 +0100 USB: serial: io_edgeport: fix epic endpoint lookup commit 7c5a2df3367a2c4984f1300261345817d95b71f8 upstream. Make sure to use the current alternate setting when looking up the endpoints on epic devices to avoid binding to an invalid interface. Failing to do so could cause the driver to misbehave or trigger a WARN() in usb_submit_urb() that kernels with panic_on_warn set would choke on. Fixes: 6e8cf7751f9f ("USB: add EPIC support to the io_edgeport driver") Cc: stable # 2.6.21 Signed-off-by: Johan Hovold Link: https://lore.kernel.org/r/20191210112601.3561-5-johan@kernel.org Signed-off-by: Greg Kroah-Hartman commit 6805e00788891f471c8257a9ae9a0c040ba1a611 Author: Johan Hovold Date: Tue Dec 10 12:26:00 2019 +0100 USB: idmouse: fix interface sanity checks commit 59920635b89d74b9207ea803d5e91498d39e8b69 upstream. Make sure to use the current alternate setting when verifying the interface descriptors to avoid binding to an invalid interface. Failing to do so could cause the driver to misbehave or trigger a WARN() in usb_submit_urb() that kernels with panic_on_warn set would choke on. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable Signed-off-by: Johan Hovold Link: https://lore.kernel.org/r/20191210112601.3561-4-johan@kernel.org Signed-off-by: Greg Kroah-Hartman commit 836924c2dab68522d8df75a4441dea3a91fc9ec8 Author: Johan Hovold Date: Tue Dec 10 12:25:58 2019 +0100 USB: atm: ueagle-atm: add missing endpoint check commit 09068c1ad53fb077bdac288869dec2435420bdc4 upstream. Make sure that the interrupt interface has an endpoint before trying to access its endpoint descriptors to avoid dereferencing a NULL pointer. The driver binds to the interrupt interface with interface number 0, but must not assume that this interface or its current alternate setting are the first entries in the corresponding configuration arrays. Fixes: b72458a80c75 ("[PATCH] USB: Eagle and ADI 930 usb adsl modem driver") Cc: stable # 2.6.16 Signed-off-by: Johan Hovold Link: https://lore.kernel.org/r/20191210112601.3561-2-johan@kernel.org Signed-off-by: Greg Kroah-Hartman commit 991fd95e5f2a4498420ac4471de98c04335331bb Author: Mircea Caprioru Date: Mon Nov 18 10:38:57 2019 +0200 iio: adc: ad7124: Enable internal reference commit 11d7c8d3b1259c303fb52789febed58f0bc35ad1 upstream. When the internal reference was selected by a channel it was not enabled. This patch fixes that and enables it. Fixes: b3af341bbd96 ("iio: adc: Add ad7124 support") Signed-off-by: Mircea Caprioru Cc: Signed-off-by: Jonathan Cameron Signed-off-by: Greg Kroah-Hartman commit 187e07d9910d36535795f105ae5b3dd6496b714b Author: Beniamin Bia Date: Mon Nov 4 18:26:34 2019 +0200 iio: adc: ad7606: fix reading unnecessary data from device commit 341826a065660d1b77d89e6335b6095cd654271c upstream. When a conversion result is being read from ADC, the driver reads the number of channels + 1 because it thinks that IIO_CHAN_SOFT_TIMESTAMP is also a physical channel. This patch fixes this issue. Fixes: 2985a5d88455 ("staging: iio: adc: ad7606: Move out of staging") Reported-by: Robert Wörle Signed-off-by: Beniamin Bia Cc: Signed-off-by: Jonathan Cameron Signed-off-by: Greg Kroah-Hartman commit d314b891272abaad7d1c9bf4caadb97023ee9721 Author: Jean-Baptiste Maneyrol Date: Tue Nov 26 17:19:12 2019 +0100 iio: imu: inv_mpu6050: fix temperature reporting using bad unit commit 53eaa9c27fdc01b4f4d885223e29f97393409e7e upstream. Temperature should be reported in milli-degrees, not degrees. Fix scale and offset values to use the correct unit. This is a fix for an issue that has been present for a long time. The fixes tag reflects the point at which the code last changed in a fashion that would make this fix patch no longer apply. Backports will be necessary to fix those elements that predate that patch. Fixes: 1615fe41a195 ("iio: imu: mpu6050: Fix FIFO layout for ICM20602") Cc: stable@vger.kernel.org Signed-off-by: Jean-Baptiste Maneyrol Signed-off-by: Jonathan Cameron Signed-off-by: Greg Kroah-Hartman commit 6e1536f5c50229490ac3bcf0a4007937aa924e6e Author: Chris Lesiak Date: Thu Nov 21 20:39:42 2019 +0000 iio: humidity: hdc100x: fix IIO_HUMIDITYRELATIVE channel reporting commit 342a6928bd5017edbdae376042d8ad6af3d3b943 upstream. The IIO_HUMIDITYRELATIVE channel was being incorrectly reported back as percent when it should have been milli percent. This is via an incorrect scale value being returned to userspace. Signed-off-by: Chris Lesiak Acked-by: Matt Ranostay Cc: Signed-off-by: Jonathan Cameron Signed-off-by: Greg Kroah-Hartman commit 9c58162eedbf66e965e49a29c3a3f7ff8bf48c51 Author: Nuno Sá Date: Mon Oct 28 17:33:48 2019 +0100 iio: adis16480: Fix scales factors commit 49549cb23a2926eba70bb634e361daea0f319794 upstream. This patch fixes the scales for the gyroscope, accelerometer and barometer. The pressure scale was just wrong. For the others, the scale factors were not taking into account that a 32bit word is being read from the device. Fixes: 7abad1063deb ("iio: adis16480: Fix scale factors") Fixes: 82e7a1b25017 ("iio: imu: adis16480: Add support for ADIS1649x family of devices") Signed-off-by: Nuno Sá Cc: Signed-off-by: Jonathan Cameron Signed-off-by: Greg Kroah-Hartman commit 5d8fb67d4068defa01267e6d509e4c8371487010 Author: Lorenzo Bianconi Date: Sun Oct 27 19:02:30 2019 +0100 iio: imu: st_lsm6dsx: fix ODR check in st_lsm6dsx_write_raw commit fc3f6ad7f5dc6c899fbda0255865737bac88c2e0 upstream. Since st_lsm6dsx i2c master controller relies on accel device as trigger and slave devices can run at different ODRs we must select an accel_odr >= slave_odr. Report real accel ODR in st_lsm6dsx_check_odr() in order to properly set sensor frequency in st_lsm6dsx_write_raw and avoid to report unsupported frequency Fixes: 6ffb55e5009ff ("iio: imu: st_lsm6dsx: introduce ST_LSM6DSX_ID_EXT sensor ids") Signed-off-by: Lorenzo Bianconi Cc: Signed-off-by: Jonathan Cameron Signed-off-by: Greg Kroah-Hartman commit 4b41b1c4ebac6da9b6e93d278b1eafda0aa925b0 Author: Nuno Sá Date: Mon Oct 28 17:33:49 2019 +0100 iio: adis16480: Add debugfs_reg_access entry commit 4c35b7a51e2f291471f7221d112c6a45c63e83bc upstream. The driver is defining debugfs entries by calling `adis16480_debugfs_init()`. However, those entries are attached to the iio_dev debugfs entry which won't exist if no debugfs_reg_access callback is provided. Fixes: 2f3abe6cbb6c ("iio:imu: Add support for the ADIS16480 and similar IMUs") Signed-off-by: Nuno Sá Cc: Signed-off-by: Jonathan Cameron Signed-off-by: Greg Kroah-Hartman commit 674a89b757fef1d2e94f7663df6a52fbc488dd8c Author: H. Nikolaus Schaller Date: Thu Nov 7 11:30:36 2019 +0100 ARM: dts: pandora-common: define wl1251 as child node of mmc3 commit 4f9007d692017cef38baf2a9b82b7879d5b2407b upstream. Since v4.7 the dma initialization requires that there is a device tree property for "rx" and "tx" channels which is not provided by the pdata-quirks initialization. By conversion of the mmc3 setup to device tree this will finally allows to remove the OpenPandora wlan specific omap3 data-quirks. Fixes: 81eef6ca9201 ("mmc: omap_hsmmc: Use dma_request_chan() for requesting DMA channel") Signed-off-by: H. Nikolaus Schaller Cc: # v4.7+ Acked-by: Tony Lindgren Signed-off-by: Ulf Hansson Signed-off-by: Greg Kroah-Hartman commit 44e7ecdab8ae67a536f3c5a123376b543c6903fc Author: Bryan O'Donoghue Date: Thu Nov 28 13:43:57 2019 +0000 usb: common: usb-conn-gpio: Don't log an error on probe deferral commit 59120962e4be4f72be537adb17da6881c4b3797c upstream. This patch makes the printout of the error message for failing to get a VBUS regulator handle conditional on the error code being something other than -EPROBE_DEFER. Deferral is a normal thing, we don't need an error message for this. Cc: Chunfeng Yun Cc: Nagarjuna Kristam Cc: Linus Walleij Cc: Greg Kroah-Hartman Cc: linux-usb@vger.kernel.org Signed-off-by: Bryan O'Donoghue Cc: stable Link: https://lore.kernel.org/r/20191128134358.3880498-2-bryan.odonoghue@linaro.org Signed-off-by: Greg Kroah-Hartman commit 9fb0a8c74c54946a8a6eda77b92c3b2f5608e147 Author: Georgi Djakov Date: Thu Dec 12 09:53:31 2019 +0200 interconnect: qcom: qcs404: Walk the list safely on node removal commit f39488ea2a75c49634c8611090f58734f61eee7c upstream. As we will remove items off the list using list_del(), we need to use the safe version of list_for_each_entry(). Fixes: 5e4e6c4d3ae0 ("interconnect: qcom: Add QCS404 interconnect provider driver") Reported-by: Dmitry Osipenko Reviewed-by: Bjorn Andersson Signed-off-by: Georgi Djakov Cc: # v5.4 Link: https://lore.kernel.org/r/20191212075332.16202-4-georgi.djakov@linaro.org Signed-off-by: Greg Kroah-Hartman commit 48b47dfd0441eb1eafe6fef7402abaf18a47f207 Author: Georgi Djakov Date: Thu Dec 12 09:53:30 2019 +0200 interconnect: qcom: sdm845: Walk the list safely on node removal commit b29b8113bb41285eb7ed55ce0c65017b5c0240f7 upstream. As we will remove items off the list using list_del(), we need to use the safe version of list_for_each_entry(). Fixes: b5d2f741077a ("interconnect: qcom: Add sdm845 interconnect provider driver") Reported-by: Dmitry Osipenko Reviewed-by: Bjorn Andersson Signed-off-by: Georgi Djakov Cc: # v5.3+ Link: https://lore.kernel.org/r/20191212075332.16202-3-georgi.djakov@linaro.org Signed-off-by: Greg Kroah-Hartman commit e6406776137bc5187fb7f1bce76785deac6f84d4 Author: Mathias Nyman Date: Wed Dec 11 16:20:07 2019 +0200 xhci: make sure interrupts are restored to correct state commit bd82873f23c9a6ad834348f8b83f3b6a5bca2c65 upstream. spin_unlock_irqrestore() might be called with stale flags after reading port status, possibly restoring interrupts to a incorrect state. If a usb2 port just finished resuming while the port status is read the spin lock will be temporary released and re-acquired in a separate function. The flags parameter is passed as value instead of a pointer, not updating flags properly before the final spin_unlock_irqrestore() is called. Cc: # v3.12+ Fixes: 8b3d45705e54 ("usb: Fix xHCI host issues on remote wakeup.") Signed-off-by: Mathias Nyman Link: https://lore.kernel.org/r/20191211142007.8847-7-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman commit 975711cd3b1852d1e666dc59058d8c662938dfbd Author: Mathias Nyman Date: Wed Dec 11 16:20:06 2019 +0200 xhci: handle some XHCI_TRUST_TX_LENGTH quirks cases as default behaviour. commit 7ff11162808cc2ec66353fc012c58bb449c892c3 upstream. xhci driver claims it needs XHCI_TRUST_TX_LENGTH quirk for both Broadcom/Cavium and a Renesas xHC controllers. The quirk was inteded for handling false "success" complete event for transfers that had data left untransferred. These transfers should complete with "short packet" events instead. In these two new cases the false "success" completion is reported after a "short packet" if the TD consists of several TRBs. xHCI specs 4.10.1.1.2 say remaining TRBs should report "short packet" as well after the first short packet in a TD, but this issue seems so common it doesn't make sense to add the quirk for all vendors. Turn these events into short packets automatically instead. This gets rid of the "The WARN Successful completion on short TX for slot 1 ep 1: needs XHCI_TRUST_TX_LENGTH quirk" warning in many cases. Cc: Reported-by: Eli Billauer Reported-by: Ard Biesheuvel Tested-by: Eli Billauer Tested-by: Ard Biesheuvel Signed-off-by: Mathias Nyman Link: https://lore.kernel.org/r/20191211142007.8847-6-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman commit 89071f7513d5a92943cbd481d28003e5b617eec6 Author: Kai-Heng Feng Date: Wed Dec 11 16:20:05 2019 +0200 xhci: Increase STS_HALT timeout in xhci_suspend() commit 7c67cf6658cec70d8a43229f2ce74ca1443dc95e upstream. I've recently observed failed xHCI suspend attempt on AMD Raven Ridge system: kernel: xhci_hcd 0000:04:00.4: WARN: xHC CMD_RUN timeout kernel: PM: suspend_common(): xhci_pci_suspend+0x0/0xd0 returns -110 kernel: PM: pci_pm_suspend(): hcd_pci_suspend+0x0/0x30 returns -110 kernel: PM: dpm_run_callback(): pci_pm_suspend+0x0/0x150 returns -110 kernel: PM: Device 0000:04:00.4 failed to suspend async: error -110 Similar to commit ac343366846a ("xhci: Increase STS_SAVE timeout in xhci_suspend()") we also need to increase the HALT timeout to make it be able to suspend again. Cc: # 5.2+ Fixes: f7fac17ca925 ("xhci: Convert xhci_handshake() to use readl_poll_timeout_atomic()") Signed-off-by: Kai-Heng Feng Signed-off-by: Mathias Nyman Link: https://lore.kernel.org/r/20191211142007.8847-5-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman commit 55734bad42cf9f8a46562110959528ce3bceaff5 Author: Mathias Nyman Date: Wed Dec 11 16:20:03 2019 +0200 xhci: fix USB3 device initiated resume race with roothub autosuspend commit 057d476fff778f1d3b9f861fdb5437ea1a3cfc99 upstream. A race in xhci USB3 remote wake handling may force device back to suspend after it initiated resume siganaling, causing a missed resume event or warm reset of device. When a USB3 link completes resume signaling and goes to enabled (UO) state a interrupt is issued and the interrupt handler will clear the bus_state->port_remote_wakeup resume flag, allowing bus suspend. If the USB3 roothub thread just finished reading port status before the interrupt, finding ports still in suspended (U3) state, but hasn't yet started suspending the hub, then the xhci interrupt handler will clear the flag that prevented roothub suspend and allow bus to suspend, forcing all port links back to suspended (U3) state. Example case: usb_runtime_suspend() # because all ports still show suspended U3 usb_suspend_both() hub_suspend(); # successful as hub->wakeup_bits not set yet ==> INTERRUPT xhci_irq() handle_port_status() clear bus_state->port_remote_wakeup usb_wakeup_notification() sets hub->wakeup_bits; kick_hub_wq() <== END INTERRUPT hcd_bus_suspend() xhci_bus_suspend() # success as port_remote_wakeup bits cleared Fix this by increasing roothub usage count during port resume to prevent roothub autosuspend, and by making sure bus_state->port_remote_wakeup flag is only cleared after resume completion is visible, i.e. after xhci roothub returned U0 or other non-U3 link state link on a get port status request. Issue rootcaused by Chiasheng Lee Cc: Cc: Lee, Hou-hsun Reported-by: Lee, Chiasheng Signed-off-by: Mathias Nyman Link: https://lore.kernel.org/r/20191211142007.8847-3-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman commit 0b3cf241df75fa1ff3e8d75a757d86c86492ad86 Author: Mika Westerberg Date: Wed Dec 11 16:20:02 2019 +0200 xhci: Fix memory leak in xhci_add_in_port() commit ce91f1a43b37463f517155bdfbd525eb43adbd1a upstream. When xHCI is part of Alpine or Titan Ridge Thunderbolt controller and the xHCI device is hot-removed as a result of unplugging a dock for example, the driver leaks memory it allocates for xhci->usb3_rhub.psi and xhci->usb2_rhub.psi in xhci_add_in_port() as reported by kmemleak: unreferenced object 0xffff922c24ef42f0 (size 16): comm "kworker/u16:2", pid 178, jiffies 4294711640 (age 956.620s) hex dump (first 16 bytes): 21 00 0c 00 12 00 dc 05 23 00 e0 01 00 00 00 00 !.......#....... backtrace: [<000000007ac80914>] xhci_mem_init+0xcf8/0xeb7 [<0000000001b6d775>] xhci_init+0x7c/0x160 [<00000000db443fe3>] xhci_gen_setup+0x214/0x340 [<00000000fdffd320>] xhci_pci_setup+0x48/0x110 [<00000000541e1e03>] usb_add_hcd.cold+0x265/0x747 [<00000000ca47a56b>] usb_hcd_pci_probe+0x219/0x3b4 [<0000000021043861>] xhci_pci_probe+0x24/0x1c0 [<00000000b9231f25>] local_pci_probe+0x3d/0x70 [<000000006385c9d7>] pci_device_probe+0xd0/0x150 [<0000000070241068>] really_probe+0xf5/0x3c0 [<0000000061f35c0a>] driver_probe_device+0x58/0x100 [<000000009da11198>] bus_for_each_drv+0x79/0xc0 [<000000009ce45f69>] __device_attach+0xda/0x160 [<00000000df201aaf>] pci_bus_add_device+0x46/0x70 [<0000000088a1bc48>] pci_bus_add_devices+0x27/0x60 [<00000000ad9ee708>] pci_bus_add_devices+0x52/0x60 unreferenced object 0xffff922c24ef3318 (size 8): comm "kworker/u16:2", pid 178, jiffies 4294711640 (age 956.620s) hex dump (first 8 bytes): 34 01 05 00 35 41 0a 00 4...5A.. backtrace: [<000000007ac80914>] xhci_mem_init+0xcf8/0xeb7 [<0000000001b6d775>] xhci_init+0x7c/0x160 [<00000000db443fe3>] xhci_gen_setup+0x214/0x340 [<00000000fdffd320>] xhci_pci_setup+0x48/0x110 [<00000000541e1e03>] usb_add_hcd.cold+0x265/0x747 [<00000000ca47a56b>] usb_hcd_pci_probe+0x219/0x3b4 [<0000000021043861>] xhci_pci_probe+0x24/0x1c0 [<00000000b9231f25>] local_pci_probe+0x3d/0x70 [<000000006385c9d7>] pci_device_probe+0xd0/0x150 [<0000000070241068>] really_probe+0xf5/0x3c0 [<0000000061f35c0a>] driver_probe_device+0x58/0x100 [<000000009da11198>] bus_for_each_drv+0x79/0xc0 [<000000009ce45f69>] __device_attach+0xda/0x160 [<00000000df201aaf>] pci_bus_add_device+0x46/0x70 [<0000000088a1bc48>] pci_bus_add_devices+0x27/0x60 [<00000000ad9ee708>] pci_bus_add_devices+0x52/0x60 Fix this by calling kfree() for the both psi objects in xhci_mem_cleanup(). Cc: # 4.4+ Fixes: 47189098f8be ("xhci: parse xhci protocol speed ID list for usb 3.1 usage") Signed-off-by: Mika Westerberg Signed-off-by: Mathias Nyman Link: https://lore.kernel.org/r/20191211142007.8847-2-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman commit 00e0fb69194a95a5707a4f197300060eb8f222bb Author: Henry Lin Date: Wed Dec 11 16:20:04 2019 +0200 usb: xhci: only set D3hot for pci device commit f2c710f7dca8457e88b4ac9de2060f011254f9dd upstream. Xhci driver cannot call pci_set_power_state() on non-pci xhci host controllers. For example, NVIDIA Tegra XHCI host controller which acts as platform device with XHCI_SPURIOUS_WAKEUP quirk set in some platform hits this issue during shutdown. Cc: Fixes: 638298dc66ea ("xhci: Fix spurious wakeups after S5 on Haswell") Signed-off-by: Henry Lin Signed-off-by: Mathias Nyman Link: https://lore.kernel.org/r/20191211142007.8847-4-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman commit 661cf020ae2b57d0675feea5e68caf68616ec4c3 Author: Johan Hovold Date: Mon Dec 2 09:56:10 2019 +0100 staging: gigaset: add endpoint-type sanity check commit ed9ed5a89acba51b82bdff61144d4e4a4245ec8a upstream. Add missing endpoint-type sanity checks to probe. This specifically prevents a warning in USB core on URB submission when fuzzing USB descriptors. Signed-off-by: Johan Hovold Cc: stable Link: https://lore.kernel.org/r/20191202085610.12719-4-johan@kernel.org Signed-off-by: Greg Kroah-Hartman commit da64ea560aa69ffbf0235a81fdb2b5b5c5238385 Author: Johan Hovold Date: Mon Dec 2 09:56:09 2019 +0100 staging: gigaset: fix illegal free on probe errors commit 84f60ca7b326ed8c08582417493982fe2573a9ad upstream. The driver failed to initialise its receive-buffer pointer, something which could lead to an illegal free on late probe errors. Fix this by making sure to clear all driver data at allocation. Fixes: 2032e2c2309d ("usb_gigaset: code cleanup") Cc: stable # 2.6.33 Cc: Tilman Schmidt Signed-off-by: Johan Hovold Link: https://lore.kernel.org/r/20191202085610.12719-3-johan@kernel.org Signed-off-by: Greg Kroah-Hartman commit d1cbf4e59240b6c3380d748b1874aa2c49d1c1dd Author: Johan Hovold Date: Mon Dec 2 09:56:08 2019 +0100 staging: gigaset: fix general protection fault on probe commit 53f35a39c3860baac1e5ca80bf052751cfb24a99 upstream. Fix a general protection fault when accessing the endpoint descriptors which could be triggered by a malicious device due to missing sanity checks on the number of endpoints. Reported-by: syzbot+35b1c403a14f5c89eba7@syzkaller.appspotmail.com Fixes: 07dc1f9f2f80 ("[PATCH] isdn4linux: Siemens Gigaset drivers - M105 USB DECT adapter") Cc: stable # 2.6.17 Cc: Hansjoerg Lipp Cc: Tilman Schmidt Signed-off-by: Johan Hovold Link: https://lore.kernel.org/r/20191202085610.12719-2-johan@kernel.org Signed-off-by: Greg Kroah-Hartman commit 2aaf1e194e2930c78e9eb54e90726be63ddf5374 Author: Marcelo Diop-Gonzalez Date: Tue Dec 3 10:39:21 2019 -0500 staging: vchiq: call unregister_chrdev_region() when driver registration fails commit d2cdb20507fe2079a146459f9718b45d78cbbe61 upstream. This undoes the previous call to alloc_chrdev_region() on failure, and is probably what was meant originally given the label name. Signed-off-by: Marcelo Diop-Gonzalez Cc: stable Fixes: 187ac53e590c ("staging: vchiq_arm: rework probe and init functions") Reviewed-by: Dan Carpenter Reviewed-by: Nicolas Saenz Julienne Link: https://lore.kernel.org/r/20191203153921.70540-1-marcgonzalez@google.com Signed-off-by: Greg Kroah-Hartman commit 601dc859961967efbefd730c3f0dc251f033fc1b Author: Johan Hovold Date: Tue Dec 10 12:47:51 2019 +0100 staging: rtl8712: fix interface sanity check commit c724f776f048538ecfdf53a52b7a522309f5c504 upstream. Make sure to use the current alternate setting when verifying the interface descriptors to avoid binding to an invalid interface. Failing to do so could cause the driver to misbehave or trigger a WARN() in usb_submit_urb() that kernels with panic_on_warn set would choke on. Fixes: 2865d42c78a9 ("staging: r8712u: Add the new driver to the mainline kernel") Cc: stable # 2.6.37 Signed-off-by: Johan Hovold Link: https://lore.kernel.org/r/20191210114751.5119-3-johan@kernel.org Signed-off-by: Greg Kroah-Hartman commit 6c38bd22074f48d66dcd921badf743d83b1d1c69 Author: Johan Hovold Date: Tue Dec 10 12:47:50 2019 +0100 staging: rtl8188eu: fix interface sanity check commit 74ca34118a0e05793935d804ccffcedd6eb56596 upstream. Make sure to use the current alternate setting when verifying the interface descriptors to avoid binding to an invalid interface. Failing to do so could cause the driver to misbehave or trigger a WARN() in usb_submit_urb() that kernels with panic_on_warn set would choke on. Fixes: c2478d39076b ("staging: r8188eu: Add files for new driver - part 20") Cc: stable # 3.12 Signed-off-by: Johan Hovold Link: https://lore.kernel.org/r/20191210114751.5119-2-johan@kernel.org Signed-off-by: Greg Kroah-Hartman commit 6859c3c6bb2b0f0e77ab764a61b474fb79f2550c Author: Brendan Higgins Date: Wed Dec 4 15:45:22 2019 -0800 staging: exfat: fix multiple definition error of `rename_file' commit 1af73a25e6e7d9f2f1e2a14259cc9ffce6d8f6d4 upstream. `rename_file' was exported but not properly namespaced causing a multiple definition error because `rename_file' is already defined in fs/hostfs/hostfs_user.c: ld: drivers/staging/exfat/exfat_core.o: in function `rename_file': drivers/staging/exfat/exfat_core.c:2327: multiple definition of `rename_file'; fs/hostfs/hostfs_user.o:fs/hostfs/hostfs_user.c:350: first defined here make: *** [Makefile:1077: vmlinux] Error 1 This error can be reproduced on ARCH=um by selecting: CONFIG_EXFAT_FS=y CONFIG_HOSTFS=y Add a namespace prefix exfat_* to fix this error. Reported-by: Brendan Higgins Signed-off-by: Brendan Higgins Cc: stable Cc: Valdis Kletnieks Tested-by: David Gow Reviewed-by: David Gow Link: https://lore.kernel.org/r/20191204234522.42855-1-brendanhiggins@google.com Signed-off-by: Greg Kroah-Hartman commit 34d8a89fe156b082823f438f8240e8d57291c9f2 Author: Todd Kjos Date: Fri Dec 13 12:25:31 2019 -0800 binder: fix incorrect calculation for num_valid commit 16981742717b04644a41052570fb502682a315d2 upstream. For BINDER_TYPE_PTR and BINDER_TYPE_FDA transactions, the num_valid local was calculated incorrectly causing the range check in binder_validate_ptr() to miss out-of-bounds offsets. Fixes: bde4a19fc04f ("binder: use userspace pointer as base of buffer space") Signed-off-by: Todd Kjos Cc: stable Link: https://lore.kernel.org/r/20191213202531.55010-1-tkjos@google.com Signed-off-by: Greg Kroah-Hartman commit a348e30570f8986952e378d62d699001840483ab Author: Nagarjuna Kristam Date: Mon Nov 4 14:54:30 2019 +0530 usb: host: xhci-tegra: Correct phy enable sequence commit 6351653febbb784d86fdf83afe41f7523a61b392 upstream. XUSB phy needs to be enabled before un-powergating the power partitions. However in the current sequence, it happens opposite. Correct the phy enable and powergating partition sequence to avoid any boot hangs. Signed-off-by: Nagarjuna Kristam Cc: stable Signed-off-by: Jui Chang Kuo Tested-by: Jon Hunter Acked-by: Thierry Reding Link: https://lore.kernel.org/r/1572859470-7823-1-git-send-email-nkristam@nvidia.com Signed-off-by: Greg Kroah-Hartman commit dabdb57bd6aa8db72a5050de34428aae288de09d Author: Kai-Heng Feng Date: Wed Nov 6 14:27:10 2019 +0800 usb: Allow USB device to be warm reset in suspended state commit e76b3bf7654c3c94554c24ba15a3d105f4006c80 upstream. On Dell WD15 dock, sometimes USB ethernet cannot be detected after plugging cable to the ethernet port, the hub and roothub get runtime resumed and runtime suspended immediately: ... [ 433.315169] xhci_hcd 0000:3a:00.0: hcd_pci_runtime_resume: 0 [ 433.315204] usb usb4: usb auto-resume [ 433.315226] hub 4-0:1.0: hub_resume [ 433.315239] xhci_hcd 0000:3a:00.0: Get port status 4-1 read: 0x10202e2, return 0x10343 [ 433.315264] usb usb4-port1: status 0343 change 0001 [ 433.315279] xhci_hcd 0000:3a:00.0: clear port1 connect change, portsc: 0x10002e2 [ 433.315293] xhci_hcd 0000:3a:00.0: Get port status 4-2 read: 0x2a0, return 0x2a0 [ 433.317012] xhci_hcd 0000:3a:00.0: xhci_hub_status_data: stopping port polling. [ 433.422282] xhci_hcd 0000:3a:00.0: Get port status 4-1 read: 0x10002e2, return 0x343 [ 433.422307] usb usb4-port1: do warm reset [ 433.422311] usb 4-1: device reset not allowed in state 8 [ 433.422339] hub 4-0:1.0: state 7 ports 2 chg 0002 evt 0000 [ 433.422346] xhci_hcd 0000:3a:00.0: Get port status 4-1 read: 0x10002e2, return 0x343 [ 433.422356] usb usb4-port1: do warm reset [ 433.422358] usb 4-1: device reset not allowed in state 8 [ 433.422428] xhci_hcd 0000:3a:00.0: set port remote wake mask, actual port 0 status = 0xf0002e2 [ 433.422455] xhci_hcd 0000:3a:00.0: set port remote wake mask, actual port 1 status = 0xe0002a0 [ 433.422465] hub 4-0:1.0: hub_suspend [ 433.422475] usb usb4: bus auto-suspend, wakeup 1 [ 433.426161] xhci_hcd 0000:3a:00.0: xhci_hub_status_data: stopping port polling. [ 433.466209] xhci_hcd 0000:3a:00.0: port 0 polling in bus suspend, waiting [ 433.510204] xhci_hcd 0000:3a:00.0: port 0 polling in bus suspend, waiting [ 433.554051] xhci_hcd 0000:3a:00.0: port 0 polling in bus suspend, waiting [ 433.598235] xhci_hcd 0000:3a:00.0: port 0 polling in bus suspend, waiting [ 433.642154] xhci_hcd 0000:3a:00.0: port 0 polling in bus suspend, waiting [ 433.686204] xhci_hcd 0000:3a:00.0: port 0 polling in bus suspend, waiting [ 433.730205] xhci_hcd 0000:3a:00.0: port 0 polling in bus suspend, waiting [ 433.774203] xhci_hcd 0000:3a:00.0: port 0 polling in bus suspend, waiting [ 433.818207] xhci_hcd 0000:3a:00.0: port 0 polling in bus suspend, waiting [ 433.862040] xhci_hcd 0000:3a:00.0: port 0 polling in bus suspend, waiting [ 433.862053] xhci_hcd 0000:3a:00.0: xhci_hub_status_data: stopping port polling. [ 433.862077] xhci_hcd 0000:3a:00.0: xhci_suspend: stopping port polling. [ 433.862096] xhci_hcd 0000:3a:00.0: // Setting command ring address to 0x8578fc001 [ 433.862312] xhci_hcd 0000:3a:00.0: hcd_pci_runtime_suspend: 0 [ 433.862445] xhci_hcd 0000:3a:00.0: PME# enabled [ 433.902376] xhci_hcd 0000:3a:00.0: restoring config space at offset 0xc (was 0x0, writing 0x20) [ 433.902395] xhci_hcd 0000:3a:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100403) [ 433.902490] xhci_hcd 0000:3a:00.0: PME# disabled [ 433.902504] xhci_hcd 0000:3a:00.0: enabling bus mastering [ 433.902547] xhci_hcd 0000:3a:00.0: // Setting command ring address to 0x8578fc001 [ 433.902649] pcieport 0000:00:1b.0: PME: Spurious native interrupt! [ 433.902839] xhci_hcd 0000:3a:00.0: Port change event, 4-1, id 3, portsc: 0xb0202e2 [ 433.902842] xhci_hcd 0000:3a:00.0: resume root hub [ 433.902845] xhci_hcd 0000:3a:00.0: handle_port_status: starting port polling. [ 433.902877] xhci_hcd 0000:3a:00.0: xhci_resume: starting port polling. [ 433.902889] xhci_hcd 0000:3a:00.0: xhci_hub_status_data: stopping port polling. [ 433.902891] xhci_hcd 0000:3a:00.0: hcd_pci_runtime_resume: 0 [ 433.902919] usb usb4: usb wakeup-resume [ 433.902942] usb usb4: usb auto-resume [ 433.902966] hub 4-0:1.0: hub_resume ... As Mathias pointed out, the hub enters Cold Attach Status state and requires a warm reset. However usb_reset_device() bails out early when the device is in suspended state, as its callers port_event() and hub_event() don't always resume the device. Since there's nothing wrong to reset a suspended device, allow usb_reset_device() to do so to solve the issue. Signed-off-by: Kai-Heng Feng Acked-by: Alan Stern Cc: stable Link: https://lore.kernel.org/r/20191106062710.29880-1-kai.heng.feng@canonical.com Signed-off-by: Greg Kroah-Hartman commit d8fc2266c40fef226f3f0f6e5d839a75e2748c98 Author: Oliver Neukum Date: Thu Nov 14 12:27:58 2019 +0100 USB: documentation: flags on usb-storage versus UAS commit 65cc8bf99349f651a0a2cee69333525fe581f306 upstream. Document which flags work storage, UAS or both Signed-off-by: Oliver Neukum Cc: stable Link: https://lore.kernel.org/r/20191114112758.32747-4-oneukum@suse.com Signed-off-by: Greg Kroah-Hartman commit bf2e403d150583eb3ef6d17aa80e263b0a2d41eb Author: Oliver Neukum Date: Thu Nov 14 12:27:57 2019 +0100 USB: uas: heed CAPACITY_HEURISTICS commit 335cbbd5762d5e5c67a8ddd6e6362c2aa42a328f upstream. There is no need to ignore this flag. We should be as close to storage in that regard as makes sense, so honor flags whose cost is tiny. Signed-off-by: Oliver Neukum Cc: stable Link: https://lore.kernel.org/r/20191114112758.32747-3-oneukum@suse.com Signed-off-by: Greg Kroah-Hartman commit 84a82ba810379ff099d8660ef18e27daf532da13 Author: Oliver Neukum Date: Thu Nov 14 12:27:56 2019 +0100 USB: uas: honor flag to avoid CAPACITY16 commit bff000cae1eec750d62e265c4ba2db9af57b17e1 upstream. Copy the support over from usb-storage to get feature parity Signed-off-by: Oliver Neukum Cc: stable Link: https://lore.kernel.org/r/20191114112758.32747-2-oneukum@suse.com Signed-off-by: Greg Kroah-Hartman commit dea5cc44e0164c4613af7664d03cdc9c8c7e8689 Author: Arnd Bergmann Date: Wed Nov 6 10:06:54 2019 +0100 media: venus: remove invalid compat_ioctl32 handler commit 4adc0423de92cf850d1ef5c0e7cb28fd7a38219e upstream. v4l2_compat_ioctl32() is the function that calls into v4l2_file_operations->compat_ioctl32(), so setting that back to the same function leads to a trivial endless loop, followed by a kernel stack overrun. Remove the incorrect assignment. Cc: stable@vger.kernel.org Fixes: 7472c1c69138 ("[media] media: venus: vdec: add video decoder files") Fixes: aaaa93eda64b ("[media] media: venus: venc: add video encoder files") Signed-off-by: Arnd Bergmann Acked-by: Stanimir Varbanov Signed-off-by: Hans Verkuil Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Greg Kroah-Hartman commit c13f137cfaa31a752476a5075e1389a69df91372 Author: Arnd Bergmann Date: Tue Sep 11 20:47:23 2018 +0200 ceph: fix compat_ioctl for ceph_dir_operations commit 18bd6caaef4021803dd0d031dc37c2d001d18a5b upstream. The ceph_ioctl function is used both for files and directories, but only the files support doing that in 32-bit compat mode. On the s390 architecture, there is also a problem with invalid 31-bit pointers that need to be passed through compat_ptr(). Use the new compat_ptr_ioctl() to address both issues. Note: When backporting this patch to stable kernels, "compat_ioctl: add compat_ptr_ioctl()" is needed as well. Reviewed-by: "Yan, Zheng" Cc: stable@vger.kernel.org Signed-off-by: Arnd Bergmann Signed-off-by: Greg Kroah-Hartman commit 8896dd968b8b2422800c63626268e37d04e1d3e6 Author: Arnd Bergmann Date: Tue Sep 11 16:55:03 2018 +0200 compat_ioctl: add compat_ptr_ioctl() commit 2952db0fd51b0890f728df94ac563c21407f4f43 upstream. Many drivers have ioctl() handlers that are completely compatible between 32-bit and 64-bit architectures, except for the argument that is passed down from user space and may have to be passed through compat_ptr() in order to become a valid 64-bit pointer. Using ".compat_ptr = compat_ptr_ioctl" in file operations should let us simplify a lot of those drivers to avoid #ifdef checks, and convert additional drivers that don't have proper compat handling yet. On most architectures, the compat_ptr_ioctl() just passes all arguments to the corresponding ->ioctl handler. The exception is arch/s390, where compat_ptr() clears the top bit of a 32-bit pointer value, so user space pointers to the second 2GB alias the first 2GB, as is the case for native 32-bit s390 user space. The compat_ptr_ioctl() function must therefore be used only with ioctl functions that either ignore the argument or pass a pointer to a compatible data type. If any ioctl command handled by fops->unlocked_ioctl passes a plain integer instead of a pointer, or any of the passed data types is incompatible between 32-bit and 64-bit architectures, a proper handler is required instead of compat_ptr_ioctl. Signed-off-by: Arnd Bergmann commit 402f7198311f84a8b56183923532f57a3cc1b63f Author: Arun Easi Date: Tue Nov 5 07:06:55 2019 -0800 scsi: qla2xxx: Fix memory leak when sending I/O fails commit 2f856d4e8c23f5ad5221f8da4a2f22d090627f19 upstream. On heavy loads, a memory leak of the srb_t structure is observed. This would make the qla2xxx_srbs cache gobble up memory. Fixes: 219d27d7147e0 ("scsi: qla2xxx: Fix race conditions in the code for aborting SCSI commands") Cc: stable@vger.kernel.org # 5.2 Link: https://lore.kernel.org/r/20191105150657.8092-7-hmadhani@marvell.com Reviewed-by: Ewan D. Milne Signed-off-by: Arun Easi Signed-off-by: Himanshu Madhani Signed-off-by: Martin K. Petersen Signed-off-by: Greg Kroah-Hartman commit 31c1f455203e56a3ce8d5dd92f37c83d07bd5bd5 Author: Quinn Tran Date: Tue Nov 5 07:06:54 2019 -0800 scsi: qla2xxx: Fix double scsi_done for abort path commit f45bca8c5052e8c59bab64ee90c44441678b9a52 upstream. Current code assumes abort will remove the original command from the active list where scsi_done will not be called. Instead, the eh_abort thread will do the scsi_done. That is not the case. Instead, we have a double scsi_done calls triggering use after free. Abort will tell FW to release the command from FW possesion. The original command will return to ULP with error in its normal fashion via scsi_done. eh_abort path would wait for the original command completion before returning. eh_abort path will not perform the scsi_done call. Fixes: 219d27d7147e0 ("scsi: qla2xxx: Fix race conditions in the code for aborting SCSI commands") Cc: stable@vger.kernel.org # 5.2 Link: https://lore.kernel.org/r/20191105150657.8092-6-hmadhani@marvell.com Reviewed-by: Ewan D. Milne Signed-off-by: Quinn Tran Signed-off-by: Arun Easi Signed-off-by: Himanshu Madhani Signed-off-by: Martin K. Petersen Signed-off-by: Greg Kroah-Hartman commit 5cb5b6748024c2bae6e9930bd652f754287e45e2 Author: Quinn Tran Date: Tue Nov 5 07:06:53 2019 -0800 scsi: qla2xxx: Fix driver unload hang commit dd322b7f3efc8cda085bb60eadc4aee6324eadd8 upstream. This patch fixes driver unload hang by removing msleep() Fixes: d74595278f4ab ("scsi: qla2xxx: Add multiple queue pair functionality.") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20191105150657.8092-5-hmadhani@marvell.com Reviewed-by: Ewan D. Milne Signed-off-by: Quinn Tran Signed-off-by: Himanshu Madhani Signed-off-by: Martin K. Petersen Signed-off-by: Greg Kroah-Hartman commit b7abcc7df5e131c0b4bf89cb2411c5301ee83d26 Author: Quinn Tran Date: Tue Nov 5 07:06:51 2019 -0800 scsi: qla2xxx: Do command completion on abort timeout commit 71c80b75ce8f08c0978ce9a9816b81b5c3ce5e12 upstream. On switch, fabric and mgt command timeout, driver send Abort to tell FW to return the original command. If abort is timeout, then return both Abort and original command for cleanup. Fixes: 219d27d7147e0 ("scsi: qla2xxx: Fix race conditions in the code for aborting SCSI commands") Cc: stable@vger.kernel.org # 5.2 Link: https://lore.kernel.org/r/20191105150657.8092-3-hmadhani@marvell.com Reviewed-by: Ewan D. Milne Signed-off-by: Quinn Tran Signed-off-by: Himanshu Madhani Signed-off-by: Martin K. Petersen Signed-off-by: Greg Kroah-Hartman commit f3aed6797ee310ca13ad9c15255f3585123dae00 Author: Steffen Maier Date: Fri Oct 25 18:12:53 2019 +0200 scsi: zfcp: trace channel log even for FCP command responses commit 100843f176109af94600e500da0428e21030ca7f upstream. While v2.6.26 commit b75db73159cc ("[SCSI] zfcp: Add qtcb dump to hba debug trace") is right that we don't want to flood the (payload) trace ring buffer, we don't trace successful FCP command responses by default. So we can include the channel log for problem determination with failed responses of any FSF request type. Fixes: b75db73159cc ("[SCSI] zfcp: Add qtcb dump to hba debug trace") Fixes: a54ca0f62f95 ("[SCSI] zfcp: Redesign of the debug tracing for HBA records.") Cc: #2.6.38+ Link: https://lore.kernel.org/r/e37597b5c4ae123aaa85fd86c23a9f71e994e4a9.1572018132.git.bblock@linux.ibm.com Reviewed-by: Benjamin Block Signed-off-by: Steffen Maier Signed-off-by: Benjamin Block Signed-off-by: Martin K. Petersen Signed-off-by: Greg Kroah-Hartman commit 64c8e5afcb2c27a3016b6f916192f498c0d27b31 Author: James Smart Date: Fri Oct 18 14:18:21 2019 -0700 scsi: lpfc: Fix bad ndlp ptr in xri aborted handling commit 324e1c402069e8d277d2a2b18ce40bde1265b96a upstream. In cases where I/O may be aborted, such as driver unload or link bounces, the system will crash based on a bad ndlp pointer. Example: RIP: 0010:lpfc_sli4_abts_err_handler+0x15/0x140 [lpfc] ... lpfc_sli4_io_xri_aborted+0x20d/0x270 [lpfc] lpfc_sli4_sp_handle_abort_xri_wcqe.isra.54+0x84/0x170 [lpfc] lpfc_sli4_fp_handle_cqe+0xc2/0x480 [lpfc] __lpfc_sli4_process_cq+0xc6/0x230 [lpfc] __lpfc_sli4_hba_process_cq+0x29/0xc0 [lpfc] process_one_work+0x14c/0x390 Crash was caused by a bad ndlp address passed to I/O indicated by the XRI aborted CQE. The address was not NULL so the routine deferenced the ndlp ptr. The bad ndlp also caused the lpfc_sli4_io_xri_aborted to call an erroneous io handler. Root cause for the bad ndlp was an lpfc_ncmd that was aborted, put on the abort_io list, completed, taken off the abort_io list, sent to lpfc_release_nvme_buf where it was put back on the abort_io list because the lpfc_ncmd->flags setting LPFC_SBUF_XBUSY was not cleared on the final completion. Rework the exchange busy handling to ensure the flags are properly set for both scsi and nvme. Fixes: c490850a0947 ("scsi: lpfc: Adapt partitioned XRI lists to efficient sharing") Cc: # v5.1+ Link: https://lore.kernel.org/r/20191018211832.7917-6-jsmart2021@gmail.com Signed-off-by: Dick Kennedy Signed-off-by: James Smart Signed-off-by: Martin K. Petersen Signed-off-by: Greg Kroah-Hartman commit b49e676ce4308ee7fe040a41ad6504b2771068d2 Author: Jian-Hong Pan Date: Thu Oct 31 17:34:09 2019 +0800 Revert "nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T" commit 655e7aee1f0398602627a485f7dca6c29cc96cae upstream. Since e045fa29e893 ("PCI/MSI: Fix incorrect MSI-X masking on resume") is merged, we can revert the previous quirk now. This reverts commit 19ea025e1d28c629b369c3532a85b3df478cc5c6. Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=204887 Fixes: 19ea025e1d28 ("nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T") Link: https://lore.kernel.org/r/20191031093408.9322-1-jian-hong@endlessm.com Signed-off-by: Jian-Hong Pan Signed-off-by: Bjorn Helgaas Acked-by: Christoph Hellwig Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman commit 5ce4a36e037ef95b95a8957a4dca4d28b035e921 Author: Keith Busch Date: Tue Dec 3 00:44:59 2019 +0900 nvme: Namepace identification descriptor list is optional commit 22802bf742c25b1e2473c70b3b99da98af65ef4d upstream. Despite NVM Express specification 1.3 requires a controller claiming to be 1.3 or higher implement Identify CNS 03h (Namespace Identification Descriptor list), the driver doesn't really need this identification in order to use a namespace. The code had already documented in comments that we're not to consider an error to this command. Return success if the controller provided any response to an namespace identification descriptors command. Fixes: 538af88ea7d9de24 ("nvme: make nvme_report_ns_ids propagate error back") Link: https://bugzilla.kernel.org/show_bug.cgi?id=205679 Reported-by: Ingo Brunberg Cc: Sagi Grimberg Cc: stable@vger.kernel.org # 5.4+ Reviewed-by: Christoph Hellwig Signed-off-by: Keith Busch Signed-off-by: Greg Kroah-Hartman commit 65b295a84549d593766a9a58442514724ad4cdd9 Author: Gustavo A. R. Silva Date: Wed Nov 6 14:28:21 2019 -0600 usb: gadget: pch_udc: fix use after free commit 66d1b0c0580b7f1b1850ee4423f32ac42afa2e92 upstream. Remove pointer dereference after free. pci_pool_free doesn't care about contents of td. It's just a void* for it Addresses-Coverity-ID: 1091173 ("Use after free") Cc: stable@vger.kernel.org Acked-by: Michal Nazarewicz Signed-off-by: Gustavo A. R. Silva Link: https://lore.kernel.org/r/20191106202821.GA20347@embeddedor Signed-off-by: Greg Kroah-Hartman commit 58849169408e422bcd254234bb576dd280a0fc7f Author: Wei Yongjun Date: Wed Oct 30 03:40:46 2019 +0000 usb: gadget: configfs: Fix missing spin_lock_init() commit 093edc2baad2c258b1f55d1ab9c63c2b5ae67e42 upstream. The driver allocates the spinlock but not initialize it. Use spin_lock_init() on it to initialize it correctly. This is detected by Coccinelle semantic patch. Fixes: 1a1c851bbd70 ("usb: gadget: configfs: fix concurrent issue between composite APIs") Signed-off-by: Wei Yongjun Cc: stable Reviewed-by: Peter Chen Link: https://lore.kernel.org/r/20191030034046.188808-1-weiyongjun1@huawei.com Signed-off-by: Greg Kroah-Hartman