The problem is currently page fault handler doesn't supports dirty bit
emulation of pmd for non-HW dirty-bit architecture so that application
stucks until VM marked the pmd dirty.
How the emulation work depends on the architecture. In case of arm64,
when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to
mark the pte dirty via triggering page fault when store access happens.
Once the page fault occurs, VM marks the pmd dirty and arch code for
setting pmd will clear PTE_RDONLY for application to proceed.
IOW, if VM doesn't mark the pmd dirty, application hangs forever by
repeated fault(i.e., store op but the pmd is PTE_RDONLY).
This patch enables pmd dirty-bit emulation for those architectures.
[1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called
Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") Link: http://lkml.kernel.org/r/1482506098-6149-1-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Andreas Schwab <schwab@suse.de> Tested-by: Andreas Schwab <schwab@suse.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jason Evans <je@fb.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The crux of the problem is that once we insert a 4k zero page, all
locking from then on is done in terms of that 4k zero page and any
additional threads sleeping on the empty DAX entry will never be woken.
Fix this by waking all sleepers when we replace the DAX radix tree entry
with a 4k zero page. This will allow all sleeping threads to
successfully transition from locking based on the DAX empty entry to
locking on the 4k zero page.
With the test case reported by Xiong this happens very regularly in my
test setup, with some runs resulting in 9+ threads in this deadlocked
state. With this fix I've been able to run that same test dozens of
times in a loop without issue.
Fixes: ac401cc78242 ("dax: New fault locking") Link: http://lkml.kernel.org/r/1483479365-13607-1-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Xiong Zhou <xzhou@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
zram has used per-cpu stream feature from v4.7. It aims for increasing
cache hit ratio of scratch buffer for compressing. Downside of that
approach is that zram should ask memory space for compressed page in
per-cpu context which requires stricted gfp flag which could be failed.
If so, it retries to allocate memory space out of per-cpu context so it
could get memory this time and compress the data again, copies it to the
memory space.
In this scenario, zram assumes the data should never be changed but it is
not true without stable page support. So, If the data is changed under
us, zram can make buffer overrun so that zsmalloc free object chain is
broken so system goes crash like below
https://bugzilla.suse.com/show_bug.cgi?id=997574
This patch adds BDI_CAP_STABLE_WRITES to zram for declaring "I am block
device needing *stable write*".
Fixes: da9556a2367c ("zram: user per-cpu compression streams") Link: http://lkml.kernel.org/r/1482366980-3782-4-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Hyeoncheol Lee <cheol.lee@lge.com> Cc: <yjay.kim@lge.com> Cc: Sangseok Lee <sangseok.lee@lge.com> Cc: Hugh Dickins <hughd@google.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit b4c5c60920e3 ("zram: avoid lockdep splat by revalidate_disk")
moved revalidate_disk call out of init_lock to avoid lockdep
false-positive splat. However, commit 08eee69fcf6b ("zram: remove
init_lock in zram_make_request") removed init_lock in IO path so there
is no worry about lockdep splat. So, let's restore it.
This patch is needed to set BDI_CAP_STABLE_WRITES atomically in next
patch.
Fixes: da9556a2367c ("zram: user per-cpu compression streams") Link: http://lkml.kernel.org/r/1482366980-3782-3-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Hyeoncheol Lee <cheol.lee@lge.com> Cc: <yjay.kim@lge.com> Cc: Sangseok Lee <sangseok.lee@lge.com> Cc: Hugh Dickins <hughd@google.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Nothing in this minimal script seems to require bash. We often run these
tests on embedded devices where the only shell available is the busybox
ash. Use sh instead.
Nothing in this minimal script seems to require bash. We often run these
tests on embedded devices where the only shell available is the busybox
ash. Use sh instead.
A recent cleanup changed the kmalloc() + copy_from_user() to
memdup_user() but the error handling wasn't updated so we might call
kfree(-EFAULT) and crash.
Fixes: a6e3918bcdb1 ('GPU-DRM-Savage: Use memdup_user() rather than duplicating') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: http://patchwork.freedesktop.org/patch/msgid/20161012062227.GU12841@mwanda Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If the allocation fails the current code returns success. If
copy_from_user() fails it returns the number of bytes remaining instead
of -EFAULT.
Fixes: d5b1a78a772f ("drm/vc4: Add support for drawing 3D frames.") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Eric Anholt <eric@anholt.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The maximum supported voltage for ldo_io# is 3.3V, but on cold boot
the selector comes up at 0x1f, which maps to 3.8V. This was previously
corrected by Allwinner's U-boot, which set all regulators on the PMICs
to some pre-configured voltage. With recent progress in U-boot SPL
support, this is no longer the case. In any case we should handle
this quirk in the kernel driver as well.
This invalid setting causes _regulator_get_voltage() to fail with -EINVAL
which causes regulator registration to fail when constrains are used:
[ 1.054181] vcc-pg: failed to get the current voltage(-22)
[ 1.059670] axp20x-regulator axp20x-regulator.0: Failed to register ldo_io0
[ 1.069749] axp20x-regulator: probe of axp20x-regulator.0 failed with error -22
This commits makes the axp20x regulator driver accept the 0x1f register
value, fixing this.
The datasheet does not guarantee reliable operation above 3.3V, so on
boards where this regulator is used the regulator-max-microvolt setting
must be 3.3V or less.
This is essentially the same as the commit f40d4896bf32 ("regulator:
axp20x: Fix axp22x ldo_io registration error on cold boot") for AXP22x
PMICs.
Fixes: a51f9f4622a3 ("regulator: axp20x: support AXP809 variant") Signed-off-by: Chen-Yu Tsai <wens@csie.org> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The BUCK regulators 3, 4, and 5 also have a 10mV step mode,
adjust the tables and logic to reflect the data-sheet for
these regulators.
fixes: d2a2e729a666 ("regulator: tps65086: Add regulator driver for the TPS65086 PMIC") Signed-off-by: Andrew F. Davis <afd@ti.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On some SoC there are no simple mapping of pins to bias register bits
and a lookup table is needed. This logic is already implemented in some
SoC specific drivers that could benefit from a generic implementation.
Add helpers to deal with the lookup which later can be used by the SoC
specific drivers. The logic used to lookup are different from the one it
aims to replace, this is intentional. This new method reduces the memory
consumption at the cost of increased CPU usage and fix a bug where a
WARN() would incorrectly be triggered if the register offset is 0.
This is due to the WARN() check if the reg field of the pullups struct
is zero, and this should be 0 for pins controlled by the PUEN0/PUD0
registers since PU0 is defined as 0. Change the data structure and use
the generic sh_pfc_pin_to_bias_info() function to get the register
offset and bit information.
These drivers need to be able to reference "struct ieee80211_hw" from
the driver's private data, and vice versa. The USB driver failed to
store the address of ieee80211_hw in the private data. Although this
bug has been present for a long time, it was not exposed until
commit ba9f93f82aba ("rtlwifi: Fix enter/exit power_save").
Fixes: ba9f93f82aba ("rtlwifi: Fix enter/exit power_save") Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In commit a5ffbe0a1993 ("rtlwifi: Fix scheduling while atomic bug") and
commit a269913c52ad ("rtlwifi: Rework rtl_lps_leave() and rtl_lps_enter()
to use work queue"), an error was introduced in the power-save routines
due to the fact that leaving PS was delayed by the use of a work queue.
This problem is fixed by detecting if the enter or leave routines are
in interrupt mode. If so, the workqueue is used to place the request.
If in normal mode, the enter or leave routines are called directly.
Fixes: a269913c52ad ("rtlwifi: Rework rtl_lps_leave() and rtl_lps_enter() to use work queue") Reported-by: Ping-Ke Shih <pkshih@realtek.com> Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Cc: Stable <stable@vger.kernel.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
drm/i915: Wait up to 3ms for the pcu to ack the cdclk change request on SKL
increased the timeout to match the spec, but we still see a timeout on
at least one SKL. A CDCLK change request following the failed one will
succeed nevertheless.
I could reproduce this problem easily by running kms_pipe_crc_basic in a
loop. In all failure cases _wait_for() was pre-empted for >3ms and so in
the worst case - when the pre-emption happened right after calculating
timeout__ in _wait_for() - we called skl_cdclk_wait_for_pcu_ready() only
once which failed and so _wait_for() timed out. As opposed to this the
spec says to keep retrying the request for at most a 3ms period.
To fix this send the first request explicitly to guarantee that there is
3ms between the first and last request. Though this matches the spec, I
noticed that in rare cases this can still time out if we sent only a few
requests (in the worst case 2) _and_ PCODE is busy for some reason even
after a previous request and a 3ms delay. To work around this retry the
polling with pre-emption disabled to maximize the number of requests.
Also increase the timeout to 10ms to account for interrupts that could
reduce the number of requests. With this change I couldn't trigger
the problem.
v2:
- Use 1ms poll period instead of 10us. (Chris)
v3:
- Poll with pre-emption disabled to increase the number of request
attempts. (Ville, Chris)
- Factor out a helper to poll, it's also needed by the next patch.
v4:
- Pass reply_mask, reply to skl_pcode_request(), instead of assuming the
reply is generic. (Ville)
v5:
- List the request specific timeout values as code comment. (Ville)
v6:
- Try the poll first with preemption enabled.
- Add code comment about first request being queued by PCODE. (Art)
- Add timeout_base_ms argument. (Ville)
v7:
- Clarify code comment about first queued request. (Chris)
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Art Runyan <arthur.j.runyan@intel.com> Cc: <stable@vger.kernel.org> # v4.2- : 3b2c171 : drm/i915: Wait up to 3ms Cc: <stable@vger.kernel.org> # v4.2- Fixes: 5d96d8afcfbb ("drm/i915/skl: Deinit/init the display at suspend/resume")
Reference: https://bugs.freedesktop.org/show_bug.cgi?id=97929
Testcase: igt/kms_pipe_crc_basic/suspend-read-crc-pipe-B Signed-off-by: Imre Deak <imre.deak@intel.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Link: http://patchwork.freedesktop.org/patch/msgid/1480955258-26311-1-git-send-email-imre.deak@intel.com
(cherry picked from commit a0b8a1fe34430c3a82258e8cb45f5968bdf31afd) Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Plantronics BT600 does not support reading the sample rate which leads
to many lines of "cannot get freq at ep 0x1" and "cannot get freq at
ep 0x82". This patch adds the USB ID of the BT600 to quirks.c and
avoids those error messages.
The calculation of SPR and SPPR doesn't round correctly at several
places which might result in baud rates that are too big. For example
with tclk_hz = 250000001 and target rate 25000000 it determined a
divider of 10 which is wrong.
Instead of fixing all the corner cases replace the calculation by an
algorithm without a loop which should even be quicker to execute apart
from being correct.
Fixes: df59fa7f4bca ("spi: orion: support armada extended baud rates") Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The commit 55ee7017ee31 ("arm: omap2: board-generic: use
omap4_local_timer_init for AM437x") unintentionally changes the
clocksource devices for AM437x from OMAP GP Timer to SyncTimer32K.
Unfortunately, the SyncTimer32K is starving from frequency deviation
as mentioned in commit 5b5c01359152 ("ARM: OMAP2+: AM43x: Use gptimer
as clocksource") and, as reported by Franklin [1], even its monotonic
nature is under question (most probably there is a HW issue, but it's
still under investigation).
Taking into account above facts It's reasonable to rollback to the use
of omap3_gptimer_timer_init().
Unlike clk_register_clkdev(), clk_hw_register_clkdev() doesn't check for
passed error objects from a previous registration call. Hence the caller
of clk_hw_register_*() has to check for errors before calling
clk_hw_register_clkdev*().
Make clk_hw_register_clkdev() more similar to clk_register_clkdev() by
adding this error check, removing the burden from callers that do mass
registration.
Fixes: e4f1b49bda6d6aa2 ("clkdev: Add clk_hw based registration APIs") Fixes: 944b9a41e004534f ("clk: ls1x: Migrate to clk_hw based OF and registration APIs") Fixes: 44ce9a9ae977736f ("MIPS: TXx9: Convert to Common Clock Framework") Fixes: f48d947a162dfa9d ("clk: clps711x: Migrate to clk_hw based OF and registration APIs") Fixes: b4626a7f489238a5 ("CLK: Add Loongson1C clock support") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It's CONFIG_SOC_OMAP5, not CONFIG_ARCH_OMAP5. Looks like make randconfig
builds have not hit this one yet.
Fixes: b3bf289c1c45 ("ARM: OMAP2+: Fix build with CONFIG_SMP and CONFIG_PM is not set") Acked-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We need to properly initialize mpuss also on omap5 like we do on omap4.
Otherwise we run into similar kexec problems like we had on omap4 when
trying to kexec from a kernel with PM initialized.
Fixes: 0573b957fc21 ("ARM: OMAP4+: Prevent CPU1 related hang with kexec") Acked-by: Santosh Shilimkar <ssantosh@kernel.org> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 39a9beab5acb ('rpc: share one xps between all backchannels') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On Zynq, we haven't been reserving the correct amount of DMA-incapable
RAM to keep DMA away from it (per the Zynq TRM Section 4.1, it should be
the first 512k). In older kernels, this was masked by the
memblock_reserve call in arm_memblock_init(). Now, reserve the correct
amount excplicitly rather than relying on swapper_pg_dir, which is an
address and not a size anyway.
Fixes: 46f5b96 ("ARM: zynq: Reserve not DMAable space in front of the kernel") Signed-off-by: Kyle Roeschley <kyle.roeschley@ni.com> Tested-by: Nathan Rossi <nathan@nathanrossi.com> Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
bcm2837-rpi-3-b.dts, its only in-tree user, was overriding it as
"brcm,bcm2837" already.
Fixes: 9d56c22a7861 ("ARM: bcm2835: Add devicetree for the Raspberry Pi 3.") Cc: Stephen Warren <swarren@wwwdotorg.org> Signed-off-by: Andreas Färber <afaerber@suse.de> Signed-off-by: Eric Anholt <eric@anholt.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We are incorrectly defining the pwr LED, attaching it to a gpio line
that is wired to the Wi-Fi SDIO module (which fails due to this).
The actual power LED is connected to the GPIO expander, which we don't
expose currently.
Fixes: 9d56c22a7861 ("ARM: bcm2835: Add devicetree for the Raspberry Pi 3.") Thanks-to: Eric Anholt <eric@anholt.net> [for clarifying we can't control the LED] Signed-off-by: Andrea Merello <andrea.merello@gmail.com> Signed-off-by: Eric Anholt <eric@anholt.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The virtio tools implementation of READ_ONCE() has a single parameter called
'var', but erroneously refers to 'val' for its cast, and thus won't work unless
there's a variable of the correct type that happens to be called 'var'.
Fix this with s/var/val/, making READ_ONCE() work as expected regardless.
Fixes: a7c490333df3cff5 ("tools/virtio: use virt_xxx barriers") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: linux-kernel@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Reviewed-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
I am getting the following warning when I build kernel 4.9-git on my
PowerBook G4 with a 32-bit PPC processor:
AS arch/powerpc/kernel/misc_32.o
arch/powerpc/kernel/misc_32.S:299:7: warning: "CONFIG_FSL_BOOKE" is not defined [-Wundef]
This problem is evident after commit 989cea5c14be ("kbuild: prevent
lib-ksyms.o rebuilds"); however, this change in kbuild only exposes an
error that has been in the code since 2005 when this source file was
created. That was with commit 9994a33865f4 ("powerpc: Introduce
entry_{32,64}.S, misc_{32,64}.S, systbl.S").
The offending line does not make a lot of sense. This error does not
seem to cause any errors in the executable, thus I am not recommending
that it be applied to any stable versions.
Thanks to Nicholas Piggin for suggesting this solution.
Fixes: 9994a33865f4 ("powerpc: Introduce entry_{32,64}.S, misc_{32,64}.S, systbl.S") Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Frank reported that vrf devices can be created with a table id of 0.
This breaks many of the run time table id checks and should not be
allowed. Detect this condition at create time and fail with EINVAL.
Fixes: 193125dbd8eb ("net: Introduce VRF device driver") Reported-by: Frank Kellermann <frank.kellermann@atos.net> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fib_select_path does not call fib_select_multipath if oif is set in the
flow struct. For VRF use cases oif is always set, so multipath route
selection is bypassed. Use the FLOWI_FLAG_SKIP_NH_OIF to skip the oif
check similar to what is done in fib_table_lookup.
Add saddr and proto to the flow struct for the fib lookup done by the
VRF driver to better match hash computation for a flow.
Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When trying to do interface down or changing interface configuration
under heavy traffic, some of the adaptive moderation corner cases can
occur and leave a WARN_ONCE call trace in the kernel log.
Those WARN_ONCE are meant for debug only, and should have been inserted
only under debug. We avoid such call traces by removing those WARN_ONCE.
Fixes: cb3c7fd4f839 ("net/mlx5e: Support adaptive RX coalescing") Signed-off-by: Gil Rockah <gilr@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The GRO fast path caches the frag0 address. This address becomes
invalid if frag0 is modified by pskb_may_pull or its variants.
So whenever that happens we must disable the frag0 optimization.
This is usually done through the combination of gro_header_hard
and gro_header_slow, however, the IPv6 extension header path did
the pulling directly and would continue to use the GRO fast path
incorrectly.
This patch fixes it by disabling the fast path when we enter the
IPv6 extension header path.
Fixes: 78a478d0efd9 ("gro: Inline skb_gro_header and cache frag0 virtual address") Reported-by: Slava Shwartsman <slavash@mellanox.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On 32bit arches, (skb->end - skb->data) is not 'unsigned int',
so we shall use min_t() instead of min() to avoid a compiler error.
Fixes: 1272ce87fa01 ("gro: Enter slow-path if there is no tailroom") Reported-by: kernel test robot <fengguang.wu@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The GRO path has a fast-path where we avoid calling pskb_may_pull
and pskb_expand by directly accessing frag0. However, this should
only be done if we have enough tailroom in the skb as otherwise
we'll have to expand it later anyway.
This patch adds the check by capping frag0_len with the skb tailroom.
Fixes: cb18978cbf45 ("gro: Open-code final pskb_may_pull") Reported-by: Slava Shwartsman <slavash@mellanox.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit bdabad3e363d ("net: Add Qualcomm IPC router") introduced a
new address family. Update the family name tables accordingly so
that the lockdep initialization can use the proper names for this
family.
Cc: Courtney Cavin <courtney.cavin@sonymobile.com> Cc: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Suman Anna <s-anna@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It is perfectly possible to have non zero indexed switches being present
in a DSA switch tree, in such a case, we will be deferencing a NULL
pointer while dsa_cpu_port_ethtool_{setup,restore}. Be more defensive
and ensure that dst->ds[0] is valid before doing anything with it.
Fixes: 0c73c523cf73 ("net: dsa: Initialize CPU port ethtool ops per tree") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pause the rx and make sure the rx fifo is empty when the autosuspend
occurs.
If the rx data comes when the driver is canceling the rx urb, the host
controller would stop getting the data from the device and continue
it after next rx urb is submitted. That is, one continuing data is
split into two different urb buffers. That let the driver take the
data as a rx descriptor, and unexpected behavior happens.
Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Split rtl8152_suspend() into rtl8152_system_suspend() and
rtl8152_rumtime_suspend().
Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We make the bcm_sf2 driver override ds->ops which points to
b53_switch_ops since b53_switch_alloc() did the assignent. This is all
well and good until a second b53 switch comes in, and ends up using the
bcm_sf2 operations. Make a proper local copy, substitute the ds->ops
pointer and then override the operations.
Fixes: f458995b9ad8 ("net: dsa: bcm_sf2: Utilize core B53 driver when possible") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If after too many passes still no image could be emitted, then
swap back to the original program as we do in all other cases
and don't use the one with blinding.
Fixes: 959a75791603 ("bpf, x86: add support for constant blinding") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The move from rx-handler to L3 receive handler inadvertantly dropped the
rx counters. Restore them.
Fixes: 74b20582ac38 ("net: l3mdev: Add hook in ip and ipv6") Reported-by: Dinesh Dutt <ddutt@cumulusnetworks.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In the case of custom rules being present we need to handle the case of the
LOCAL table being intialized after the new rule has been added. To address
that I am adding a new check so that we can make certain we don't use an
alias of MAIN for LOCAL when allocating a new table.
Fixes: 0ddcf43d5d4a ("ipv4: FIB Local/MAIN table collapse") Reported-by: Oliver Brunel <jjk@jjacky.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a system receives a Query, it does not respond immediately.
Instead, it delays its response by a random amount of time, bounded
by the Max Resp Time value derived from the Max Resp Code in the
received Query message. A system may receive a variety of Queries on
different interfaces and of different kinds (e.g., General Queries,
Group-Specific Queries, and Group-and-Source-Specific Queries), each
of which may require its own delayed response.
Before scheduling a response to a Query, the system must first
consider previously scheduled pending responses and in many cases
schedule a combined response. Therefore, the system must be able to
maintain the following state:
o A timer per interface for scheduling responses to General Queries.
o A per-group and interface timer for scheduling responses to Group-
Specific and Group-and-Source-Specific Queries.
o A per-group and interface list of sources to be reported in the
response to a Group-and-Source-Specific Query.
When a new Query with the Router-Alert option arrives on an
interface, provided the system has state to report, a delay for a
response is randomly selected in the range (0, [Max Resp Time]) where
Max Resp Time is derived from Max Resp Code in the received Query
message. The following rules are then used to determine if a Report
needs to be scheduled and the type of Report to schedule. The rules
are considered in order and only the first matching rule is applied.
1. If there is a pending response to a previous General Query
scheduled sooner than the selected delay, no additional response
needs to be scheduled.
2. If the received Query is a General Query, the interface timer is
used to schedule a response to the General Query after the
selected delay. Any previously pending response to a General
Query is canceled.
--8<--
Currently the timer is rearmed with new random expiration time for
every incoming query regardless of possibly already pending report.
Which is not aligned with the above RFE.
It also might happen that higher rate of incoming queries can
postpone the report after the expiration time of the first query
causing group membership loss.
Now the per interface general query timer is rearmed only
when there is no pending report already scheduled on that interface or
the newly selected expiration time is before the already pending
scheduled report.
Signed-off-by: Michal Tesar <mtesar@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
__skb_flow_dissect can be called with a skb or a data packet, either
can be NULL. All calls seems to have been moved to __skb_header_pointer
except the pptp handling which is still calling skb_header_pointer.
Fixes: ab10dccb1160 ("rps: Inspect PPTP encapsulated by GRE to get flow hash") Signed-off-by: Ian Kumlien <ian.kumlien@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Final nlmsg_len field update must reflect inserted net_dm_drop_point
data.
This patch depends on previous patch:
"drop_monitor: add missing call to genlmsg_end"
Signed-off-by: Reiter Wolfgang <wr0112358@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Update nlmsg_len field with genlmsg_end to enable userspace processing
using nlmsg_next helper. Also adds error handling.
Signed-off-by: Reiter Wolfgang <wr0112358@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
IPv4 output routes already use l3mdev device instead of loopback for dst's
if it is applicable. Change local input routes to do the same.
This fixes icmp responses for unreachable UDP ports which are directed
to the wrong table after commit 9d1a6c4ea43e4 because local_input
routes use the loopback device. Moving from ingress device to loopback
loses the L3 domain causing responses based on the dst to get to lost.
Fixes: 9d1a6c4ea43e4 ("net: icmp_route_lookup should use rt dev to
determine L3 domain") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we send a packet for our own local address on a non-loopback
interface (e.g. eth0), due to the change had been introduced from
commit 0b922b7a829c ("net: original ingress device index in PKTINFO"), the
original ingress device index would be set as the loopback interface.
However, the packet should be considered as if it is being arrived via the
sending interface (eth0), otherwise it would break the expectation of the
userspace application (e.g. the DHCPRELEASE message from dhcp_release
binary would be ignored by the dnsmasq daemon, since it come from lo which
is not the interface dnsmasq bind to)
Fixes: 0b922b7a829c ("net: original ingress device index in PKTINFO") Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Wei Zhang <asuka.com@163.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Disable netdev should come after it was closed, although no harm of doing it
before -hence the MLX5E_STATE_DESTROYING bit- but it is more natural this way.
Fixes: 26e59d8077a3 ("net/mlx5e: Implement mlx5e interface attach/detach callbacks") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Skip setting netdev vxlan ports and netdev rx_mode on driver load
when netdev is not yet registered.
Synchronizing with netdev state is needed only on reset flow where the
netdev remains registered for the whole reset period.
This also fixes an access before initialization of net_device.addr_list_lock
- which for some reason initialized on register_netdev - where we queued
set_rx_mode work on driver load before netdev registration.
Fixes: 26e59d8077a3 ("net/mlx5e: Implement mlx5e interface attach/detach callbacks") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Reviewed-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We need to mask the destination mac value with the destination mac
mask when adding steering rule via ethtool.
Fixes: 1174fce8d1410 ('net/mlx5e: Support l3/l4 flow type specs in ethtool flow steering') Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Avoid using a local variable named numa_node to avoid shadowing a public
one.
Fixes: db058a186f98 ('net/mlx5_core: Set irq affinity hints') Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If there is pending delayed work for health recovery it must be canceled
if the device is being unloaded.
Fixes: 05ac2c0b7438 ("net/mlx5: Fix race between PCI error handlers and health work") Signed-off-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since we now use a non zero mask on addr_type, we are matching on its
value (IPV4/IPV6). So before this fix, matching on enc_src_ip/enc_dst_ip
failed in SW/classify path since its value was zero.
This patch sets the proper value of addr_type for encapsulated packets.
Fixes: 970bfcd09791 ('net/sched: cls_flower: Use mask for addr_type') Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is currently a small window during which the network device registered by
stmmac can be made visible, yet all resources, including and clock and MDIO bus
have not had a chance to be set up, this can lead to the following error to
occur:
[ 473.919358] stmmaceth 0000:01:00.0 (unnamed net_device) (uninitialized):
stmmac_dvr_probe: warning: cannot get CSR clock
[ 473.919382] stmmaceth 0000:01:00.0: no reset control found
[ 473.919412] stmmac - user ID: 0x10, Synopsys ID: 0x42
[ 473.919429] stmmaceth 0000:01:00.0: DMA HW capability register supported
[ 473.919436] stmmaceth 0000:01:00.0: RX Checksum Offload Engine supported
[ 473.919443] stmmaceth 0000:01:00.0: TX Checksum insertion supported
[ 473.919451] stmmaceth 0000:01:00.0 (unnamed net_device) (uninitialized):
Enable RX Mitigation via HW Watchdog Timer
[ 473.921395] libphy: PHY stmmac-1:00 not found
[ 473.921417] stmmaceth 0000:01:00.0 eth0: Could not attach to PHY
[ 473.921427] stmmaceth 0000:01:00.0 eth0: stmmac_open: Cannot attach to
PHY (error: -19)
[ 473.959710] libphy: stmmac: probed
[ 473.959724] stmmaceth 0000:01:00.0 eth0: PHY ID 01410cc2 at 0 IRQ POLL
(stmmac-1:00) active
[ 473.959728] stmmaceth 0000:01:00.0 eth0: PHY ID 01410cc2 at 1 IRQ POLL
(stmmac-1:01)
[ 473.959731] stmmaceth 0000:01:00.0 eth0: PHY ID 01410cc2 at 2 IRQ POLL
(stmmac-1:02)
[ 473.959734] stmmaceth 0000:01:00.0 eth0: PHY ID 01410cc2 at 3 IRQ POLL
(stmmac-1:03)
Fix this by making sure that register_netdev() is the last thing being done,
which guarantees that the clock and the MDIO bus are available.
Fixes: 4bfcbd7abce2 ("stmmac: Move the mdio_register/_unregister in probe/remove") Reported-by: Kweh, Hock Leong <hock.leong.kweh@intel.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Shahar reported a soft lockup in tc_classify(), where we run into an
endless loop when walking the classifier chain due to tp->next == tp
which is a state we should never run into. The issue only seems to
trigger under load in the tc control path.
What happens is that in tc_ctl_tfilter(), thread A allocates a new
tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
with it. In that classifier callback we had to unlock/lock the rtnl
mutex and returned with -EAGAIN. One reason why we need to drop there
is, for example, that we need to request an action module to be loaded.
This happens via tcf_exts_validate() -> tcf_action_init/_1() meaning
after we loaded and found the requested action, we need to redo the
whole request so we don't race against others. While we had to unlock
rtnl in that time, thread B's request was processed next on that CPU.
Thread B added a new tp instance successfully to the classifier chain.
When thread A returned grabbing the rtnl mutex again, propagating -EAGAIN
and destroying its tp instance which never got linked, we goto replay
and redo A's request.
This time when walking the classifier chain in tc_ctl_tfilter() for
checking for existing tp instances we had a priority match and found
the tp instance that was created and linked by thread B. Now calling
again into tp->ops->change() with that tp was successful and returned
without error.
tp_created was never cleared in the second round, thus kernel thinks
that we need to link it into the classifier chain (once again). tp and
*back point to the same object due to the match we had earlier on. Thus
for thread B's already public tp, we reset tp->next to tp itself and
link it into the chain, which eventually causes the mentioned endless
loop in tc_classify() once a packet hits the data path.
Fix is to clear tp_created at the beginning of each request, also when
we replay it. On the paths that can cause -EAGAIN we already destroy
the original tp instance we had and on replay we really need to start
from scratch. It seems that this issue was first introduced in commit 12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining
and avoid kernel panic when we use cls_cgroup").
Fixes: 12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining and avoid kernel panic when we use cls_cgroup") Reported-by: Shahar Klein <shahark@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Tested-by: Shahar Klein <shahark@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Socket cmsg IP(V6)_RECVORIGDSTADDR checks that port range lies within
the packet. For sockets that have transport headers pulled, transport
offset can be negative. Use signed comparison to avoid overflow.
Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Reported-by: Nisar Jagabar <njagabar@cloudmark.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Prior to this patch, sctp_transport_lookup_process didn't rcu_read_unlock
when it failed to find a transport by sctp_addrs_lookup_transport.
This patch is to fix it by moving up rcu_read_unlock right before checking
transport and also to remove the out path.
Fixes: 1cceda784980 ("sctp: fix the issue sctp_diag uses lock_sock in rcu_read_lock") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Locally originated traffic in a VRF fails in the presence of a POSTROUTING
rule. For example,
$ iptables -t nat -A POSTROUTING -s 11.1.1.0/24 -j MASQUERADE
$ ping -I red -c1 11.1.1.3
ping: Warning: source address might be selected on device other than red.
PING 11.1.1.3 (11.1.1.3) from 11.1.1.2 red: 56(84) bytes of data.
ping: sendmsg: Operation not permitted
Worse, the above causes random corruption resulting in a panic in random
places (I have not seen a consistent backtrace).
Call nf_reset to drop the conntrack info following the pass through the
VRF device. The nf_reset is needed on Tx but not Rx because of the order
in which NF_HOOK's are hit: on Rx the VRF device is after the real ingress
device and on Tx it is is before the real egress device. Connection
tracking should be tied to the real egress device and not the VRF device.
Fixes: 8f58336d3f78a ("net: Add ethernet header for pass through VRF device") Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Connection tracking with VRF is broken because the pass through the VRF
device drops the connection tracking info. Removing the call to nf_reset
allows DNAT and MASQUERADE to work across interfaces within a VRF.
Fixes: 73e20b761acf ("net: vrf: Add support for PREROUTING rules on vrf device") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ep->mult is supposed to be set to Isochronous and
Interrupt Endapoint's multiplier value. This value
is computed from different places depending on the
link speed.
If we're dealing with HighSpeed, then it's part of
bits [12:11] of wMaxPacketSize. This case wasn't
taken into consideration before.
While at that, also make sure the ep->mult defaults
to one so drivers can use it unconditionally and
assume they'll never multiply ep->maxpacket to zero.
Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a disfunctional timer, e.g. dummy timer, is installed, the tick core
tries to setup the broadcast timer.
If no broadcast device is installed, the kernel crashes with a NULL pointer
dereference in tick_broadcast_setup_oneshot() because the function has no
sanity check.
Reported-by: Mason <slash.tmp@free.fr> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Anna-Maria Gleixner <anna-maria@linutronix.de> Cc: Richard Cochran <rcochran@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org>, Cc: Sebastian Frias <sf84@laposte.net> Cc: Thibaud Cornic <thibaud_cornic@sigmadesigns.com> Cc: Robin Murphy <robin.murphy@arm.com> Link: http://lkml.kernel.org/r/1147ef90-7877-e4d2-bb2b-5c4fa8d3144b@free.fr Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the dummy timer callback is invoked before the real timer callbacks,
then it tries to install that timer for the starting CPU. If the platform
does not have a broadcast timer installed the installation fails with a
kernel crash. The crash happens due to a unconditional deference of the non
available broadcast device. This needs to be fixed in the timer core code.
But even when this is fixed in the core code then installing the dummy
timer before the real timers is a pointless exercise.
Move it to the end of the callback list.
Fixes: 00c1d17aab51 ("clocksource/dummy_timer: Convert to hotplug state machine") Reported-and-tested-by: Mason <slash.tmp@free.fr> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Anna-Maria Gleixner <anna-maria@linutronix.de> Cc: Richard Cochran <rcochran@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org>, Cc: Sebastian Frias <sf84@laposte.net> Cc: Thibaud Cornic <thibaud_cornic@sigmadesigns.com> Cc: Robin Murphy <robin.murphy@arm.com> Link: http://lkml.kernel.org/r/1147ef90-7877-e4d2-bb2b-5c4fa8d3144b@free.fr Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
max_retries _show and _store functions should test against cfg->max_retries,
not cfg->retry_timeout
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is a race window between write_cache_pages calling
clear_page_dirty_for_io and XFS calling set_page_writeback, in which
the mapping for an inode is tagged neither as dirty, nor as writeback.
If the COW shrinker hits in exactly that window we'll remove the delayed
COW extents and writepages trying to write it back, which in release
kernels will manifest as corruption of the bmap btree, and in debug
kernels will trip the ASSERT about now calling xfs_bmapi_write with the
COWFORK flag for holes. A complex customer load manages to hit this
window fairly reliably, probably by always having COW writeback in flight
while the cow shrinker runs.
This patch adds another check for having the I_DIRTY_PAGES flag set,
which is still set during this race window. While this fixes the problem
I'm still not overly happy about the way the COW shrinker works as it
still seems a bit fragile.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We need to use the actual AG length when making per-AG reservations,
since we could otherwise end up reserving more blocks out of the last
AG than there are actual blocks.
Complained-about-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dan Carpenter reported a double-free of rcur if _defer_finish fails
while we're recovering CUI items. Fix the error recovery to prevent
this.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Use NOFS for allocating btree cursors, since they can be called
under the ilock.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we create a new attribute, we first create a shortform
attribute, and try to fit the new attribute into it.
If that fails, we copy the (empty) attribute into a leaf attribute,
and do the copy again. Thus there can be a transient state where
we have an empty leaf attribute.
If we encounter this during log replay, the verifier will fail.
So add a test to ignore this part of the leaf attr verification
during log replay.
Thanks as usual to dchinner for spotting the problem.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After various discussions on linux-fsdevel, it has been decided that it
is not necessary to cap the length of a dedupe request, and that
correctly-written userspace client programs will be able to absorb the
change. Therefore, remove the length clamping behavior.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The on-disk field di_size is used to set i_size, which is a signed
integer of loff_t. If the high bit of di_size is set, we'll end up with
a negative i_size, which will cause all sorts of problems. Since the
VFS won't let us create a file with such length, we should catch them
here in the verifier too.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We shouldn't assert if somehow we end up trying to add an attr fork to
an inode that apparently already has attr extents because this is an
indication of on-disk corruption. Instead, return an error code to
userspace.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In xfs_dir3_data_read, we can encounter the situation where err == 0 and
*bpp == NULL if the given bno offset happens to be a hole; this leads to
a crash if we try to set the buffer type after the _da_read_buf call.
Holes can happen due to corrupt or malicious entries in the bmbt data,
so be a little more careful when we're handling buffers.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When reading into memory all extents of a btree-format inode fork,
complain if the number of extents we find is not the same as the number
of extents reported in the inode core. This is needed to stop an IO
action from accessing the garbage areas of the in-core fork.
[dchinner: removed redundant assert]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When we're reading a btree block, make sure that what we retrieved
matches the owner and level; and has a plausible number of records.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is no such thing as a zero-level AG btree since even a single-node
zero-records btree has one level. Btree cursor constructors read
cur_nlevels straight from disk and then access things like
cur_bufs[cur_nlevels - 1] which is /really/ bad if cur_nlevels is zero!
Therefore, strengthen the verifiers to prevent this possibility.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
By inspection, xfs_bmap_trace_exlist isn't handling cow forks,
and will trace the data fork instead.
Fix this by setting state appropriately if whichfork
== XFS_COW_FORK.
()___()
< @ @ >
| |
{o_o}
(|)
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When xfs_bmap_trace_exlist called trace_xfs_extlist,
it sent in the "whichfork" var instead of the bmap "state"
as expected (even though state was already set up for this
purpose).
As a result, the xfs_bmap_class in tracing code used
"whichfork" not state in xfs_iext_state_to_fork(), and got
the wrong ifork pointer. It all goes downhill from
there, including an ASSERT when ifp_bytes is empty
by the time it reaches xfs_iext_get_ext():
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We've missed properly setting the buffer type for
an AGI transaction in 3 spots now, so just move it
into xfs_read_agi() and set it if we are in a transaction
to avoid the problem in the future.
This is similar to how it is done in i.e. the dir3
and attr3 read functions.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
xfs_file_iomap_begin_delay() implements post-eof speculative
preallocation by extending the block count of the requested delayed
allocation. Now that xfs_bmapi_reserve_delalloc() has been updated to
handle prealloc blocks separately and tag the inode, update
xfs_file_iomap_begin_delay() to use the new parameter and rely on the
former to tag the inode.
Note that this patch does not change behavior.
Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
And only lookup the previous extent inside xfs_iomap_prealloc_size
if we actually need it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>