The following build error occured during a ia64 build with
swap-over-NFS patches applied.
net/core/sock.c:274:36: error: initializer element is not constant
net/core/sock.c:274:36: error: (near initialization for 'memalloc_socks')
net/core/sock.c:274:36: error: initializer element is not constant
This is identical to a parisc build error. Fengguang Wu, Mel Gorman
and James Bottomley did all the legwork to track the root cause of
the problem. This fix and entire commit log is shamelessly copied
from them with one extra detail to change a dubious runtime use of
ATOMIC_INIT() to atomic_set() in drivers/char/mspec.c
Dave Anglin says:
> Here is the line in sock.i:
>
> struct static_key memalloc_socks = ((struct static_key) { .enabled =
> ((atomic_t) { (0) }) });
The above line contains two compound literals. It also uses a designated
initializer to initialize the field enabled. A compound literal is not a
constant expression.
The location of the above statement isn't fully clear, but if a compound
literal occurs outside the body of a function, the initializer list must
consist of constant expressions.
Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The downgrade of the 4 level page table created by init_new_context is
currently done only in start_thread31. If a 31 bit process forks the
new mm uses a 4 level page table, including the task size of 2<<42
that goes along with it. This is incorrect as now a 31 bit process
can map memory beyond 2GB. Define arch_dup_mmap to do the downgrade
after fork.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The Intel case falls through into the generic case which then changes
the values. For cases like the P6 it doesn't do the right thing so
this seems to be a screwup.
The irq field of struct snd_mpu401 is supposed to be initialized to -1.
Since it's set to zero as of now, a probing error before the irq
installation results in a kernel warning "Trying to free already-free
IRQ 0".
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=44821 Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We've got a bug report about the silent output from the headphone on a
mobo with VT2021, and spotted out that this was because of the wrong
D3 state on the DAC for the headphone output. The bug is triggered by
the incomplete check for this DAC in set_widgets_power_state_vt1718S().
It checks only the connectivity of the primary output (0x27) but
doesn't consider the path from the headphone pin (0x28).
Now this patch fixes the problem by checking both pins for DAC 0x0b.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
[bwh: Backported to 3.2: keep using snd_hda_codec_write() as
update_power_state() is missing] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The touchpad on the Acer Aspire One D250 will report out of range values
in the extreme lower portion of the touchpad. These appear as abrupt
changes in the values reported by the hardware from very low values to
very high values, which can cause unexpected vertical jumps in the
position of the mouse pointer.
What seems to be happening is that the value is wrapping to a two's
compliment negative value of higher resolution than the 13-bit value
reported by the hardware, with the high-order bits being truncated. This
patch adds handling for these values by converting them to the
appropriate negative values.
The only tricky part about this is deciding when to treat a number as
negative. It stands to reason that if out of range values can be
reported on the low end then it could also happen on the high end, so
not all out of range values should be treated as negative. The approach
taken here is to split the difference between the maximum legitimate
value for the axis and the maximum possible value that the hardware can
report, treating values greater than this number as negative and all
other values as positive. This can be tweaked later if hardware is found
that operates outside of these parameters.
BugLink: http://bugs.launchpad.net/bugs/1001251 Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Reviewed-by: Daniel Kurtz <djkurtz@chromium.org> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Line 0 and 1 were both written to line 0 (on the display) and all subsequent
lines had an offset of -1. The result was that the last line on the display
was never overwritten by writes to /dev/fbN.
The origin of this bug seems to have been udlfb.
Signed-off-by: Alexander Holler <holler@ahsoftware.de> Signed-off-by: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
If uaddr == uaddr2, then we have broken the rule of only requeueing
from a non-pi futex to a pi futex with this call. If we attempt this,
as the trinity test suite manages to do, we miss early wakeups as
q.key is equal to key2 (because they are the same uaddr). We will then
attempt to dereference the pi_mutex (which would exist had the futex_q
been properly requeued to a pi futex) and trigger a NULL pointer
dereference.
The WARN_ON in futex_wait_requeue_pi() for a NULL q.pi_state was testing
the address (&q.pi_state) of the pointer instead of the value
(q.pi_state) of the pointer. Correct it accordingly.
wm831x devices contain a unique ID value. Feed this into the newly added
device_add_randomness() to add some per device seed data to the pool.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The tamper evident features of the RTC include the "write counter" which
is a pseudo-random number regenerated whenever we set the RTC. Since this
value is unpredictable it should provide some useful seeding to the random
number generator.
Only do this on boot since the goal is to seed the pool rather than add
useful entropy.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Create a new function, get_random_bytes_arch() which will use the
architecture-specific hardware random number generator if it is
present. Change get_random_bytes() to not use the HW RNG, even if it
is avaiable.
The reason for this is that the hw random number generator is fast (if
it is present), but it requires that we trust the hardware
manufacturer to have not put in a back door. (For example, an
increasing counter encrypted by an AES key known to the NSA.)
It's unlikely that Intel (for example) was paid off by the US
Government to do this, but it's impossible for them to prove otherwise
If the CPU supports a hardware random number generator, use it in
xfer_secondary_pool(), where it will significantly improve things and
where we can afford it.
Also, remove the use of the arch-specific rng in
add_timer_randomness(), since the call is significantly slower than
get_cycles(), and we're much better off using it in
xfer_secondary_pool() anyway.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Add a new interface, add_device_randomness() for adding data to the
random pool that is likely to differ between two devices (or possibly
even per boot). This would be things like MAC addresses or serial
numbers, or the read-out of the RTC. This does *not* add any actual
entropy to the pool, but it initializes the pool to different values
for devices that might otherwise be identical and have very little
entropy available to them (particularly common in the embedded world).
[ Modified by tytso to mix in a timestamp, since there may be some
variability caused by the time needed to detect/configure the hardware
in question. ]
The real-time Linux folks don't like add_interrupt_randomness() taking
a spinlock since it is called in the low-level interrupt routine.
This also allows us to reduce the overhead in the fast path, for the
random driver, which is the interrupt collection path.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We've been moving away from add_interrupt_randomness() for various
reasons: it's too expensive to do on every interrupt, and flooding the
CPU with interrupts could theoretically cause bogus floods of entropy
from a somewhat externally controllable source.
This solves both problems by limiting the actual randomness addition
to just once a second or after 64 interrupts, whicever comes first.
During that time, the interrupt cycle data is buffered up in a per-cpu
pool. Also, we make sure the the nonblocking pool used by urandom is
initialized before we start feeding the normal input pool. This
assures that /dev/urandom is returning unpredictable data as soon as
possible.
(Based on an original patch by Linus, but significantly modified by
tytso.)
Tested-by: Eric Wustrow <ewust@umich.edu> Reported-by: Eric Wustrow <ewust@umich.edu> Reported-by: Nadia Heninger <nadiah@cs.ucsd.edu> Reported-by: Zakir Durumeric <zakir@umich.edu> Reported-by: J. Alex Halderman <jhalderm@umich.edu>. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When we are initializing using arch_get_random_long() we only need to
loop enough times to touch all the bytes in the buffer; using
poolwords for that does twice the number of operations necessary on a
64-bit machine, since in the random number generator code "word" means
32 bits.
If there is an architecture-specific random number generator (such as
RDRAND for Intel architectures), use it to initialize /dev/random's
entropy stores. Even in the worst case, if RDRAND is something like
AES(NSA_KEY, counter++), it won't hurt, and it will definitely help
against any other adversaries.
We still don't use rdrand in /dev/random, which just seems stupid. We
accept the *cycle*counter* as a random input, but we don't accept
rdrand? That's just broken.
Sure, people can do things in user space (write to /dev/random, use
rdrand in addition to /dev/random themselves etc etc), but that
*still* seems to be a particularly stupid reason for saying "we
shouldn't bother to try to do better in /dev/random".
And even if somebody really doesn't trust rdrand as a source of random
bytes, it seems singularly stupid to trust the cycle counter *more*.
So I'd suggest the attached patch. I'm not going to even bother
arguing that we should add more bits to the entropy estimate, because
that's not the point - I don't care if /dev/random fills up slowly or
not, I think it's just stupid to not use the bits we can get from
rdrand and mix them into the strong randomness pool.
Link: http://lkml.kernel.org/r/CA%2B55aFwn59N1=m651QAyTy-1gO1noGbK18zwKDwvwqnravA84A@mail.gmail.com Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Cc: Matt Mackall <mpm@selenic.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 9ef449c6b31bb6a8e6dedc24de475a3b8c79be20 ("[media] rc: Postpone ISR
registration") fixed an early ISR registration on several drivers. It did
however also introduced a bug by moving the invocation of pnp_port_start()
to the end of the probe function.
This patch fixes this issue by moving the invocation of pnp_port_start() to
an earlier stage in the probe function.
Cc: Jarod Wilson <jarod@redhat.com> Signed-off-by: Luis Henriques <luis.henriques@canonical.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Booting a 3.2, 3.3, or 3.4-rc4 kernel on an Atari using the
`nfeth' ethernet device triggers a WARN_ONCE() in generic irq
handling code on the first irq for that device:
After invoking the irq's handler the kernel sees !irqs_disabled()
and concludes that the handler erroneously enabled interrupts.
However, debugging shows that !irqs_disabled() is true even before
the handler is invoked, which indicates a problem in the platform
code rather than the specific driver.
The warning does not occur in 3.1 or older kernels.
It turns out that the ALLOWINT definition for Atari is incorrect.
The Atari definition of ALLOWINT is ~0x400, the stated purpose of
that is to avoid taking HSYNC interrupts. irqs_disabled() returns
true if the 3-bit ipl & 4 is non-zero. The nfeth interrupt runs at
ipl 3 (it's autovector 3), but 3 & 4 is zero so irqs_disabled() is
false, and the warning above is generated.
When interrupts are explicitly disabled, ipl is set to 7. When they
are enabled, ipl is masked with ALLOWINT. On Atari this will result
in ipl = 3, which blocks interrupts at ipl 3 and below. So how come
nfeth interrupts at ipl 3 are received at all? That's because ipl
is reset to 2 by Atari-specific code in default_idle(), again with
the stated purpose of blocking HSYNC interrupts. This discrepancy
means that ipl 3 can remain blocked for longer than intended.
Both default_idle() and falcon_hblhandler() identify HSYNC with
ipl 2, and the "Atari ST/.../F030 Hardware Register Listing" agrees,
but ALLOWINT is defined as if HSYNC was ipl 3.
[As an experiment I modified default_idle() to reset ipl to 3, and
as expected that resulted in all nfeth interrupts being blocked.]
The fix is simple: define ALLOWINT as ~0x500 instead. This makes
arch_local_irq_enable() consistent with default_idle(), and prevents
the !irqs_disabled() problems for ipl 3 interrupts.
Tested on Atari running in an Aranym VM.
Signed-off-by: Mikael Pettersson <mikpe@it.uu.se> Tested-by: Michael Schmitz <schmitzmic@googlemail.com> (on Falcon/CT60)
[Geert Uytterhoeven: This version applies to v3.2..v3.4.] Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
As part of commit 463454b5dbd8 ("cfg80211: fix interface
combinations check"), this extra check was introduced:
if ((all_iftypes & used_iftypes) != used_iftypes)
goto cont;
However, most wireless NIC drivers did not advertise ADHOC in
wiphy.iface_combinations[i].limits[] and hence we'll get -EBUSY
when we bring up a ADHOC wlan with commands similar to:
# iwconfig wlan0 mode ad-hoc && ifconfig wlan0 up
In commit 8e8b41f9d8c8e ("cfg80211: enforce lack of interface
combinations"), the change below fixes the issue:
if (total == 1)
return 0;
But it also introduces other dependencies for stable. For example,
a full cherry pick of 8e8b41f9d8c8e would introduce additional
regressions unless we also start cherry picking driver specific
fixes like the following:
9b4760e ath5k: add possible wiphy interface combinations 1ae2fc2 mac80211_hwsim: advertise interface combinations 20c8e8d ath9k: add possible wiphy interface combinations
And the purpose of the 'if (total == 1)' is to cover the specific
use case (IBSS, adhoc) that was mentioned above. So we just pick
the specific part out from 8e8b41f9d8c8e here.
Doing so gives stable kernels a way to fix the change introduced
by 463454b5dbd8, without having to make cherry picks specific to
various NIC drivers.
Signed-off-by: Liang Li <liang.li@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Also add a model/fixup string "lenovo-dock", so that other Thinkpad
users will be able to test this fixup easily, to see if it enables
dock I/O for them as well.
BugLink: https://bugs.launchpad.net/bugs/1026953 Tested-by: John McCarron <john.mccarron@canonical.com> Signed-off-by: David Henningsson <david.henningsson@canonical.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
For one, the driver device pointer needs to be filled in, or the lirc core
will refuse to load the driver. And we really need to wire up all the
platform_device bits. This has been tested via the lirc sourceforge tree
and verified to work, been sitting there for months, finally getting
around to sending it. :\
CC: Josh Boyer <jwboyer@redhat.com> Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When the frontend and the backend reside on the same domain, even if we
add pages to the m2p_override, these pages will never be returned by
mfn_to_pfn because the check "get_phys_to_machine(pfn) != mfn" will
always fail, so the pfn of the frontend will be returned instead
(resulting in a deadlock because the frontend pages are already locked).
The explanation is in the comment within the code:
We need to do this because the pages shared by the frontend
(xen-blkfront) can be already locked (lock_page, called by
do_read_cache_page); when the userspace backend tries to use them
with direct_IO, mfn_to_pfn returns the pfn of the frontend, so
do_blockdev_direct_IO is going to try to lock the same pages
again resulting in a deadlock.
Internally the xen-blkback uses m2p_add_override to swizzle (temporarily)
a 'struct page' to have a different MFN (so that it can point to another
guest). It also can easily find out whether another pfn corresponding
to the mfn exists in the m2p, and can set the FOREIGN bit
in the p2m, making sure that mfn_to_pfn returns the pfn of the backend.
This allows the backend to perform direct_IO on these pages, but as a
side effect prevents the frontend from using get_user_pages_fast on
them while they are being shared with the backend.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
add_disk() takes gendisk reference on request queue. If driver failed during
initialization and never called add_disk() then that extra reference is not
taken. That reference is put in put_disk(). floppy driver allocates the
disk, allocates queue, sets disk->queue and then relizes that floppy
controller is not present. It tries to tear down everything and tries to
put a reference down in put_disk() which was never taken.
In such error cases cleanup disk->queue before calling put_disk() so that
we never try to put down a reference which was never taken in first place.
Stefan reported a crash on a kernel before a3e5d1091c1 ("sched:
Don't call task_group() too many times in set_task_rq()"), he
found the reason to be that the multiple task_group()
invocations in set_task_rq() returned different values.
Looking at all that I found a lack of serialization and plain
wrong comments.
The below tries to fix it using an extra pointer which is
updated under the appropriate scheduler locks. Its not pretty,
but I can't really see another way given how all the cgroup
stuff works.
Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org>
(backported to previous file names and layout) Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Several fields in struct cpuinfo_x86 were not defined for the
!SMP case, likely to save space. However, those fields still
have some meaning for UP, and keeping them allows some #ifdef
removal from other files. The additional size of the UP kernel
from this change is not significant enough to worry about
keeping up the distinction:
text data bss dec hex filename 4737168 506459 972040 6215667 5ed7f3 vmlinux.o.before 4737444 506459 972040 6215943 5ed907 vmlinux.o.after
for a difference of 276 bytes for an example UP config.
If someone wants those 276 bytes back badly then it should
be implemented in a cleaner way.
Commit a6bc32b89922 ("mm: compaction: introduce sync-light migration for
use by compaction") changed the declaration of migrate_pages() and
migrate_huge_pages().
But it missed changing the argument of migrate_huge_pages() in
soft_offline_huge_page(). In this case, we should call
migrate_huge_pages() with MIGRATE_SYNC.
Additionally, there is a mismatch between type the of argument and the
function declaration for migrate_pages().
Signed-off-by: Joonsoo Kim <js1304@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Linear copy works by adding the offset to the buffer address,
which may end up not being 16-byte aligned.
Some tests I've written for prime_pcopy show that the engine
allows this correctly, so the restriction on lowest 4 bits of
address can be lifted safely.
The comments added were by envyas, I think because I used
a newer version.
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
[bwh: Backported to 3.2: no # prefixes in nva3_copy.fuc] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Each ordered operation has a free callback, and this was called with the
worker spinlock held. Josef made the free callback also call iput,
which we can't do with the spinlock.
This drops the spinlock for the free operation and grabs it again before
moving through the rest of the list. We'll circle back around to this
and find a cleaner way that doesn't bounce the lock around so much.
Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
To have DP behave like VGA/DVI we need to retrain the link
on hotplug. For this to happen we need to force link
training to happen by setting connector dpms to off
before asking it turning it on again.
v2: agd5f
- drop the dp_get_link_status() change in atombios_dp.c
for now. We still need the dpms OFF change.
Signed-off-by: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
v2: agd5f
- no passive DP to VGA adapters, update comments
- assign radeon_connector_atom_dig after we are sure
we have a digital connector as analog connectors
have different private data.
- get new sink type before checking for retrain. No
need to check if it's no longer a DP connection.
Signed-off-by: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We want to print link status query failed only if it's
an unexepected fail. If we query to see if we need
link training it might be because there is nothing
connected and thus link status query have the right
to fail in that case.
To avoid printing failure when it's expected, move the
failure message to proper place.
Signed-off-by: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Retry label was at wrong place in function leading to memory
leak.
Signed-off-by: Jerome Glisse <jglisse@redhat.com> Reviewed-by: Michel Dänzer <michel.daenzer@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
In the ac.c, power_supply_register()'s return value is not checked.
As a result, the driver's add() ops may return success
even though the device failed to initialize.
For example, some BIOS may describe two ACADs in the same DSDT.
The second ACAD device will fail to register,
but ACPI driver's add() ops returns sucessfully.
The ACPI device will receive ACPI notification and cause OOPS.
The only checks of the long argument passed to fcntl(fd,F_SETLEASE,.)
are done after converting the long to an int. Thus some illegal values
may be let through and cause problems in later code.
[ They actually *don't* cause problems in mainline, as of Dave Jones's
commit 8d657eb3b438 "Remove easily user-triggerable BUG from
generic_setlease", but we should fix this anyway. And this patch will
be necessary to fix real bugs on earlier kernels. ]
Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Ever since the DAPM performance improvements we've been marking all widgets
as not dirty after each DAPM run. Since _PRE and _POST events aren't part
of the DAPM graph this has rendered them non-functional, they will never be
marked dirty again and thus will never be run again.
Fix this by skipping them when marking widgets as not dirty.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Acked-by: Liam Girdwood <lrg@ti.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The function ext4_calc_metadata_amount() has side effects, although
it's not obvious from its function name. So if we fail to claim
space, regardless of whether we retry to claim the space again, or
return an error, we need to undo these side effects.
Otherwise we can end up incorrectly calculating the number of metadata
blocks needed for the operation, which was responsible for an xfstests
failure for test #271 when using an ext2 file system with delalloc
enabled.
Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
If we hit a condition where we have allocated metadata blocks that
were not appropriately reserved, we risk underflow of
ei->i_reserved_meta_blocks. In turn, this can throw
sbi->s_dirtyclusters_counter significantly out of whack and undermine
the nondelalloc fallback logic in ext4_nonda_switch(). Warn if this
occurs and set i_allocated_meta_blocks to avoid this problem.
This condition is reproduced by xfstests 270 against ext2 with
delalloc enabled:
Mar 28 08:58:02 localhost kernel: [ 171.526344] EXT4-fs (loop1): delayed block allocation failed for inode 14 at logical offset 64486 with max blocks 64 with error -28
Mar 28 08:58:02 localhost kernel: [ 171.526346] EXT4-fs (loop1): This should not happen!! Data will be lost
270 ultimately fails with an inconsistent filesystem and requires an
fsck to repair. The cause of the error is an underflow in
ext4_da_update_reserve_space() due to an unreserved meta block
allocation.
Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
At http://dev.laptop.org/ticket/11980 we have determined that the
Marvell CaFe SDHCI controller reports bad card presence during
resume. It reports that no card is present even when it is.
This is a regression -- resume worked back around 2.6.37.
Around 400ms after resuming, a "card inserted" interrupt is
generated, at which point it starts reporting presence.
Work around this hardware oddity by setting the
SDHCI_QUIRK_BROKEN_CARD_DETECTION flag.
Thanks to Chris Ball for helping with diagnosis.
Signed-off-by: Daniel Drake <dsd@laptop.org>
[stable@: please apply to 3.0+] Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
BTW, speaking of struct file treatment related to sockets -
there's this piece of code in iscsi:
/*
* The SCTP stack needs struct socket->file.
*/
if ((np->np_network_transport == ISCSI_SCTP_TCP) ||
(np->np_network_transport == ISCSI_SCTP_UDP)) {
if (!new_sock->file) {
new_sock->file = kzalloc(
sizeof(struct file), GFP_KERNEL);
For one thing, as far as I can see it'not true - sctp does *not* depend on
socket->file being non-NULL; it does, in one place, check socket->file->f_flags
for O_NONBLOCK, but there it treats NULL socket->file as "flag not set".
Which is the case here anyway - the fake struct file created in
__iscsi_target_login_thread() (and in iscsi_target_setup_login_socket(), with
the same excuse) do *not* get that flag set.
Moreover, it's a bloody serious violation of a bunch of asserts in VFS;
all struct file instances should come from filp_cachep, via get_empty_filp()
(or alloc_file(), which is a wrapper for it). FWIW, I'm very tempted to
do this and be done with the entire mess:
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Andy Grover <agrover@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
commit 198439e4 [SCSI] libsas: do not set res = 0 in sas_ex_discover_dev()
commit 19252de6 [SCSI] libsas: fix wide port hotplug issues
The above commits seem to have confused the return value of
sas_ex_discover_dev which is non-zero on failure and
sas_ex_join_wide_port which just indicates short circuiting discovery on
already established ports. The result is random discovery failures
depending on configuration.
Calls to sas_ex_join_wide_port are the source of the trouble as its
return value is errantly assigned to 'res'. Convert it to bool and stop
returning its result up the stack.
Tested-by: Dan Melnic <dan.melnic@amd.com> Reported-by: Dan Melnic <dan.melnic@amd.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jack Wang <jack_wang@usish.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Continue running revalidation until no more broadcast devices are
discovered. Fixes cases where re-discovery completes too early in a
domain with multiple expanders with pending re-discovery events.
Servicing BCNs can get backed up behind error recovery.
Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Rapid ata hotplug on a libsas controller results in cases where libsas
is waiting indefinitely on eh to perform an ata probe.
A race exists between scsi_schedule_eh() and scsi_restart_operations()
in the case when scsi_restart_operations() issues i/o to other devices
in the sas domain. When this happens the host state transitions from
SHOST_RECOVERY (set by scsi_schedule_eh) back to SHOST_RUNNING and
->host_busy is non-zero so we put the eh thread to sleep even though
->host_eh_scheduled is active.
Before putting the error handler to sleep we need to check if the
host_state needs to return to SHOST_RECOVERY for another trip through
eh. Since i/o that is released by scsi_restart_operations has been
blocked for at least one eh cycle, this implementation allows those
i/o's to run before another eh cycle starts to discourage hung task
timeouts.
Reported-by: Tom Jackson <thomas.p.jackson@intel.com> Tested-by: Tom Jackson <thomas.p.jackson@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
...teach scsi_sysfs_add_devices() to check for deleted devices() before
trying to add them, and teach scsi_remove_target() how to remove targets
that have not been added via device_add().
Reported-by: Dariusz Majchrzak <dariusz.majchrzak@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When we call scsi_unprep_request() the command associated with the request
gets destroyed and therefore drops its reference on the device. If this was
the only reference, the device may get released and we end up with a NULL
pointer deref when we call blk_requeue_request.
Reported-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Reviewed-by: Tejun Heo <tj@kernel.org>
[jejb: enhance commend and add commit log for stable] Signed-off-by: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Use blk_queue_dead() to test whether the queue is dead instead
of !sdev. Since scsi_prep_fn() may be invoked concurrently with
__scsi_remove_device(), keep the queuedata (sdev) pointer in
__scsi_remove_device(). This patch fixes a kernel oops that
can be triggered by USB device removal. See also
http://www.spinics.net/lists/linux-scsi/msg56254.html.
Other changes included in this patch:
- Swap the blk_cleanup_queue() and kfree() calls in
scsi_host_dev_release() to make that code easier to grasp.
- Remove the queue dead check from scsi_run_queue() since the
queue state can change anyway at any point in that function
where the queue lock is not held.
- Remove the queue dead check from the start of scsi_request_fn()
since it is redundant with the scsi_device_online() check.
Reported-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Turn on the pin widget's PIN_OUT bit from playback prepare. The pin is
enabled in open, but is disabled in hdmi_init_pin which is called during
system resume. This causes a system suspend/resume during playback to
mute HDMI/DP. Enabling the pin in prepare instead of open allows calling
snd_pcm_prepare after a system resume to restore audio.
This could previously fail if either of the enabled displays was using a
horizontal resolution that is a multiple of 128, and only the leftmost column
of the cursor was (supposed to be) visible at the right edge of that display.
The solution is to move the cursor one pixel to the left in that case.
Signed-off-by: Michel Dänzer <michel.daenzer@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
This did not work because devices are not put into the
pt_domain. Fix this.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
[bwh: Backported to 3.2: do not use iommu_dev_data::passthrough] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Currently, all workqueue cpu hotplug operations run off
CPU_PRI_WORKQUEUE which is higher than normal notifiers. This is to
ensure that workqueue is up and running while bringing up a CPU before
other notifiers try to use workqueue on the CPU.
Per-cpu workqueues are supposed to remain working and bound to the CPU
for normal CPU_DOWN_PREPARE notifiers. This holds mostly true even
with workqueue offlining running with higher priority because
workqueue CPU_DOWN_PREPARE only creates a bound trustee thread which
runs the per-cpu workqueue without concurrency management without
explicitly detaching the existing workers.
However, if the trustee needs to create new workers, it creates
unbound workers which may wander off to other CPUs while
CPU_DOWN_PREPARE notifiers are in progress. Furthermore, if the CPU
down is cancelled, the per-CPU workqueue may end up with workers which
aren't bound to the CPU.
While reliably reproducible with a convoluted artificial test-case
involving scheduling and flushing CPU burning work items from CPU down
notifiers, this isn't very likely to happen in the wild, and, even
when it happens, the effects are likely to be hidden by the following
successful CPU down.
Fix it by using different priorities for up and down notifiers - high
priority for up operations and low priority for down operations.
Workqueue cpu hotplug operations will soon go through further cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Commit d83579e2a50ac68389e6b4c58b845c702cf37516 incorporated some
changes from the vendor driver that made it newly important that the
calculated hardware version correctly include the CHIP_92D bit, as all
of the IS_92D_* macros were changed to depend on it. However, this bit
was being unset for dual-mac, dual-phy devices. The vendor driver
behavior was modified to not do this, but unfortunately this change was
not picked up along with the others. This caused scanning in the 2.4GHz
band to be broken, and possibly other bugs as well.
This patch brings the version calculation logic in parity with the
vendor driver in this regard, and in doing so fixes the regression.
However, the version calculation code in general continues to be largely
incoherent and messy, and needs to be cleaned up.
Signed-off-by: Forest Bond <forest.bond@rapidrollout.com> Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The s390 idle accounting code uses a sequence counter which gets used
when the per cpu idle statistics get updated and read.
One assumption on read access is that only when the sequence counter is
even and did not change while reading all values the result is valid.
On cpu hotplug however the per cpu data structure gets initialized via
a cpu hotplug notifier on CPU_ONLINE.
CPU_ONLINE however is too late, since the onlined cpu is already running
and might access the per cpu data. Worst case is that the data structure
gets initialized while an idle thread is updating its idle statistics.
This will result in an uneven sequence counter after an update.
As a result user space tools like top, which access /proc/stat in order
to get idle stats, will busy loop waiting for the sequence counter to
become even again, which will never happen until the queried cpu will
update its idle statistics again. And even then the sequence counter
will only have an even value for a couple of cpu cycles.
Fix this by moving the initialization of the per cpu idle statistics
to cpu_init(). I prefer that solution in favor of changing the
notifier to CPU_UP_PREPARE, which would be a different solution to
the problem.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Fail UNMAP commands that have more than our reported limit on unmap
descriptors.
Signed-off-by: Roland Dreier <roland@purestorage.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
[bwh: Backported to 3.2: adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
It's possible for an initiator to send us an UNMAP command with a
descriptor that is less than 8 bytes; in that case it's really bad for
us to set an unsigned int to that value, subtract 8 from it, and then
use that as a limit for our loop (since the value will wrap around to
a huge positive value).
Fix this by making size be signed and only looping if size >= 16 (ie
if we have at least a full descriptor available).
Also remove offset as an obfuscated name for the constant 8.
Signed-off-by: Roland Dreier <roland@purestorage.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
[bwh: Backported to 3.2: adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The UNMAP DATA LENGTH and UNMAP BLOCK DESCRIPTOR DATA LENGTH fields
are in the unmap descriptor (the payload transferred to our data out
buffer), not in the CDB itself. Read them from the correct place in
target_emulated_unmap.
Signed-off-by: Roland Dreier <roland@purestorage.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
[bwh: Backported to 3.2: adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When processing an UNMAP command, we need to make sure that the number
of blocks we're asked to UNMAP does not exceed our reported maximum
number of blocks per UNMAP, and that the range of blocks we're
unmapping doesn't go past the end of the device.
Signed-off-by: Roland Dreier <roland@purestorage.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
[bwh: Backported to 3.2: adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Many SCSI commands are defined to return a CHECK CONDITION / ILLEGAL
REQUEST with ASC set to LOGICAL BLOCK ADDRESS OUT OF RANGE if the
initiator sends a command that accesses a too-big LBA. Add an enum
value and case entries so that target code can return this status.
Signed-off-by: Roland Dreier <roland@purestorage.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
These interfaces need to be handled by QMI/WWAN driver
Signed-off-by: Andrew Bird <ajb@spheresystems.co.uk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
mwifiex driver supports 2x2 chips as well. Hence valid mcs values
are 0 to 15. The check for mcs index is corrected in this patch.
For example: if 40MHz is enabled and mcs index is 11, "iw link"
command would show "tx bitrate: 108.0 MBit/s" without this patch.
Now it shows "tx bitrate: 108.0 MBit/s MCS 11 40Mhz" with the patch.
Signed-off-by: Amitkumar Karwar <akarwar@marvell.com> Signed-off-by: Bing Zhao <bzhao@marvell.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When a partition table length is corrupted to be close to 1 << 32, the
check for its length may overflow on 32-bit systems and we will think
the length is valid. Later on the kernel can crash trying to read beyond
end of buffer. Fix the check to avoid possible overflow.
Reported-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Commit f975d6bcc7a introduced bug which caused ext4_statfs() to
miscalculate the number of file system overhead blocks. This causes
the f_blocks field in the statfs structure to be larger than it should
be. This would in turn cause the "df" output to show the number of
data blocks in the file system and the number of data blocks used to
be larger than they should be.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
iso data buffers may have holes in them if some packets were short, so for
iso urbs we should always copy the entire buffer, just like the regular
processcompl does.
Signed-off-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Microcode reloading in a per-core manner is a very bad idea for both
major x86 vendors. And the thing is, we have such interface with which
we can end up with different microcode versions applied on different
cores of an otherwise homogeneous wrt (family,model,stepping) system.
So turn off the possibility of doing that per core and allow it only
system-wide.
This is a minimal fix which we'd like to see in stable too thus the
more-or-less arbitrary decision to allow system-wide reloading only on
the BSP:
Also, allowing the reload only from one CPU (the BSP in
that case) doesn't allow the reload procedure to degenerate
into an O(n^2) deal when triggering reloads from all
/sys/devices/system/cpu/cpuX/microcode/reload sysfs nodes
simultaneously.
A more generic fix will follow.
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Link: http://lkml.kernel.org/r/1340280437-7718-2-git-send-email-bp@amd64.org Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
If function tracing is enabled for some of the low-level suspend/resume
functions, it leads to triple fault during resume from suspend, ultimately
ending up in a reboot instead of a resume (or a total refusal to come out
of suspended state, on some machines).
This issue was explained in more detail in commit f42ac38c59e0a03d (ftrace:
disable tracing for suspend to ram). However, the changes made by that commit
got reverted by commit cbe2f5a6e84eebb (tracing: allow tracing of
suspend/resume & hibernation code again). So, unfortunately since things are
not yet robust enough to allow tracing of low-level suspend/resume functions,
suspend/resume is still broken when ftrace is enabled.
So fix this by disabling function tracing during suspend/resume & hibernation.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
A "usb0" interface that has never been connected to a host has an unknown
operstate, and therefore the IFF_RUNNING flag is (incorrectly) asserted
when queried by ifconfig, ifplugd, etc. This is a result of calling
netif_carrier_off() too early in the probe function; it should be called
after register_netdev().
Similar problems have been fixed in many other drivers, e.g.:
e826eafa6 (bonding: Call netif_carrier_off after register_netdevice) 0d672e9f8 (drivers/net: Call netif_carrier_off at the end of the probe) 6a3c869a6 (cxgb4: fix reported state of interfaces without link)
Fix is to move netif_carrier_off() to the end of the function.
Signed-off-by: Kevin Cernekee <cernekee@gmail.com> Signed-off-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Albert Pool<albertpool@solcon.nl> Acked-by: Gertjan van Wingerde <gwingerde@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
RT3070:
(0x2019,0x5201) Planex Communications, Inc. RT8070
(0x7392,0x4085) 2L Central Europe BV 8070
7392 is Edimax
RT35xx:
(0x1690,0x0761) Askey
was Fujitsu Stylistic 550, but 1690 is Askey
Signed-off-by: Xose Vazquez Perez <xose.vazquez@gmail.com> Acked-by: Gertjan van Wingerde <gwingerde@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Xose Vazquez Perez <xose.vazquez@gmail.com> Acked-by: Gertjan van Wingerde <gwingerde@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
[bwh: Backported to 3.2: drop the 5372 devices] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Jian found that when he ran fsx on a 32 bit arch with a large wsize the
process and one of the bdi writeback kthreads would sometimes deadlock
with a stack trace like this:
Each task would kmap part of its address array before getting stuck, but
not enough to actually issue the write.
This patch fixes this by serializing the marshal_iov operations for
async reads and writes. The idea here is to ensure that cifs
aggressively tries to populate a request before attempting to fulfill
another one. As soon as all of the pages are kmapped for a request, then
we can unlock and allow another one to proceed.
There's no need to do this serialization on non-CONFIG_HIGHMEM arches
however, so optimize all of this out when CONFIG_HIGHMEM isn't set.
Reported-by: Jian Li <jiali@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The 8168evl (RTL_GIGA_MAC_VER_34) based Gigabyte GA-990FXA motherboards
are very prone to NETDEV watchdog problems without this change. See
https://bugzilla.kernel.org/show_bug.cgi?id=42899 for instance.
I don't know why it *works*. It's depressingly effective though.
For the record:
- the problem may go along IOMMU (AMD-Vi) errors but it really looks
like a red herring.
- the patch sets the RX_MULTI_EN bit. If the 8168c doc is any guide,
the chipset now fetches several Rx descriptors at a time.
- long ago the driver ignored the RX_MULTI_EN bit. e542a2269f232d61270ceddd42b73a4348dee2bb changed the RxConfig
settings. Whatever the problem it's now labeled a regression.
- Realtek's own driver can identify two different 8168evl devices
(CFG_METHOD_16 and CFG_METHOD_17) where the r8169 driver only
sees one. It sucks.
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Despite lots of investigation into why this is needed we don't
know or have an elegant cure. The only answer found on this
laptop is to mark a problem region as used so that Linux doesn't
put anything there.
Currently all the users add reserve= command lines and anyone
not knowing this needs to find the magic page that documents it.
Automate it instead.
Signed-off-by: Alan Cox <alan@linux.intel.com> Tested-and-bugfixed-by: Arne Fitzenreiter <arne@fitzenreiter.de>
Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=10231 Link: http://lkml.kernel.org/r/20120515174347.5109.94551.stgit@bluebook Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The strcpy was being used to set the name of the board. Since the
destination char* was read-only and the name is set statically at
compile time; this was both wrong and redundant.
The type of char* is changed to const char* to prevent future errors.
Reported-by: Radek Masin <radek@masin.eu> Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com>
[ Taking directly due to vacations - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Just like the module loader, ftrace needs to be updated to use r12
instead of r11 with newer gcc's.
Signed-off-by: Roger Blofeld <blofeldus@yahoo.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[]
calculations") since while that fixed the busy case it regressed the
mostly idle case.
Add a callback from the nohz exit to also age the rq->cpu_load[]
array. This closes the hole where either there was no nohz load
balance pass during the nohz, or there was a 'significant' amount of
idle time between the last nohz balance and the nohz exit.
So we'll update unconditionally from the tick to not insert any
accidental 0 load periods while busy, and we try and catch up from
nohz idle balance and nohz exit. Both these are still prone to missing
a jiffy, but that has always been the case.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: pjt@google.com Cc: Venkatesh Pallipadi <venki@google.com> Link: http://lkml.kernel.org/n/tip-kt0trz0apodbf84ucjfdbr1a@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.2: adjust filenames and context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
While investigating why the load-balancer did funny I found that the
rq->cpu_load[] tables were completely screwy.. a bit more digging
revealed that the updates that got through were missing ticks followed
by a catchup of 2 ticks.
The catchup assumes the cpu was idle during that time (since only nohz
can cause missed ticks and the machine is idle etc..) this means that
esp. the higher indices were significantly lower than they ought to
be.
The reason for this is that its not correct to compare against jiffies
on every jiffy on any other cpu than the cpu that updates jiffies.
This patch cludges around it by only doing the catch-up stuff from
nohz_idle_balance() and doing the regular stuff unconditionally from
the tick.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: pjt@google.com Cc: Venkatesh Pallipadi <venki@google.com> Link: http://lkml.kernel.org/n/tip-tp4kj18xdd5aj4vvj0qg55s2@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.2: adjust filenames and context; keep functions static] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Avoid crashing if the private_data pointer happens to be NULL. This has
been seen sometimes when a host reset happens, notably when there are
many LUNs:
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Tested-by: Marcus Dennis <marcusx.e.dennis@intel.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
[bwh: Backported to 3.2: adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
expensive and severely impacted page allocator performance. This
is part of a series of patches that reduce page allocator overhead.
Fix a gcc warning (and bug?) introduced in cc9a6c877 ("cpuset: mm: reduce
large amounts of memory barrier related damage v3")
Local variable "page" can be uninitialized if the nodemask from vma policy
does not intersects with nodemask from cpuset. Even if it doesn't happens
it is better to initialize this variable explicitly than to introduce
a kernel oops in a weird corner case.
mm/hugetlb.c: In function `alloc_huge_page':
mm/hugetlb.c:1135:5: warning: `page' may be used uninitialized in this function
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
expensive and severely impacted page allocator performance. This
is part of a series of patches that reduce page allocator overhead.
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.
[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths. This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32. The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.
For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.
This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side. This is much cheaper on some architectures, including x86. The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.
While updating the nodemask, a check is made to see if a false failure
is a risk. If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.
In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
actual results were
The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected). The
actual number of page faults is noticeably improved.
For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.
To test the actual bug the commit fixed I opened two terminals. The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data. In a second window, the nodemask of the
cpuset was continually randomised in a loop.
Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
[bwh: Forward-ported from 3.0 to 3.2: apply the upstream changes
to get_any_partial()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Stable note: Not tracked in Bugzilla. This is a partial backport of an
upstream commit addressing a completely different issue
that accidentally contained an important fix. The workload
this patch helps was memcached when IO is started in the
background. memcached should stay resident but without this patch
it gets swapped. Sometimes this manifests as a drop in throughput
but mostly it was observed through /proc/vmstat.
Commit [246e87a9: memcg: fix get_scan_count() for small targets] was meant
to fix a problem whereby small scan targets on memcg were ignored causing
priority to raise too sharply. It forced scanning to take place if the
target was small, memcg or kswapd.
From the time it was introduced it caused excessive reclaim by kswapd
with workloads being pushed to swap that previously would have stayed
resident. This was accidentally fixed in commit [b95a2f2d: mm: vmscan:
convert global reclaim to per-memcg LRU lists] by making it harder for
kswapd to force scan small targets but that patchset is not suitable for
backporting. This was later changed again by commit [90126375: mm/vmscan:
push lruvec pointer into get_scan_count()] into a format that looks
like it would be a straight-forward backport but there is a subtle
difference due to the use of lruvecs.
The impact of the accidental fix is to make it harder for kswapd to force
scan small targets by taking zone->all_unreclaimable into account. This
patch is the closest equivalent available based on what is backported.
Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Stable note: Not tracked in Bugzilla. There were reports of shared
mapped pages being unfairly reclaimed in comparison to older kernels.
This is being addressed over time. Even though the subject
refers to lumpy reclaim, it impacts compaction as well.
Lumpy reclaim does well to stop at a PageAnon when there's no swap, but
better is to stop at any PageSwapBacked, which includes shmem/tmpfs too.
Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Stable note: Not tracked in Bugzilla. This patch reduces kswapd CPU
usage on swapless systems with high anonymous memory usage.
It's pointless to continue reclaiming when we have no swap space and lots
of anon pages in the inactive list.
Without this patch, it is possible when swap is disabled to continue
trying to reclaim when there are only anonymous pages in the system even
though that will not make any progress.
Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Stable note: Not tracked in Bugzilla. There were reports of shared
mapped pages being unfairly reclaimed in comparison to older kernels.
This is being addressed over time.
Logic added in commit 8cab4754d24a0 ("vmscan: make mapped executable pages
the first class citizen") was noticeably weakened in commit 645747462435d84 ("vmscan: detect mapped file pages used only once").
Currently these pages can become "first class citizens" only after second
usage. After this patch page_check_references() will activate they after
first usage, and executable code gets yet better chance to stay in memory.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Stable note: Not tracked in Bugzilla. There were reports of shared
mapped pages being unfairly reclaimed in comparison to older kernels.
This is being addressed over time. The specific workload being
addressed here in described in paragraph four and while paragraph
five says it did not help performance as such, it made a difference
to major page faults. I'm aware of at least one bug for a large
vendor that was due to increased major faults.
Commit 645747462435 ("vmscan: detect mapped file pages used only once")
greatly decreases lifetime of single-used mapped file pages.
Unfortunately it also decreases life time of all shared mapped file
pages. Because after commit bf3f3bc5e7347 ("mm: don't mark_page_accessed
in fault path") page-fault handler does not mark page active or even
referenced.
Thus page_check_references() activates file page only if it was used twice
while it stays in inactive list, meanwhile it activates anon pages after
first access. Inactive list can be small enough, this way reclaimer can
accidentally throw away any widely used page if it wasn't used twice in
short period.
After this patch page_check_references() also activate file mapped page at
first inactive list scan if this page is already used multiple times via
several ptes.
I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5
(~2.6.18). There a complete mess with >100 web/mail/spam/ftp containers,
they share all their files but there a lot of anonymous pages: ~500mb
shared file mapped memory and 15-20Gb non-shared anonymous memory. In
this situation major-pagefaults are very costly, because all containers
share the same page. In my load kernel created a disproportionate
pressure on the file memory, compared with the anonymous, they equaled
only if I raise swappiness up to 150 =)
These patches actually wasn't helped a lot in my problem, but I saw
noticable (10-20 times) reduce in count and average time of
major-pagefault in file-mapped areas.
Actually both patches are fixes for commit v2.6.33-5448-g6457474, because
it was aimed at one scenario (singly used pages), but it breaks the logic
in other scenarios (shared and/or executable pages)
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>