Theodore Ts'o [Tue, 20 Dec 2011 22:06:08 +0000 (17:06 -0500)]
ext4: fix potential deadlock with setuid files and EXT4_IOC_MOVE_EXT
file_remove_suid() must be called with i_mutex down, since it calls
notify_change(). In addition, we really want to remove the suid file
*before* we modify the donor file, to avoid someone from trying to
exploit a race.
Linus Torvalds [Tue, 20 Dec 2011 19:42:38 +0000 (11:42 -0800)]
Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time/clocksource: Fix kernel-doc warnings
rtc: m41t80: Workaround broken alarm functionality
rtc: Expire alarms after the time is set.
Linus Torvalds [Tue, 20 Dec 2011 19:41:17 +0000 (11:41 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
oprofile: Fix uninitialized memory access when writing to writing to oprofilefs
Linus Torvalds [Tue, 20 Dec 2011 19:40:48 +0000 (11:40 -0800)]
Merge branch 'stable/for-linus-fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
* 'stable/for-linus-fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel"
Linus Torvalds [Tue, 20 Dec 2011 19:31:56 +0000 (11:31 -0800)]
Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: Fix a regression in nfs_file_llseek()
NFSv4: Do not accept delegated opens when a delegation recall is in effect
NFSv4: Ensure correct locking when accessing the 'lock_states' list
NFSv4.1: Ensure that we handle _all_ SEQUENCE status bits.
NFSv4: Don't error if we handled it in nfs4_recovery_handle_error
SUNRPC: Ensure we always bump the backlog queue in xprt_free_slot
SUNRPC: Fix the execution time statistics in the face of RPC restarts
Linus Torvalds [Tue, 20 Dec 2011 19:31:44 +0000 (11:31 -0800)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
vmwgfx: Clip cliprects against screen boundaries in present and dirty
vmwgfx: Resend the cursor after legacy modeset
vmwgfx: Do better culling of presents
vmwgfx: Refactor kms code to use vmw_user_lookup_handle helper
vmwgfx: Add helper function to get surface or dmabuf
vmwgfx: Refactor cursor update
vmwgfx: Remove dmabuf check in present ioctl
vmwgfx: Use the revised fifo hw version register when present
Before waiting (predefined value 120s), check that at least one device
was successfully brought up. Otherwise (e.g. buggy bootloader
which does not set the MAC address) there is no point in waiting
for carrier.
Cc: Micha Nelissen <micha@neli.hopto.org> Cc: Holger Brunck <holger.brunck@keymile.com> Signed-off-by: Gerlando Falauto <gerlando.falauto@keymile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Olof Johansson [Tue, 20 Dec 2011 17:27:39 +0000 (09:27 -0800)]
arm/tegra: remove __initdata annotation from pinmux tables
Instead of reshuffling what functions in the pinmux paths should be
__init and thus could keep references to __initdata, let's just remove
the annotations for now -- the tables are moving to device tree in the
next version anyway and the whole subsystem is being wired up. We will
go back and re-annotate where appropriate once things settle down.
Signed-off-by: Olof Johansson <olof@lixom.net> Acked-by: Stephen Warren <swarren@nvidia.com>
Thomas Graf [Mon, 19 Dec 2011 04:11:40 +0000 (04:11 +0000)]
sctp: Do not account for sizeof(struct sk_buff) in estimated rwnd
When checking whether a DATA chunk fits into the estimated rwnd a
full sizeof(struct sk_buff) is added to the needed chunk size. This
quickly exhausts the available rwnd space and leads to packets being
sent which are much below the PMTU limit. This can lead to much worse
performance.
The reason for this behaviour was to avoid putting too much memory
pressure on the receiver. The concept is not completely irational
because a Linux receiver does in fact clone an skb for each DATA chunk
delivered. However, Linux also reserves half the available socket
buffer space for data structures therefore usage of it is already
accounted for.
When proposing to change this the last time it was noted that this
behaviour was introduced to solve a performance issue caused by rwnd
overusage in combination with small DATA chunks.
Trying to reproduce this I found that with the sk_buff overhead removed,
the performance would improve significantly unless socket buffer limits
are increased.
The following numbers have been gathered using a patched iperf
supporting SCTP over a live 1 Gbit ethernet network. The -l option
was used to limit DATA chunk sizes. The numbers listed are based on
the average of 3 test runs each. Default values have been used for
sk_(r|w)mem.
binary_sysctl() calls sysctl_getname() which allocates from names_cache
slab usin __getname()
The matching function to free the name is __putname(), and not putname()
which should be used only to match getname() allocations.
This is because when auditing is enabled, putname() calls audit_putname
*instead* (not in addition) to __putname(). Then, if a syscall is in
progress, audit_putname does not release the name - instead, it expects
the name to get released when the syscall completes, but that will happen
only if audit_getname() was called previously, i.e. if the name was
allocated with getname() rather than the naked __getname(). So,
__getname() followed by putname() ends up leaking memory.
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Eric Paris <eparis@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
where the oom score computation was divided into several steps and it's no
longer computed as one expression in unsigned long(rss, swapents, nr_pte
are unsigned long), where the result value assigned to points(int) is in
range(1..1000). So there could be an int overflow while computing
176 points *= 1000;
and points may have negative value. Meaning the oom score for a mem hog task
will be one.
196 if (points <= 0)
197 return 1;
For example:
[ 3366] 0 3366 3539048024303939 5 0 0 oom01
Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child
Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical
memory, but it's oom score is one.
In this situation the mem hog task is skipped and oom killer kills another and
most probably innocent task with oom score greater than one.
The points variable should be of type long instead of int to prevent the
int overflow.
Haogang Chen [Tue, 20 Dec 2011 01:11:56 +0000 (17:11 -0800)]
nilfs2: potential integer overflow in nilfs_ioctl_clean_segments()
There is a potential integer overflow in nilfs_ioctl_clean_segments().
When a large argv[n].v_nmembs is passed from the userspace, the subsequent
call to vmalloc() will allocate a buffer smaller than expected, which
leads to out-of-bound access in nilfs_ioctl_move_blocks() and
lfs_clean_segments().
The following check does not prevent the overflow because nsegs is also
controlled by the userspace and could be very large.
if (argv[n].v_nmembs > nsegs * nilfs->ns_blocks_per_segment)
goto out_free;
This patch clamps argv[n].v_nmembs to UINT_MAX / argv[n].v_size, and
returns -EINVAL when overflow.
David Rientjes [Tue, 20 Dec 2011 01:11:52 +0000 (17:11 -0800)]
cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty
nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a
new set of allowed cpuset nodes where the two nodemasks, as a result of
the remap, are now disjoint.
c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
nodes from changing for a thread. This causes any update to a set of
allowed nodes to stall until put_mems_allowed() is called.
This stall is unncessary, however, if at least one node remains unchanged
in the update to the set of allowed nodes. This was addressed by 89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
node remains set"), but it's still possible that an empty nodemask may be
read from a mempolicy because the old nodemask may be remapped to the new
nodemask during rebind. To prevent this, only avoid the stall if there is
no mempolicy for the thread being changed.
This is a temporary solution until all reads from mempolicy nodemasks can
be guaranteed to not be empty without the get_mems_allowed()
synchronization.
Also moves the check for nodemask intersection inside task_lock() so that
tsk->mems_allowed cannot change. This ensures that nothing can set this
tsk's mems_allowed out from under us and also protects tsk->mempolicy.
Reported-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Josh Boyer [Tue, 20 Dec 2011 15:41:28 +0000 (10:41 -0500)]
powerpc/44x: Fix build error on currituck platform
The MPIC_PRIMARY define was recently made "default" and the meaning was
inverted to MPIC_SECONDARY. This causes compile errors in currituck now, so
fix it to the new manner of allocating mpics.
Suzuki Poulose [Wed, 14 Dec 2011 22:59:57 +0000 (22:59 +0000)]
powerpc/boot: Change the load address for the wrapper to fit the kernel
The wrapper code which uncompresses the kernel in case of a 'ppc' boot
is by default loaded at 0x00400000 and the kernel will be uncompressed
to fit the location 0-0x00400000. But with dynamic relocations, the size
of the kernel may exceed 0x00400000(4M). This would cause an overlap
of the uncompressed kernel and the boot wrapper, causing a failure in
boot.
The message looks like :
zImage starting: loaded at 0x00400000 (sp: 0x0065ffb0)
Allocating 0x5ce650 bytes for kernel ...
Insufficient memory for kernel at address 0! (_start=00400000, uncompressed size=00591a20)
This patch shifts the load address of the boot wrapper code to the next
higher MB, according to the size of the uncompressed vmlinux.
With the patch, we get the following message while building the image :
WARN: Uncompressed kernel (size 0x5b0344) overlaps the address of the wrapper(0x400000)
WARN: Fixing the link_address of wrapper to (0x600000)
Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com> Signed-off-by: Josh Boyer <jwboyer@gmail.com>
virt_phys_offset = effective. kernel virt base - kernstart_addr
I have tested the patches on 440x platforms only. However this should
work fine for PPC_47x also, as we only depend on the runtime address
and the current TLB XLAT entry for the startup code, which is available
in r25. I don't have access to a 47x board yet. So, it would be great if
somebody could test this on 47x.
Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Tony Breeds <tony@bakeyournoodle.com> Cc: Josh Boyer <jwboyer@gmail.com> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org> Signed-off-by: Josh Boyer <jwboyer@gmail.com>
On BookE, we need __va() & __pa() early in the boot process to access
the device tree.
Currently this has been defined as :
#define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) -
PHYSICAL_START + KERNELBASE)
where:
PHYSICAL_START is kernstart_addr - a variable updated at runtime.
KERNELBASE is the compile time Virtual base address of kernel.
This won't work for us, as kernstart_addr is dynamic and will yield different
results for __va()/__pa() for same mapping.
e.g.,
Let the kernel be loaded at 64MB and KERNELBASE be 0xc0000000 (same as
PAGE_OFFSET).
In this case, we would be mapping 0 to 0xc0000000, and kernstart_addr = 64M
Now __va(1MB) = (0x100000) - (0x4000000) + 0xc0000000
= 0xbc100000 , which is wrong.
it should be : 0xc0000000 + 0x100000 = 0xc0100000
On platforms which support AMP, like PPC_47x (based on 44x), the kernel
could be loaded at highmem. Hence we cannot always depend on the compile
time constants for mapping.
Here are the possible solutions:
1) Update kernstart_addr(PHSYICAL_START) to match the Physical address of
compile time KERNELBASE value, instead of the actual Physical_Address(_stext).
The disadvantage is that we may break other users of PHYSICAL_START. They
could be replaced with __pa(_stext).
2) Redefine __va() & __pa() with relocation offset
a) A variable, say relocation_offset (like kernstart_addr), updated
at boot time. This impacts performance, as we have to load an additional
variable from memory.
__va(0x100000) = 0xc0000000 + 0x100000 = 0xc0100000
which is what we want.
I have implemented (3) in the following patch which has same cost of
operation as the existing one.
I have tested the patches on 440x platforms only. However this should
work fine for PPC_47x also, as we only depend on the runtime address
and the current TLB XLAT entry for the startup code, which is available
in r25. I don't have access to a 47x board yet. So, it would be great if
somebody could test this on 47x.
Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org> Signed-off-by: Josh Boyer <jwboyer@gmail.com>
Suzuki Poulose [Wed, 14 Dec 2011 22:58:12 +0000 (22:58 +0000)]
powerpc: Process dynamic relocations for kernel
The following patch implements the dynamic relocation processing for
PPC32 kernel. relocate() accepts the target virtual address and relocates
the kernel image to the same.
Currently the following relocation types are handled :
The last 3 relocations in the above list depends on value of Symbol indexed
whose index is encoded in the Relocation entry. Hence we need the Symbol
Table for processing such relocations.
Note: The GNU ld for ppc32 produces buggy relocations for relocation types
that depend on symbols. The value of the symbols with STB_LOCAL scope
should be assumed to be zero. - Alan Modra
Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com> Signed-off-by: Josh Poimboeuf <jpoimboe@linux.vnet.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Alan Modra <amodra@au1.ibm.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org> Signed-off-by: Josh Boyer <jwboyer@gmail.com>
Suzuki Poulose [Wed, 14 Dec 2011 22:57:15 +0000 (22:57 +0000)]
powerpc: Rename mapping based RELOCATABLE to DYNAMIC_MEMSTART for BookE
The current implementation of CONFIG_RELOCATABLE in BookE is based
on mapping the page aligned kernel load address to KERNELBASE. This
approach however is not enough for platforms, where the TLB page size
is large (e.g, 256M on 44x). So we are renaming the RELOCATABLE used
currently in BookE to DYNAMIC_MEMSTART to reflect the actual method.
The CONFIG_RELOCATABLE for PPC32(BookE) based on processing of the
dynamic relocations will be introduced in the later in the patch series.
This change would allow the use of the old method of RELOCATABLE for
platforms which can afford to enforce the page alignment (platforms with
smaller TLB size).
Changes since v3:
* Introduced a new config, NONSTATIC_KERNEL, to denote a kernel which is
either a RELOCATABLE or DYNAMIC_MEMSTART(Suggested by: Josh Boyer)
Suggested-by: Scott Wood <scottwood@freescale.com> Tested-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com> Cc: Scott Wood <scottwood@freescale.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Josh Boyer <jwboyer@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: linux ppc dev <linuxppc-dev@lists.ozlabs.org> Signed-off-by: Josh Boyer <jwboyer@gmail.com>
Olof Johansson [Tue, 20 Dec 2011 05:09:48 +0000 (21:09 -0800)]
Merge branch 'omap/uart' into next/pm
* omap/uart: (32 commits)
ARM: omap: pass minimal SoC/board data for UART from dt
arm/dts: Add minimal device tree support for omap2420 and omap2430
omap-serial: Add minimal device tree support
omap-serial: Use default clock speed (48Mhz) if not specified
omap-serial: Get rid of all pdev->id usage
ARM: OMAP2+: UART: Fix compilation/sparse warnings
ARM: OMAP2+: UART: Remove omap_uart_can_sleep and add pm_qos
ARM: OMAP2+: UART: Do not gate uart clocks if used for debug_prints
ARM: OMAP2+: UART: Avoid uart idling on suspend for no_console_suspend usecase
ARM: OMAP2+: UART: Avoid console uart idling during bootup
ARM: OMAP2+: UART: remove temporary variable used to count uart instance
ARM: OMAP2+: UART: Make the RX_TIMEOUT for DMA configurable for each UART
ARM: OMAP2+: UART: Allow UART parameters to be configured from board file.
ARM: OMAP2+: UART: Remove old and unused clocks handling funcs
ARM: OMAP2+: UART: Add wakeup mechanism for omap-uarts
ARM: OMAP2+: UART: Move errata handling from serial.c to omap-serial
ARM: OMAP2+: UART: Get context loss count to context restore
ARM: OMAP2+: UART: Remove uart reset function.
ARM: OMAP2+: UART: Ensure all reg values configured are available from port structure
ARM: OMAP2+: UART: Remove context_save and move context restore to driver
...
Olof Johansson [Tue, 20 Dec 2011 05:04:42 +0000 (21:04 -0800)]
Merge branch 'omap/prcm' into next/pm
* omap/prcm:
ARM: OMAP2+: hwmod: Add a new flag to handle hwmods left enabled at init
ARM: OMAP4: PRM: use PRCM interrupt handler
ARM: OMAP3: pm: use prcm chain handler
ARM: OMAP: hwmod: add support for selecting mpu_irq for each wakeup pad
ARM: OMAP2+: mux: add support for PAD wakeup interrupts
ARM: OMAP: PRCM: add suspend prepare / finish support
ARM: OMAP: PRCM: add support for chain interrupt handler
ARM: OMAP3/4: PRM: add functions to read pending IRQs, PRM barrier
ARM: OMAP2+: hwmod: Add API to enable IO ring wakeup
ARM: OMAP2+: mux: add wakeup-capable hwmod mux entries to dynamic list
Holger Brunck [Mon, 19 Dec 2011 16:49:48 +0000 (17:49 +0100)]
ARM: plat-orion: make gpiochip label unique
The former implementation adds a fix gpiochip label string
to the framework. This is confusing because orion_gpio_init
is called more than once and this ends up in different gpiochips
with the same label.
This patch adds the already present orion_gpio_chip_count to the
label string to make it unique in the system.
Signed-off-by: Holger Brunck <holger.brunck@keymile.com> Cc: Lennert Buytenhek <kernel@wantstofly.org> Acked-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Olof Johansson <olof@lixom.net>
On a system with lots of memory pressure that is stuck on synchronous inode
reclaim the workqueue code will run one instance of the inode reclaim work
item on every CPU. which is not what we want. Make sure to mark the
xfssyncd workqueue as non-reentrant to make sure there only is one instace
of each running globally. Also stop using special paramater for the
workqueue; now that we guarantee each fs has only running one of each works
at a time there is no need to artificially lower max_active and compensate
for that by setting the WQ_CPU_INTENSIVE flag.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
Stephen Warren [Mon, 19 Dec 2011 19:24:04 +0000 (12:24 -0700)]
arm/tegra: Make MACH_TEGRA_DT depend on ARCH_TEGRA_2x_SOC
Now that Tegra20 and Tegra30 device tree board files are separate,
MACH_TEGRA_DT (which enables the Tegra20 device tree board file) should
depend on Tegra20 support being enabled.
Signed-off-by: Stephen Warren <swarren@nvidia.com> Signed-off-by: Olof Johansson <olof@lixom.net>
Stephen Warren [Mon, 19 Dec 2011 19:24:03 +0000 (12:24 -0700)]
arm/tegra: Delete tegra_init_clock()
tegra_init_clock() is written to call tegra2_init_clocks(), which only
exists if Tegra20 support is enabled. This breaks the build of a
Tegra30-only kernel.
tegra_init_clock() isn't actually used any more; tegra20_init_early()
calls tegra2_init_clocks() directly. So, just delete this function.
Signed-off-by: Stephen Warren <swarren@nvidia.com> Signed-off-by: Olof Johansson <olof@lixom.net>
Stephen Warren [Fri, 16 Dec 2011 22:12:32 +0000 (15:12 -0700)]
arm/tegra: Use bus notifiers to trigger pinmux setup
Currently, the Tegra pinmux is initialized at different times when booting
with and without device tree:
Without device tree:
1) Pinmux and GPIO drivers are registered.
2) Pinmux is configured.
3) All other drivers are registered.
With device tree:
1) All drivers are registered and probed, including pinmux and GPIO.
2) Pinmux is configured.
This change modifies board-pinmux.c to detect pinmux and GPIO driver
registration using bus notifiers. This allows pinmux configuration to
happen immediately after the pinmux driver is probed, irrespective of
whether the pinmux driver is manually registered by board-pinmux.c, or
if it's instantiated during device tree parsing.
To support this with device tree, the pinmux init functions must be
called prior to instantiating devices from device tree, so that the
notifiers are set up before-hand.
Signed-off-by: Stephen Warren <swarren@nvidia.com> Signed-off-by: Olof Johansson <olof@lixom.net>
Stephen Warren [Fri, 16 Dec 2011 22:12:31 +0000 (15:12 -0700)]
arm/tegra: Refactor board-*-pinmux.c to share code
This moves the implementation of *_pinmux_init() into a single location.
The board-specific pinmux data is left in each board's own file. This
will allow future changes that set up the pinmux in a more complex
fashion to do so without duplicating that code in each board's pinmux
file.
Signed-off-by: Stephen Warren <swarren@nvidia.com> Signed-off-by: Olof Johansson <olof@lixom.net>