Now that gfs2_lookup_by_inum only takes the inode glock for new inodes
(and not for cached inodes anymore), there no longer is a need to
optimize the cached-inode case in gfs2_get_dentry or delete_work_func,
and gfs2_ilookup can be removed.
In addition, gfs2_get_dentry wasn't checking the GFS2_DIF_SYSTEM flag in
i_diskflags in the gfs2_ilookup case (see gfs2_lookup_by_inum); this
inconsistency goes away as well.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
The current gfs2_lookup_by_inum takes the glock of a presumed inode
identified by block number, verifies that the block is indeed an inode,
and then instantiates and reads the new inode via gfs2_inode_lookup.
However, instantiating a new inode may block on freeing a previous
instance of that inode (__wait_on_freeing_inode), and freeing an inode
requires to take the glock already held, leading to lock inversion and
deadlock.
Fix this by first instantiating the new inode, then verifying that the
block is an inode (if required), and then reading in the new inode, all
in gfs2_inode_lookup.
If the block we are looking for is not an inode, we discard the new
inode via iget_failed, which marks inodes as bad and unhashes them.
Other tasks waiting on that inode will get back a bad inode back from
ilookup or iget_locked; in that case, retry the lookup.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
gfs2: Initialize iopen glock holder for new inodes
In gfs2_init_inode_once, initialize inode->i_iopen_gh.gh_gl to NULL:
otherwise, when gfs2_inode_lookup fails, the iopen glock holder can
remain unset and iget_failed can end up accessing random memory.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Bob Peterson [Thu, 9 Jun 2016 19:24:07 +0000 (14:24 -0500)]
GFS2: don't set rgrp gl_object until it's inserted into rgrp tree
Before this patch, function read_rindex_entry would set a rgrp
glock's gl_object pointer to itself before inserting the rgrp into
the rgrp rbtree. The problem is: if another process was also reading
the rgrp in, and had already inserted its newly created rgrp, then
the second call to read_rindex_entry would overwrite that value,
then return a bad return code to the caller. Later, other functions
would reference the now-freed rgrp memory by way of gl_object.
In some cases, that could result in gfs2_rgrp_brelse being called
twice for the same rgrp: once for the failed attempt and once for
the "real" rgrp release. Eventually the kernel would panic.
There are also a number of other things that could go wrong when
a kernel module is accessing freed storage. For example, this could
result in rgrp corruption because the fake rgrp would point to a
fake bitmap in memory too, causing gfs2_inplace_reserve to search
some random memory for free blocks, and find some, since we were
never setting rgd->rd_bits to NULL before freeing it.
This patch fixes the problem by not setting gl_object until we
have successfully inserted the rgrp into the rbtree. Also, it sets
rd_bits to NULL as it frees them, which will ensure any accidental
access to the wrong rgrp will result in a kernel panic rather than
file system corruption, which is preferred.
Linus Torvalds [Tue, 24 May 2016 17:22:34 +0000 (10:22 -0700)]
Merge tag 'for-linus-4.7-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen bug fixes from David Vrabel.
* tag 'for-linus-4.7-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: use same main loop for counting and remapping pages
xen/events: Don't move disabled irqs
xen/x86: actually allocate legacy interrupts on PV guests
Xen: don't warn about 2-byte wchar_t in efi
xen/gntdev: reduce copy batch size to 16
xen/x86: don't lose event interrupts
Linus Torvalds [Tue, 24 May 2016 16:46:45 +0000 (09:46 -0700)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
"Looks like a quiet cycle for virtio. There's a new inorder option for
the ringtest tool, and a bugfix for balloon for ppc platforms when
using virtio 1 mode"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
ringtest: pass buf != NULL
virtio_balloon: fix PFN format for virtio-1
virtio: add inorder option
Linus Torvalds [Tue, 24 May 2016 16:19:38 +0000 (09:19 -0700)]
Merge tag 'microblaze-4.7-rc1' of git://git.monstr.eu/linux-2.6-microblaze
Pull Microblaze updates from Michal Simek:
- Wire-up new syscalls
- Fix link error
* tag 'microblaze-4.7-rc1' of git://git.monstr.eu/linux-2.6-microblaze:
microblaze: pci: export isa_io_base to fix link errors
microblaze: Wire up userfaultfd, membarrier, mlock2 syscalls
Juergen Gross [Wed, 18 May 2016 14:44:54 +0000 (16:44 +0200)]
xen: use same main loop for counting and remapping pages
Instead of having two functions for cycling through the E820 map in
order to count to be remapped pages and remap them later, just use one
function with a caller supplied sub-function called for each region to
be processed. This eliminates the possibility of a mismatch between
both loops which showed up in certain configurations.
Suggested-by: Ed Swierk <eswierk@skyportsystems.com> Signed-off-by: Juergen Gross <jgross@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Ross Lagerwall [Tue, 10 May 2016 15:11:00 +0000 (16:11 +0100)]
xen/events: Don't move disabled irqs
Commit ff1e22e7a638 ("xen/events: Mask a moving irq") open-coded
irq_move_irq() but left out checking if the IRQ is disabled. This broke
resuming from suspend since it tries to move a (disabled) irq without
holding the IRQ's desc->lock. Fix it by adding in a check for disabled
IRQs.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
xen/x86: actually allocate legacy interrupts on PV guests
b4ff8389ed14 is incomplete: relies on nr_legacy_irqs() to get the number
of legacy interrupts when actually nr_legacy_irqs() returns 0 after
probe_8259A(). Use NR_IRQS_LEGACY instead.
Arnd Bergmann [Wed, 11 May 2016 12:47:59 +0000 (14:47 +0200)]
Xen: don't warn about 2-byte wchar_t in efi
The XEN UEFI code has become available on the ARM architecture
recently, but now causes a link-time warning:
ld: warning: drivers/xen/efi.o uses 2-byte wchar_t yet the output is to use 4-byte wchar_t; use of wchar_t values across objects may fail
This seems harmless, because the efi code only uses 2-byte
characters when interacting with EFI, so we don't pass on those
strings to elsewhere in the system, and we just need to
silence the warning.
It is not clear to me whether we actually need to build the file
with the -fshort-wchar flag, but if we do, then we should also
pass --no-wchar-size-warning to the linker, to avoid the warning.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Fixes: 37060935dc04 ("ARM64: XEN: Add a function to initialize Xen specific UEFI runtime services")
David Vrabel [Mon, 9 May 2016 09:59:48 +0000 (10:59 +0100)]
xen/gntdev: reduce copy batch size to 16
IOCTL_GNTDEV_GRANT_COPY batches copy operations to reduce the number
of hypercalls. The stack is used to avoid a memory allocation in a
hot path. However, a batch size of 24 requires more than 1024 bytes of
stack which in some configurations causes a compiler warning.
xen/gntdev.c: In function ‘gntdev_ioctl_grant_copy’:
xen/gntdev.c:949:1: warning: the frame size of 1248 bytes is
larger than 1024 bytes [-Wframe-larger-than=]
This is a harmless warning as there is still plenty of stack spare,
but people keep trying to "fix" it. Reduce the batch size to 16 to
reduce stack usage to less than 1024 bytes. This should have minimal
impact on performance.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
On slow platforms with unreliable TSC, such as QEMU emulated machines,
it is possible for the kernel to request the next event in the past. In
that case, in the current implementation of xen_vcpuop_clockevent, we
simply return -ETIME. To be precise the Xen returns -ETIME and we pass
it on. However the result of this is a missed event, which simply causes
the kernel to hang.
Instead it is better to always ask the hypervisor for a timer event,
even if the timeout is in the past. That way there are no lost
interrupts and the kernel survives. To do that, remove the
VCPU_SSHOTTMR_future flag.
Linus Torvalds [Tue, 24 May 2016 02:42:28 +0000 (19:42 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:
- Oleg's "wait/ptrace: assume __WALL if the child is traced". It's a
kernel-based workaround for existing userspace issues.
- A few hotfixes
- befs cleanups
- nilfs2 updates
- sys_wait() changes
- kexec updates
- kdump
- scripts/gdb updates
- the last of the MM queue
- a few other misc things
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (84 commits)
kgdb: depends on VT
drm/amdgpu: make amdgpu_mn_get wait for mmap_sem killable
drm/radeon: make radeon_mn_get wait for mmap_sem killable
drm/i915: make i915_gem_mmap_ioctl wait for mmap_sem killable
uprobes: wait for mmap_sem for write killable
prctl: make PR_SET_THP_DISABLE wait for mmap_sem killable
exec: make exec path waiting for mmap_sem killable
aio: make aio_setup_ring killable
coredump: make coredump_wait wait for mmap_sem for write killable
vdso: make arch_setup_additional_pages wait for mmap_sem for write killable
ipc, shm: make shmem attach/detach wait for mmap_sem killable
mm, fork: make dup_mmap wait for mmap_sem for write killable
mm, proc: make clear_refs killable
mm: make vm_brk killable
mm, elf: handle vm_brk error
mm, aout: handle vm_brk failures
mm: make vm_munmap killable
mm: make vm_mmap killable
mm: make mmap_sem for write waits killable for mm syscalls
MAINTAINERS: add co-maintainer for scripts/gdb
...
Linus Torvalds [Tue, 24 May 2016 02:37:41 +0000 (19:37 -0700)]
Merge tag 'linux-kselftest-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest updates from Shuah Khan:
"This update for Kselftest adds:
- a new ftrace testcase
- fixes for ftrace and intel_pstate tests"
* tag 'linux-kselftest-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
tools: testing: define the _GNU_SOURCE macro
kselftests/ftrace: Add a test case for event pid filtering
kselftests/ftrace: Detect tracefs mount point
Linus Torvalds [Tue, 24 May 2016 02:30:30 +0000 (19:30 -0700)]
Merge tag 'trace-v4.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Reviewing the selftest I recently submitted, I realize that the second
part of it uses my old hack to get the PID of the spawned background
tasks, which doesn't work for all shells, instead of the common use of
$!"
* tag 'trace-v4.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftracetest: Use proper logic to find process PID
Linus Torvalds [Tue, 24 May 2016 01:19:21 +0000 (18:19 -0700)]
Merge branch 'for-4.7-dw' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata sata_dwc_460ex updates from Tejun Heo:
"Patches to bring sata_dwc_460ex up to snuff.
It was a separate pull request because it depends on dmaengine dw
platform changes which are now in mainline"
* 'for-4.7-dw' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: (24 commits)
ata: dwc: add DMADEVICES dependency
powerpc/4xx: Device tree update for the 460ex DWC SATA
ata: sata_dwc_460ex: make debug messages neat
ata: sata_dwc_460ex: supply physical address of FIFO to DMA
ata: sata_dwc_460ex: use devm_ioremap
ata: sata_dwc_460ex: tidy up sata_dwc_clear_dmacr()
ata: sata_dwc_460ex: use readl/writel_relaxed()
ata: sata_dwc_460ex: switch to new dmaengine_terminate_* API
ata: sata_dwc_460ex: add __iomem to register base pointer
ata: sata_dwc_460ex: get rid of incorrect cast
ata: sata_dwc_460ex: get rid of some pointless casts
ata: sata_dwc_460ex: remove empty libata callback
ata: sata_dwc_460ex: correct HOSTDEV{P}_FROM_*() macros
ata: sata_dwc_460ex: get rid of global data
ata: sata_dwc_460ex: add phy support
ata: sata_dwc_460ex: use "dmas" DT property to find dma channel
ata: sata_dwc_460ex: don't call ata_sff_qc_issue() on DMA commands
ata: sata_dwc_460ex: skip dma setup for non-dma commands
ata: sata_dwc_460ex: select only core part of DMA driver
ata: sata_dwc_460ex: DMA is always a flow controller
...
Linus Torvalds [Tue, 24 May 2016 00:53:39 +0000 (17:53 -0700)]
Merge branch 'for-4.7-zac' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata ZAC support from Tejun Heo:
"This contains Zone ATA Command support for Shingled Magnetic Recording
devices.
In addition to sending the new commands down to the device, as ZAC
commands depend on getting a lot of responses from the device, piping
up responses is beefed up too. However, it doesn't involve changes to
libata core mechanism or its interaction with upper layers, so I'm not
expecting too many fallouts.
Kudos to Hannes for driving SMR support"
* 'for-4.7-zac' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: (28 commits)
libata: support host-aware and host-managed ZAC devices
libata: support device-managed ZAC devices
libata: NCQ encapsulation for ZAC MANAGEMENT OUT
libata: Implement ZBC OUT translation
libata: implement ZBC IN translation
libata: fixup ZAC device disabling
libata-scsi: Generate sense code for disabled devices
libata-trace: decode subcommands
libata: Check log page directory before accessing pages
libata: Add command definitions for NCQ Encapsulation for READ LOG DMA EXT
libata: Separate out ata_dev_config_ncq_send_recv()
libata/libsas: Define ATA_CMD_NCQ_NON_DATA
libsas: enable FPDMA SEND/RECEIVE
libata: do not attempt to retrieve sense code twice
libata-scsi: Set information sense field for invalid parameter
libata-scsi: set bit pointer for sense code information
libata-scsi: Set field pointer in sense code
scsi: add scsi_set_sense_field_pointer()
libata: Implement control mode page to select sense format
libata-scsi: generate correct ATA pass-through sense
...
Linus Torvalds [Tue, 24 May 2016 00:26:27 +0000 (17:26 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull more security subsystem updates from James Morris:
"Minor updates for the keys code"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
MAINTAINERS: Update keyrings record and add asymmetric keys record
lib: asn1_decoder - add MODULE_LICENSE("GPL")
KEYS: The PKCS#7 test key type should use the secondary keyring
Jiri Slaby [Mon, 23 May 2016 23:26:20 +0000 (16:26 -0700)]
kgdb: depends on VT
With VT=n, the kernel build fails with:
drivers/built-in.o: In function `kgdboc_pre_exp_handler':
kgdboc.c:(.text+0x7b5aa): undefined reference to `fg_console'
kgdboc.c:(.text+0x7b5ce): undefined reference to `vc_cons'
kgdboc.c:(.text+0x7b5d5): undefined reference to `vc_cons'
kgdboc.o is built when KGDB_SERIAL_CONSOLE is set. So make
KGDB_SERIAL_CONSOLE depend on HW_CONSOLE which includes those symbols.
Michal Hocko [Mon, 23 May 2016 23:26:17 +0000 (16:26 -0700)]
drm/amdgpu: make amdgpu_mn_get wait for mmap_sem killable
amdgpu_mn_get which is called during ioct path relies on mmap_sem for
write. If the waiting task gets killed by the oom killer it would block
oom_reaper from asynchronous address space reclaim and reduce the
chances of timely OOM resolving. Wait for the lock in the killable mode
and return with EINTR if the task got killed while waiting.
[arnd@arndb.de: use ERR_PTR() to return from amdgpu_mn_get] Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Christian König <christian.koenig@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 23 May 2016 23:26:14 +0000 (16:26 -0700)]
drm/radeon: make radeon_mn_get wait for mmap_sem killable
radeon_mn_get which is called during ioct path relies on mmap_sem for
write. If the waiting task gets killed by the oom killer it would block
oom_reaper from asynchronous address space reclaim and reduce the
chances of timely OOM resolving. Wait for the lock in the killable mode
and return with EINTR if the task got killed while waiting.
Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 23 May 2016 23:26:11 +0000 (16:26 -0700)]
drm/i915: make i915_gem_mmap_ioctl wait for mmap_sem killable
i915_gem_mmap_ioctl relies on mmap_sem for write. If the waiting task
gets killed by the oom killer it would block oom_reaper from
asynchronous address space reclaim and reduce the chances of timely OOM
resolving. Wait for the lock in the killable mode and return with EINTR
if the task got killed while waiting.
Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: David Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 23 May 2016 23:26:08 +0000 (16:26 -0700)]
uprobes: wait for mmap_sem for write killable
xol_add_vma needs mmap_sem for write. If the waiting task gets killed
by the oom killer it would block oom_reaper from asynchronous address
space reclaim and reduce the chances of timely OOM resolving. Wait for
the lock in the killable mode and return with EINTR if the task got
killed while waiting.
Do not warn in dup_xol_work if __create_xol_area failed due to fatal
signal pending because this is usually considered a kernel issue.
Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 23 May 2016 23:26:05 +0000 (16:26 -0700)]
prctl: make PR_SET_THP_DISABLE wait for mmap_sem killable
PR_SET_THP_DISABLE requires mmap_sem for write. If the waiting task
gets killed by the oom killer it would block oom_reaper from
asynchronous address space reclaim and reduce the chances of timely OOM
resolving. Wait for the lock in the killable mode and return with EINTR
if the task got killed while waiting.
Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 23 May 2016 23:26:02 +0000 (16:26 -0700)]
exec: make exec path waiting for mmap_sem killable
setup_arg_pages requires mmap_sem for write. If the waiting task gets
killed by the oom killer it would block oom_reaper from asynchronous
address space reclaim and reduce the chances of timely OOM resolving.
Wait for the lock in the killable mode and return with EINTR if the task
got killed while waiting. All the callers are already handling error
path and the fatal signal doesn't need any additional treatment.
The same applies to __bprm_mm_init.
Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 23 May 2016 23:25:59 +0000 (16:25 -0700)]
aio: make aio_setup_ring killable
aio_setup_ring waits for mmap_sem in writable mode. If the waiting task
gets killed by the oom killer it would block oom_reaper from
asynchronous address space reclaim and reduce the chances of timely OOM
resolving. Wait for the lock in the killable mode and return with EINTR
if the task got killed while waiting. This will also expedite the
return to the userspace and do_exit.
Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Benamin LaHaise <bcrl@kvack.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 23 May 2016 23:25:57 +0000 (16:25 -0700)]
coredump: make coredump_wait wait for mmap_sem for write killable
coredump_wait waits for mmap_sem for write currently which can prevent
oom_reaper to reclaim the oom victims address space asynchronously
because that requires mmap_sem for read. This might happen if the oom
victim is multi threaded and some thread(s) is holding mmap_sem for read
(e.g. page fault) and it is stuck in the page allocator while other
thread(s) reached coredump_wait already.
This patch simply uses down_write_killable and bails out with EINTR if
the lock got interrupted by the fatal signal. do_coredump will return
right away and do_group_exit will take care to zap the whole thread
group.
Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 23 May 2016 23:25:54 +0000 (16:25 -0700)]
vdso: make arch_setup_additional_pages wait for mmap_sem for write killable
most architectures are relying on mmap_sem for write in their
arch_setup_additional_pages. If the waiting task gets killed by the oom
killer it would block oom_reaper from asynchronous address space reclaim
and reduce the chances of timely OOM resolving. Wait for the lock in
the killable mode and return with EINTR if the task got killed while
waiting.
Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Andy Lutomirski <luto@amacapital.net> [x86 vdso] Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 23 May 2016 23:25:51 +0000 (16:25 -0700)]
ipc, shm: make shmem attach/detach wait for mmap_sem killable
shmat and shmdt rely on mmap_sem for write. If the waiting task gets
killed by the oom killer it would block oom_reaper from asynchronous
address space reclaim and reduce the chances of timely OOM resolving.
Wait for the lock in the killable mode and return with EINTR if the task
got killed while waiting.
Michal Hocko [Mon, 23 May 2016 23:25:48 +0000 (16:25 -0700)]
mm, fork: make dup_mmap wait for mmap_sem for write killable
dup_mmap needs to lock current's mm mmap_sem for write. If the waiting
task gets killed by the oom killer it would block oom_reaper from
asynchronous address space reclaim and reduce the chances of timely OOM
resolving. Wait for the lock in the killable mode and return with EINTR
if the task got killed while waiting.
Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 23 May 2016 23:25:45 +0000 (16:25 -0700)]
mm, proc: make clear_refs killable
CLEAR_REFS_MM_HIWATER_RSS and CLEAR_REFS_SOFT_DIRTY are relying on
mmap_sem for write. If the waiting task gets killed by the oom killer
and it would operate on the current's mm it would block oom_reaper from
asynchronous address space reclaim and reduce the chances of timely OOM
resolving. Wait for the lock in the killable mode and return with EINTR
if the task got killed while waiting. This will also expedite the
return to the userspace and do_exit even if the mm is remote.
Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Petr Cermak <petrcermak@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 23 May 2016 23:25:42 +0000 (16:25 -0700)]
mm: make vm_brk killable
Now that all the callers handle vm_brk failure we can change it wait for
mmap_sem killable to help oom_reaper to not get blocked just because
vm_brk gets blocked behind mmap_sem readers.
Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 23 May 2016 23:25:39 +0000 (16:25 -0700)]
mm, elf: handle vm_brk error
load_elf_library doesn't handle vm_brk failure although nothing really
indicates it cannot do that because the function is allowed to fail due
to vm_mmap failures already. This might be not a problem now but later
patch will make vm_brk killable (resp. mmap_sem for write waiting will
become killable) and so the failure will be more probable.
Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 23 May 2016 23:25:36 +0000 (16:25 -0700)]
mm, aout: handle vm_brk failures
vm_brk is allowed to fail but load_aout_binary simply ignores the error
and happily continues. I haven't noticed any problem from that in real
life but later patches will make the failure more likely because vm_brk
will become killable (resp. mmap_sem for write waiting will become
killable) so we should be more careful now.
The error handling should be quite straightforward because there are
calls to vm_mmap which check the error properly already. The only
notable exception is set_brk which is called after beyond_if label. But
nothing indicates that we cannot move it above set_binfmt as the two do
not depend on each other and fail before we do set_binfmt and alter
reference counting.
Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 23 May 2016 23:25:33 +0000 (16:25 -0700)]
mm: make vm_munmap killable
Almost all current users of vm_munmap are ignoring the return value and
so they do not handle potential error. This means that some VMAs might
stay behind. This patch doesn't try to solve those potential problems.
Quite contrary it adds a new failure mode by using down_write_killable
in vm_munmap. This should be safer than other failure modes, though,
because the process is guaranteed to die as soon as it leaves the kernel
and exit_mmap will clean the whole address space.
This will help in the OOM conditions when the oom victim might be stuck
waiting for the mmap_sem for write which in turn can block oom_reaper
which relies on the mmap_sem for read to make a forward progress and
reclaim the address space of the victim.
Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 23 May 2016 23:25:30 +0000 (16:25 -0700)]
mm: make vm_mmap killable
All the callers of vm_mmap seem to check for the failure already and
bail out in one way or another on the error which means that we can
change it to use killable version of vm_mmap_pgoff and return -EINTR if
the current task gets killed while waiting for mmap_sem. This also
means that vm_mmap_pgoff can be killable by default and drop the
additional parameter.
This will help in the OOM conditions when the oom victim might be stuck
waiting for the mmap_sem for write which in turn can block oom_reaper
which relies on the mmap_sem for read to make a forward progress and
reclaim the address space of the victim.
Please note that load_elf_binary is ignoring vm_mmap error for
current->personality & MMAP_PAGE_ZERO case but that shouldn't be a
problem because the address is not used anywhere and we never return to
the userspace if we got killed.
Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Mon, 23 May 2016 23:25:27 +0000 (16:25 -0700)]
mm: make mmap_sem for write waits killable for mm syscalls
This is a follow up work for oom_reaper [1]. As the async OOM killing
depends on oom_sem for read we would really appreciate if a holder for
write didn't stood in the way. This patchset is changing many of
down_write calls to be killable to help those cases when the writer is
blocked and waiting for readers to release the lock and so help
__oom_reap_task to process the oom victim.
Most of the patches are really trivial because the lock is help from a
shallow syscall paths where we can return EINTR trivially and allow the
current task to die (note that EINTR will never get to the userspace as
the task has fatal signal pending). Others seem to be easy as well as
the callers are already handling fatal errors and bail and return to
userspace which should be sufficient to handle the failure gracefully.
I am not familiar with all those code paths so a deeper review is really
appreciated.
As this work is touching more areas which are not directly connected I
have tried to keep the CC list as small as possible and people who I
believed would be familiar are CCed only to the specific patches (all
should have received the cover though).
This patchset is based on linux-next and it depends on
down_write_killable for rw_semaphores which got merged into tip
locking/rwsem branch and it is merged into this next tree. I guess it
would be easiest to route these patches via mmotm because of the
dependency on the tip tree but if respective maintainers prefer other
way I have no objections.
I haven't covered all the mmap_write(mm->mmap_sem) instances here
I have tried to cover those which should be relatively easy to review in
this series because this alone should be a nice improvement. Other
places can be changed on top.
This is the first step in making mmap_sem write waiters killable. It
focuses on the trivial ones which are taking the lock early after
entering the syscall and they are not changing state before.
Therefore it is very easy to change them to use down_write_killable and
immediately return with -EINTR. This will allow the waiter to pass away
without blocking the mmap_sem which might be required to make a forward
progress. E.g. the oom reaper will need the lock for reading to
dismantle the OOM victim address space.
The only tricky function in this patch is vm_mmap_pgoff which has many
call sites via vm_mmap. To reduce the risk keep vm_mmap with the
original non-killable semantic for now.
vm_munmap callers do not bother checking the return value so open code
it into the munmap syscall path for now for simplicity.
Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kieran Bingham [Mon, 23 May 2016 23:25:21 +0000 (16:25 -0700)]
scripts/gdb: decode bytestream on dmesg for Python3
The recent fixes to lx-dmesg, now allow the command to print
successfully on Python3, however the python interpreter wraps the bytes
for each line with a b'<text>' marker.
To remove this, we need to decode the line, where .decode() will default
to 'UTF-8'
Link: http://lkml.kernel.org/r/d67ccf93f2479c94cb3399262b9b796e0dbefcf2.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Kieran Bingham <kieran@bingham.xyz> Acked-by: Dom Cote <buzdelabuz2@gmail.com> Tested-by: Dom Cote <buzdelabuz2@gmail.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dom Cote [Mon, 23 May 2016 23:25:16 +0000 (16:25 -0700)]
scripts/gdb: improve types abstraction for gdb python scripts
Change the read_u16 function so it accepts both 'str' and 'byte' as type
for the arguments.
When calling read_memory() from gdb API, depending on if it was built
with 2.7 or 3.X, the format used to return the data will differ ( 'str'
for 2.7, and 'byte' for 3.X ).
Add a function read_memoryview() to be able to get a 'memoryview' object
back from read_memory() both with python 2.7 and 3.X .
Tested with python 3.4 and 2.7
Tested with gdb 7.7
Link: http://lkml.kernel.org/r/73621f564503137a002a639d174e4fb35f73f462.1462865983.git.jan.kiszka@siemens.com Signed-off-by: Dom Cote <buzdelabuz2+git@gmail.com> Tested-by: Kieran Bingham <kieran@bingham.xyz> (Py2.7,Py3.4,GDB10) Signed-off-by: Kieran Bingham <kieran@bingham.xyz> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kieran Bingham [Mon, 23 May 2016 23:25:13 +0000 (16:25 -0700)]
scripts/gdb: add lx_thread_info_by_pid helper
The tasks module already provides helpers to find the task struct by
pid, and the thread_info by task struct; however this is cumbersome to
utilise on the gdb commandline.
Wrap these two functionalities together in an extra single helper to
allow exploring the thread info, from a PID value
Kieran Bingham [Mon, 23 May 2016 23:25:07 +0000 (16:25 -0700)]
scripts/gdb: add a Radix Tree Parser
Linux makes use of the Radix Tree data structure to store pointers
indexed by integer values. This structure is utilised across many
structures in the kernel including the IRQ descriptor tables, and
several filesystems.
This module provides a method to lookup values from a structure given
its head node.
Usage:
The function lx_radix_tree_lookup, must be given a symbol of type struct
radix_tree_root, and an index into that tree.
The object returned is a generic integer value, and must be cast
correctly to the type based on the storage in the data structure.
For example, to print the irq descriptor in the sparse irq_desc_tree at
index 18, try the following:
Kieran Bingham [Mon, 23 May 2016 23:25:02 +0000 (16:25 -0700)]
scripts/gdb: add cpu iterators
The linux kernel provides macro's for iterating against values from the
cpu_list masks. By providing some commonly used masks, we can mirror
the kernels helper macros with easy to use generators.
Kieran Bingham [Mon, 23 May 2016 23:24:59 +0000 (16:24 -0700)]
scripts/gdb: add mount point list command
lx-mounts will identify current mount points based on the 'init_task'
namespace by default, as we do not yet have a kernel thread list
implementation to select the current running thread.
Optionally, a user can specify a PID to list from that process'
namespace
Kieran Bingham [Mon, 23 May 2016 23:24:51 +0000 (16:24 -0700)]
scripts/gdb: support !CONFIG_MODULES gracefully
If CONFIG_MODULES is not enabled, lx-lsmod tries to find a non-existent
symbol and generates an unfriendly traceback:
(gdb) lx-lsmod
Address Module Size Used by
Traceback (most recent call last):
File "scripts/gdb/linux/modules.py", line 75, in invoke
for module in module_list():
File "scripts/gdb/linux/modules.py", line 24, in module_list
module_ptr_type = module_type.get_type().pointer()
File "scripts/gdb/linux/utils.py", line 28, in get_type
self._type = gdb.lookup_type(self._name)
gdb.error: No struct type named module.
Error occurred in Python command: No struct type named module.
Catch the error and return an empty module_list() for a clean command
output as follows:
Kieran Bingham [Mon, 23 May 2016 23:24:48 +0000 (16:24 -0700)]
scripts/gdb: provide exception catching parser
If we attempt to read a value that is not available to GDB, an exception
is raised. Most of the time, this is a good thing; however on occasion
we will want to be able to determine if a symbol is available.
By catching the exception to simply return None, we can determine if we
tried to read an invalid value, without the exception taking our
execution context away from us
Use kmemdup when some other buffer is immediately copied into allocated
region. It replaces call to allocation followed by memcpy, by a single
call to kmemdup.
[akpm@linux-foundation.org: remove unneeded cast to void*] Link: http://lkml.kernel.org/r/1463665743-16269-1-git-send-email-falakreyaz@gmail.com Signed-off-by: Muhammad Falak R Wani <falakreyaz@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rtsx_usb_ms: use schedule_timeout_idle() in polling loop
First version of this patch has already been posted to LKML by Ben
Hutchings ~6 months ago, but no further action were performed.
Ben's original message:
: rtsx_usb_ms creates a task that mostly sleeps, but tasks in
: uninterruptible sleep still contribute to the load average (for
: bug-compatibility with Unix). A load average of ~1 on a system that
: should be idle is somewhat alarming.
:
: Change the sleep to be interruptible, but still ignore signals.
References: https://bugs.debian.org/765717 Link: http://lkml.kernel.org/r/b49f95ae83057efa5d96f532803cba47@natalenko.name Signed-off-by: Oleksandr Natalenko <oleksandr@natalenko.name> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Lee Jones <lee.jones@linaro.org> Cc: Wolfram Sang <wsa@the-dreams.de> Cc: Roger Tseng <rogerable@realtek.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Corey Minyard [Mon, 23 May 2016 23:24:25 +0000 (16:24 -0700)]
kdump: fix gdb macros work work with newer and 64-bit kernels
Lots of little changes needed to be made to clean these up, remove the
four byte pointer assumption and traverse the pid queue properly. Also
consolidate the traceback code into a single function instead of having
three copies of it.
Link: http://lkml.kernel.org/r/1462926655-9390-1-git-send-email-minyard@acm.org Signed-off-by: Corey Minyard <cminyard@mvista.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Haren Myneni <hbabu@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xunlei Pang [Mon, 23 May 2016 23:24:22 +0000 (16:24 -0700)]
s390/kexec: consolidate crash_map/unmap_reserved_pages() and arch_kexec_protect(unprotect)_crashkres()
Commit 3f625002581b ("kexec: introduce a protection mechanism for the
crashkernel reserved memory") is a similar mechanism for protecting the
crash kernel reserved memory to previous crash_map/unmap_reserved_pages()
implementation, the new one is more generic in name and cleaner in code
(besides, some arch may not be allowed to unmap the pgtable).
Therefore, this patch consolidates them, and uses the new
arch_kexec_protect(unprotect)_crashkres() to replace former
crash_map/unmap_reserved_pages() which by now has been only used by
S390.
The consolidation work needs the crash memory to be mapped initially,
this is done in machine_kdump_pm_init() which is after
reserve_crashkernel(). Once kdump kernel is loaded, the new
arch_kexec_protect_crashkres() implemented for S390 will actually
unmap the pgtable like before.
Signed-off-by: Xunlei Pang <xlpang@redhat.com> Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Minfei Huang <mhuang@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minfei Huang [Mon, 23 May 2016 23:24:19 +0000 (16:24 -0700)]
kexec: do a cleanup for function kexec_load
There are a lof of work to be done in function kexec_load, not only for
allocating structs and loading initram, but also for some misc.
To make it more clear, wrap a new function do_kexec_load which is used
to allocate structs and load initram. And the pre-work will be done in
kexec_load.
Signed-off-by: Minfei Huang <mnfhuang@gmail.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Xunlei Pang <xlpang@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minfei Huang [Mon, 23 May 2016 23:24:16 +0000 (16:24 -0700)]
kexec: make a pair of map/unmap reserved pages in error path
For some arch, kexec shall map the reserved pages, then use them, when
we try to start the kdump service.
kexec may return directly, without unmaping the reserved pages, if it
fails during starting service. To fix it, we make a pair of map/unmap
reserved pages both in generic path and error path.
This patch only affects s390. Other architecturess don't implement the
interface of crash_unmap_reserved_pages and crash_map_reserved_pages.
It isn't a urgent patch. Kernel can work well without any risk,
although the reserved pages are not unmapped before returning in error
path.
Signed-off-by: Minfei Huang <mnfhuang@gmail.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Xunlei Pang <xlpang@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xunlei Pang [Mon, 23 May 2016 23:24:10 +0000 (16:24 -0700)]
kexec: introduce a protection mechanism for the crashkernel reserved memory
For the cases that some kernel (module) path stamps the crash reserved
memory(already mapped by the kernel) where has been loaded the second
kernel data, the kdump kernel will probably fail to boot when panic
happens (or even not happens) leaving the culprit at large, this is
unacceptable.
The patch introduces a mechanism for detecting such cases:
1) After each crash kexec loading, it simply marks the reserved memory
regions readonly since we no longer access it after that. When someone
stamps the region, the first kernel will panic and trigger the kdump.
The weak arch_kexec_protect_crashkres() is introduced to do the actual
protection.
2) To allow multiple loading, once 1) was done we also need to remark
the reserved memory to readwrite each time a system call related to
kdump is made. The weak arch_kexec_unprotect_crashkres() is introduced
to do the actual protection.
The architecture can make its specific implementation by overriding
arch_kexec_protect_crashkres() and arch_kexec_unprotect_crashkres().
Signed-off-by: Xunlei Pang <xlpang@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Dave Young <dyoung@redhat.com> Cc: Minfei Huang <mhuang@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Mon, 23 May 2016 23:24:08 +0000 (16:24 -0700)]
exec: remove the no longer needed remove_arg_zero()->free_arg_page()
remove_arg_zero() does free_arg_page() for no reason. This was needed
before and only if CONFIG_MMU=y: see commit 4fc75ff4816c ("exec: fix
remove_arg_zero"), install_arg_page() was called for every page != NULL
in bprm->page[] array. Today install_arg_page() has already gone and
free_arg_page() is nop after another commit b6a2fea39318 ("mm: variable
length argument support").
CONFIG_MMU=n does free_arg_pages() in free_bprm() and thus it doesn't
need remove_arg_zero()->free_arg_page() too; apart from get_arg_page()
it never checks if the page in bprm->page[] was allocated or not, so the
"extra" non-freed page is fine. OTOH, this free_arg_page() can add the
minor pessimization, the caller is going to do copy_strings_kernel()
right after remove_arg_zero() which will likely need to re-allocate the
same page again.
And as Hujunjie pointed out, the "offset == PAGE_SIZE" check is wrong
because we are going to increment bprm->p once again before return, so
CONFIG_MMU=n "leaks" the page anyway if '0' is the final byte in this
page.
NOTE: remove_arg_zero() assumes that argv[0] is null-terminated but this
is not necessarily true. copy_strings() does "len = strnlen_user(...)",
then copy_from_user(len) but another thread or debuger can overwrite the
trailing '0' in between. Afaics nothing really bad can happen because
we must always have the null-terminated bprm->filename copied by the 1st
copy_strings_kernel(), but perhaps we should change this code to check
"bprm->p < bprm->exec" anyway, and/or change copy_strings() to ensure
that the last byte in string is always zero.
Andi Kleen [Mon, 23 May 2016 23:24:05 +0000 (16:24 -0700)]
kernek/fork.c: allocate idle task for a CPU always on its local node
Linux preallocates the task structs of the idle tasks for all possible
CPUs. This currently means they all end up on node 0. This also
implies that the cache line of MWAIT, which is around the flags field in
the task struct, are all located in node 0.
We see a noticeable performance improvement on Knights Landing CPUs when
the cache lines used for MWAIT are located in the local nodes of the
CPUs using them. I would expect this to give a (likely slight)
improvement on other systems too.
The patch implements placing the idle task in the node of its CPUs, by
passing the right target node to copy_process()
[akpm@linux-foundation.org: use NUMA_NO_NODE, not a bare -1] Link: http://lkml.kernel.org/r/1463492694-15833-1-git-send-email-andi@firstfloor.org Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Mon, 23 May 2016 23:23:57 +0000 (16:23 -0700)]
signal: make oom_flags a bool
Currently the size of "struct signal_struct"->oom_flags member is
sizeof(unsigned) bytes, but only one flag OOM_FLAG_ORIGIN which is
updated by current thread is defined. We can convert OOM_FLAG_ORIGIN
into a bool, and reuse the saved bytes for updating from the OOM killer
and/or the OOM reaper thread.
By the way, do we care about a race window between run_store() and
swapoff() because it would be theoretically possible that two threads
sharing the "struct signal_struct" concurrently call respective
functions? If we care, we can make oom_flags an atomic_t.
Oleg Nesterov [Mon, 23 May 2016 23:23:53 +0000 (16:23 -0700)]
wait: allow sys_waitid() to accept __WNOTHREAD/__WCLONE/__WALL
I see no reason why waitid() can't support other linux-specific flags
allowed in sys_wait4().
In particular this change can help if we reconsider the previous change
("wait/ptrace: assume __WALL if the child is traced") which adds the
"automagical" __WALL for debugger.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Pedro Alves <palves@redhat.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: <syzkaller@googlegroups.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
creates an unreapable zombie if /sbin/init doesn't use __WALL.
This is not a kernel bug, at least in a sense that everything works as
expected: debugger should reap a traced sub-thread before it can reap the
leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.
Unfortunately, it seems that /sbin/init in most (all?) distributions
doesn't use it and we have to change the kernel to avoid the problem.
Note also that most init's use sys_waitid() which doesn't allow __WALL, so
the necessary user-space fix is not that trivial.
This patch just adds the "ptrace" check into eligible_child(). To some
degree this matches the "tsk->ptrace" in exit_notify(), ->exit_signal is
mostly ignored when the tracee reports to debugger. Or WSTOPPED, the
tracer doesn't need to set this flag to wait for the stopped tracee.
This obviously means the user-visible change: __WCLONE and __WALL no
longer have any meaning for debugger. And I can only hope that this won't
break something, but at least strace/gdb won't suffer.
We could make a more conservative change. Say, we can take __WCLONE into
account, or !thread_group_leader(). But it would be nice to not
complicate these historical/confusing checks.
Ryusuke Konishi [Mon, 23 May 2016 23:23:31 +0000 (16:23 -0700)]
nilfs2: do not emit extra newline on nilfs_warning() and nilfs_error()
This updates call sites of nilfs_warning() and nilfs_error() so that they
don't add a duplicate newline. These output functions are already
designed to add a trailing newline to the message.
Ryusuke Konishi [Mon, 23 May 2016 23:23:17 +0000 (16:23 -0700)]
nilfs2: get rid of nilfs_mdt_mark_block_dirty()
nilfs_mdt_mark_block_dirty() can be replaced with primary functions
like nilfs_mdt_get_block() and mark_buffer_dirty(), and it's used only
by nilfs_ioctl_mark_blocks_dirty().
This gets rid of the function to simplify the interface of metadata
file.
Ryusuke Konishi [Mon, 23 May 2016 23:23:14 +0000 (16:23 -0700)]
nilfs2: clarify permission to replicate the design
To respond to a certain developer's request, this explicitly state that
developers can reimplement the nilfs2 design for other operating systems
to share data stored in that format.
Ryusuke Konishi [Mon, 23 May 2016 23:23:06 +0000 (16:23 -0700)]
nilfs2: remove FSF mailing address from GPL notices
This removes the extra paragraph which mentions FSF address in GPL
notices from source code of nilfs2 and avoids the checkpatch.pl error
related to it.
Ryusuke Konishi [Mon, 23 May 2016 23:23:03 +0000 (16:23 -0700)]
nilfs2: remove space before comma
Fix checkpatch.pl error "ERROR: space prohibited before that ','
(ctx:WxW)" at nilfs_sufile_set_suinfo().
This also fixes checkpatch.pl warning "WARNING: Prefer 'unsigned int' to
bare use of 'unsigned'" at nilfs_sufile_set_suinfo() and
nilfs_sufile_get_suinfo().