Zheng Liu [Sun, 19 Aug 2012 22:07:40 +0000 (18:07 -0400)]
ext4: remove duplicated declarations in inode.c
In patch cb20d5188366f04d96d2e07b1240cc92170ade40, ext4_set_bh_endio
and ext4_end_io_buffer_write are declared at the beginning of inode.c,
and again later on in the middle of the file. Remove the second set
of duplicated function declarations.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
results in an IO error when unmounting the RO filesystem:
[ 318.020828] Buffer I/O error on device loop1, logical block 196608
[ 318.027024] lost page write due to I/O error on loop1
[ 318.032088] JBD2: Error -5 detected when updating journal superblock for loop1-8.
This was a regression introduced by commit 24bcc89c7e7c: "jbd2: split
updating of journal superblock and marking journal empty".
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
Linus Torvalds [Sat, 18 Aug 2012 23:20:05 +0000 (16:20 -0700)]
Merge branch 'fixes' of git://git.linaro.org/people/rmk/linux-arm
Pull ARM fixes from Russell King:
"The largest thing in this set of changes is bringing back some of the
ARMv3 code to fix a compile problem noticed on RiscPC, which we still
support, even though we only support ARMv4 there.
(The reason is that the system bus doesn't support ARMv4 half-word
accesses, so we need the ARMv3 library code for this platform.)
The rest are all quite minor fixes."
* 'fixes' of git://git.linaro.org/people/rmk/linux-arm:
ARM: 7490/1: Drop duplicate select for GENERIC_IRQ_PROBE
ARM: Bring back ARMv3 IO and user access code
ARM: 7489/1: errata: fix workaround for erratum #720789 on UP systems
ARM: 7488/1: mm: use 5 bits for swapfile type encoding
ARM: 7487/1: mm: avoid setting nG bit for user mappings that aren't present
ARM: 7486/1: sched_clock: update epoch_cyc on resume
ARM: 7484/1: Don't enable GENERIC_LOCKBREAK with ticket spinlocks
ARM: 7483/1: vfp: only advertise VFPv4 in hwcaps if CONFIG_VFPv3 is enabled
ARM: 7482/1: topology: fix section mismatch warning for init_cpu_topology
Linus Torvalds [Sat, 18 Aug 2012 21:39:19 +0000 (14:39 -0700)]
Merge tag 'pm-for-3.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael J. Wysocki:
- Fixes for three obscure problems in the runtime PM core code found
recently.
- Two fixes for the new "coupled" cpuidle code from Colin Cross and Jon
Medhurst.
- intel_idle driver fix from Konrad Rzeszutek Wilk.
* tag 'pm-for-3.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
intel_idle: Check cpu_idle_get_driver() for NULL before dereferencing it.
cpuidle: Prevent null pointer dereference in cpuidle_coupled_cpu_notify
cpuidle: coupled: fix sleeping while atomic in cpu notifier
PM / Runtime: Check device PM QoS setting before "no callbacks" check
PM / Runtime: Clear power.deferred_resume on success in rpm_suspend()
PM / Runtime: Fix rpm_resume() return value for power.no_callbacks set
Al Viro [Sat, 18 Aug 2012 04:25:51 +0000 (00:25 -0400)]
unexport sock_map_fd(), switch to sock_alloc_file()
Both modular callers of sock_map_fd() had been buggy; sctp one leaks
descriptor and file if copy_to_user() fails, 9p one shouldn't be
exposing file in the descriptor table at all.
Switch both to sock_alloc_file(), export it, unexport sock_map_fd() and
make it static.
Al Viro [Sat, 18 Aug 2012 01:32:56 +0000 (21:32 -0400)]
vfio: grab vfio_device reference *before* exposing the sucker via fd_install()
It's not critical (anymore) since another thread closing the file will block
on ->device_lock before it gets to dropping the final reference, but it's
definitely cleaner that way...
Al Viro [Sat, 18 Aug 2012 01:29:06 +0000 (21:29 -0400)]
vfio: get rid of vfio_device_put()/vfio_group_get_device* races
we really need to make sure that dropping the last reference happens
under the group->device_lock; otherwise a loop (under device_lock)
might find vfio_device instance that is being freed right now, has
already dropped the last reference and waits on device_lock to exclude
the sucker from the list.
Linus Torvalds [Sat, 18 Aug 2012 17:02:17 +0000 (10:02 -0700)]
Merge branch 'vfs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull vfs fixes from Miklos Szeredi.
This mainly fixes some confusion about whether the open 'mode' variable
passed around should contain the full file type (S_IFREG etc)
information or just the permission mode. In particular, the lack of
proper file type information had confused fuse.
* 'vfs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
vfs: fix propagation of atomic_open create error on negative dentry
fuse: check create mode in atomic open
vfs: pass right create mode to may_o_create()
vfs: atomic_open(): fix create mode usage
vfs: canonicalize create mode in build_open_flags()
Sven Schnelle [Fri, 17 Aug 2012 19:43:43 +0000 (21:43 +0200)]
USB: CDC ACM: Fix NULL pointer dereference
If a device specifies zero endpoints in its interface descriptor,
the kernel oopses in acm_probe(). Even though that's clearly an
invalid descriptor, we should test wether we have all endpoints.
This is especially bad as this oops can be triggered by just
plugging a USB device in.
USB: emi62: remove __devinit* from the struct usb_device_id table
This structure needs to always stick around, even if CONFIG_HOTPLUG
is disabled, otherwise we can oops when trying to probe a device that
was added after the structure is thrown away.
Thanks to Fengguang Wu and Bjørn Mork for tracking this issue down.
Reported-by: Fengguang Wu <fengguang.wu@intel.com> Reported-by: Bjørn Mork <bjorn@mork.no> Cc: stable <stable@vger.kernel.org> CC: Paul Gortmaker <paul.gortmaker@windriver.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Felipe Balbi <balbi@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
USB: winbond: remove __devinit* from the struct usb_device_id table
This structure needs to always stick around, even if CONFIG_HOTPLUG
is disabled, otherwise we can oops when trying to probe a device that
was added after the structure is thrown away.
Thanks to Fengguang Wu and Bjørn Mork for tracking this issue down.
USB: vt6656: remove __devinit* from the struct usb_device_id table
This structure needs to always stick around, even if CONFIG_HOTPLUG
is disabled, otherwise we can oops when trying to probe a device that
was added after the structure is thrown away.
Thanks to Fengguang Wu and Bjørn Mork for tracking this issue down.
Reported-by: Fengguang Wu <fengguang.wu@intel.com> Reported-by: Bjørn Mork <bjorn@mork.no> Cc: stable <stable@vger.kernel.org> CC: Forest Bond <forest@alittletooquiet.net> CC: Marcos Paulo de Souza <marcos.souza.org@gmail.com> CC: "David S. Miller" <davem@davemloft.net> CC: Jesper Juhl <jj@chaosbits.net> CC: Jiri Pirko <jpirko@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
USB: rtl8187: remove __devinit* from the struct usb_device_id table
This structure needs to always stick around, even if CONFIG_HOTPLUG
is disabled, otherwise we can oops when trying to probe a device that
was added after the structure is thrown away.
Thanks to Fengguang Wu and Bjørn Mork for tracking this issue down.
USB: p54usb: remove __devinit* from the struct usb_device_id table
This structure needs to always stick around, even if CONFIG_HOTPLUG
is disabled, otherwise we can oops when trying to probe a device that
was added after the structure is thrown away.
Thanks to Fengguang Wu and Bjørn Mork for tracking this issue down.
USB: spca506: remove __devinit* from the struct usb_device_id table
This structure needs to always stick around, even if CONFIG_HOTPLUG
is disabled, otherwise we can oops when trying to probe a device that
was added after the structure is thrown away.
Thanks to Fengguang Wu and Bjørn Mork for tracking this issue down.
USB: jl2005bcd: remove __devinit* from the struct usb_device_id table
This structure needs to always stick around, even if CONFIG_HOTPLUG
is disabled, otherwise we can oops when trying to probe a device that
was added after the structure is thrown away.
Thanks to Fengguang Wu and Bjørn Mork for tracking this issue down.
USB: smsusb: remove __devinit* from the struct usb_device_id table
This structure needs to always stick around, even if CONFIG_HOTPLUG
is disabled, otherwise we can oops when trying to probe a device that
was added after the structure is thrown away.
Thanks to Fengguang Wu and Bjørn Mork for tracking this issue down.
Linus Torvalds [Sat, 18 Aug 2012 00:47:32 +0000 (17:47 -0700)]
Merge tag 'md-3.6-fixes' of git://neil.brown.name/md
Pull md fixes from NeilBrown:
"2 fixes for md, tagged for -stable"
* tag 'md-3.6-fixes' of git://neil.brown.name/md:
md/raid10: fix problem with on-stack allocation of r10bio structure.
md: Don't truncate size at 4TB for RAID0 and Linear
NeilBrown [Fri, 17 Aug 2012 23:51:42 +0000 (09:51 +1000)]
md/raid10: fix problem with on-stack allocation of r10bio structure.
A 'struct r10bio' has an array of per-copy information at the end.
This array is declared with size [0] and r10bio_pool_alloc allocates
enough extra space to store the per-copy information depending on the
number of copies needed.
So declaring a 'struct r10bio on the stack isn't going to work. It
won't allocate enough space, and memory corruption will ensue.
So in the two places where this is done, declare a sufficiently large
structure and use that instead.
The two call-sites of this bug were introduced in 3.4 and 3.5
so this is suitable for both those kernels. The patch will have to
be modified for 3.4 as it only has one bug.
Cc: stable@vger.kernel.org Reported-by: Ivan Vasilyev <ivan.vasilyev@gmail.com> Tested-by: Ivan Vasilyev <ivan.vasilyev@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
Linus Torvalds [Fri, 17 Aug 2012 18:45:58 +0000 (11:45 -0700)]
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull infiniband/rdma fixes from Roland Dreier:
"Grab bag of InfiniBand/RDMA fixes:
- IPoIB fixes for regressions introduced by path database conversion
- mlx4 fixes for bugs with large memory systems and regressions from
SR-IOV patches
- RDMA CM fix for passing bad event up to userspace
- Other minor fixes"
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/mlx4: Check iboe netdev pointer before dereferencing it
mlx4_core: Clean up buddy bitmap allocation
mlx4_core: Fix integer overflow issues around MTT table
mlx4_core: Allow large mlx4_buddy bitmaps
IB/srp: Fix a race condition
IB/qib: Fix error return code in qib_init_7322_variables()
IB: Fix typos in infiniband drivers
IB/ipoib: Fix RCU pointer dereference of wrong object
IB/ipoib: Add missing locking when CM object is deleted
RDMA/ucma.c: Fix for events with wrong context on iWARP
RDMA/ocrdma: Don't call vlan_dev_real_dev() for non-VLAN netdevs
IB/mlx4: Fix possible deadlock on sm_lock spinlock
intel_idle: Check cpu_idle_get_driver() for NULL before dereferencing it.
If the machine is booted without any cpu_idle driver set
(b/c disable_cpuidle() has been called) we should follow
other users of cpu_idle API and check the return value
for NULL before using it.
Reported-and-tested-by: Mark van Dijk <mark@internecto.net> Suggested-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
cpuidle: Prevent null pointer dereference in cpuidle_coupled_cpu_notify
When a kernel is built to support multiple hardware types it's possible
that CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is set but the hardware the
kernel is run on doesn't support cpuidle and therefore doesn't load a
driver for it. In this case, when the system is shut down,
cpuidle_coupled_cpu_notify() gets called with cpuidle_devices set to
NULL. There are quite possibly other circumstances where this
situation can also occur and we should check for it.
Signed-off-by: Jon Medhurst <tixy@linaro.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Colin Cross [Wed, 15 Aug 2012 20:10:50 +0000 (22:10 +0200)]
cpuidle: coupled: fix sleeping while atomic in cpu notifier
The cpu hotplug notifier gets called in both atomic and non-atomic
contexts, it is not always safe to lock a mutex. Filter out all events
except the six necessary ones, which are all sleepable, before taking
the mutex.
Signed-off-by: Colin Cross <ccross@android.com> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
PM / Runtime: Check device PM QoS setting before "no callbacks" check
If __dev_pm_qos_read_value(dev) returns a negative value,
rpm_suspend() should return -EPERM for dev even if its
power.no_callbacks flag is set. For this to happen, the device's
power.no_callbacks flag has to be checked after the PM QoS check,
so move the PM QoS check to rpm_check_suspend_allowed() (this will
make it cover idle notifications as well as runtime suspend too).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Alan Stern <stern@rowland.harvard.edu> Cc: stable@vger.kernel.org
PM / Runtime: Clear power.deferred_resume on success in rpm_suspend()
The power.deferred_resume can only be set if the runtime PM status
of device is RPM_SUSPENDING and it should be cleared after its
status has been changed, regardless of whether or not the runtime
suspend has been successful. However, it only is cleared on
suspend failure, while it may remain set on successful suspend and
is happily leaked to rpm_resume() executed in that case.
That shouldn't happen, so if power.deferred_resume is set in
rpm_suspend() after the status has been changed to RPM_SUSPENDED,
clear it before calling rpm_resume(). Then, it doesn't need to be
cleared before changing the status to RPM_SUSPENDING any more,
because it's always cleared after the status has been changed to
either RPM_SUSPENDED (on success) or RPM_ACTIVE (on failure).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Alan Stern <stern@rowland.harvard.edu> Cc: stable@vger.kernel.org
PM / Runtime: Fix rpm_resume() return value for power.no_callbacks set
For devices whose power.no_callbacks flag is set, rpm_resume()
should return 1 if the device's parent is already active, so that
the callers of pm_runtime_get() don't think that they have to wait
for the device to resume (asynchronously) in that case (the core
won't queue up an asynchronous resume in that case, so there's
nothing to wait for anyway).
Modify the code accordingly (and make sure that an idle notification
will be queued up on success, even if 1 is to be returned).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Alan Stern <stern@rowland.harvard.edu> Cc: stable@vger.kernel.org
Linus Torvalds [Fri, 17 Aug 2012 17:17:03 +0000 (10:17 -0700)]
Merge tag 'staging-3.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging fixes from Greg Kroah-Hartman:
"Here are some staging driver fixes (and iio driver fixes, they get
lumped in with the staging stuff due to dependancies) for your 3.6-rc3
tree.
Nothing major, just a bunch of fixes that people have reported.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'staging-3.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (26 commits)
iio: lm3533-als: Fix build warnings
staging:iio:ad7780: Mark channels as unsigned
staging:iio:ad7192: Report offset and scale for temperature channel
staging:iio:ad7192: Report channel offset
staging:iio:ad7192: Mark channels as unsigned
staging:iio:ad7192: Fix setting ACX
staging:iio:ad7192: Add missing break in switch statement
staging:iio:ad7793: Fix internal reference value
staging:iio:ad7793: Follow new IIO naming spec
staging:iio:ad7793: Fix temperature scale and offset
staging:iio:ad7793: Report channel offset
staging:iio:ad7793: Mark channels as unsigned
staging:iio:ad7793: Add missing break in switch statement
iio/adjd_s311: Fix potential memory leak in adjd_s311_update_scan_mode()
iio: frequency: ADF4350: Fix potential reference div factor overflow.
iio: staging: ad7298_ring: Fix maybe-uninitialized warning
staging: comedi: usbduxfast: Declare MODULE_FIRMWARE usage
staging: comedi: usbdux: Declare MODULE_FIRMWARE usage
staging: comedi: usbduxsigma: Declare MODULE_FIRMWARE usage
staging: csr: add INET dependancy
...
Linus Torvalds [Fri, 17 Aug 2012 17:16:30 +0000 (10:16 -0700)]
Merge tag 'driver-core-3.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg Kroah-Hartman:
"Here are two tiny patches, one fixing a dynamic debug problem that the
printk rework turned up, and the other one fixing an extcon problem
that people reported.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'driver-core-3.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
extcon: extcon_gpio: Replace gpio_request_one by devm_gpio_request_one
drivers-core: make structured logging play nice with dynamic-debug
Linus Torvalds [Fri, 17 Aug 2012 17:15:57 +0000 (10:15 -0700)]
Merge tag 'char-misc-3.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull Char / Misc driver fixes from Greg Kroah-Hartman:
"Here are some small misc and w1 driver fixes for 3.6-rc3. Nothing
major, just some some bugfixes and a new device id for a w1 driver.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'char-misc-3.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
1-Wire: Add support for the maxim ds1825 temperature sensor
ti-st: Fix check for pdata->chip_awake function pointer
mei: add mei_quirk_probe function
mei: fix device stall after wd is stopped
Linus Torvalds [Fri, 17 Aug 2012 17:14:53 +0000 (10:14 -0700)]
Merge tag 'usb-3.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB patches from Greg Kroah-Hartman:
"Here are a number of small USB patches for 3.6-rc3.
The "large" one is just a number of device id updates to the option
driver, done by the manufacturer, properly fixing up the device ids
based on shipping devices.
Other than that, some gadget driver fixes, the obligitary XHCI
patches, and some other device ids and bugs fixed.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'usb-3.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (26 commits)
USB: qcserial: fix port handling on Gobi 1K and 2K+
USB: serial: Fix mos7840 timeout
USB: option: add ZTE K5006-Z
usb: gadget: u_ether: fix kworker 100% CPU issue with still used interfaces in eth_stop
usb: host: tegra: fix warning messages in ehci_remove
usb: host: mips: sead3: Update for EHCI register structure.
usb: renesas_usbhs: fixup resume method for autonomy mode
usb: renesas_usbhs: mod_host: add missing .bus_suspend/resume
update MAINTAINERS for Oliver Neukum
usb: usb_wwan: resume/suspend can be called after port is gone
usb: serial: prevent suspend/resume from racing against probe/remove
usb: usb_wwan: replace release and disconnect with a port_remove hook
usb: serial: mos7840: Fixup mos7840_chars_in_buffer()
USB: isp1362-hcd.c: usb message always saved in case of underrun
OMAP: USB : Fix the EHCI enumeration and core retention issue
usb: chipidea: fix and improve dependencies if usb host or gadget support is built as module
USB: support the new interfaces of Huawei Data Card devices in option driver
USB: ftdi_sio: Add VID/PID for Kondo Serial USB
xhci: Switch PPT ports to EHCI on shutdown.
xhci: Fix bug after deq ptr set to link TRB.
...
Linus Torvalds [Fri, 17 Aug 2012 15:04:47 +0000 (08:04 -0700)]
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bug fixes from Ted Ts'o:
"The following are all bug fixes and regressions. The most notable are
the ones which cause problems for ext4 on RAID --- a performance
problem when mounting very large filesystems, and a kernel OOPS when
doing an rm -rf on large directory hierarchies on fast devices."
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix kernel BUG on large-scale rm -rf commands
ext4: fix long mount times on very big file systems
ext4: don't call ext4_error while block group is locked
ext4: avoid kmemcheck complaint from reading uninitialized memory
ext4: make sure the journal sb is written in ext4_clear_journal_err()
powerpc/mpc85xx: Add new ext fields to Integrated FLash Controller
Freescale's Integrated Flash controller(IFC) v1.1.0 supports 40 bit
address bus width.
In case more than 32 bit address is used, the EXT registers should be set.
Add support of ext registers.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: York Sun <yorksun@freescale.com> Signed-off-by: Prabhakar Kushwaha <prabhakar@freescale.com> Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Theodore Ts'o [Fri, 17 Aug 2012 14:04:17 +0000 (10:04 -0400)]
ext4: return an error if kset_create_and_add fails in ext4_init_fs()
In the very unlikely case that kset_create_and_add() fails when the
ext4.ko module is being loaded (or during kernel startup) set err so
that it's clear that the module load failed.
Ian Kent [Fri, 17 Aug 2012 03:09:04 +0000 (11:09 +0800)]
autofs4 - fix expire check
In some cases when an autofs indirect mount is contained in a file
system that is marked as shared (such as when systemd does the
equivalent of "mount --make-rshared /" early in the boot), mounts
stop expiring.
When this happens the first expiry check on a mountpoint dentry in
autofs_expire_indirect() sees a mountpoint dentry with a higher
than minimal reference count. Consequently the dentry is condidered
busy and the actual expiry check is never done.
This particular check was originally meant as an optimisation to
detect a path walk in progress but with the addition of rcu-walk
it can be ineffective anyway.
Removing the test allows automounts to expire again since the
actual expire check doesn't rely on the dentry reference count.
Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zheng Liu [Fri, 17 Aug 2012 13:54:17 +0000 (09:54 -0400)]
ext4: make the zero-out chunk size tunable
Currently in ext4 the length of zero-out chunk is set to 7 file system
blocks. But if an inode has uninitailized extents from using
fallocate to preallocate space, and the workload issues many random
writes, this can cause a fragmented extent tree that will
unnecessarily grow the extent tree.
So create a new sysfs tunable, extent_max_zeroout_kb, which controls
the maximum size where blocks will be zeroed out instead of creating a
new uninitialized extent. The default of this has been sent to 32kb.
CC: Zach Brown <zab@zabbo.net> CC: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Anatol Pomozov [Fri, 17 Aug 2012 13:50:17 +0000 (09:50 -0400)]
ext4: realign trace events structs to make it smaller
Most hardware architectures require that data (including struct fields)
have to be aligned in memory. To make it happen compiler inserts padding
between struct fields if they are not aligned correctly.
Reorder fields to remove paddings and make structures denser. Making data
smaller saves some memory that is very important for trace events.
Tracing buffer has limited size and making objects smaller we can put more
of them without overflowing the tracing buffer.
To find data struct holes I used 'pahole -H 1 -E -I vmlinux.o' from
'dwarves' package.
Theodore Ts'o [Fri, 17 Aug 2012 13:48:17 +0000 (09:48 -0400)]
ext4: add max_dir_size_kb mount option
Very large directories can cause significant performance problems, or
perhaps even invoke the OOM killer, if the process is running in a
highly constrained memory environment (whether it is VM's with a small
amount of memory or in a small memory cgroup).
So it is useful, in cloud server/data center environments, to be able
to set a filesystem-wide cap on the maximum size of a directory, to
ensure that directories never get larger than a sane size. We do this
via a new mount option, max_dir_size_kb. If there is an attempt to
grow the directory larger than max_dir_size_kb, the system call will
return ENOSPC instead.
Theodore Ts'o [Fri, 17 Aug 2012 13:46:17 +0000 (09:46 -0400)]
ext4: don't load the block bitmap for block groups which have no space
Add a short circuit check to ext4_mb_group_group() so that we don't
bother to load the block bitmap for a block group which does not have
any space available. (Or which does not have enough space until we
are in desperation mode, i.e., when cr == 3.)
Theodore Ts'o [Fri, 17 Aug 2012 13:44:17 +0000 (09:44 -0400)]
ext4: collapse a single extent tree block into the inode if possible
If an inode has more than 4 extents, but then later some of the
extents are merged together, we can optimize the file system by moving
the extents up into the inode, and discarding the extent tree block.
This is important, because if there are a large number of inodes with
an external extent tree blocks where the contents could fit in the
inode, this can significantly increase the fsck time of the file
system.
Theodore Ts'o [Fri, 17 Aug 2012 12:54:52 +0000 (08:54 -0400)]
ext4: fix kernel BUG on large-scale rm -rf commands
Commit 968dee7722: "ext4: fix hole punch failure when depth is greater
than 0" introduced a regression in v3.5.1/v3.6-rc1 which caused kernel
crashes when users ran run "rm -rf" on large directory hierarchy on
ext4 filesystems on RAID devices:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
The problem in commit 968dee7722 was that caused the variable 'i' to
be left uninitialized if the truncate required more space than was
available in the journal. This resulted in the function
ext4_ext_truncate_extend_restart() returning -EAGAIN, which caused
ext4_ext_remove_space() to restart the truncate operation after
starting a new jbd2 handle.
Theodore Ts'o [Thu, 16 Aug 2012 15:59:04 +0000 (11:59 -0400)]
ext4: fix long mount times on very big file systems
Commit 8aeb00ff85a: "ext4: fix overhead calculation used by
ext4_statfs()" introduced a O(n**2) calculation which makes very large
file systems take forever to mount. Fix this with an optimization for
non-bigalloc file systems. (For bigalloc file systems the overhead
needs to be set in the the superblock.)
Theodore Ts'o [Fri, 10 Aug 2012 17:57:52 +0000 (13:57 -0400)]
ext4: don't call ext4_error while block group is locked
While in ext4_validate_block_bitmap(), if an block allocation bitmap
is found to be invalid, we call ext4_error() while the block group is
still locked. This causes ext4_commit_super() to call a function
which might sleep while in an atomic context.
There's no need to keep the block group locked at this point, so hoist
the ext4_error() call up to ext4_validate_block_bitmap() and release
the block group spinlock before calling ext4_error().
Kees Cook [Wed, 15 Aug 2012 18:41:55 +0000 (11:41 -0700)]
Yama: access task_struct->comm directly
The core ptrace access checking routine holds a task lock, and when
reporting a failure, Yama takes a separate task lock. To avoid a
potential deadlock with two ptracers taking the opposite locks, do not
use get_task_comm() and just use ->comm directly since accuracy is not
important for the report.