Alex Bligh [Mon, 24 Oct 2011 14:53:27 +0000 (01:53 +1100)]
net/netfilter/nf_conntrack_netlink.c: fix Oops on container destroy
Problem:
A repeatable Oops can be caused if a container with networking
unshared is destroyed when it has nf_conntrack entries yet to expire.
A copy of the oops follows below. A perl program generating the oops
repeatably is attached inline below.
Analysis:
The oops is called from cleanup_net when the namespace is
destroyed. conntrack iterates through outstanding events and calls
death_by_timeout on each of them, which in turn produces a call to
ctnetlink_conntrack_event. This calls nf_netlink_has_listeners, which
oopses because net->nfnl is NULL.
The perl program generates the container through fork() then
clone(NS_NEWNET). I does not explicitly set up netlink
explicitly set up netlink, but I presume it was set up else net->nfnl
would have been NULL earlier (i.e. when an earlier connection
timed out). This would thus suggest that net->nfnl is made NULL
during the destruction of the container, which I think is done by
nfnetlink_net_exit_batch.
I can see that the various subsystems are deinitialised in the opposite
order to which the relevant register_pernet_subsys calls are called,
and both nf_conntrack and nfnetlink_net_ops register their relevant
subsystems. If nfnetlink_net_ops registered later than nfconntrack,
then its exit routine would have been called first, which would cause
the oops described. I am not sure there is anything to prevent this
happening in a container environment.
Whilst there's perhaps a more complex problem revolving around ordering
of subsystem deinit, it seems to me that missing a netlink event on a
container that is dying is not a disaster. An early check for net->nfnl
being non-NULL in ctnetlink_conntrack_event appears to fix this. There
may remain a potential race condition if it becomes NULL immediately
after being checked (I am not sure any lock is held at this point or
how synchronisation for subsystem deinitialization works).
Patch:
The patch attached should apply on everything from 2.6.26 (if not before)
onwards; it appears to be a problem on all kernels. This was taken against
Ubuntu-3.0.0-11.17 which is very close to 3.0.4. I have torture-tested it
with the above perl script for 15 minutes or so; the perl script hung the
machine within 20 seconds without this patch.
Applicability:
If this is the right solution, it should be applied to all stable kernels
as well as head. Apart from the minor overhead of checking one variable
against NULL, it can never 'do the wrong thing', because if net->nfnl
is NULL, an oops will inevitably result. Therefore, checking is a reasonable
thing to do unless it can be proven than net->nfnl will never be NULL.
Check net->nfnl for NULL in ctnetlink_conntrack_event to avoid Oops on
container destroy
Signed-off-by: Alex Bligh <alex@alex.org.uk> Cc: Patrick McHardy <kaber@trash.net> Cc: David Miller <davem@davemloft.net> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ben Hutchings [Mon, 24 Oct 2011 13:12:28 +0000 (15:12 +0200)]
module,bug: Add TAINT_OOT_MODULE flag for modules not built in-tree
Use of the GPL or a compatible licence doesn't necessarily make the code
any good. We already consider staging modules to be suspect, and this
should also be true for out-of-tree modules which may receive very
little review.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Reviewed-by: Dave Jones <davej@redhat.com> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (patched oops-tracing.txt)
Ben Hutchings [Tue, 1 Nov 2011 03:59:33 +0000 (03:59 +0000)]
module: Enable dynamic debugging regardless of taint
Dynamic debugging is currently disabled for tainted modules, except
for TAINT_CRAP. This prevents use of dynamic debugging for
out-of-tree modules once the next patch is applied.
This condition was apparently intended to avoid a crash if a force-
loaded module has an incompatible definition of dynamic debug
structures. However, a administrator that forces us to load a module
is claiming that it *is* compatible even though it fails our version
checks. If they are mistaken, there are any number of ways the module
could crash the system.
As a side-effect, proprietary and other tainted modules can now use
dynamic_debug.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Tejun Heo [Fri, 4 Nov 2011 00:04:52 +0000 (01:04 +0100)]
PM / Freezer: Revert 27920651fe "PM / Freezer: Make fake_signal_wake_up() wake TASK_KILLABLE tasks too"
Commit 27920651fe "PM / Freezer: Make fake_signal_wake_up() wake
TASK_KILLABLE tasks too" updated fake_signal_wake_up() used by freezer
to wake up KILLABLE tasks. Sending unsolicited wakeups to tasks in
killable sleep is dangerous as there are code paths which depend on
tasks not waking up spuriously from KILLABLE sleep.
For example. sys_read() or page can sleep in TASK_KILLABLE assuming
that wait/down/whatever _killable can only fail if we can not return
to the usermode. TASK_TRACED is another obvious example.
The previous patch updated wait_event_freezekillable() such that it
doesn't depend on the spurious wakeup. This patch reverts the
offending commit.
Note that the spurious KILLABLE wakeup had other implicit effects in
KILLABLE sleeps in nfs and cifs and those will need further updates to
regain freezekillable behavior.
Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Oleg Nesterov [Thu, 3 Nov 2011 23:07:49 +0000 (16:07 -0700)]
PM / Freezer: Reimplement wait_event_freezekillable using freezer_do_not_count/freezer_count
Commit 27920651fe "PM / Freezer: Make fake_signal_wake_up() wake
TASK_KILLABLE tasks too" updated fake_signal_wake_up() used by freezer
to wake up KILLABLE tasks. Sending unsolicited wakeups to tasks in
killable sleep is dangerous as there are code paths which depend on
tasks not waking up spuriously from KILLABLE sleep.
For example. sys_read() or page can sleep in TASK_KILLABLE assuming
that wait/down/whatever _killable can only fail if we can not return
to the usermode. TASK_TRACED is another obvious example.
The offending commit was to resolve freezer hang during system PM
operations caused by KILLABLE sleeps in network filesystems.
wait_event_freezekillable(), which depends on the spurious KILLABLE
wakeup, was added by f06ac72e92 "cifs, freezer: add
wait_event_freezekillable and have cifs use it" to be used to
implement killable & freezable sleeps in network filesystems.
To prepare for reverting of 27920651fe, this patch reimplements
wait_event_freezekillable() using freezer_do_not_count/freezer_count()
so that it doesn't depend on the spurious KILLABLE wakeup. This isn't
very nice but should do for now.
[tj: Refreshed patch to apply to linus/master and updated commit
description on Rafael's request.]
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Alan Stern [Thu, 3 Nov 2011 23:52:46 +0000 (00:52 +0100)]
USB: Update last_busy time after autosuspend fails
Originally, the runtime PM core would send an idle notification
whenever a suspend attempt failed. The idle callback routine could
then schedule a delayed suspend for some time later.
However this behavior was changed by commit f71648d73c1650b8b4aceb3856bebbde6daa3b86 (PM / Runtime: Remove idle
notification after failing suspend). No notifications were sent, and
there was no clear mechanism to retry failed suspends.
This caused problems for the usbhid driver, because it fails
autosuspend attempts as long as a key is being held down. A companion
patch changes the PM core's behavior, but we also need to change the
USB core. In particular, this patch (as1493) updates the device's
last_busy time when an autosuspend fails, so that the PM core will
retry the autosuspend in the future when the delay time expires
again.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Tested-by: Henrik Rydberg <rydberg@euromail.se> Cc: <stable@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Originally, the runtime PM core would send an idle notification
whenever a suspend attempt failed. The idle callback routine could
then schedule a delayed suspend for some time later.
However this behavior was changed by commit f71648d73c1650b8b4aceb3856bebbde6daa3b86 (PM / Runtime: Remove idle
notification after failing suspend). No notifications were sent, and
there was no clear mechanism to retry failed suspends.
This caused problems for the usbhid driver, because it fails
autosuspend attempts as long as a key is being held down. Therefore
this patch (as1492) adds a mechanism for retrying failed
autosuspends. If the callback routine updates the last_busy field so
that the next autosuspend expiration time is in the future, the
autosuspend will automatically be rescheduled.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Tested-by: Henrik Rydberg <rydberg@euromail.se> Cc: <stable@kernel.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Olof Johansson [Wed, 2 Nov 2011 11:00:49 +0000 (11:00 +0000)]
af_packet: de-inline some helper functions
This popped some compiler errors due to mismatched prototypes. Just
remove most manual inlines, the compiler should be able to figure out
what makes sense to inline and not.
net/packet/af_packet.c:252: warning: 'prb_curr_blk_in_use' declared inline after being called
net/packet/af_packet.c:252: warning: previous declaration of 'prb_curr_blk_in_use' was here
net/packet/af_packet.c:258: warning: 'prb_queue_frozen' declared inline after being called
net/packet/af_packet.c:258: warning: previous declaration of 'prb_queue_frozen' was here
net/packet/af_packet.c:248: warning: 'packet_previous_frame' declared inline after being called
net/packet/af_packet.c:248: warning: previous declaration of 'packet_previous_frame' was here
net/packet/af_packet.c:251: warning: 'packet_increment_head' declared inline after being called
net/packet/af_packet.c:251: warning: previous declaration of 'packet_increment_head' was here
Signed-off-by: Olof Johansson <olof@lixom.net> Cc: Chetan Loke <loke.chetan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Oliver Hartkopp [Wed, 2 Nov 2011 10:55:13 +0000 (10:55 +0000)]
MAINTAINERS: Add can-gw include to maintained files
Commit c1aabdf379bc2feeb0df7057ed5bad96f492133e (can-gw: add netlink based
CAN routing) added a new include file that's neither referenced by any of
the CAN maintainers.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Tony Lindgren [Wed, 2 Nov 2011 13:40:28 +0000 (13:40 +0000)]
net: Add back alignment for size for __alloc_skb
Commit 87fb4b7b533073eeeaed0b6bf7c2328995f6c075 (net: more
accurate skb truesize) changed the alignment of size. This
can cause problems at least on some machines with NFS root:
Problem comes from commit 0e734419
(ipv4: Use inet_csk_route_child_sock() in DCCP and TCP.)
If inet_csk_route_child_sock() returns NULL, we should release socket
lock before freeing it.
Another lock imbalance exists if __inet_inherit_port() returns an error
since commit 093d282321da ( tproxy: fix hash locking issue when using
port redirection in __inet_inherit_port()) a backport is also needed for
>= 2.6.37 kernels.
Reported-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Balazs Scheidler <bazsi@balabit.hu> CC: KOVACS Krisztian <hidden@balabit.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Simon Kirby <sim@hostway.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 2 Nov 2011 22:47:44 +0000 (22:47 +0000)]
l2tp: fix race in l2tp_recv_dequeue()
Misha Labjuk reported panics occurring in l2tp_recv_dequeue()
If we release reorder_q.lock, we must not keep a dangling pointer (tmp),
since another thread could manipulate reorder_q.
Instead we must restart the scan at beginning of list.
Reported-by: Misha Labjuk <spiked.yar@gmail.com> Tested-by: Misha Labjuk <spiked.yar@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>