Harout Hedeshian [Tue, 16 Jun 2015 00:40:43 +0000 (18:40 -0600)]
netfilter: xt_socket: add XT_SOCKET_RESTORESKMARK flag
xt_socket is useful for matching sockets with IP_TRANSPARENT and
taking some action on the matching packets. However, it lacks the
ability to match only a small subset of transparent sockets.
Suppose there are 2 applications, each with its own set of transparent
sockets. The first application wants all matching packets dropped,
while the second application wants them forwarded somewhere else.
Add the ability to retore the skb->mark from the sk_mark. The mark
is only restored if a matching socket is found and the transparent /
nowildcard conditions are satisfied.
Now the 2 hypothetical applications can differentiate their sockets
based on a mark value set with SO_MARK.
iptables -t mangle -I PREROUTING -m socket --transparent \
--restore-skmark -j action
iptables -t mangle -A action -m mark --mark 10 -j action2
iptables -t mangle -A action -m mark --mark 11 -j action3
Signed-off-by: Harout Hedeshian <harouth@codeaurora.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Roman Kubiak [Fri, 12 Jun 2015 10:32:57 +0000 (12:32 +0200)]
netfilter: nfnetlink_queue: add security context information
This patch adds an additional attribute when sending
packet information via netlink in netfilter_queue module.
It will send additional security context data, so that
userspace applications can verify this context against
their own security databases.
Signed-off-by: Roman Kubiak <r.kubiak@samsung.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Accessing current->pid/uid from cls_bpf may lead to misleading results and
should not be used when TC classifiers need accurate information about pid/uid.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Mon, 15 Jun 2015 17:27:54 +0000 (18:27 +0100)]
sfc: mark state UNINIT after unregister
Without this change, modprobe -r sfc hits the BUG_ON() in
efx_pci_remove_main().
Fixes: e7fef9b45ae1 ("sfc: add sysfs entry to control MCDI tracing") Reported-by: Jarod Wilson <jarod@redhat.com> Reviewed-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Jun 2015 02:49:22 +0000 (19:49 -0700)]
Merge branch 'sock_diag_destruction_events'
Craig Gallek says:
====================
Socket destruction events via netlink sock_diag
This series extends the netlink sock_diag interface to broadcast
socket information as they are being destroyed. The current
interface is poll based and can not be used to retreive information
about sockets that are destroyed between poll intervals.
Only inet sockets are broadcast in this implementation, but other
families could easily be added as needed in the future.
If this patch set is accepted, a follow-up patch to the ss utility
in the iproute2 suite will also be submitted.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Craig Gallek [Mon, 15 Jun 2015 15:26:20 +0000 (11:26 -0400)]
sock_diag: implement a get_info handler for inet
This get_info handler will simply dispatch to the appropriate
existing inet protocol handler.
This patch also includes a new netlink attribute
(INET_DIAG_PROTOCOL). This attribute is currently only used
for multicast messages. Without this attribute, there is no
way of knowing the IP protocol used by the socket information
being broadcast. This attribute is not necessary in the 'dump'
variant of this protocol (though it could easily be added)
because dump requests are issued for specific family/protocol
pairs.
Tested: ss -E (note, the -E option has not yet been merged into
the upstream version of ss).
Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Craig Gallek [Mon, 15 Jun 2015 15:26:19 +0000 (11:26 -0400)]
sock_diag: specify info_size per inet protocol
Previously, there was no clear distinction between the inet protocols
that used struct tcp_info to report information and those that didn't.
This change adds a specific size attribute to the inet_diag_handler
struct which defines these interfaces. This will make dispatching
sock_diag get_info requests identical for all inet protocols in a
following patch.
Tested: ss -au
Tested: ss -at Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Craig Gallek [Mon, 15 Jun 2015 15:26:18 +0000 (11:26 -0400)]
sock_diag: define destruction multicast groups
These groups will contain socket-destruction events for
AF_INET/AF_INET6, IPPROTO_TCP/IPPROTO_UDP.
Near the end of socket destruction, a check for listeners is
performed. In the presence of a listener, rather than completely
cleanup the socket, a unit of work will be added to a private
work queue which will first broadcast information about the socket
and then finish the cleanup operation.
Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Jun 2015 00:23:03 +0000 (17:23 -0700)]
Merge branch 'mlx4-vf-counters'
Or Gerlitz says:
====================
mlx4 driver update (+ new VF ndo)
This series from Eran and Hadar is further dealing with traffic
counters in the mlx4 driver, this time mostly around SRIOV.
We added a new ndo to read the VF counters through the PF netdev
netlink infrastructure plus mlx4 implementation for that ndo.
changes from V0:
- applied feedback from John to use nested netlink encoding
for the VF counters so we can extend it later
- add handling of single ported VFs in the mlx4_en driver new ndo
- avoid chopping the FW counters from 64 to 32 bits in mlx4_en PF flow
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eran Ben Elisha [Mon, 15 Jun 2015 14:59:08 +0000 (17:59 +0300)]
net/mlx4_en: Support ndo_get_vf_stats
Implement the ndo to gather VF statistics through the PF.
All counters related to this VF are stored in a per slave
list, run over the slave's list and collect all statistics.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eran Ben Elisha [Mon, 15 Jun 2015 14:59:07 +0000 (17:59 +0300)]
net/core: Add reading VF statistics through the PF netdevice
Add ndo_get_vf_stats where the PF retrieves and fills the VFs traffic
statistics. We encode the VF stats in a nested manner to allow for
future extensions.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eran Ben Elisha [Mon, 15 Jun 2015 14:59:06 +0000 (17:59 +0300)]
net/mlx4_en: Show PF own statistics via ethtool
Allow the user to observe the PF own statistics using ethtool with pf_
prefixed counter names.
Those counters are the PF statistics out of the overall port statistics.
Every PF QP is attached to a counter and the summary of those counters
is the PF statistics.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eran Ben Elisha [Mon, 15 Jun 2015 14:59:05 +0000 (17:59 +0300)]
net/mlx4_core: Add helper to query counters
This is an infrastructure step for querying VF and PF counters.
This code was in the IB driver, move it to the mlx4 core driver
so it will be accessible for more use cases.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eran Ben Elisha [Mon, 15 Jun 2015 14:59:04 +0000 (17:59 +0300)]
IB/mlx4: Set VF to read from QP counters
As IB VFs are not capable to read the port counters through MADs,
move there to read their own QP counters to gather statistics.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eran Ben Elisha [Mon, 15 Jun 2015 14:59:03 +0000 (17:59 +0300)]
IB/mlx4: Add RoCE/IB dedicated counters
This is an infrastructure step to attach all the QPs opened from the
IB driver to a counter in order to collect VF stats from the PF using
those counters.
If the port's type is Ethernet, the counter policy demands two counters
per port (one for RoCE and one for Ethernet). The port default counter
(allocated in mlx4_core) is used for the Ethernet netdev QPs and we
allocate another counter for RoCE.
If the port's traffic is Infiniband, the counter policy demands
one counter per port, so it can use the port's default counter.
Also, Add 'allocated' flag for each counter in order to clean it at
unload.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eran Ben Elisha [Mon, 15 Jun 2015 14:59:02 +0000 (17:59 +0300)]
net/mlx4_core: Allocate default counter per port
Default counter per port will be allocated at the mlx4 core driver load.
Every QP opened by the Ethernet driver will be attached to the port's default
counter. This is an infrastructure step to collect VF statistics from the PF.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eran Ben Elisha [Mon, 15 Jun 2015 14:59:01 +0000 (17:59 +0300)]
net/mlx4_core: Add port attribute when tracking counters
Counter will get its port attribute within the resource tracker when
the first QP attached to it is modified to RTR. If a QP is counter-less,
an attempt to create a new counter with assigned port will be made.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eran Ben Elisha [Mon, 15 Jun 2015 14:59:00 +0000 (17:59 +0300)]
net/mlx4_core: Adjust counter grant policy in the resource tracker
Each physical function has a guarantee of two counters per port, one
for a default counter and one for the IB driver.
Each virtual function has a guarantee of one counter per port.
All other counters are free and can be obtained on demand.
This is a preparation step for supporting a get_vf_stats ndo call,
so we can promise a counter for every VF in order to collect their
statistics from the PF context.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eran Ben Elisha [Mon, 15 Jun 2015 14:58:59 +0000 (17:58 +0300)]
net/mlx4_core: Remove counters table allocation from VF flow
Since virtual functions get their counters indices allocation from the PF,
allocate counters indices bitmap only in case the function isn't virtual.
Also, check that the device has counters to allocate before creating the
indices bitmap table.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eran Ben Elisha [Mon, 15 Jun 2015 14:58:58 +0000 (17:58 +0300)]
net/mlx4_core: Add sink counter
Reserve the last valid counter index for "sink" counter, when a
new counter cannot be allocated, the driver will use this counter.
In order to avoid allocating this counter on any other flow, fix the
indices bitmap allocation range, and reserve the sink counter index.
Add macro for the sink counter index and replace all appearences of the
index with the macro.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eran Ben Elisha [Mon, 15 Jun 2015 14:58:57 +0000 (17:58 +0300)]
net/mlx4_core: Reset counters data when freed
Add resetting the counter data to the free counter flow, so the counter's
data won't be accessible anymore if querying the counter. Also, on next
counter allocation (to another VM for example), it will be fresh and clear.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eran Ben Elisha [Mon, 15 Jun 2015 14:58:56 +0000 (17:58 +0300)]
net/mlx4_core: Check before cleaning counters bitmap
If counters are not supported by the device. The indices bitmap table is not
allocated during initialization. Add the symmetrical check before cleaning
the counters bitmap table or freeing a counter.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 15 Jun 2015 23:44:19 +0000 (16:44 -0700)]
Merge tag 'nfc-next-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
Samuel Ortiz says:
====================
NFC 4.2 pull request
This is the NFC pull request for 4.2.
- NCI drivers can now define their own handlers for processing
proprietary NCI responses and notifications.
- NFC vendors can use a dedicated netlink API to send their own
proprietary commands, like e.g. all commands needed to implement
vendor specific manufacturing tools.
- A new generic NCI over UART driver against which any NCI chipset
running on top of a serial interface can register.
- The st21nfcb driver is renamed to st-nci as it can and will support
most of ST Microelectronics NCI chipsets.
- The st21nfcb driver can put its CLF in hibernate mode and save
significant amount of power.
- A few st21nfcb minor fixes.
- The NXP NCI driver now supports ACPI enumeration.
- The Marvell NCI driver now supports both USB and serial
physical interfaces.
- The Marvell NCI drivers also supports NCI frames being muxed
over HCI. This is a setting that can be defined by a DT property.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 15 Jun 2015 23:40:25 +0000 (16:40 -0700)]
Merge branch 'bond-netlink-3ad-attrs'
Nikolay Aleksandrov says:
====================
bonding: extend the 3ad exported attributes
These are two small patches that export actor_oper_port_state and
partner_oper_port_state via netlink and sysfs, until now they were only
exported via bond's proc entry. If this set gets accepted I have an iproute2
patch prepared that will export them with which I tested these changes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
bonding: export slave's partner_oper_port_state via sysfs and netlink
Export the partner_oper_port_state of each port via sysfs and netlink.
In 802.3ad mode it is valuable for the user to be able to check the
partner_oper state, it is already exported via bond's proc entry.
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
bonding: export slave's actor_oper_port_state via sysfs and netlink
Export the actor_oper_port_state of each port via sysfs and netlink.
In 802.3ad mode it is valuable for the user to be able to check the
actor_oper state, it is already exported via bond's proc entry.
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 15 Jun 2015 23:06:49 +0000 (16:06 -0700)]
Merge branch 'rocker-no-wait'
Scott Feldman says:
====================
rocker: revert back to support for nowait processes
One of the items removed from the rocker driver in the Spring Cleanup patch
series was the ability to mark processing in the driver as "no wait" for
those contexts where we cannot sleep. Turns out, we have "no wait"
contexts where we want to program the device and we don't want to defer the
processing to a process context. So re-add the ROCKER_OP_FLAG_NOWAIT flag
to mark such processes, and propagate flags to mem allocator and to the
device cmd executor. With NOWAIT, mem allocs are GFP_ATOMIC and device
cmds are queued to the device, but the driver will not wait (sleep) for the
response back from the device.
My bad for removing NOWAIT support in the first place; I thought we could
swing non-sleep contexts to process context using a work queue, for
example, but there is push-back to keep processing in original context.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Sat, 13 Jun 2015 04:35:50 +0000 (21:35 -0700)]
rocker: move port stop to 'no wait' processing
rocker_port_stop can be called from atomic and non-atomic contexts. Since
we can't test what context we're getting called in, do the processing as
'no wait', which will cover all cases.
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Sat, 13 Jun 2015 04:35:48 +0000 (21:35 -0700)]
rocker: mark STP update as 'no wait' processing
We can get STP updates from the bridge driver in atomic and non-atomic
contexts. Since we can't test what context we're getting called in,
do the STP processing as 'no wait', which will cover all cases.
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Sat, 13 Jun 2015 04:35:47 +0000 (21:35 -0700)]
rocker: mark neigh update event processing as 'no wait'
Neigh update event handler runs in a context where we can't sleep, so mark
processing in driver with ROCKER_OP_FLAG_NOWAIT. NOWAIT will use
GFP_ATOMIC for allocations and will queue cmds to the device's cmd ring but
will not wait (sleep) for cmd response back from device.
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Sat, 13 Jun 2015 04:35:46 +0000 (21:35 -0700)]
rocker: revert back to support for nowait processes
One of the items removed from the rocker driver in the Spring Cleanup patch
series was the ability to mark processing in the driver as "no wait" for
those contexts where we cannot sleep. Turns out, we have "no wait"
contexts where we want to program the device. So re-add the
ROCKER_OP_FLAG_NOWAIT flag to mark such processes, and propagate flags to
mem allocator and to the device cmd executor. With NOWAIT, mem allocs are
GFP_ATOMIC and device cmds are queued to the device, but the driver will
not wait (sleep) for the response back from the device.
My bad for removing NOWAIT support in the first place; I thought we could
swing non-sleep contexts to process context using a work queue, for
example, but there is push-back to keep processing in original context.
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Sat, 13 Jun 2015 04:24:40 +0000 (21:24 -0700)]
rocker: fix neigh tbl index increment race
rocker->neigh_tbl_next_index is used to generate unique indices for neigh
entries programmed into the device. The way new indices were generated was
racy with the new prepare-commit transaction model. A simple fix here
removes the race. The race was with two processes getting the same index,
one process using prepare-commit, the other not:
Proc A Proc B
PREPARE phase
get neigh_tbl_next_index
NONE phase
get neigh_tbl_next_index
neigh_tbl_next_index++
COMMIT phase
neigh_tbl_next_index++
Both A and B got the same index. The fix is to store and increment
neigh_tbl_next_index in the PREPARE (or NONE) phase and use value in COMMIT
phase:
Proc A Proc B
PREPARE phase
get neigh_tbl_next_index
neigh_tbl_next_index++
NONE phase
get neigh_tbl_next_index
neigh_tbl_next_index++
COMMIT phase
// use value stashed in PREPARE phase
Reported-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Scott Feldman <sfeldma@gmail.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Sat, 13 Jun 2015 04:09:44 +0000 (21:09 -0700)]
rocker: gaurd against NULL rocker_port when removing ports
The ports array is filled in as ports are probed, but if probing doesn't
finish, we need to stop only those ports that where probed successfully.
Check the ports array for NULL to skip un-probed ports when stopping.
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 13 Jun 2015 02:44:48 +0000 (19:44 -0700)]
net: make u64_stats_init() a function
Using a function instead of a macro is cleaner and remove
following W=1 warnings (extract)
In file included from net/ipv6/ip6_vti.c:29:0:
net/ipv6/ip6_vti.c: In function ‘vti6_dev_init_gen’:
include/linux/netdevice.h:2029:18: warning: variable ‘stat’ set but not
used [-Wunused-but-set-variable]
typeof(type) *stat; \
^
net/ipv6/ip6_vti.c:862:16: note: in expansion of macro
‘netdev_alloc_pcpu_stats’
dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
^
CC [M] net/ipv6/sit.o
In file included from net/ipv6/sit.c:30:0:
net/ipv6/sit.c: In function ‘ipip6_tunnel_init’:
include/linux/netdevice.h:2029:18: warning: variable ‘stat’ set but not
used [-Wunused-but-set-variable]
typeof(type) *stat; \
^
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Sat, 13 Jun 2015 00:39:50 +0000 (17:39 -0700)]
bridge: use either ndo VLAN ops or switchdev VLAN ops to install MASTER vlans
v2:
Move struct switchdev_obj automatics to inner scope where there used.
v1:
To maintain backward compatibility with the existing iproute2 "bridge vlan"
command, let bridge's setlink/dellink handler call into either the port
driver's 8021q ndo ops or the port driver's bridge_setlink/dellink ops.
This allows port driver to choose 8021q ops or the newer
bridge_setlink/dellink ops when implementing VLAN add/del filtering on the
device. The iproute "bridge vlan" command does not need to be modified.
To summarize using the "bridge vlan" command examples, we have:
1) bridge vlan add|del vid VID dev DEV
Here iproute2 sets MASTER flag. Bridge's bridge_setlink/dellink is called.
Vlan is set on bridge for port. If port driver implements ndo 8021q ops,
call those to port driver can install vlan filter on device. Otherwise, if
port driver implements bridge_setlink/dellink ops, call those to install
vlan filter to device. This option only works if port is bridged.
2) bridge vlan add|del vid VID dev DEV master
Same as 1)
3) bridge vlan add|del vid VID dev DEV self
Bridge's bridge_setlink/dellink isn't called. Port driver's
bridge_setlink/dellink is called, if implemented. This option works if
port is bridged or not. If port is not bridged, a VLAN can still be
added/deleted to device filter using this variant.
4) bridge vlan add|del vid VID dev DEV master self
This is a combination of 1) and 3), but will only work if port is bridged.
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
bpf: let kprobe programs use bpf_get_smp_processor_id() helper
It's useful to do per-cpu histograms.
Suggested-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
bpf: allow networking programs to use bpf_trace_printk() for debugging
bpf_trace_printk() is a helper function used to debug eBPF programs.
Let socket and TC programs use it as well.
Note, it's DEBUG ONLY helper. If it's used in the program,
the kernel will print warning banner to make sure users don't use
it in production.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
Netfilter updates for net-next
This a bit large (and late) patchset that contains Netfilter updates for
net-next. Most relevantly br_netfilter fixes, ipset RCU support, removal of
x_tables percpu ruleset copy and rework of the nf_tables netdev support. More
specifically, they are:
1) Warn the user when there is a better protocol conntracker available, from
Marcelo Ricardo Leitner.
2) Fix forwarding of IPv6 fragmented traffic in br_netfilter, from Bernhard
Thaler. This comes with several patches to prepare the change in first place.
3) Get rid of special mtu handling of PPPoE/VLAN frames for br_netfilter. This
is not needed anymore since now we use the largest fragment size to
refragment, from Florian Westphal.
4) Restore vlan tag when refragmenting in br_netfilter, also from Florian.
5) Get rid of the percpu ruleset copy in x_tables, from Florian. Plus another
follow up patch to refine it from Eric Dumazet.
6) Several ipset cleanups, fixes and finally RCU support, from Jozsef Kadlecsik.
7) Get rid of parens in Netfilter Kconfig files.
8) Attach the net_device to the basechain as opposed to the initial per table
approach in the nf_tables netdev family.
9) Subscribe to netdev events to detect the removal and registration of a
device that is referenced by a basechain.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
netfilter: nf_tables_netdev: unregister hooks on net_device removal
In case the net_device is gone, we have to unregister the hooks and put back
the reference on the net_device object. Once it comes back, register them
again. This also covers the device rename case.
This patch also adds a new flag to indicate that the basechain is disabled, so
their hooks are not registered. This flag is used by the netdev family to
handle the case where the net_device object is gone. Currently this flag is not
exposed to userspace.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nf_tables: attach net_device to basechain
The device is part of the hook configuration, so instead of a global
configuration per table, set it to each of the basechain that we create.
This patch reworks ebddf1a8d78a ("netfilter: nf_tables: allow to bind table to
net_device").
Note that this adds a dev_name field in the nft_base_chain structure which is
required the netdev notification subscription that follows up in a patch to
handle gone net_devices.
Suggested-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Merge branch 'master' of git://blackhole.kfki.hu/nf-next
Jozsef Kadlecsik says:
====================
ipset patches for nf-next
Please consider to apply the next bunch of patches for ipset. First
comes the small changes, then the bugfixes and at the end the RCU
related patches.
* Use MSEC_PER_SEC consistently instead of the number.
* Use SET_WITH_*() helpers to test set extensions from Sergey Popovich.
* Check extensions attributes before getting extensions from Sergey Popovich.
* Permit CIDR equal to the host address CIDR in IPv6 from Sergey Popovich.
* Make sure we always return line number on batch in the case of error
from Sergey Popovich.
* Check CIDR value only when attribute is given from Sergey Popovich.
* Fix cidr handling for hash:*net* types, reported by Jonathan Johnson.
* Fix parallel resizing and listing of the same set so that the original
set is kept for the whole dumping.
* Make sure listing doesn't grab a set which is just being destroyed.
* Remove rbtree from ip_set_hash_netiface.c in order to introduce RCU.
* Replace rwlock_t with spinlock_t in "struct ip_set", change the locking
in the core and simplifications in the timeout routines.
* Introduce RCU locking in bitmap:* types with a slight modification in the
logic on how an element is added.
* Introduce RCU locking in hash:* types. This is the most complex part of
the changes.
* Introduce RCU locking in list type where standard rculist is used.
* Fix coding styles reported by checkpatch.pl.
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jozsef Kadlecsik [Sat, 13 Jun 2015 15:29:56 +0000 (17:29 +0200)]
netfilter: ipset: Introduce RCU locking in hash:* types
Three types of data need to be protected in the case of the hash types:
a. The hash buckets: standard rcu pointer operations are used.
b. The element blobs in the hash buckets are stored in an array and
a bitmap is used for book-keeping to tell which elements in the array
are used or free.
c. Networks per cidr values and the cidr values themselves are stored
in fix sized arrays and need no protection. The values are modified
in such an order that in the worst case an element testing is repeated
once with the same cidr value.
The ipset hash approach uses arrays instead of lists and therefore is
incompatible with rhashtable.
Performance is tested by Jesper Dangaard Brouer:
Simple drop in FORWARD
~~~~~~~~~~~~~~~~~~~~~~
Dropping via simple iptables net-mask match::
iptables -t raw -N simple || iptables -t raw -F simple
iptables -t raw -I simple -s 198.18.0.0/15 -j DROP
iptables -t raw -D PREROUTING -j simple
iptables -t raw -I PREROUTING -j simple
Drop via original ipset in RAW table
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Create a set with lots of elements::
sudo ./ipset destroy test
echo "create test hash:ip hashsize 65536" > test.set
for x in `seq 0 255`; do
for y in `seq 0 255`; do
echo "add test 198.18.$x.$y" >> test.set
done
done
sudo ./ipset restore < test.set
Dropping via ipset::
iptables -t raw -F
iptables -t raw -N net198 || iptables -t raw -F net198
iptables -t raw -I net198 -m set --match-set test src -j DROP
iptables -t raw -I PREROUTING -j net198
Jozsef Kadlecsik [Sat, 13 Jun 2015 12:39:59 +0000 (14:39 +0200)]
netfilter: ipset: Introduce RCU locking in bitmap:* types
There's nothing much required because the bitmap types use atomic
bit operations. However the logic of adding elements slightly changed:
first the MAC address updated (which is not atomic), then the element
activated (added). The extensions may call kfree_rcu() therefore we
call rcu_barrier() at module removal.
Jozsef Kadlecsik [Sat, 13 Jun 2015 12:22:25 +0000 (14:22 +0200)]
netfilter: ipset: Prepare the ipset core to use RCU at set level
Replace rwlock_t with spinlock_t in "struct ip_set" and change the locking
accordingly. Convert the comment extension into an rcu-avare object. Also,
simplify the timeout routines.
Jozsef Kadlecsik [Sat, 13 Jun 2015 09:59:45 +0000 (11:59 +0200)]
netfilter: ipset: Fix parallel resizing and listing of the same set
When elements added to a hash:* type of set and resizing triggered,
parallel listing could start to list the original set (before resizing)
and "continue" with listing the new set. Fix it by references and
using the original hash table for listing. Therefore the destroying of
the original hash table may happen from the resizing or listing functions.
Jozsef Kadlecsik [Fri, 12 Jun 2015 20:11:00 +0000 (22:11 +0200)]
netfilter: ipset: Fix cidr handling for hash:*net* types
Commit "Simplify cidr handling for hash:*net* types" broke the cidr
handling for the hash:*net* types when the sets were used by the SET
target: entries with invalid cidr values were added to the sets.
Reported by Jonathan Johnson.
Sergey Popovich [Fri, 12 Jun 2015 19:30:57 +0000 (21:30 +0200)]
netfilter: ipset: Check CIDR value only when attribute is given
There is no reason to check CIDR value regardless attribute
specifying CIDR is given.
Initialize cidr array in element structure on element structure
declaration to let more freedom to the compiler to optimize
initialization right before element structure is used.
Remove local variables cidr and cidr2 for netnet and netportnet
hashes as we do not use packed cidr value for such set types and
can store value directly in e.cidr[].
Sergey Popovich [Fri, 12 Jun 2015 19:26:43 +0000 (21:26 +0200)]
netfilter: ipset: Make sure we always return line number on batch
Even if we return with generic IPSET_ERR_PROTOCOL it is good idea
to return line number if we called in batch mode.
Moreover we are not always exiting with IPSET_ERR_PROTOCOL. For
example hash:ip,port,net may return IPSET_ERR_HASH_RANGE_UNSUPPORTED
or IPSET_ERR_INVALID_CIDR.
Sergey Popovich [Fri, 12 Jun 2015 19:23:31 +0000 (21:23 +0200)]
netfilter: ipset: Permit CIDR equal to the host address CIDR in IPv6
Permit userspace to supply CIDR length equal to the host address CIDR
length in netlink message. Prohibit any other CIDR length for IPv6
variant of the set.
Also return -IPSET_ERR_HASH_RANGE_UNSUPPORTED instead of generic
-IPSET_ERR_PROTOCOL in IPv6 variant of hash:ip,port,net when
IPSET_ATTR_IP_TO attribute is given.
1) Fix uninitialized struct station_info in cfg80211_wireless_stats(),
from Johannes Berg.
2) Revert commit attempt to fix ipv6 protocol resubmission, it adds
regressions.
3) Endless loops can be created in bridge port lists, fix from Nikolay
Aleksandrov.
4) Don't WARN_ON() if sk->sk_forward_alloc is non-zero in
sk_clear_memalloc, it is a legal situation during swap deactivation.
Fix from Mel Gorman.
5) Fix order of disabling interrupts and unlocking NAPI in enic driver
to avoid a race. From Govindarajulu Varadarajan.
6) High and low register writes are swapped when programming the start
of periodic output in igb driver. From Richard Cochran.
7) Fix device rename handling in mpls stack, from Robert Shearman.
8) Do not trigger compaction synchronously when optimistically trying
to allocate an order 3 page in alloc_skb_with_frags() and
skb_page_frag_refill(). From Shaohua Li.
9) Authentication with COOKIE_ECHO is not handled properly in SCTP, fix
from Marcelo Ricardo Leitner.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
Doc: networking: Fix URL for wiki.wireshark.org in udplite.txt
sctp: allow authenticating DATA chunks that are bundled with COOKIE_ECHO
net: don't wait for order-3 page allocation
mpls: handle device renames for per-device sysctls
net: igb: fix the start time for periodic output signals
enic: fix memory leak in rq_clean
enic: check return value for stat dump
enic: unlock napi busy poll before unmasking intr
net, swap: Remove a warning and clarify why sk_mem_reclaim is required when deactivating swap
bridge: fix multicast router rlist endless loop
tipc: disconnect socket directly after probe failure
Revert "ipv6: Fix protocol resubmission"
cfg80211: wext: clear sinfo struct before calling driver
Eric Dumazet [Sat, 13 Jun 2015 02:31:32 +0000 (19:31 -0700)]
flow_dissector: fix ipv6 dst, hop-by-hop and routing ext hdrs
__skb_header_pointer() returns a pointer that must be checked.
Fixes infinite loop reported by Alexei, and add __must_check to
catch these errors earlier.
Fixes: 6a74fcf426f5 ("flow_dissector: add support for dst, hop-by-hop and routing ext hdrs") Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Raghu Vatsavayi [Sat, 13 Jun 2015 01:11:50 +0000 (18:11 -0700)]
Fix Cavium Liquidio build related errors and warnings
1) Fixed following sparse warnings:
lio_main.c:213:6: warning: symbol 'octeon_droq_bh' was not
declared. Should it be static?
lio_main.c:233:5: warning: symbol 'lio_wait_for_oq_pkts' was
not declared. Should it be static?
lio_main.c:3083:5: warning: symbol 'lio_nic_info' was not
declared. Should it be static?
lio_main.c:2618:16: warning: cast from restricted __be16
octeon_device.c:466:6: warning: symbol 'oct_set_config_info'
was not declared. Should it be static?
octeon_device.c:573:25: warning: cast to restricted __be32
octeon_device.c:582:29: warning: cast to restricted __be32
octeon_device.c:584:39: warning: cast to restricted __be32
octeon_device.c:594:13: warning: cast to restricted __be32
octeon_device.c:596:25: warning: cast to restricted __be32
octeon_device.c:613:25: warning: cast to restricted __be32
octeon_device.c:614:29: warning: cast to restricted __be64
octeon_device.c:615:29: warning: cast to restricted __be32
octeon_device.c:619:37: warning: cast to restricted __be32
octeon_device.c:623:33: warning: cast to restricted __be32
cn66xx_device.c:540:6: warning: symbol
'lio_cn6xxx_get_pcie_qlmport' was not declared. Should it be s
octeon_mem_ops.c:181:16: warning: cast to restricted __be64
octeon_mem_ops.c:190:16: warning: cast to restricted __be32
octeon_mem_ops.c:196:17: warning: incorrect type in initializer
2) Fix build errors corresponding to vmalloc on linux-next 4.1.
3) Liquidio now supports 64 bit only, modified Kconfig accordingly.
4) Fix some code alignment issues based on kernel build warnings.
Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com> Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com> Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com> Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vincent Cuissard [Fri, 12 Jun 2015 13:35:53 +0000 (15:35 +0200)]
NFC: nci: remove current SLEEP mode management
NCI deactivate management was modified to support all NCI
deactivation type. Problem is that all the API are not ready
yet for it.
Problem is that with current code, when neard asks to deactivate
the tag it sends a deactivate SLEEP but nobody will then send a
IDLE deactivate. This IDLE deactivate is mandatory since NFC
controller can only be unlocked by DH.
Signed-off-by: Vincent Cuissard <cuissard@marvell.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Tom Herbert [Fri, 12 Jun 2015 16:01:05 +0000 (09:01 -0700)]
flow_dissector: Fix MPLS entropy label handling in flow dissector
Need to shift after masking to get label value for comparison.
Fixes: b3baa0fbd02a1a9d493d8 ("mpls: Add MPLS entropy label in flow_keys") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Fri, 12 Jun 2015 10:12:22 +0000 (12:12 +0200)]
net: ipv4: un-inline ip_finish_output2
text data bss dec hex filename
old: 16527 44 0 16571 40bb net/ipv4/ip_output.o
new: 14935 44 0 14979 3a83 net/ipv4/ip_output.o
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
sctp: allow authenticating DATA chunks that are bundled with COOKIE_ECHO
Currently, we can ask to authenticate DATA chunks and we can send DATA
chunks on the same packet as COOKIE_ECHO, but if you try to combine
both, the DATA chunk will be sent unauthenticated and peer won't accept
it, leading to a communication failure.
This happens because even though the data was queued after it was
requested to authenticate DATA chunks, it was also queued before we
could know that remote peer can handle authenticating, so
sctp_auth_send_cid() returns false.
The fix is whenever we set up an active key, re-check send queue for
chunks that now should be authenticated. As a result, such packet will
now contain COOKIE_ECHO + AUTH + DATA chunks, in that order.
Reported-by: Liu Wei <weliu@redhat.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 12 Jun 2015 18:35:19 +0000 (11:35 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
"Remember about a week ago when I sent the last pull request for 4.1?
Well, I lied. Now, I don't want to shift the blame, but Dan, Ming,
and Richard made a liar out of me.
Here are three small patches that should go into 4.1. More
specifically, this pull request contains:
- A Kconfig dependency for the pmem block driver, so it can't be
selected if HAS_IOMEM isn't availble. From Richard Weinberger.
- A fix for genhd, making the ext_devt_lock softirq safe. This makes
lockdep happier, since we also end up grabbing this lock on release
off the softirq path. From Dan Williams.
- A blk-mq software queue release fix from Ming Lei.
Last two are headed to stable, first fixes an issue introduced in this
cycle"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: pmem: Add dependency on HAS_IOMEM
block: fix ext_dev_lock lockdep report
blk-mq: free hctx->ctxs in queue's release handler
Linus Torvalds [Fri, 12 Jun 2015 18:33:03 +0000 (11:33 -0700)]
Merge tag 'md/4.1-rc7-fixes' of git://neil.brown.name/md
Pull three more md fixes from Neil Brown:
"Hasn't been a good cycle for md has it :-(
The main issue fixed here is a rare race which can result in two
reshape threads running at once, which doesn't end well.
Also a minor issue with a write to a sysfs file returning the wrong
value. Backports to 4.0-stable are indicated"
* tag 'md/4.1-rc7-fixes' of git://neil.brown.name/md:
md: make sure MD_RECOVERY_DONE is clear before starting recovery/resync
md: Close race when setting 'action' to 'idle'.
md: don't return 0 from array_state_store
Linus Torvalds [Fri, 12 Jun 2015 18:28:57 +0000 (11:28 -0700)]
Merge git://git.infradead.org/intel-iommu
Pull VT-d hardware workarounds from David Woodhouse:
"This contains a workaround for hardware issues which I *thought* were
never going to be seen on production hardware. I'm glad I checked
that before the 4.1 release...
Firstly, PASID support is so broken on existing chips that we're just
going to declare the old capability bit 28 as 'reserved' and change
the VT-d spec to move PASID support to another bit. So any existing
hardware doesn't support SVM; it only sets that (now) meaningless bit
28.
That patch *wasn't* imperative for 4.1 because we don't have PASID
support yet. But *even* the extended context tables are broken — if
you just enable the wider tables and use none of the new bits in them,
which is precisely what 4.1 does, you find that translations don't
work. It's this problem which I thought was caught in time to be
fixed before production, but wasn't.
To avoid triggering this issue, we now *only* enable the extended
context tables on hardware which also advertises "we have PASID
support and we actually tested it this time" with the new PASID
feature bit.
In addition, I've added an 'intel_iommu=ecs_off' command line
parameter to allow us to disable it manually if we need to"
* git://git.infradead.org/intel-iommu:
iommu/vt-d: Only enable extended context tables if PASID is supported
iommu/vt-d: Change PASID support to bit 40 of Extended Capability Register
We store the rule blob per (possible) cpu. Unfortunately this means we can
waste lot of memory on big smp machines. ipt_entry structure ('rule head')
is 112 byte, so e.g. with maxcpu=64 one single rule eats
close to 8k RAM.
Since previous patch made counters percpu it appears there is nothing
left in the rule blob that needs to be percpu.
On my test system (144 possible cpus, 400k dummy rules) this
change saves close to 9 Gigabyte of RAM.
Florian Westphal [Wed, 10 Jun 2015 23:34:54 +0000 (01:34 +0200)]
netfilter: xtables: use percpu rule counters
The binary arp/ip/ip6tables ruleset is stored per cpu.
The only reason left as to why we need percpu duplication are the rule
counters embedded into ipt_entry et al -- since each cpu has its own copy
of the rules, all counters can be lockless.
The downside is that the more cpus are supported, the more memory is
required. Rules are not just duplicated per online cpu but for each
possible cpu, i.e. if maxcpu is 144, then rule is duplicated 144 times,
not for the e.g. 64 cores present.
To save some memory and also improve utilization of shared caches it
would be preferable to only store the rule blob once.
So we first need to separate counters and the rule blob.
Instead of using entry->counters, allocate this percpu and store the
percpu address in entry->counters.pcnt on CONFIG_SMP.
This change makes no sense as-is; it is merely an intermediate step to
remove the percpu duplication of the rule set in a followup patch.
Suggested-by: Eric Dumazet <edumazet@google.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: bridge: restore vlan tag when refragmenting
If bridge netfilter is used with both
bridge-nf-call-iptables and bridge-nf-filter-vlan-tagged enabled
then ip fragments in VLAN frames are sent without the vlan header.
This has never worked reliably. Turns out this relied on pre-3.5
behaviour where skb frag_list was used to store ip fragments;
ip_fragment() then re-used these skbs.
But since commit 3cc4949269e01f39443d0fcfffb5bc6b47878d45
("ipv4: use skb coalescing in defragmentation") this is no longer
the case. ip_do_fragment now needs to allocate new skbs, but these
don't contain the vlan tag information anymore.
Fix it by storing vlan information of the ressembled skb in the
br netfilter percpu frag area, and restore them for each of the
fragments.
Fixes: 3cc4949269e01f3 ("ipv4: use skb coalescing in defragmentation") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
net: ip_fragment: remove BRIDGE_NETFILTER mtu special handling
since commit d6b915e29f4adea9
("ip_fragment: don't forward defragmented DF packet") the largest
fragment size is available in the IPCB.
Therefore we no longer need to care about 'encapsulation'
overhead of stripped PPPOE/VLAN headers since ip_do_fragment
doesn't use device mtu in such cases.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
IPv6 fragmented packets are not forwarded on an ethernet bridge
with netfilter ip6_tables loaded. e.g. steps to reproduce
1) create a simple bridge like this
modprobe br_netfilter
brctl addbr br0
brctl addif br0 eth0
brctl addif br0 eth2
ifconfig eth0 up
ifconfig eth2 up
ifconfig br0 up
2) place a host with an IPv6 address on each side of the bridge
set IPv6 address on host A:
ip -6 addr add fd01:2345:6789:1::1/64 dev eth0
set IPv6 address on host B:
ip -6 addr add fd01:2345:6789:1::2/64 dev eth0
3) run a simple ping command on host A with packets > MTU
ping6 -s 4000 fd01:2345:6789:1::2
4) wait some time and run e.g. "ip6tables -t nat -nvL" on the bridge
IPv6 fragmented packets traverse the bridge cleanly until somebody runs.
"ip6tables -t nat -nvL". As soon as it is run (and netfilter modules are
loaded) IPv6 fragmented packets do not traverse the bridge any more (you
see no more responses in ping's output).
After applying this patch IPv6 fragmented packets traverse the bridge
cleanly in above scenario.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at>
[pablo@netfilter.org: small changes to br_nf_dev_queue_xmit] Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Bernhard Thaler [Sat, 30 May 2015 13:28:28 +0000 (15:28 +0200)]
netfilter: bridge: refactor frag_max_size
Currently frag_max_size is member of br_input_skb_cb and copied back and
forth using IPCB(skb) and BR_INPUT_SKB_CB(skb) each time it is changed or
used.
Attach frag_max_size to nf_bridge_info and set value in pre_routing and
forward functions. Use its value in forward and xmit functions.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This does not work with ip6tables on a bridge in NAT66 scenario
because the REDIRECT/DNAT/SNAT is not correctly detected.
The bridge pre-routing (finish) netfilter hook has to check for a possible
redirect and then fix the destination mac address. This allows to use the
ip6tables rules for local REDIRECT/DNAT/SNAT REDIRECT similar to the IPv4
iptables version.
This patch makes it possible to use IPv6 NAT66 on a bridge. It was tested
on a bridge with two interfaces using SNAT/DNAT NAT66 rules.
Reported-by: Artie Hamilton <artiemhamilton@yahoo.com> Signed-off-by: Sven Eckelmann <sven@open-mesh.com>
[bernhard.thaler@wvnet.at: rebased, add indirect call to ip6_route_input()]
[bernhard.thaler@wvnet.at: rebased, split into separate patches] Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: conntrack: warn the user if there is a better helper to use
After db29a9508a92 ("netfilter: conntrack: disable generic tracking for
known protocols"), if the specific helper is built but not loaded
(a standard for most distributions) systems with a restrictive firewall
but weak configuration regarding netfilter modules to load, will
silently stop working.
This patch then puts a warning message so the sysadmin knows where to
start looking into. It's a pr_warn_once regardless of protocol itself
but it should be enough to give a hint on where to look.
Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
David Woodhouse [Fri, 12 Jun 2015 09:15:49 +0000 (10:15 +0100)]
iommu/vt-d: Only enable extended context tables if PASID is supported
Although the extended tables are theoretically a completely orthogonal
feature to PASID and anything else that *uses* the newly-available bits,
some of the early hardware has problems even when all we do is enable
them and use only the same bits that were in the old context tables.
For now, there's no motivation to support extended tables unless we're
going to use PASID support to do SVM. So just don't use them unless
PASID support is advertised too. Also add a command-line bailout just in
case later chips also have issues.
The equivalent problem for PASID support has already been fixed with the
upcoming VT-d spec update and commit bd00c606a ("iommu/vt-d: Change
PASID support to bit 40 of Extended Capability Register"), because the
problematic platforms use the old definition of the PASID-capable bit,
which is now marked as reserved and meaningless.
So with this change, we'll magically start using ECS again only when we
see the new hardware advertising "hey, we have PASID support and we
actually tested it this time" on bit 40.
The VT-d hardware architect has promised that we are not going to have
any reason to support ECS *without* PASID any time soon, and he'll make
sure he checks with us before changing that.
In the future, if hypothetical new features also use new bits in the
context tables and can be seen on implementations *without* PASID support,
we might need to add their feature bits to the ecs_enabled() macro.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
NeilBrown [Fri, 12 Jun 2015 10:05:04 +0000 (20:05 +1000)]
md: make sure MD_RECOVERY_DONE is clear before starting recovery/resync
MD_RECOVERY_DONE is normally cleared by md_check_recovery after a
resync etc finished. However it is possible for raid5_start_reshape
to race and start a reshape before MD_RECOVERY_DONE is cleared. This
can lean to multiple reshapes running at the same time, which isn't
good.
To make sure it is cleared before starting a reshape, and also clear
it when reaping a thread, just to be safe.
NeilBrown [Fri, 12 Jun 2015 09:51:27 +0000 (19:51 +1000)]
md: Close race when setting 'action' to 'idle'.
Checking ->sync_thread without holding the mddev_lock()
isn't really safe, even after flushing the workqueue which
ensures md_start_sync() has been run.
While this code is waiting for the lock, md_check_recovery could reap
the thread itself, and then start another thread (e.g. recovery might
finish, then reshape starts). When this thread gets the lock
md_start_sync() hasn't run so it doesn't get reaped, but
MD_RECOVERY_RUNNING gets cleared. This allows two threads to start
which leads to confusion.
So don't both if MD_RECOVERY_RUNNING isn't set, but if it is do
the flush and the test and the reap all under the mddev_lock to
avoid any race with md_check_recovery.
Signed-off-by: NeilBrown <neilb@suse.de> Fixes: 6791875e2e53 ("md: make reconfig_mutex optional for writes to md sysfs files.") Cc: stable@vger.kernel.org (v4.0+)