If a client issues a DHCPREQUEST for renewal, the packet is dropped
if the old destination (the old gateway for the client) TQ is smaller
than the current best gateway TQ less GW_THRESHOLD
Signed-off-by: Antonio Quartulli <ordex@autistici.org> Signed-off-by: Marek Lindner <lindner_marek@yahoo.de> Signed-off-by: Sven Eckelmann <sven@narfation.org>
batman-adv: throw uevent in userspace on gateway add/change/del event
In case of new default gw, changing the default gw or deleting the default gw a
uevent is triggered with type=gw, action=add/change/del and
data={GW_ORIG_ADDRESS} (if any).
Signed-off-by: Antonio Quartulli <ordex@autistici.org> Signed-off-by: Marek Lindner <lindner_marek@yahoo.de> Signed-off-by: Sven Eckelmann <sven@narfation.org>
The gateway election mechanism has been a little revised. Now the
gw_election is trigered by an atomic_t flag (gw_reselect) which is set
to 1 in case of election needed, avoding to set curr_gw to NULL.
Signed-off-by: Antonio Quartulli <ordex@autistici.org> Signed-off-by: Marek Lindner <lindner_marek@yahoo.de> Signed-off-by: Sven Eckelmann <sven@narfation.org>
batman-adv: add wrapper function to throw uevent in userspace
Using throw_uevent() is now possible to trigger uevent signal that can
be recognised in userspace. Uevents will be triggered through the
/devices/virtual/net/{MESH_IFACE} kobject.
A triggered uevent has three properties:
- type: the event class. Who generates the event (only 'gw' is currently
defined). Corresponds to the BATTYPE uevent variable.
- action: the associated action with the event ('add'/'change'/'del' are
currently defined). Corresponds to the BATACTION uevent variable.
- data: any useful data for the userspace. Corresponds to the BATDATA
uevent variable.
Signed-off-by: Antonio Quartulli <ordex@autistici.org> Signed-off-by: Sven Eckelmann <sven@narfation.org>
batman-adv: protect the local and the global trans-tables with rcu
The local and the global translation-tables are now lock free and rcu
protected.
Signed-off-by: Antonio Quartulli <ordex@autistici.org> Acked-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de> Signed-off-by: Sven Eckelmann <sven@narfation.org>
With the current client announcement implementation, in case of roaming,
an update is triggered on the new AP serving the client. At that point
the new information is spread around by means of the OGM broadcasting
mechanism. Until this operations is not executed, no node is able to
correctly route traffic towards the client. This obviously causes packet
drops and introduces a delay in the time needed by the client to recover
its connections.
A new packet type called ROAMING_ADVERTISEMENT is added to account this
issue.
This message is sent in case of roaming from the new AP serving the
client to the old one and will contain the client MAC address. In this
way an out-of-OGM update is immediately committed, so that the old node
can update its global translation table. Traffic reaching this node will
then be redirected to the correct destination utilising the fresher
information. Thus reducing the packet drops and the connection recovery
delay.
Signed-off-by: Antonio Quartulli <ordex@autistici.org> Signed-off-by: Sven Eckelmann <sven@narfation.org>
The client announcement mechanism informs every mesh node in the network
of any connected non-mesh client, in order to find the path towards that
client from any given point in the mesh.
The old implementation was based on the simple idea of appending a data
buffer to each OGM containing all the client MAC addresses the node is
serving. All other nodes can populate their global translation tables
(table which links client MAC addresses to node addresses) using this
MAC address buffer and linking it to the node's address contained in the
OGM. A node that wants to contact a client has to lookup the node the
client is connected to and its address in the global translation table.
It is easy to understand that this implementation suffers from several
issues:
- big overhead (each and every OGM contains the entire list of
connected clients)
- high latencies for client route updates due to long OGM trip time and
OGM losses
The new implementation addresses these issues by appending client
changes (new client joined or a client left) to the OGM instead of
filling it with all the client addresses each time. In this way nodes
can modify their global tables by means of "updates", thus reducing the
overhead within the OGMs.
To keep the entire network in sync each node maintains a translation
table version number (ttvn) and a translation table checksum. These
values are spread with the OGM to allow all the network participants to
determine whether or not they need to update their translation table
information.
When a translation table lookup is performed in order to send a packet
to a client attached to another node, the destination's ttvn is added to
the payload packet. Forwarding nodes can compare the packet's ttvn with
their destination's ttvn (this node could have a fresher information
than the source) and re-route the packet if necessary. This greatly
reduces the packet loss of clients roaming from one AP to the next.
Signed-off-by: Antonio Quartulli <ordex@autistici.org> Signed-off-by: Marek Lindner <lindner_marek@yahoo.de> Signed-off-by: Sven Eckelmann <sven@narfation.org>
batman-adv: Unify the first 3 bytes in each packet
The amount of duplicated code in the receive and routing code can be
reduced when all headers provide the packet type, version and ttl in the
same first bytes.
Signed-off-by: Antonio Quartulli <ordex@autistici.org> Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Wed, 15 Jun 2011 07:41:37 +0000 (09:41 +0200)]
batman-adv: Reduce usage of char
char was used in different places to store information without really
using the characteristics of that data type or by ignoring the fact that
char has not a well defined signedness.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
David Howells [Wed, 15 Jun 2011 07:41:36 +0000 (09:41 +0200)]
batman-adv: count_real_packets() in batman-adv assumes char is signed
count_real_packets() in batman-adv assumes char is signed, and returns -1
through it:
net/batman-adv/routing.c: In function 'receive_bat_packet':
net/batman-adv/routing.c:739: warning: comparison is always false due to limited range of data type
Use int instead.
Signed-off-by: David Howells <dhowells@redhat.com>
[sven@narfation.org: Rebase on top of current version] Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Wed, 15 Jun 2011 13:08:59 +0000 (15:08 +0200)]
batman-adv: Move compare_orig to originator.c
compare_orig is only used in context of orig_node which is managed
inside originator.c. It is not necessary to keep that function inside
the header originator.h.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Sat, 4 Jun 2011 09:26:00 +0000 (11:26 +0200)]
batman-adv: Use enums for related constants
CodingStyle "Chapter 12: Macros, Enums and RTL" recommends to use enums
for several related constants. Internal states can be used without
defining the actual value, but all values which are visible to the
outside must be defined as before. Normal values are assigned as usual
and flags are defined by shifts of a bit.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Sun, 5 Jun 2011 08:20:19 +0000 (10:20 +0200)]
batman-adv: Rewrite debugfs kobj_to_* helpers as functions
CodingStyle "Chapter 12: Macros, Enums and RTL" highly recommends to use
functions instead of macros were possible. This ensures type safety and
prevents shadowing of other variables.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Sat, 4 Jun 2011 10:40:37 +0000 (12:40 +0200)]
batman-adv: Don't return value in void function
gw_node_delete is defined with "void" as return type, but still tries to
return a value. The called function gw_node_delete is also return as
void and thus doesn't provide a value for us.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Daniele Furlan [Mon, 6 Jun 2011 22:45:55 +0000 (00:45 +0200)]
batman-adv: accept delayed rebroadcasts to avoid bogus routing under heavy load
When a link is saturated (re)broadcasts of OGMs are delayed. Under heavy
load this delay may exceed the orig interval which leads to OGMs being
dropped (the code would only accept an OGM rebroadcast if it arrived
before the next OGM was broadcasted). With this patch batman-adv will
also accept delayed OGMs in order to avoid a bogus influence on the
routing metric.
Signed-off-by: Daniele Furlan <daniele.furlan@gmail.com> Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Tue, 10 May 2011 09:22:37 +0000 (11:22 +0200)]
batman-adv: Ensure that we really have route changes in update_route
The debug output of update_route has tests for "route deleted" and "route
added". All other situations are handled as "route changed". This is not
true because neigh_node and curr_router could be both NULL.
The function is not called in this situation, but the code might be
interpreted wrong when reading it without this test.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
batman-adv: a multiline comment should precede the variable it is describing
This comment has been wrongly put after the variable it refers to and was also bad indented
Signed-off-by: Antonio Quartulli <ordex@autistici.org> Acked-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Sven Eckelmann <sven@narfation.org>
batman-adv: use is_broadcast_ether_addr() instead of compare_eth(.., brd_addr)
Instead of comparing mac addresses with the broadcast address by means
of compare_eth(), the is_broadcast_ether_addr() kernel function has to be
used.
Signed-off-by: Antonio Quartulli <ordex@autistici.org> Acked-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Thu, 19 May 2011 19:43:08 +0000 (21:43 +0200)]
batman-adv: Check type of x and y in seq_(before|after)
seq_before and seq_after depend on the fact that both sequence numbers
have the same type and thus the same bitwidth. We can ensure that by
compile time checking using a compare between the pointer to the
temporary buffers which were created using the typeof of both
parameters. For example gcc would create a warning like
"warning: comparison of distinct pointer types lacks a cast".
Signed-off-by: Sven Eckelmann <sven@narfation.org>
batman-adv: move smallest_signed_int(), seq_before() and seq_after() into main.h
smallest_signed_int(), seq_before() and seq_after() are very useful
functions that help to handle comparisons between sequence numbers.
However they were only defined in vis.c. With this patch every
batman-adv function will be able to use them.
Signed-off-by: Antonio Quartulli <ordex@autistici.org> Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Sat, 14 May 2011 22:50:21 +0000 (00:50 +0200)]
batman-adv: Use rcu_dereference_protected by update-side
Usually rcu_dereference isn't necessary in situations were the
RCU-protected data structure cannot change, but sparse and lockdep still
need a similar functionality for analysis. rcu_dereference_protected
implements the reduced version which should be used to support the
dynamic and static analysis.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Sat, 14 May 2011 21:14:54 +0000 (23:14 +0200)]
batman-adv: Calculate sizeof using variable insead of types
Documentation/CodingStyle recommends to use the form
p = kmalloc(sizeof(*p), ...);
to calculate the size of a struct and not the version where the struct
name is spelled out to prevent bugs when the type of p changes. This
also seems appropriate for manipulation of buffers when they are
directly associated with p.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Sat, 14 May 2011 21:14:51 +0000 (23:14 +0200)]
batman-adv: Only use int up and down gw representation
It is not save to provide memory for an int and then cast the pointer to
it to long*. It is better to standardize the up and down gateway
bandwith representation to simple ints and only use long inside
conversation routines.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Sat, 14 May 2011 21:14:50 +0000 (23:14 +0200)]
batman-adv: Add const type qualifier for pointers
batman-adv uses pointers which are marked as const and should not
violate that type qualifier by passing it to functions which force a
cast to the non-const version.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Sven Eckelmann [Sat, 14 May 2011 21:14:49 +0000 (23:14 +0200)]
batman-adv: Don't do pointer arithmetic with void*
The size of void is currently set by gcc to 1, but is not well defined
in general. Therefore it is more advisable to cast it to char* before
doing pointer arithmetic.
Signed-off-by: Sven Eckelmann <sven@narfation.org>
batman-adv: move neigh_node->if_incoming->if_status check in find_router()
Every time that find_router() is invoked, if_status has to be compared with
IF_ACTIVE. Moving this comparison inside find_router() will avoid to write it
each time.
Signed-off-by: Antonio Quartulli <ordex@autistici.org> Signed-off-by: Sven Eckelmann <sven@narfation.org>
Linus Torvalds [Sun, 29 May 2011 21:06:42 +0000 (14:06 -0700)]
arm gpio drivers: make them 'depends on ARM'
We had a few drivers move from arch/arm into drivers/gpio, but they
don't actually compile without the ARM platform headers etc. As a
result they were messing up allyesconfig on x86.
Tyler Hicks [Tue, 24 May 2011 10:11:12 +0000 (05:11 -0500)]
eCryptfs: Remove ecryptfs_header_cache_2
Now that ecryptfs_lookup_interpose() is no longer using
ecryptfs_header_cache_2 to read in metadata, the kmem_cache can be
removed and the ecryptfs_header_cache_1 kmem_cache can be renamed to
ecryptfs_header_cache.
Tyler Hicks [Tue, 24 May 2011 09:56:23 +0000 (04:56 -0500)]
eCryptfs: Cleanup and optimize ecryptfs_lookup_interpose()
ecryptfs_lookup_interpose() has turned into spaghetti code over the
years. This is an effort to clean it up.
- Shorten overly descriptive variable names such as ecryptfs_dentry
- Simplify gotos and error paths
- Create helper function for reading plaintext i_size from metadata
It also includes an optimization when reading i_size from the metadata.
A complete page-sized kmem_cache_alloc() was being done to read in 16
bytes of metadata. The buffer for that is now statically declared.
Tyler Hicks [Mon, 2 May 2011 05:39:54 +0000 (00:39 -0500)]
eCryptfs: Return useful code from contains_ecryptfs_marker
Instead of having the calling functions translate the true/false return
code to either 0 or -EINVAL, have contains_ecryptfs_marker() return 0 or
-EINVAL so that the calling functions can just reuse the return code.
Also, rename the function to ecryptfs_validate_marker() to avoid callers
mistakenly thinking that it returns true/false codes.
Tyler Hicks [Tue, 24 May 2011 08:49:02 +0000 (03:49 -0500)]
eCryptfs: Fix new inode race condition
Only unlock and d_add() new inodes after the plaintext inode size has
been read from the lower filesystem. This fixes a race condition that
was sometimes seen during a multi-job kernel build in an eCryptfs mount.
https://bugzilla.kernel.org/show_bug.cgi?id=36002
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Reported-by: David <david@unsolicited.net> Tested-by: David <david@unsolicited.net>
Linus Torvalds [Sun, 29 May 2011 18:44:33 +0000 (11:44 -0700)]
Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mjg59/platform-drivers-x86
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mjg59/platform-drivers-x86: (43 commits)
acer-wmi: support integer return type from WMI methods
msi-laptop: fix section mismatch in reference from the function load_scm_model_init
acer-wmi: support to set communication device state by new wmid method
acer-wmi: allow 64-bits return buffer from WMI methods
acer-wmi: check the existence of internal 3G device when set capability
platform/x86:delete two unused variables
support wlan hotkey on Acer Travelmate 5735Z
platform-x86: intel_mid_thermal: Fix memory leak
platform/x86: Fix Makefile for intel_mid_powerbtn
platform/x86: Simplify intel_mid_powerbtn
acer-wmi: Delete out-of-date documentation
acerhdf: Clean up includes
acerhdf: Drop pointless dependency on THERMAL_HWMON
acer-wmi: Update MAINTAINERS
wmi: Orphan ACPI-WMI driver
tc1100-wmi: Orphan driver
acer-wmi: does not allow negative number set to initial device state
platform/oaktrail: ACPI EC Extra driver for Oaktrail
thinkpad_acpi: Convert printks to pr_<level>
thinkpad_acpi: Correct !CONFIG_THINKPAD_ACPI_VIDEO warning
...
which was introduced by commit de03c72cfce5 ("mm: convert
mm->cpu_vm_cpumask into cpumask_var_t"), and is due to wrong ordering of
mm_init() vs mm_init_cpumask
Thomas wrote a patch to just fix the ordering of initialization, but I
hate the new double allocation in the fork path, so I ended up instead
doing some more radical surgery to clean it all up.
Reported-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Ingo Molnar <mingo@elte.hu> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 29 May 2011 18:21:12 +0000 (11:21 -0700)]
Merge branch 'for-2.6.40' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.40' of git://linux-nfs.org/~bfields/linux: (22 commits)
nfsd: make local functions static
NFSD: Remove unused variable from nfsd4_decode_bind_conn_to_session()
NFSD: Check status from nfsd4_map_bcts_dir()
NFSD: Remove setting unused variable in nfsd_vfs_read()
nfsd41: error out on repeated RECLAIM_COMPLETE
nfsd41: compare request's opcnt with session's maxops at nfsd4_sequence
nfsd v4.1 lOCKT clientid field must be ignored
nfsd41: add flag checking for create_session
nfsd41: make sure nfs server process OPEN with EXCLUSIVE4_1 correctly
nfsd4: fix wrongsec handling for PUTFH + op cases
nfsd4: make fh_verify responsibility of nfsd_lookup_dentry caller
nfsd4: introduce OPDESC helper
nfsd4: allow fh_verify caller to skip pseudoflavor checks
nfsd: distinguish functions of NFSD_MAY_* flags
svcrpc: complete svsk processing on cb receive failure
svcrpc: take advantage of tcp autotuning
SUNRPC: Don't wait for full record to receive tcp data
svcrpc: copy cb reply instead of pages
svcrpc: close connection if client sends short packet
svcrpc: note network-order types in svc_process_calldir
...
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm:
dm kcopyd: return client directly and not through a pointer
dm kcopyd: reserve fewer pages
dm io: use fixed initial mempool size
dm kcopyd: alloc pages from the main page allocator
dm kcopyd: add gfp parm to alloc_pl
dm kcopyd: remove superfluous page allocation spinlock
dm kcopyd: preallocate sub jobs to avoid deadlock
dm kcopyd: avoid pointless job splitting
dm mpath: do not fail paths after integrity errors
dm table: reject devices without request fns
dm table: allow targets to support discards internally
Linus Torvalds [Sun, 29 May 2011 18:20:02 +0000 (11:20 -0700)]
Merge branch 'nfs-for-2.6.40' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'nfs-for-2.6.40' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
SUNRPC: Support for RPC over AF_LOCAL transports
SUNRPC: Remove obsolete comment
SUNRPC: Use AF_LOCAL for rpcbind upcalls
SUNRPC: Clean up use of curly braces in switch cases
NFS: Revert NFSROOT default mount options
SUNRPC: Rename xs_encode_tcp_fragment_header()
nfs,rcu: convert call_rcu(nfs_free_delegation_callback) to kfree_rcu()
nfs41: Correct offset for LAYOUTCOMMIT
NFS: nfs_update_inode: print current and new inode size in debug output
NFSv4.1: Fix the handling of NFS4ERR_SEQ_MISORDERED errors
NFSv4: Handle expired stateids when the lease is still valid
SUNRPC: Deal with the lack of a SYN_SENT sk->sk_state_change callback...
Linus Torvalds [Sun, 29 May 2011 18:19:16 +0000 (11:19 -0700)]
Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6:
ACPI EC: remove redundant code
ACPI: Add D3 cold state
ACPI: processor: fix processor_physically_present in UP kernel
ACPI: Split out custom_method functionality into an own driver
ACPI: Cleanup custom_method debug stuff
ACPI EC: enable MSI workaround for Quanta laptops
ACPICA: Update to version 20110413
ACPICA: Execute an orphan _REG method under the EC device
ACPICA: Move ACPI_NUM_PREDEFINED_REGIONS to a more appropriate place
ACPICA: Update internal address SpaceID for DataTable regions
ACPICA: Add more methods eligible for NULL package element removal
ACPICA: Split all internal Global Lock functions to new file - evglock
ACPI: EC: add another DMI check for ASUS hardware
ACPI EC: remove dead code
ACPICA: Fix code divergence of global lock handling
ACPICA: Use acpi_os_create_lock interface
ACPI: osl, add acpi_os_create_lock interface
ACPI:Fix goto flows in thermal-sys
Linus Torvalds [Sun, 29 May 2011 18:18:09 +0000 (11:18 -0700)]
Merge branch 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6
* 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
x86 idle: deprecate mwait_idle() and "idle=mwait" cmdline param
x86 idle: deprecate "no-hlt" cmdline param
x86 idle APM: deprecate CONFIG_APM_CPU_IDLE
x86 idle floppy: deprecate disable_hlt()
x86 idle: EXPORT_SYMBOL(default_idle, pm_idle) only when APM demands it
x86 idle: clarify AMD erratum 400 workaround
idle governor: Avoid lock acquisition to read pm_qos before entering idle
cpuidle: menu: fixed wrapping timers at 4.294 seconds
Al Viro [Sun, 29 May 2011 12:46:08 +0000 (13:46 +0100)]
cifs/ubifs: Fix shrinker API change fallout
Commit 1495f230fa77 ("vmscan: change shrinker API by passing
shrink_control struct") changed the API of ->shrink(), but missed ubifs
and cifs instances.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Boaz Harrosh [Wed, 25 May 2011 18:25:29 +0000 (21:25 +0300)]
pnfs-obj: pg_test check for max_io_size
Implement pg_test vector to test for max IO sizes. We calculate
a max_io_size member only once, and cache it in lseg so to not
do so on every page insert.
Boaz Harrosh [Sun, 29 May 2011 08:45:39 +0000 (11:45 +0300)]
NFSv4.1: define nfs_generic_pg_test
By default, unless pnfs is used coalesce pages until pg_bsize
(rsize or wsize) is reached.
pnfs layout drivers define their own pg_test methods that use
pnfs_generic_pg_test and need to define their own I/O size
limits (e.g. based on the file stripe size).
[Move a check from nfs_pageio_do_add_request to nfs_generic_pg_test] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Benny Halevy [Sun, 22 May 2011 16:53:48 +0000 (19:53 +0300)]
pnfs: encode_layoutcommit
Add a layout driver method to encode the layout type specific
opaque part of layout commit in-line in the xdr stream.
Currently, the pnfs-objects layout driver uses it to encode metadata hints
to the MDS and the blocks layout driver to commit provisionally allocated
extents to the file.
Boaz Harrosh [Thu, 26 May 2011 18:49:46 +0000 (21:49 +0300)]
pnfs-obj: report errors and .encode_layoutreturn Implementation.
An io_state pre-allocates an error information structure for each
possible osd-device that might error during IO. When IO is done if all
was well the io_state is freed. (as today). If the I/O has ended with an
error, the io_state is queued on a per-layout err_list. When eventually
encode_layoutreturn() is called, each error is properly encoded on the
XDR buffer and only then the io_state is removed from err_list and
de-allocated.
It is up to the io_engine to fill in the segment that fault and the type
of osd_error that occurred. By calling objlayout_io_set_result() for
each failing device.
In objio_osd:
* Allocate io-error descriptors space as part of io_state
* Use generic objlayout error reporting at end of io.
With the objects layout security model, we have object capabilities
that are associated with the layout and we anticipate that the server
will issue a cb_layoutrecall for any setattr that changes security
related attributes (user/group/mode/acl) or truncates the file.
Therefore, the layout is returned before issuing the setattr to avoid
the anticipated cb_layoutrecall.
Benny Halevy [Sun, 22 May 2011 16:52:37 +0000 (19:52 +0300)]
pnfs: layoutreturn
NFSv4.1 LAYOUTRETURN implementation
Currently, does not support layout-type payload encoding.
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com> Signed-off-by: Andy Adamson <andros@citi.umich.edu> Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu> Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
[call pnfs_return_layout right before pnfs_destroy_layout]
[remove assert_spin_locked from pnfs_clear_lseg_list]
[remove wait parameter from the layoutreturn path.]
[remove return_type field from nfs4_layoutreturn_args]
[remove range from nfs4_layoutreturn_args]
[no need to send layoutcommit from _pnfs_return_layout]
[don't wait on sync layoutreturn]
[fix layout stateid in layoutreturn args]
[fixed NULL deref in _pnfs_return_layout]
[removed recaim member of nfs4_layoutreturn_args] Signed-off-by: Benny Halevy <bhalevy@panasas.com>
With the use of the in-kernel osd library. Implement read/write
of data from/to osd-objects according to information specified
in the objects-layout.
Support for stripping over mirrors with a received stripe_unit.
There are however a few constrains which are not supported:
1. Stripe Unit must be a multiple of PAGE_SIZE
2. stripe length (stripe_unit * number_of_stripes) can not be
bigger then 32bit.
Also support raid-groups and partial-layout. Partial-layout is
when not all the groups are received on the line, addressing
only a partial range of the file.
TODO:
Only raid0! raid 4/5/6 support will come at later stage
A none supported layout will send IO through the MDS
[Important fallout from the last rebase] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[gfp_flags] Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Benny Halevy [Sun, 22 May 2011 16:52:03 +0000 (19:52 +0300)]
pnfs: support for non-rpc layout drivers
Non-rpc layout driver such as for objects and blocks
implement their own I/O path and error handling logic.
Therefore bypass NFS-based error handling for these layout drivers.
[fix lseg ref-count bugs, and null de-refs]
[Fall out from: non-rpc layout drivers] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[get rid of PNFS_USE_RPC_CODE]
[get rid of __nfs4_write_done_cb]
[revert useless change in nfs4_write_done_cb] Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Boaz Harrosh [Thu, 26 May 2011 18:45:34 +0000 (21:45 +0300)]
pnfs-obj: objio_osd device information retrieval and caching
When a new layout is received in objio_alloc_lseg all device_ids
referenced are retrieved. The device information is queried for from MDS
and then the osd_device is looked-up from the osd-initiator library. The
devices are cached in a per-mount-point list, for later use. At unmount
all devices are "put" back to the library.
objlayout_get_deviceinfo(), objlayout_put_deviceinfo() middleware
API for retrieving device information given a device_id.
TODO: The device cache can get big. Cap its size. Keep an LRU and start
to return devices which were not used, when list gets to big, or
when new entries allocation fail.
[pnfs-obj: Bugs in new global-device-cache code] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[gfp_flags]
[use global device cache]
[use layout driver in global device cache] Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Boaz Harrosh [Sun, 22 May 2011 16:50:20 +0000 (19:50 +0300)]
pnfs-obj: decode layout, alloc/free lseg
objlayout_alloc_lseg prepares an xdr_stream and calls the
raid engins objio_alloc_lseg() to allocate a private
pnfs_layout_segment.
objio_osd.c::objio_alloc_lseg() uses passed xdr_stream to
decode and store the layout_segment information in an
objio_segment struct, using the pnfs_osd_xdr.h API for
the actual parsing the layout xdr.
objlayout_free_lseg calls objio_free_lseg() to free the
allocated space.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[gfp_flags]
[removed "extern" from function definitions] Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Benny Halevy [Sun, 22 May 2011 16:49:32 +0000 (19:49 +0300)]
pnfs-obj: pnfs_osd XDR definitions
* Add the pnfs_osd_xdr.h header
* defintions the pnfs_osd_layout structure including all it's
sub-types and constants.
* Declare the pnfs_osd_xdr_decode_layout API + all needed
inline helpers.
* Define the pnfs_osd_deviceaddr structure and all its subtypes and
constants.
* Declare API for decoding of a pnfs_osd_deviceaddr from XDR stream.
* Define the pnfs_osd_ioerr structure, its substructures and constants.
* Declare API for encoding of a pnfs_osd_ioerr into XDR stream.
* Define the pnfs_osd_layoutupdate structure and its substructures.
* Declare API for encoding of a pnfs_osd_layoutupdate into XDR stream.
Benny Halevy [Sun, 22 May 2011 16:49:06 +0000 (19:49 +0300)]
pnfs-obj: objlayoutdriver module skeleton
* Define the PNFS_OBJLAYOUT Kconfig option in the nfs
master Kconfig file.
* Add the objlayout driver to the Kernel's Kbuild system.
* Add the fs/nfs/objlayout/Kbuild file for building the
objlayoutdriver.ko driver
* Define fs/nfs/objlayout/objio_osd.c, register the driver on module
initialization and unregister on exit.
[pnfs-obj: remove of CONFIG_PNFS fallout] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[added "unsure" clause]
[depend on NFS_V4_1] Signed-off-by: Benny Halevy <bhalevy@panasas.com>
J. Bruce Fields [Sun, 22 May 2011 16:48:21 +0000 (19:48 +0300)]
pnfs: client stats
A pNFS client auto-negotiates a lot of features (minorversion level,
pNFS layout type, etc.). This is convenient, but makes certain kinds of
failures hard for a user to detect.
For example, if the client falls back on 4.0, or falls back to MDS IO
because the user didn't connect to the right iscsi disks before
mounting, the only symptoms may be reduced performance, which may not be
noticed till long after the actual failure, and may be difficult for a
user to diagnose.
However, such "failures" may also be perfectly normal in some cases, so
we don't want to spam the system logs with them.
One approach would be to put some more information into
/proc/self/mountstats.
Signed-off-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfs: add commit client stats]
[fixup data types for "ret" variables in pnfs_try_to* inline funcs.] Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[fix definition of show_pnfs for !CONFIG_PNFS] Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: Fix show_sessions in the not CONFIG_NFS_V4_1 case]
There is a build error when CONFIG_NFS_V4 is set but
CONFIG_NFS_V4_1 is *not* set. show_sessions() prototype
was unbalanced between the two cases. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[pnfs: super.c remove CONFIG_PNFS] Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Benny Halevy [Tue, 24 May 2011 15:04:02 +0000 (18:04 +0300)]
NFSv4.1: use layout driver in global device cache
pnfs deviceids are unique per server, per layout type.
struct nfs_client is currently used to distinguish deviceids from
different nfs servers, yet these may clash between different layout
types on the same server. Therefore, use the layout driver associated
with each deviceid at insertion time to look it up, unhash, or
delete it.
Marc Eshel [Sun, 22 May 2011 16:47:09 +0000 (19:47 +0300)]
pnfs: CB_NOTIFY_DEVICEID
Note: This functionlaity is incomplete as all layout segments referring to
the 'to be removed device id' need to be reaped, and all in flight I/O drained.
[use be32 res in nfs4_callback_devicenotify]
[use nfs_client to qualify deviceid for cb_notify_deviceid]
[use global deviceid cache for CB_NOTIFY_DEVICEID]
[refactor device cache _lookup_deviceid]
[refactor device cache _find_get_deviceid] Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[Bug in new global-device-cache code]
[layout_driver MUST set free_deviceid_node if using dev-cache] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Tyler Hicks [Tue, 24 May 2011 07:16:51 +0000 (02:16 -0500)]
eCryptfs: Cleanup inode initialization code
The eCryptfs inode get, initialization, and dentry interposition code
has two separate paths. One is for when dentry interposition is needed
after doing things like a mkdir in the lower filesystem and the other
is needed after a lookup. Unlocking new inodes and doing a d_add() needs
to happen at different times, depending on which type of dentry
interposing is being done.
This patch cleans up the inode get and initialization code paths and
splits them up so that the locking and d_add() differences mentioned
above can be handled appropriately in a later patch.
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Tested-by: David <david@unsolicited.net>
Benny Halevy [Fri, 20 May 2011 11:47:33 +0000 (13:47 +0200)]
NFSv4.1: purge deviceid cache on nfs_free_client
Use the pnfs_layoutdriver_type both as a qualifier for the deviceid,
distinguishing deviceid from different layout types on the server,
and for freeing the layout-driver allocated structure containing the
nfs4_deviceid_node.
[BUG in _deviceid_purge_client]
[layout_driver MUST set free_deviceid_node if using dev-cache]
[let ver < 4.1 compile] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
[removed EXPORT_SYMBOL_GPL(nfs4_deviceid_purge_client)] Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Mikulas Patocka [Sun, 29 May 2011 12:03:11 +0000 (13:03 +0100)]
dm kcopyd: reserve fewer pages
Reserve just the minimum of pages needed to process one job.
Because we allocate pages from page allocator, we don't need to reserve
a large number of pages. The maximum job size is SUB_JOB_SIZE and we
calculate the number of reserved pages based on this.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Sun, 29 May 2011 12:03:09 +0000 (13:03 +0100)]
dm io: use fixed initial mempool size
Replace the arbitrary calculation of an initial io struct mempool size
with a constant.
The code calculated the number of reserved structures based on the request
size and used a "magic" multiplication constant of 4. This patch changes
it to reserve a fixed number - itself still chosen quite arbitrarily.
Further testing might show if there is a better number to choose.
Note that if there is no memory pressure, we can still allocate an
arbitrary number of "struct io" structures. One structure is enough to
process the whole request.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Sun, 29 May 2011 12:03:07 +0000 (13:03 +0100)]
dm kcopyd: alloc pages from the main page allocator
This patch changes dm-kcopyd so that it allocates pages from the main
page allocator with __GFP_NOWARN | __GFP_NORETRY flags (so that it can
fail in case of memory pressure). If the allocation fails, dm-kcopyd
allocates pages from its own reserve.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove the spinlock protecting the pages allocation. The spinlock is only
taken on initialization or from single-threaded workqueue. Therefore, the
spinlock is useless.
The spinlock is taken in kcopyd_get_pages and kcopyd_put_pages.
kcopyd_get_pages is only called from run_pages_job, which is only
called from process_jobs called from do_work.
kcopyd_put_pages is called from client_alloc_pages (which is initialization
function) or from run_complete_job. run_complete_job is only called from
process_jobs called from do_work.
Another spinlock, kc->job_lock is taken each time someone pushes or pops
some work for the worker thread. Once we take kc->job_lock, we
guarantee that any written memory is visible to the other CPUs.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Sun, 29 May 2011 12:03:00 +0000 (13:03 +0100)]
dm kcopyd: preallocate sub jobs to avoid deadlock
There's a possible theoretical deadlock in dm-kcopyd because multiple
allocations from the same mempool are required to finish a request.
Avoid this by preallocating sub jobs.
There is a mempool of 512 entries. Each request requires up to 9
entries from the mempool. If we have at least 57 concurrent requests
running, the mempool may overflow and mempool allocations may start
blocking until another entry is freed to the mempool. Because the same
thread is used to free entries to the mempool and allocate entries from
the mempool, this may result in a deadlock.
This patch changes it so that one mempool entry contains all 9 "struct
kcopyd_job" required to fulfill the whole request. The allocation is
done only once in dm_kcopyd_copy and no further mempool allocations are
done during request processing.
If dm_kcopyd_copy is not run in the completion thread, this
implementation is deadlock-free.
MIN_JOBS needs reducing accordingly and we've chosen to reduce it
further to 8.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Sun, 29 May 2011 12:02:58 +0000 (13:02 +0100)]
dm kcopyd: avoid pointless job splitting
Don't split SUB_JOB_SIZE jobs
If the job size equals SUB_JOB_SIZE, there is no point in splitting it.
Splitting it just unnecessarily wastes time, because the split job size
is SUB_JOB_SIZE too.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm mpath: do not fail paths after integrity errors
Integrity errors need to be passed to the owner of the integrity
metadata for processing. Consequently EILSEQ should be passed up the
stack.
Cc: stable@kernel.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Milan Broz [Sun, 29 May 2011 12:02:52 +0000 (13:02 +0100)]
dm table: reject devices without request fns
This patch adds a check that a block device has a request function
defined before it is used. Otherwise, misconfiguration can cause an oops.
Because we are allowing devices with zero size e.g. an offline multipath
device as in commit 2cd54d9bedb79a97f014e86c0da393416b264eb3
("dm: allow offline devices") there needs to be an additional check
to ensure devices are initialised. Some block devices, like a loop
device without a backing file, exist but have no request function.
Reproducer is trivial: dm-mirror on unbound loop device
(no backing file on loop devices)
Heiko Carstens [Sun, 29 May 2011 10:40:51 +0000 (12:40 +0200)]
[S390] mm: fix mmu_gather rework
Quite a few functions that get called from the tlb gather code require that
preemption must be disabled. So disable preemption inside of the called
functions instead.
The only drawback is that rcu_table_freelist_finish() doesn't get necessarily
called on the cpu(s) that filled the free lists. So we may see a delay, until
we finally see an rcu callback. However over time this shouldn't matter.
So we get rid of lots of "BUG: using smp_processor_id() in preemptible"
messages.
Heiko Carstens [Sun, 29 May 2011 10:40:50 +0000 (12:40 +0200)]
[S390] mm: fix storage key handling
page_get_storage_key() and page_set_storage_key() expect a page address
and not its page frame number. This got inconsistent with 2d42552d
"[S390] merge page_test_dirty and page_clear_dirty".
Result is that we read/write storage keys from random pages and do not
have a working dirty bit tracking at all.
E.g. SetPageUpdate() doesn't clear the dirty bit of requested pages, which
for example ext4 doesn't like very much and panics after a while.