David Ahern [Tue, 31 Jul 2012 04:31:33 +0000 (22:31 -0600)]
perf tools: Change strlist to use the new rblist
Replaces the direct use of rbtree code with the rblist API. In the end
the patch is a no-op on strlist functionality; the API for strlist is
not changed, only its implementaton.
Signed-off-by: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1343709095-7089-3-git-send-email-dsahern@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
David Ahern [Mon, 30 Jul 2012 02:54:35 +0000 (20:54 -0600)]
perf kvm: Use strtol for walking guestmount directory
Only want to process directories under the guestmnount directory that
have a pid as a name (ie, all digits). Other entries in the guestmount
directory should be ignored. There is already a check that requires the
first character of each entry to be a digit, but atoi is used to convert
the directory name to a pid. For example if guestmount contains a
directory with the name 1foo, atoi converts it to a pid of 1 and a
machine is created with a pid of 1. This is wrong; this directory really
should be ignored. Use strtol to do that.
Signed-off-by: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1343616875-6455-1-git-send-email-dsahern@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
David Ahern [Mon, 30 Jul 2012 02:53:51 +0000 (20:53 -0600)]
perf tool: Save cmdline from user in file header vs what is passed to record
A number of builtin commands process some user args and then pass the rest to
cmd_record. cmd_record then saves argc/argv that it receives into the header of
the perf data file. But this loses the arguments handled by the first command
-- ie., the real command line from the user. This patch saves the command line
as typed by the user rather than what was passed to cmd_record.
As an example consider the command:
$ perf kvm --guest --host --guestmount=/tmp/guest-mount record
-fo /tmp/perf.data -ag -- sleep 10
Currently the command saved to the header is:
cmdline : /tmp/p3.5/perf record -o perf.data.kvm -fo /tmp/perf.data -ag -- sleep 1
(ignore the duplicated -o -- the first would be yet another bug with perf-kvm).
With this patch the command line saved to the header is:
cmdline : /tmp/p3.5/perf kvm --guest --host --guestmount=/tmp/guest-mount
record -fo /tmp/perf.data -ag -- sleep 1
v2: simplified to saving the command in parse_options per Stephane's suggestion
Signed-off-by: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1343616831-6408-1-git-send-email-dsahern@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
David Ahern [Mon, 30 Jul 2012 02:53:03 +0000 (20:53 -0600)]
perf top: Error handling for counter creation should parallel perf-record
5a7ed29 fixed up perf-record but not perf-top. Similar argument holds
for it -- fallback to PMU only if it does not exist and handle invalid
attributes separately.
Signed-off-by: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robert Richter <robert.richter@amd.com> Link: http://lkml.kernel.org/r/1343616783-6360-1-git-send-email-dsahern@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
So that we don't have to store it in the perf_session instance, because
in the future perf_session instances may have multiple evlists, each
with different sample_type/sizes.
Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/n/tip-ptod86fxkpgq3h62m9refkv4@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Like do_wp_page(), __replace_page() should do munlock_vma_page()
for the case when the old page still has other !VM_LOCKED
mappings. Unfortunately this needs mm/internal.h.
Also, move put_page() outside of ptl lock. This doesn't really
matter but looks a bit better.
uprobes: Rename vma_address() and make it return "unsigned long"
1. vma_address() returns loff_t, this looks confusing and this
is unnecessary after the previous change. Make it return "ulong",
all callers truncate the result anyway.
2. Its name conflicts with mm/rmap.c:vma_address(), rename it to
offset_to_vaddr(), this matches vaddr_to_offset().
1. register_for_each_vma() checks that vma_address() == vaddr,
but this is not enough. We should also ensure that
vaddr >= vm_start, find_vma() guarantees "vaddr < vm_end" only.
2. After the prevous changes, register_for_each_vma() is the
only reason why vma_address() has to return loff_t, all other
users know that we have the valid mapping at this offset and
thus the overflow is not possible.
Change the code to use vaddr_to_offset() instead, imho this looks
more clean/understandable and now we can change vma_address().
uprobes: Teach build_probe_list() to consider the range
Currently build_probe_list() builds the list of all uprobes
attached to the given inode, and the caller should filter out
those who don't fall into the [start,end) range, this is
sub-optimal.
This patch turns find_least_offset_node() into
find_node_in_range() which returns the first node inside the
[min,max] range, and changes build_probe_list() to use this node
as a starting point for rb_prev() and rb_next() to find all
other nodes the caller needs. The resulting list is no longer
sorted but we do not care.
This can speed up both build_probe_list() and the callers, but
there is another reason to introduce find_node_in_range(). It
can be used to figure out whether the given vma has uprobes or
not, this will be needed soon.
While at it, shift INIT_LIST_HEAD(tmp_list) into
build_probe_list().
Remove insert_vm_struct()->uprobe_mmap(). It is not needed, nobody
except arch/ia64/kernel/perfmon.c uses insert_vm_struct(vma)
with vma->vm_file != NULL.
And it is wrong. Again, get_user_pages() can not succeed before
vma_link(vma) makes is visible to find_vma(). And even if this
worked, we must not insert the new bp before this mapping is
visible to vma_prio_tree_foreach() for uprobe_unregister().
Remove copy_vma()->uprobe_mmap(new_vma), it is absolutely wrong.
This new_vma was just initialized to represent the new unmapped
area, [vm_start, vm_end) was returned by get_unmapped_area() in
the caller.
This means that uprobe_mmap()->get_user_pages() will fail for
sure, simply because find_vma() can never succeed. And I
verified that sys_mremap()->mremap_to() indeed always fails with
the wrong ENOMEM code if [addr, addr+old_len] is probed.
And why this uprobe_mmap() was added? I believe the intent was
wrong. Note that the caller is going to do move_page_tables(),
all registered uprobes are already faulted in, we only change
the virtual addresses.
NOTE: However, somehow we need to close the race with
uprobe_register() which relies on map_info->vaddr. This needs
another fix I'll try to do later. Probably we need uprobe_mmap()
in move_vma() but we can not do this right now, this can confuse
uprobes_state.counter (which I still hope we are going to kill).
uprobe_munmap() does get_user_pages() and it is also called from
the final mmput()->exit_mmap() path. This slows down
exit/mmput() for no reason, and I think it is simply
dangerous/wrong to try to fault-in a page into the dying mm. If
nothing else, this happens after the last sync_mm_rss(), afaics
handle_mm_fault() can change the task->rss_stat and make the
subsequent check_mm() unhappy.
Change uprobe_munmap() to check mm->mm_users != 0.
The bug was introduced by me in 449d0d7c ("uprobes: Simplify the
usage of uprobe->pending_list").
Yes, we do not care about uprobe->pending_list after return and
nobody can remove the current list entry, but put_uprobe(uprobe)
can actually free it and thus we need list_for_each_safe().
uprobes: Clean up and document write_opcode()->lock_page(old_page)
The comment above write_opcode()->lock_page(old_page) tells
about the race with do_wp_page(). I don't really understand
which exactly race it means, but afaics this lock_page() was not
enough to close all races with do_wp_page().
Anyway, since:
77fc4af1b59d uprobes: Change register_for_each_vma() to take mm->mmap_sem for writing
this code is always called with ->mmap_sem held for writing,
so we can forget about do_wp_page().
However, we can't simply remove this lock_page(), and the only
(afaics) reason is __replace_page()->try_to_free_swap().
Nothing in write_opcode() needs it, move it into
__replace_page() and fix the comment.
write_opcode() does lock_page(new_page) for no reason. Nobody
can see this page until __replace_page() exposes it under ptl
lock, and we do nothing with this page after pte_unmap_unlock().
If nothing else, the similar code in do_wp_page() doesn't lock
the new page for page_add_new_anon_rmap/set_pte_at_notify.
uprobes: __replace_page() should not use page_address_in_vma()
page_address_in_vma(old_page) in __replace_page() is ugly and
wrong. The caller already knows the correct virtual address,
this page was found by get_user_pages(vaddr).
However, page_address_in_vma() can actually fail if
page->mapping was cleared by __delete_from_page_cache() after
get_user_pages() returns. But this means the race with page
reclaim, write_opcode() should not fail, it should retry and
read this page again. Probably the race with remove_mapping() is
not possible due to page_freeze_refs() logic, but afaics at
least shmem_writepage()->shmem_delete_from_page_cache() can
clear ->mapping.
We could change __replace_page() to return -EAGAIN in this case,
but it would be better to simply use the caller's vaddr and rely
on page_check_address().
uprobes: Don't recheck vma/f_mapping in write_opcode()
write_opcode() rechecks valid_vma() and ->f_mapping, this is
pointless. The caller, register_for_each_vma() or uprobe_mmap(),
has already done these checks under mmap_sem.
To clarify, uprobe_mmap() checks valid_vma() only, but we can
rely on build_probe_list(vm_file->f_mapping->host).
perf/x86: Fix LLC-* and node-* events on Intel SandyBridge
LLC-* and node-* events require using the OFFCORE_RESPONSE events
on SandyBridge, but the hw_cache_extra_regs is left uninitialized.
This patch adds the missing extra register configure table for
SandyBridge.
The uncore subsystem in Nehalem-EX consists of 7 components
(U-Box, C-Box, B-Box, S-Box, R-Box, M-Box and W-Box). This
patch is large because the way to program these boxes is
diverse.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4FF534F1.3030307@intel.com
[ Improved the code. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andrew Vagin [Thu, 12 Jul 2012 10:14:29 +0000 (14:14 +0400)]
sched: Deliver sched_switch events to the current task
Otherwise they can't be filtered for a defined task:
perf record -e sched:sched_switch ./foo
This command doesn't report any events without this patch.
I think it isn't a security concern if someone knows who will
be executed next - this can already be observed by polling /proc
state. By default perf is disabled for non-root users in any case.
I need these events for profiling sleep times. sched_switch is used for
getting callchains and sched_stat_* is used for getting time periods.
These events are combined in user space, then it can be analyzed by
perf tools.
perf annotate: Prevent overflow in size calculation
A large enough symbol size causes an overflow in the size parameter to
the histogram allocation, leading to a segfault in
symbol__inc_addr_samples later on when this histogram is accessed.
In the case of being called via perf-report, this returns back and
gracefully ignores the sample, eventually ignoring the chained return
value of perf_session_deliver_event in flush_sample_queue.
Signed-off-by: Cody Schafer <cody@linux.vnet.ibm.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1342753525-4521-1-git-send-email-cody@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cross compiling perf requires setting ARCH and CROSS_COMPILE variables,
but libtraceevent couldn't detect the changes so it ends up believing no
recompiling is required. Thus the linker failed like:
LINK perf
../lib/traceevent//libtraceevent.a: member ../lib/traceevent//libtraceevent.a(event-parse.o) in archive is not an object
collect2: ld returned 1 exit status
make: *** [perf] Error 1
This patch fixes this by adding TRACEEVENT-CFLAGS file like
PERF-CFLAGS to track those changes.
Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: David Ahern <dsahern@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1341559297-25725-2-git-send-email-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Bison 2.6 started to generate parse_events_parse() declaration in header. In
this case we have redundant redeclaration:
util/parse-events.c:29:5: error: redundant redeclaration of ‘parse_events_parse’ [-Werror=redundant-decls]
In file included from util/parse-events.c:14:0:
util/parse-events-bison.h:99:5: note: previous declaration of ‘parse_events_parse’ was here
cc1: all warnings being treated as errors
Let's disable -Wredundant-decls for util/parse-events.c since it includes
header we can't control.
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Ingo Molnar <mingo@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120723210407.GA25186@shutemov.name Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf tools: use XSI-complaint version of strerror_r() instead of GNU-specific
Perf uses GNU-specific version of strerror_r(). The GNU-specific strerror_r()
returns a pointer to a string containing the error message. This may be either
a pointer to a string that the function stores in buf, or a pointer to some
(immutable) static string (in which case buf is unused).
In glibc-2.16 GNU version was marked with attribute warn_unused_result. It
triggers few warnings in perf:
util/target.c: In function ‘perf_target__strerror’:
util/target.c:114:13: error: ignoring return value of ‘strerror_r’, declared with attribute warn_unused_result [-Werror=unused-result]
ui/browsers/hists.c: In function ‘hist_browser__dump’:
ui/browsers/hists.c:981:13: error: ignoring return value of ‘strerror_r’, declared with attribute warn_unused_result [-Werror=unused-result]
They are bugs.
Let's fix strerror_r() usage.
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: Ulrich Drepper <drepper@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ulrich Drepper <drepper@gmail.com> Link: http://lkml.kernel.org/r/20120723210654.GA25248@shutemov.name
[ committer note: s/assert/BUG_ON/g ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf tools: Make the breakpoint events sample period default to 1
There have one problem about hw_breakpoint perf event, as watched, the
events reported to userspace is not correctly, sometime one trigger
bp_event report several events, sometime bp_event cannot go through to
user.
The root cause is attr->freq is 1 passed to kernel defaultly in bp
events, this make kernel calculate event period not as expect, make
sample period to 1 will change attr->freq to 0, to fix this problem.
This patch is similar with commit f92128 about tracepoint events:
perf: Make the trace events sample period default to 1
Jiri Olsa [Sun, 22 Jul 2012 12:14:39 +0000 (14:14 +0200)]
perf symbols: Add dso data caching
Adding dso data caching so we don't need to open/read/close, each time
we want dso data.
The DSO data caching affects following functions:
dso__data_read_offset
dso__data_read_addr
Each DSO read tries to find the data (based on offset) inside the cache.
If it's not present it fills the cache from file, and returns the data.
If it is present, data are returned with no file read.
Each data read is cached by reading cache page sized/aligned amount of
DSO data. The cache page size is hardcoded to 4096. The cache is using
RB tree with file offset as a sort key.
Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Arun Sharma <asharma@fb.com> Cc: Benjamin Redelings <benjamin.redelings@nescent.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Stephane Eranian <eranian@google.com> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Ulrich Drepper <drepper@gmail.com> Link: http://lkml.kernel.org/r/1342959280-5361-17-git-send-email-jolsa@redhat.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Sun, 22 Jul 2012 12:14:33 +0000 (14:14 +0200)]
perf symbols: Add interface to read DSO image data
Adding following interface for DSO object to allow
reading of DSO image data:
dso__data_fd
- opens DSO and returns file descriptor
Binary types are used to locate/open DSO in following order:
DSO_BINARY_TYPE__BUILD_ID_CACHE
DSO_BINARY_TYPE__SYSTEM_PATH_DSO
In other word we first try to open DSO build-id path,
and if that fails we try to open DSO system path.
dso__data_read_offset
- reads DSO data from specified offset
dso__data_read_addr
- reads DSO data from specified address/map.
Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Arun Sharma <asharma@fb.com> Cc: Benjamin Redelings <benjamin.redelings@nescent.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Stephane Eranian <eranian@google.com> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Ulrich Drepper <drepper@gmail.com> Link: http://lkml.kernel.org/r/1342959280-5361-11-git-send-email-jolsa@redhat.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf tools: Fix trace events storms due to weight demux
Trace events have a period (weight) of 1 by default. This can be
overriden on events definition by using the __perf_count() macro.
For example, the sched_stat_runtime() is weighted with the runtime of
the task that fired the event.
By default, perf handles such weighted event by dividing it into
individual events carrying a weight of 1. For example if
sched_stat_runtime is fired and the task has run 5000000 nsecs, perf
divides it into 5000000 events in the buffer.
This behaviour makes weighted events unusable because they quickly
fullfill the buffers and we lose most events.
The commit 5d81e5cfb37a174e8ddc0413e2e70cdf05807ace ("events: Don't
divide events if it has field period") solves this problem by sending
only one event when PERF_SAMPLE_PERIOD flag is set. The weight is
carried in the sample itself such that we don't need to demultiplex it
anymore.
This patch provides the last missing piece to use this feature by
setting PERF_SAMPLE_PERIOD from perf tools when we deal with trace
events.
Before:
$ ./perf record -e sched:* -a sleep 1
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 1.619 MB perf.data (~70749 samples) ]
Warning:
Processed 16909 events and lost 1 chunks!
After:
$ ./perf record -e sched:* -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.074 MB perf.data (~3228 samples) ]
David Ahern [Fri, 20 Jul 2012 23:25:51 +0000 (17:25 -0600)]
perf kvm: Limit repetitive guestmount message to once per directory
After 7ed97ad use of the guestmount option without a subdir for *each*
VM generates an error message for each sample related to that VM. Once
per VM is enough.
Signed-off-by: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1342826756-64663-7-git-send-email-dsahern@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
After a maddening 2 days sifting through perf minutia I found it --
id_hdr_size is not initialized for guest machines. This shows up on the
report side as random garbage for the cpu and timestamp, e.g.,
David Ahern [Fri, 20 Jul 2012 23:25:47 +0000 (17:25 -0600)]
perf kvm: Set name for VM process in guest machine
COMM events are not generated in the context of a guest machine, so the
thread name is never set for the VMM process. For example, the qemu-kvm
name applies to the process in the host machine, not the guest machine.
So, samples for guest machines are currently displayed as:
99.67% :5671 [unknown] [g] 0xffffffff81366b41
where 5671 is the pid of the VMM. With this patch the samples in the guest
machine are shown as:
Merge branch 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull a x86/build change from Ingo Molnar.
This makes the default stack alignment on x86-64 be just 8, allowing for
improved code generation (it can avoid some unnecessary extra alignment
logic and use just pure push/pop sequences) and smaller stack frames.
We can't generally do SSE with 16-byte alignment issues in the kernel anyway.
* 'x86-build-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86-64, gcc: Use -mpreferred-stack-boundary=3 if supported
Merge branch 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/uv changes from Ingo Molnar:
"UV2 BAU productization fixes.
The BAU (Broadcast Assist Unit) is SGI's fancy out of line way on UV
hardware to do TLB flushes, instead of the normal APIC IPI methods.
The commits here fix / work around hangs in their latest hardware
iteration (UV2).
My understanding is that the main purpose of the out of line
signalling channel is to improve scalability: the UV APIC hardware
glue does not handle broadcasting to many CPUs very well, and this
matters most for TLB shootdowns.
[ I don't agree with all aspects of the current approach: in hindsight
it would have been better to link the BAU at the IPI/APIC driver
level instead of the TLB shootdown level, where TLB flushes are
really just one of the uses of broadcast SMP messages. Doing that
would improve scalability in some other ways and it would also
remove a few uglies from the TLB path. It would also be nice to
push more is_uv_system() tests into proper x86_init or x86_platform
callbacks. Cliff? ]"
* 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/uv: Work around UV2 BAU hangs
x86/uv: Implement UV BAU runtime enable and disable control via /proc/sgi_uv/
x86/uv: Fix the UV BAU destination timeout period
Merge branch 'x86-reboot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/reboot changes from Ingo Molnar:
"Now that the revampted x86 real-mode trampoline code is upstream and
seems to be working well, we can extend the 64-bit reboot code to be
as capable as the 32-bit one."
* 'x86-reboot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86-64, reboot: Be more paranoid in 64-bit reboot=bios
x86, reboot: Drop redundant write of reboot_mode
x86-64, reboot: Allow reboot=bios and reboot-cpu override on x86-64
Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 platform changes from Ingo Molnar:
"This tree mostly involves various APIC driver cleanups/robustization,
and vSMP motivated platform callback improvements/cleanups"
Fix up trivial conflict due to printk cleanup right next to return value
change.
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
Revert "x86/early_printk: Replace obsolete simple_strtoul() usage with kstrtoint()"
x86/apic/x2apic: Use multiple cluster members for the irq destination only with the explicit affinity
x86/apic/x2apic: Limit the vector reservation to the user specified mask
x86/apic: Optimize cpu traversal in __assign_irq_vector() using domain membership
x86/vsmp: Fix vector_allocation_domain's return value
irq/apic: Use config_enabled(CONFIG_SMP) checks to clean up irq_set_affinity() for UP
x86/vsmp: Fix linker error when CONFIG_PROC_FS is not set
x86/apic/es7000: Make apicid of a cluster (not CPU) from a cpumask
x86/apic/es7000+summit: Always make valid apicid from a cpumask
x86/apic/es7000+summit: Fix compile warning in cpu_mask_to_apicid()
x86/apic: Fix ugly casting and branching in cpu_mask_to_apicid_and()
x86/apic: Eliminate cpu_mask_to_apicid() operation
x86/x2apic/cluster: Vector_allocation_domain() should return a value
x86/apic/irq_remap: Silence a bogus pr_err()
x86/vsmp: Ignore IOAPIC IRQ affinity if possible
x86/apic: Make cpu_mask_to_apicid() operations check cpu_online_mask
x86/apic: Make cpu_mask_to_apicid() operations return error code
x86/apic: Avoid useless scanning thru a cpumask in assign_irq_vector()
x86/apic: Try to spread IRQ vectors to different priority levels
x86/apic: Factor out default vector_allocation_domain() operation
...
Merge branch 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull debug-for-linus git tree from Ingo Molnar.
Fix up trivial conflict in arch/x86/kernel/cpu/perf_event_intel.c due to
a printk() having changed to a pr_info() differently in the two branches.
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Move call to print_modules() out of show_regs()
x86/mm: Mark free_initrd_mem() as __init
x86/microcode: Mark microcode_id[] as __initconst
x86/nmi: Clean up register_nmi_handler() usage
x86: Save cr2 in NMI in case NMIs take a page fault (for i386)
x86: Remove cmpxchg from i386 NMI nesting code
x86: Save cr2 in NMI in case NMIs take a page fault
x86/debug: Add KERN_<LEVEL> to bare printks, convert printks to pr_<level>
Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/asm changes from Ingo Molnar:
"Assorted single-commit improvements, as usual"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/mtrr: Slightly simplify print_mtrr_state()
x86/mm/mtrr: Fix alignment determination in range_to_mtrr()
x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature
x86/alternatives: Use atomic_xchg() instead atomic_dec_and_test() for stop_machine_text_poke()
Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core changes from Ingo Molnar:
"Continued cleanups of the core time and NTP code, plus more nohz work
preparing for tick-less userspace execution."
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Rework timekeeping functions to take timekeeper ptr as argument
time: Move xtime_nsec adjustment underflow handling timekeeping_adjust
time: Move arch_gettimeoffset() usage into timekeeping_get_ns()
time: Refactor accumulation of nsecs to secs
time: Condense timekeeper.xtime into xtime_sec
time: Explicitly use u32 instead of int for shift values
time: Whitespace cleanups per Ingo%27s requests
nohz: Move next idle expiry time record into idle logic area
nohz: Move ts->idle_calls incrementation into strict idle logic
nohz: Rename ts->idle_tick to ts->last_tick
nohz: Make nohz API agnostic against idle ticks cputime accounting
nohz: Separate idle sleeping time accounting from nohz logic
timers: Improve get_next_timer_interrupt()
timers: Add accounting of non deferrable timers
timers: Consolidate base->next_timer update
timers: Create detach_if_pending() and use it
Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events changes from Ingo Molnar:
"- kernel side:
- Intel uncore PMU support for Nehalem and Sandy Bridge CPUs, we
support both the events available via the MSR and via the PCI
access space.
- various uprobes cleanups and restructurings
- PMU driver quirks by microcode version and required x86 microcode
loader cleanups/robustization
- various tracing robustness updates
- static keys: remove obsolete static_branch()
- tooling side:
- GTK browser improvements
- perf report browser: support screenshots to file
- more automated tests
- perf kvm improvements
- perf bench refinements
- build environment improvements
- pipe mode improvements
- libtraceevent updates, we have now hopefully merged most bits with
the out of tree forked code base
... and many other goodies."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (138 commits)
tracing: Check for allocation failure in __tracing_open()
perf/x86: Fix intel_perfmon_event_mapformatting
jump label: Remove static_branch()
tracepoint: Use static_key_false(), since static_branch() is deprecated
perf/x86: Uncore filter support for SandyBridge-EP
perf/x86: Detect number of instances of uncore CBox
perf/x86: Fix event constraint for SandyBridge-EP C-Box
perf/x86: Use 0xff as pseudo code for fixed uncore event
perf/x86: Save a few bytes in 'struct x86_pmu'
perf/x86: Add a microcode revision check for SNB-PEBS
perf/x86: Improve debug output in check_hw_exists()
perf/x86/amd: Unify AMD's generic and family 15h pmus
perf/x86: Move Intel specific code to intel_pmu_init()
perf/x86: Rename Intel specific macros
perf/x86: Fix USER/KERNEL tagging of samples
perf tools: Split event symbols arrays to hw and sw parts
perf tools: Split out PE_VALUE_SYM parsing token to SW and HW tokens
perf tools: Add empty rule for new line in event syntax parsing
perf test: Use ARRAY_SIZE in parse events tests
tools lib traceevent: Cleanup realloc use
...
Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU changes from Ingo Molnar:
"Quoting from Paul, the major features of this series are:
1. Preventing latency spikes of more than 200 microseconds for
kernels built with NR_CPUS=4096, which is reportedly becoming the
default for some distros. This is a first step, as it does not
help with systems that actually -have- 4096 CPUs (work on this case
is in progress, but is not yet ready for mainline).
This category also includes improving concurrency of rcu_barrier(),
placed here due to conflicts. Posted to LKML at:
https://lkml.org/lkml/2012/6/22/381
Note that patches 18-22 of that series have been defered to 3.7, as
they have not yet proven themselves to be mainline-ready (and yes,
these are the ones intended to get rid of RCU's latency spikes for
systems that actually have 4096 CPUs).
2. Updates to documentation and rcutorture fixes, the latter category
including improvements to rcu_barrier() testing. Posted to LKML at
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
rcu: Fix broken strings in RCU's source code.
rcu: Fix code-style issues involving "else"
rcu: Introduce check for callback list/count mismatch
rcu: Make RCU_FAST_NO_HZ respect nohz= boot parameter
rcu: Fix qlen_lazy breakage
rcu: Round FAST_NO_HZ lazy timeout to nearest second
rcu: The rcu_needs_cpu() function is not a quiescent state
rcu: Dump only the current CPU's buffers for idle-entry/exit warnings
rcu: Add check for CPUs going offline with callbacks queued
rcu: Disable preemption in rcu_blocking_is_gp()
rcu: Prevent uninitialized string in RCU CPU stall info
rcu: Fix rcu_is_cpu_idle() #ifdef in TINY_RCU
rcu: Split RCU core processing out of __call_rcu()
rcu: Prevent __call_rcu() from invoking RCU core on offline CPUs
rcu: Make __call_rcu() handle invocation from idle
rcu: Remove function versions of __kfree_rcu and __is_kfree_rcu_offset
rcu: Consolidate tree/tiny __rcu_read_{,un}lock() implementations
rcu: Remove return value from rcu_assign_pointer()
key: Remove extraneous parentheses from rcu_assign_keypointer()
rcu: Remove return value from RCU_INIT_POINTER()
...
Merge branch 'core-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core/iommu changes from Ingo Molnar.
* 'core-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
iommu/dmar: Use pr_format() instead of PREFIX to tidy up pr_*() calls
iommu/dmar: Reserve mmio space used by the IOMMU, if the BIOS forgets to
iommu/dmar: Replace printks with appropriate pr_*()
This commit is subtly buggy: kstrto*int() can return an error but
it's not checked in every path. simple_strtoul() on the other hand
could not fail, so this patch subtly intruduces new failure modes.
Merge emailed kgdb dmesg fixups patches from Anton Vorontsov:
"The dmesg command appears to be broken after the printk rework. The
old logic in the kdb code makes no sense in terms of current
printk/logging storage format, and KDB simply hangs forever upon
entering 'dmesg' command.
The first patch revives the command by switching to kmsg_dumper
iterator. As a side-effect, the code is now much more simpler.
A few changes were needed in the printk.c: we needed unlocked variant
of the kmsg_dumper iterator, but these can surely wait for 3.6.
It's probably too late even for the first patch to go to 3.5, but I'll
try to convince otherwise. :-) Here we go:
- The current code is broken for sure, and has no hope to work at
all. It is a regression
- The new code works for me, and probably works for everyone else;
- If it compiles (and I urge everyone to compile-test it on your
setup), it hardly can make things worse."
* Merge emailed patches from Anton Vorontsov: (4 commits)
kdb: Switch to nolock variants of kmsg_dump functions
printk: Implement some unlocked kmsg_dump functions
printk: Remove kdb_syslog_data
kdb: Revive dmesg command
Anton Vorontsov [Sat, 21 Jul 2012 00:27:37 +0000 (17:27 -0700)]
kdb: Revive dmesg command
The kgdb dmesg command is broken after the printk rework. The old logic
in kdb code makes no sense in terms of current printk/logging storage
format, and KDB simply hangs forever.
This patch revives the command by switching to kmsg_dumper iterator.
The code is now much more simpler and shorter.
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull late MIPS fixes from Ralf Baechle:
"This fixes a number of lose ends in the MIPS code and various bug
fixes.
Aside of dropping some patch that should not be in this pull request
everything has sat in -next for quite a while and there are no known
issues.
The biggest patch in this patch set moves the allocation of an array
that is aliased to a function (for runtime generated code) to
assembler code. This avoids an issue with certain toolchains when
building for microMIPS."
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (35 commits)
MIPS: PCI: Move fixups from __init to __devinit.
MIPS: Fix bug.h MIPS build regression
MIPS: sync-r4k: remove redundant irq operation
MIPS: smp: Warn on too early irq enable
MIPS: call set_cpu_online() on cpu being brought up with irq disabled
MIPS: call ->smp_finish() a little late
MIPS: Yosemite: delay irq enable to ->smp_finish()
MIPS: SMTC: delay irq enable to ->smp_finish()
MIPS: BMIPS: delay irq enable to ->smp_finish()
MIPS: Octeon: delay enable irq to ->smp_finish()
MIPS: Oprofile: Fix build as a module.
MIPS: BCM63XX: Fix BCM6368 IPSec clock bit
MIPS: perf: Fix build error caused by unused counters_per_cpu_to_total()
MIPS: Fix Magic SysRq L kernel crash.
MIPS: BMIPS: Fix duplicate header inclusion.
mips: mark const init data with __initconst instead of __initdata
MIPS: cmpxchg.h: Add missing include
MIPS: Malta may also be equipped with MIPS64 R2 processors.
MIPS: Fix typo multipy -> multiply
MIPS: Cavium: Fix duplicate ARCH_SPARSEMEM_ENABLE in kconfig.
...
Merge tag 'dm-3.5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm
Pull device-mapper discard fixes from Alasdair G Kergon:
- avoid a crash in dm-raid1 when discards coincide with mirror
recovery;
- avoid discarding shared data that's still needed in dm-thin;
- don't guarantee that discarded blocks will be wiped in dm-raid1.
* tag 'dm-3.5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
dm raid1: set discard_zeroes_data_unsupported
dm thin: do not send discards to shared blocks
dm raid1: fix crash with mirror recovery and discard
Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
Pull pnfs/ore fixes from Boaz Harrosh:
"These are catastrophic fixes to the pnfs objects-layout that were just
discovered. They are also destined for @stable.
I have found these and worked on them at around RC1 time but
unfortunately went to the hospital for kidney stones and had a very
slow recovery. I refrained from sending them as is, before proper
testing, and surly I have found a bug just yesterday.
So now they are all well tested, and have my sign-off. Other then
fixing the problem at hand, and assuming there are no bugs at the new
code, there is low risk to any surrounding code. And in anyway they
affect only these paths that are now broken. That is RAID5 in pnfs
objects-layout code. It does also affect exofs (which was not broken)
but I have tested exofs and it is lower priority then objects-layout
because no one is using exofs, but objects-layout has lots of users."
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
pnfs-obj: Fix __r4w_get_page when offset is beyond i_size
pnfs-obj: don't leak objio_state if ore_write/read fails
ore: Unlock r4w pages in exact reverse order of locking
ore: Remove support of partial IO request (NFS crash)
ore: Fix NFS crash by supporting any unaligned RAID IO
and we finally have the fix. I am quite confident the fix is correct
because I could reproduce the problem with nandsim and verify the fix.
It was also verified by Iwo (the reporter).
I am also confident that this is OK to merge the fix so late because
this patch affects only the fixup functionality, which is not used by
most users."
* tag 'upstream-3.5-rc8' of git://git.infradead.org/linux-ubifs:
UBIFS: fix a bug in empty space fix-up
We can't guarantee that REQ_DISCARD on dm-mirror zeroes the data even if
the underlying disks support zero on discard. So this patch sets
ti->discard_zeroes_data_unsupported.
For example, if the mirror is in the process of resynchronizing, it may
happen that kcopyd reads a piece of data, then discard is sent on the
same area and then kcopyd writes the piece of data to another leg.
Consequently, the data is not zeroed.
When process_discard receives a partial discard that doesn't cover a
full block, it sends this discard down to that block. Unfortunately, the
block can be shared and the discard would corrupt the other snapshots
sharing this block.
This patch detects block sharing and ends the discard with success when
sending it to the shared block.
The above change means that if the device supports discard it can't be
guaranteed that a discard request zeroes data. Therefore, we set
ti->discard_zeroes_data_unsupported.
dm raid1: fix crash with mirror recovery and discard
This patch fixes a crash when a discard request is sent during mirror
recovery.
Firstly, some background. Generally, the following sequence happens during
mirror synchronization:
- function do_recovery is called
- do_recovery calls dm_rh_recovery_prepare
- dm_rh_recovery_prepare uses a semaphore to limit the number
simultaneously recovered regions (by default the semaphore value is 1,
so only one region at a time is recovered)
- dm_rh_recovery_prepare calls __rh_recovery_prepare,
__rh_recovery_prepare asks the log driver for the next region to
recover. Then, it sets the region state to DM_RH_RECOVERING. If there
are no pending I/Os on this region, the region is added to
quiesced_regions list. If there are pending I/Os, the region is not
added to any list. It is added to the quiesced_regions list later (by
dm_rh_dec function) when all I/Os finish.
- when the region is on quiesced_regions list, there are no I/Os in
flight on this region. The region is popped from the list in
dm_rh_recovery_start function. Then, a kcopyd job is started in the
recover function.
- when the kcopyd job finishes, recovery_complete is called. It calls
dm_rh_recovery_end. dm_rh_recovery_end adds the region to
recovered_regions or failed_recovered_regions list (depending on
whether the copy operation was successful or not).
The above mechanism assumes that if the region is in DM_RH_RECOVERING
state, no new I/Os are started on this region. When I/O is started,
dm_rh_inc_pending is called, which increases reg->pending count. When
I/O is finished, dm_rh_dec is called. It decreases reg->pending count.
If the count is zero and the region was in DM_RH_RECOVERING state,
dm_rh_dec adds it to the quiesced_regions list.
Consequently, if we call dm_rh_inc_pending/dm_rh_dec while the region is
in DM_RH_RECOVERING state, it could be added to quiesced_regions list
multiple times or it could be added to this list when kcopyd is copying
data (it is assumed that the region is not on any list while kcopyd does
its jobs). This results in memory corruption and crash.
There already exist bypasses for REQ_FLUSH requests: REQ_FLUSH requests
do not belong to any region, so they are always added to the sync list
in do_writes. dm_rh_inc_pending does not increase count for REQ_FLUSH
requests. In mirror_end_io, dm_rh_dec is never called for REQ_FLUSH
requests. These bypasses avoid the crash possibility described above.
These bypasses were improperly implemented for REQ_DISCARD when
the mirror target gained discard support in commit 5fc2ffeabb9ee0fc0e71ff16b49f34f0ed3d05b4 (dm raid1: support discard).
In do_writes, REQ_DISCARD requests is always added to the sync queue and
immediately dispatched (even if the region is in DM_RH_RECOVERING). However,
dm_rh_inc and dm_rh_dec is called for REQ_DISCARD resusts. So it violates the
rule that no I/Os are started on DM_RH_RECOVERING regions, and causes the list
corruption described above.
This patch changes it so that REQ_DISCARD requests follow the same path
as REQ_FLUSH. This avoids the crash.
Boaz Harrosh [Thu, 7 Jun 2012 23:02:30 +0000 (02:02 +0300)]
pnfs-obj: Fix __r4w_get_page when offset is beyond i_size
It is very common for the end of the file to be unaligned on
stripe size. But since we know it's beyond file's end then
the XOR should be preformed with all zeros.
Old code used to just read zeros out of the OSD devices, which is a great
waist. But what scares me more about this situation is that, we now have
pages attached to the file's mapping that are beyond i_size. I don't
like the kind of bugs this calls for.
Fix both birds, by returning a global zero_page, if offset is beyond
i_size.
TODO:
Change the API to ->__r4w_get_page() so a NULL can be
returned without being considered as error, since XOR API
treats NULL entries as zero_pages.
[Bug since 3.2. Should apply the same way to all Kernels since] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
ore: Unlock r4w pages in exact reverse order of locking
The read-4-write pages are locked in address ascending order.
But where unlocked in a way easiest for coding. Fix that,
locks should be released in opposite order of locking, .i.e
descending address order.
I have not hit this dead-lock. It was found by inspecting the
dbug print-outs. I suspect there is an higher lock at caller that
protects us, but fix it regardless.
Boaz Harrosh [Fri, 8 Jun 2012 01:30:40 +0000 (04:30 +0300)]
ore: Remove support of partial IO request (NFS crash)
Do to OOM situations the ore might fail to allocate all resources
needed for IO of the full request. If some progress was possible
it would proceed with a partial/short request, for the sake of
forward progress.
Since this crashes NFS-core and exofs is just fine without it just
remove this contraption, and fail.
TODO:
Support real forward progress with some reserved allocations
of resources, such as mem pools and/or bio_sets
[Bug since 3.2 Kernel] CC: Stable Tree <stable@kernel.org> CC: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Boaz Harrosh [Thu, 7 Jun 2012 22:19:07 +0000 (01:19 +0300)]
ore: Fix NFS crash by supporting any unaligned RAID IO
In RAID_5/6 We used to not permit an IO that it's end
byte is not stripe_size aligned and spans more than one stripe.
.i.e the caller must check if after submission the actual
transferred bytes is shorter, and would need to resubmit
a new IO with the remainder.
Exofs supports this, and NFS was supposed to support this
as well with it's short write mechanism. But late testing has
exposed a CRASH when this is used with none-RPC layout-drivers.
The change at NFS is deep and risky, in it's place the fix
at ORE to lift the limitation is actually clean and simple.
So here it is below.
The principal here is that in the case of unaligned IO on
both ends, beginning and end, we will send two read requests
one like old code, before the calculation of the first stripe,
and also a new site, before the calculation of the last stripe.
If any "boundary" is aligned or the complete IO is within a single
stripe. we do a single read like before.
The code is clean and simple by splitting the old _read_4_write
into 3 even parts:
1._read_4_write_first_stripe
2. _read_4_write_last_stripe
3. _read_4_write_execute
And calling 1+3 at the same place as before. 2+3 before last
stripe, and in the case of all in a single stripe then 1+2+3
is preformed additively.
Why did I not think of it before. Well I had a strike of
genius because I have stared at this code for 2 years, and did
not find this simple solution, til today. Not that I did not try.
This solution is much better for NFS than the previous supposedly
solution because the short write was dealt with out-of-band after
IO_done, which would cause for a seeky IO pattern where as in here
we execute in order. At both solutions we do 2 separate reads, only
here we do it within a single IO request. (And actually combine two
writes into a single submission)
NFS/exofs code need not change since the ORE API communicates the new
shorter length on return, what will happen is that this case would not
occur anymore.
hurray!!
[Stable this is an NFS bug since 3.2 Kernel should apply cleanly] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
UBIFS has a feature called "empty space fix-up" which is a quirk to work-around
limitations of dumb flasher programs. Namely, of those flashers that are unable
to skip NAND pages full of 0xFFs while flashing, resulting in empty space at
the end of half-filled eraseblocks to be unusable for UBIFS. This feature is
relatively new (introduced in v3.0).
The fix-up routine (fixup_free_space()) is executed only once at the very first
mount if the superblock has the 'space_fixup' flag set (can be done with -F
option of mkfs.ubifs). It basically reads all the UBIFS data and metadata and
writes it back to the same LEB. The routine assumes the image is pristine and
does not have anything in the journal.
There was a bug in 'fixup_free_space()' where it fixed up the log incorrectly.
All but one LEB of the log of a pristine file-system are empty. And one
contains just a commit start node. And 'fixup_free_space()' just unmapped this
LEB, which resulted in wiping the commit start node. As a result, some users
were unable to mount the file-system next time with the following symptom:
UBIFS error (pid 1): replay_log_leb: first log node at LEB 3:0 is not CS node
UBIFS error (pid 1): replay_log_leb: log error detected while replaying the log at LEB 3:0
The root-cause of this bug was that 'fixup_free_space()' wrongly assumed
that the beginning of empty space in the log head (c->lhead_offs) was known
on mount. However, it is not the case - it was always 0. UBIFS does not store
in it the master node and finds out by scanning the log on every mount.
The fix is simple - just pass commit start node size instead of 0 to
'fixup_leb()'.
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull last minute Ceph fixes from Sage Weil:
"The important one fixes a bug in the socket failure handling behavior
that was turned up in some recent failure injection testing. The
other two are minor bug fixes."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: endian bug in rbd_req_cb()
rbd: Fix ceph_snap_context size calculation
libceph: fix messenger retry
Merge tag 'md-3.5-fixes' of git://neil.brown.name/md
Pull three md bugfixes from NeilBrown:
"One of the bugs was introduced in 3.5-rc1. Others have been there for
longer."
* tag 'md-3.5-fixes' of git://neil.brown.name/md:
md/raid1: close some possible races on write errors during resync
md: avoid crash when stopping md array races with closing other open fds.
md: fix bug in handling of new_data_offset
Pull networking changes from David Miller:
"Ok, we should be good to go now"
1) We have to statically initialize the init_net device list head rather
than do so in an initcall, otherwise netprio_cgroup crashes if it's
built statically rather than modular (Mark D. Rustad)
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
net: Statically initialize init_net.dev_base_head
MAINTAINERS: Changes in qlcnic and qlge maintainers list
cipso: don't follow a NULL pointer when setsockopt() is called
Merge branch 'upstream-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
Pull HID update from Jiri Kosina:
"A final round of changes for HID for 3.5: just device ID additions."
* 'upstream-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
HID: hid-multitouch: add support for Zytronic panels
HID: add Sennheiser BTD500USB device support
HID: add battery quirk for Apple Wireless ANSI
The strcpy was being used to set the name of the board. Since the
destination char* was read-only and the name is set statically at
compile time; this was both wrong and redundant.
The type of char* is changed to const char* to prevent future errors.
Reported-by: Radek Masin <radek@masin.eu> Signed-off-by: Ezequiel Garcia <elezegarcia@gmail.com>
[ Taking directly due to vacations - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixups are executed once the pci-device is found which is during boot
process so __init seems fine as long as the platform does not support
hotplug.
However it is possible to remove the PCI bus at run time and have it
rediscovered again via "echo 1 > /sys/bus/pci/rescan" and this will call
the fixups again.
[ralf@linux-mips.org: Made piixirqmap[] in malta_piix_func0_fixup()
__initdata.]
Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: linux-mips@linux-mips.org Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
CC arch/mips/kernel/machine_kexec.o
In file included from include/linux/kernel.h:20:0,
from include/asm-generic/bug.h:35,
from /home/yuasa/src/linux/kernel/git/linux-2.6/arch/mips/include/asm/bug.h:41,
from /home/yuasa/src/linux/kernel/git/linux-2.6/arch/mips/include/asm/bitops.h:20,
from include/linux/bitops.h:22,
from include/linux/signal.h:38,
from include/linux/elfcore.h:5,
from include/linux/kexec.h:60,
from arch/mips/kernel/machine_kexec.c:9:
include/linux/log2.h: In function '__ilog2_u32':
include/linux/log2.h:34:2: error: implicit declaration of function 'fls' [-Werror=implicit-function-declaration]
include/linux/log2.h: In function '__ilog2_u64':
include/linux/log2.h:42:2: error: implicit declaration of function 'fls64' [-Werror=implicit-function-declaration]
include/linux/log2.h: In function '__roundup_pow_of_two':
include/linux/log2.h:63:2: error: implicit declaration of function 'fls_long' [-Werror=implicit-function-declaration]
In file included from include/linux/bitops.h:22:0,
from include/linux/signal.h:38,
from include/linux/elfcore.h:5,
from include/linux/kexec.h:60,
from arch/mips/kernel/machine_kexec.c:9:
/home/yuasa/src/linux/kernel/git/linux-2.6/arch/mips/include/asm/bitops.h: At top level:
/home/yuasa/src/linux/kernel/git/linux-2.6/arch/mips/include/asm/bitops.h:615:19: error: static declaration of 'fls' follows non-static declaration
include/linux/log2.h:34:9: note: previous implicit declaration of 'fls' was here
In file included from /home/yuasa/src/linux/kernel/git/linux-2.6/arch/mips/include/asm/bitops.h:651:0,
from include/linux/bitops.h:22,
from include/linux/signal.h:38,
from include/linux/elfcore.h:5,
from include/linux/kexec.h:60,
from arch/mips/kernel/machine_kexec.c:9:
include/asm-generic/bitops/fls64.h:18:28: error: static declaration of 'fls64' follows non-static declaration
include/linux/log2.h:42:9: note: previous implicit declaration of 'fls64' was here
In file included from include/linux/signal.h:38:0,
from include/linux/elfcore.h:5,
from include/linux/kexec.h:60,
from arch/mips/kernel/machine_kexec.c:9:
include/linux/bitops.h:160:24: error: conflicting types for 'fls_long'
include/linux/log2.h:63:16: note: previous implicit declaration of 'fls_long' was here
cc1: all warnings being treated as errors