git.karo-electronics.de Git - linux-beck.git/log

target: NULL dereference on error path

During a failure in transport_add_device_to_core_hba() code, we called
destroy_workqueue(dev->tmr_wq) before ->tmr_wq was allocated which leads
to an oops.

This fixes a regression introduced in with:

commit af8772926f019b7bddd7477b8de5f3b0f12bad21
Author: Christoph Hellwig <hch@infradead.org>
Date: Sun Jul 8 15:58:49 2012 -0400

target: replace the processing thread with a TMR work queue

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Allow for target_submit_cmd() returning errors

We want it to be possible for target_submit_cmd() to return errors up
to its fabric module callers. For now just update the prototype to
return an int, and update all callers to handle non-zero return values
as an error.

This is immediately useful for tcm_qla2xxx to fix a long-standing active
I/O session shutdown race, but tcm_fc, usb-gadget, and sbp-target the
fabric maintainers need to check + ACK that handling a target_submit_cmd()
failure due to session shutdown does not introduce regressions

(nab: Respin against for-next after initial NACK + update docbook comment +
fix double se_cmd init in exception path for usb-gadget)

Cc: Chad Dupuis <chad.dupuis@qlogic.com>
Cc: Arun Easi <arun.easi@qlogic.com>
Cc: Chris Boot <bootc@bootc.net>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Mark Rustad <mark.d.rustad@intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Andy Grover <agrover@redhat.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Check number of unmap descriptors against our limit

Fail UNMAP commands that have more than our reported limit on unmap
descriptors.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Cc: stable@vger.kernel.org
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Fix possible integer underflow in UNMAP emulation

It's possible for an initiator to send us an UNMAP command with a
descriptor that is less than 8 bytes; in that case it's really bad for
us to set an unsigned int to that value, subtract 8 from it, and then
use that as a limit for our loop (since the value will wrap around to
a huge positive value).

Fix this by making size be signed and only looping if size >= 16 (ie
if we have at least a full descriptor available).

Also remove offset as an obfuscated name for the constant 8.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Cc: stable@vger.kernel.org
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Fix reading of data length fields for UNMAP commands

The UNMAP DATA LENGTH and UNMAP BLOCK DESCRIPTOR DATA LENGTH fields
are in the unmap descriptor (the payload transferred to our data out
buffer), not in the CDB itself. Read them from the correct place in
target_emulated_unmap.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Cc: stable@vger.kernel.org
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Add range checking to UNMAP emulation

When processing an UNMAP command, we need to make sure that the number
of blocks we're asked to UNMAP does not exceed our reported maximum
number of blocks per UNMAP, and that the range of blocks we're
unmapping doesn't go past the end of the device.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Cc: stable@vger.kernel.org
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Add generation of LOGICAL BLOCK ADDRESS OUT OF RANGE

Many SCSI commands are defined to return a CHECK CONDITION / ILLEGAL
REQUEST with ASC set to LOGICAL BLOCK ADDRESS OUT OF RANGE if the
initiator sends a command that accesses a too-big LBA. Add an enum
value and case entries so that target code can return this status.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Cc: stable@vger.kernel.org
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Make unnecessarily global se_dev_align_max_sectors() static

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Remove se_session.sess_wait_list

Since we set se_session.sess_tearing_down and stop new commands from
being added to se_session.sess_cmd_list before we wait for commands to
finish when freeing a session, there's no need for a separate
sess_wait_list -- if we let new commands be added to sess_cmd_list
after setting sess_tearing_down, that would be a bug that breaks the
logic of waiting in-flight commands.

Also rename target_splice_sess_cmd_list() to
target_sess_cmd_list_set_waiting(), since we are no longer splicing
onto a separate list.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: Remove racy, now-redundant check of sess_tearing_down

Now that target_submit_cmd() / target_get_sess_cmd() check
sess_tearing_down before adding commands to the list, we no longer
need the check in qlt_do_work(). In fact this check is racy anyway
(and that race is what inspired the change to add the check of
sess_tearing_down to the target core).

Cc: Chad Dupuis <chad.dupuis@qlogic.com>
Cc: Arun Easi <arun.easi@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Check sess_tearing_down in target_get_sess_cmd()

Target core code assumes that target_splice_sess_cmd_list() has set
sess_tearing_down and moved the list of pending commands to
sess_wait_list, no more commands will be added to the session; if any
are added, nothing keeps the se_session from being freed while the
command is still in flight, which e.g. leads to use-after-free of
se_cmd->se_sess in target_release_cmd_kref().

To enforce this invariant, put a check of sess_tearing_down inside of
sess_cmd_lock in target_get_sess_cmd(); any checks before this are
racy and can lead to the use-after-free described above. For example,
the qla_target check in qlt_do_work() checks sess_tearing_down from
work thread context but then drops all locks before calling
target_submit_cmd() (as it must, since that is a sleeping function).

However, since no locks are held, anything can happen with respect to
the session it has looked up -- although it does correctly get
sess_kref within its lock, so the memory won't be freed while
target_submit_cmd() is actually running, nothing stops eg an ACL from
being dropped and calling ->shutdown_session() (which calls into
target_splice_sess_cmd_list()) before we get to target_get_sess_cmd().
Once this happens, the se_session memory can be freed as soon as
target_submit_cmd() returns and qlt_do_work() drops its reference,
even though we've just added a command to sess_cmd_list.

To prevent this use-after-free, check sess_tearing_down inside of
sess_cmd_lock right before target_get_sess_cmd() adds a command to
sess_cmd_list; this is synchronized with target_splice_sess_cmd_list()
so that every command is either waited for or not added to the queue.

(nab: Keep target_submit_cmd() returning void for now..)

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

sbp-target: Consolidate duplicated error path code in sbp_handle_command()

Cc: Chris Boot <bootc@bootc.net>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Signed-off-by: Roland Dreier <roland@purestorage.com>

target: Un-export target_get_sess_cmd()

There are no in-tree users of target_get_sess_cmd() outside of
target_core_transport.c. Any new code should use the higher-level
target_submit_cmd() interface. So let's un-export target_get_sess_cmd()
and make it static to the one file where it's actually used.

(nab: Fix up minor fuzz to for-next)

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: Get rid of redundant qla_tgt_sess.tearing_down

The only place that sets qla_tgt_sess.tearing_down calls
target_splice_sess_cmd_list() immediately afterwards, without dropping
the lock it holds. That function sets se_session.sess_tearing_down,
so we can get rid of the qla_target-specific flag, and in the one
place that looks at the qla_tgt_sess.tearing_down flag just test
se_session.sess_tearing_down instead.

Cc: Chad Dupuis <chad.dupuis@qlogic.com>
Cc: Arun Easi <arun.easi@qlogic.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Make core_disable_device_list_for_node use pre-refactoring lock ordering

So after kicking around commit 547ac4c9c90 around a bit more, a tcm_qla2xxx LUN
unlink OP has generated the following warning:

[   50.386625] qla2xxx [0000:07:00.0]-00af:0: Performing ISP error recovery - ha=ffff880263774000.
[   70.572988] qla2xxx [0000:07:00.0]-8038:0: Cable is unplugged...
[  126.527531] ------------[ cut here ]------------
[  126.532677] WARNING: at kernel/softirq.c:159 local_bh_enable_ip+0x41/0x8c()
[  126.540433] Hardware name: S5520HC
[  126.544248] Modules linked in: tcm_vhost ib_srpt ib_cm ib_sa ib_mad ib_core tcm_qla2xxx tcm_loop tcm_fc libfc iscsi_target_mod target_core_pscsi target_core_file target_core_iblock target_core_mod configfs ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_round_robin dm_multipath scsi_dh loop i2c_i801 kvm_intel kvm crc32c_intel i2c_core microcode joydev button iomemory_vsl(O) pcspkr ext3 jbd uhci_hcd lpfc ata_piix libata ehci_hcd qla2xxx mlx4_core scsi_transport_fc scsi_tgt igb [last unloaded: scsi_wait_scan]
[  126.595567] Pid: 3283, comm: unlink Tainted: G           O 3.5.0-rc2+ #33
[  126.603128] Call Trace:
[  126.605853]  [<ffffffff81026b91>] ? warn_slowpath_common+0x78/0x8c
[  126.612737]  [<ffffffff8102c342>] ? local_bh_enable_ip+0x41/0x8c
[  126.619433]  [<ffffffffa03582a2>] ? core_disable_device_list_for_node+0x70/0xe3 [target_core_mod]
[  126.629323]  [<ffffffffa035849f>] ? core_clear_lun_from_tpg+0x88/0xeb [target_core_mod]
[  126.638244]  [<ffffffffa0362ec1>] ? core_tpg_post_dellun+0x17/0x48 [target_core_mod]
[  126.646873]  [<ffffffffa03575ee>] ? core_dev_del_lun+0x26/0x8c [target_core_mod]
[  126.655114]  [<ffffffff810bcbd1>] ? dput+0x27/0x154
[  126.660549]  [<ffffffffa0359aa0>] ? target_fabric_port_unlink+0x3b/0x41 [target_core_mod]
[  126.669661]  [<ffffffffa034a698>] ? configfs_unlink+0xfc/0x14a [configfs]
[  126.677224]  [<ffffffff810b5979>] ? vfs_unlink+0x58/0xb7
[  126.683141]  [<ffffffff810b6ef3>] ? do_unlinkat+0xbb/0x142
[  126.689253]  [<ffffffff81330c75>] ? page_fault+0x25/0x30
[  126.695170]  [<ffffffff81335df9>] ? system_call_fastpath+0x16/0x1b
[  126.702053] ---[ end trace 2f8e5b0a9ec797ef ]---
[  126.756336] qla2xxx [0000:07:00.0]-00af:0: Performing ISP error recovery - ha=ffff880263774000.
[  146.942414] qla2xxx [0000:07:00.0]-8038:0: Cable is unplugged...

So this warning triggered because device_list disable logic is now
holding nacl->device_list_lock w/ spin_lock_irqsave before obtaining
port->sep_alua_lock with only spin_lock_bh..

The original disable logic obtains *deve ahead of dropping the entry
from deve->alua_port_list and then obtains ->device_list_lock to do the
remaining work.  Also, I'm pretty sure this particular warning is being
generated by a demo-mode session in tcm_qla2xxx, and not by explicit
NodeACL MappedLUNs.  The Initiator MappedLUNs are already protected by a
seperate configfs symlink reference back se_lun->lun_group, and the
demo-mode se_node_acl (and associated ->device_list[]) is released
during se_portal_group->tpg_group shutdown.

The following patch drops the extra functional change to disable logic
in commit 547ac4c9c90

Cc: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: refactor core_update_device_list_for_node()

Code was almost entirely divided based on value of bool param "enable".

Split it into two functions.

Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Eliminate else using boolean logic

Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Misc retval cleanups

Bubble-up retval from iscsi_update_param_value() and
iscsit_ta_authentication().

Other very small retval cleanups.

Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Remove hba param from core_dev_add_lun

Only used in a debugprint, and function signature is cleaner now.

Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Remove unneeded double parentheses

Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: replace the processing thread with a TMR work queue

The last functionality of the target processing thread is offloading possibly
long running task management requests from the submitter context. To keep
TMR semantics the same we need a single threaded ordered queue, which can
be provided by a per-device workqueue with the right flags.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: remove transport_generic_handle_cdb_map

Remove this command submission path which is not used by any in-tree driver.
This also removes the now unused new_cmd_map fabtric method, which a few
drivers implemented despite never calling transport_generic_handle_cdb_map.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: simply fabric driver queue full processing

There is no need to schedule the delayed processing in a workqueue that
offloads it to the target processing thread. Instead execute it directly
from the workqueue. There will be a lot of future work in this area,
which I'd likfe to defer for now as it is not nessecary for getting rid
of the target processing thread.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: remove transport_generic_handle_data

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

tcm_fc: Offload WRITE I/O backend submission to tpg workqueue

Defer the write processing to the internal to be able to use
target_execute_cmd. I'm not even entirely sure the calling code requires
this due to the convoluted structure in libfc, but let's be safe for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Mark Rustad <mark.d.rustad@intel.com>
Cc: Kiran Patil <Kiran.patil@intel.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

tcm_qla2xxx: Offload WRITE I/O backend submission to tcm_qla2xxx wq

Defer the whole tcm_qla2xxx_handle_data call instead of just the error
path to the qla2xxx-internal workqueue. Also remove the useless lock around
the CMD_T_ABORTED check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Giridhar Malavali <giridhar.malavali@qlogic.com>
Cc: tcm-qla2xxx@qlogic.com
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

srpt: use target_execute_cmd for WRITEs in srpt_handle_rdma_comp

srpt_handle_rdma_comp is called from kthread context and thus can execute
target_execute_cmd directly. srpt_abort_cmd sets the CMD_T_LUN_STOP
flag directly, and thus the abuse of transport_generic_handle_data can be
replaced with an opencoded variant of that code path. I'm still not happy
about a fabric driver poking into target core internals like this, but
let's defer the bigger architecture changes for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

iscsit: use target_execute_cmd for WRITEs

All three callers of transport_generic_handle_data are from user context
and can use target_execute_cmd directly to handle the backend I/O submission
of WRITE I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: merge transport_generic_write_pending into transport_generic_new_cmd

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: call transport_check_aborted_status from target_execute_cmd

When we call target_execute_cmd for write commands the command has been
on the state list before an abort might have come in before
target_execute_cmd. Call transport_check_aborted_status to deal with
this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: remove transport_generic_process_write

Just call target_execute_cmd directly. Also, convert loopback, sbp,
usb-gadget to use the newly exported target_execute_cmd().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: split transport_cmd_check_stop

Inline the transport_off == 0 case into target_execute_cmd to simplify
the function for the remaining cases.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

tcm_qla2xxx: Remove duplicate header file inclusion

ctype.h and string.h header files were included more than once.

Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

Revert "target: Do not special-case loop and iscsi fabric module loads"

Existing lio_dump.py code expects this to be in place for /iscsi.

Revert for now to avoid userspace breakage in lio-utils

This reverts commit fd88a785f9ac5d6be437c528571ccd85cdf2d493.

Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: move unmap to struct spc_ops

Having all the unmap payload parsing in the backed is a bit ugly, but until
more drivers support it and we can find a good interface for all of them
that seems the way to go.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: move write_same to struct spc_ops

Add spc_ops->execute_write_same() caller for ->execute_cmd() setup,
and update IBLOCK backends to use it.

(nab: add export of spc_get_write_same_sectors symbol)
(roland: Carry forward: Fix range calculation in WRITE SAME emulation
when num blocks == 0)

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: move sync_cache to struct spc_ops

Add spc_ops->execute_sync_cache() caller for ->execute_cmd() setup,
and update IBLOCK + FILEIO backends to use it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: add struct spc_ops + initial ->execute_rw pointer usage

Remove the execute_cmd method in struct se_subsystem_api, and always use the
one directly in struct se_cmd. To make life simpler for SBC virtual backends
a struct spc_ops that is passed to sbc_parse_cmd is added. For now it
only contains an execute_rw member, but more will follow with the subsequent
commits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: remove dead SCF_ flags

Remove the dead SCF_SE_ALLOW_EOO and SCF_DELAYED_CMD_FROM_SAM_ATTR
from se_cmd_flags_table.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target/iscsi: Remove dead code in lio_get_tpg_from_tpg_item()

It's got no callers...

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target/iblock: Add parameter to specify read-only devices

see https://bugzilla.redhat.com/show_bug.cgi?id=818855

Adds a parameter so read-only block devices may be registered as
LIO backstores.

Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Do not special-case loop and iscsi fabric module loads

These modules, along with other fabrics, should be loaded as-needed by
the LIO userspace tools.

Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: move ref_cmd from the generic se_tmr_req into iscsi code

Also remove the unused ref_task_lun field in struct se_tmr_req.

(nab: Add missing TASK_REASSIGN ref_lun vs. ref_cmd orig_fe_lun checks
in iscsit_tmr_task_reassign)

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: remove the execute list

Since "target: Drop se_device TCQ queue_depth usage from I/O path" we always
submit all commands (or back then, tasks) from __transport_execute_tasks.

That means the the execute list has lots its purpose, as we can simply
submit the commands that are restarted in transport_complete_task_attr
directly while we walk the list. In fact doing so also solves a race
in the way it currently walks to delayed_cmd_list as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target/pscsi: Only emulate REPORT_LUNS for passthrough

This patch changes back the pSCSI backend to follow pre 3.6-queue code to
passthrough SPC-3 persistent reservations + SPC-2 legacy reservation
handling to the underlying LLD / physical hardware.

For folks who really need this for their own SPC-3 emulation logic, avoid
changing the functionality of this beyond what is exported for REPORT_LUNS
for existing code, and to avoid problems with SPC-3 PR/ALUA as INQUIRY
EVPD=0x83 emulation needs to be in place in order for this to work as
expected with spc_parse_cdb() code..

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Move MAINTENANCE_[IN,OUT] from pscsi_parse_cdb -> spc_parse_cdb

The MAINTENANCE_[IN,OUT] CDB parsing required for generic ALUA emulation
needs to be in spc_parse_cdb() to function for virtual TYPE_DISK exports,
instead of in backend pscsi_parse_cdb() code used only for passthrough ops.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: move transport_generic_prepare_cdb into pscsi

The virtual drivers don't need to clear cdb fields they never look at, so move
this code into the pscsi backend.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: move code for CDB emulation

Move the existing code in target_core_cdb.c into the files for the command
sets that the emulations implement.

(roland + nab: Squash patch: Fix range calculation in WRITE SAME emulation
when num blocks == 0s)

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: add a parse_cdb method to the backend drivers

Instead of trying to handle all SCSI command sets in one function
(transport_generic_cmd_sequencer) call out to the backend driver to perform
this functionality. For pSCSI a copy of the existing code is used, but for
all virtual backends we can use a new parse_sbc_cdb helper is used to
provide a simple SBC emulation.

For now this setups means a fair amount of duplication between pSCSI and the
SBC library, but patches later in this series will sort out that problem.

(nab: Fix up build failure in target_core_pscsi.c)

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: split parsing of SPC commands into a separate helper

(nab: Add EXPORT_SYMBOL usage for spc_parse_cdb)

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: split overflow and underflow checks into a helper

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: remove control CDB flags

We don't need three flags to classifiy the CDB as we can check for a NULL S/G
list for a dataless command, and can infer from the absence of the data flag
that we deal with a control CDB. Also remove the _SG_IO from the data CDB
flag as all I/O is dont on S/G lists now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: move unrelated code out of transport_generic_cmd_sequencer

Move all code not related to cdb parsing from transport_generic_cmd_sequencer
into target_setup_cmd_from_cdb.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Fix range calculation in WRITE SAME emulation when num blocks == 0

When NUMBER OF LOGICAL BLOCKS is 0, WRITE SAME is supposed to write
all the blocks from the specified LBA through the end of the device.
However, dev->transport->get_blocks(dev) (perhaps confusingly) returns
the last valid LBA rather than the number of blocks, so the correct
number of blocks to write starting with lba is

dev->transport->get_blocks(dev) - lba + 1

(nab: Backport roland's for-3.6 patch to for-3.5)

Signed-off-by: Roland Dreier <roland@purestorage.com>
Cc: Cc: <stable@vger.kernel.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Clean up returning errors in PR handling code

- instead of (PTR_ERR(file) < 0) just use IS_ERR(file)
- return -EINVAL instead of EINVAL
- all other error returns in target_scsi3_emulate_pr_out() use
"goto out" -- get rid of the one remaining straight "return."

Signed-off-by: Roland Dreier <roland@purestorage.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

tcm_fc: Fix crash seen with aborts and large reads

This patch fixes a crash seen when large reads have their exchange
aborted by either timing out or being reset. Because the exchange
abort results in the seq pointer being set to NULL, because the
sequence is no longer valid, it must not be dereferenced. This
patch changes the function ft_get_task_tag to return ~0 if it is
unable to get the tag for this reason. Because the get_task_tag
interface provides no means of returning an error, this seems
like the best way to fix this issue at the moment.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: print the right array elements in qlt_async_event

Based upon Alan's patch from Coverity scan id 793583, these debug
messages in qlt_async_event() should be starting from byte 0, which is
always the Asynchronous Event Status Code from the parent switch statement.

Also, rename reason_code -> login_code following the language used in
2500 FW spec for Port Database Changed (0x8014) -> Port Database Changed
Event Mailbox Register for mailbox[2].

Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: Chad Dupuis <chad.dupuis@qlogic.com>
Cc: Giridhar Malavali <giridhar.malavali@qlogic.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

tcm_fc: Resolve suspicious RCU usage warnings

Use rcu_dereference_protected to tell rcu that the ft_lport_lock
is held during ft_lport_create. This resolved "suspicious RCU usage"
warnings when debugging options are turned on.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: Remove version.h header file inclusion

version.h header file is no longer required for qla_target code.

Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

tcm_qla2xxx: Handle malformed wwn strings properly

If we make a variable an unsigned int and then expect it to be < 0 on
a bad character, we're going to have a bad time.  Fix the tcm_qla2xxx
code to actually notice if hex_to_bin() returns a negative variable.

This was detected by the compiler warning:

    scsi/qla2xxx/tcm_qla2xxx.c: In function ‘tcm_qla2xxx_npiv_extract_wwn’:
    scsi/qla2xxx/tcm_qla2xxx.c:148:3: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

tcm_qla2xxx: tcm_qla2xxx_handle_tmr() can be static

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: Don't leak commands we give up on in qlt_do_work()

If we go to the "out_term:" exit path in qlt_do_work(), we call
qlt_send_term_exchange() with a NULL cmd, which means that it can't
possibly free the cmd for us. Add an explicit call to free the
command memory, so we don't leak the allocation.

This will also fix warnings about "BUG qla_tgt_cmd_cachep: Objects
remaining on kmem_cache_close" from slub when unloading the qla2xxx
target module.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

qla2xxx: Don't crash if we can't find cmd for failed CTIO

In qlt_do_ctio_completion(), there's no point in calling
qlt_term_ctio_exchange() with a NULL cmd -- all that it does is crash
in a NULL pointer dereference, since it does

qlt_send_term_exchange(vha, cmd, &cmd->atio, 1);

and dereferencing &cmd->atio is a bad idea if cmd itself is NULL.

If we really need to do this, we could take the values from the
failed CTIO we're processing, but it's not clear if it's worth
the replumbing to do that.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

tcm_qla2xxx: Don't insert nacls without sessions into the btree

When we create an explicit node ACL in tcm_qla2xxx_make_nodeacl(),
there is a call to tcm_qla2xxx_setup_nacl_from_rport(), which puts the
node ACL into the lport_fcport_map even though there is no session yet
for the initiator. Since the only time we remove entries from this
map is when we free a session, this means that if we later delete this
node ACL without the initiator ever creating a session, we'll leave
the nacl pointer in the btree pointing at freed memory.

This is especially bad if that initiator later does send us a command
that would cause us to create a dynamic ACL and session: we'll find
the stale freed nacl pointer in the btree and end up with use-after-free.

We could add more code to clear the btree entry when deleting the
explicit nacl, but the original insertion is pointless: without a
session attached, we'll just have to update the entry when a session
appears anyway. So we can just delete tcm_qla2xxx_setup_nacl_from_rport()
and the code that calls it.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Cc: Chad Dupuis <chad.dupuis@qlogic.com>
Cc: Giridhar Malavali <giridhar.malavali@qlogic.com>
Cc: Arun Easi <arun.easi@qlogic.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Return error to initiator if SET TARGET PORT GROUPS emulation fails

The error paths in target_emulate_set_target_port_groups() are all
essentially "rc = -EINVAL; goto out;" but the code at "out:" ignores
rc and always returns success.  This means that even if eg explicit
ALUA is turned off, the initiator will always see a good SCSI status
for SET TARGET PORT GROUPS.

Fix this by returning rc as is intended.  It appears this bug was
added by the following patch:

commit 05d1c7c0d0db4cc25548d9aadebb416888a82327
Author: Andy Grover <agrover@redhat.com>
Date:   Wed Jul 20 19:13:28 2011 +0000

    target: Make all control CDBs scatter-gather

Signed-off-by: Roland Dreier <roland@purestorage.com>
Cc: Andy Grover <agrover@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

tcm_qla2xxx: Clear session s_id + loop_id earlier during shutdown

This patch adds a new tcm_qla2xxx_clear_sess_lookup() call to clear session
specific s_id + loop_id entries used for se_node_acl pointer lookup ahead
of releasing se_session within the process context workqueue callback in
tcm_qla2xxx_free_session().

It makes the call in existing tcm_qla2xxx_clear_nacl_from_fcport_map()
code invoked from qlt_unreg_sess() in interrupt context w/ hardware_lock
held, ahead of the process context callback into qlt_free_session_done()
-> tcm_qla2xxx_free_session().

We are doing this to address a race between incoming ATIO or TMR packets
using stale se_node_acl pointer once session shutdown has been invoked via
qlt_unreg_sess() in qla_target.c LLD code, and when the entire tcm_qla2xxx
endpoint has not been forced into shutdown w/ echo 0 > ../$QLA2XXX_PORT/enable

Cc: Joern Engel <joern@logfs.org>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Arun Easi <arun.easi@qlogic.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

tcm_qla2xxx: Convert to TFO->put_session() usage

This patch converts tcm_qla2xxx code to use an internal kref_put() for
se_session->sess_kref in order to ensure that qla_hw_data->hardware_lock
can be held while calling qlt_unreg_sess() for the final put.

Signed-off-by: Joern Engel <joern@logfs.org>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Arun Easi <arun.easi@qlogic.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

target: Add TFO->put_session() caller for HW fabric session shutdown

This patch adds an optional target_core_fabric_ops->put_session() caller
within the existing target_put_session() code path.

This is required by tcm_qla2xxx code in order to invoke it's own fabric
specific session shutdown handler using se_session->sess_kref.

Signed-off-by: Joern Engel <joern@logfs.org>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Arun Easi <arun.easi@qlogic.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

Linux 3.5-rc2

mm, oom: fix badness score underflow

If the privileges given to root threads (3% of allowable memory) or a
negative value of /proc/pid/oom_score_adj happen to exceed the amount of
rss of a thread, its badness score overflows as a result of commit
a7f638f999ff ("mm, oom: normalize oom scores to oom_score_adj scale only
for userspace").

Fix this by making the type signed and return 1, meaning the thread is
still eligible for kill, if the value is negative.

Reported-by: Dave Jones <davej@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar.

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix the relax_domain_level boot parameter
  sched: Validate assumptions in sched_init_numa()
  sched: Always initialize cpu-power
  sched: Fix domain iteration
  sched/rt: Fix lockdep annotation within find_lock_lowest_rq()
  sched/numa: Load balance between remote nodes
  sched/x86: Calculate booted cores after construction of sibling_mask

sched/fair: fix lots of kernel-doc warnings

Fix lots of new kernel-doc warnings in kernel/sched/fair.c:

  Warning(kernel/sched/fair.c:3625): No description found for parameter 'env'
  Warning(kernel/sched/fair.c:3625): Excess function parameter 'sd' description in 'update_sg_lb_stats'
  Warning(kernel/sched/fair.c:3735): No description found for parameter 'env'
  Warning(kernel/sched/fair.c:3735): Excess function parameter 'sd' description in 'update_sd_pick_busiest'
  Warning(kernel/sched/fair.c:3735): Excess function parameter 'this_cpu' description in 'update_sd_pick_busiest'
  .. more warnings

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Revert "drm/i915/crt: Do not rely upon the HPD presence pin"

This reverts commit 9e612a008fa7fe493a473454def56aa321479495.

It incorrectly finds VGA connectors where none are attached, apparently
not noticing that nothing replied to the EDID queries, and happily using
the default EDID modes that have nothing to do with actual hardware.

That in turn then causes X to fall down to the lowest common
denominator, which is usually the default 1024x768 mode that is in the
default EDID and pretty much anything supports).

I'd suggest that if not relying on the HDP pin, the code should at least
check whether it gets valid EDID data back, rather than just assume
there's something on the VGA connector.

Cc: Dave Airlie <airlied@linux.ie>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bug fixes from Theodore Ts'o:
"This update contains two bug fixes, both destined for the stable tree.
  Perhaps the most important is one which fixes ext4 when used with file
  systems originally formatted for use with ext3, but then later
  converted to take advantage of ext4."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: don't set i_flags in EXT4_IOC_SETFLAGS
  ext4: fix the free blocks calculation for ext3 file systems w/ uninit_bg

Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc

Pull powerpc fixes from Paul Mackerras:
"Two small fixes for powerpc:
   - a fix for a regression since 3.2 that causes 4-second (or longer)
     pauses
   - a fix for a potential oops when loading kernel modules on 32-bit
     embedded systems."

* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc:
  powerpc: Fix kernel panic during kernel module load
  powerpc/time: Sanity check of decrementer expiration is necessary

Merge tag 'upstream-3.5-rc2' of git://git.infradead.org/linux-ubifs

Pull UBI/UBIFS fixes from Artem Bityutskiy:
"Fix UBI and UBIFS - they refuse to work without debugfs.  This was
  broken by the 3.5-rc1 UBI/UBIFS changes when we removed the debugging
  Kconfig switches.

  Also, correct locking in 'ubi_wl_flush()' - it was extended to support
  flushing a specific LEB in 3.5-rc1, and the locking was sub-optimal."

* tag 'upstream-3.5-rc2' of git://git.infradead.org/linux-ubifs:
  UBI: correct ubi_wl_flush locking
  UBIFS: fix debugfs-less systems support
  UBI: fix debugfs-less systems support

Revert "vfs: stop d_splice_alias creating directory aliases"

This reverts commit 7732a557b1342c6e6966efb5f07effcf99f56167 (and commit
3f50fff4dace23d3cfeb195d5cd4ee813cee68b7, which was a follow-up
cleanup).

We're chasing an elusive bug that Dave Jones can apparently reproduce
using his system call fuzzer tool, and that looks like some kind of
locking ordering problem on the directory i_mutex chain.  Our i_mutex
locking is rather complex, and depends on the topological ordering of
the directories, which is why we have been very wary of splicing
directory entries around.

Of course, we really don't want to ever see aliased unconnected
directories anyway, so none of this should ever happen, but this revert
aims to basically get us back to a known older state.

Bruce points to some of the previous discussion at

       http://marc.info/?i=<20110310105821.GE22723@ZenIV.linux.org.uk>

and in particular a long post from Neil:

       http://marc.info/?i=<20110311150749.2fa2be66@notabene.brown>

It should be noted that it's possible that Dave's problems come from
other changes altohgether, including possibly just the fact that Dave
constantly is teachning his fuzzer new tricks.  So what appears to be a
new bug could in fact be an old one that just gets newly triggered, but
reverting these patches as "still under heavy discussion" is the right
thing regardless.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar.

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/nmi: Fix section mismatch warnings on 32-bit
  x86/uv: Fix UV2 BAU legacy mode
  x86/mm: Only add extra pages count for the first memory range during pre-allocation early page table space
  x86, efi stub: Add .reloc section back into image
  x86/ioapic: Fix NULL pointer dereference on CPU hotplug after disabling irqs
  x86/reboot: Fix a warning message triggered by stop_other_cpus()
  x86/intel/moorestown: Change intel_scu_devices_create() to __devinit
  x86/numa: Set numa_nodes_parsed at acpi_numa_memory_affinity_init()
  x86/gart: Fix kmemleak warning
  x86: mce: Add the dropped timer interval init back
  x86/mce: Fix the MCE poll timer logic

Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Ingo Molnar:
"A bit larger than what I'd wish for - half of it is due to hw driver
  updates to Intel Ivy-Bridge which info got recently released,
  cycles:pp should work there now too, amongst other things.  (but we
  are generally making exceptions for hardware enablement of this type.)

  There are also callchain fixes in it - responding to mostly
  theoretical (but valid) concerns.  The tooling side sports perf.data
  endianness/portability fixes which did not make it for the merge
  window - and various other fixes as well."

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
  perf/x86: Check user address explicitly in copy_from_user_nmi()
  perf/x86: Check if user fp is valid
  perf: Limit callchains to 127
  perf/x86: Allow multiple stacks
  perf/x86: Update SNB PEBS constraints
  perf/x86: Enable/Add IvyBridge hardware support
  perf/x86: Implement cycles:p for SNB/IVB
  perf/x86: Fix Intel shared extra MSR allocation
  x86/decoder: Fix bsr/bsf/jmpe decoding with operand-size prefix
  perf: Remove duplicate invocation on perf_event_for_each
  perf uprobes: Remove unnecessary check before strlist__delete
  perf symbols: Check for valid dso before creating map
  perf evsel: Fix 32 bit values endianity swap for sample_id_all header
  perf session: Handle endianity swap on sample_id_all header data
  perf symbols: Handle different endians properly during symbol load
  perf evlist: Pass third argument to ioctl explicitly
  perf tools: Update ioctl documentation for PERF_IOC_FLAG_GROUP
  perf tools: Make --version show kernel version instead of pull req tag
  perf tools: Check if callchain is corrupted
  perf callchain: Make callchain cursors TLS
  ...

Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux

Pull drm intel and exynos fixes from Dave Airlie:
"A bunch of fixes for Intel and exynos, nothing too major, a new intel
  PCI ID, and a fix for CRT detection."

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
  drm/i915: pch_irq_handler -> {ibx, cpt}_irq_handler
  char/agp: add another Ironlake host bridge
  drm/i915: fix up ivb plane 3 pageflips
  drm/exynos: fixed blending for hdmi graphic layer
  drm/exynos: Remove dummy encoder get_crtc operation implementation
  drm/exynos: Keep a reference to frame buffer GEM objects
  drm/exynos: Don't cast GEM object to Exynos GEM object when not needed
  drm/exynos: DRIVER_BUS_PLATFORM is not a driver feature
  drm/exynos: fixed size type.
  drm/exynos: Use DRM_FORMAT_{NV12, YUV420} instead of DRM_FORMAT_{NV12M, YUV420M}
  drm/i915: hold forcewake around ring hw init
  drm/i915: Mark the ringbuffers as being in the GTT domain
  drm/i915/crt: Do not rely upon the HPD presence pin
  drm/i915: Reset last_retired_head when resetting ring

Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull leap second timer fix from Thomas Gleixner.

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecond

Merge tag 'moduleparam-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus

Pull minor module param fixes from Rusty Russell:
"One bugfix for multiple moduleparam levels, one removal of overzealous
  printk."

* tag 'moduleparam-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
  init: Drop initcall level output
  module_param: stop double-calling parameters.

x86/nmi: Fix section mismatch warnings on 32-bit

It was reported that compiling for 32-bit caused a bunch of
section mismatch warnings:

VDSOSYM arch/x86/vdso/vdso32-syms.lds
  LD      arch/x86/vdso/built-in.o
  LD      arch/x86/built-in.o

WARNING: arch/x86/built-in.o(.data+0x5af0): Section mismatch in
reference from the variable test_nmi_ipi_callback_na.10451 to
the function .init.text:test_nmi_ipi_callback() [...]

WARNING: arch/x86/built-in.o(.data+0x5b04): Section mismatch in
reference from the variable nmi_unk_cb_na.10399 to the function
.init.text:nmi_unk_cb() The variable nmi_unk_cb_na.10399
references the function __init nmi_unk_cb() [...]

Both of these are attributed to the internal representation of
the nmiaction struct created during register_nmi_handler.  The
reason for this is that those structs are not defined in the
init section whereas the rest of the code in nmi_selftest.c is.

To resolve this, I created a new #define,
register_nmi_handler_initonly, that tags the struct as
__initdata to resolve the mismatch.  This #define should only be
used in rare situations where the register/unregister is called
during init of the kernel.

Big thanks to Jan Beulich for decoding this for me as I didn't
have a clue what was going on.

Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Tested-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Cc: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/r/1338991542-23000-1-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

powerpc: Fix kernel panic during kernel module load

This fixes a problem which can causes kernel oopses while loading
a kernel module.

According to the PowerPC EABI specification, GPR r11 is assigned
the dedicated function to point to the previous stack frame.
In the powerpc-specific kernel module loader, do_plt_call()
(in arch/powerpc/kernel/module_32.c), GPR r11 is also used
to generate trampoline code.

This combination crashes the kernel, in the case where the compiler
chooses to use a helper function for saving GPRs on entry, and the
module loader has placed the .init.text section far away from the
.text section, meaning that it has to generate a trampoline for
functions in the .init.text section to call the GPR save helper.
Because the trampoline trashes r11, references to the stack frame
using r11 can cause an oops.

The fix just uses GPR r12 instead of GPR r11 for generating the
trampoline code. According to the statements from Freescale, this is
safe from an EABI perspective.

I've tested the fix for kernel 2.6.33 on MPC8541.

Cc: stable@vger.kernel.org
Signed-off-by: Steffen Rumler <steffen.rumler.ext@nsn.com>
[paulus@samba.org: reworded the description]
Signed-off-by: Paul Mackerras <paulus@samba.org>

x86/uv: Fix UV2 BAU legacy mode

The SGI Altix UV2 BAU (Broadcast Assist Unit) as used for
tlb-shootdown (selective broadcast mode) always uses UV2
broadcast descriptor format. There is no need to clear the
'legacy' (UV1) mode, because the hardware always uses UV2 mode
for selective broadcast.

But the BIOS uses general broadcast and legacy mode, and the
hardware pays attention to the legacy mode bit for general
broadcast. So the kernel must not clear that mode bit.

Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/E1SccoO-0002Lh-Cb@eag09.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

x86/mm: Only add extra pages count for the first memory range during pre-allocation early page table space

Robin found this regression:

| I just tried to boot an 8TB system. It fails very early in boot with:
| Kernel panic - not syncing: Cannot find space for the kernel page tables

git bisect commit 722bc6b16771ed80871e1fd81c86d3627dda2ac8.

A git revert of that commit does boot past that point on the 8TB
configuration.

That commit will add up extra pages for all memory range even
above 4g.

Try to limit that extra page count adding to first entry only.

Bisected-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/CAE9FiQUj3wyzQxtq9yzBNc9u220p8JZ1FYHG7t%3DMOzJ%3D9BZMYA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Merge branch 'exynos-drm-fixes' of git://git.infradead.org/users/kmpark/linux-samsung into drm-fixes

* 'exynos-drm-fixes' of git://git.infradead.org/users/kmpark/linux-samsung:
  drm/exynos: fixed blending for hdmi graphic layer
  drm/exynos: Remove dummy encoder get_crtc operation implementation
  drm/exynos: Keep a reference to frame buffer GEM objects
  drm/exynos: Don't cast GEM object to Exynos GEM object when not needed
  drm/exynos: DRIVER_BUS_PLATFORM is not a driver feature
  drm/exynos: fixed size type.
  drm/exynos: Use DRM_FORMAT_{NV12, YUV420} instead of DRM_FORMAT_{NV12M, YUV420M}

Merge branch 'drm-intel-fixes' of git://people.freedesktop.org/~danvet/drm-intel into drm-fixes

* 'drm-intel-fixes' of git://people.freedesktop.org/~danvet/drm-intel:
  drm/i915: pch_irq_handler -> {ibx, cpt}_irq_handler
  char/agp: add another Ironlake host bridge
  drm/i915: fix up ivb plane 3 pageflips
  drm/i915: hold forcewake around ring hw init
  drm/i915: Mark the ringbuffers as being in the GTT domain
  drm/i915/crt: Do not rely upon the HPD presence pin
  drm/i915: Reset last_retired_head when resetting ring

init: Drop initcall level output

9fb48c744ba6a ("params: add 3rd arg to option handler callback
signature") added similar lines to dmesg:

initlevel:0=early, 4 registered initcalls
initlevel:1=core, 31 registered initcalls
initlevel:2=postcore, 11 registered initcalls
initlevel:3=arch, 7 registered initcalls
initlevel:4=subsys, 40 registered initcalls
initlevel:5=fs, 30 registered initcalls
initlevel:6=device, 250 registered initcalls
initlevel:7=late, 35 registered initcalls

but they don't contain any info for the general user staring at dmesg.
I'm very doubtful the count of initcalls registered per level helps
anyone so drop that output completely.

Cc: Jim Cromie <jim.cromie@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jason Baron <jbaron@redhat.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

module_param: stop double-calling parameters.

Commit 026cee0086fe1df4cf74691cf273062cc769617d "params:
<level>_initcall-like kernel parameters" set old-style module
parameters to level 0. And we call those level 0 calls where we used
to, early in start_kernel().

We also loop through the initcall levels and call the levelled
module_params before the corresponding initcall. Unfortunately level
0 is early_init(), so we call the standard module_param calls twice.

(Turns out most things don't care, but at least ubi.mtd does).

Change the level to -1 for standard module_param calls.

Reported-by: Benoît Thébaudeau <benoit.thebaudeau@advansee.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org

powerpc/time: Sanity check of decrementer expiration is necessary

This reverts 68568add2c ("powerpc/time: Remove unnecessary sanity check
of decrementer expiration").  We do need to check whether we have reached
the expiration time of the next event, because we sometimes get an early
decrementer interrupt, most notably when we set the decrementer to 1 in
arch_irq_work_raise().  The effect of not having the sanity check is that
if timer_interrupt() gets called early, we leave the decrementer set to
its maximum value, which means we then don't get any more decrementer
interrupts for about 4 seconds (or longer, depending on timebase
frequency).  I saw these pauses as a consequence of getting a stray
hypervisor decrementer interrupt left over from exiting a KVM guest.

This isn't quite a straight revert because of changes to the surrounding
code, but it restores the same algorithm as was previously used.

Cc: stable@vger.kernel.org
Acked-by: Anton Blanchard <anton@samba.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>

Revert "mm: correctly synchronize rss-counters at exit/exec"

This reverts commit 40af1bbdca47e5c8a2044039bb78ca8fd8b20f94.

It's horribly and utterly broken for at least the following reasons:

- calling sync_mm_rss() from mmput() is fundamentally wrong, because
   there's absolutely no reason to believe that the task that does the
   mmput() always does it on its own VM.  Example: fork, ptrace, /proc -
   you name it.

- calling it *after* having done mmdrop() on it is doubly insane, since
   the mm struct may well be gone now.

- testing mm against NULL before you call it is insane too, since a
NULL mm there would have caused oopses long before.

.. and those are just the three bugs I found before I decided to give up
looking for me and revert it asap.  I should have caught it before I
even took it, but I trusted Andrew too much.

Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ext4: don't set i_flags in EXT4_IOC_SETFLAGS

Commit 7990696 uses the ext4_{set,clear}_inode_flags() functions to
change the i_flags automatically but fails to remove the error setting
of i_flags. So we still have the problem of trashing state flags.
Fix this by removing the assignment.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org

ext4: fix the free blocks calculation for ext3 file systems w/ uninit_bg

Ext3 filesystems that are converted to use as many ext4 file system
features as possible will enable uninit_bg to speed up e2fsck times.
These file systems will have a native ext3 layout of inode tables and
block allocation bitmaps (as opposed to ext4's flex_bg layout).
Unfortunately, in these cases, when first allocating a block in an
uninitialized block group, ext4 would incorrectly calculate the number
of free blocks in that block group, and then errorneously report that
the file system was corrupt:

EXT4-fs error (device vdd): ext4_mb_generate_buddy:741: group 30, 32254 clusters in bitmap, 32258 in gd

This problem can be reproduced via:

    mke2fs -q -t ext4 -O ^flex_bg /dev/vdd 5g
    mount -t ext4 /dev/vdd /mnt
    fallocate -l 4600m /mnt/test

The problem was caused by a bone headed mistake in the check to see if a
particular metadata block was part of the block group.

Many thanks to Kees Cook for finding and bisecting the buggy commit
which introduced this bug (commit fd034a84e1, present since v3.2).

Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Reported-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: stable@kernel.org

Merge branch 'akpm' (Andrew's fixups)

Merge random fixes from Andrew Morton.

* emailed from Andrew Morton <akpm@linux-foundation.org>: (11 patches)
  mm: correctly synchronize rss-counters at exit/exec
  btree: catch NULL value before it does harm
  btree: fix tree corruption in btree_get_prev()
  ipc: shm: restore MADV_REMOVE functionality on shared memory segments
  drivers/platform/x86/acerhdf.c: correct Boris' mail address
  c/r: prctl: drop VMA flags test on PR_SET_MM_ stack data assignment
  c/r: prctl: add ability to get clear_tid_address
  c/r: prctl: add minimal address test to PR_SET_MM
  c/r: prctl: update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal
  MAINTAINERS: whitespace fixes
  shmem: replace_page must flush_dcache and others

mm: correctly synchronize rss-counters at exit/exec

mm->rss_stat counters have per-task delta: task->rss_stat.  Before
changing task->mm pointer the kernel must flush this delta with
sync_mm_rss().

do_exit() already calls sync_mm_rss() to flush the rss-counters before
committing the rss statistics into task->signal->maxrss, taskstats,
audit and other stuff.  Unfortunately the kernel does this before
calling mm_release(), which can call put_user() for processing
task->clear_child_tid.  So at this point we can trigger page-faults and
task->rss_stat becomes non-zero again.  As a result mm->rss_stat becomes
inconsistent and check_mm() will print something like this:

| BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
| BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1

This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
out of do_exit() and calls it earlier.  After mm_release() there should
be no pagefaults.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org> [3.4.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

btree: catch NULL value before it does harm

Storing NULL values in the btree is illegal and can lead to memory
corruption and possible other fun as well. Catch it on insert, instead
of waiting for the inevitable.

Signed-off-by: Joern Engel <joern@logfs.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

btree: fix tree corruption in btree_get_prev()

The memory the parameter __key points to is used as an iterator in
btree_get_prev(), so if we save off a bkey() pointer in retry_key and
then assign that to __key, we'll end up corrupting the btree internals
when we do eg

longcpy(__key, bkey(geo, node, i), geo->keylen);

to return the key value. What we should do instead is use longcpy() to
copy the key value that retry_key points to __key.

This can cause a btree to get corrupted by seemingly read-only
operations such as btree_for_each_safe.

[akpm@linux-foundation.org: avoid the double longcpy()]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Acked-by: Joern Engel <joern@logfs.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

ipc: shm: restore MADV_REMOVE functionality on shared memory segments

Commit 17cf28afea2a ("mm/fs: remove truncate_range") removed the
truncate_range inode operation in favour of the fallocate file
operation.

When using SYSV IPC shared memory segments, calling madvise with the
MADV_REMOVE advice on an area of shared memory will attempt to invoke
the .fallocate function for the shm_file_operations, which is NULL and
therefore returns -EOPNOTSUPP to userspace. The previous behaviour
would inherit the inode_operations from the underlying tmpfs file and
invoke truncate_range there.

This patch restores the previous behaviour by wrapping the underlying
fallocate function in shm_fallocate, as we do for fsync.

[hughd@google.com: use -ENOTSUPP in shm_fallocate()]
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

drivers/platform/x86/acerhdf.c: correct Boris' mail address

Correct mail address reference to a mail account which I actually read.

Signed-off-by: Borislav Petkov <bp@alien8.de>
Cc: Peter Feuerer <peter@piie.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>