The basic scheme of the block mode assembler is that we start by
enabling the FPU, loading the key into the floating point registers,
then iterate calling the encrypt/decrypt routine for each block.
For the 256-bit key cases, we run short on registers in the unrolled
loops.
So the {ENCRYPT,DECRYPT}_256_2() macros reload the key registers that
get clobbered.
The unrolled macros, {ENCRYPT,DECRYPT}_256(), are not mindful of this.
So if we have a mix of multi-block and single-block calls, the
single-block unrolled 256-bit encrypt/decrypt can run with some
of the key registers clobbered.
Handle this by always explicitly loading those registers before using
the non-unrolled 256-bit macro.
This was discovered thanks to all of the new test cases added by
Jussi Kivilinna.
Signed-off-by: David S. Miller <davem@davemloft.net>