Never saw a profile of bset_search_tree() where it wasn't bottlenecked
on memory until I got my new Haswell machine, but when I tried it there
it was suddenly burning 20% of the cpu in the inner loop on shrd...
Turns out, the version of shrd that takes 64 bit operands has a 9 cycle
latency. hah.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
static inline uint64_t shrd128(uint64_t high, uint64_t low, uint8_t shift)
{
-#ifdef CONFIG_X86_64
- asm("shrd %[shift],%[high],%[low]"
- : [low] "+Rm" (low)
- : [high] "R" (high),
- [shift] "ci" (shift)
- : "cc");
-#else
low >>= shift;
low |= (high << 1) << (63U - shift);
-#endif
return low;
}