load CQ->tail only once during lockless stealing
Currently we load the CQ->tail with acquire semantic to determine if we should steal from teh victim and load it again in the actual stealing logic which will also immediately abort if there are no CQEs to steal.
Keep the optimization for the locked case.