Fix reap completion race (!117) · Merge requests · i4-old / manycore / emper

Maxim Onciul requested to merge fix_reap_completion_race into master Mar 01, 2021

Our current naive try lock protecting a worker's IoContext's cq is racy. This fact alone is no problem a try lock is by design racy in the sense that two threads race who can take the lock.

The actual problem is:

While a worker is holding the lock additional completions could arrive which the worker does not observe because it could be already finished iterating the CQ.

In the case that the worker still holds the lock preventing the globalCompleter from reaping the additional completions there exists a lost wakeup problem possibly leading to a completely sleeping runtime with runnable completions in a worker's IoContext.

To prevent this lost wakeup the cq_lock now counts the unsuccessful lock attempts from the globalCompleter.

If a worker observes that the globalCompleter tried to reapCompletions more than once we know that a lost wakeup could have occurred and we try to reap again. Observing one attempt is normal since we know the globalCompleter and the worker owning the IoContext race for the cq_lock required to reap completions.

Additionally:

Reduce the critical section in which the cq_lock is held by copying all seen cqes and completing the Futures after the lock was released.
Don't immediately schedule blocked Fibers or Callbacks rather collect them an return them as batch. Maybe the caller knows better what to to with a batch of runnable Fibers

Fix reap completion race

Merge request reports