Implement sleep strategy using the IO subsystem
implement a pipe based sleep strategy using the IO subsystem
Design goals
- Wakeup either on external newWork notifications or on local IO completions -> Sleep strategy is sound without the IO completer
- Do as less as possible in a system saturated with work
- Pass a hint where to find new work to suspended workers
Algorithm
Data:
Global:
hint pipe
sleepers count
Per worker:
dispatch hint buffer
in flight flag
Sleep:
if we have no sleep request in flight
Atomic increment sleep count
Remember that we are sleeping
Prepare read cqe from the hint pipe to dispatch hint buffer
Prevent the completer from reaping completions on this worker's IoContext
Wait until IO completions occurred
NotifyEmper(n):
if observed sleepers <= 0
return
// Determine how many we are responsible to wake
do
toWakeup = min(observed sleepers, n)
while (!CAS(sleepers, toWakeup))
write toWakeup hints to the hint pipe
NotifyAnywhere(n):
// Ensure all n notifications take effect
while (!CAS(sleepers, observed sleepers - n))
if observed sleeping <= -n
return
toWakeup = min(observed sleeping, n)
write toWakeup hints to the hint pipe
onNewWorkCompletion:
reset in flight flag
allow completer to reap completions on this IoContext
Notes
- We must decrement the sleepers count on the notifier side to prevent multiple notifiers to observe all the same amount of sleepers, trying to wake up the same sleepers by writing to the pipe and jamming it up with unconsumed hints and thus blocking in the notify write resulting in a deadlock.
- The CAS loops on the notifier side are needed because decrementing and incrementing the excess is racy: Two notifier can observe the sum of both their excess decrement and increment to much resulting in a broken counter.
- Add the dispatch hint code in
AbstractWorkStealingScheduler::nextFiber
. This allows workers to check the dispatch hint after there where no local work to execute. This is a trade-off where we trade slower wakeup - a just awoken worker will check for local work - against a faster dispatch hot path when we have work to do in our local WSQ. - The completer tread must not reap completions on the IoContexts of sleeping workers because this introduces a race for cqes and a possible lost wakeup if the completer consumes the completions before the worker is actually waiting for them.
- When notifying sleeping workers from anywhere we must ensure that all notifications take effect. This is needed for example when terminating the runtime to prevent sleep attempt from worker thread which are about to sleep but have not incremented the sleeper count yet. We achieve this by always decrementing the sleeper count by the notification count.
Thanks to Florian Schmaus flow@cs.fau.de for spotting bugs and suggesting improvements.
Edited by Maxim Onciul