BlogtechnicalAdvances In The ZIO 2.0 Scheduler

Advances In The ZIO 2.0 Scheduler

Learn what a scheduler is, why it is important, how it works, some pitfalls in implementing one, and how ZIO's scheduler avoids these pitfalls

26 August 2021# scala# zio

ZIO's new fiber-aware scheduler is one of the many ways that we are driving higher performance in ZIO 2.0.

Advances in the ZIO 2.0 Scheduler

By Adam Fraser

ZIO 2.0 takes concurrent programming in Scala to the next level in a variety of ways, including higher performance, improved ergonomics, built-in support for logging and metrics, and a new vision of streams.

In this post, I want to focus in particular on ZIO's new fiber-aware scheduler, which is one of the many ways that we are driving higher performance in ZIO 2.0. We will talk about what a scheduler is, why it is important, how it works, some pitfalls in implementing one, and how ZIO's scheduler avoids these pitfalls.

Schedulers

Functional effect systems allow us to write our code in an extremely high level, compositional style. For example, I can fork a hundred thousand fibers to do some work like this:

import zio._

for {
  ref    <- Ref.make(0)
  fibers <- ZIO.foreach(1 to 100000)(_ => ref.update(_ + 1).fork)
  _      <- ZIO.foreach(fibers)(_.join)
  value  <- ref.get
} yield value

This program will fork a hundred thousand separate logical processes that will each do some work. Then it will wait, without ever blocking any underlying operating system threads, for all of the work to be completed and finally retrieve the result.

It is great for us to be able to write high-level code like this, but it relies on the runtime being able to execute our logic in a performant way. This is where a scheduler comes in.

We don't want to have a hundred thousand threads in our application, at least if we want to have reasonable performance. Generally, we only want to have a number of threads equal to the number of cores on the machine, which is perhaps 8 on a local development machine or as high as 256 on a server.

So the job of a scheduler is to schedule the execution of the work being performed by countless fibers onto a very limited number of operating system threads in a way that:

  1. Ensures that all work gets done
  2. Maximizes performance

There are a variety of approaches we can take to do this.

One of the simplest, implemented by the ZIO 1.0 scheduler, is to maintain a single global work queue.

global-queue

In this approach, every time new work is submitted for execution, we add it to the end of a global queue. Then each worker thread simply repeatedly takes work units from the queue and executes them.

It is easy to show that this approach is correct. If all new work is added to the end of the same queue, and all workers take from that queue, then eventually, all submitted work will get done.

However, this approach is not as efficient as it could be.

First, the global queue creates a single point of contention. All workers offer to and take from the same queue, so if many workers are trying to offer or take work at the same time, performance can be adversely impacted.

Second, this approach hurts cache locality.

Typically, when we fork a fiber the work being done by that fiber is related to the work we are currently doing. So executing it on the same thread means the information related to that work is possibly already cached, improving performance.

Unfortunately, using a single global queue means we will frequently not achieve this. Work on forked fibers is done by a random thread, executed by a random core, so it is likely to not be the thread that we were previously running on.

The opposite of this approach would be having a local queue for each worker.

local-queues

This approach completely eliminates contention on global resources and maximizes cache locality, because work forked by a fiber will always be run on the same thread as the fiber was running on.

However, this approach creates the risk of uneven workloads (a lack of fairness). One worker may have many items in its local work queue while another worker is idle, creating the risk that we will not use all of our operating system cores.

This approach can be extremely efficient, but pushes the responsibility onto the user to ensure that work is evenly distributed. As a result, this approach can be a useful strategy for inside a performance-sensitive library, but it doesn't make sense for effect systems like ZIO, where we want to be able to write high-level code without having to worry about details of work distribution.

To improve on either of the preceding approaches, the ZIO 2.0 fiber-aware scheduler uses local queues for each worker, but also allows workers to "steal" work from the work queues of other workers when they have no work in their own queues.

work-stealing

This means that most of the time, workers just execute work from their own local work queues, minimizing contention and maximizing cache locality. But because workers steal work from other workers if they have no work to do, we also avoid the situation where one worker has many tasks to process while another sits idle.

This hybrid approach implemented in ZIO 2.0 was inspired by the excellent Tokio project, which provides a fast runtime for writing reliable, asynchronous applications in the Rust programming language.

Fast and Correct

While this work stealing approach has the potential to achieve higher performance, it also has significantly more complexity and thus pitfalls.

The global queue and local queue solutions can be thought of as two extremes on a continuum between fully shared and fully localized state, and so both of them leave few questions for the implementation. In contrast, a hybrid like work stealing leaves a variety of questions to be answered:

  • Should the local queues have limited size, and if so, where should work that does not fit be placed?
  • When should workers signal to other workers that new work is potentially available for stealing?
  • When should workers attempt to steal from each other or take from global sources of work?

The ways we answer these questions present new opportunities for making code faster, but they also create pitfalls, where we may accidentally create schedulers that are fundamentally unsound, and which have catastrophic implications for runtime correctness.

To see an example of this, let's consider the implementation of the batched queue in Cats Effect 3.

The ZIO 2.0 fiber-aware scheduler uses a global queue, in addition to local queues for each worker. The global queue is used for work that is submitted from outside the ZIO runtime, as well as when work is submitted by a worker but the worker's local queue is already full.

Cats Effect 3 uses a similar global queue for new work called the inject queue. The library uses a separate queue for work that does not fit in the local queue of a worker called the batched queue.

When the local queue of a worker is full, Cats Effect 3 takes half of the tasks in the local queue, and submits them to the batched queue, freeing up space in the local queue. Similarly, when workers take from the batched queue, they take an entire batch of tasks.

At a high-level, this batch-oriented approach seems to make sense, because it reduces contention on globally shared state by reducing the number of offers and takes.

Workers don't take work from the batched queue until they are out of work on their local queue, which also seems to make sense. After all, we pushed work onto the batched queue because we didn't have space in the local queue, so why would we take work back from the batched queue until we had first emptied the local queue?

However, on deeper reflection, we can recognize that this violates a basic invariant that we need to maintain safety, which is that all work submitted to the scheduler must eventually be executed.

To see this, consider what would happen if each of the workers were continually generating and executing new work.

For example, the workers might check some piece of state in order to determine whether to continue and then do additional work, repeating this cycle forever, and yielding between each step to ensure fairness for other fibers. In this case, no worker would ever check the batched queue, which means that even if work had been scheduled which would signal to stop continuing, this work would never be executed.

It turns out that we can deadlock the Cats Effect 3 scheduler quite easily by doing just this:

import cats.effect._
import cats.implicits._

object CatsScheduling extends IOApp {

  def run(args: List[String]): IO[ExitCode] =
    io.as(ExitCode.Success)

  def yieldUntil(ref: Ref[IO, Boolean]): IO[Unit] =
    ref.get.flatMap(b => if (b) IO.unit else IO.cede *> yieldUntil(ref))

  val io: IO[Unit] =
    for {
      n    <- IO(java.lang.Runtime.getRuntime.availableProcessors)
      done <- Ref.of[IO, Boolean](false)
      fibers <- List.range(0, n - 1).traverse(_ => yieldUntil(done).start)
      _ <- IO.unit.start.replicateA(200)
      _ <- done.set(true).start
      _ <- IO.unit.start.replicateA(1000)
      _ <- yieldUntil(done)
      _ <- fibers.traverse(_.join)
    } yield ()
}

This program creates exactly the situation described above.

We create one fiber for each operating system core that repeatedly checks if a Ref is set to true and yields by calling cede. We also start a fiber that will set the Ref to true, so we expect that this program will terminate almost immediately.

But in fact, with the Cats Effect 3.0 scheduler, this program will never terminate. Despite the fact that we have used no "unsafe" operators and in fact are repeatedly using the cede operator, which is supposed to guarantee fairness, the fiber that is responsible for setting the Ref to true will never be executed.

ZIO 2.0 Scheduler

The ZIO 2.0 fiber-aware scheduler delivers leading performance with a relatively simple implementation. We use a single global queue and workers check that queue on a regular basis after executing a specified number of iterations of work to avoid unsoundness and catastrophic deadlocks like the one described above.

The preceding program translated to ZIO 2.0 looks like this:

import zio._

object ZIOScheduling extends App {

  def run(args: List[String]): ZIO[ZEnv, Nothing, ExitCode] =
    io.exitCode

  def yieldUntil(ref: Ref[Boolean]): UIO[Unit] =
    ref.get.flatMap(b => if (b) ZIO.unit else ZIO.yieldNow *> yieldUntil(ref))

  val io: UIO[Unit] =
    for {
      n      <- ZIO.succeed(java.lang.Runtime.getRuntime.availableProcessors)
      done   <- Ref.make(false)
      fibers <- ZIO.foreach(List.range(0, n - 1))(_ => yieldUntil(done).forkDaemon)
      _      <- ZIO.unit.forkDaemon.replicateZIO(200)
      _      <- done.set(true).forkDaemon
      _      <- ZIO.unit.forkDaemon.replicateZIO(1000)
      _      <- yieldUntil(done)
      _      <- ZIO.foreach(fibers)(_.join)
    } yield ()
}

As expected, this program terminates correctly and instantly on the ZIO 2.0 scheduler.

Of course, we continue to explore optimization opportunities and details of the implementation will likely evolve between now and the release of ZIO 2.0. However, our paramount focus is on correctness, and we will never cut any corners to appear better on synthetic microbenchmarks.

Fiber-aware schedulers are already quite fast, and as we saw above, it is easy to introduce unsoundness when trying to achieve a few extra points on synthetic microbenchmarks. The reality is that these extra points do not translate into any gains in real-world performance, and if achieved by unsoundness, they can cause catastrophic deadlocks.

Summary

The fiber-aware scheduler is just one of a host of features coming to ZIO 2.0. These features include other performance improvements, such as optimized handling of blocking operations, faster handling of unsafeRun operations in mixed code bases, and a highly tuned and polished runtime system.

ZIO 2.0 also includes many improvements to ergonomics, which make ZIO more accessible than ever to developers coming from a variety of backgrounds, and which safely and automatically manage side-effects accidentally embedded in ZIO constructors. New features like logging and metrics, combined with improved tracing, streams, and high-performance concurrent data structures like ZHub, round out what we hope to be an epic release in the history of the ZIO library.

We'll be sharing more about these features in the coming weeks, but for now, hopefully this post gives you some insight into what the new fiber-aware scheduler in ZIO 2.0 is, how it is going to make your ZIO code faster, and what are some of the design considerations that go into making sure schedulers are not just fast, but also correct and bulletproof!

Related Posts

14 July 2020

African Scala Development Program

African Scala Development Program We are thrilled to announce that Ziverge Inc. has launched an African Scala Development Program, in…

See More
16 November 2020

Functional Scala 2020 Scholarship Program

Functional Scala 2020 Scholarship Program The countdown to Functional Scala 2020 is underway, and today, we are excited to announce the…

See More

Functional Design

Functional Design by John A. De Goes Although functional programming theory is useful, most day-to-day functional programming does not…

See More
Subscribe to our newsletter