Careful, Go's Standard Ticker is not Realtime
Dec 19, 2021
10 minute read

While developing Mumble-Discord-Bridge (MDB) I learned that Go’s standard ticker wasn’t the best solution for MDBs internals. MDB requires that several routines run at a fixed timing interval. MDB’s main function is to mix multiple audio streams. As there is no shared clock our synchronization method MDB is forced aligns audio packets to the next tick of an internal loop. To accomplish a fixed time internal loop many Go developers will turn to (including myself) the Go standard libraries ticker (time.Ticker). That basic solution looks a little something like this.

package main

import (
    "time"
)

func main() {

  // Create the ticker
  ticker := time.NewTicker(500 * time.Millisecond)

  for {
        // Wait for the tick
        <-ticker.C:

        // Carry out fix time work here.
      }
  }
}

To use a standard ticker the process is straight forward. Create a ticker and block on the channel it returns. The channel can be passed into different go routines and can be shared. This makes the ticker versatile for number of use case everything from simple timeout logic to shared rate limiters. There is no arguing that standard implementation is great for most purposes. However when handling audio we need consistent timing performance and the standard ticker fails to provide this.

Identify the Failure

The timing failure was not immediately obvious. This is for two reasons. The first, the standard ticker performs reasonably well and at “first glance” appears to work. The second, failures are hard to detect and associate to the ticker as timing errors may be confused with network or other factors. For myself, as I have been developing MDB I have often been listening to audio in both Mumble and Discord. This allowed me to develop an expectation as to how long a typical delay should be. I first noticed the issue when I move the bridge into a docker container and ran it on a basic DigitalOcean server. With the bridge running in this new environment I noticed from time to time, that the audio was heavily delayed. It was difficult to pin down at first but I ultimately determined that it was whenever there was a long continuos audio stream from one side of the bridge the delay would increase. I then received which could be interpreted as a slap in the face when my log started warning about dropped packets. While at first the relationship between these two issue was not obvious, it didn’t take long to begin to see that they were in fact correlated.

Looking into the issue I determined that the ticker interval was slower than the interval than the incoming packets even though the ticker was set to the correct duration. There are several internal buffers that are used to manage jitter and better sync audio. These buffers by default were left intentionally large to accommodate high jitter network scenarios. With some basic debugging it became obvious that these buffers were filling up as a result of the intervals not performing correctly. While there were several possible sources I recalled seeing a comment in the Go docs that I returned to begin chasing this problem.

Queue the comments for NewTicker.

// NewTicker returns a new Ticker containing a channel that will send
// the time on the channel after each tick. The period of the ticks is
// specified by the duration argument. The ticker will adjust the time
// interval or drop ticks to make up for slow receivers.
// The duration d must be greater than zero; if not, NewTicker will
// panic. Stop the ticker to release associated resources.

Did you see it? “The ticker will adjust the time interval or drop ticks to make up for slow receivers.” Great. I’m looking for a “slow receiver”. Test time.

// Test 1 Unloaded
go func(interval time.Duration) {
  now := time.Now()
  start := now
  // start the ticker
  t := time.NewTicker(interval)
  var i int64
  for i = 0; i < testCount; i++ {
    now = <-t.C
    // fmt.Println(now)
  }
  t.Stop()
  fmt.Println("Ticker (unloaded) after", testDuration, "drifts", time.Since(start)-testDuration)
  wg.Done()
}(tickerInterval)
Ticker (unloaded) after 1m40s drifts 1.105224ms

This tests the timing of the ticker over a large time interval. For the test a 10ms ticker was used with a target time of 1m40s. The test demonstrates that the ticker performs as expected. I almost gave up on the idea until I loaded my machine with another process and saw the test fail. To demonstrate the failure purely in go we need to introduce some additional “fake” load. Consider the following modification to the test.

go func(interval time.Duration) {
  now := time.Now()
  start := now
  // start the ticker
  t := time.NewTicker(interval)
  var i int64
  for i = 0; i < testCount; i++ {
    if i+1 < testCount {
      time.Sleep(time.Duration(float64(maxSleepInterval) * rand.Float64()))
    }
    now = <-t.C
    // fmt.Println(now)
  }
  t.Stop()
  fmt.Println("Ticker (loaded) after", testDuration, "drifts", time.Since(start)-testDuration)
  wg.Done()
}(tickerInterval)
Ticker (loaded) after 1m40s drifts 2.670534052s

Aha! 2.6 Seconds a Drift! In this scenario we introduce a random delay in the loop of up 15ms. This delay can represent slow code inside our loop or what we will explore in the next section something called idle jitter. From the perspective the ticker the loop is now considered a “slow receiver”. When ever the delay is larger than the timing interval the ticker drops the next tick. Interestingly each miss is cumulative as it was in MDB resulting in full buffer and dropped packets. After seeing this result it now became apparent that the single core DigitalOcean machine was not significantly powerful enough to hide this timing issue in the implementation. If I had remained using my personal machine with many cores this issue may have never presented itself.

Idle Jitter

While the problem originally presented itself in a slow machine I do not think a fast machine is immune to this issue. Depending on how fast the we want the ticker to tick it is possible to see this drift behavior on seemingly fast machine. There is not doubt that kernel scheduling is a advanced topic reserved for the elite few. Coupling kernel schedule with Go runtime and it should send anyone’s head for a spin. The basic fundamental that I think is necessary to be aware of however is the idea of idle jitter. I, personally, was first made aware of idle jitter when browsing Netdata plugins. Netdata describes its idle jitter as “The difference between the requested and the actual duration of the sleep, is the idle jitter.” Relevant? Yup. Simple? Sure. If you are interested in monitoring the idle jitter on your own system Netdata will do it out of the box and is worth an install if you want to learn more about various system metrics. We can borrow Netdata’s implementation and measure idle jitter with a simple Go program.

package main

import (
	"fmt"
	"time"
)

func main() {
	for i := 0; i < 100000; i++ {
		now := time.Now()
		time.Sleep(10 * time.Millisecond)
		drift := time.Since(now.Add(10 * time.Millisecond))
		if drift > time.Millisecond {
			fmt.Println(i, drift)
		}
	}
}

This simple program sleeps for 10ms and calculates the drift from desired sleep time.

GOMAXPROCS=1 go run idle_jitter.go

I chose to run the idle jitter application with GOMAXPROCS=1 to remove any possibility of the Go Runtime doing something I don’t understand. This should force the the Go application to only spawn a single thread for the process.

1167 1.322981ms
2177 1.057028ms
3865 9.513415ms
4843 1.812275ms
6490 3.273958ms
6513 1.760718ms
6518 3.228439ms
6522 6.380061ms
6565 3.645833ms
6570 1.084399ms
...

We can see form the print out that while most iterations drift less than 1ms, drifts of almost 10ms can be seen. This test was run on Dell XPS 15 with six core I7. The test started with the machine almost idle with the exception of a handful of open user applications. During the test I ran a separate CPU intensive process and we can see around 6500 the number of large drifts increases. This increases correlates to CPU load from the separate process and suggest that idle jitter is related to CPU load.

Looking at system idle jitter I believe we can draw the conclusion that the base system behavior could easily lead to degraded performance for and application like MDB even on a seemingly fast or idle machine. Most certainly however a slow enough system or a large enough time horizon could result in the internal loops falling behind resulting in delayed audio or dropped audio packets.

The Solution

The solution I chose for MDB is actually quite simple. While the ticker is great for a lot of use cases I turned back to a basic sleep command. Removing the ticker allows the implementation to now dictate how missed intervals should be handled and removed the need for the Go Runtime to wake another Go routine. I essentially created my own “ticker” class using a sleep call and a few variables to track the current and next sleep targets. This required some minor code modification as we were no longer blocking on a channel but but simply waiting for the sleep command to return. Looking a test of the new implementation we see the results we were looking for.

go func(interval time.Duration) {
  now := time.Now()
  start := now
  // start the ticker
  s := sleepct.SleepCT{}
  s.Start(interval)
  var i int64
  for i = 0; i < testCount; i++ {
    if i+1 < testCount {
      time.Sleep(time.Duration(float64(maxSleepInterval) * rand.Float64()))
    }
    s.SleepNextTarget(context.TODO(), false)
  }
  fmt.Println("SleepCT (loaded) after", testDuration, "drifts", time.Since(start)-testDuration)
  wg.Done()
}(tickerInterval)
SleepCT (loaded) after 1m40s drifts 1.154823ms
--- PASS: TestSleepCT (100.00s)

Using the loaded test profile from before we can see that the basic sleep implementation performs as desired. The 1ms drift is the result of the jitter seen at the end of the test and would be present regardless of the length of the test.

The reality here is that the jitter still exists. The implementing simply moves the jitter to receiving services. Considering the role of MDB from the perspective of the two external services, I believe that it is safe to assume that both Discord and Mumble can handle the jitter imposed by the process. As these audio applications consistently have to manage the jitter introduced by the network between client and server this implementation essentially stacks that jitter on top of any network jitter present. As with networked audio applications it is in the best interest of the admin running and instance of MDB to locate it as close to the two external services as possible, opting for the fastest, lowest jitter network connection for the best performance.

Conclusion

This was an interesting thread to pull while developing MDB. There are a lot of factors that contribute to the timing performance of this soft-realtime application. The simple tests demonstrate how the base ticker fails us when the system lacks the resources to meet the timing requirements. Falling back to a simple sleep implementation is all that is needed to maintain the timing we desire for this application.

While I am no expert in kernel scheduling or the Go Runtime they both play an integral part in waking the threads and go routine in a timely manner. I suspect if we want to tighten the timing performance further we would need to explore tweaking the system parameters.

I hope this was interesting summary of this problem and that it helps someone out in the future in a similar predicament.

Please check out Mumble-Discord-Bridge and tell your OSS enthusiasts.

Thanks for reading! :)



comments powered by Disqus