Radon Meter Crashing Found


After the previous changes I had yet another hang after just 40,800 seconds. Clearly it wasn’t the resistors or power filtering. Reading over my code again I noticed a hold-out from a code sample I copied about a year back when first doing some testing. It was “old” code and had been shared on the internet as an example, it couldn’t possibly be my problem… or could it?

Here’s the relevant section:

(from main.h)
unsigned long timeBudget;


void loop() {
  unsigned long loop_millis = millis();
  buttonHelper.Handle();
  radonSensor.Handle();
  sendHelper.Handle();
  handleLed();
  handleDisplay();
  timeBudget = max((unsigned long) 1, 30 - (millis() - loop_millis));
  delay((int) timeBudget);
 }

The problem is right here:

timeBudget = max((unsigned long) 1, 30 - (millis() - loop_millis));

If millis() - loop_millis is ever greater than 30 this will underflow and the timeBudget will be just a shade under 4,294,967,295. You can probably guess where this is going next.

Let’s run through some examples to understand better what happened:

Our program has been running for about a minute and at the beginning of the loop millis() returns a value of say 61350.
We run through all of the specified handlers and at the end of the loop 15ms have passed.
The calculation looks like this: timeBudget = max((unsigned long) 1, 30 - (61365 - 61350)) or timeBudget = max((unsigned long) 1, 30 - (15)) therefore timeBudget = 15.
The (completely unnecessary) cast to int and back to unsigned long has no effect on the actual value since on the esp8266 both are 32-bit quantities so we delay 15ms and move on.

What if we took longer in our loop?
Again we start our loop at 61350, but this time we take 40ms to handle all of our events.
At the end of the loop millis() returns 61390.
Now let’s calculate the time budget and delay: timeBudget = max((unsigned long) 1, 30 - (61390 - 61350)) reduces to timeBudget = max((unsigned long) 1, 30 - (45)) which underflows and becomes timeBudget = max((unsigned long) 1, 4294967280).
The (again, I feel bad this was even written so I’m saying it once more, completely unnecessary) cast to int and back to unsigned long has no effect on the value so when we pass this to the delay function we are asking for a delay of 4,294,967,280ms or just over 49 days.

Under normal usage, my loop would run in significantly less than 30ms.
Typical execution times were around 2-5ms, with some display routines taking longer to draw (sometimes as high as 9-12ms).
That is why I could run so long without this bug manifesting.
However, on some unlucky passes through the loop, and what is causing this I still have no idea, something took longer and timeBudget underflowed causing my program to ask for a 49 day delay which to my untrained eye looked like a hang.

Even though the failure was relatively simple the difficulty I had in isolating the source of the problem has shown me a few improvements I could make.

  • I feel like I might have identified this quicker if I had made available the device’s IP address, perhaps by displaying it on the screen, because then I could have tested pinging it when it appeared hung. Since it was just in a delay handler the tcp stack should have still been running and it would have responded. Unfortunately, each time I elected to angrily power cycle the device by pulling the power cord. Since the device stopped posting updates to my data handling routines I figured it had locked up - I should have poked back.
  • I also think my complacency with the delay function in the loop, a mistake which I attribute to copying and pasting it from another project which ran without issues for months at a time, caused me to just completely skip over it when reviewing my code for potential problems.
  • Finally, I think that a couple untimely crashes early on in the process of uncovering the real issue caused me to falsely attribute the crashes to environmental or physical issues which just led me further from the real problem and caused me to waste my time.

With all of that behind me, at least the fix is easy! Rewrite the delay calculation to not suck!

(from main.h)
#define FRAME_MS 30
byte lastFrameTime = 0;


void loop() {
  unsigned long loopMillis = millis();
  unsigned long elapsedTime = loopMillis - lastLoopMillis;
  lastLoopMillis = loopMillis;
  lastFrameMillis = min((byte) 255, (byte) elapsedTime);
  if (elapsedTime < FRAME_MS) {
    delay(min(FRAME_MS - elapsedTime, (unsigned long) FRAME_MS));
  }

  buttonHelper.Handle();
  radonSensor.Handle();
  sendHelper.Handle();
  handleLed();
  handleDisplay();
}

I prefer this approach anyway because the whole idea of having a time budget was sort of inconsistent with how I wanted the device to work.
There are a couple times where it is necessary to exceed that budget which would impact the frame rate already, so why don’t I just accept that such things will happen sometimes and just try my best to keep the framerate and instead of treating that time like some kind of divine principle that must not be violated.

See also