Component-Level Root Cause Analysis: From Mystery Failure to MEMS Oscillator

Topics: hardware debugging, root cause analysis, IoT, manufacturing

Tom Wade

7/12/20255 min read

About 1 in 50 of our wireless sensor units would fail to register on the cellular network at startup — sometimes. Reset the device, and most would come up fine. The pattern was infuriating: not consistent enough to be a hardware fault, not random enough to be software. This is the story of how I traced that failure mode from logs and field reports all the way down to the startup behavior of a quartz crystal — and how I qualified a MEMS oscillator replacement that nearly eliminated the problem. It's also a story about why “intermittent” almost never actually means “random.”

The symptom

We make wireless sensor devices used for structural health monitoring. Each device boots up, talks to a cellular gateway, and starts reporting data. Most boot cleanly. A small but stubborn percentage didn't — they'd fail to register on the cellular network, sit there for some unknown amount of time, and eventually either succeed (sometimes) or just sit there forever (sometimes).

Field reports were inconsistent. Sometimes a customer would deploy 50 devices and have all 50 work. Sometimes they'd deploy 50 and have one or two get stuck. Resetting the unit usually fixed it. Sending it back through QC found no fault. Each individual incident felt like a one-off — statistical noise, maybe — but the pattern across enough deployments was unmistakable. Something was wrong, and “sometimes a thing fails” isn't an acceptable answer for a product that's supposed to monitor things people care about.

What I tested first (and ruled out)

The instinct on intermittent failures is to suspect everything. The discipline is to suspect things in order. I went through the obvious candidates first:

Software bugs in the connection sequence. We reviewed the firmware that handled cellular registration, looked for race conditions, retry-logic bugs, and timeout misconfigurations. There were small things we cleaned up, but none of them explained the failure pattern. Devices with the same firmware version were behaving differently.

Cellular signal quality. A weak signal could explain registration failures. We pulled signal strength data from the affected units and the fleet broadly. The affected units weren't in noticeably worse signal environments than working ones. The hypothesis was eliminated.

Power supply issues. A flaky power rail at boot could cause all sorts of weird behavior. We instrumented the power rails on a few units and saw clean power-on profiles. Not it.

Cellular module variation. We considered whether some specific manufacturing batch of cellular modules might have a defect. The failure pattern didn't correlate with module batch — we saw failures on units with modules from multiple vendors and production runs. Not it.

After eliminating the most obvious causes, the failure was still happening. This is the part of debugging that separates engineers who solve problems from engineers who give up: when the easy hypotheses are exhausted, you have to start hypothesizing things that are less obvious.

The breakthrough

The breakthrough came when I started looking at boot logs at temperature extremes rather than at room temperature. We had a few units in a temperature-controlled environment for environmental testing; I asked for their boot logs at -10°C, +25°C, and +40°C. The pattern was clear once I saw it: at temperature extremes, the boot sequence took noticeably longer in the very first stage of startup — before the cellular module was even powered on.

That timing variation pointed to something at the very beginning of the boot process. The cellular module wasn't the problem; the cellular module was getting handed bad initial conditions because something earlier was misbehaving. The next question was: what runs that early in the boot sequence and is temperature-sensitive?

The answer was the system clock. Specifically, the quartz crystal that provides the master clock to the microcontroller. Quartz crystals don't start oscillating instantly when you apply power — they ramp up over a startup period that depends on the physical characteristics of the crystal, the temperature, and a few other factors. If the startup is too slow, code that depends on a stable clock can begin executing before the clock is actually stable, leading to subtle timing-dependent failures downstream. Including, plausibly, in the cellular registration handshake, which is sensitive to precise timing.

Confirming the hypothesis

The next step was confirming the quartz crystal was actually the culprit, not just a plausible candidate. I instrumented the boot sequence to capture clock stabilization timing on units that had failed before and on known-good units. The pattern held: failed units showed slower clock stabilization at startup, especially at temperature extremes. The amount of variation across units came from manufacturing tolerances on the crystals — some were just slower than others, and the slow ones were the ones we were seeing failures on.

This is where component-level RCA pays off: once you have the right hypothesis, the test to confirm it is usually fast and the fix is usually well-scoped. We weren't going to redesign the cellular subsystem or rewrite the firmware. We were going to swap a single component.

Qualifying the corrective action

MEMS oscillators are an alternative to quartz crystals. They have different physical properties — specifically, much faster startup times and much less variation across units and temperatures. They're also more expensive per unit, so you don't choose them lightly. But for a device whose entire job is to come up reliably and start sending data, the cost difference is trivial compared to the cost of an unreliable boot.

I researched candidates, evaluated them against our timing requirements at temperature extremes, ordered samples from the most promising vendor, and tested them in production hardware. The performance was as advertised — startup was fast, consistent across units, and stable across temperature. We qualified the part for production use.

The rollout was the kind of process change that benefits from documentation. We updated the bill of materials, validated the change against our QC framework, ran a transition batch with the new oscillator, and verified that the failure mode I'd been chasing was gone in the new units. We also documented the before-and-after data so we'd have a clean record of why the change was made.

The result

The failure mode was nearly eliminated. The handful of edge cases that remained traced to causes other than clock startup, and the fleet-wide registration reliability improved by enough that customer-side reports of intermittent boot failures essentially stopped. The MEMS oscillator change is now permanent across the product line.

The meta-lesson

The biggest thing this project taught me wasn't about quartz crystals or cellular modules. It was about how to think about intermittent failures.

Intermittent doesn't mean random. It means “the failure happens under conditions you haven't characterized yet.” When something fails sometimes, the question isn't “why is this random?” — it's “what's different about the times it fails?” That reframe is where the breakthrough comes from. Once you start looking for the variable that distinguishes failure from success, you can usually find it.

It also taught me to look at the boundary conditions early. Most of the time, the failure isn't in the middle of the operating envelope — it's at the edges. Temperature extremes, voltage extremes, manufacturing tolerance extremes. The cases that are easy to ignore in normal testing are often the ones that produce the bugs you're trying to catch.

And finally, it taught me that a stubborn investigation is worth more than a fast one. The first three hypotheses I tested were wrong. The fourth was right, but I would never have gotten to the fourth if I'd given up on the third. That's worth remembering when you're staring at a problem that doesn't want to give up its answers.

Connecting the digital to the physical.

Contact

Links

wadesthomas1@gmail.com

+12155019211