1. 15

  2. 2

    That seems like such a terrible flaw that it’s hard to believe a commercial vendor would sell such parts. “Here’s a real-time clock which will only work for 20 years.”

    1. 8

      IIRC this was not a hardware failure, but a software one. NASA published an assessment a while back:

      Although the exact cause of the mission loss is not known with certainty, the spacecraft operations team at JPL and its major contractor uncovered a potential problem with computer time-tagging that could have led to loss of spacecraft attitude control. DI/EPOXI flight software includes a fault management function that reports and logs all onboard fault protection activities and events, such as declaring a symptom, declaring a fault, running a response, or the occurrence of a processor reset. The function computes an integer representation of spacecraft time in milliseconds by taking the spacecraft clock represented in seconds as a 64-bit floating point number, multiplying it by 10, and casting it back into a 32-bit unsigned (i.e., non-negative) integer. (“Floating point-to-integer conversion” is necessary in order to convert a decimal number issued by the clock into a binary value recognized by the computer.) The computation does not guard against an overflow condition where spacecraft time in milliseconds can no longer be fully represented within 32 bits. This occurred when the spacecraft clock reached 429,496,729.6 seconds, which corresponded to the date “DOY 223” (2013). At this point in time or beyond, the computation results in a floating point error that ultimately triggers a processor reset.

      There is some context here that’s worth keeping in mind.

      It seems surprising that this wasn’t caught during development or testing. Deep Impact was launched in 2004, and the spacecraft was developed starting around Y2K. At that time, this was a well-understood failure mode, so it would be included in testing plans, especially since such a failure had already been painfully observed in 1996, with an Ariane launch. Linters would also yell at you.

      However, Deep Impact’s primary mission ended in 2005, after it observed the Tempel 1 impact. Its mission was extended after 2005 more or less by chance, and there were no concrete plans to use it after 2005 at the time of its launch. NASA approved its use for additional missions because it still had manoeuvring fuel left, but its intended lifetime was basically around one year. It’s likely that longer-term use simply wasn’t planned for, so it never made its way into the software specs, or the testing documentation, or if it did, it was simply too low on the list of priorities to matter.

      This is one of the reasons why NASA’s own LL doc, and much of the analysis in the wider industry, focused not so much on the overflow aspect but on the common mode aspect. The overflow was embarrassing, of course, but neither much of a revelation, nor really a disaster: the probe had completed its original mission, then four other missions, and was neither doing anything nor specifically planned to do anything else when contact was lost (it had been targeted at an asteroid two years before but without concrete plans for a scientific mission). This particular bug was interesting because it was a kind of failure that could lock down a redundant system, and due to a bunch of historical reasons, we do have a history of, well, sucking at building long-running redundant systems (tl;dr for a long time we focused on designing redundant systems to survive transient events, and just didn’t spend enough time thinking about how to make redundant systems that run well the rest of the time).