1. 28
  1.  

  2. 17

    Just to underscore the lesson here: understanding your business requirements can substantially reduce cost of development.

    Instead of wasting time hunting down leaks in Valgrind and programming defensively and doing all that good stuff, they did some easy calculations estimating leak occurrence and just baked in a factor-of-safety.

    Oftentimes, these sorts of “shortcuts” are left on the table by developers who prefer to live in theory instead of sullying their hands with the messy but lax world of business requirements.

    1. 19

      Of course, being military and a missile… they will no doubt develop a longer range version using the same code after that programmer has left…….

      …and then wonder why it suddenly deviated from course killing whatever was unfortunate enough to be in the wrong place.

      1. 8

        This.

        In fact, this basically happened: The Explosion of the Ariane 5 back in June 1996. Luckily Ariane 5 was an unmanned rocket and no human lives were lost, but “the destroyed rocket and its cargo were valued at $500 million”.

        1. 2
          No. Not this.

          The internal SRI software exception was caused during execution of a data conversion from 64-bit floating point to 16-bit signed integer value. The floating point number which was converted had a value greater than what could be represented by a 16-bit signed integer.

          short i6(double x) { return (short)x; }
          

          Well, no shit.

          Indeed doing work you don’t need to (like memory management in a missile) is the exact sort of thing that caused this kind of bug, not the reverse like you’re suggesting.

          1. 2

            Did you read the report of the Inquiry Board mentioned in the page I linked? If not, I suggest you do.

            I was not arguing whether or not it’s safe to convert double to short like that, or whether or not you should do dynamic memory allocation in a safety critical system.

            The point my parent comment was making (and what I was referring to by This) was: when someone does stupid risky shit like letting everything leak and then trying to justify or compensate for it by doubling the maximum amount of leak possible based on the maximum flight time of a particular device, they’re gonna be in serious trouble if when that code is carelessly reused in a future version of that device with different specs and requirements (e.g. longer flight time, etc).

            The Ariane 5 incident which I linked is the exact same scenario: they reused arguably badly written code from the previous Ariane 4 rocket, without verifying and making sure that all of the assumptions made about the system in those pieces of code still hold for Ariane 5.

            I’m not going to quote the entire report, but here are some relevant excerpts:

            • The alignment function is operative for 50 seconds after starting of the Flight Mode of the SRIs which occurs at H0 - 3 seconds for Ariane 5. Consequently, when lift-off occurs, the function continues for approx. 40 seconds of flight. This time sequence is based on a requirement of Ariane 4 and is not required for Ariane 5.
            • The value of BH was much higher than expected because the early part of the trajectory of Ariane 5 differs from that of Ariane 4 and results in considerably higher horizontal velocity values.

            The same requirement does not apply to Ariane 5, which has a different preparation sequence and it was maintained for commonality reasons, presumably based on the view that, unless proven necessary, it was not wise to make changes in software which worked well on Ariane 4.

            p) Ariane 5 has a high initial acceleration and a trajectory which leads to a build-up of horizontal velocity which is five times more rapid than for Ariane 4. The higher horizontal velocity of Ariane 5 generated, within the 40-second timeframe, the excessive value which caused the inertial system computers to cease operation.

            1. 1

              I’m not going to quote the entire report, but here are some relevant excerpts

              Except they’re not relevant.

              The relevant bits are the recommendations, which don’t recommend code-reuse as being a factor, but here’s my favourite recommendation:

              no software function should run during flight unless it is needed.

              which supports my assertion, and the original assertion: Don’t free memory if you don’t need to reclaim it.

              1. 1

                Except they’re not relevant.

                They’re relevant to the point I was making.

                I’m not opposed to code reuse; not when it’s done properly. You want to bring over code from the previous rocket into the new one? That’s perfectly fine if you can justify the existence of every function you’re bringing (which is what you’re saying), and if it doesn’t interfere with the spec of your new system. They shouldn’t have brought over that function from Ariane 4 into Ariane 5 where it wasn’t needed; and they failed to make sure that what they’re bringing in is consistent with the specs of the new rocket.

      2. 5

        On the other hand, taking these shortcuts without also understanding why the leaks occur can have potentially catastrophic consequences. Feynman links improperly thought out safety factors to the Challenger explosion.

        1. 2

          Although a good point, I’ll note that both Ada and C++ have ways to reclaim memory as objects go out of scope. It’s just a declaration with a little effort structuring the program. That would also be a baked-in factor of safety much easier than tons of work in Valgrind. They can combine that with a WCET analysis [which they already do] to make sure things stay within timing bounds.

          Far as using something like in OP, people do both in bootstrapping without a GC and using certain kinds of formal analyses. In both, they know they can try to run both to see if the tool succeeds or crashes (eg out of memory). If they crash, we’re probably doing it wrong and rewrite it somehow. Or just do smaller pieces at a time with a clean slate between them. Not so different from memory pools Ada can use for the above stuff.

        2. 8

          This is just sloppy design, unless they analysed the system to ensure the memory leak has no chance of explosive growth under unfavourable codepaths. Which would likely be more expensive than fixing the memory leak in the first place..