This is how we built our system. It is fantastic. We use design by contract and turn on assertions in production. Failed assertions kill the process (no exceptions). Lets just say, bugs don’t last long. In fact our bug list has zero items.
I proposed and made the argument to my team. We voted on keeping asserts on in production.
I guess I was just convincing. There is better good research on them now. There is a paper from Microsoft research showing higher quality code with less bugs proportional to asserts per thousand lines of code.
Maybe hit your team with some actual computer science :-p
C++. Yes, it produces core dump. Also logs a stack trace and a really nice assert message. Usually we don’t even need the core dump because the problem is immediately obvious from the assert message.
This paper is referenced from James Hamilton’s paper on deploying internet scale services. It’s a rather terrifying thing to do the first time in a production system but, IMO, worth it.
I think the biggest resistance to this is, especially if you’re running at small scale, components simply don’t fail that often. Look at the StackExchange link from yesterday, they have not had a single SSD failure yet (according to the article) and they run at a large number of requests but small number of components. When failure happens so rarely it’s easy to convince yourself that failure is not something you need to worry about.
We had a discussion on this topic way back when. Interestingly, the discussion went in the opposite direction. For me, the big takeaway was, “just because you can’t always do better, doesn’t mean you should never do better”.
The paper linked to addresses some of the issues you and @tedu discuss. Specifically it argues for a retry mechanism up stream, hiding the external client from the failure if possible.
It’s also important to note that this is talking about highly-available services. The cost of your recovering code not working is more expensive than the cost of a few clients having to retry their operation.
It’s not strange that they include postgres as an example of crash only software? As far as I’m aware, postgres also has a “clean shutdown” capability where it finishes transactions in progress before just zapping them with kill -9, making it clearly not crash only software. The documentation even specifically mentions not using SIGKILL to stop the server.
I think there’s a message of “software should be resilient” in here, which is good, but it’s getting swept up in some fiery ideological speech.
This is how we built our system. It is fantastic. We use design by contract and turn on assertions in production. Failed assertions kill the process (no exceptions). Lets just say, bugs don’t last long. In fact our bug list has zero items.
We have a watchdog that restarts everything.
I have also been advocating exactly the approach you have described…. alas, I have been overruled by my colleagues.
I would be really interested in a lot more detail about your experience and context.
I proposed and made the argument to my team. We voted on keeping asserts on in production.
I guess I was just convincing. There is better good research on them now. There is a paper from Microsoft research showing higher quality code with less bugs proportional to asserts per thousand lines of code.
Maybe hit your team with some actual computer science :-p
What language do you mostly use? I imagine you get a core dump and someone investigates it later? Is that how debugging works for the most part?
C++. Yes, it produces core dump. Also logs a stack trace and a really nice assert message. Usually we don’t even need the core dump because the problem is immediately obvious from the assert message.
This paper is referenced from James Hamilton’s paper on deploying internet scale services. It’s a rather terrifying thing to do the first time in a production system but, IMO, worth it.
I think the biggest resistance to this is, especially if you’re running at small scale, components simply don’t fail that often. Look at the StackExchange link from yesterday, they have not had a single SSD failure yet (according to the article) and they run at a large number of requests but small number of components. When failure happens so rarely it’s easy to convince yourself that failure is not something you need to worry about.
We had a discussion on this topic way back when. Interestingly, the discussion went in the opposite direction. For me, the big takeaway was, “just because you can’t always do better, doesn’t mean you should never do better”.
Hmm. I think that argument was a bit wrong…. You don’t have earthquakes / floods / meteors every day.
Programs do crash / run out of resources / net links fail / …. every day.
The “Continuous Delivery” guys have an important concept.
The important thing in delivering very high availability is not Mean Time Between Failure.
Sure having very high MTBF is great.
But beyond a certain point the things that are going to fail you become exponentially less and less controllable by you.
The critical measure then becomes MTRS Mean Time To Recover Service.
The paper linked to addresses some of the issues you and @tedu discuss. Specifically it argues for a retry mechanism up stream, hiding the external client from the failure if possible.
It’s also important to note that this is talking about highly-available services. The cost of your recovering code not working is more expensive than the cost of a few clients having to retry their operation.
It’s not strange that they include postgres as an example of crash only software? As far as I’m aware, postgres also has a “clean shutdown” capability where it finishes transactions in progress before just zapping them with kill -9, making it clearly not crash only software. The documentation even specifically mentions not using SIGKILL to stop the server.
I think there’s a message of “software should be resilient” in here, which is good, but it’s getting swept up in some fiery ideological speech.