A web search for “Eeek! page_mapcount(page) went negative! (-1)” returns 303 results for me.
The searches seem to indicate that users have been periodically running into this error message since 2007.
This report from 2008 seems to indicate a way to reproduce it, someone seems to have found a commit that introduced the bug. The bug is closed with an unsatisfying
as the package user-mode-linux has just been removed from the Debian archive
unstable we hereby close the associated bug reports.
If it’s reproducible, meaning there is a sequence of actions you can take to generate that condition, I’d say unlikely to be a cosmic rays.
I’d say that responding to a bug report with “cosmic rays” is unprofessional. There is a stock answer that goes “Sorry, can’t reproduce. Closing, please re-open if you can give me steps to reproduce” that is much more helpful and less antagonistic.
I work with pretty friendly people and so I look at reports of the hostility in software with a skeptical eye, but little things like this remind me that life in the software development world can get nasty for no reason at all.
There was no reproducibility at that time, and it could be explained by cosmic rays.
Matt is one of the nicest Linux hackers I know. Mercurial is nice and has a good UI because Matt cares about being nice. His detailed explanation here of why it’s reasonable to blame cosmic rays until a pattern emerges (which hadn’t at the time he wrote that), is a very nice and detailed way of explaining why they did not want to look into the problem any deeper.
This is becoming an increasingly severe problem in HPC. To the point where software needs to be written in an explicit fault-tolerant fashion, since errors like these or even hardware failures will happen on nearly every exaflop run. Even petaflop machines that are typical today need to have special handling for hardware failures to avoid crashing constantly.
What happens when you get an error? i.e. say computer 4 gets hit by a cosmic ray which flips a bit; what’s the procedure for bringing all computer back into agreement?
If you have multiple computers you can do a quorum. Otherwise, information is lost and it’s up to the situation what you do. You can either fail an tell the user or if there is a backup policy, execute that.
I second apy’s recommendation of Tandem Computers. I’ll go further with two specific works. The first is by Jim Gray showing how Tandem looked at things systematically to figure out how to eliminate as many error classes as possible. They ended up achieving a five 9’s system. The second is from a competitor, Stratus, covering both hardware and programming techniques for robust systems, including Tandem NonStop.
TBH sounds apocryphal. The system was in a building away from the tracks, no living thing can give out that amount of ionizing radiation while still, well, living.
A web search for “Eeek! page_mapcount(page) went negative! (-1)” returns 303 results for me.
The searches seem to indicate that users have been periodically running into this error message since 2007.
This report from 2008 seems to indicate a way to reproduce it, someone seems to have found a commit that introduced the bug. The bug is closed with an unsatisfying
Sure, so maybe it was reproducible. All that Matt said is that until a pattern emerges, cosmic rays are a reasonable conclusion.
If it’s reproducible, meaning there is a sequence of actions you can take to generate that condition, I’d say unlikely to be a cosmic rays.
I’d say that responding to a bug report with “cosmic rays” is unprofessional. There is a stock answer that goes “Sorry, can’t reproduce. Closing, please re-open if you can give me steps to reproduce” that is much more helpful and less antagonistic.
I work with pretty friendly people and so I look at reports of the hostility in software with a skeptical eye, but little things like this remind me that life in the software development world can get nasty for no reason at all.
There was no reproducibility at that time, and it could be explained by cosmic rays.
Matt is one of the nicest Linux hackers I know. Mercurial is nice and has a good UI because Matt cares about being nice. His detailed explanation here of why it’s reasonable to blame cosmic rays until a pattern emerges (which hadn’t at the time he wrote that), is a very nice and detailed way of explaining why they did not want to look into the problem any deeper.
Awesome post. For those curious, the field of Immunity-Aware Programming tries to address some of this problems.
This is becoming an increasingly severe problem in HPC. To the point where software needs to be written in an explicit fault-tolerant fashion, since errors like these or even hardware failures will happen on nearly every exaflop run. Even petaflop machines that are typical today need to have special handling for hardware failures to avoid crashing constantly.
Would you mind elaborating on the techniques used when attempting to be fault taulerant of bit flips?
One place to start is actually Tandem Computers which were built for fault tolerance, basically by running two computers.
NASA’s guidance system, among other things, has 3 or 4 computers which all compute the same thing then check with each other if they agree.
For systems that require not running the same thing a whole bunch, one can let a checksum of the data flow end-to-end, checking it at various places.
I’m sure other solutions exist, but as a non-expert, those are the ones I’ve come across.
What happens when you get an error? i.e. say computer 4 gets hit by a cosmic ray which flips a bit; what’s the procedure for bringing all computer back into agreement?
If you have multiple computers you can do a quorum. Otherwise, information is lost and it’s up to the situation what you do. You can either fail an tell the user or if there is a backup policy, execute that.
I’m not terribly familiar with this field, but this report should get you started: http://www.netlib.org/lapack/lawnspdf/lawn289.pdf
Oh nice! I didn’t have that. Thanks.
I second apy’s recommendation of Tandem Computers. I’ll go further with two specific works. The first is by Jim Gray showing how Tandem looked at things systematically to figure out how to eliminate as many error classes as possible. They ended up achieving a five 9’s system. The second is from a competitor, Stratus, covering both hardware and programming techniques for robust systems, including Tandem NonStop.
Why Do Computers Stop and What Can Be Done About It?
Paranoid Programming: Techniques for Constructing Robust Software
Note: First is an old PDF. Second one is a PostScript file from Archive.org since the PDF link is dead with no archive copy.
Thanks for the links! Looks like a there is a bunch of goodies in there…
Reminds me of http://www.jakepoz.com/debugging-behind-the-iron-curtain/
TBH sounds apocryphal. The system was in a building away from the tracks, no living thing can give out that amount of ionizing radiation while still, well, living.