I used to work as a repair tech in a computer shop for a few years, and learned that you should pretty much always check RAM as soon as you have any failure, just to be sure. RAM module failure is fairly common regardless of RAM age, and can cause all sorts of Weird Shit™, as shown in this article. It’s pretty easy (just let memtest run overnight) and a very good ROI time-wise.
When my first PC arrived I also went down the rabbit hole for around 1.5 years.. System randomly crashed, sometimes in login screen, sometimes in games. But always some random bluescreen with a different error message or simply a total hangup. But mostly in games. Due to previous problems with a laptop GPU and the hurdle to return it, I waited for 1.5 years till I send it back and got it repaired. I wish I would have done this earlier.
I tested it with windows own memtest back then, but my bad luck was that the “Normal” test couldn’t find anything wrong with that machine. Only the higher tier test would eventually hang up forever. This cost me one full HDD, which can’t be recognized anymore as a drive by any system. For some time I developed a fear of the sound pattern a crashed windows system makes when it replays the last sound buffer over and over.
This happened to my family (my Father, older brother, and myself) the last time we tried to run a FreeBSD home server. I was 15(?) at the time. We had decided that we wanted to use ZFS because of all of it’s cool features, including really nice backup stuff. We made meticulous instructions on how to set up our email, file share and DNS and tested it a couple of times on real hardware before trying to actually replace our old Mac OS X server.
flashbacks intensify
It took years for FreeBSD to be approached by one of us again. We also considered the hard drives to be completely useless at this point. The computer (an Intel NUC) also collected dust for a while.
We didn’t actually diagnose it until a few years later when we (just my Father and myself this time) tried to re-use the computer, at a client. We used CentOS 7 this time. We were seeing some Weird Shit™ like rsync locking up the system, and one of us had the idea to run a memory test.
Memtest86 found it in less than half an hour. This is the only time in my professional career that I’ve ever considered bad memory a possibility. Thankfully it was only one of the modules and we were able to leave soon after we figured this out. (We were only using this computer in the interim while setting up their old Mac OS X server with Linux.)
On the way back late that night after having everything resolved my Father told me that he actually remembers having issue with FreeBSD due to bad memory sticks that didn’t happen under any other OS.
sighs intensify
He said he heard that FreeBSD swaps less often than Linux or something like that. Well, since then, I’ve installed FreeBSD on my own computer to fiddle with. It’s pretty nice! I’ve got xdm launching dwm with dmenu, st, surf.
Kind of off-topic: If anybody has any tips on getting tabbed and surf to play nice, I would appreciate it.
It’s a shame that we have to use these consumer grade systems in order to get a good price. If we could just get manufacturers to consider ECC DRAM to be a nice option for consumers, it would be widely available and only slightly more expensive.
In this case, ECC errors or repairs in dmesg would have caught the problem MUCH faster.
I, for one, will start replacing RAM modules before going down the latter route.
The author seems to be new to the process of troubleshooting, and from this last line in the article, sounds like he still has a ways to go. Basically he observes random issues and starts laying blame on random components with unverified hypotheses. Which is a great way to waste a lot of time and money.
The most effective way to troubleshoot intermittent or random problems is to build a list of all possible causes and eliminate them one by one, in a methodical fashion. If you go through the whole list and still haven’t found the problem, then your list was too small or your verification methods not comprehensive enough. Computers operate in the physical realm and every problem has a cause, even if the solution isn’t obvious.
Whenever I see Weird Shit happening to a server and can’t pin down a known bug via googling, this is my rough order of investigation:
Reseat all cables and modules
Check Storage
Check Memory
Swap CPU/motherboard
Swap Power supply
The power supply gets moved up to #2 if it’s non-enterprise gear like a desktop workstation or raspberry pi.
I used to work as a repair tech in a computer shop for a few years, and learned that you should pretty much always check RAM as soon as you have any failure, just to be sure. RAM module failure is fairly common regardless of RAM age, and can cause all sorts of Weird Shit™, as shown in this article. It’s pretty easy (just let memtest run overnight) and a very good ROI time-wise.
I find it’s a good practice to run this for new hardware too. I know that for example Github had a process with this automated for all new servers (https://github.blog/2015-12-01-githubs-metal-cloud/).
When my first PC arrived I also went down the rabbit hole for around 1.5 years.. System randomly crashed, sometimes in login screen, sometimes in games. But always some random bluescreen with a different error message or simply a total hangup. But mostly in games. Due to previous problems with a laptop GPU and the hurdle to return it, I waited for 1.5 years till I send it back and got it repaired. I wish I would have done this earlier.
I tested it with windows own memtest back then, but my bad luck was that the “Normal” test couldn’t find anything wrong with that machine. Only the higher tier test would eventually hang up forever. This cost me one full HDD, which can’t be recognized anymore as a drive by any system. For some time I developed a fear of the sound pattern a crashed windows system makes when it replays the last sound buffer over and over.
Ah I know this sound too well from my mac and linux boxes. I feel your pain.
This happened to my family (my Father, older brother, and myself) the last time we tried to run a FreeBSD home server. I was 15(?) at the time. We had decided that we wanted to use ZFS because of all of it’s cool features, including really nice backup stuff. We made meticulous instructions on how to set up our email, file share and DNS and tested it a couple of times on real hardware before trying to actually replace our old Mac OS X server.
flashbacks intensify
It took years for FreeBSD to be approached by one of us again. We also considered the hard drives to be completely useless at this point. The computer (an Intel NUC) also collected dust for a while.
We didn’t actually diagnose it until a few years later when we (just my Father and myself this time) tried to re-use the computer, at a client. We used CentOS 7 this time. We were seeing some Weird Shit™ like rsync locking up the system, and one of us had the idea to run a memory test.
Memtest86 found it in less than half an hour. This is the only time in my professional career that I’ve ever considered bad memory a possibility. Thankfully it was only one of the modules and we were able to leave soon after we figured this out. (We were only using this computer in the interim while setting up their old Mac OS X server with Linux.)
On the way back late that night after having everything resolved my Father told me that he actually remembers having issue with FreeBSD due to bad memory sticks that didn’t happen under any other OS.
sighs intensify
He said he heard that FreeBSD swaps less often than Linux or something like that. Well, since then, I’ve installed FreeBSD on my own computer to fiddle with. It’s pretty nice! I’ve got xdm launching dwm with dmenu, st, surf.
Kind of off-topic: If anybody has any tips on getting tabbed and surf to play nice, I would appreciate it.
It’s a shame that we have to use these consumer grade systems in order to get a good price. If we could just get manufacturers to consider ECC DRAM to be a nice option for consumers, it would be widely available and only slightly more expensive.
In this case, ECC errors or repairs in dmesg would have caught the problem MUCH faster.
The author seems to be new to the process of troubleshooting, and from this last line in the article, sounds like he still has a ways to go. Basically he observes random issues and starts laying blame on random components with unverified hypotheses. Which is a great way to waste a lot of time and money.
The most effective way to troubleshoot intermittent or random problems is to build a list of all possible causes and eliminate them one by one, in a methodical fashion. If you go through the whole list and still haven’t found the problem, then your list was too small or your verification methods not comprehensive enough. Computers operate in the physical realm and every problem has a cause, even if the solution isn’t obvious.
Whenever I see Weird Shit happening to a server and can’t pin down a known bug via googling, this is my rough order of investigation:
The power supply gets moved up to #2 if it’s non-enterprise gear like a desktop workstation or raspberry pi.