Tuesday, April 01, 2014

Bad Power Supply! >smack<

Now in the closing phase of a hardware crisis involving my server system, I feel the need to blog and record what happened. It started a couple of weeks ago, when the server started randomly powering off for no reason. The first time, I figured it for a glitch and simply powered it back on. The local power has had some brownouts lately, but my UPS is set on the sensitive side and switches to battery power as needed. My next thought was, maybe it was overheating, since we live in a dusty area and I hadn't dusted out the server case in some time. It didn't seem overly hot inside, but I dusted it out anyway. The problem recurred, so I dusted it some more. The secondary display adapter for the second monitor failed to initialize, so I removed the card and dusted it out as well. After this the system remained stable for eight days so I hoped that I had fixed the problem.

At this point I should mention my file systems and raid arrays. All of my file systems had been converted to EXT3 except for the root partition, and this had to be preened after each failure, necessitating a secondary reboot. Long ago I had decided that EXT2 worked just fine with no need to upgrade, but I re-thought that at some point since my raid partition is now quite huge, and preening it after a crash is a lengthy operation. I had upgraded all partitions except the system disk, since I wasn't sure if I could upgrade a mounted partition. Turns out I can, and so I did, so at least the reboots are fairly uneventful.

However I was not so luck with the raid. After the first two power-offs, the array came up clean. But ever since then, it was coming up dirty, needing a three-hour re-sync process. This was starting to concern me, but in hindsight this doesn't appear to have been an issue.

Anyways, after eight days the problems resumed. Initially the power-offs were about every twenty-four hours. After eight days the problem returned, now happening with greater frequency. Finally last night while I was at work, it failed again, and when I got home I was unable to restart it. I had already figured that the power supply was the likely culprit and I planned to replace it this morning with a newer one of the same model, borrowed from my disused Windows XP system. I went ahead and replaced the power supply last night.

This should have been the end of my troubles, except for one thing. These power supplies have both the SATA-style connectors as well as the older style. However the newer version has more SATA power connectors and fewer of the legacy kind. Support for legacy IDE drives is eroding fast, and non-existent on the newer motherboards. I really need to migrate away from legacy technology but I haven't, for reasons of a monetary nature. So at any rate I worked from top to bottom plugging in devices, and when I get to the bottom I have one hard drive with no power.

I rummage in my closet looking for power splitters. I used to have a lot of these but I can find only one, and it's not in very good shape. I use it anyway. The front intake fan that blows over the hard drives is piggy-backed off one of the drives, and it's running so I figure the power is good. Actually it's not, and the fan uses a different voltage from the drives, so this was not a good test.

I try to boot the system and it can't see the IDE drives at all. Now, when I built this system I made a poor selection in the motherboard department, and bought one that had only one IDE header along with four SATA connectors. Of course, my system has four IDE devices counting the CD and DVD drives. So I bought a JMicron adapter to provide the additional IDE support that I would need. At that time I was running an older version of Linux which could not see both IDE buses for some reason. The JMicron bus caused the one on the motherboard to be invisible, except during the installation process, when it was able to install Slackware without a problem. I subsequently upgrade Slackware and this problem went away.

My first thought is the JMicron card has failed somehow, so I unplugged the IDE cable to the CD/DVD drives, and plug in the hard drive bus in its place. This result in the system hanging during the power-up sequence, during drive detection. I restore the IDE buses to their original places and boot the system. Slackware comes up, albeit without the two IDE drives, and hence is unable to start the raid array.

My memory of the sequence of events is a little muddled since it was late at night, I had just worked a long cashier shift, and I was feeling unusually tired on top of that. At any rate I attempted boot Slackware on multiple occasions with varying degrees of success. At one point I was getting hard drive error messages, and either the boot process got hung, or it got to the login prompt but I was unable to log in. It prompted for the username but not the password, and then after a pause returned to the username prompt. Crtl-Alt-Del shutdown would not work since it couldn't execute /sbin/shutdown for some reason.

I got past this impasse somehow, booted the system with everything except the IDE drives, and went to bed.

In the morning I worked on getting a temporary desktop operational. My normal desktop resides on the raid partition, which is mounted as /home. This partition is shared via NFS and Samba. My satellite Linux systems use NFS to mount this as server3:/home. server3 is the third in a line of server systems, starting with a 486 running Windows NT Server, then a Pentium 4 running Windows 2000 Server, and finally the current Linux implementation. I have a feeling I will be wanting to break ground on a new server soon.

Since the laptops might be away from home and away from the /home directory, I created the concept of /away folders, residing on the local drives of satellite systems. User directories are created here, and symbolic links in the /home/ directories point here. Also, in case the "real" /home directory is not mounted, the mount point in the root partition, when not in use as a mount point, contains symbolic links to the user directories in /away. This allows users to log in with the server being present. For the sake of completeness, I created a /away directory on the server as well.

Therefore with the raid partition not mounted, my home directory points to the /away directory and I am able to log in and bring up X-Windows. I get on to Facebook and update my zoos, drink some coffee, then get back to work. I'm able to get on the web, so I do some research. At this point I'm deciding that one of the IDE drives is bad, and it's bringing down the entire IDE bus. So I try unplugging one drive or the other at a time. No dice. I seem to have a culprit, if the lower IDE drive is on the bus, the bus is down.

At this point I toy with the notion of putting one of my IDE-to-SATA tailgates on the 'good' IDE drive and connecting it to a Sata port on the motherboard. The tailgate requires power in the form of a floppy style power connection. My power supply has one, and it's way up at the top of the case hanging off the DVD drives. Remember, I connected things from top to bottom. So I rework the power connections. The power drop with the floppy connector has two legacy connectors. These get plugged into the IDE drives, while the splitter goes onto the CD and DVD drives.

At this point things start to look a whole lot better. I seem to recall now how an unpowered IDE drive can bring down an entire bus. I go through a number of power-ons to the setup screen to check device presence, and eventually I boot up Slackware with all hard drives online.

Now we're into the endgame, trying to get the raid array back online. I try to start /dev/md0. It says started with three devices, but it's dirty and won't run. I do a force, it says I/O error and still won't run. I do an examine on each component partition. They all say that one of the drives is removed, except for the one in question which says everything's present. I must have booted Linux with that drive unplugged from the bus, the one I thought was bad, and it got removed automatically. I've had to do this on a prior occasion but it's been a while and I've forgotten. Anyway an article on the web suggests a re-add, and suddenly everything is working. I'm able to run /dev/md0, re-sync commences, I can remount /home. I restart MySQL since it resides on the raid partition, and everything is looking good.

I'll still have to monitor the system over the next several weeks. I'm assuming it was the power supply, but given the intermittent nature of the problem only time will tell. I'm sort of suspecting that one of the rails in the power supply is weak, and prone to overload, causing the power supply to shut down. I may even be able to use the failed unit in a less-demanding system, maybe a desktop with a single hard drive. I document this incident to remind myself in the future of lessons learned, and hope that others can gain some insight from my experiences.