When we last tuned in, it was the day of the public opening and the Walker’s website was down. We join our hero on his way to physically check the troublesome server:
Arriving at Onvoy, the server appeared to be trying to reboot and just needed a keypress. After that it came up cleanly - the drive is journaled via ext3, so it didn’t even have to check the disk. Problem solved? At the time I didn’t know for sure what had caused the original issue - and I’d deleted most of the /var/log/messages (the main system log) that I’d need to diagnose it. (Why? Bad instincts, I guess: The initial assessment showed the /var partition was full - which is enough to hose a system - so I copied most of what I thought I’d need and then emptied the file).
So I was left with a working server (yes!!) but no solid idea about what had caused the drive I/O errors — the portion of the log file I’d retained only showed the symptoms, not the onset error. I decided the best I could do immediately was to just let it run and watch the logs - and figure out how to restore from our backups.
The restore procedure turned out to be very straightforward, and I immediately took steps to build a set of worst-case scenario disaster recovery CDs. (these included base OS installs for all our production servers and a CD containing a fresh install of the recovery utility and master boot record images of the servers)
But watching the logs proved uneventful - even when the server crashed again early Wednesday and the next Sunday morning. (ahhhhhh!) It seemed whatever was happening essentially took the drive completely offline, and hung the entire operating system while it waited for the drive to come back — so the logs stopped being written. No permanent data to diagnose the problem. Also, the machine would not succesfully reboot until it was power cycled - a soft reboot did not work. (what??!)
If I could catch the server as or just after it crashed, I could physically get to it before it locked up completely and check the logs and dmesg output. Maybe that would give me enough information to solve the crashing server. So it was a game of waiting and researching the few clues I had gathered…