We had a week’s downtime which was followed by some short disturbances while bootstrapping the environment after getting the server back online. I’m trying to write down what actually happened in this post, so that me or anyone else who encounters same situation can actually try to resolve it without first trying few days worth of trial’n’error while search engine is not providing any useful help.
TL;DR: Sorry about all the downtime. This post does not contain any useful information.
Maintenance for server
First of all. At the start of the whole operation, I was rebooting domains to a new kernel (From 2.6.36 to 3.2.2, to be precise) as a regular maintenance and in order to get some kernel features originally missing. Okay, I compiled the kernel, shut down the domain and rebooted only to see boot process hang at domain creation, and after few minutes saying that it can’t create this big domain, and that there was too little memory available. From 5G of memory the machine has, the domain only used 1G of it, so it shouldn’t have been a problem. Okay, still not working and after restarting the domain with the original kernel, I’m greeted by a kernel panic. Very frustrating, or as my friend said, “Yattaa!! ^^”.
After few tries with different domains, result was always the same: no way to start domains. At this point, I was starting to get worried of what actually was broken, but logs were showing nothing. Absolutely nothing. So time to reboot the machine and see if it helps.
The machine is located in a server room approximately 200 kilometers away from where I reside, so booting the machine is not something I’d eagerly do, but there was no other way. Maybe xend went haywire? So I shut down rest of the services and rebooted the machine. Everything went back up, I logged in and tried to start domains. Same result. While wondering what the heck was going on after few tries with different kernels and domains, I resorted to rebooting the machine again. Nothing.
After few tries, I spotted even stranger problem: file operations didn’t work properly, /usr was mounted read-only. Even better, there was nothing in the logs, not even your usual “your hard disk has some problems, please create some backups and throw the drive as far as you can”. Frustrated, I tried to reboot again. Server stayed down.
After some serious cursing, I went to fetch the server, only to see /usr not mountable even though other partitions still worked. The machine was running and showing login prompt, but without /usr it wasn’t able to launch even a ssh daemon for example. Time to do a decision: forcibly restore /usr and trust the machine to work, or take it and make a complete reinstall.
Bringing the server back to the life
Former could work if there wasn’t any problems on the machine. But the machine was not working in first place so I took the machine with me to look at it with time to see what was going on with it.
In different environment, I still was not able to get the domains up, so I decided to backup the files, install new disks and reinstall the machine. I also switched some memory sticks and threw one away which was most likely damaged when I was fiddling with the hardware; it was causing some serious instability during the installation process.
After the server was reinstalled and files copied back, I was trying to bring machine back to life, but, both old and new kernel were not usable even to the point that a domain could be created. At this time I think I was hitting my head against the wall frustrated or something equally drastic. Well, back to http://ddg.gg to search what in the world could be the cause.
Domains were “bootable”, if I disabled disks and networking; in that case the kernel booted to the stage where it wanted to find the root partition in order to operate, which would fail due to missing disk. With those operations, nothing really worked. Eventually, I spotted that there is nowadays alternative way to use Xen: xl utility. This utility is meant to replace xm, which I’ve been using so far. So trying…and it worked. Without actually doing anything fancy. A week’s worth of debugging wasted, due to a simple mistake of using deprecated utility.
xm is deprecated, but that shouldn’t mean that it doesn’t work anymore. Also new disk configuration man page of xl, which can ironically only be found from Xen’s wiki, not as a man page atleast on my system, did not exactly tell how to use physical devices, so I altered deprecated syntax to have full paths instead of short syntax of what xm utility required. That means,
disk = [ 'phy:raid10/jambi,xvda,w' ] requires modification to
disk = [ 'phy:/dev/raid10/jambi,xvda,w' ], where there is /dev/ at the start of the path.
The “/dev/“ at start, which xm was not able to operate with, was required by xl, which means the deprecated syntax of xl utility is not same as the old syntax used by xm, but something totally new specifically created for xl, so why is it deprecated, I can but wonder.
Now, as the server is live again, I don’t want to test if xm would work, but I’m fairly sure it won’t. Maybe xend scripts, which are not needed by xl, are broken somehow.
So, the server is back online, and hopefully it will stay online for few years before needing some maintenance again, and that by then I’ve managed to get a slave server that would take the server’s operations in case of failure.
By the way, I opened a discussion about repositories of Jambi here.