Latest Technical SETInfo

hyperspace · Post by **hyperspace** » Thu Mar 02, 2006 10:15 pm

(From the SETI Technical News page)

March 2, 2006 - 21:15 UTC
So it turns out the master database storage arrays had three drive failures during the long and thorough RAID resync process. We had two hot spares and a spare drive on the shelf. This, along with the fact that the array was RAID 10, means that we shouldn't have lost any data, but the resync process took extra time to do deal with these lost drives.

Why did we lose so many drives? These are old storage arrays donated to us a while ago, and the disks came with heavy wear and tear. We already had several other disks fail in this system so this is no big surprise. Once everything is resync'ed (in about 20 minutes from the time of writing) we'll start up the master database, check its tables (which may take as long as 24 hours), do some other hardware testing, and if all is well start up the assimilators/splitters again. If not, we might be out for an extra day as we continue to clean up.

February 28, 2006 - 21:15 UTC
We had a planned outage today to remove a couple more items from the server closet (the Classic SETI@home data server and several large, heavy disk arrays which contained the old science database). In order to safely do so, we wanted to power down several important machines so they wouldn't accidentally get bumped and go down ungracefully.

The Bay Area is having a rough winter, and a storm today brought lightning which knocked out power to the entire campus, including our lab, around 8am. Most of the servers went down without a hitch. And with the power off anyway we went ahead and cleaned up the closet as planned. We can now get behind the racks again without painful contortion.

Powering up the entire network is painful, as servers need to revive in a set order, and many hidden mounting issues come to light (that only get tickled by a reboot). Plus some drives needed some fsck'ing. Everything eventually booted up just fine, except for the master science database.

One of the fibre channel loops disappeared on this particular server. Bad cable? Bad GBIC? Not sure just yet, as the terminal wasn't working well enough to give us all the boot diagnostics. We hooked up a laptop and fought with hyperterm to see these messages, but by the time we got that working the machine booted just fine for no explicable reason... but all the metadevices needed to be resynced. This resync could take up to 24 hours, during which the master science database will be down. That means no splitting and no assimilating, and we'll probably run out of work to send before too long. Oh well.

February 28, 2006 - 00:30 UTC
Just a quick update so you know we haven't disappeared. We've entered a phase of massive cleanup - moving machines around in preparation to put newer ones in the server closet. Since we were cracking the whole system open we figured we might as well bite the bullet and clean all our /usr/local's, update old versions of software, etc. So naturally, everything broke. The last couple of weeks have been spent playing a non-stop game of Whac-a-Mole, trying to fix one minor broken thing after another. You may have noticed some of these failures. For example, the user-of-the-day selection was stuck for a week due to a broken path.

There were some other minor issues. One of the assimilators kept crashing with no error messages - after some painful debugging we found it was freaked out by a single corrupt record in the database. But other than that there has been slow, steady progress. The new data recorder is nearing completion (being stress tested at this point), and we're planning to move more old servers out of the closet tomorrow.

February 16, 2006 - 23:00 UTC
Today we had another quick database backup/compression, and then upgraded the MySQL version again (to the latest 4.1.x). It was a painless upgrade, and a couple problems seemed to have cleared up. Most notably, users are now able to "merge computers" again via our web site. This query had been locking up the system.

February 14, 2006 - 23:00 UTC
We had a couple of outages over the past few days. One was unintentional - we are still having database lock issues involving the "merge computer" function on our web site, and this was turned back on accidentally.

Yesterday we had a standard database backup/compression. We're going to begin doing these twice a week as we continue to figure out why we are having throughput issues.

Today we replaced a disk enclosure that was part of our workunit storage array. It was a relatively painless procedure except that the system wasn't recognizing the old disks upon restart. Eventually this was diagnosed: The system's qlogic fibre channel card needed a configuration "refresh." This required hooking up a console and simply entering/exiting the BIOS without editing anything. Anyway, the whole system is back up and running now.

Post by **Derek** » Fri Mar 03, 2006 12:39 am

Thank you. I'm still waiting for work units!

Dave Rave · Post by **Dave Rave** » Fri Mar 03, 2006 2:41 am

quick, where's my SetiQueue for Boinc ??????