April 20, 2011

RAIDers of the lost archives


I don't know about other IT support people elsewhere, but for me, we always have computer problems after a long break, like the first day after the New Year, or, often worse, the first day after Songkran festival.

On the last day of the long Songkran festival vacation, the employee RFID scanner's UPS died and somehow took out the scanner's power supply, I had to quickly find a replacement power supply before people started coming in to work. At 8 AM on the first day of work, everyone started turning on the lights and computers at the same time, and all the breakers automatically cut off. Normally someone just had to manually reset the breakers but for some reason resetting didn't work this time, so they just called the building maintenance people then switched all the breakers off, without notifying the IT people, and then the power to the servers went out.

After the power came back, the affected RAID drives (the SQL server has an SSD RAID1 array) started rebuilding themselves, and I started getting support calls that users can't login to the ERP database. I looked at the SQL server and saw alert level 024: Hardware Error. Unfortunately, eventhough I have a nightly backup routine, it failed to activate since earlier last week, and the SQL Agent didn't notify me of its failure. If I restored from last good backup, I lose two days of work and get killed by the users and the boss. If I repair the database, I lose less work but my ERP support guys have to fix it. DBCC CHECKDB REPAIR_ALLOW_FOR_DATA_LOSS it is then.

Fortunately, there wasn't much data loss and internal consistency wasn't compromised, but my ERP guys are still working on the problem as I'm typing this. Later in the afternoon, more support calls started coming in from users using non-standard clients such as Windows 98 and things like scanners that automatically scan and save to shared folders on the server. I turned my attention to the new domain controller to check user and folder rights. Then I notice a little blinking icon on the task bar: one drive on the RAID array had failed.

One of the first posts of this blog was about putting Western Digital RE2 drives into a Buffalo TeraStation. That was in early 2008. Three years later, five of the original fourteen RE2 drives I bought have already failed. The reason I got fourteen drives at the time was to put four each into two Terastations, and three each into two Windows servers. I thought these are supposed to be special drives designed for RAID use, but the stupidly cheap "consumer" drives that I also use for RAID have not had any problems, and these expensive RAID drives are dropping like flies.

One time it was so bad because one drive failed in a TeraStations' RAID5 array, and when I put in a new drive to rebuild the RAID array, another drive went bad took out the entire RAID array. I had to reflash the firmware to bring it back. Last year, one drive in a server's RAID5 array went bad, imagine how bad I felt when it took 30 hours to rebuilt the array.

Luckily I happened to have a test server in an isolated network, so I quickly joined into into the domain and promoted it as a domain controller, then proceeded to replace the failed drive.

During Songkran vacation, while I worked to replace the old server, which I'll write about another time, there was a bunch of half-naked ladyboys outside making noise and splashing water. Had I known I was gonna have all these problems, I would've invited them to come in and just blow up my servers and preemptively end all this misery.

My SQL data and failed hard drive have gone to Bit Heaven. They served well and died without thanks. May they rest in peace.

No comments: