April 20, 2011

RAIDers of the lost archives


I don't know about other IT support people elsewhere, but for me, we always have computer problems after a long break, like the first day after the New Year, or, often worse, the first day after Songkran festival.

On the last day of the long Songkran festival vacation, the employee RFID scanner's UPS died and somehow took out the scanner's power supply, I had to quickly find a replacement power supply before people started coming in to work. At 8 AM on the first day of work, everyone started turning on the lights and computers at the same time, and all the breakers automatically cut off. Normally someone just had to manually reset the breakers but for some reason resetting didn't work this time, so they just called the building maintenance people then switched all the breakers off, without notifying the IT people, and then the power to the servers went out.

After the power came back, the affected RAID drives (the SQL server has an SSD RAID1 array) started rebuilding themselves, and I started getting support calls that users can't login to the ERP database. I looked at the SQL server and saw alert level 024: Hardware Error. Unfortunately, eventhough I have a nightly backup routine, it failed to activate since earlier last week, and the SQL Agent didn't notify me of its failure. If I restored from last good backup, I lose two days of work and get killed by the users and the boss. If I repair the database, I lose less work but my ERP support guys have to fix it. DBCC CHECKDB REPAIR_ALLOW_FOR_DATA_LOSS it is then.

Fortunately, there wasn't much data loss and internal consistency wasn't compromised, but my ERP guys are still working on the problem as I'm typing this. Later in the afternoon, more support calls started coming in from users using non-standard clients such as Windows 98 and things like scanners that automatically scan and save to shared folders on the server. I turned my attention to the new domain controller to check user and folder rights. Then I notice a little blinking icon on the task bar: one drive on the RAID array had failed.

One of the first posts of this blog was about putting Western Digital RE2 drives into a Buffalo TeraStation. That was in early 2008. Three years later, five of the original fourteen RE2 drives I bought have already failed. The reason I got fourteen drives at the time was to put four each into two Terastations, and three each into two Windows servers. I thought these are supposed to be special drives designed for RAID use, but the stupidly cheap "consumer" drives that I also use for RAID have not had any problems, and these expensive RAID drives are dropping like flies.

One time it was so bad because one drive failed in a TeraStations' RAID5 array, and when I put in a new drive to rebuild the RAID array, another drive went bad took out the entire RAID array. I had to reflash the firmware to bring it back. Last year, one drive in a server's RAID5 array went bad, imagine how bad I felt when it took 30 hours to rebuilt the array.

Luckily I happened to have a test server in an isolated network, so I quickly joined into into the domain and promoted it as a domain controller, then proceeded to replace the failed drive.

During Songkran vacation, while I worked to replace the old server, which I'll write about another time, there was a bunch of half-naked ladyboys outside making noise and splashing water. Had I known I was gonna have all these problems, I would've invited them to come in and just blow up my servers and preemptively end all this misery.

My SQL data and failed hard drive have gone to Bit Heaven. They served well and died without thanks. May they rest in peace.

April 16, 2011

Throwing hardware at the problem

My ERP project had been live for a few months, but I never managed to resolve the slowness issue. I'm not very good with SQL Server, and all the optimizations I did probably had limited effects on the performance since Dynamics NAV's C/SIDE code is executed client side and the idiots I have for consultants don't know the difference between native database and SQL Server. I couldn't get to the ERP's code, and even if I could, I probably wouldn't want to hack the code anyway.

Few months ago my boss suddenly told me that we made huge profits this month, and to avoid paying huge amounts of tax on the huge profits, he's giving me special permission to buy stuff. A lot of stuff.

Old server specs (purchased in October 2007): Intel Core 2 Quad Q600 (2.4 GHz), 8 GB DDR2 RAM, ASUS P5K Premium, and 2x500 GB SATA (RAID1). (Was 3x500 GB SATA RAID5, but I listened to the consultants and changed the RAID5 to RAID1, with no difference in speed whatsoever.)

New server specs (purchased in late 2010):


Oh wait...

New server specs: Intel Core i7-950 (3.06 GHz), 24 GB DDR3 RAM, ASUS P6X58D Premium, and 2x160 GB SSD (RAID1).

Windows 7's experience index score is 7.5 on the CPU and RAM, and 7.9 on the hard drive. I wanted to buy a faster CPU, but the 950 is most cost effective, and RAM is maximized for Core i7.

Oh, and the ERP didn't run any faster than before. None whatsoever.

I've talked about my Core 2 Quad Q6600 servers many times before. They were originally bought to replace the older domain controllers and run the ERP database. But I used them for SQL only due to performance recommendations, not that it made any difference since my database is so small. Now that I have the Core i7 dedicated to SQL, I decided to replace domain controllers with the Q6600 servers, and also upgrade the entire domain to 2008 R2 functional level.