RAID – Server Down, RAID 5 failing – life saver !

November 5, 2014

Recently I saw that my ClustrMap was not showing up in my blog ?

image

Next thing when I clicked the link I got redirect to this page.

http://blog.clustrmaps.com/2014/11/01/www3-clustrmaps-com-server-down/

SUMMARY: Major disk crash affects maps on www3.clustrmaps.com; a very lengthy (several days) recovery procedure is underway using reliable backups. The original post and updates below are in chronological order, so the latest updates are at the bottom.

[Original post]: We received an alert at 10:15 GMT 1st November 2014 that one of our servers (www3.clustrmaps.com) was not responding. The means that anyone who has a map on that server (you can tell because the link to your big map begins with www3….) will be experiencing a data outage during this period [no visitor counting], for which we apologise. Our hosting provider (SoftLayer.com) is ‘on the case’, and it looks like there is a problem with the RAID disk array that requires a disk swap and rebuild. This normally takes a good few hours, but additionally it has taken some extra time to locate the fault. Please return to (and refresh) this blog posting for updates. Many thanks for your patience and understanding.
-The ClustrMaps Team

Explaining they had a server down for several days, because of a failing RAID 5 !?

Well if you thought RAID 5 was there for reliability I can tell you are wrong !

I had a similar surprising incident myself a couple of year ago.

Luckily for me I was able to escape the disaster dance like this :

Solution :

1. First try to replace a failing disk as normal in a RAID 5. Which should start rebuilding automatically.

2. In my case it failed, because of a bad sector on one of the 2 remaining disk Sad smile

Pull out 1 of the failing disks, wait a bit and try again.

3. If it Fails again to rebuild. Power down the server.

Pull out all of the disks that belong to the faulty Logical drive (not the other ones).

4. Restart the server without the disks. This will clear the ARRAY controller ROM RAID configuration.

5. Start putting in the disks again, and the server will start to a new RAID configuration, based on the ROM data that is on the old Disk ! Smile

6. If this still does not work Sad smile

Shut down the server. Remove all disks and delete the ARRAY config.

Add brand new disks in the server. Restart the server and create a new exact ARRAY config from scratch.

7. Stop the server again. Pull out the new disks and put back the old disks.

Start the server again.

This should start the rebuild without a problem.

Because the RAID controller cache had now being fully cleared.

This little trick saved my day back then ! Smile Smile

PS : in case it still does not work try first to do a firmware upgrade for the RAID controller. And start again as of step 6

In case all of these did not work, you are unfortunately in the same position as the guys from ClustrMap.

And you need to rebuild the server from scratch.

Install the OS software next the backup software, and do a restore.

Will take you a long time.