Server Down, RAID 5 failing – life saver !

Recently I saw that my ClustrMap was not showing up in my blog ?

image

Next thing when I clicked the link I got redirect to this page.

http://blog.clustrmaps.com/2014/11/01/www3-clustrmaps-com-server-down/

SUMMARY: Major disk crash affects maps on www3.clustrmaps.com; a very lengthy (several days) recovery procedure is underway using reliable backups. The original post and updates below are in chronological order, so the latest updates are at the bottom.

[Original post]: We received an alert at 10:15 GMT 1st November 2014 that one of our servers (www3.clustrmaps.com) was not responding. The means that anyone who has a map on that server (you can tell because the link to your big map begins with www3….) will be experiencing a data outage during this period [no visitor counting], for which we apologise. Our hosting provider (SoftLayer.com) is ‘on the case’, and it looks like there is a problem with the RAID disk array that requires a disk swap and rebuild. This normally takes a good few hours, but additionally it has taken some extra time to locate the fault. Please return to (and refresh) this blog posting for updates. Many thanks for your patience and understanding.
-The ClustrMaps Team

Explaining they had a server down for several days, because of a failing RAID 5 !?

Well if you thought RAID 5 was there for reliability I can tell you are wrong !

I had a similar surprising incident myself a couple of year ago.

Luckily for me I was able to escape the disaster dance like this :

Solution :

1. First try to replace a failing disk as normal in a RAID 5. Which should start rebuilding automatically.

2. In my case it failed, because of a bad sector on one of the 2 remaining disk Sad smile

Pull out 1 of the failing disks, wait a bit and try again.

3. If it Fails again to rebuild. Power down the server.

Pull out all of the disks that belong to the faulty Logical drive (not the other ones).

4. Restart the server without the disks. This will clear the ARRAY controller ROM RAID configuration.

5. Start putting in the disks again, and the server will start to a new RAID configuration, based on the ROM data that is on the old Disk ! Smile

6. If this still does not work Sad smile

Shut down the server. Remove all disks and delete the ARRAY config.

Add brand new disks in the server. Restart the server and create a new exact ARRAY config from scratch.

7. Stop the server again. Pull out the new disks and put back the old disks.

Start the server again.

This should start the rebuild without a problem.

Because the RAID controller cache had now being fully cleared.

This little trick saved my day back then ! Smile Smile

PS : in case it still does not work try first to do a firmware upgrade for the RAID controller. And start again as of step 6

In case all of these did not work, you are unfortunately in the same position as the guys from ClustrMap.

And you need to rebuild the server from scratch.

Install the OS software next the backup software, and do a restore.

Will take you a long time.

Advertisements

One Response to Server Down, RAID 5 failing – life saver !

  1. Ron Sunden says:

    Don’t forget that latent disk errors are one of the biggest problems with RAID technology. If you are rebuilding a RAID, Unrecoverable Read Error (URE) can stop a RAID rebuild in its track, essentially making the entire RAID volume unusable.

    http://www.smbitjournal.com/2012/05/when-no-redundancy-is-more-reliable/

    Typical ranges for hard drives are around 1 x 10^14 bits which means that 1 out of every 10^14 bits cannot be read. This means if you have 12TBs or 12x 1TB drives your probability of encountering a URE is one (i.e. it’s going to happen). If you have 2TB drives, then all you need is 6x 2TB drives and you will encounter a URE. If you have a RAID-5 group that has seven 2TB drives and one drive fails, the RAID rebuild has to read all of the remaining disks (all six of them). At that point you are almost guaranteed that during the RAID-5 rebuild, you will hit a URE and the RAID rebuild will fail. This means you have lost all of your data.

    http://www.lucidti.com/zfs-checksums-add-reliability-to-nas-storage

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: