Welcome to the QA Tech-Tips blog!

Some see things as they are, and ask "Why?"   I dream things that never were, and ask "Why Not".  
Robert F. Kennedy

“Impossible” is only found in the dictionary of a fool.  
Old Chinese Proverb

Monday, February 9, 2009

Resurrecting a Faild Hard Drive (on a Linux system)

Originally published on 2/9/09 as
QA Tech-Tip - "Resurrecting the Dead!"



Author's note:
This article describes how I managed to recover and restore a corrupted Western Digital My Book World II NAS device which used a variant of Linux as the base operating system.

It originally appeared as an article I contributed to the Hacking the My Book World Wiki site: (http://mybookworld.wikidot.com/rescue-procedure-take-2) and was subsequently re-published as a link in the February 2009 QA Tech-Tip, "Rersurrecting the Dead!"

Despite the fact that this article primarily involves the use of a Linux system and Linux system commands to recover the data, I have also used Linux systems - and the associated rescue utilities like dd_rescue - to recover data from other operating systems as well.  Since these tools and techniques are - essentially - independant of the file system being recovered, it should be useful to anyone wishing to recover potentially lost data due to a file-system crash.

It is my hope that the information contained herein, and the knowledge gained from it, may help you with your own data recovery efforts.

To this end, I dedicate this article.

Jim



How to recover data from a crashed MBWE-II


Acknowledgments:


I want to acknowledge the help given me by Gabriel (who sure earned his name this time!) along with everyone else on these fora who posted their own experiences with the MBWE. Without your help I would have been SO SCREWED it would not be funny.

As we all know, there's really no excuse for inadequate backups. And of all people, I know better.

However, there I was with 30+ years of accumulated experience, tools, tricks, tips, software, etc. on a single drive - just waiting for Good 'Ole Mr. Murphy to come in and ball it up. This data was both critical and irreplaceable, so "failure is NOT an option!"

There was no choice, I had to recover that data "regardless of cost or loss!" - even if it meant I had to go through those disks byte-by-byte with a disk editor.

I was damned lucky.

I was able to recover about 99% of my data, with the lost data being (relatively) easily replaced.

It did cost me though. I went through about $700.00, four tanks of gasoline, and a number of trips to my local (!!) Micro-Center to get parts and materials. Not to mention two weeks of acid-reflux.

I am taking the trouble to document what eventually succeeded for me - in the hope that it will help others avoid some of the mistakes *I* made.

Lastly, please excuse the length of this article. Even though I will make it as brief as possible, it was a long time in the telling, and it won't be told here in three lines.



Hardware Requirements:

  • Your hard drive must still be spinning, with the potential for recovering data
    Obviously if your drive's platters have frozen solid and don't spin, or the drive is suffering from a gross mechanical defect - such as pieces rattling around inside - your chances of success plummet like a rock.

  • You will need a computer that you can exclusively dedicate to this task for awhile
    "Awhile" might be measured in days, or even weeks. It took me two weeks of trial-and-error to get my data fully recovered.

  • You will need at least twice as many drives as there were drives in your MBWE
    My device had two 500 gig drives, so I purchased four drives to rebuild data on.

  • Each new drive will need to be at least twice the size of the drive you're trying to recover
    Since I had two 500 gig drives, I purchased four 1T drives.

  • You will need a controller card - or available SATA space on your recovery system's MoBo - for the extra drives in addition to the drive(s) already in the system

  • You may need a replacement drive for the one that failed
    Try to get as exact a replacement as possible. Western Digital, same size, same model series if possible, etc.



Software Requirements:

  • You will need a flavor of Linux compatible with your system and controller
    • Some people recommend the use of a "Live CD" for the recovery. I don't. I found it very convenient to be able to save log files, as well as some of the smaller data files, to my desktop. It's not so easy to do this with a "Live" CD.
    • Since you will need to download, install, save test artifacts and files, etc. etc. etc. I found it much easier to just do a flat "install from scratch" on the recovery system.
    • Additionally, the "Live" CD's I did try, (Ubuntu, Fedora, Knoppix), did NOT want to work with the SATA (RAID) card I bought. Chip revisions change, and sometimes the older drivers don't like the newer boards. I was able to get newer drivers, but only for Fedora, and they'd ONLY work on an "installed" system from the full-up install DVD - not the "Live CD" install.

    (N.B. I have since upgraded / reinstalled from Fedora 8 (which the drivers were for) to Fedora 10 (the latest stable release as of this writing), and the additional drivers were not required. The Fedora 10 "Live CD" however did not work. Maybe they have to leave things off like "unusual" drivers on the CD? :-) )

  • You will need ddrescue / dd_rescue
    You will need to find, or download, a copy of the program "ddrescue". (It's called "dd_rescue" on some distributions.) If your distribution does not come with that already, download and install it via your distribution's package manager.

  • You will need mdadm
    This is commonly included in most recent distributions. If it's not included, you can download it via your distribution's package manager.

  • You will need a recent copy of the Western Digital Data Lifeguard Tools CD to make a boot floppy of the Western Digital Data Lifeguard "Diagnostics".

  • You will need to be on excellent terms with Lady Luck!
    Or, as Scripture says: "The fervent effectual prayer of a righteous man availeth much."

    And I'm not kidding.  If you're reading this, you are probably already in Deep Sneakers, and sinking fast. Luck, prayer, whatever, will be a primary constituent of your success.



Notes:

  • You need to be logged in as root to do any of this stuff.

  • Be EXTREMELY CAREFUL with the "dd" and "dd_rescue" commands - they are extremely powerful and useful commands - but a tiny typo could render your drives, or your computer, a quivering wastland.

  • For brevity, I have NOT included examples of every possible command used (i.e. "mount" "umount" "ls" etc) If you are not sure how to do this stuff, (or are not that familiar with Linux), get help!



My MBWE-II Configuration and Status as of the time of the repair

  • My system was set up as a LINEAR array - that is the two 500 gig drives in my system appeared to be one 1 terabyte drive.
    • Because of this - it is actually a RAID-0 I think - the data was striped across both drives. In this case, the failure of any one drive means the entire data store was garbage.
    • To recover this - if both drives are spinning! - I needed to copy off the data from both drives to somewhere I could work on it, and then try to "stitch" the two array halves back together again.
  • The Web Setup (admin) page for my system showed "Drive 'A' Failed"
    • When I used dd_rescue (described below) to image the "failed" drive, the system partitions showed a number of "bad blocks" - in this case, it turned out that they were not truly defective, but just corrupted beyond the ability of the operating system to repair them. However, because the system partitions had bad blocks, I had to hope that the system partitions on drive B (my hopefully "good" drive), would be intact enough to recover from.
    • In my case, it turned out that my "B" drive was "still good" - and as dd_rescue proved further down - I had no bad clusters on that drive - so I could try to use the system partitions from that drive to re-create the partitions on the "bad" drive.

  • I was able to prove - using the Western Digital Drive Diagnostics - that the "A" drive was actually not truly defective. That saved me from having to actually replace the drive. However, if that had been needed, the only difference would be to substitute the NEW hard drive for the OLD one when you begin the drive "A" rebuild process.

Recovery Steps:


Rule #1: Don't Touch That Drive!


You are already in trouble. Dinking around with the drive - potentially changing it's contents - will only make it worse.

Prepare the new drives to receive the recovery data

  • Open a terminal session - or two! - and SU to root.
    • You will need to be ROOT (super-user) for any of this to work.
    • Each time you shutdown and restart the system, you need to re-open your terminal sessions and re-su to root.

  • Attach all the new drives, create one single partition on each, and format as ext3.
    • You can do this one-at-a-time, or you can attach all four of the new recovery drives to the controller, and format them all up there.
    • Shutdown and remove all formatted drives and set them aside carefully.

Copying the data off the damaged drive.

  • Install the drive that is NOT damaged, and view the partition table with Gparted or QTParted and verify that the partition table is intact.

  • Your partition table should look like this:
    • Unallocated space. (This space is used to store individual system specific data, such as MAC address, serial number, etc.)
    • Partition #1, formatted as ext3. (This is the boot partition, with /boot, /root, etc. on it.)
    • Partition #2, formatted as swap (This is the system paging file.)
    • Partition #3, formatted as ext3 (This is the rest of the O/S, /var, etc.)
    • Partition #4, unknown format. (This is the data-store, don't modify or change this!)
These partitions will be essentially identical between the two drives on a two drive system - Linear array or mirrored.
  • Using dd_rescue, copy the "un-damaged" drive to a file on one of the new drives.
    • This will take a fairly long while - measured in hours.
    • Take note of any failed blocks. (cut-and-paste to a text file.)
    • Shutdown the system, turn it off, remove the new drive with the file, label it, and put it somewhere safe.
    • Attach another new drive.
    • Reboot.
Commands to do the above:

dd_rescue -l /home/**uname**/Desktop/B-logfile.txt -o /home/**uname**/Desktop/B-bbfile.txt -v /dev/sdb 

/recover/b/b-recover-disk

Don't type the "asterisks" (**)

**uname** = Your username (this is the path to your desktop)
l = logfile output
o = bad-block logfile output (you need both of these for repairs)
/dev/sdb = The physical device the drive is on
/recover/b/b-recover-disk = the output filename for the extracted disk image.
I mounted my "recovery" drives at a mount-point called "/recover" on my system,  and the recovery drives were mounted as "a" and "b", so I had "/recover/a"  and "/recover/b" as the two recovery drives on my system.

  • Using dd_rescue, copy the last partition from the "undamaged" drive to a file on the new drive.
    • This will also take a long while. Almost exactly as long as the first copy, since this is where most of the data lives.
    • Again, take note of any failed blocks. Hopefully you won't find any on the "2nd" drive during either copy.
    • Shutdown the system, turn it off, remove both the new drive (mark it and put it somewhere safe), and the "B" drive, label and put somewhere else safe.
Commands to do the above:

dd_rescue -l /home/**uname**/Desktop/B-logfile.txt -o /home/**uname**/Desktop/B-bbfile.txt -v /dev/**sdb4**

/recover/b/b-recover-data

/dev/sdb4 = The 4th partition on device "sdb"  You can copy any partition by enumerating it here.
/recover/b/b-recover-data = the output file containing the **data** partition from drive "B".


  • Add the failed drive to the system and attempt to verify partitions
    • Attach the failed drive ("A"), to the controller where the "B" drive was, and re-run the Gparted, QTParted partition verification step as noted above.
    • Shut down and turn off the system.
IF the "failed" drive's partition table is OK, continue with the next section.

IF the "failed" drive's partition table is NOT OK, continue with the steps below.
  • Use dd to copy the first 512 bytes from the disk with the good partition table.

  • Copy that file to the first 512 bytes of the "bad" disk to see if we can recover valid partition data.

Attempt to recover data from the failed drive

  • Attach the failed drive ("A"), to the controller where the "B"drive was, and attach another new drive.

  • Reboot the system.

  • Using dd_rescue, copy the last partition of the "A" drive to a file on the new disk.
    • Again, this will take a long while.
    • Also, take careful note of any bad blocks.
    • Shutdown the system, turn it off, remove and label the new drive, and put it away safely.
    • Attach the last new drive and reboot.

  • Attempt to copy data from the entire disk to a file on the last new hard disk
    • Allow dd_rescue to copy about half the disk contents to a file, then abort it with CTL-C.
    • Hopefully, one of the two disks had the system partitions without errors.
    • Shutdown the system, turn it off, remove and label the last new drive, and put it away safely, leaving the potentially defective drive attached.

At this point, you should have all the images you need.


Verify if the "failed" drive is really bad

  • At this point, the system should be shut down, with all the new drives removed, and the one failing drive still attached.

  • Boot the system using the "Diagnostics" floppy you created from the Western Digital Data Lifeguard CD.

  • Select the correct drive in your system.

  • Run the "Quick Test".
    • It is not necessary to run the "full" test.
    • If the drive passes the "Quick" test, repeat it a few times to verify that it always passes.
    • Ideally, each pass will return an error code of "0000"
  • If the drive passes, mark it so, and put it away.

  • If the drive fails, mark it so, and set it aside where you won't pick it up to use it.
    • The magnets out of a failed H/D make GREAT 'fridge magnets!
    • Replace it with the replacement drive you purchased, or go purchase one. Remember to get as exact a replacement as humanly possible.
  • Repeat this same exact procedure, substituting the other MBWE drive to verify it is OK.

Attempt to rebuild the damaged data array

  • Re-attach the data image drives and prepare to recover
    • Shutdown and turn off the system if not already shutdown.
    • Attach the two drives that have the two data-partition images on them in positions 1 & 2 on the controller.
    • Attach a blank drive - if available - as position #3.
    • Restart the system.
  • Mount the three drives in a convenient location
    • I will assume /recover/a, /recover/b, and /recover/c are the mount points.
    • I am also assuming that the drive with the drive "A" data image is first, the drive "B" data image is second.
  • Loop-mount the recovered data image files created before
    • I will assume that they're named "a-recover-data" and "b-recover-data"
    • Execute the following commands to loop-mount the two image files:
Commands to do the above:

losetup  /dev/loop0  /recover/a/a-recover-data
losetup  /dev/loop1  /recover/b/b-recover-data

This creates two "fake" (virtual) drives mounted on loop0 and loop1 that contain the contents of these two files.

Trick: You can loop-mount ANY valid file-system image - including things like cd/dvd ISO images, etc.

  • Merge the images into a copy of their original array
    • Execute the following command to re-create the original MBWE array structure:
Commands to do the above:

mdadm --assemble  /dev/md1  --force  /dev/loop0  /dev/loop1

This command takes the two loop-mounted array parts and (hopefully!) merges them into an array image similar to the one on the MBWE that the two drives came out of.

Hopefully the array built - and started! - correctly. If it didn't, I don't know how to help you here.

Assuming the array built correctly - mount /dev/md1 wherever convenient. (Let's assume /recover/md1)

Navigate to the mount point, and view the contents of the root of that "drive". If all has gone well, at this point you should see a filesystem containing folders and data - as you had it on the original MBWE.

If you successfully see a filesystem - congratulate yourself, take a deep breath, and perhaps take a short break.

If you don't have a filesystem here - I am not sure how to fix this. Not without messing with it myself.

Make a "backup" of the filesystem's apparent content.

  • Very Important!
    • Using "cp -R", copy the entire contents of the /dev/md1 mount point to the empty drive you have mounted at your third hard drive mount point.
    • This will take a while. Take careful note of any files that generate errors.
    • We do this because when we try to repair the two partition images, things might get destroyed.

Attempt to repair / recover the partition images

  • Check array partitions for consistency
    • Execute the following command to verify the structure of the array partition's filesystem.
Commands to do the above:

fsec -t ext3 /dev/md1 -- -n -f -v

-n = Don't actually fix anything
-f = Force scan, even if screwy.
-v = Tell us a lot about what you see. 
    • Again, remember to take careful note of any errors or issues seen.
    • In my case, there were a lot of "inode hash" errors
  • Try a "real" fsck to clean up issues
    • This will discover if any of the issues disclosed were "serious" issues. (They probably are, but we can see if we get lucky… .)
    • Execute the following command:
Commands to do the above:

fsck -t ext3  /dev/md1  -- -D -p -f -v

D = consolidate and re-index directories.
p = "Preen" (auto-repair) non-critical problems.
f = Force checking
v = Tell us what's happening. 
    • You may get a "/dev/md1: Adding dirhash hint to filesystem" message when you start the "real" fsck. This is indicating that fsck is updating the partition to handle indexing properly. This is a non-problem.
    • When I did this, it still bailed out on me because "inode hash" issues are considered "critical" problems. What will happen is that - if you force fix, and you will need to, trust me - the directories and/or files with the inode hash errors will be deleted and the space consumed returned to the free pool.
  • Retry fsck forcing it to fix all errors found
    • We will need to absolutely clean up the issues found, so we must (at this point) force fsck to fix things.
    • Execute the following commands to do this:
Commands to do the above:

fsck -t ext3  /dev/md1  -- -y -f -v

(note, we're omitting the "-D" here deliberately.)

y = force auto fix (answer any question "yes!")
    • Re-execute the same command again to verify all issues have been resolved.
    • Repeat until there are no more errors found.
    • Once everything is OK, re-run fsck again to optimize and re-index directories.
Commands to do the above:

fsck -t ext3 /dev/md1 -- -D -y -f -v
This works just like before, except the "-D" forces directory re-indexing and optimization again, which helps.

  • Un-mount /dev/md1, and stop the array
Commands to do the above:

umount /dev/md1
mdadm --stop /dev/md1

Stop and take stock of things

Where we should be now

  • We should have two partition image files loop-mounted.
  • We should have them successfully assembled into an array.
  • We should have successfully run fsck on the array partition and cleaned up any errors.
  • We should have at least ONE good disk out of the two that came from the MBWE.
  • We should have at least ONE good system image from the two drives.
    • If you don't, you will need to download one and follow instructions to install it at a later step.

Begin rebuilding the two drives for the MBWE.

  • I am assuming that the "B" drive contained no bad blocks - and if there were, they are in the data partition, not the system partitions.

  • I am also assuming that we have a good drive "A", or a replacement, that may not have a good system image on it.
    • If this is not true - you do not have ANY good system images, skip the single step below, download a system image, and follow the instructions to install it on the two drives, creating the last (fourth) partition.
    • Using dd_rescue, copy the entirety of drive "B" to drive "A". This will replace the bad/missing system partitions, and re-create the 4th partition for the data.
    • After this is about 1/2 done, stop the copy with CTRL-C.
  • Using dd_rescue, copy the drive "A" data partition image that we fixed-up before, back to partition 4 of drive "A".
    • We use dd_rescue instead of "dd" - because dd_rescue will properly detect the end of the drive/data and will make sure every byte gets written.  "dd" - when it reaches the end of the drive - would simply fail, and not write the last few blocks of data.

  • Using dd_rescue, copy the drive "B" data partition image that we fixed-up before, back to partition 4 of drive "B".

  • Once that is done, completely shut-down and turn off power.

Rebuild the MBWE

  • Re-install the hard drives
    • Replace the two side-rails on each hard drive (if you removed them)
    • Re-insert the two drives into the MBWE, remembering that drive "A" goes in the slot closest to the controller electronics.
    • Re-connect all connectors removed during MBWE tear-down.
  • Reconnect network and power
    • Re-attach the network cable to the MBWE.
    • Re-attach the power connector to the MBWE.
  • FIRE THAT PUPPY UP!! (and pray…)
    • Re-connect power.
    • Carefully monitor the front-panel lights.
At this point, the MBWE should boot, do a final internal fsck - which is indicated by the internal lit ring spinning - and then come fully back on-line.

Note If you replaced the system partitions with downloaded partition data, you may have to re-configure the MBWE to your needs.

Verify correct operation

  • Attempt to access the web setup page
    • Verify that the web-setup page works, and that the drive status is "OK"
    • Re-configure any settings that you need to change.
  • Attempt to access the pre-existing shares on the MBWE
    • Verify that the original shares on the MBWE exist, you can access them, and you can read-and-write data to them.
    • Note that any files or directories that were "corrected" during the fsck of the partition array above may not be there - you may have to replace this data. THAT is why I asked you to take notes!

Verify everything's correct, replace any lost data, and return to service

  • Satisfy yourself that everything is back to normal, by shutting down the MBWE, re-booting it, etc.
    • You will probably notice that the MBWE is booting up - and serving files - much faster now than ever it did before.
    • This is a result of both cleaning up all the cruft and problems, as well as the consolidate, optimize, and re-index steps that we performed during the FSCK operations above.
  • Replace any necessary lost data
    • Replace any necessary lost data as noted during the FSCK passes above.

  • Return to Service
    • Return the MBWE to normal operational status.

Congratulate Yourself on a Job Well Done!

Jim