Monday, February 9, 2009

Resurrecting a Faild Hard Drive (on a Linux system)

Originally published on 2/9/09 as
QA Tech-Tip - "Resurrecting the Dead!"

Author's note:
This article describes how I managed to recover and restore a corrupted Western Digital My Book World II NAS device which used a variant of Linux as the base operating system.

It originally appeared as an article I contributed to the Hacking the My Book World Wiki site: (http://mybookworld.wikidot.com/rescue-procedure-take-2) and was subsequently re-published as a link in the February 2009 QA Tech-Tip, "Rersurrecting the Dead!"

Despite the fact that this article primarily involves the use of a Linux system and Linux system commands to recover the data, I have also used Linux systems - and the associated rescue utilities like dd_rescue - to recover data from other operating systems as well. Since these tools and techniques are - essentially - independant of the file system being recovered, it should be useful to anyone wishing to recover potentially lost data due to a file-system crash.

It is my hope that the information contained herein, and the knowledge gained from it, may help you with your own data recovery efforts.

To this end, I dedicate this article.

Jim

How to recover data from a crashed MBWE-II

Acknowledgments:

I want to acknowledge the help given me by Gabriel (who sure earned his name this time!) along with everyone else on these fora who posted their own experiences with the MBWE. Without your help I would have been SO SCREWED it would not be funny.

As we all know, there's really no excuse for inadequate backups. And of all people, I know better.

However, there I was with 30+ years of accumulated experience, tools, tricks, tips, software, etc. on a single drive - just waiting for Good 'Ole Mr. Murphy to come in and ball it up. This data was both critical and irreplaceable, so "failure is NOT an option!"

There was no choice, I had to recover that data "regardless of cost or loss!" - even if it meant I had to go through those disks byte-by-byte with a disk editor.

I was damned lucky.

I was able to recover about 99% of my data, with the lost data being (relatively) easily replaced.

It did cost me though. I went through about $700.00, four tanks of gasoline, and a number of trips to my local (!!) Micro-Center to get parts and materials. Not to mention two weeks of acid-reflux.

I am taking the trouble to document what eventually succeeded for me - in the hope that it will help others avoid some of the mistakes *I* made.

Lastly, please excuse the length of this article. Even though I will make it as brief as possible, it was a long time in the telling, and it won't be told here in three lines.

Hardware Requirements:

Your hard drive must still be spinning, with the potential for recovering data
Obviously if your drive's platters have frozen solid and don't spin, or the drive is suffering from a gross mechanical defect - such as pieces rattling around inside - your chances of success plummet like a rock.
You will need a computer that you can exclusively dedicate to this task for awhile
"Awhile" might be measured in days, or even weeks. It took me two weeks of trial-and-error to get my data fully recovered.
You will need at least twice as many drives as there were drives in your MBWE
My device had two 500 gig drives, so I purchased four drives to rebuild data on.
Each new drive will need to be at least twice the size of the drive you're trying to recover
Since I had two 500 gig drives, I purchased four 1T drives.
You will need a controller card - or available SATA space on your recovery system's MoBo - for the extra drives in addition to the drive(s) already in the system
You may need a replacement drive for the one that failed
Try to get as exact a replacement as possible. Western Digital, same size, same model series if possible, etc.

Software Requirements:

You will need a flavor of Linux compatible with your system and controller
- Some people recommend the use of a "Live CD" for the recovery. I don't. I found it very convenient to be able to save log files, as well as some of the smaller data files, to my desktop. It's not so easy to do this with a "Live" CD.
- Since you will need to download, install, save test artifacts and files, etc. etc. etc. I found it much easier to just do a flat "install from scratch" on the recovery system.
- Additionally, the "Live" CD's I did try, (Ubuntu, Fedora, Knoppix), did NOT want to work with the SATA (RAID) card I bought. Chip revisions change, and sometimes the older drivers don't like the newer boards. I was able to get newer drivers, but only for Fedora, and they'd ONLY work on an "installed" system from the full-up install DVD - not the "Live CD" install.
(N.B. I have since upgraded / reinstalled from Fedora 8 (which the drivers were for) to Fedora 10 (the latest stable release as of this writing), and the additional drivers were not required. The Fedora 10 "Live CD" however did not work. Maybe they have to leave things off like "unusual" drivers on the CD? :-) )
You will need ddrescue / dd_rescue
You will need to find, or download, a copy of the program "ddrescue". (It's called "dd_rescue" on some distributions.) If your distribution does not come with that already, download and install it via your distribution's package manager.
You will need mdadm
This is commonly included in most recent distributions. If it's not included, you can download it via your distribution's package manager.
You will need a recent copy of the Western Digital Data Lifeguard Tools CD to make a boot floppy of the Western Digital Data Lifeguard "Diagnostics".
You will need to be on excellent terms with Lady Luck!
Or, as Scripture says: "The fervent effectual prayer of a righteous man availeth much."

And I'm not kidding. If you're reading this, you are probably already in Deep Sneakers, and sinking fast. Luck, prayer, whatever, will be a primary constituent of your success.

Notes:

You need to be logged in as root to do any of this stuff.
Be EXTREMELY CAREFUL with the "dd" and "dd_rescue" commands - they are extremely powerful and useful commands - but a tiny typo could render your drives, or your computer, a quivering wastland.
For brevity, I have NOT included examples of every possible command used (i.e. "mount" "umount" "ls" etc) If you are not sure how to do this stuff, (or are not that familiar with Linux), get help!

My MBWE-II Configuration and Status as of the time of the repair

My system was set up as a LINEAR array - that is the two 500 gig drives in my system appeared to be one 1 terabyte drive.

Because of this - it is actually a RAID-0 I think - the data was striped across both drives. In this case, the failure of any one drive means the entire data store was garbage.
To recover this - if both drives are spinning! - I needed to copy off the data from both drives to somewhere I could work on it, and then try to "stitch" the two array halves back together again.

The Web Setup (admin) page for my system showed "Drive 'A' Failed"

When I used dd_rescue (described below) to image the "failed" drive, the system partitions showed a number of "bad blocks" - in this case, it turned out that they were not truly defective, but just corrupted beyond the ability of the operating system to repair them. However, because the system partitions had bad blocks, I had to hope that the system partitions on drive B (my hopefully "good" drive), would be intact enough to recover from.
In my case, it turned out that my "B" drive was "still good" - and as dd_rescue proved further down - I had no bad clusters on that drive - so I could try to use the system partitions from that drive to re-create the partitions on the "bad" drive.

I was able to prove - using the Western Digital Drive Diagnostics - that the "A" drive was actually not truly defective. That saved me from having to actually replace the drive. However, if that had been needed, the only difference would be to substitute the NEW hard drive for the OLD one when you begin the drive "A" rebuild process.

Recovery Steps:

Rule #1: Don't Touch That Drive!

You are already in trouble. Dinking around with the drive - potentially changing it's contents - will only make it worse.

Prepare the new drives to receive the recovery data

Open a terminal session - or two! - and SU to root.

You will need to be ROOT (super-user) for any of this to work.
Each time you shutdown and restart the system, you need to re-open your terminal sessions and re-su to root.

Attach all the new drives, create one single partition on each, and format as ext3.

You can do this one-at-a-time, or you can attach all four of the new recovery drives to the controller, and format them all up there.
Shutdown and remove all formatted drives and set them aside carefully.

Copying the data off the damaged drive.

Install the drive that is NOT damaged, and view the partition table with Gparted or QTParted and verify that the partition table is intact.
Your partition table should look like this:

Unallocated space. (This space is used to store individual system specific data, such as MAC address, serial number, etc.)
Partition #1, formatted as ext3. (This is the boot partition, with /boot, /root, etc. on it.)
Partition #2, formatted as swap (This is the system paging file.)
Partition #3, formatted as ext3 (This is the rest of the O/S, /var, etc.)
Partition #4, unknown format. (This is the data-store, don't modify or change this!)

These partitions will be essentially identical between the two drives on a two drive system - Linear array or mirrored.

Using dd_rescue, copy the "un-damaged" drive to a file on one of the new drives.

This will take a fairly long while - measured in hours.
Take note of any failed blocks. (cut-and-paste to a text file.)
Shutdown the system, turn it off, remove the new drive with the file, label it, and put it somewhere safe.
Attach another new drive.
Reboot.

Commands to do the above:

dd_rescue -l /home/**uname**/Desktop/B-logfile.txt -o /home/**uname**/Desktop/B-bbfile.txt -v /dev/sdb 

/recover/b/b-recover-disk

Don't type the "asterisks" (**)

**uname** = Your username (this is the path to your desktop)
l = logfile output
o = bad-block logfile output (you need both of these for repairs)
/dev/sdb = The physical device the drive is on
/recover/b/b-recover-disk = the output filename for the extracted disk image.

I mounted my "recovery" drives at a mount-point called "/recover" on my system, and the recovery drives were mounted as "a" and "b", so I had "/recover/a" and "/recover/b" as the two recovery drives on my system.

Using dd_rescue, copy the last partition from the "undamaged" drive to a file on the new drive.

This will also take a long while. Almost exactly as long as the first copy, since this is where most of the data lives.
Again, take note of any failed blocks. Hopefully you won't find any on the "2nd" drive during either copy.
Shutdown the system, turn it off, remove both the new drive (mark it and put it somewhere safe), and the "B" drive, label and put somewhere else safe.

Commands to do the above:



dd_rescue -l /home/**uname**/Desktop/B-logfile.txt -o /home/**uname**/Desktop/B-bbfile.txt -v /dev/**sdb4** 



/recover/b/b-recover-data



/dev/sdb4 = The 4th partition on device "sdb"  You can copy any partition by enumerating it here.

/recover/b/b-recover-data = the output file containing the **data** partition from drive "B".

Add the failed drive to the system and attempt to verify partitions

Attach the failed drive ("A"), to the controller where the "B" drive was, and re-run the Gparted, QTParted partition verification step as noted above.
Shut down and turn off the system.

IF the "failed" drive's partition table is OK, continue with the next section.

IF the "failed" drive's partition table is NOT OK, continue with the steps below.

Use dd to copy the first 512 bytes from the disk with the good partition table.
Copy that file to the first 512 bytes of the "bad" disk to see if we can recover valid partition data.

Attempt to recover data from the failed drive

Attach the failed drive ("A"), to the controller where the "B"drive was, and attach another new drive.
Reboot the system.
Using dd_rescue, copy the last partition of the "A" drive to a file on the new disk.

Again, this will take a long while.
Also, take careful note of any bad blocks.
Shutdown the system, turn it off, remove and label the new drive, and put it away safely.
Attach the last new drive and reboot.

Attempt to copy data from the entire disk to a file on the last new hard disk

Allow dd_rescue to copy about half the disk contents to a file, then abort it with CTL-C.
Hopefully, one of the two disks had the system partitions without errors.
Shutdown the system, turn it off, remove and label the last new drive, and put it away safely, leaving the potentially defective drive attached.

At this point, you should have all the images you need.

Verify if the "failed" drive is really bad

At this point, the system should be shut down, with all the new drives removed, and the one failing drive still attached.
Boot the system using the "Diagnostics" floppy you created from the Western Digital Data Lifeguard CD.
Select the correct drive in your system.
Run the "Quick Test".

It is not necessary to run the "full" test.
If the drive passes the "Quick" test, repeat it a few times to verify that it always passes.
Ideally, each pass will return an error code of "0000"

If the drive passes, mark it so, and put it away.
If the drive fails, mark it so, and set it aside where you won't pick it up to use it.

The magnets out of a failed H/D make GREAT 'fridge magnets!
Replace it with the replacement drive you purchased, or go purchase one. Remember to get as exact a replacement as humanly possible.

Repeat this same exact procedure, substituting the other MBWE drive to verify it is OK.

Attempt to rebuild the damaged data array

Re-attach the data image drives and prepare to recover

Shutdown and turn off the system if not already shutdown.
Attach the two drives that have the two data-partition images on them in positions 1 & 2 on the controller.
Attach a blank drive - if available - as position #3.
Restart the system.

Mount the three drives in a convenient location

I will assume /recover/a, /recover/b, and /recover/c are the mount points.
I am also assuming that the drive with the drive "A" data image is first, the drive "B" data image is second.

Loop-mount the recovered data image files created before

I will assume that they're named "a-recover-data" and "b-recover-data"
Execute the following commands to loop-mount the two image files:

Commands to do the above:

losetup  /dev/loop0  /recover/a/a-recover-data
losetup  /dev/loop1  /recover/b/b-recover-data

This creates two "fake" (virtual) drives mounted on loop0 and loop1 that contain the contents of these two files.

Trick: You can loop-mount ANY valid file-system image - including things like cd/dvd ISO images, etc.

Merge the images into a copy of their original array

Execute the following command to re-create the original MBWE array structure:

Commands to do the above:

mdadm --assemble  /dev/md1  --force  /dev/loop0  /dev/loop1

This command takes the two loop-mounted array parts and (hopefully!) merges them into an array image similar to the one on the MBWE that the two drives came out of.

Hopefully the array built - and started! - correctly. If it didn't, I don't know how to help you here.

Assuming the array built correctly - mount /dev/md1 wherever convenient. (Let's assume /recover/md1)

Navigate to the mount point, and view the contents of the root of that "drive". If all has gone well, at this point you should see a filesystem containing folders and data - as you had it on the original MBWE.

If you successfully see a filesystem - congratulate yourself, take a deep breath, and perhaps take a short break.

If you don't have a filesystem here - I am not sure how to fix this. Not without messing with it myself.

Make a "backup" of the filesystem's apparent content.

Very Important!

Using "cp -R", copy the entire contents of the /dev/md1 mount point to the empty drive you have mounted at your third hard drive mount point.
This will take a while. Take careful note of any files that generate errors.
We do this because when we try to repair the two partition images, things might get destroyed.

Attempt to repair / recover the partition images

Check array partitions for consistency

Execute the following command to verify the structure of the array partition's filesystem.

Commands to do the above:

fsec -t ext3 /dev/md1 -- -n -f -v

-n = Don't actually fix anything
-f = Force scan, even if screwy.
-v = Tell us a lot about what you see.

Again, remember to take careful note of any errors or issues seen.

In my case, there were a lot of "inode hash" errors

Try a "real" fsck to clean up issues

This will discover if any of the issues disclosed were "serious" issues. (They probably are, but we can see if we get lucky… .)
Execute the following command:

Commands to do the above:

fsck -t ext3  /dev/md1  -- -D -p -f -v

D = consolidate and re-index directories.
p = "Preen" (auto-repair) non-critical problems.
f = Force checking
v = Tell us what's happening.

You may get a "/dev/md1: Adding dirhash hint to filesystem" message when you start the "real" fsck. This is indicating that fsck is updating the partition to handle indexing properly. This is a non-problem.
When I did this, it still bailed out on me because "inode hash" issues are considered "critical" problems. What will happen is that - if you force fix, and you will need to, trust me - the directories and/or files with the inode hash errors will be deleted and the space consumed returned to the free pool.

Retry fsck forcing it to fix all errors found

We will need to absolutely clean up the issues found, so we must (at this point) force fsck to fix things.
Execute the following commands to do this:

Commands to do the above:

fsck -t ext3  /dev/md1  -- -y -f -v

(note, we're omitting the "-D" here deliberately.)

y = force auto fix (answer any question "yes!")

Re-execute the same command again to verify all issues have been resolved.
Repeat until there are no more errors found.
Once everything is OK, re-run fsck again to optimize and re-index directories.

Commands to do the above:

fsck -t ext3 /dev/md1 -- -D -y -f -v

This works just like before, except the "-D" forces directory re-indexing and optimization again, which helps.

Un-mount /dev/md1, and stop the array

Commands to do the above:

umount /dev/md1
mdadm --stop /dev/md1

Stop and take stock of things

Where we should be now

We should have two partition image files loop-mounted.
We should have them successfully assembled into an array.
We should have successfully run fsck on the array partition and cleaned up any errors.
We should have at least ONE good disk out of the two that came from the MBWE.
We should have at least ONE good system image from the two drives.

If you don't, you will need to download one and follow instructions to install it at a later step.

Begin rebuilding the two drives for the MBWE.

I am assuming that the "B" drive contained no bad blocks - and if there were, they are in the data partition, not the system partitions.
I am also assuming that we have a good drive "A", or a replacement, that may not have a good system image on it.

If this is not true - you do not have ANY good system images, skip the single step below, download a system image, and follow the instructions to install it on the two drives, creating the last (fourth) partition.
Using dd_rescue, copy the entirety of drive "B" to drive "A". This will replace the bad/missing system partitions, and re-create the 4th partition for the data.

After this is about 1/2 done, stop the copy with CTRL-C.

Using dd_rescue, copy the drive "A" data partition image that we fixed-up before, back to partition 4 of drive "A".

We use dd_rescue instead of "dd" - because dd_rescue will properly detect the end of the drive/data and will make sure every byte gets written. "dd" - when it reaches the end of the drive - would simply fail, and not write the last few blocks of data.

Using dd_rescue, copy the drive "B" data partition image that we fixed-up before, back to partition 4 of drive "B".
Once that is done, completely shut-down and turn off power.

Rebuild the MBWE

Re-install the hard drives

Replace the two side-rails on each hard drive (if you removed them)
Re-insert the two drives into the MBWE, remembering that drive "A" goes in the slot closest to the controller electronics.
Re-connect all connectors removed during MBWE tear-down.

Reconnect network and power

Re-attach the network cable to the MBWE.
Re-attach the power connector to the MBWE.

FIRE THAT PUPPY UP!! (and pray…)

Re-connect power.
Carefully monitor the front-panel lights.

At this point, the MBWE should boot, do a final internal fsck - which is indicated by the internal lit ring spinning - and then come fully back on-line.

Note If you replaced the system partitions with downloaded partition data, you may have to re-configure the MBWE to your needs.

Verify correct operation

Attempt to access the web setup page

Verify that the web-setup page works, and that the drive status is "OK"
Re-configure any settings that you need to change.

Attempt to access the pre-existing shares on the MBWE

Verify that the original shares on the MBWE exist, you can access them, and you can read-and-write data to them.
Note that any files or directories that were "corrected" during the fsck of the partition array above may not be there - you may have to replace this data. THAT is why I asked you to take notes!

Verify everything's correct, replace any lost data, and return to service

Satisfy yourself that everything is back to normal, by shutting down the MBWE, re-booting it, etc.

You will probably notice that the MBWE is booting up - and serving files - much faster now than ever it did before.
This is a result of both cleaning up all the cruft and problems, as well as the consolidate, optimize, and re-index steps that we performed during the FSCK operations above.

Replace any necessary lost data

Replace any necessary lost data as noted during the FSCK passes above.

Return to Service

Return the MBWE to normal operational status.

Congratulate Yourself on a Job Well Done!

Jim

Monday, January 19, 2009

January '09 - New Year's Presents!

Originally published 1/19/2009 as
QA Tech Tip - January '09 - New Year's Presents!

This month’s Tech-Tip celebrates the New Year with a couple of New Year’s freebies – they may be free, but they still pack a wallop!

#1:

Avira’s AntiVir Personal (http://www.free-av.com/)
They offer:

An absolutely free (personal use) anti-virus solution for your Windows boxes.
Absolutely free product and virus definition updates for as long as you have AntiVir installed.
A free “Rescue CD” program that will build a rescue CD ISO file and, (optionally), write it to CD if it recognizes your recorder. The Rescue CD is a stand-alone Linux system-on-CD that runs AntiVir on your hard drives while they are “quiet”, (neither mounted nor active under Windows), to make finding – and removing! – bogus programs easier. Both the virus definitions and the rescue CD image creator program are updated daily.

Avira’s AntiVir Personal does everything that the corresponding Norton, McAfee, and other commercial products do, except:

AntiVir is NOT a resource hog, and does NOT bring your system to its knees just because you installed it. The corresponding Norton and McAfee products quickly reduced every machine I installed them on to the performance equivalent of a 66 MHz ‘486 of days gone by.
AntiVir doesn’t cost an arm-and-a-leg to purchase. The comparable commercial products (from Norton and McAfee) would cost something like $60+/seat. If you have more than one or two computers to support – as I do – then this can become a seriously significant expense.
AntiVir doesn’t annoy you with a “subscription” based update system.

(I don’t think this needs elaboration. . . .)

The important question is this: Does it REALLY do the job? Or is it just a pseudo-AV wannabe?

I got the answer to that question when I ran Avira’s AntiVir Rescue Disk on my computer. Despite running several different versions of both Norton and McAfee on my machine at various times – all of which gave it a clean bill of health – Avira’s product found two, count ‘em, TWO root-kit droppers / worms neatly tucked away into a couple of perfectly innocent e-mail attachments. At that point, it was up to me to go get the e-mails in question and delete them as AntiVir was reluctant to just stir around inside my Outlook mail-files.

To be perfectly fair, Norton found the installed root-kits / worms, but choked on the removal process which I had to complete manually. It did NOT discover the source of these infections. AntiVir did.

The one “bad” point about the AntiVir program is that every time it updates, (usually once daily), it opens a pretty darn large dialog on your desktop extolling the virtues of their paid products and providing a convenient link to their on-line purchase page. The good part of this is that it is entirely optional – you can safely dismiss the dialog and get on with your life. And I really can’t blame them for the attempted up-sell. Even Anti-Virus developers have to eat sometime!

All in all – I rate it, (especially the rescue CD), as a definite “Must Get”.

#2:

Ubuntu Linux (http://www.ubuntu.com/)

“Ubuntu” is an ancient African philosophy meaning, (in essence), “Humanity toward humanity” or “Humanity toward others”. While I won’t get into a philosophical discussion on ubuntu, (though there is an excellent Wikipedia article here), Ubuntu as a Linux/GNU operating system comes very close to this ideal.

They offer:

Ubuntu “Desktop” – designed for the “typical” desktop user with Firefox, Eudora Mail, Open Office, etc. pre-installed for you.
Ubuntu “Server” (Enterprise, clustering, whatever. . .) – is designed for those who wish a more server-oriented install. It should be noted that - unlike other Linux distributions - the fancier versions, such as Enterprise, Clustering Server, etc. are - all of them - free for the taking.

Of course, you can purchase commercial support - which might not be a bad idea in a production or mission-critical environment - but you don't have to. There are lots of ways to get questions answered and problems solved even if your IT budget is NOT bottomless.

The installer media, (at least for the desktop version), is a “live” CD. This means you can pop it into your machine's CD tray and run it on your PC without installing it, to see how you like it before you commit precious hard drive space to an install.

(Note that all of my observations below are based on the “desktop” version of Ubuntu.)

Ubuntu Linux has a number of characteristics that I believe make it stand above the crowd:

Ubuntu Linux is designed, first and foremost, to be used by people. Note that I said “people” and not “techies”. In support of that, Ubuntu has made great strides in the area of user experience and just plain old usability. If you can use Windows, you can use (and install!) Ubuntu.
Ubuntu expands on the concept of “usability by non-techies” with a well thought-out installation process.

In a word it is “slick” – even more so than Windows. There are five-or-six dialogs in the installation process that basically ask you “who you are” (along with “what do you want to name your computer”), and “where to put it” on your hard drive. The defaults offered are all reasonable and sane, choices are clearly shown, and if you really want to go behind the scenes and diddle, you can do that too.

Starting with a stone-cold system, it takes less time to complete the Ubuntu installation dialogs, (and start the install running!), than it takes for Vista’s installer to boot and load. This is made even sweeter because Ubuntu’s ability to correctly detect – and configure – a machine’s hardware configuration is as good as, or maybe even a tad better than, anything I’ve seen up to this point. Unlike many other Linux installers, Ubuntu’s installer is truly “plug and play”. You pop-in the CD, answer a few simple questions and you are on your way.
More important than that, Ubuntu doesn’t arrogantly assume that you want to throw away all the other operating systems on the machine. If there’s another operating system present, Ubuntu will work hard to make itself fit in without disrupting the other system – and the Grub boot manager provides a clean boot process for all of the operating systems installed.
Ubuntu strives to be as completely Open Source as possible – but does not become religious or pedantic about it. Ubuntu will cheerfully make non-open-source drivers or applications available to you – after telling you that there are either licensing restrictions that prevent it from being purely open source, or other issues that you may need to be aware of.

Example: Both ATI and NVIDIA have released a number of their video drivers to the Linux community – as binaries – but still retain copyright and ownership. Ubuntu makes them available to you, but tells you that they’re not “pure” open source. Other Linux versions, Fedora chief among them, go to great lengths to “forbid” (or even obstruct) the use of non-open-source drivers or software - despite the potential consequences to their users.
Because Ubuntu is based on the Tried-And-True Open Source Linux platform – it has available for quick download a HUGE library of free applications and utilities – from the simple, (roving eyeballs for your task-bar), to the complex, (Scribus, the open source replacement for Quark), to the more esoteric, (Q-draw, an open source AutoCAD replacement); it’s all there, waiting for you to find a need for it. Since it is based on Debian and uses the equally tried-and-true “.deb” package installation process, adding applications or features is as painless as anything I’ve seen.
Ubuntu is about giving the user choices. You can go this way, or that way, and both ways are just fine by them. If you like things nice-and-simple, that’s perfectly OK. If you want to play uber-geek and mess with the more technical aspects of Linux, that’s all there too.

The Ubuntu interface (desktop) is clean and uncluttered, with just about everything you need to get productive one or two mouse clicks away.

Their user security model is very similar to Vista’s (or is it vice-versa? ;-) ), where Ubuntu will let you do whatever you wish – until it would affect the system as a whole – then it asks you to confirm by typing in your own password.

I, for one, think that’s a great idea – rather than the classic Windows’ “everyone is Admin/root/God”, model which has been the bane of Windows users, and a boon for malware writers. Or their “Restricted User” model, which is nearly useless. Though Ubuntu’s system is far from bullet-proof, it goes a long way toward making it darn difficult to “accidentally” pooch your system beyond repair.

You can even run Windows apps on it, (at least in theory, I have not tried it yet), by running them inside a “Windows emulator” called Wine.

The real telling characteristic is this: Ubuntu is the first Linux system that I would seriously consider putting on my wife’s computer, or even my mother’s, (a lady of nearly 80 years), confident that they would be able to use Ubuntu with an absolute minimum of difficulty.

Are there issues? Of course there are. Some portions of Ubuntu are less forgiving than others, (God help you if you accidentally set your monitor resolution or refresh rate wrong!), but this is true for any operating system out there – especially the various ‘nix systems and their near brethren. (And I won’t even discuss the headaches Windows or Vista have given me!)

My bottom line is this: If you’ve been thinking of trying out Linux, but were afraid of all the “techie” aspects of it; go ahead, take the plunge and give Ubuntu a try.

That’s it! Now go give these a try and have a wonderful New Year!

Jim

Sunday, December 14, 2008

What is "Quality"?

Originally published on12/14/08 as

December Software QA Tech-Tip: What is "Quality"?

This month's Tech-Tip is a conversation-starter – What IS "Quality"?

To be honest, this Tech-Tip was inspired by a copy of Fortune magazine with a ‘50's Chevy on the front cover and the legend "GM – Death of an American Dream" describing the abysmal state that American automakers have fallen into.

I have heard a number of different definitions for "Quality" – and they all center on the concept of giving the customer a product (or service) that works as expected.

I am going to respectfully disagree – and I will say that Quality is not only giving a customer a product or service that works as expected, but goes beyond expectations in thoughtful ways that indicate you had the customer's best interests in mind when the product or service was designed.

Many years ago, I purchased a brand-new 1990 Hyundai Excel. This was back when everyone was "talkin' trash" about Hyundai and the "crummy" cars they created. This car not only did what I expected a car to do – get me where I wanted to go without a lot of maintenance and without consuming gallons of gasoline – but as I got to know the car, I noticed a whole host of thoughtful touches:

Two examples:

The fuse-panel used the, (then new and unfamiliar), "blade" type fuses. Not only did they provide a plastic coated PRINTED fuse-map, securely glued to the inside of the fuse-panel door; right next to it was a blade-type fuse puller – and a molded-in clip to hold it ready for use! In addition, right there at the bottom of the fuse panel, were four extra fuses – one spare for every value used – just in case you blew a fuse in the middle of nowhere. Not to mention that the fuse panel was located where it was easily reached – and only a slight push would cause the press-lock catch to release, opening up the entire panel for you.

By comparison, most other cars – even those costing much more than the $7k I spent – if they had a fuse-map at all, it was molded into the plastic somewhere and was almost impossible to read.

Fuse puller? Yea, right! You'd get one at the local Auto-Parts-Extravaganza! Spares? Who are YOU kidding?!!

Location, location, location . . . The fuse panel was invariably tucked up somewhere – BEHIND the dashboard – where even Houdini would have trouble reaching it. And even using a flashlight, (assuming you even HAD one with you!), was little help as the fuses were not only hidden somewhere up behind the dashboard, but tucked away behind something else, making them almost impossible to see.
This car – unlike most other cars I had owned up to that point in time – was primarily electronic. Everything, from the dashboard clock to the fuel-injection system was run by computer. Every system, every subsystem had its own electronic module or controller to run it.

One night, I helped a friend determine that her battery had died by swapping my battery into her car, and using it to start it. It was dark – dimly lit – and when I placed the battery back into my car, I attached the "minus" lead, tightened it down, and then the "plus" lead . . . . To the accompaniment of a loud flash and BANG!! Ohmygosh! I'd installed the battery BACKWARDS and had blown the 50 amp fusible link that was built into the battery cable… My heart sank – fifty amps of twelve-volt power, applied BACKWARDS to THOUSANDS OF DOLLARS of expensive car electronics and computers. Not only was my car fried – it was EXPENSIVELY fried! It would probably cost more to fix than to replace – and I wasn't even finished paying it off yet!!

Fearing the worst, I disconnected the battery, swapped it around, connected the battery terminals and, (to jumper out the now-blown 50 amp fuse), I swapped the charging lead from one fuse terminal to the other. All the time fearing the smell of char-broiled electronics when I re-applied power.

Once I made contact . . . Lights came on, the heater fan started and (joy!) I heard not only the radio start playing, but the familiar clicking sound of the engine's computer starting up! The car – much to my surprise! – was absolutely unscathed.

Once home, (after singing Holy Hymns the entire way!), I dug out the copy of the service manual I bought when I purchased the car and found out what had happened.

The engineers in Korea, realizing that car electronics are not cheap and that some clever idiot like myself would – inevitably – reverse-connect the battery, designed a special part into the electrical system – a part that did absolutely nothing so long as everything was connected correctly. However, it would short-circuit the entire electrical system – not only blowing the master fuse, but also protecting the car's electronics – if the battery was accidentally installed backwards. Total cost for my mistake: $4 for a new 50 amp fuse at the local auto-parts store. Not counting the thirty-or-so extra grey hairs!

In both cases, the engineers saw an opportunity to add value to the car owner's experience: First, by providing a very inexpensive tool that would make changing fuses much simpler – AND spares, "just in case". Second, by the inclusion of two extra components: the special part and extra "Master" fuse, the car was protected from a potentially devastating and expensive mishap.

The usual practice would have been to omit these touches – to save money. And I am sure it would have. Thousands, maybe even millions per year while these cars were being made. And since the Hyundai Excel was such an inexpensive car, no one would have missed them. But, by the inclusion of a few extra dollars in parts and effort, they made a loyal customer who referred dozens of friends to that brand. At seven thou, plus, per car – the few extra dollars spent proved a worthwhile investment.

Nowadays, Hyundai has glass cases filled with J. D. Powers Awards and legions of thrilled customers. Toyota can't make cars fast enough to meet demand. Lexus dealerships (a division of Toyota), are springing up like dandelions and "Honda" is the new Japanese word for "making tons of money", while "The Big Three" automakers pray for another miracle to prevent them from disappearing into oblivion.

Obviously, "Quality" is not just about giving the customer what he expects. It's about giving the customer MORE than he expects; looking for ways to meet needs the customer did not even realize he had.

What say ye?

Jim

Thursday, October 9, 2008

Network and Routing Tools for QA

Originally published on 10/9/08 as
QA Tech-Tip - October: Network and Routing QA Tools

Hello everybody!

This tech-tip will be a departure from the usual “single command or option” type tech-tip I’ve done in the past. In fact, I’m going to discuss an entire operating system!

Problem:
The boundary between applications, end-users and the Internet is becoming increasingly indistinct as more and more products depend on network connectivity for some or all of their functionality. If you have a security suite or anti-virus application that receives automatic updates you will know exactly what I am talking about.

The problem is that the successful software QA Engineer / Analyst needs to become increasingly proficient in disciplines that were once considered outside the domain of Software QA, such as network architecture and infrastructure, bandwidth, security, WEB services, and so on.

This is compounded by the additional problem of learning these skills on a modest budget.

Fortunately much of this can be done with an inexpensive Linux setup: Apache for web services, Wireshark and Netmon for network health and packet inspection, IPtables (etc) for firewalling. . . . Etc.

One of the final roadblocks in the way of the successful QA Engineer involves non-trivial routing, WAN load balancing or firewalls using “enterprise class” equipment. Many of the jobs I see out there today for the software QA Engineer need or want higher-end networking, (and telecom), experience. WAN routing and load-balancing, multi-homed WAN, persistent VPN tunnels, point-to-point, (etc.), all of these things are needful to know for the up-and-coming QA analyst.

And few of us have the dollars necessary to get themselves a full-blown Cisco router to play with.

That roadblock has just fallen, thanks to a company called Vyatta.

“Vyatta? Who’s that?!” you ask? Well you should ask this – as not too many people have heard of Vyatta. Yet.

Vyatta is a small company out in California who has taken mature open source software, (Debian GNU/Linux to be exact), stripped it to its bare essentials, and released a software version of a high-end Cisco / Juniper / Nortel router that runs on virtually anything with a ‘x86 processor and a few megs of memory. (At least 128 megs, to be exact.)

They’ve just released version 4 of their router software – and to mark the occasion they hired the Tolley Group to do an independent verification of their software, (on a beefy Xeon box), head-to-head with the best Cisco has to offer. Result: Vyatta ate Cisco’s lunch. And Vyatta did it for about one tenth the price of the Cisco equipment - *including* the cost of the beefy Xeon!

What is more important to us as QA engineers is that the software itself is Open Source – meaning you can download a copy, free of charge, and play with it yourself. This provides a golden opportunity for people in QA to become more familiar with Internet edge-facing equipment at a minimal investment. And if you’re using a smaller scale network with lower-than-gigabit requirements, it allows you to use less expensive hardware to host it.

In fact – at right around 150 meg’s in size, their live image CD ISO won’t take long to download even if your Internet connection is less than stellar.

That’s right kiddies, I said “live image CD”.

You can, if you wish, run it “out of the box” without even having to install it. Just make sure there’s a blank floppy around to save your configurations. Or – if you have the hardware to spare – you can install right off the CD. They have oodles of documents (in PDF) right there so you can get a head-start on getting your system up and running.

IMHO, Vyatta is engaging in some extremely shrewd marketing here – they distribute Open Source software to interested parties. This is a lot like the “Apple in the Schoolroom” marketing approach that Apple took years ago to guarantee a solid user-base for themselves. People get a chance to use the stuff – learn how it works and what it can do – and then, when the time comes, they’ll recommend it as a “buy” when it comes time to join the Big Leagues.

Even if you don’t use one of Vyatta’s hardware solutions, (pre-loaded with Vyatta’s software, of course!), you can buy priority service support on an incremental basis. It’s not cheap, but compared to things like Cisco’s service, it’s not all that expensive either.

Go take a peek – http://www.vyatta.com/

Jim

QA Tech-Tips

Welcome to the QA Tech-Tips blog!