Welcome to the QA Tech-Tips blog!

Some see things as they are, and ask "Why?"   I dream things that never were, and ask "Why Not".  
Robert F. Kennedy

“Impossible” is only found in the dictionary of a fool.  
Old Chinese Proverb

Sunday, February 24, 2013

OOPS! - When disaster strikes
System Backups (Part 2 of a series)

In my previous article in this series - OOPS! - When disaster strikes (System Restore) - I discussed the importance of the System Restore functionality that has been included in Windows since Windows XP, and how it can save your butt big-time.  This particular "safety net" is easy to implement, and easier to use, since it is essentially automatic once enabled.

System Restore, as good as it is, can't save you if the problem is more drastic.  For example, a hard drive crash; or when malware, (or a poorly written piece of software), wanders willy-nilly over your hard drive leaving a trail of dead or wounded disk clusters in it's wake.  In the case where the physical or logical integrity of your storage media has been damaged or destroyed, System Restore cannot help you.

In the same vein, System Restore cannot help you if you're brain-dead at the wheel and "accidentally" permanently delete your one-and-only draft of your Doctoral Thesis, some other document, file, or whatever that is similarly valuable and irreplaceable.

In this kind of situation, what you need is a backup.



Backups come in a whole host of shapes, sizes, colors and flavors, depending on what you need; ranging from something as simple as copying cherished family photos to a CD, to something as difficult as managing the backup methodology for Wikipedia's online databases and file servers to ensure data consistency.

If a particular file, or files, are important, you should invest time into making periodic copies of the data while it's evolving.  In the case of program source files, or important document files that are subject to continuing edits, some kind of version control system could be vitally important as well.

However, in many cases what we want to preserve, and possibly restore, is the entire state of a particular computer system at a given point in time.

There are two basic methods for making these kinds of system backups:
  • "Bare-Metal"backups
  • Snapshots



Bare-metal Backups:
 
The easiest way to backup a system is via a complete system image, otherwise known as a "bare-metal" backup.

A bare-metal backup is one that takes an existing, known working, system and makes a literal, (clone), image of it, byte-for-byte, that can be used to re-create the exact same system on different hardware at some future time if need be.  In other words, you can recreate the imaged system on "bare-metal" - that is a system that does not have a working operating system installed yet.

Example:
You have a computer that has Windows 7 on it, and it's working wonderfully.  At some point in time, you decide you want to upgrade the system to Windows 8 by doing an in-place upgrade rather than re-installing the entire O/S from scratch.  And. . .  Though you are confident that the engineers and programmers at Microsoft have done a wonderfully thorough job of making sure the upgrade is painless, you want to be able to restore the original system if your upgrade experience is somewhat less than stellar.

In situations like this, bare-metal backups are the best way to preserve the system as it was prior to the upgrade.

Obviously, a literal byte-for-byte image of a three terabyte hard drive would be huge

Fortunately, modern system imaging software handles this problem using a two-pronged approach.
  • It creates a sparse file, copying only the parts of the disk that are actually being used.
  • It attempts to compress the data that is there, if safely possible.
The advantages are obvious:  If you are only using a small part of the three terabyte hard drive, and a lot of your data is safely compressible, your 3T drive may result in an image that is only a few gigs in size.

The disadvantages are more subtle:  Unless your filesystem is absolutely pristine - totally consistent [1] [2] with no errors whatsoever - the image created may look more like Frankenstein's Monster, than a usable filesystem.  Fortunately, most disk cloning software is particularly paranoid when it comes to file system integrity, and will stubbornly refuse to go any further if it sees anything out of the ordinary.

Theoretically, (and that really should be in quotation marks!), all cloning software should support most, if not all, modern filesystems.  Despite the claims of some disk imaging software, I prefer to use the Windows based software for imaging Windows drives and Linux based software for imaging Linux drives.

My two favorites are:
  • Acronis True Image (payware), for Windows based filesystems.
    • I have used, and can endorse, the 2012 version of Acronis True Image, though there is a newly released 2013 version that I have not yet tried.
  • Clonezilla (GPL freeware) for Linux based filesystems.
    • I have also used Clonezilla for backing up my Linux servers, especially prior to in-place upgrades, and it has worked well for me.
    • Note that Clonezilla claims to support non-'nix based filesystems, like NTFS or HPFS.  However I have not tried it using these filesystems and cannot vouch for it.  Your Mileage May, (and probably will), Vary.

Both Acronis True Image and Clonezilla can be used as "live CD's", meaning that you can boot the system from the CD and take an image of the entire filesystem while it is at rest.  (IMHO, the best way to obtain a completely consistent system image.)

The advantage of taking an image while the system is not running is obvious:  The filesystem is static so there is no chance of any changes taking place while the backup is being taken.  Because the filesystem is static, the image created is an exact copy of the system state at that point in time.  This can be useful when disinfecting a drive, or doing data recovery.

The disadvantage of a system like this is the same as its advantage:  The filesystem must not be running.

In other words, the system must be stopped, the backup taken, (which can take non-trivial amounts of time depending on the size of the disks and the amount of the disk that is being used), and then the system must be re-started.  Obviously, while the backup is being performed, the system is not available for use.  Equally obvious is that systems which must be on-line 24/7 cannot tolerate this kind of "cold system" backup and it's associated downtime.

Likewise, most bare-metal backup software will not restore an image to a hard drive smaller than the original, even if most of the space on the drive is unused.  Depending on the software being used, this can be mitigated by shrinking the partitions to the smallest size possible prior to making the backup, and then re-expanding them to the size of the hard drive they're installed on after the restore is finished.

Another potential disadvantage is that - in some cases - the image is atomic in and of itself, so that it is not possible to extract parts of the image, (particular files, for example), without restoring the entire image.

Luckily this is not universally true, and some of the better backup software allow the created images to be explored, and portions extracted, as if they were just one big zip file.
  • Acronis True Image allows its images to be explored and individual parts of an image can be retrieved if the True Image Windows software application is installed.
  • Clonezilla does not directly support file extraction from within an image, though there is a workaround described on the Clonezilla site. (Note:  It appears to be a butt ugly hack that involves jumping through flaming hoops and (IMHO), I would steer clear of it unless absolutely necessary.)
  • Surprisingly enough, some of the expensive high-end backup solutions do not support file extraction from within an image either.  (Ref: IBM Tivoli Support Technical Exchange Web Seminar: Differences between image and snapshot backups)



Snapshot Backups:

Snapshot backups are a little bit different than a bare-metal backup.
  • Snapshot backups can be taken while the system is running.  In other words, they do not need a "cold" or static filesystem prior to taking the snapshot.
  • Snapshot backups are not "bare-metal" backups, despite what some people may think.

Snapshot backups are frozen-in-time copies of the state of a machine, taken while the machine is running.

There are various ways of taking snapshots[1] [2], however the method used by VMware is illustrative.
  • You begin with a running virtual machine, configured and running the way you like it.
  • The very first snapshot pauses the machine briefly while a complete copy of the machine state at that instant in time is created.  This becomes the snapshot baseline.
  • The filesystem is marked as a "copy on write" filesystem, where any writes to the disk made after the initial snapshot are created, are stored in a special file.
  • The second snapshot is virtually instantaneous, as the currently running "copy on write" file is closed, marked as the second snapshot file, and a new copy-on-write file is created starting at that instant in time.
  • Subsequent snapshots perform the same action as the second snapshot.

Restoring to a particular snapshot point-in-time is essentially the reverse of the above procedure.

Snapshot backups have some interesting advantages:
  • Snapshots can be taken while the machine is running.  (i.e.  It does not require a "cold" system.)
  • Individual snapshots, after the first one, are virtually instantaneous.
  • Snapshots can be taken as often as desired, within the storage limits of the machine.

Snapshot backups have their disadvantages too:
  • You cannot bare-metal restore from snapshots.  However, if you stop to take a bare-metal baseline prior to creating your first snapshot, you have a convenient point of reference if everything goes to Hell in a Hand-Basket.
  • Snapshots consist of an ordered sequence of files that contain the differences between each file and the file immediately proceeding it.  Because of this, if any one of the snapshot files is destroyed or corrupted, any subsequent snapshot files are useless.
  • Depending on the software being used, the effect of a snapshot on open files may not be well defined.



The subject of backups is a complex one, as even a cursory on-line reading of the subject will show.  The various backup methods have their individual supporters, and discussions about which backup method is best is akin to the religious wars during the Middle Ages.

Unfortunately, there is really no "one-size-fits-all" solution for backups, as it depends on what your individual requirements are, how much storage you can afford, and how much risk you are willing to take.



Update:

Bennette, in his comment below, pointed out that I had actually left out a good chunk of the "backup" process, and that is actually verifying that the backups are worth a damn.  I have seen large and tech-savvy companies get their lunch eaten because they grew complacent about their backups.

An ISP provider that I used to use got trapped this way.  They had a really fancy striped RAID-5+1 setup on their servers. . .  And with all that redundancy, backups aren't really that important, right?

Good 'Ole Mr. Murphy steps in one night when everyone else is asleep at the wheel and introduces just a teeny, tiny bit of corruption.  And, not unlike the FabergĂ© shampoo commercial, it corrupted two things, and they corrupted two more things, and. . .  Well you get the picture.  Many weeks, huge gobs of money, and a tremendous hit to their customer good-will later - they finally got everything back up and running.

I have had this happen to me as well.  I learned my lesson awhile ago when I had a nice RAID-1 array go south on me - and there I am thinking about how clever I am with my (ahem!) "bulletproof" RAID with it's "built-in backup".  Right?

WRONG!!!

I spent more than two weeks, not to mention hundreds of dollars that I really didn't have to spare, picking up the pieces after one side of the RAID got corrupted somehow, and dutifully copied the corruption to its mirror.

I guess the best advice I can give you is to quote part of what he said:
A serious backup policy, especially one for systems that contain mission-critical data, should include regular "fire drills" where the steps for recovering a system are rehearsed.
Truer words have never been spoken!



I hope that I have given you enough information so that you can begin to research the subject for yourself, and - ultimately - decide what kind of backup strategy works best for you.

What say ye?

Jim (JR)