Welcome to the QA Tech-Tips blog!

Some see things as they are, and ask "Why?"   I dream things that never were, and ask "Why Not".  
Robert F. Kennedy

“Impossible” is only found in the dictionary of a fool.  
Old Chinese Proverb

Tuesday, March 8, 2011

The 2000 Gigabyte Gorilla


Here's the scenario:

You have a computer that supports SATA / eSATA - or an external drive enclosure that supports SATA - and you decide you want a huge drive to fill it.

You snoop around and find a really good price on 2+ terabyte hard drives, so you buy a couple-or-five, depending on your cash situation.

You bring them home, carry them lovingly to your computer, hook them up, and proceed to partition and format them in the way you usually do.

Unknown to you, there's a 2000GB Gorilla in the room with you.  And that's when the fun begins!



In my case, I wanted to hook them up to the Linux box I am using for my primary file store so that I could make space on my RAID array.  I was planning to move less critical files to a more "near line" storage device, so I needed a very large drive to accommodate them.

So, I did exactly that.  I plugged one in, partitioned and formatted it in the usual way and started copying almost a full terabyte of data over to it.

Unfortunately about half way, (maybe two thirds of the way), through the copy the drive errored out and remounted as read only; causing the entire copy process to go straight to hell in a hand-basket.

I tried everything.  I changed interface adapters, I used a different power supply to power the drive, I even hooked it up directly to my computer's eSATA port.

No difference.  It would still error out about half way through the copy.

So I'm thinking:  "$^%#*&@!! - stinkin' hard drive's bad. . .!" and I get out the second one I bought. (I bought two, so I'd have a spare.)

I repeat the entire process and - sure enough - the drive fails about half way through the bulk copy.

I look on the Internet and I see a whole host of articles complaining that these drives, (from Western Digital), are pieces of GAGH!  Everybody's having issues with them and not a few unkind things were said to - or about - Western Digital.  Not to mention a whole host of other drive manufacturers who appear to be having the same issues.  Even my buddy, Ed, at Micro Center says they're all junk.

Hmmm. . . . .  Is EVERY two terabyte hard drive garbage?  This doesn't make sense to me.  Western Digital, Samsung, Hitachi, Seagate and all the rest of the hard drive manufacturers might be crazy, but one thing is absolutely certain:  They are NOT stupid.  I cannot believe that any reputable manufacturer would deliberately ship crates and crates of drives that are known garbage to an unsuspecting public.

Of course, the "conspiracy theorists" are having a field day:  It's all a conspiracy to get us to buy solid-state drives!

But it doesn't make sense to me.  Why would any reputable manufacturer risk his good name and reputation for the sake of a "conspiracy"?

I still couldn't see the 2000GB Gorilla, but I decided to dig a little bit deeper anyway.



Let's pause for a short trip down memory lane. . . .

Back at the Dawn of Time - when Men were Men, and Hard Drives were Hard Drives, (and starting one sounded like the jet engines on a B-52 winding up), hard drives used a very simple geometry known as "CHS" - Cylinders, Heads, and Sectors.  Any point on the drive could be addressed by specifying the cylinder, (the radial position of the heads), which of the many heads to use, and what sector on that particular platter is desired.

Once hard drives started to get fairly large - larger than about 512 megs - the old CHS scheme had troubles.  In order to address a particular sector on the drive, the number of cylinders and heads had become larger than the controllers could handle, so there were BIOS updates that allowed the drives to report a fictitious CHS geometry which would add up to the correct drive size.

Again, when hard drives became relatively huge, (around 8 gigs or so), there was another issue:  The CHS system could not keep up.  So hard drives, and the respective computer BIOS programs, addressed this issue by switching to Logical Block Addressing, (LBA), where each sector was numbered in ascending order.  And that kept people happy for a while. . . .  But not for long, because hard drives were getting bigger, and bigger, and bigger, and . . . . . .

Enter the 132 gig problem:  We've run out of bits to address all the logical blocks on a large drive.  So there was another hack: Extended LBA, (also known as LBA-48), that increased the bit-count even more.  This allowed the IDE/ATA interface to accommodate larger and larger drives capacities.

At around 500-or-so gigs, the LBA addressing scheme, (as well as the entire ATA architecture), was straining at the seams.  There were architectural issues that could not be solved simply by throwing bits at them.

This time - instead of hacking what was rapidly becoming an old and crufty interface - they decided to go an entirely different direction; SATA, (serial ATA).  It was faster, it was neater to install because the cables were smaller and it allowed, (theoretically), a virtually unlimited addressing range.

As a plus, because of the smaller cable arrangement with fewer pins to accommodate, drives could be added externally to the computer - hence eSATA.  Drives were still using LBA addresses, but now the addressing range was much greater.

And. . . .  just to make things even more interesting. . . . .

For the longest time hard drives, and their manufacturers, were leading a double-life.

In public they still supported both the CHS and LBA geometries, but secretly they were re-mapping the "public" geometry to a hidden geometry that had no real relationship to the public one.  And what a life it was - on the outside they had the stodgy, old and conservative wife, but secretly they had the young, sexy mistress making things nice for them.

In fact, this had been going on since the original 512 megabyte limit issue, when the drives started reporting ficticious geometries that would keep the BIOS happy.

"All good things must come to an end" and if you're living a double life you eventually get found out.  Which, by the way, is exactly what happened.



Fast forward to the present day as drives keep getting bigger and bigger.

Somewhere between the 1.5 TB and 2TB drive sizes, the drive manufacturers reached a crisis.  Trying to keep up the "512 byte sector" facade was becoming more and more difficult.  Making things worse was the fact that most every operating system had given up addressing things in "sectors" long ago.  Operating systems started allocating space in terms of "clusters"; groups of sectors that were treated as a single entity.  The result was that for every request to update a cluster, a multitude of sectors had to be read, potentially modified, and then written back - one by one.

Early attempts were made to solve this bottleneck by allowing read and write "bursting"; asking for more than one sector at a time and getting all of them read - or written - all at once.

Increasingly large amounts of cache memory on the hard drive were used to mitigate the issue by allowing the computer to make multiple requests of the drive without actually accessing the drive platters themselves.  Since, for a fairly large percentage of the individual drive requests, the O/S would be addressing the same or near-by locations, the drive's cache and internally delayed writes allowed the drives to keep up with the data-rate demands.

Later still, hard drives adopted "Native Command Queueing", a technique that allowed the drive - internally - to shuffle read and write requests so that the sequence of reads and writes made sense.  For example, if the computer read a block of data, made changes, wrote the changes, then made more changes and wrote them again; the hard drive could choose to skip the first write(s) since all the changes were within the same block of data.

Likewise, if multiple programs were using the disk, and each wanted to read or write specific pieces of data; the drive, (recognizing that all these requests were within a relatively short distance from each other), would read all the data needed by all the applications as one distinct read, (or write as the case may be), saving significant amounts of access time.

However. . . . .  There's still the 2000GB Gorilla.

When you get up into the multiples of terabytes, keeping track of all those sectors becomes hugely unwieldy.  Translation tables were becoming unreasonably large, performance was suffering and the cost of maintaining these huge tables, as well as the optimization software needed to make them work, was becoming excessive.  Both the cost of the embedded hard drive controller chip's capacity and speed, as well as the sheer manpower needed to keep it all working, had become a significant expense.

What happened is what usually happens when manufacturing and engineering face a life-and-death crisis:  All the engineers got together, went to a resort somewhere, and got drunk . . . .

After they sobered up, they came up with a solution:  Drop the facade, and "come clean" with respect to drive geometry.  The result was the new Advanced Format Drive, (AFD), geometry that abandoned the idea of 512 byte sectors, organizing the drive geometry into larger "sectors", (now called "allocation units" or "allocation blocks"), that are 4kb in length.

And I am sure you can guess what happened next.  It's what usually happens when someone comes clean about a sexy young mistress - the stodgy old wives had a fit!

The BIOS writers were / are still using the "Interrupt 13", (Int-13), boot process - a fossilized legacy from the days of the XT - and maybe even earlier.  And this boot process requires certain things:
  1. The hard disk will report a "sane" CHS geometry at start up.
  2. The Int-13 bootstrap would see 512 byte sectors for the partition table, boot code, and possibly even the secondary boot loader.
. . . . and it's kind-of hard to square a 4k allocation unit size with a 512 byte sector.

So, to keep the stodgy old wives happy, the hard drive manufacturers did two things:
  1. They allow the first meg-or-so of the drive to be addressed natively as 512 byte sectors.  This provides enough room for the MBR, (Master Boot Record), and enough of the bootstrap loader so that the Int-13 boot process can get things going.
  2. The drives would still accept requests for data anywhere on the drive based on 512 byte sectors with two caveats:  There would be a huge performance penalty for doing so, and YOU had to do more of the work to keep track of the sector juggling act.  And God help you if you dropped the ball!
And this is exactly the crux of the problem:  Many operating systems, (surprisingly, later versions of Windows are a notable exception), depend on sharing the juggling act with the hard drive itself.  Even Linux's hard-drive kernel modules assume that the drive will shoulder some of the load when using the legacy msdos partition table format.

I am sure you can guess what happens when HE expects you to be shouldering the entire load, and YOU expect him to shoulder his share.

This, my friends, is the 2000GB Gorilla and if he's not happy, things get "interesting". . . .

So, how do you go about taming this beast?

Interestingly enough, there has been a solution to this all along.  It's not until now where larger capacities, (that require the AFD drive geometry), appeared on single-unit drives that things have come to a head.

First:
The old "msdos" type of partition table makes assumptions about drive geometry that are no longer true.  Not to mention the fact that the msdos partition table can't handle exceptionally large drives.  Not without jumping through hoops or some really ugly hacks that we really don't want to think about.

The solution is to just abandon the msdos partition type, as there are a host of other partition types that will work just as well.  One in particular, GPT, is especially designed to work with more advanced drive geometries.

You do it like this:
(I'm using GNU parted, so that you can actually see what's happening.)
# parted
(parted)  select /dev/[device]
(parted)  mklabel gpt
(parted) [. . . . .]

Presto!  A non-msdos partition table structure that is compatible with the newer drive geometries.

Second:
You have to make sure that the partition table's clusters, (allocation units), are set up so that the logical allocation units, (where the partition thinks the clusters are), and the actual, physical allocation units on the hard drive itself, are aligned properly.

If you fail to do this you could suffer the same massive performance penalty as if you were addressing 512 byte sectors; because for every allocation unit you read or write, multiple physical allocation units may have to be individually read, updated, and/or written.  Fortunately the Linux partitioner, parted, will complain bitterly if it notices that things aren't aligned properly.

The solution - when using parted - is to skip the first meg of the drive so that physical and logical allocation units align correctly.

Like this:
(parted)  mkpart primary ext4 1 -1
(parted) [. . . .]
(parted) quit

Here you make a partition that is primary, preset as an ext4 partition, starting at a 1 meg offset from the beginning of the drive and stopping at the very end, (-1).  Of course, you can set the partition to ext2, ext3, or whatever.  I haven't heard of this being tried with xfs, Reiser, etc, so Your Mileage May Vary.

By the way, this works with the Western Digital drives I purchased and, ideally, other manufacturers should map their drives the same way.  However if you get a warning that the partition is not aligned correctly - look on the web, try different offsets and keep plugging at it until you get the geometry lined up just right.

Update:
I finally had the chance to try this with a couple of 2TB Seagate drives and the partitioning scheme mentioned above worked like a champ with them as well.  So there's a really good chance that, whatever brand of 2TB hard-drive you buy, this fix will work just fine for you too.
/Update

If you are creating multiple partitions, you have to check alignment for each and every partition from beginning to end.  Fortunately, if the first partition is aligned properly, there's a good chance that subsequent partitions will align properly too.

Once you do that, you can use mke2fs, (or whatever), to create the actual file system in the normal manner.  And once that is done you should notice that the drive access times are MUCH faster than before and you don't get a mid-drive logical crash!

It may appear more complicated now but I suspect, very highly, that those who work with these fundamental drive utilities and drivers will rapidly bring their software up-to-date so that this stuff is handled transparently in future releases.

There is, unfortunately, one caveat with all this:  You can kiss backwards compatibility with legacy versions of Linux goodbye - as well as compatibility with legacy non-Linux operating systems when you switch to any kind of advanced partitioning scheme.

Of course this is not news.

When drives switched from CHS to LBA, from LBA to LBA-48, or from parallel ATA to serial ATA, backward compatibility for the newer drives was also lost.  You could regain it if needed, but not without using some butt-ugly hacks or specialized hardware adapters.

And my money's on the almost certain possibility that - in a few years when hundreds of terabytes, or peta-byte hard drives become main-stream - the AFD geometry will need a major update too.

What say ye?

Jim

2 comments:

  1. Just when I thought I was "missing out" by not jumping on the "terrabyte" bandwagon. Glad i stuck to my guns. I usually like to stay a few years behind the "bleeding edge" of the grand technology sword, specifically so I don't get cut. I'm sure given a little time (and a flurry of returns by people uninterested in "plugging away" at getting their new purchases to work), the manufacturers will get their ducks in a row. In the mean time, I'll make do with my meager 500 gig drives. As i seem to have everything I've ever done since ATARI days residing there with plenty of room to spare, I guess i can stand to wait a little more. I wonder if the supposedly consumer friendly 2 terabyte USB drives are going to leave people out in the cold, when they finally store that 30,000th photo.

    ReplyDelete
  2. Excellent historical account of the problem, and a delightfully easy fix to a nasty problem if you were caught unawares.

    ReplyDelete

Thanks for sharing your thoughts here at the QA Tech-Tips Blog!

Note:
This blog will not, repeat NOT, publish comments that contain ANY KIND OF HYPERLINK.

If your comment contains ANY KIND OF HYPERLINK it WILL BE DELETED.

Please read the article at How To Get Comments Approved On This Blog for additional information regarding this rule.

Thank you for understanding.

Jim (JR)