Cluster sizes in RAID0+1

3 replies [Last post]
knowyouremeny
Offline
Joined: Mar 2 2006

I first installed the RAID drivers, obviously, and then when installing Windows, I made a 60GB partition to boot from. Windows installation automatically formats this drive to have 4kb clusters. I also made the other 410GB another partition, which was left unformatted.

Once windows was loaded, I formated the 410GB partition (E) with 64kb clusters by running "format e: /FS:NTFS /A:64KB" in a command prompt.

As you'll notice, as file size gets larger, the 64kb cluster drive starts to pull away... this cannot be attributed to the position of the partitions on the drives, since the C drive is actually on the outermost portions of the disk. At any rate, here are the results:

C (4kb clusters): http://s42.yousendit.com/d.aspx?id=0KYNL4BQTNOOI3NAK1DP0JN36K
E (64kb clusters): http://s48.yousendit.com/d.aspx?id=1ATPXUQKCYRV23R79LSA4SIVNX

It would've been really nice to get a 5th drive so that I wouldn't have had to partition the array, but it's working pretty nicely... despite having a rather jagged bandwidth chart across the drive. To be honest with you, I'm not entirely sure it was worth the hassle. If I had a system drive as well as the array, it would definitely be great, but I'm not sure how much the partition is hindering the performance of the array.

More on cluster sizes here:

If you are going to store multimedia stuff that is usually huge in size, make cluster bigger to increase a performance.

http://www.ntfs.com/ntfs_optimization.htm

The choice of cluster size has an impact on real-world performance, though for most people it is not all that significant. In a nutshell, larger clusters waste more space due to slack but generally provide for slightly better performance because there will be less fragmentation and more of the file will be in consecutive blocks.

http://www.storagereview.com/guide2000/ref/hdd/perf/perf/ext/fileCluster.html

Vimeous
Offline
Joined: Mar 10 2006

Here's my analogous view of clusters sizes, hope it helps.

Clusters are the basic storage units for your data. A number of them make up your total partitioned space.
If you consider each cluster to be a box then different cluster sizes are different sizes of box. In this case lets say we have the two similar total storage spaces broken up into a)small boxes (4k clusters) and b) large boxes (64k clusters). Our boxes will be filled with pages of information (data). Critically boxes cannot contains two different sets of information but multiple boxes can be combined to provide storage for one large piece of information.

Lets say we fill the first 80 small boxes and 5 large ones. We start with box one in both cases and keep going until we have no 'information' left.
In this case we're storing the same quantity of information and because the boxes are all concurrent, were we searching for it, we'd find it all stored in-order. It'd take slightly longer across the small boxes because we have to keep moving from box to box but overall the difference is negligable.

Now lets fill another 80 small boxes of information to a) and 5 large to b). Then remove the original information.

Now I want to put 160 small and 10 large boxes of information into my storage. The first 80/5 fit in before the currently stored information and the 80/5 after. Now when I go to retrieve that information I have to spend time going from the end of the first half of data to the start of the second half.
Again the time taken is very similar because the information is stored in similar locations.

If however we added one set of new information into 65 small boxes and a second set of 95 later we run into problems when trying to replicate it with large boxes.
With small boxes we put the first 15 of the second set before the original 80 boxes and the remaining 80 after.
However the first new information set needs 4.06 large boxes and the second 5.94. Because we can't mix set one and two in a single box we have to use 5 large boxes for set one and 6 for set two. The first set goes before the current information and the second after and uses up an entire large box (or 16 small boxes) of additional storage space. On the positive side, while new set one is easy access for both small and large boxes, set two is more time-consuming in the small box example as it takes more time to skip the interviening data to get to the second part of the information.

Finally lets look at small amounts of information. If we want to store successive allocations of 4 small boxes of information then it's fairly straight forward. However because we cannot mix data in a box when we use our large box system we use four times the storage space for each new allocation of four small boxes. So for example if I want to store 4x 4 small boxes of info I use only 16 boxes of space or with large boxes 4x 4 large boxes uses an equivalent of 64 small boxes of space!
Left-over empty space in clusters is known as Wasted Space.

On a hard drive the continual addition and deletion of data leads to a situation where the data collection tool (single disk head) has to whizz round the drive grabbing info from all over the place just to access one set of data.
As this gets worse the drive becomes what is known as fragmented. Defragmentation attempts to take distinct data groups and move them around in the data structure so the clusters they occupy are close together and therefore reducing this time for the disk head to gather the information.

Going back to the original question question over which cluster size is better I suggest the following:

4k Clusters for OS and installation directories where thousands of small files are common. This makes efficient use of space but can suffer slow-down over time if data is allowed to get too fragmented. Even here many sub-1k files will cause wasted space. Just imagine thousands of 1k files in 64k clusters however (100x1k suddenly equals 6.4Mb)!

8-16k Clusters for general use where there is a requirement for a mix of files sizes and a balance between speed and efficiency is required.

32, 64 and 128k Clusters for large data files where reduced fragmentation improves speed of access and concurrancy of data and the problems of wasted space can be avoided.

My appologies for rattling on. I'm sure I could word it better (I prefer to use a pigeon-hole analogy but not everyone knows what that is).
However I hope it helps clarify why different cluster sizes make a difference to the way in which your hard drives perform.

N.B.
On SATA discs many of you will have noted something called NCQ (Native Command Queuing). This is designed to reduce the problems of fragmentation by caching disc access requests reordered them by how close the data requests are to the data currently being accessed.
By moving the head to the nearest data rather than using the original order in which the requests were received, the time lost as the disc head continually moves backwards and forwards is reduced.

Work (Mrs Vim)
Panasonic HDC SD20/100/200/300
Dell Precision (Nehalim EP) | Microboards Duplicator
Edius 5.5

Home
Canon XL1s | MA-200
Athlon X2 4200+ | 2Gb | A8N-SLI Premium | 6600GT | SonicFury | 4x 250Gb Maxline III (RAID10) 4x 250Gb WD RE2 (RAID10) | DVDRW | Coolermaster Stacker 830 | Seasonic 700W | 2 x VP171s-2 TFT
Premiere 6.5 | Canopus ADVC 300

knowyouremeny
Offline
Joined: Mar 2 2006

Forgot to post my follow-up here! I made some changes:

Due to the very random performance of that setup, I tried multiple times to get it to work properly and be more stable, but it never did. Currently running RAID1 w/ 2 drives on NVRAID and RAID0 w/ 2 drives on the integrated Silicon Image controller.

The NVRAID1 looks to be the culprit for ruining the 0+1, since my current RAID1 is performing just as good as the 0+1 I posted pictures of.

Vimeous
Offline
Joined: Mar 10 2006

Funnily enough I've just rebuilt our edit machine. Like you we have a mix of NVRAID and Silicon Image 3114. I've had some trouble getting a two-drive RAID1 on the NVRAID working correctly as for a time it randomly claimed the set had broken. I've replaced the SATA cables and so far all has been well.
The SI3114 is running 4 drives in RAID5.

We're using the RAID5 split to hold boot/programs, page-file, project and archive partitions. It's using the standard NTFS cluster size of 4k. The RAID1 is using 128k clusters and is intended only for capture.

Hopefully it'll perform well as we've done no real testing yet but it's a definite advance over our previous hardware!

Work (Mrs Vim)
Panasonic HDC SD20/100/200/300
Dell Precision (Nehalim EP) | Microboards Duplicator
Edius 5.5

Home
Canon XL1s | MA-200
Athlon X2 4200+ | 2Gb | A8N-SLI Premium | 6600GT | SonicFury | 4x 250Gb Maxline III (RAID10) 4x 250Gb WD RE2 (RAID10) | DVDRW | Coolermaster Stacker 830 | Seasonic 700W | 2 x VP171s-2 TFT
Premiere 6.5 | Canopus ADVC 300