!!!!!!!!ZFS 2ution !!!!!!!!!!!




What is ZFS ?
ZFS is a new kind of file system. ZFS is a combined file system and logical volume manager. The features of ZFS include data integrity verification against data corruption modes, support for high storage capacities, integration of the concepts of filesystem and volume management, snapshots and copy-on-write clones, continuous integrity checking and automatic repair,RAID-Z and native NFSv4 ACLs.
ZFS Pool
ZFS uses the concept of storage pools to manage physical storage. Historically, file systems were constructed on top of a single physical device. To address multiple devices and provide for data redundancy, the concept of a volume manager was introduced to provide a representation of a single device so that file systems would not need to be modified to take advantage of multiple devices. This design added another layer of complexity and ultimately prevented certain file system advances because the file system had no control over the physical placement of data on the virtualized volumes.
ZFS eliminates volume management altogether. Instead of forcing you to create virtualized volumes, ZFS aggregates devices into a storage pool. The storage pool describes the physical characteristics of the storage (device layout, data redundancy, and so on) and acts as an arbitrary data store from which file systems can be created. File systems are no longer constrained to individual devices, allowing them to share disk space with all file systems in the pool. You no longer need to predetermine the size of a file system, as file systems grow automatically within the disk space allocated to the storage pool. When new storage is added, all file systems within the pool can immediately use the additional disk space without additional work. In many ways, the storage pool works similarly to a virtual memory system: When a memory DIMM is added to a system, the operating system doesn't force you to run commands to configure the memory and assign it to individual processes. All processes on the system automatically use the additional memory.

Using zfs (basics )

Creating and manipulating zpools (zfs)


Zpools are the underlying device layers for zfs filesystems. Mirrors, RAIDs and Concatenated Storage are defined here.


For pooling devices, zpools can be:



- a mirror
- a RAIDz with single or double parity
- a concatenated/striped storage



This work sheet has been done with Solaris 10 running on a virtual Parallels machine. The disks are not real, they are virtualized by Parallels, giving 8 GB to each disk. Not much, but enough to play with.




First we will try to look up the disks accessible by our system:




# format

Searching for disks...done



AVAILABLE DISK SELECTIONS:
       0. c0d0
          /pci@0,0/pci-ide@1f,1/ide@0/cmdk@0,0
       1. c1d0
          /pci@0,0/pci-ide@1f,1/ide@1/cmdk@0,0
Specify disk (enter its number): ^C




Type CTRL-C to quit "format".



If your disks do not show up, use devfsadm:




# devfsadm

# format

Searching for disks...done




AVAILABLE DISK SELECTIONS:
       0. c0d0
          /pci@0,0/pci-ide@1f,1/ide@0/cmdk@0,0
       1. c0d1
          /pci@0,0/pci-ide@1f,1/ide@0/cmdk@1,0
       2. c1d0
          /pci@0,0/pci-ide@1f,1/ide@1/cmdk@0,0
       3. c1d1
          /pci@0,0/pci-ide@1f,1/ide@1/cmdk@1,0
Specify disk (enter its number): ^C




You'll notice that the virtual disks are mapped as IDE/ATA drives, so the disk device names don't have a target specification "t".



Let's create our first pool by simply putting together all three disks (c0d0 is our root partition and boot disk which is not usable for our example):
# zpool create zfstest c0d1 c1d0 c1d1
That's it. You have just created a zpool named "zfstest" containing all three disks. Your available space will be just the sum of all three disks:
# zpool list

NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
zfstest                23.8G     91K   23.8G     0%  ONLINE     -

Use "zpool status" to get detailed status information of the components of your zpool:
# zpool status

  pool: zfstest
 state: ONLINE
 scrub: none requested
config:



        NAME        STATE     READ WRITE CKSUM
        zfstest     ONLINE       0     0     0
          c0d1      ONLINE       0     0     0
          c1d0      ONLINE       0     0     0
          c1d1      ONLINE       0     0     0



errors: No known data errors

To destroy a pool, use "zpool destroy":
# zpool destroy zfstest
and your pool is gone.
Let's try a mirror now:
# zpool create mirror c1d0 c1d1
You just created a mirror between disk c1d0 and disk c1d1. Available storage is the same as if you used only one of these disks. If disk sizes differ, the smaller size will be your storage size. Data is replicated between these disks.
"zpool status" now reads:
# zpool status

  pool: zfstest
 state: ONLINE
 scrub: none requested
config:



        NAME        STATE     READ WRITE CKSUM
        zfstest     ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1d0    ONLINE       0     0     0
            c1d1    ONLINE       0     0     0



errors: No known data errors

So now we have a simple mirror. But how to put data on it?
If you create a zpool there is automatically a zfs filesystem created in it. The mountpoint defaults to the poolname. So your pool "zfstest" is mounted as a zfs filesystem at /zfstest:
# df -k

Filesystem            kbytes    used   avail capacity  Mounted on
/dev/dsk/c0d0s0      14951508 5725085 9076908    39%    /
/devices                   0       0       0     0%    /devices
ctfs                       0       0       0     0%    /system/contract
proc                       0       0       0     0%    /proc
mnttab                     0       0       0     0%    /etc/mnttab
swap                 2104456     836 2103620     1%    /etc/svc/volatile
objfs                      0       0       0     0%    /system/object
/usr/lib/libc/libc_hwcap1.so.1
                     14951508 5725085 9076908    39%    /lib/libc.so.1
fd                         0       0       0     0%    /dev/fd
swap                 2103624       4 2103620     1%    /tmp
swap                 2103644      24 2103620     1%    /var/run
zfstest              8193024      24 8192938     1%    /zfstest

We will create a big file on it:
# dd if=/dev/zero bs=128k count=40000 of=/zfstest/bigfile

40000+0 records in
40000+0 records out

It is really there now:
# ls -la /zfstest

total 10241344
drwxr-xr-x   2 root     sys            3 Apr 21 11:15 .
drwxr-xr-x  39 root     root        1024 Apr 21 11:13 ..
-rw-r--r--   1 root     root     5242880000 Apr 21 11:30 bigfile

Now the differences to classical volume managers do begin. The underlying zpool "zfstest" KNOWS actually that approx. 5 Gigabytes are taken by zfs:
# zpool list

NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
zfstest                7.94G   4.88G   3.05G    61%  ONLINE     -

This has enormous advantages: When replacing a mirrored disk, zfs will only copy allocated blocks to the new disk and not all blocks of the pool. The same is true with RAID devices, only allocated data blocks are reconstructed.

Now let's just stop the mirror. You do that just by detaching one drive from the mirror:

# zpool detach zfstest c1d0
This command pulled away disk c1d0 from the pool. Your mirror is not a mirror any more:
# zpool status

  pool: zfstest
 state: ONLINE
 scrub: none requested
config:



        NAME        STATE     READ WRITE CKSUM
        zfstest     ONLINE       0     0     0
          c1d1      ONLINE       0     0     0



errors: No known data errors

However you did not lose any bit of data! Your zpool ist just available as it was before (as long disk c1d1 does not fail).
You may attach another disk to your pool to create a new mirror:
# zpool attach zfstest c1d1 c0d1
Now your mirror consists of disks c1d1 and c0d1. Solaris will immediately begin to duplicate any block that's used by zfs from drive c1d1 to drive c0d1:
# zpool status

  pool: zfstest
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress, 55.53% done, 0h7m to go
config:



        NAME        STATE     READ WRITE CKSUM
        zfstest     ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1d1    ONLINE       0     0     0
            c0d1    ONLINE       0     0     0



errors: No known data errors

The process of replicating data on new or outdated disks is named "resilvering".
A mirror is not limited to two disks. If you have big concerns that your valuable data is prone to losses, just attach another disk to your mirror:
# zpool attach zfstest c0d1 c1d0
Your mirror now has three elements. Note, that your storage size does not grow by attaching new mirror components. But now two drives may fail completely and you still have all of your data:
# zpool status

  pool: zfstest
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Apr 21 13:56:16 2008
config:



        NAME        STATE     READ WRITE CKSUM
        zfstest     ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c1d1    ONLINE       0     0     0
            c0d1    ONLINE       0     0     0
            c1d0    ONLINE       0     0     0



errors: No known data errors

Let's detach two disks now:
# zpool detach zfstest c1d1

# zpool detach zfstest c1d0

Your mirror has gone once again. To set up a concatenated or striped storage (write operations on zfs occur on ALL pool members, so it's more like a striped disk set), you may add these disks to your pool (never mistake  "add" for "attach" - the former ADDs storage, the latter attaches disks to mirrors):
# zpool add zfstest c1d0 c1d1
Your pool has become the same as the one we created at the beginning of our exercise:
# zpool list

NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
zfstest                23.8G   4.88G   18.9G    20%  ONLINE     -
# zpool status
  pool: zfstest
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Apr 21 13:56:16 2008
config:



        NAME        STATE     READ WRITE CKSUM
        zfstest     ONLINE       0     0     0
          c0d1      ONLINE       0     0     0
          c1d0      ONLINE       0     0     0
          c1d1      ONLINE       0     0     0



errors: No known data errors

The only difference is our file "bigfile", which is still available as we did not destroy the pool. You see it from the output of "zpool list" above: 4,8G are still used.
Now we are stuck. It is not possible to remove disks added to our zpool. As writes occur on all members, newly written data is on all disks. No chance to throw away a disk. Mirror component disks can be detached at any time - as long it is not the last disk of a mirror.
Let's destroy the pool and set up a RAID storage. ZFS offers two RAID types: raidz1 and raidz2. raidz1 means single parity, raidz2 double parity.
# zpool destroy zfstest

# zpool create zfstest raidz1 c0d1 c1d0 c1d1

We have now created a raid group:
# zpool status

  pool: zfstest
 state: ONLINE
 scrub: none requested
config:



        NAME        STATE     READ WRITE CKSUM
        zfstest     ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c0d1    ONLINE       0     0     0
            c1d0    ONLINE       0     0     0
            c1d1    ONLINE       0     0     0



errors: No known data errors

Be aware that "zpool list" is showing the global capacity of your raid set and not the usable capacity:
# zpool list

NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
zfstest                23.9G    157K   23.9G     0%  ONLINE     -

To see how many space we are able to allocate, use a zfs command 
# zfs list

NAME      USED  AVAIL  REFER  MOUNTPOINT
zfstest   101K  15.7G  32.6K  /zfstest

One disk may fail in this scenario.
To put up the same pool with double parity:
# zpool destroy zfstest

# zpool create zfstest raidz2 c0d1 c1d0 c1d1
# zfs list
NAME      USED  AVAIL  REFER  MOUNTPOINT
zfstest  86.7K  7.80G  24.4K  /zfstest

Only 7.8 GB left - compared to 15.7 GB with a single parity RAID device. Two drives may fail now.
We have achieved the same data security as with a three way mirror - hence the same usable storage.
As a practical example, here is the output of "zpool status" and "zpool list" of a mailserver. The zpool "mail" consists of two mirror pairs added to a pool.
The creation command has been:
# zpool create mail \

 mirror c6t600D0230006C1C4C0C50BE5BC9D49100d0 c6t600D0230006B66680C50AB7821F0E900d0 \
 mirror c6t600D0230006B66680C50AB0187D75000d0 c6t600D0230006C1C4C0C50BE27386C4900d0

As you see, it is perfectly legal and possible to add the storage of two mirrors in one pool.
# zpool status

  pool: mail
 state: ONLINE
 scrub: none requested
config:



        NAME                                       STATE     READ WRITE CKSUM
        mail                                       ONLINE       0     0     0
          mirror                                   ONLINE       0     0     0
            c6t600D0230006C1C4C0C50BE5BC9D49100d0  ONLINE       0     0     0
            c6t600D0230006B66680C50AB7821F0E900d0  ONLINE       0     0     0
          mirror                                   ONLINE       0     0     0
            c6t600D0230006B66680C50AB0187D75000d0  ONLINE       0     0     0
            c6t600D0230006C1C4C0C50BE27386C4900d0  ONLINE       0     0     0




# zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
mail                   6.81T    3.08T  3.73T    45%  ONLINE     -


As you see, you can also use Sun MPxIO devices - they have LONG device names. You may also use FDISK partitions of x86 computers (...p0,...p1,...) and Solaris slices (...s0, ....s1, ...) to set up zpools. Both are however not recommended but fine to play with zpool commands.
The MPxIO names are usable because they show up just like normal block disk devices in /dev/dsk:
# ls -la /dev/dsk/c6t600D0230006C1C4C0C50BE5BC9D49100d0

lrwxrwxrwx   1 root     root          65 Dec 11 06:22 /dev/dsk/c6t600D0230006C1C4C0C50BE5BC9D49100d0 -> ../../devices/scsi_vhci/disk@g600d0230006c1c4c0c50be5bc9d49100:wd


To use zfs, you need to create at least one zpool first.


After that, you should have something like this:



# zpool list

NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT

zfstest                23.8G     91K   23.8G     0%  ONLINE     -



This zpool "zfstest" also has one incorporated zfs filesystem on it. To manipulate zfs there is the "zfs" command. So keep in mind: zpool manipulates pool storage, zfs manipulates zfs generation and options. Try this:



# zfs list

NAME      USED  AVAIL  REFER  MOUNTPOINT

zfstest    88K  23.4G  24.5K  /zfstest




As you can see, the pool "zfstest" also has a filesystem on it, mounted automatically at mountpoint /zfstest.



You may create a new filesystem by using "zfs create":



# zfs create zfstest/king

# zfs list

NAME           USED  AVAIL  REFER  MOUNTPOINT
zfstest        118K  23.4G  25.5K  /zfstest
zfstest/king  24.5K  23.4G  24.5K  /zfstest/king



New filesystems within a pool are always named "poolname/filesystemname". Without any additional options, it will also mount automatically on "/poolname/filesystemname".



Let's create another one:



# zfs create zfstest/queen

# zfs list

NAME            USED  AVAIL  REFER  MOUNTPOINT
zfstest         147K  23.4G  25.5K  /zfstest
zfstest/king   24.5K  23.4G  24.5K  /zfstest/king
zfstest/queen  24.5K  23.4G  24.5K  /zfstest/queen



We see some differences between old-fashioned filesystems and zfs: Usable storage is shared among all filesystems in a pool. "zfstest/king" has 23.4G available, "zfstest/queen" also, as does the master pool filesystem "zfstest".



So why create filesystems then? Couldn't we just use subdirectories in our master pool filesystem "zfstest" (mounted on /zfstest)?



The "trick" about zfs filesystems is the possibility to assign options to them, so they can be treated differently. We will see that later.



First, let's push some senseless data on our newly created filesystem:



# dd if=/dev/zero bs=128k count=5000 of=/zfstest/king/bigfile

5000+0 records in

5000+0 records out



This command creates a file "bigfile" in directory /zfstest/king, consisting of 5000 times 128 kilobytes. That's big enough for our purpose.



"zfs list" reads:



# zfs list

NAME            USED  AVAIL  REFER  MOUNTPOINT

zfstest         625M  22.8G  27.5K  /zfstest
zfstest/king    625M  22.8G   625M  /zfstest/king
zfstest/queen  24.5K  22.8G  24.5K  /zfstest/queen



625 megabytes are used from filesystem zfstest/king, as expected. Notice also that now every other filesystem on that pool only can allocate 22.8G, as 625M are taken (compare with 23.4 G above, before creating that big file).



You CAN look up free space in your zfs filesystems also doing a "df -k", but I wouldn't recommend it: You won't see snapshots and the numbers can be very big.



Example for our zpool "zfstest":



# df -k

Filesystem            kbytes    used   avail capacity  Mounted on

/dev/dsk/c0d0s0      14951508 5725184 9076809    39%    /
/devices                   0       0       0     0%    /devices
ctfs                       0       0       0     0%    /system/contract
[... lines omitted ...]
zfstest              24579072      27 23938789     1%    /zfstest
zfstest/king         24579072  640149 23938789     3%    /zfstest/king
zfstest/queen        24579072      24 23938789     1%    /zfstest/queen



So 22.8G are 23938789 bytes. Sun uses 1K=1024 bytes, 1M = 1024K, 1G = 1024M, 1T = 1024G. They're a computer company and not an ISO metric organization...



So let's try out first option: "quota".
As you can imagine, "quota" limits storage. You know that as nearly every mailbox provider do impose a quota on your storage, as do file space providers.
First: To set and get options, you need to use "zfs set" and "zfs get", respectively.



So here we define a quota on zfstest/queen:



# zfs set quota=5G zfstest/queen



Result:



# zfs list

NAME            USED  AVAIL  REFER  MOUNTPOINT

zfstest         625M  22.8G  27.5K  /zfstest
zfstest/king    625M  22.8G   625M  /zfstest/king
zfstest/queen  24.5K  5.00G  24.5K  /zfstest/queen



Only 5G left to use at mountpoint /zfstest/queen. Note, that you may still gobble up 22.8G in /zfstest/king, making it impossible then to put 5G in /zfstest/queen. So a quota does not guarantee any storage, it only limits it.



To guarantee a certain amount of storage, use the option "reservation":



# zfs set reservation=5G zfstest/queen



Now we simulated a classical "partition" - we reserved the same amount of storage as the quota implies, 5G:



# zfs list

NAME            USED  AVAIL  REFER  MOUNTPOINT

zfstest        5.61G  17.8G  27.5K  /zfstest
zfstest/king    625M  17.8G   625M  /zfstest/king
zfstest/queen  24.5K  5.00G  24.5K  /zfstest/queen



The other filesystems only have 17.8G left, as 5 G are really reserved for zfstest/queen.



Now, let's try another nice option: compression
Perhaps now you are thinking about compression nightmares on windows systems, like doublespace, stacker and all these other parasital programs which killed performance, not storage. Forget them! zfs compression IS reliable and - fast!
With todays' CPU power the effect of compressing and decompressing objects is a charm and won't harm significantly your overall performance - it can boost performance as you will need less i/o due to compression. 
As with many other zfs options, changing the compression only affects newly written files/sectors. Uncompressed blocks still can be read. It's transparent to the application. fseek() et.al. do not even notice that files are compressed.



# zfs set compression=on zfstest/queen



Now, compression is activated on /zfstest/queen (as "zfstest/queen" is mounted on /zfstest/queen, we did not change the mountpoint - and yes, you're right, the mountpoint is also just another zfs option...).



Let's copy our "bigfile" from king to queen:



# cp /zfstest/king/bigfile /zfstest/queen



Ok THIS in unfair - as our file consists of only zeroes, zfs won't compress it, it only sets up a marker saying that 655360000 bytes of zeroes have to be generated. It is some kind of "benchmark" hook to get nice results and to avoid to waste space with "hole files":



# zfs list

NAME            USED  AVAIL  REFER  MOUNTPOINT

zfstest        5.61G  17.8G  27.5K  /zfstest
zfstest/king    625M  17.8G   625M  /zfstest/king
zfstest/queen  24.5K  5.00G  24.5K  /zfstest/queen



No space needed in zfstest/queen... You may check it with "ls -las" (option "s" prints out the number of needed disk blocks to store the file):



# ls -las /zfstest/queen

total 7

   3 drwxr-xr-x   2 root     sys            3 Apr 23 06:17 .
   3 drwxr-xr-x   4 root     sys            4 Apr 23 06:05 ..
   1 -rw-r--r--   1 root     root     655360000 Apr 23 06:18 bigfile



One block. On our uncompressed king filesystem the situation is like that:



# ls -las /zfstest/king

total 1280257

   3 drwxr-xr-x   2 root     sys            4 Apr 23 06:19 .
   3 drwxr-xr-x   4 root     sys            4 Apr 23 06:05 ..
1280251 -rw-r--r--   1 root     root     655360000 Apr 23 06:10 bigfile



To be able to create a "real world" file, we will use the "zfs get all" command, to get ALL options of a zfs filesystem:



# zfs get all zfstest/queen

NAME           PROPERTY       VALUE                  SOURCE

zfstest/queen  type           filesystem             -
zfstest/queen  creation       Wed Apr 23  6:05 2008  -
zfstest/queen  used           24.5K                  -
zfstest/queen  available      5.00G                  -
zfstest/queen  referenced     24.5K                  -
zfstest/queen  compressratio  1.00x                  -
zfstest/queen  mounted        yes                    -
zfstest/queen  quota          5G                     local
zfstest/queen  reservation    5G                     local
zfstest/queen  recordsize     128K                   default
zfstest/queen  mountpoint     /zfstest/queen         default
zfstest/queen  sharenfs       off                    default
zfstest/queen  checksum       on                     default
zfstest/queen  compression    on                     local
zfstest/queen  atime          on                     default
zfstest/queen  devices        on                     default
zfstest/queen  exec           on                     default
zfstest/queen  setuid         on                     default
zfstest/queen  readonly       off                    default
zfstest/queen  zoned          off                    default
zfstest/queen  snapdir        hidden                 default
zfstest/queen  aclmode        groupmask              default
zfstest/queen  aclinherit     secure                 default
zfstest/queen  canmount       on                     default
zfstest/queen  shareiscsi     off                    default
zfstest/queen  xattr          on                     default



As you remark, the "compressratio" option (which is a read-only option, so you may only use "zfs get" and not "zfs set") gives the compression ratio of your filesystem, but our "zero file" does not count, so it remains 1.00x!).



So let's create another file now in your compressed queen filesystem:



# zfs get all zfstest/queen > /zfstest/queen/outputfile



Our file will use 3 disk blocks:



# ls -las /zfstest/queen

total 10

   3 drwxr-xr-x   2 root     sys            4 Apr 23 06:18 .
   3 drwxr-xr-x   4 root     sys            4 Apr 23 06:05 ..
   1 -rw-r--r--   1 root     root     655360000 Apr 23 06:18 bigfile
   3 -rw-r--r--   1 root     root        1598 Apr 23 06:18 outputfile



Let's copy it to our uncompressed king filesystem:



# cp /zfstest/queen/outputfile /zfstest/king/



Here it will use 5 blocks:



# ls -las /zfstest/king

total 1280262

   3 drwxr-xr-x   2 root     sys            4 Apr 23 06:19 .
   3 drwxr-xr-x   4 root     sys            4 Apr 23 06:05 ..
1280251 -rw-r--r--   1 root     root     655360000 Apr 23 06:10 bigfile
   5 -rw-r--r--   1 root     root        1598 Apr 23 06:19 outputfile



These were the basic steps to create zfs filesystems, but at least one command is missing: How do destroy filesystems? Use "zfs destroy":



# zfs destroy zfstest/king
# zfs destroy zfstest/queen



Note, that the filesystem must not be in use, otherwise it won't work (just like any unmount (umount) of a classical filesystem won't work when it's in use).



Note, you may NOT destroy "zfstest", because that's the master filesystem of your pool, destroy your pool if you want to get rid of it:



# zfs destroy zfstest      

cannot destroy 'zfstest': operation does not apply to pools

use 'zfs destroy -r zfstest' to destroy all datasets in the pool
use 'zpool destroy zfstest' to destroy the pool itself



Comments

Popular Posts