!!!!!!!!ZFS 2ution !!!!!!!!!!!
What is
ZFS ?
ZFS is a new kind of file system. ZFS is
a combined file
system and logical volume manager. The features of ZFS include data integrity
verification against data
corruption modes, support for high storage capacities,
integration of the concepts of filesystem and volume management, snapshots and copy-on-write clones, continuous integrity checking and
automatic repair,RAID-Z and native NFSv4 ACLs.
ZFS
Pool
ZFS uses the concept of storage
pools to manage physical storage.
Historically, file systems were constructed on top of a single physical device.
To address multiple devices and provide for data redundancy, the concept of a volume manager was introduced to provide a
representation of a single device so that file systems would not need to be
modified to take advantage of multiple devices. This design added another layer
of complexity and ultimately prevented certain file system advances because the
file system had no control over the physical placement of data on the
virtualized volumes.
ZFS eliminates volume
management altogether. Instead of forcing you to create virtualized volumes,
ZFS aggregates devices into a storage pool. The storage pool describes the
physical characteristics of the storage (device layout, data redundancy, and so
on) and acts as an arbitrary data store from which file systems can be created.
File systems are no longer constrained to individual devices, allowing them to
share disk space with all file systems in the pool. You no longer need to predetermine
the size of a file system, as file systems grow automatically within the disk
space allocated to the storage pool. When new storage is added, all file
systems within the pool can immediately use the additional disk space without
additional work. In many ways, the storage pool works similarly to a virtual
memory system: When a memory DIMM is added to a system, the operating system
doesn't force you to run commands to configure the memory and assign it to
individual processes. All processes on the system automatically use the
additional memory.
Using zfs (basics )
Creating and manipulating
zpools (zfs)
Zpools
are the underlying device layers for zfs filesystems. Mirrors, RAIDs and
Concatenated Storage are defined here.
For pooling devices, zpools can be:
- a mirror
- a RAIDz with single or double parity
- a concatenated/striped storage
This work sheet has been done with Solaris 10 running on a virtual Parallels
machine. The disks are not real, they are virtualized by Parallels, giving 8 GB
to each disk. Not much, but enough to play with.
First we will try to look up the disks accessible by our system:
# format
Searching for disks...done
AVAILABLE DISK SELECTIONS:
0. c0d0
/pci@0,0/pci-ide@1f,1/ide@0/cmdk@0,0
1. c1d0
/pci@0,0/pci-ide@1f,1/ide@1/cmdk@0,0
Specify disk (enter its number): ^C
Type CTRL-C to quit "format".
If your disks do not show up, use devfsadm:
# devfsadm
# format
Searching for disks...done
AVAILABLE DISK SELECTIONS:
0. c0d0
/pci@0,0/pci-ide@1f,1/ide@0/cmdk@0,0
1. c0d1
/pci@0,0/pci-ide@1f,1/ide@0/cmdk@1,0
2. c1d0
/pci@0,0/pci-ide@1f,1/ide@1/cmdk@0,0
3. c1d1
/pci@0,0/pci-ide@1f,1/ide@1/cmdk@1,0
Specify disk (enter its number): ^C
You'll notice that the virtual disks are mapped as IDE/ATA drives, so the disk
device names don't have a target specification "t".
Let's create our first pool by simply putting together all three
disks (c0d0 is our root partition and boot disk which is not usable for our
example):
# zpool create zfstest c0d1 c1d0 c1d1
That's it. You have just created a zpool named
"zfstest" containing all three disks. Your available space will be
just the sum of all three disks:
# zpool list
NAME
SIZE USED AVAIL CAP
HEALTH ALTROOT
zfstest
23.8G 91K 23.8G
0% ONLINE -
Use "zpool status" to get detailed status information
of the components of your zpool:
# zpool status
pool: zfstest
state: ONLINE
scrub: none requested
config:
NAME STATE
READ WRITE CKSUM
zfstest ONLINE
0 0 0
c0d1 ONLINE
0 0 0
c1d0 ONLINE
0 0 0
c1d1 ONLINE
0 0 0
errors: No known data errors
To destroy a pool, use "zpool destroy":
# zpool destroy zfstest
and your pool is gone.
Let's try a mirror now:
# zpool create mirror c1d0 c1d1
You just created a mirror between disk c1d0 and disk c1d1.
Available storage is the same as if you used only one of these disks. If disk
sizes differ, the smaller size will be your storage size. Data is replicated
between these disks.
"zpool
status" now reads:
# zpool status
pool: zfstest
state: ONLINE
scrub: none requested
config:
NAME STATE
READ WRITE CKSUM
zfstest ONLINE
0 0 0
mirror ONLINE
0 0 0
c1d0 ONLINE
0 0 0
c1d1 ONLINE
0 0 0
errors: No known data errors
So now we have a simple mirror. But how to put data on it?
If you create a zpool there is automatically a zfs filesystem
created in it. The mountpoint defaults to the poolname. So your pool
"zfstest" is mounted as a zfs filesystem at /zfstest:
# df -k
Filesystem
kbytes used avail capacity Mounted on
/dev/dsk/c0d0s0 14951508 5725085
9076908 39% /
/devices
0 0
0 0% /devices
ctfs
0 0
0 0% /system/contract
proc
0 0
0 0% /proc
mnttab
0 0
0 0% /etc/mnttab
swap
2104456 836 2103620
1% /etc/svc/volatile
objfs
0 0
0 0% /system/object
/usr/lib/libc/libc_hwcap1.so.1
14951508 5725085 9076908 39% /lib/libc.so.1
fd
0 0
0 0% /dev/fd
swap
2103624 4 2103620
1% /tmp
swap
2103644 24 2103620
1% /var/run
zfstest
8193024 24 8192938
1% /zfstest
We will create a big file on it:
# dd if=/dev/zero bs=128k count=40000
of=/zfstest/bigfile
40000+0 records in
40000+0 records out
It is really there now:
# ls -la /zfstest
total 10241344
drwxr-xr-x 2 root
sys 3 Apr 21
11:15 .
drwxr-xr-x 39 root
root 1024 Apr 21 11:13 ..
-rw-r--r-- 1 root
root 5242880000 Apr 21 11:30 bigfile
Now the differences to classical volume managers do begin. The
underlying zpool "zfstest" KNOWS actually that approx. 5 Gigabytes
are taken by zfs:
# zpool list
NAME
SIZE USED AVAIL CAP
HEALTH ALTROOT
zfstest
7.94G 4.88G 3.05G 61%
ONLINE -
This has enormous advantages: When replacing a mirrored disk,
zfs will only copy allocated blocks to the new disk and not all blocks of the
pool. The same is true with RAID devices, only allocated data blocks are
reconstructed.
Now let's just stop the mirror. You do that just by detaching
one drive from the mirror:
# zpool detach zfstest c1d0
This command pulled away disk c1d0 from the pool. Your mirror is
not a mirror any more:
# zpool status
pool: zfstest
state: ONLINE
scrub: none requested
config:
NAME STATE
READ WRITE CKSUM
zfstest ONLINE
0 0 0
c1d1
ONLINE 0
0 0
errors: No known data errors
However you did not lose any bit of data! Your zpool ist just
available as it was before (as long disk c1d1 does not fail).
You may attach another disk to your pool to create a new mirror:
# zpool attach zfstest c1d1 c0d1
Now your mirror consists of disks c1d1 and c0d1. Solaris will
immediately begin to duplicate any block that's used by zfs from drive c1d1 to
drive c0d1:
# zpool status
pool: zfstest
state: ONLINE
status: One or more devices is currently being resilvered. The pool
will
continue to function, possibly
in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress, 55.53% done, 0h7m to go
config:
NAME STATE
READ WRITE CKSUM
zfstest ONLINE
0 0 0
mirror ONLINE
0 0 0
c1d1 ONLINE
0 0 0
c0d1 ONLINE
0 0 0
errors: No known data errors
The process of replicating data on new or outdated disks is
named "resilvering".
A mirror is not limited to two disks. If you have big concerns
that your valuable data is prone to losses, just attach another disk to your
mirror:
# zpool attach zfstest c0d1 c1d0
Your mirror now has three elements. Note, that your storage size
does not grow by attaching new mirror components. But now two drives may fail
completely and you still have all of your data:
# zpool status
pool: zfstest
state: ONLINE
scrub: resilver completed with 0 errors on Mon Apr 21 13:56:16 2008
config:
NAME STATE
READ WRITE CKSUM
zfstest ONLINE
0 0 0
mirror ONLINE
0 0 0
c1d1 ONLINE
0 0 0
c0d1 ONLINE
0 0 0
c1d0 ONLINE
0 0 0
errors: No known data errors
Let's detach two disks now:
# zpool detach zfstest c1d1
# zpool detach zfstest c1d0
Your mirror has gone once again. To set up a concatenated or
striped storage (write operations on zfs occur on ALL pool members, so it's
more like a striped disk set), you may add these disks to your pool (never
mistake "add" for "attach" - the former ADDs storage,
the latter attaches disks to mirrors):
# zpool add zfstest c1d0 c1d1
Your pool has become the same as the one we created at the
beginning of our exercise:
# zpool list
NAME
SIZE USED AVAIL CAP
HEALTH ALTROOT
zfstest
23.8G 4.88G 18.9G 20% ONLINE
-
# zpool status
pool: zfstest
state: ONLINE
scrub: resilver completed with 0 errors on Mon Apr 21 13:56:16 2008
config:
NAME STATE
READ WRITE CKSUM
zfstest ONLINE
0 0 0
c0d1 ONLINE
0 0 0
c1d0 ONLINE
0 0 0
c1d1 ONLINE
0 0 0
errors: No known data errors
The only difference is our file "bigfile", which is
still available as we did not destroy the pool. You see it from the output of
"zpool list" above: 4,8G are still used.
Now we are stuck. It is not possible to remove disks added to
our zpool. As writes occur on all members, newly written data is on all disks.
No chance to throw away a disk. Mirror component disks can be detached at any
time - as long it is not the last disk of a mirror.
Let's destroy the pool and set up a RAID storage. ZFS offers two
RAID types: raidz1 and raidz2. raidz1 means single parity, raidz2 double
parity.
# zpool destroy zfstest
# zpool create zfstest raidz1 c0d1 c1d0 c1d1
We have now created a raid group:
# zpool status
pool: zfstest
state: ONLINE
scrub: none requested
config:
NAME STATE
READ WRITE CKSUM
zfstest ONLINE
0 0 0
raidz1
ONLINE
0 0 0
c0d1 ONLINE
0 0 0
c1d0 ONLINE
0 0 0
c1d1 ONLINE
0 0 0
errors: No known data errors
Be aware that "zpool list" is showing the global
capacity of your raid set and not the usable capacity:
# zpool list
NAME
SIZE USED AVAIL CAP
HEALTH ALTROOT
zfstest
23.9G 157K 23.9G 0%
ONLINE -
To see how many space we are able to allocate, use a zfs command
# zfs list
NAME USED AVAIL REFER
MOUNTPOINT
zfstest 101K 15.7G 32.6K /zfstest
One disk may fail in this scenario.
To put up the same pool with double parity:
# zpool destroy zfstest
# zpool create zfstest raidz2 c0d1 c1d0 c1d1
# zfs list
NAME USED AVAIL REFER
MOUNTPOINT
zfstest 86.7K 7.80G 24.4K /zfstest
Only 7.8 GB left - compared to 15.7 GB with a single parity RAID
device. Two drives may fail now.
We have achieved the same data security as with a three way
mirror - hence the same usable storage.
As a practical example, here is the output of "zpool
status" and "zpool list" of a mailserver. The zpool
"mail" consists of two mirror pairs added to a pool.
The creation command has been:
# zpool create mail \
mirror c6t600D0230006C1C4C0C50BE5BC9D49100d0
c6t600D0230006B66680C50AB7821F0E900d0 \
mirror c6t600D0230006B66680C50AB0187D75000d0
c6t600D0230006C1C4C0C50BE27386C4900d0
As you see, it is perfectly legal and possible to add the
storage of two mirrors in one pool.
# zpool status
pool: mail
state: ONLINE
scrub: none requested
config:
NAME
STATE READ WRITE CKSUM
mail
ONLINE 0
0 0
mirror
ONLINE 0
0 0
c6t600D0230006C1C4C0C50BE5BC9D49100d0 ONLINE
0 0 0
c6t600D0230006B66680C50AB7821F0E900d0
ONLINE 0
0 0
mirror
ONLINE 0
0 0
c6t600D0230006B66680C50AB0187D75000d0
ONLINE 0
0 0
c6t600D0230006C1C4C0C50BE27386C4900d0
ONLINE 0
0 0
# zpool list
NAME
SIZE USED AVAIL CAP
HEALTH ALTROOT
mail
6.81T 3.08T 3.73T 45%
ONLINE -
As you see, you can also use Sun MPxIO devices - they have LONG
device names. You may also use FDISK partitions of x86 computers
(...p0,...p1,...) and Solaris slices (...s0, ....s1, ...) to set up zpools.
Both are however not recommended but fine to play with zpool commands.
The MPxIO names are usable because they show up just like normal
block disk devices in /dev/dsk:
# ls -la
/dev/dsk/c6t600D0230006C1C4C0C50BE5BC9D49100d0
lrwxrwxrwx 1 root
root 65 Dec 11 06:22
/dev/dsk/c6t600D0230006C1C4C0C50BE5BC9D49100d0 ->
../../devices/scsi_vhci/disk@g600d0230006c1c4c0c50be5bc9d49100:wd
To use
zfs, you need to create at least
one zpool first.
After that, you should have something like this:
# zpool list
NAME
SIZE USED AVAIL CAP
HEALTH ALTROOT
zfstest
23.8G 91K 23.8G
0% ONLINE -
This zpool "zfstest" also has one incorporated zfs filesystem on it.
To manipulate zfs there is the "zfs" command. So keep in mind: zpool manipulates pool storage, zfs manipulates zfs generation and
options. Try this:
# zfs list
NAME
USED AVAIL REFER MOUNTPOINT
zfstest
88K 23.4G 24.5K /zfstest
As you can see, the pool "zfstest" also has a filesystem on it,
mounted automatically at mountpoint /zfstest.
You may create a new filesystem by using "zfs create":
# zfs create
zfstest/king
#
zfs list
NAME
USED AVAIL REFER MOUNTPOINT
zfstest
118K 23.4G 25.5K /zfstest
zfstest/king
24.5K 23.4G 24.5K /zfstest/king
New filesystems within a pool are always named
"poolname/filesystemname". Without any additional options, it will
also mount automatically on "/poolname/filesystemname".
Let's create another one:
# zfs create
zfstest/queen
#
zfs list
NAME
USED AVAIL REFER MOUNTPOINT
zfstest
147K 23.4G 25.5K /zfstest
zfstest/king
24.5K 23.4G 24.5K /zfstest/king
zfstest/queen
24.5K 23.4G 24.5K /zfstest/queen
We see some differences between old-fashioned filesystems and zfs: Usable
storage is shared among all filesystems in a pool. "zfstest/king" has
23.4G available, "zfstest/queen" also, as does the master pool
filesystem "zfstest".
So why create filesystems then? Couldn't we just use subdirectories in our
master pool filesystem "zfstest" (mounted on /zfstest)?
The "trick" about zfs filesystems is the possibility to assign
options to them, so they can be treated differently. We will see that later.
First, let's push some senseless data on our newly created filesystem:
# dd if=/dev/zero
bs=128k count=5000 of=/zfstest/king/bigfile
5000+0
records in
5000+0
records out
This command creates a file "bigfile" in directory /zfstest/king,
consisting of 5000 times 128 kilobytes. That's big enough for our purpose.
"zfs list" reads:
# zfs list
NAME
USED AVAIL REFER MOUNTPOINT
zfstest
625M 22.8G 27.5K /zfstest
zfstest/king
625M 22.8G 625M /zfstest/king
zfstest/queen
24.5K 22.8G 24.5K /zfstest/queen
625 megabytes are used from filesystem zfstest/king, as expected. Notice also
that now every other filesystem on that pool only can allocate 22.8G, as 625M
are taken (compare with 23.4 G above, before creating that big file).
You CAN look up free space in your zfs filesystems also doing a "df
-k", but I wouldn't recommend it: You won't see snapshots and the numbers
can be very big.
Example for our zpool "zfstest":
# df -k
Filesystem
kbytes used avail capacity Mounted on
/dev/dsk/c0d0s0
14951508 5725184 9076809 39% /
/devices
0 0
0 0% /devices
ctfs
0 0
0 0% /system/contract
[...
lines omitted ...]
zfstest
24579072 27 23938789
1% /zfstest
zfstest/king
24579072 640149 23938789 3%
/zfstest/king
zfstest/queen
24579072 24 23938789
1% /zfstest/queen
So 22.8G are 23938789 bytes. Sun uses 1K=1024 bytes, 1M = 1024K, 1G = 1024M, 1T
= 1024G. They're a computer company and not an ISO metric organization...
So let's try out first option: "quota".
As you can imagine, "quota" limits storage. You know that as nearly
every mailbox provider do impose a quota on your storage, as do file space
providers.
First: To set and get options, you need to use "zfs set" and
"zfs get", respectively.
So here we define a quota on zfstest/queen:
# zfs set quota=5G
zfstest/queen
Result:
# zfs list
NAME
USED AVAIL REFER MOUNTPOINT
zfstest
625M 22.8G 27.5K /zfstest
zfstest/king
625M 22.8G 625M /zfstest/king
zfstest/queen
24.5K 5.00G 24.5K /zfstest/queen
Only 5G left to use at mountpoint /zfstest/queen. Note, that you may still
gobble up 22.8G in /zfstest/king, making it impossible then to put 5G in
/zfstest/queen. So a quota does not guarantee any storage, it only limits it.
To guarantee a certain amount of storage, use the option
"reservation":
# zfs set
reservation=5G zfstest/queen
Now we simulated a classical "partition" - we reserved the same
amount of storage as the quota implies, 5G:
# zfs list
NAME
USED AVAIL REFER MOUNTPOINT
zfstest
5.61G 17.8G 27.5K /zfstest
zfstest/king
625M 17.8G 625M /zfstest/king
zfstest/queen
24.5K 5.00G 24.5K /zfstest/queen
The other filesystems only have 17.8G left, as 5 G are really reserved for
zfstest/queen.
Now, let's try another nice option: compression
Perhaps now you are thinking about compression nightmares on windows systems,
like doublespace, stacker and all these other parasital programs which killed
performance, not storage. Forget them! zfs compression IS reliable and - fast!
With todays' CPU power the effect of compressing and decompressing objects is a
charm and won't harm significantly your overall performance - it can boost
performance as you will need less i/o due to compression.
As with many other zfs options, changing the compression only affects newly
written files/sectors. Uncompressed blocks still can be read. It's transparent
to the application. fseek() et.al. do not even notice that files are
compressed.
# zfs set
compression=on zfstest/queen
Now, compression is activated on /zfstest/queen (as "zfstest/queen"
is mounted on /zfstest/queen, we did not change the mountpoint - and yes,
you're right, the mountpoint is also just another zfs option...).
Let's copy our "bigfile" from king to queen:
# cp
/zfstest/king/bigfile /zfstest/queen
Ok THIS in unfair - as our file consists of only zeroes, zfs won't compress it,
it only sets up a marker saying that 655360000 bytes of zeroes have to be
generated. It is some kind of "benchmark" hook to get nice results
and to avoid to waste space with "hole files":
# zfs list
NAME
USED AVAIL REFER MOUNTPOINT
zfstest
5.61G 17.8G 27.5K /zfstest
zfstest/king
625M 17.8G 625M /zfstest/king
zfstest/queen
24.5K 5.00G 24.5K /zfstest/queen
No space needed in zfstest/queen... You may check it with "ls -las"
(option "s" prints out the number of needed disk blocks to store the
file):
# ls -las
/zfstest/queen
total
7
3 drwxr-xr-x 2 root
sys 3 Apr 23
06:17 .
3 drwxr-xr-x 4 root
sys 4 Apr 23
06:05 ..
1 -rw-r--r-- 1 root
root 655360000 Apr 23 06:18 bigfile
One block. On our uncompressed king filesystem the situation is like that:
# ls -las
/zfstest/king
total
1280257
3 drwxr-xr-x 2 root
sys 4 Apr 23
06:19 .
3 drwxr-xr-x 4 root
sys 4 Apr 23
06:05 ..
1280251
-rw-r--r-- 1 root
root 655360000 Apr 23 06:10 bigfile
To be able to create a "real world" file, we will use the "zfs
get all" command, to get ALL options of a zfs filesystem:
# zfs get all
zfstest/queen
NAME
PROPERTY VALUE
SOURCE
zfstest/queen
type
filesystem
-
zfstest/queen
creation Wed Apr 23 6:05 2008 -
zfstest/queen
used
24.5K
-
zfstest/queen
available
5.00G
-
zfstest/queen
referenced
24.5K
-
zfstest/queen
compressratio
1.00x
-
zfstest/queen
mounted
yes
-
zfstest/queen
quota
5G
local
zfstest/queen
reservation
5G
local
zfstest/queen
recordsize
128K
default
zfstest/queen
mountpoint
/zfstest/queen default
zfstest/queen
sharenfs
off
default
zfstest/queen
checksum
on
default
zfstest/queen
compression
on
local
zfstest/queen
atime
on
default
zfstest/queen
devices
on
default
zfstest/queen
exec
on
default
zfstest/queen
setuid
on
default
zfstest/queen
readonly
off
default
zfstest/queen
zoned
off
default
zfstest/queen
snapdir
hidden
default
zfstest/queen
aclmode
groupmask
default
zfstest/queen
aclinherit
secure
default
zfstest/queen
canmount
on
default
zfstest/queen
shareiscsi
off
default
zfstest/queen
xattr on
default
As you remark, the "compressratio" option (which is a read-only
option, so you may only use "zfs get" and not "zfs set")
gives the compression ratio of your filesystem, but our "zero file"
does not count, so it remains 1.00x!).
So let's create another file now in your compressed queen filesystem:
# zfs get all
zfstest/queen > /zfstest/queen/outputfile
Our file will use 3 disk blocks:
# ls -las
/zfstest/queen
total
10
3 drwxr-xr-x 2 root
sys 4 Apr 23
06:18 .
3 drwxr-xr-x 4 root
sys 4 Apr 23
06:05 ..
1 -rw-r--r-- 1 root
root 655360000 Apr 23 06:18 bigfile
3 -rw-r--r-- 1 root
root 1598 Apr 23 06:18 outputfile
Let's copy it to our uncompressed king filesystem:
# cp
/zfstest/queen/outputfile /zfstest/king/
Here it will use 5 blocks:
# ls -las
/zfstest/king
total
1280262
3 drwxr-xr-x 2 root
sys 4 Apr 23
06:19 .
3 drwxr-xr-x 4 root
sys 4 Apr 23
06:05 ..
1280251
-rw-r--r-- 1 root
root 655360000 Apr 23 06:10 bigfile
5 -rw-r--r-- 1 root
root 1598 Apr 23 06:19 outputfile
These were the basic steps to create zfs filesystems, but at least one command
is missing: How do destroy filesystems? Use "zfs destroy":
# zfs destroy
zfstest/king
#
zfs destroy zfstest/queen
Note, that the filesystem must not be in use, otherwise it won't work (just
like any unmount (umount) of a classical filesystem won't work when it's in
use).
Note, you may NOT destroy "zfstest", because that's the master
filesystem of your pool, destroy your pool if you want to get rid of it:
# zfs destroy
zfstest
cannot
destroy 'zfstest': operation does not apply to pools
use
'zfs destroy -r zfstest' to destroy all datasets in the pool
use
'zpool destroy zfstest' to destroy the pool itself
Comments