i/o stat

iostat

As with most of the monitoring commands, the first line of iostat reflects a summary of statistics since boot time. To look at meaningful real-time data, run iostat with a time step (eg iostat 30) and look at the lines that report summaries over the time step intervals.

For Solaris 2.6 and higher, use iostat -xPnce 30 to get information including the common device names of the disk partitions, CPU statistics, error statistics, and extended disk statistics.

For Solaris 2.5.1 and earlier, or for more compact output, use iostat -xc 30 to get the extended disk and CPU statistics.

In either case, the information reported is:

  • disk: Disk device name.
  • r/s, w/s: Average reads/writes per second.
  • Kr/s, Kw/s: Average Kb read/written per second.
  • wait: Time spent by a process while waiting for block
  • (eg disk) I/O to complete.
  • actv: Number of active requests in the hardware queue.
  • %w: Occupancy of the wait queue.
  • %b: Occupancy of the active queue with the device busy.
  • svc_t: Service time (ms). Includes everything: wait time, active queue time, seek rotation, transfer time.
  • us/sy: User/system CPU time (%).
  • wt: Wait for I/O (%).
  • id: Idle time (%).

Notes on Odd Behavior

The "wait" time reported by iostat refers to time spent by a process while waiting for block device (such as disk) I/O to finish. In Solaris 2.6 and earlier, the calculation algorithm sometimes overstates the problem on multi-processor machines, since it does not take into account that an I/O wait on one CPU does not mean that I/O is blocked for processes on the other CPUs. Solaris 7 has corrected this problem.

iostat also sometimes reports excessive svc_t (service time) readings for disks that are very inactive. This is due to the action of fsflush keeping the data in memory and on the disk up-to-date. Since many writes are specified over a very short period of time to random parts of the disk, a queue forms briefly, and the average service time goes up. svc_t should only be taken seriously on a disk that is showing 5% or more activity.



netstat

netstat provides useful information regarding traffic flow.

In particular, netstat -i lists statistics for each interface, netstat -s provides a full listing of several counters, and netstat -rs provides routing table statistics. netstat -an reports all open ports.

netstat -k provides a useful summary of several network-related statistics up through Solaris 9, but this option was removed in Solaris 10 in favor of the /bin/kstat command. Through Solaris 9, netstat -k provides a listing of several component kstat statistics.

Here are some of the issues that can be revealed with netstat:

  • netstat -i: (Collis+Ierrs+Oerrs)/(Ipkts+Opkts) > 2%: This may indicate a network hardware issue.

  • netstat -i: (Collis/Opkts) > 10%: The interface is overloaded. Traffic will need to be reduced or redistributed to other interfaces or servers.

  • netstat -i: (Ierrs/Ipkts) > 25%: Packets are probably being dropped by the host, indicating an overloaded network (and/or server). Retransmissions can be dropped by reducing the rsize and wsize mount parameters to 2048 on the clients. Note that this is a temporary workaround, since this has the net effect of reducing maximum NFS throughput on the segment.

  • netstat -s: If significant numbers of packets arrive with bad headers, bad data length or bad checksums, check the network hardware.

  • netstat -i: If there are more than 120 collisions/second, the network is overloaded. See the suggestions above.

  • netstat -i: If the sum of input and output packets is higher than about 600 for a 10Mbs interface or 6000 for a 100Mbs interface, the network segment is too busy. See the suggestions above.
  • netstat -r: This form of the command provides the routing table. Make sure that the routes are as you expect them to be.

..............................................................................................
  • vmstat, which is one of the simplest and most useful tools because it reports important data in the categories of CPU, memory utilization , and disk-I/O
  • iostat, which is also important because you can use it in conjunction with vmstat to determine if there is a disk bottleneck
  • netstat, which is used to monitor critical network activity
  • sar, or system activity reporter, which you can use to record a large set of system statistics that includes time stamping of the statistics. (This tool is most useful for capacity planning and trend analysis. It does not collect network stats.)

You can run these four tools from the OS prompt. For example, to use vmstat:

$ vmstat 300

In this case, data is averaged and output over 5 minute intervals (300 seconds) for an indefinite period of time.

To capture the output to a file, use:

$ vmstat 300 > vmstat.out

To send both to the screen and to an output file, use:

$ vmstat 300 tee vmstat.out

(Unlike with Intel, there is minimal additional overhead incurred when sending output to the screen on the latest UltraSPARC-III.)

You can use the UNIX man pages to get help on the various switches and definitions of terms. For example:

$ man vmstat

It's relatively easy to create a UNIX shell script that automatically runs a few of the utilities at the desired interval. Here's a sample script; note that the & sets the tool to run in the background. (Note that you could run each of these commands manually at the OS prompt.)

#!/bin/csh
vmstat 300 >vmstat.out &
iostat -xtc 300 >iostat.out &
netstat -i 300 >netstat.out &

The following sections provides more details about these tools and the statistics that are important to monitor. The acceptable range for the important statistics are discussed later in the Resource bottleneck threshold rules-of-thumb section.

Details about vmstat

vmstat is one of the simplest and most useful tools because it reports important data in the categories of CPU, memory utilization, and disk-I/O. To see the system activity for 3 seconds with a 1 second reporting interval use:

vmstat 1 3

The following is an example of the results of doing a vmstat 1 3. The bolded columns (r, b, sr, us, and sy) are most important. (Note that the first output line of the data is really the accumulated statistics since the system startup. That line is not included in the examples in this article.)


Figure 1. Sample vmstat results
Sample vmstat results

In the process (procs) group of statistics, there are two important stats, r and b:

  • r is the number of processes in the CPU run queue.
  • b is the number of processes blocked for resources I/O, paging, and so forth.

In the memory group of statistics, the important stat is sr:

  • sr is the number of pages scanned and can be an indicator of a RAM shortage.

The cpu group of statistics gives a breakdown of the percentage usage of CPU time. On MP systems, this is an average across all processors.

  • us is the percentage of user CPU time.
  • sy is the percentage of system CPU time.

Details about iostat

To see the system activity over 300 seconds with a 1 second reporting interval use:

iostat 300 2

You can add the switch -x to provide extended statistics, which makes the output more readable because each disk has its own line. You can also add the -c switch to report the percentage of time the system has spent in user mode, in system mode, waiting for I/O, and idling. The following is an example of the results of doing a iostat with these switches, specifically:

iostat -cx 300 1

The bolded columns (svc_t, %b, us, sy, and wt) are most important.


Figure 2. Results of iostat -cx 300 1
Results of iostat -cx 300 1

In the extended device group of statistics, there are two important stats, svc_t and % b:

  • svc_t is the average service time, in milliseconds, of the disk.
  • %b is the percent of time the disk is busy (transactions in progress).

In the cpu group of statistics, there are three important statistics, us, sy, and wt:

  • us and sy are the percent CPU for user and system respectively.
  • wt is the wait on I/O times for the cpu.

Details about netstat

The netstat command can show you how healthy your network is. Using the -i switch summarizes all the network interfaces. Another switch that can be useful is the -s switch, which lists all the protocols. The following is an example of the results of doing a netstat -i:


Figure 3. Sample results of netstat -i
Sample results of netstat -i

The column colls indicates the number of collisions of the network packets. If there are no collisions, then the network is probably not experiencing a performance problem. (Note that you will not see collisions on servers on switched networks.)

Because netstat's viewpoint is that of just one node on the network, it's usefulness is limited in assessing over all Domino server performance. A single 100 Mb/s pipe can deliver requests and absorb responses much faster than all but the very largest Domino servers can handle. Network issues are best viewed from the perspective of the entire network using network sniffer hardware or software.


Resource bottleneck threshold rules-of-thumb

This section attempts to establish rules-of-thumb to determine if important system resources are on the verge of being a bottleneck and limiting server performance.

Disk bottlenecks

Disk bottlenecks are the most likely bottlenecks. Here are the thresholds you should look for using the different monitoring tools.

Using iostat:

  • The significant bottleneck threshold is %b (percent time disk busy) > 20% AND (20 ms <>
  • The critical bottleneck threshold is %b (percent time disk busy) > 20% AND ( svc_t (ServiceTime) > 30 ms)

Using vmstat:

  • A significant bottleneck threshold occurs if b (processes blocked for resources) approaches r (# in run queue)
  • A critical bottleneck threshold occurs if b (processes blocked for resources) = or > r (# in run queue)

Memory bottlenecks

Here are the memory bottleneck thresholds you should look for using the different monitoring tools.

According to the book, Sun performance and Tuning, Java and the Internet, by Adrian Cockcroft and Richard Pettit, using vmstat:

  • A significant bottleneck threshold occurs if sr (free page scan rate) > 200 scans/second
  • A critical bottleneck threshold occurs if sr (free page scan rate) > 300 scans/second

(Note that the last point could indicate thrashing because active and inactive pages will be stolen from the Process Working Set.)

You can also issue the command vmstat -s to show paging. The si and so columns are pages swapped in and out.

Comments

Popular Posts