Saturday, October 24, 2009

script for fmadm alerting

source: http://prefetch.net/code/fmadmnotifier

From above site I found the following script
#!/bin/bash
#
# Program: E-mail fault manager errors
#
# Author: Matty < matty91 at gmail dot com >
#
# Current Version: 1.1
#
# Revision History:
#
# Version 1.1
# Avoid the use of temporary files -- Michael Shon
#
# Version 1.0
# Initial Release
#
# Last Updated: 08-18-2006
#
# Purpose:
# Fmadm.sh queries the fault manager to see if errors have been
# generated. If an error is detected, the script will email the
# admininstrator defined in the ADMIN vairable with the error
# details.
#
# License:
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Installation:
# Copy the shell script to a suitable location
#
# Usage:
# To check for events once per hour, add a cron job similar to the following:
#
# $ crontab -l | grep fmadmnotifier.sh
# 0 * * * * /etc/scripts/fmadmnotifier.sh
#

PATH=/usr/bin:/sbin:/usr/sbin:/usr/sfw/bin

# Who to E-mail with new updates
ADMIN="root"

# Location of binaries
AWK=$(which awk)
FMADM=$(which fmadm)
HOSTNAME=$(which hostname)
MAIL=$(which mailx)
MKTEMP=$(which mktemp)

# Check to make sure the mail binary exists
if [ ! -f ${MAIL} ]
then
echo "Cannot find ${MAIL}"
exit 1
fi

# Check to make sure the fmadm utility exists
if [ ! -f ${FMADM} ]
then
echo "Cannot find ${FMADM}"
exit 1
fi

# Verify that mktemp exists
if [ ! -f ${MKTEMP} ]
then
echo "Cannot find ${MKTEMP}"
exit 1
fi

# Run fmadm faulty to check for hardware errors
FMADMOUTPUT=$(${FMADM} faulty | ${AWK} '$0 !~ /STATE/ && $0 !~ /^----/ { print $0 }')

if [ -n "${FMADMOUTPUT}" ]
then
(
echo "The fault manager detected a problem with the system hardware."
echo "The fmadm and fmdump utilities can be run to retrieve additional"
echo "details on the faults and recommended next course of action. "

echo ""
echo "fmadm faulty output:"
echo ""

${FMADM} faulty
echo ""
) | ${MAIL} -s "Hardware fault on $($HOSTNAME)" ${ADMIN}
fi

And some fmadm details:
The fmadm utilities "config" option can be used to view the list of diagnosis engines and agents that are active on a system:
i $ fmadm config
MODULE cpumem-retire disk-transport eft fmd-self-diagnosis io-retire snmp-trapgen sysevent-transport syslog-msgs zfs-diagnosis zfs-retire VERSION 1.1 1.0 1.16 1.0 2.0 1.0 1.0 1.0 1.0 1.0 STATUS active active active active active active active active active active DESCRIPTION CPU/Memory Retire Agent Disk Transport Agent eft diagnosis engine Fault Manager Self-Diagnosis I/O Retire Agent SNMP Trap Generation Agent SysEvent Transport Agent Syslog Messaging Agent ZFS Diagnosis Engine ZFS Retire Agent Fault manager logs
· The fault manager maintains two log files: ­ The error log contains a list of errors events that have been sent to the fault manager daemon ­ The fault log contains a list of problems that have been diagnosed and repaired · The fault log can be viewed by running fmdump:
$ fmdump · The error log can be viewed with fmdump's "-e" option:
$ fmdump -e · Fmdump also has a "-u" option to limit the output to a specific UUID, a "-T" option to display events that occurred during a specific timeframe, and "-v" and "-V" options to display verbose output Viewing faulty components

Monday, October 5, 2009

zpool monitoring

The second script checks the current state of the zpools, looking for degraded arrays (caused by failed drives), unavailable spares and unrecovered errors. Because it keeps a state file in /etc/zfs, it would need to be run as root. I run this hourly. It should be possible to update this script to also check for ZFS checksum errors, but I haven't taken the time to do it. The reminder code hasn't been tested, as I haven't had a failure since the code was put in place.

#! /bin/sh

STATEFILE="/etc/zfs/chk.state"
ALARMUSER="root@localhost"

zpool status 2>&1 | \
egrep -i '(degraded|unavail|unrecover)' > /dev/null

STATE=$?

if [ -f $STATEFILE ]
then
LASTSTATE=`cat $STATEFILE`
else
LASTSTATE=1
echo $STATE > $STATEFILE
fi

#
# Error is currently set.
#
if [ $STATE = 0 ]
then

#
# Error wasn't set previously. Send out the error message.
#
if [ $LASTSTATE = 1 ]
then
HOSTNAME=`uname -n`
zpool status -x | \
mailx -s "ZFS.error.on.$HOSTNAME" $ALARMUSER
echo $STATE > $STATEFILE
exit
fi

#
# Send out a reminder every other day.
#
FOUND=`find $STATEFILE -mtime -2`
if [ -z $FOUND ]
then
exit
fi
HOSTNAME=`uname -n`
zpool status -x | \
mailx -s "ZFS.error.reminder.on.$HOSTNAME" $ALARMUSER
echo $STATE > $STATEFILE
exit
fi

#
# Error was set, but is no longer. Send out the fixed message.
#
if [ $STATE = 1 -a $LASTSTATE = 0 ]
then
HOSTNAME=`uname -n`
zpool status -x | \
mailx -s "ZFS.error.fixed.on.$HOSTNAME" $ALARMUSER
echo $STATE > $STATEFILE
exit
fi


EDIT: Updated above script to look for unrecovered errors, thanks to information in this post by nhamilto40. To reset the error counts, the "zpool clear pool" command can be used.

I scanned this thread, and see no scripts. Perhaps this will be more useful than I thought.

ZFS tutorial using files instead of disks

Using Files
To use files on an existing filesystem, create four 128 MB files, eg.:

# mkfile 128m /home/ocean/disk1
# mkfile 128m /home/ocean/disk2
# mkfile 128m /home/ocean/disk3
# mkfile 128m /home/ocean/disk4

# ls -lh /home/ocean
total 1049152
-rw------T 1 root root 128M Mar 7 19:48 disk1
-rw------T 1 root root 128M Mar 7 19:48 disk2
-rw------T 1 root root 128M Mar 7 19:48 disk3
-rw------T 1 root root 128M Mar 7 19:48 disk4

This is easy for testing you don't need real disks or partitions
(source: http://flux.org.uk/howto/solaris/zfs_tutorial_01)

Saturday, October 3, 2009

Remote powerdown from Windows with plink.exe

Because I am using a private network without connection to the outside world,
I am not concerned about security.
Usink plink.exe as part of the puTTY package I managed a remote shutdown:
Create a shortcut on windows with the following command:
(I use the user admin for logon)

plink.exe -ssh admin@hostname -pw password -m shutdown

The tricky bit is the remote command with the option -m,
(you have to use the full path because no profile is loaded)
so my "shutdown" saved on the windows box contains this line:

/usr/bin/pfexec /usr/sbin/init 5

very similar to the 'pfexec init 5' you use when logged on in Solaris