ClusterMonitoring » History » Version 9

Version 8 (Anonymous, 08/18/2008 01:53 PM) → Version 9/11 (Anonymous, 08/19/2008 05:04 PM)

= Cluster Monitoring =
The cluster resources and performance needs to be constantly monitored, and the users need to be tracked.

We assume the following configuration:
||!StorMan || 2.12_B928 || [wiki:StorMan Installation & Configuration]
||Dell RAID Controller || || [wiki:DellRaid Installation & Configuration]

||Ganglia || 3.0.4 || [wiki:Ganglia Installation & Configuration]
||!JobMonarch || 0.2 || [wiki:JobMonarch Installation & Configuration]
||Monit || 4.9.2

== !StorMan ==
!StorMan is the ''DELL RAID Storage Manger" (RSM) for RAID systems. The agent has ''/dev/aac0'' open and listens on ports 34571, 34572, 34573.

* Download [repos:Externals/Cluster/RSM-2.12_B928_Linux.tgz] from the repository and unpack it. Enter at the command line of the master node:
{{{
untgz StorMan-2.12.i386.rpm
}}}

* Install the RSM ignoring the dependencies. Enter at the command line of the master node:
{{{
rpm -Uv --dodeps StorMan-2.12.i386.rpm
}}}

* Make sure that the RSM starts at boot time. Enter at the command line of the master node:
{{{
cd /usr/StoreMan
cp ./stor_agent /etc/init.d/rsd
/sbin/chkconfig --add rsd
/sbin/chkconfig rsd on
}}}

* Start the RSM. Enter at the command line of the master node:
{{{
/sbin/service rsd start
}}}

== Further RAID Monitoring ==

=== Command Line Interface (CLI) for Dell's RAID Controller ===
The manual can be found at [http://support.euro.dell.com/support/edocs/storage/CS6CH/en/ug/dell_ceg.htm]]

=== Comments by William Armitage ===
On procksi0 i have dropped the raid tools into /usr/local/afa

{{{
root@master01:/usr/local/afa# ls -l /usr/local/afa
total 3032
-rwxr--r-- 1 wja procksi 1893976 Nov 29 11:51 afacli
-rw-r--r-- 1 root root 0 Nov 29 11:58 cfg.log
-rw-r--r-- 1 wja procksi
|| [wiki:Monit Installation & Configuration] 572 Nov 29 11:51 getcfg.afa
-rw-r--r-- 1 root root 165 Dec 19 12:34 i2
-rw-r--r-- 1 root root 2050 Dec 19 12:36 i2.log
-rw-r--r-- 1 root root 159 Dec 11 18:11 i2.orig
-rw-r--r-- 1 root root 325 Dec 19 12:34 i3
-rw-r--r-- 1 root root 1153256 Dec 19 12:38 i3.log
-rw-r--r-- 1 root root 98 Nov 29 14:47 input
-rwxr--r-- 1 wja procksi 595 Nov 29 11:51 MAKEDEV.afa
-rw-r--r-- 1 root root 1152 Nov 30 12:39 output
-rw-r--r-- 1 root root 1152 Nov 29 14:48 output.0
-rw-r--r-- 1 root root 1152 Nov 29 15:23 output.1
-rw-r--r-- 1 root root 1152 Nov 29 18:04 output.2
}}}

afacli is from http://linux.dell.com/storage.shtml under the section AACRAID > Management Utility > afa-apps-snmp.2807420-A04.tar.gz; untar and pull apart afaapps-4.1-0.i386.rpm.

[http://support.dell.com/support/downloads/download.aspx?c=us&l=en&s=gen&releaseid=R85529&formatcnt=1&fileid=112003]

If you don't have /dev/afa0 create it by
{{{
cd /dev
/usr/local/afa/MAKEDEV.afa afa0
}}}
It disappeared after the reboot and needed recreating.

afacli is described as a bad port of a dos program.
while command line it does wierd things to the terminal so feed it scripts.
it does echo to the output as well as any logging set but it uses escape
codes that write to the alternate screen in colour xterm and then immediatly
switches back at end.

The "input" script comes from
[http://linux.dell.com/files/aacraid/nagios/check_raid_pl.txt][[br]]
{{{output log "output"}}}

the more detailed script "i2" comes from
[http://www.techno-obscura.com/~delgado/notes/sles9-NagiosAfacli.html][[br]]
{{{output i2.out}}}

"i3" dumps the controller logs. Its based on
[http://threebit.net/mail-archive/centos/msg02033.html][[br]]
{{{output i3.out}}}

== Monitoring of Services: Monit ==
You can find more about Monit at [http://mon.wiki.kernel.org/].

* Add the DAG repository on the ''master node'' and ''slave nodes''. Enter at the command line as ''root'':
{{{

||!SiteMeter wget http://apt.sw.be/redhat/el5/en/x86_64/rpmforge/RPMS/rpmforge-release-0.3.6-1.el5.rf.x86_64.rpm
rpm -Uvh rpmforge-release-0.3.6-1.el5.rf.x86_64.rpm
}}}

* Install Monit on the ''master node'' and ''slave nodes''. Enter at the command line as ''root'':
{{{
yum install monit
}}}

* General configuration on the ''master node'' and ''slave nodes''. Edit ''/etc/monit.conf'' as ''root'':

Start Monit as a daemon and check the services every 2 minutes:
{{{
set daemon 120
}}}

Use ''syslog'' in order to rotate log files:
{{{
set logfile syslog facility log_daemon
}}}

Set list of mailservers to be used for alert delivery:
{{{
set mailserver master01.procksi.local
|| || [wiki:SiteMeter Installation & Configuration] # primary mailserver
marian.cs.nott.ac.uk # fallback relay
}}}

Change the alert message format:
{{{
set mail-format { from: monit@procksi.net }
}}}

Set the alert repicient(s):
{{{
set alert procksi@cs.nott.ac.uk # receive all alerts
}}}

* Monitor the general system resources on the ''master node'':

Edit the Monit configuration file ''/etc/monit.conf'':
{{{
check system master01.procksi.local
if loadavg (1min) > 4 then alert
if loadavg (5min) > 2 then alert
if memory usage > 75% then alert
if cpu usage (user) > 70% then alert
if cpu usage (system) > 30% then alert
if cpu usage (wait) > 20% then alert
}}}

* Monitor the Apache web server on the ''master node'':

Edit ''/etc/monit.conf'':
{{{
check process apache with pidfile /var/run/httpd.pid
start program = "/sbin/service httpd start"
stop program = "/sbin/service httpd stop"
if cpu > 60% for 2 cycles then alert
if cpu > 80% for 5 cycles then restart
if totalmem > 200.0 MB for 5 cycles then restart
if children > 250 then restart
if loadavg(5min) greater than 10 for 8 cycles then stop
if failed host www.procksi.net port 80 protocol http
and request "/monit/token"
then restart
if 3 restarts within 5 cycles then timeout
group server
}}}

Edit the Apache configuration file ''/etc/httpd/conf/httpd.conf'':
{{{
#General Aliases for Monitoring and Testing
Alias /monit/ "/home/procksi/monit/"
Alias /ganglia/ "/usr/local/ganglia/html/"
Alias /trees/ "/home/procksi/trees/"
}}}

Create the Monit token file and restart Apache. Enter at the command line:
{{{
mkdir /home/procksi/monit/
echo "ProCKSI.monit" > /home/procksi/monit/token
chown -R procksi.procksi_dev /home/procksi/monit/token
/sbin/service httpd restart
}}}

* Configure services to be monitored on the ''slave nodes'':
Check the general system resources; edit the host name for slaveXX accordingly. Edit ''/etc/monit.conf'':
{{{
check system slaveXX.procksi.local
if loadavg (1min) > 4 then alert
if loadavg (5min) > 2 then alert
if memory usage > 75% then alert
if cpu usage (user) > 70% then alert
if cpu usage (system) > 30% then alert
if cpu usage (wait) > 20% then alert
}}}

* Configure files to be monitored:
[TBC]

* Monitor devices on the ''master node'':
Monitor / and /home. Edit the Monit configuration file ''/etc/monit.conf'':

{{{
check device root with path /dev/sda1
if space usage > 80% for 5 times within 15 cycles then alert
if space usage > 95% then alert
group server

check device home with path /dev/sda3
start program = "/bin/mount /home"
stop program = "/bin/umount /home"
if space usage > 80% for 5 times within 15 cycles then alert
if space usage > 95% then alert
group server
}}}

* Monitor devices on the ''slave nodes'':
Monitor / and /scratch. Edit the Monit configuration file ''/etc/monit.conf'':

{{{
check device root with path /dev/sda1
if space usage > 80% for 5 times within 15 cycles then alert
if space usage > 95% then alert
group server

check device scratch with path /dev/sda3
start program = "/bin/mount /scratch"
stop program = "/bin/umount /scratch"
if space usage > 80% for 5 times within 15 cycles then alert
if space usage > 95% then alert
group server
}}}

* Make the Monit daemon start at bootup. Enter at the command line as ''root'' on the ''master node'' and ''slave nodes'':
{{{
/sbin/chkconfig monit on
}}}
* Start the Monit Daemon. Enter at the command line as ''root'' on the ''master node'' and ''slave nodes'':
{{{
/sbin/service monit start
}}}


== Monitoring of Cluster Resources: Ganglia ==
''Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.''

* Download the latest release of the ''Ganglia Monitoring Core'' from [http://ganglia.sourceforge.net/ http://ganglia.sourceforge.net].
* Install Ganglia into ''/usr/local/ganglia'', its web frontend into ''/usr/local/ganglia/html/', and its databases into ''/usr/local/ganglia/rrds/'.
* Install the ''Ganglia Monitoring Daemon'' (gmond) on each node, and the ''Ganglia Meta Daemon'' (gmetad) on the the head node.

=== Ganglia Monitoring Daemon ===
* Configure, build and install Ganglia on each slave node (only with ''gmond''):
{{{
./configure --prefix=/usr/local
}}}
and on the master node (with ''gmond'' and ''gmetad''):
{{{
./configure --prefix=/usr/local --with-gmeta
}}}

* Initialise the configuration file for ''gmond'':
{{{
gmond --default >> /etc/gmond.conf
}}}

* Configure the ''Ganglia Monitoring Daemon'' in ''/etc/gmond.conf'':
* Set the name of the cluster:
{{{
cluster {
name = "ProCKSI"
}
}}}
* Set the IP address and port for multicast data exchange:
{{{
udp_send_channel {
mcast_join = 239.2.11.71
port = 8649
}
udp_recv_channel {
mcast_join = 239.2.11.71
port = 8649
bind = 239.2.11.71
}
}}}

* Copy start-up script for ''gmond'':
{{{
cp ./gmond/gmond.init /etc/init.d/gmond
}}}

* Add additional route for correct data exchange via multicast using the ''internal'' interface (''eth0''). Modify ''/etc/inid.d/gmond'':
{{{
#Add multicast route to internal interface
/sbin/route add -host 239.2.11.71 dev eth0
daemon $GMOND
}}}
{{{
#Remove multicast route to internal interface
/sbin/route delete -host 239.2.11.71 dev eth0
killproc gmond
}}}
* Make the Ganglia Monitoring Daemon start at bootup.
{{{
/sbin/chkconfig gmond on
}}}
* Start the Ganglia Monitoring Daemon:
{{{
/sbin/service gmond start
}}}

=== Ganglia Meta Daemon ===
* Install and configure the ''Ganglia Meta Daeomn'' (gmetad) on the master node.

* Make the Ganglia Meta Daemon start at bootup.
{{{
/sbin/chkconfig --add gmetad
/sbin/chkconfig gmetad on
}}}
* Start the Ganglia Meta Daemon:
{{{
/sbin/service gmetad start
}}}

* If the pie chart diagrams do not show up, you have to install the ''php-gd'' packages.


=== Further Customisation ===
In order to display more fine-grained time intervals, edit the following files in ''/usr/local/ganglia/html/'':
* '''header.php'''
{{{
if (!$physical) {
$context_ranges[]="10 minutes";
$context_ranges[]="20 minutes";
$context_ranges[]="30 minutes";
$context_ranges[]="1 hour";
$context_ranges[]="2 hours";
$context_ranges[]="4 hours";
$context_ranges[]="8 hours";
$context_ranges[]="12 hours";
$context_ranges[]="1 day";
$context_ranges[]="2 days";
$context_ranges[]="week";
$context_ranges[]="month";
$context_ranges[]="year";
}}}

* '''get_context.php'''
{{{
switch ($range) {
case "10 minutes": $start = -600; break;
case "20 minutes": $start = -1200; break;
case "30 minutes": $start = -1800; break;
case "1 hour": $start = -3600; break;
case "2 hours": $start = -7200; break;
case "4 hours": $start = -14400; break;
case "8 hours": $start = -28800; break;
case "12 hours": $start = -43200; break;
case "1 day": $start = -86400; break;
case "2 days": $start = -172800; break;
case "week": $start = -604800; break;
case "month": $start = -2419200; break;
case "year": $start = -31449600; break;
}}}

== !JobMonarch ==
!JobMonarch is an add-on to Ganglia which provides PBS job monitoring through the web browser.

See [http://subtrac.rc.sara.nl/oss/jobmonarch/wiki/Documentation http://subtrac.rc.sara.nl/oss/jobmonarch/wiki/Documentation] for information on requirements, configuration and installation.

'''Attention''': Does not work properly yet.

== Domain Usage Monitoring ==
All HTML documents must contain the following code in order to be tracked correctly.

{{{
<!-- Site Meter -->
<script type="text/javascript" src="http://s18.sitemeter.com/js/counter.js?site=s18procksi">
</script>
<noscript>
<a href="http://s18.sitemeter.com/stats.asp?site=s18procksi" target="_top">
<img src=[http://s18.sitemeter.com/meter.asp?site=s18procksi http://s18.sitemeter.com/meter.asp?site=s18procksi]
alt="Site Meter" border="0"/>
</a>
</noscript>

<!-- Copyright (c)2006 Site Meter -->
}}}