ClusterMonitoring » History » Version 6

Version 5 (Anonymous, 08/07/2008 02:35 PM) → Version 6/11 (Anonymous, 08/18/2008 01:34 PM)

= Cluster Monitoring =
The cluster resources and performance needs to be constantly monitored, and the users need to be tracked.

We assume the following configuration:
||!StorMan || 2.12_B928
||Ganglia || 3.0.4
||!JobMonarch || 0.2
||Monit || 4.9.2

== !StorMan ==
!StorMan is the ''DELL RAID Storage Manger" (RSM) for RAID systems. The agent has ''/dev/aac0'' open and listens on ports 34571, 34572, 34573.

* Download [repos:Externals/Cluster/RSM-2.12_B928_Linux.tgz] from the repository and unpack it. Enter at the command line of the master node:
{{{
untgz StorMan-2.12.i386.rpm
}}}

* Install the RSM ignoring the dependencies. Enter at the command line of the master node:
{{{
rpm -Uv --dodeps StorMan-2.12.i386.rpm
}}}

* Make sure that the RSM starts at boot time. Enter at the command line of the master node:
{{{
cd /usr/StoreMan
cp ./stor_agent /etc/init.d/rsd
/sbin/chkconfig --add rsd
/sbin/chkconfig rsd on
}}}

* Start the RSM. Enter at the command line of the master node:
{{{
/sbin/service rsd start
}}}

== Further RAID Monitoring ==

=== Command Line Interface (CLI) for Dell's RAID Controller ===
The manual can be found at [http://support.euro.dell.com/support/edocs/storage/CS6CH/en/ug/dell_ceg.htm]]

=== Comments by William Armitage ===
On procksi0 i have dropped the raid tools into /usr/local/afa

{{{
root@master01:/usr/local/afa# ls -l /usr/local/afa
total 3032
-rwxr--r-- 1 wja procksi 1893976 Nov 29 11:51 afacli
-rw-r--r-- 1 root root 0 Nov 29 11:58 cfg.log
-rw-r--r-- 1 wja procksi 572 Nov 29 11:51 getcfg.afa
-rw-r--r-- 1 root root 165 Dec 19 12:34 i2
-rw-r--r-- 1 root root 2050 Dec 19 12:36 i2.log
-rw-r--r-- 1 root root 159 Dec 11 18:11 i2.orig
-rw-r--r-- 1 root root 325 Dec 19 12:34 i3
-rw-r--r-- 1 root root 1153256 Dec 19 12:38 i3.log
-rw-r--r-- 1 root root 98 Nov 29 14:47 input
-rwxr--r-- 1 wja procksi 595 Nov 29 11:51 MAKEDEV.afa
-rw-r--r-- 1 root root 1152 Nov 30 12:39 output
-rw-r--r-- 1 root root 1152 Nov 29 14:48 output.0
-rw-r--r-- 1 root root 1152 Nov 29 15:23 output.1
-rw-r--r-- 1 root root 1152 Nov 29 18:04 output.2
}}}

afacli is from http://linux.dell.com/storage.shtml under the section AACRAID > Management Utility > afa-apps-snmp.2807420-A04.tar.gz; untar and pull apart afaapps-4.1-0.i386.rpm.

[http://support.dell.com/support/downloads/download.aspx?c=us&l=en&s=gen&releaseid=R85529&formatcnt=1&fileid=112003]

If you don't have /dev/afa0 create it by
{{{
cd /dev
/usr/local/afa/MAKEDEV.afa afa0
}}}
It disappeared after the reboot and needed recreating.

afacli is described as a bad port of a dos program.
while command line it does wierd things to the terminal so feed it scripts.
it does echo to the output as well as any logging set but it uses escape
codes that write to the alternate screen in colour xterm and then immediatly
switches back at end.

The "input" script comes from
[http://linux.dell.com/files/aacraid/nagios/check_raid_pl.txt][[br]]
{{{output log "output"}}}

the more detailed script "i2" comes from
[http://www.techno-obscura.com/~delgado/notes/sles9-NagiosAfacli.html][[br]]
{{{output i2.out}}}

"i3" dumps the controller logs. Its based on
[http://threebit.net/mail-archive/centos/msg02033.html][[br]]
{{{output i3.out}}}

== Monitoring of Services: Monit ==
You can find more about ''monit'' at [http://mon.wiki.kernel.org/].

* Add the DAG repository on the ''master node'' and ''slave nodes''. Enter at the command line as ''root'': repository:
{{{
wget http://apt.sw.be/redhat/el5/en/x86_64/rpmforge/RPMS/rpmforge-release-0.3.6-1.el5.rf.x86_64.rpm
rpm -Uvh rpmforge-release-0.3.6-1.el5.rf.x86_64.rpm
}}}

* Install ''monit'' on the ''master node'' and ''slave nodes''. Enter at the command line as ''root'': ''monit'':
{{{
yum install monit
}}}

* General configuration on the ''master node'' and ''slave nodes''. Edit ''/etc/monit.conf'' as ''root'':

Start Monit as a daemon and check the services every 2 minutes:
{{{
set daemon 120
}}}

Use ''syslog'' in order to rotate log files:
{{{
set logfile syslog facility log_daemon
}}}

Set list of mailservers to be used for alert delivery:
{{{
set mailserver master01.procksi.local # primary mailserver
marian.cs.nott.ac.uk # fallback relay
}}}

Change the alert message format:
{{{
set mail-format { from: monit@procksi.net }
}}}

Set the alert repicient(s):
{{{
set alert procksi@cs.nott.ac.uk # receive all alerts
}}}

configuration:
[TBC]

* Configure services to be monitored:
[TBC]

* Configure files to be monitored:
[TBC]

* Configure devices to be monitored:
[TBC]

* Make the ''monit'' daemon start at bootup. Enter at the command line as ''root'' on the ''master node'' and ''slave nodes'':
{{{
/sbin/chkconfig monit on
}}}
* Start the Ganglia Monitoring Daemon. Enter at the command line as ''root'' on the ''master node'' and ''slave nodes'': Daemon:
{{{
/sbin/service monit start
}}}

== Monitoring of Cluster Resources: Ganglia ==
''Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.''

* Download the latest release of the ''Ganglia Monitoring Core'' from [http://ganglia.sourceforge.net/ http://ganglia.sourceforge.net].
* Install Ganglia into ''/usr/local/ganglia'', its web frontend into ''/usr/local/ganglia/html/', and its databases into ''/usr/local/ganglia/rrds/'.
* Install the ''Ganglia Monitoring Daemon'' (gmond) on each node, and the ''Ganglia Meta Daemon'' (gmetad) on the the head node.

=== Ganglia Monitoring Daemon ===
* Configure, build and install Ganglia on each slave node (only with ''gmond''):
{{{
./configure --prefix=/usr/local
}}}
and on the master node (with ''gmond'' and ''gmetad''):
{{{
./configure --prefix=/usr/local --with-gmeta
}}}

* Initialise the configuration file for ''gmond'':
{{{
gmond --default >> /etc/gmond.conf
}}}

* Configure the ''Ganglia Monitoring Daemon'' in ''/etc/gmond.conf'':
* Set the name of the cluster:
{{{
cluster {
name = "ProCKSI"
}
}}}
* Set the IP address and port for multicast data exchange:
{{{
udp_send_channel {
mcast_join = 239.2.11.71
port = 8649
}
udp_recv_channel {
mcast_join = 239.2.11.71
port = 8649
bind = 239.2.11.71
}
}}}

* Copy start-up script for ''gmond'':
{{{
cp ./gmond/gmond.init /etc/init.d/gmond
}}}

* Add additional route for correct data exchange via multicast using the ''internal'' interface (''eth0''). Modify ''/etc/inid.d/gmond'':
{{{
#Add multicast route to internal interface
/sbin/route add -host 239.2.11.71 dev eth0
daemon $GMOND
}}}
{{{
#Remove multicast route to internal interface
/sbin/route delete -host 239.2.11.71 dev eth0
killproc gmond
}}}
* Make the Ganglia Monitoring Daemon start at bootup.
{{{
/sbin/chkconfig gmond on
}}}
* Start the Ganglia Monitoring Daemon:
{{{
/sbin/service gmond start
}}}

=== Ganglia Meta Daemon ===
* Install and configure the ''Ganglia Meta Daeomn'' (gmetad) on the master node.

* Make the Ganglia Meta Daemon start at bootup.
{{{
/sbin/chkconfig --add gmetad
/sbin/chkconfig gmetad on
}}}
* Start the Ganglia Meta Daemon:
{{{
/sbin/service gmetad start
}}}

* If the pie chart diagrams do not show up, you have to install the ''php-gd'' packages.


=== Further Customisation ===
In order to display more fine-grained time intervals, edit the following files in ''/usr/local/ganglia/html/'':
* '''header.php'''
{{{
if (!$physical) {
$context_ranges[]="10 minutes";
$context_ranges[]="20 minutes";
$context_ranges[]="30 minutes";
$context_ranges[]="1 hour";
$context_ranges[]="2 hours";
$context_ranges[]="4 hours";
$context_ranges[]="8 hours";
$context_ranges[]="12 hours";
$context_ranges[]="1 day";
$context_ranges[]="2 days";
$context_ranges[]="week";
$context_ranges[]="month";
$context_ranges[]="year";
}}}

* '''get_context.php'''
{{{
switch ($range) {
case "10 minutes": $start = -600; break;
case "20 minutes": $start = -1200; break;
case "30 minutes": $start = -1800; break;
case "1 hour": $start = -3600; break;
case "2 hours": $start = -7200; break;
case "4 hours": $start = -14400; break;
case "8 hours": $start = -28800; break;
case "12 hours": $start = -43200; break;
case "1 day": $start = -86400; break;
case "2 days": $start = -172800; break;
case "week": $start = -604800; break;
case "month": $start = -2419200; break;
case "year": $start = -31449600; break;
}}}

== !JobMonarch ==
!JobMonarch is an add-on to Ganglia which provides PBS job monitoring through the web browser.

See [http://subtrac.rc.sara.nl/oss/jobmonarch/wiki/Documentation http://subtrac.rc.sara.nl/oss/jobmonarch/wiki/Documentation] for information on requirements, configuration and installation.

'''Attention''': Does not work properly yet.

== Domain Usage Monitoring ==
All HTML documents must contain the following code in order to be tracked correctly.

{{{
<!-- Site Meter -->
<script type="text/javascript" src="http://s18.sitemeter.com/js/counter.js?site=s18procksi">
</script>
<noscript>
<a href="http://s18.sitemeter.com/stats.asp?site=s18procksi" target="_top">
<img src=[http://s18.sitemeter.com/meter.asp?site=s18procksi http://s18.sitemeter.com/meter.asp?site=s18procksi]
alt="Site Meter" border="0"/>
</a>
</noscript>

<!-- Copyright (c)2006 Site Meter -->
}}}