ClusterMonitoring » History » Version 5

Anonymous, 08/07/2008 02:35 PM
Add configuration for 'monit' monitoring daemon

1 1 Anonymous
= Cluster Monitoring =
2 1 Anonymous
The cluster resources and performance needs to be constantly monitored, and the users need to be tracked.
3 1 Anonymous
4 1 Anonymous
We assume the following configuration:
5 2 Anonymous
 ||!StorMan    || 2.12_B928 
6 1 Anonymous
 ||Ganglia     || 3.0.4
7 1 Anonymous
 ||!JobMonarch || 0.2
8 5 Anonymous
 ||Monit       || 4.9.2
9 2 Anonymous
10 2 Anonymous
11 2 Anonymous
== !StorMan ==
12 3 Anonymous
!StorMan is the ''DELL RAID Storage Manger" (RSM) for RAID systems. The agent has ''/dev/aac0'' open and listens on ports 34571, 34572, 34573.
13 2 Anonymous
14 2 Anonymous
 * Download [repos:Externals/Cluster/RSM-2.12_B928_Linux.tgz] from the repository and unpack it. Enter at the command line of the master node:
15 2 Anonymous
  {{{
16 2 Anonymous
  untgz StorMan-2.12.i386.rpm
17 2 Anonymous
  }}}
18 2 Anonymous
19 2 Anonymous
 * Install the RSM ignoring the dependencies. Enter at the command line of the master node:
20 2 Anonymous
  {{{
21 2 Anonymous
  rpm -Uv --dodeps StorMan-2.12.i386.rpm
22 2 Anonymous
  }}}
23 2 Anonymous
24 2 Anonymous
 * Make sure that the RSM starts at boot time. Enter at the command line of the master node:
25 2 Anonymous
  {{{
26 2 Anonymous
  cd /usr/StoreMan
27 2 Anonymous
  cp ./stor_agent /etc/init.d/rsd
28 2 Anonymous
  /sbin/chkconfig --add rsd
29 2 Anonymous
  /sbin/chkconfig rsd on
30 2 Anonymous
  }}}
31 2 Anonymous
32 2 Anonymous
 * Start the RSM. Enter at the command line of the master node:
33 2 Anonymous
  {{{
34 2 Anonymous
  /sbin/service rsd start
35 2 Anonymous
  }}}
36 2 Anonymous
37 4 Anonymous
== Further RAID Monitoring ==
38 4 Anonymous
39 4 Anonymous
=== Command Line Interface (CLI) for Dell's RAID Controller ===
40 4 Anonymous
The manual can be found at [http://support.euro.dell.com/support/edocs/storage/CS6CH/en/ug/dell_ceg.htm]]
41 4 Anonymous
42 4 Anonymous
43 4 Anonymous
=== Comments by William Armitage ===
44 4 Anonymous
On procksi0 i have dropped the raid tools into /usr/local/afa
45 4 Anonymous
46 4 Anonymous
{{{
47 4 Anonymous
root@master01:/usr/local/afa# ls -l /usr/local/afa
48 4 Anonymous
total 3032
49 4 Anonymous
-rwxr--r-- 1 wja  procksi 1893976 Nov 29 11:51 afacli
50 4 Anonymous
-rw-r--r-- 1 root root          0 Nov 29 11:58 cfg.log
51 4 Anonymous
-rw-r--r-- 1 wja  procksi     572 Nov 29 11:51 getcfg.afa
52 4 Anonymous
-rw-r--r-- 1 root root        165 Dec 19 12:34 i2
53 4 Anonymous
-rw-r--r-- 1 root root       2050 Dec 19 12:36 i2.log
54 4 Anonymous
-rw-r--r-- 1 root root        159 Dec 11 18:11 i2.orig
55 4 Anonymous
-rw-r--r-- 1 root root        325 Dec 19 12:34 i3
56 4 Anonymous
-rw-r--r-- 1 root root    1153256 Dec 19 12:38 i3.log
57 4 Anonymous
-rw-r--r-- 1 root root         98 Nov 29 14:47 input
58 4 Anonymous
-rwxr--r-- 1 wja  procksi     595 Nov 29 11:51 MAKEDEV.afa
59 4 Anonymous
-rw-r--r-- 1 root root       1152 Nov 30 12:39 output
60 4 Anonymous
-rw-r--r-- 1 root root       1152 Nov 29 14:48 output.0
61 4 Anonymous
-rw-r--r-- 1 root root       1152 Nov 29 15:23 output.1
62 4 Anonymous
-rw-r--r-- 1 root root       1152 Nov 29 18:04 output.2
63 4 Anonymous
}}}
64 4 Anonymous
65 4 Anonymous
afacli is from http://linux.dell.com/storage.shtml under the section AACRAID > Management Utility > afa-apps-snmp.2807420-A04.tar.gz; untar and pull apart afaapps-4.1-0.i386.rpm.
66 4 Anonymous
67 4 Anonymous
[http://support.dell.com/support/downloads/download.aspx?c=us&l=en&s=gen&releaseid=R85529&formatcnt=1&fileid=112003]
68 4 Anonymous
69 4 Anonymous
If you don't have /dev/afa0 create it by
70 4 Anonymous
{{{
71 4 Anonymous
  cd /dev
72 4 Anonymous
  /usr/local/afa/MAKEDEV.afa afa0
73 4 Anonymous
}}}
74 4 Anonymous
It disappeared after the reboot and needed recreating.
75 4 Anonymous
76 4 Anonymous
afacli is described as a bad port of a dos program.
77 4 Anonymous
while command line it does wierd things to the terminal so feed it scripts.
78 4 Anonymous
it does echo to the output as well as any logging set but it uses escape
79 4 Anonymous
codes that write to the alternate screen in colour xterm and then immediatly
80 4 Anonymous
switches back at end.
81 4 Anonymous
82 4 Anonymous
The "input" script comes from
83 4 Anonymous
[http://linux.dell.com/files/aacraid/nagios/check_raid_pl.txt][[br]]
84 4 Anonymous
{{{output log "output"}}}
85 4 Anonymous
86 4 Anonymous
the more detailed script "i2" comes from
87 4 Anonymous
[http://www.techno-obscura.com/~delgado/notes/sles9-NagiosAfacli.html][[br]]
88 4 Anonymous
{{{output i2.out}}}
89 4 Anonymous
90 4 Anonymous
"i3" dumps the controller logs. Its based on
91 4 Anonymous
[http://threebit.net/mail-archive/centos/msg02033.html][[br]]
92 4 Anonymous
{{{output i3.out}}}
93 2 Anonymous
  
94 5 Anonymous
== Monitoring of Services: Monit ==
95 5 Anonymous
You can find more about ''monit'' at [http://mon.wiki.kernel.org/].
96 1 Anonymous
97 5 Anonymous
 * Add the DAG repository:
98 5 Anonymous
{{{
99 5 Anonymous
 wget http://apt.sw.be/redhat/el5/en/x86_64/rpmforge/RPMS/rpmforge-release-0.3.6-1.el5.rf.x86_64.rpm
100 5 Anonymous
 rpm -Uvh rpmforge-release-0.3.6-1.el5.rf.x86_64.rpm
101 5 Anonymous
}}}
102 5 Anonymous
103 5 Anonymous
 * Install ''monit'':
104 5 Anonymous
{{{
105 5 Anonymous
 yum install monit
106 5 Anonymous
}}}
107 5 Anonymous
108 5 Anonymous
 * General configuration:
109 5 Anonymous
   [TBC]
110 5 Anonymous
111 5 Anonymous
 * Configure services to be monitored:
112 5 Anonymous
 [TBC]
113 5 Anonymous
114 5 Anonymous
 * Configure files to be monitored:
115 5 Anonymous
 [TBC]
116 5 Anonymous
117 5 Anonymous
 * Configure devices to be monitored:
118 5 Anonymous
 [TBC]
119 5 Anonymous
120 5 Anonymous
 * Make the ''monit'' daemon start at bootup.
121 5 Anonymous
  {{{
122 5 Anonymous
   /sbin/chkconfig  monit  on
123 5 Anonymous
  }}}
124 5 Anonymous
 * Start the Ganglia Monitoring Daemon:
125 5 Anonymous
  {{{
126 5 Anonymous
   /sbin/service  monit  start
127 5 Anonymous
  }}}
128 5 Anonymous
129 5 Anonymous
130 5 Anonymous
== Monitoring of Cluster Resources: Ganglia ==
131 1 Anonymous
''Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.''
132 1 Anonymous
133 1 Anonymous
 * Download the latest release of the ''Ganglia Monitoring Core'' from [http://ganglia.sourceforge.net/ http://ganglia.sourceforge.net].
134 1 Anonymous
 * Install Ganglia into ''/usr/local/ganglia'', its web frontend into ''/usr/local/ganglia/html/', and its databases into ''/usr/local/ganglia/rrds/'.
135 1 Anonymous
 * Install the ''Ganglia Monitoring Daemon'' (gmond) on each node, and the ''Ganglia Meta Daemon'' (gmetad) on the the head node.
136 1 Anonymous
137 1 Anonymous
=== Ganglia Monitoring Daemon ===
138 1 Anonymous
 * Configure, build and install Ganglia on each slave node (only with ''gmond''):
139 1 Anonymous
  {{{
140 1 Anonymous
  ./configure --prefix=/usr/local
141 1 Anonymous
  }}}
142 1 Anonymous
  and on the master node (with ''gmond'' and ''gmetad''):
143 1 Anonymous
  {{{
144 1 Anonymous
  ./configure --prefix=/usr/local --with-gmeta
145 1 Anonymous
  }}}
146 1 Anonymous
147 1 Anonymous
 * Initialise the configuration file for ''gmond'':
148 1 Anonymous
  {{{
149 1 Anonymous
  gmond --default >> /etc/gmond.conf
150 1 Anonymous
  }}}
151 1 Anonymous
152 1 Anonymous
 * Configure the ''Ganglia Monitoring Daemon'' in ''/etc/gmond.conf'':
153 1 Anonymous
  * Set the name of the cluster: 
154 1 Anonymous
  {{{
155 1 Anonymous
  cluster {
156 1 Anonymous
    name = "ProCKSI"
157 1 Anonymous
  }
158 1 Anonymous
  }}}
159 1 Anonymous
  * Set the IP address and port for multicast data exchange:
160 1 Anonymous
  {{{
161 1 Anonymous
  udp_send_channel {
162 1 Anonymous
    mcast_join = 239.2.11.71
163 1 Anonymous
    port = 8649
164 1 Anonymous
  }
165 1 Anonymous
  udp_recv_channel {
166 1 Anonymous
    mcast_join = 239.2.11.71
167 1 Anonymous
    port = 8649
168 1 Anonymous
    bind = 239.2.11.71
169 1 Anonymous
  }
170 1 Anonymous
  }}}
171 1 Anonymous
 
172 1 Anonymous
 * Copy start-up script for ''gmond'':
173 1 Anonymous
  {{{
174 1 Anonymous
  cp ./gmond/gmond.init /etc/init.d/gmond
175 1 Anonymous
  }}}
176 1 Anonymous
177 1 Anonymous
 * Add additional route for correct data exchange via multicast using the ''internal'' interface (''eth0''). Modify ''/etc/inid.d/gmond'':
178 1 Anonymous
  {{{
179 1 Anonymous
   #Add multicast route to internal interface
180 1 Anonymous
   /sbin/route add -host 239.2.11.71 dev eth0
181 1 Anonymous
   daemon $GMOND
182 1 Anonymous
  }}}
183 1 Anonymous
  {{{
184 1 Anonymous
   #Remove multicast route to internal interface
185 1 Anonymous
   /sbin/route delete -host 239.2.11.71 dev eth0
186 1 Anonymous
   killproc gmond
187 1 Anonymous
  }}}
188 1 Anonymous
 * Make the Ganglia Monitoring Daemon start at bootup.
189 1 Anonymous
  {{{
190 1 Anonymous
   /sbin/chkconfig  gmond  on
191 1 Anonymous
  }}}
192 1 Anonymous
 * Start the Ganglia Monitoring Daemon:
193 1 Anonymous
  {{{
194 1 Anonymous
   /sbin/service  gmond  start
195 1 Anonymous
  }}}
196 1 Anonymous
  
197 1 Anonymous
=== Ganglia Meta Daemon ===
198 1 Anonymous
 * Install and configure the ''Ganglia Meta Daeomn'' (gmetad) on the master node.
199 1 Anonymous
200 1 Anonymous
 * Make the Ganglia Meta Daemon start at bootup.
201 1 Anonymous
  {{{
202 1 Anonymous
   /sbin/chkconfig  --add gmetad
203 1 Anonymous
   /sbin/chkconfig  gmetad  on
204 1 Anonymous
  }}}
205 1 Anonymous
 * Start the Ganglia Meta Daemon:
206 1 Anonymous
  {{{
207 1 Anonymous
   /sbin/service  gmetad  start
208 1 Anonymous
  }}}
209 1 Anonymous
 
210 1 Anonymous
 * If the pie chart diagrams do not show up, you have to install the ''php-gd'' packages.
211 1 Anonymous
212 1 Anonymous
  
213 1 Anonymous
=== Further Customisation ===
214 1 Anonymous
In order to display more fine-grained time intervals, edit the following files in ''/usr/local/ganglia/html/'':
215 1 Anonymous
 * '''header.php'''
216 1 Anonymous
 {{{
217 1 Anonymous
  if (!$physical) {
218 1 Anonymous
   $context_ranges[]="10 minutes";
219 1 Anonymous
   $context_ranges[]="20 minutes";
220 1 Anonymous
   $context_ranges[]="30 minutes";
221 1 Anonymous
   $context_ranges[]="1 hour";
222 1 Anonymous
   $context_ranges[]="2 hours";
223 1 Anonymous
   $context_ranges[]="4 hours";
224 1 Anonymous
   $context_ranges[]="8 hours";
225 1 Anonymous
   $context_ranges[]="12 hours";
226 1 Anonymous
   $context_ranges[]="1 day";
227 1 Anonymous
   $context_ranges[]="2 days";
228 1 Anonymous
   $context_ranges[]="week";
229 1 Anonymous
   $context_ranges[]="month";
230 1 Anonymous
   $context_ranges[]="year";
231 1 Anonymous
 }}}
232 1 Anonymous
233 1 Anonymous
 * '''get_context.php'''
234 1 Anonymous
 {{{
235 1 Anonymous
  switch ($range) {
236 1 Anonymous
   case "10 minutes":   $start = -600; break;
237 1 Anonymous
   case "20 minutes":   $start = -1200; break;
238 1 Anonymous
   case "30 minutes":   $start = -1800; break;
239 1 Anonymous
   case "1 hour":       $start = -3600; break;
240 1 Anonymous
   case "2 hours":      $start = -7200; break;
241 1 Anonymous
   case "4 hours":      $start = -14400; break;
242 1 Anonymous
   case "8 hours":      $start = -28800; break;
243 1 Anonymous
   case "12 hours":     $start = -43200; break;
244 1 Anonymous
   case "1 day":        $start = -86400; break;
245 1 Anonymous
   case "2 days":       $start = -172800; break;
246 1 Anonymous
   case "week":         $start = -604800; break;
247 1 Anonymous
   case "month":        $start = -2419200; break;
248 1 Anonymous
   case "year":         $start = -31449600; break;
249 1 Anonymous
 }}}
250 1 Anonymous
251 1 Anonymous
252 1 Anonymous
== !JobMonarch ==
253 1 Anonymous
!JobMonarch is an add-on to Ganglia which provides PBS job monitoring through the web browser.
254 1 Anonymous
255 1 Anonymous
See [http://subtrac.rc.sara.nl/oss/jobmonarch/wiki/Documentation http://subtrac.rc.sara.nl/oss/jobmonarch/wiki/Documentation] for information on requirements, configuration and installation.
256 1 Anonymous
257 1 Anonymous
'''Attention''': Does not work properly yet.
258 1 Anonymous
259 1 Anonymous
260 1 Anonymous
== Domain Usage Monitoring ==
261 1 Anonymous
All HTML documents must contain the following code in order to be tracked correctly.
262 1 Anonymous
263 1 Anonymous
 {{{
264 1 Anonymous
<!-- Site Meter -->
265 1 Anonymous
	<script type="text/javascript" src="http://s18.sitemeter.com/js/counter.js?site=s18procksi">
266 1 Anonymous
	</script>
267 1 Anonymous
	<noscript>
268 1 Anonymous
		<a href="http://s18.sitemeter.com/stats.asp?site=s18procksi" target="_top">
269 1 Anonymous
			<img	src=[http://s18.sitemeter.com/meter.asp?site=s18procksi http://s18.sitemeter.com/meter.asp?site=s18procksi]
270 1 Anonymous
    				alt="Site Meter" border="0"/>
271 1 Anonymous
		</a>
272 1 Anonymous
	</noscript>
273 1 Anonymous
274 1 Anonymous
<!-- Copyright (c)2006 Site Meter -->
275 1 Anonymous
 }}}