ClusterMonitoring » History » Version 8

Anonymous, 08/18/2008 01:53 PM
Monit: monitor devices

1 1 Anonymous
= Cluster Monitoring =
2 1 Anonymous
The cluster resources and performance needs to be constantly monitored, and the users need to be tracked.
3 1 Anonymous
4 1 Anonymous
We assume the following configuration:
5 2 Anonymous
 ||!StorMan    || 2.12_B928 
6 1 Anonymous
 ||Ganglia     || 3.0.4
7 1 Anonymous
 ||!JobMonarch || 0.2
8 5 Anonymous
 ||Monit       || 4.9.2
9 2 Anonymous
10 2 Anonymous
11 2 Anonymous
== !StorMan ==
12 3 Anonymous
!StorMan is the ''DELL RAID Storage Manger" (RSM) for RAID systems. The agent has ''/dev/aac0'' open and listens on ports 34571, 34572, 34573.
13 2 Anonymous
14 2 Anonymous
 * Download [repos:Externals/Cluster/RSM-2.12_B928_Linux.tgz] from the repository and unpack it. Enter at the command line of the master node:
15 2 Anonymous
  {{{
16 2 Anonymous
  untgz StorMan-2.12.i386.rpm
17 2 Anonymous
  }}}
18 2 Anonymous
19 2 Anonymous
 * Install the RSM ignoring the dependencies. Enter at the command line of the master node:
20 2 Anonymous
  {{{
21 2 Anonymous
  rpm -Uv --dodeps StorMan-2.12.i386.rpm
22 2 Anonymous
  }}}
23 2 Anonymous
24 2 Anonymous
 * Make sure that the RSM starts at boot time. Enter at the command line of the master node:
25 2 Anonymous
  {{{
26 2 Anonymous
  cd /usr/StoreMan
27 2 Anonymous
  cp ./stor_agent /etc/init.d/rsd
28 2 Anonymous
  /sbin/chkconfig --add rsd
29 2 Anonymous
  /sbin/chkconfig rsd on
30 2 Anonymous
  }}}
31 2 Anonymous
32 2 Anonymous
 * Start the RSM. Enter at the command line of the master node:
33 2 Anonymous
  {{{
34 2 Anonymous
  /sbin/service rsd start
35 2 Anonymous
  }}}
36 2 Anonymous
37 4 Anonymous
== Further RAID Monitoring ==
38 4 Anonymous
39 4 Anonymous
=== Command Line Interface (CLI) for Dell's RAID Controller ===
40 4 Anonymous
The manual can be found at [http://support.euro.dell.com/support/edocs/storage/CS6CH/en/ug/dell_ceg.htm]]
41 4 Anonymous
42 4 Anonymous
43 4 Anonymous
=== Comments by William Armitage ===
44 4 Anonymous
On procksi0 i have dropped the raid tools into /usr/local/afa
45 4 Anonymous
46 4 Anonymous
{{{
47 4 Anonymous
root@master01:/usr/local/afa# ls -l /usr/local/afa
48 4 Anonymous
total 3032
49 4 Anonymous
-rwxr--r-- 1 wja  procksi 1893976 Nov 29 11:51 afacli
50 4 Anonymous
-rw-r--r-- 1 root root          0 Nov 29 11:58 cfg.log
51 4 Anonymous
-rw-r--r-- 1 wja  procksi     572 Nov 29 11:51 getcfg.afa
52 4 Anonymous
-rw-r--r-- 1 root root        165 Dec 19 12:34 i2
53 4 Anonymous
-rw-r--r-- 1 root root       2050 Dec 19 12:36 i2.log
54 4 Anonymous
-rw-r--r-- 1 root root        159 Dec 11 18:11 i2.orig
55 4 Anonymous
-rw-r--r-- 1 root root        325 Dec 19 12:34 i3
56 4 Anonymous
-rw-r--r-- 1 root root    1153256 Dec 19 12:38 i3.log
57 4 Anonymous
-rw-r--r-- 1 root root         98 Nov 29 14:47 input
58 4 Anonymous
-rwxr--r-- 1 wja  procksi     595 Nov 29 11:51 MAKEDEV.afa
59 4 Anonymous
-rw-r--r-- 1 root root       1152 Nov 30 12:39 output
60 4 Anonymous
-rw-r--r-- 1 root root       1152 Nov 29 14:48 output.0
61 4 Anonymous
-rw-r--r-- 1 root root       1152 Nov 29 15:23 output.1
62 4 Anonymous
-rw-r--r-- 1 root root       1152 Nov 29 18:04 output.2
63 4 Anonymous
}}}
64 4 Anonymous
65 4 Anonymous
afacli is from http://linux.dell.com/storage.shtml under the section AACRAID > Management Utility > afa-apps-snmp.2807420-A04.tar.gz; untar and pull apart afaapps-4.1-0.i386.rpm.
66 4 Anonymous
67 4 Anonymous
[http://support.dell.com/support/downloads/download.aspx?c=us&l=en&s=gen&releaseid=R85529&formatcnt=1&fileid=112003]
68 4 Anonymous
69 4 Anonymous
If you don't have /dev/afa0 create it by
70 4 Anonymous
{{{
71 4 Anonymous
  cd /dev
72 4 Anonymous
  /usr/local/afa/MAKEDEV.afa afa0
73 4 Anonymous
}}}
74 4 Anonymous
It disappeared after the reboot and needed recreating.
75 4 Anonymous
76 4 Anonymous
afacli is described as a bad port of a dos program.
77 4 Anonymous
while command line it does wierd things to the terminal so feed it scripts.
78 4 Anonymous
it does echo to the output as well as any logging set but it uses escape
79 4 Anonymous
codes that write to the alternate screen in colour xterm and then immediatly
80 4 Anonymous
switches back at end.
81 4 Anonymous
82 4 Anonymous
The "input" script comes from
83 4 Anonymous
[http://linux.dell.com/files/aacraid/nagios/check_raid_pl.txt][[br]]
84 4 Anonymous
{{{output log "output"}}}
85 4 Anonymous
86 4 Anonymous
the more detailed script "i2" comes from
87 4 Anonymous
[http://www.techno-obscura.com/~delgado/notes/sles9-NagiosAfacli.html][[br]]
88 4 Anonymous
{{{output i2.out}}}
89 4 Anonymous
90 4 Anonymous
"i3" dumps the controller logs. Its based on
91 4 Anonymous
[http://threebit.net/mail-archive/centos/msg02033.html][[br]]
92 4 Anonymous
{{{output i3.out}}}
93 2 Anonymous
  
94 5 Anonymous
== Monitoring of Services: Monit ==
95 7 Anonymous
You can find more about Monit at [http://mon.wiki.kernel.org/].
96 1 Anonymous
97 6 Anonymous
 * Add the DAG repository on the ''master node'' and ''slave nodes''. Enter at the command line as ''root'':
98 5 Anonymous
{{{
99 5 Anonymous
 wget http://apt.sw.be/redhat/el5/en/x86_64/rpmforge/RPMS/rpmforge-release-0.3.6-1.el5.rf.x86_64.rpm
100 5 Anonymous
 rpm -Uvh rpmforge-release-0.3.6-1.el5.rf.x86_64.rpm
101 5 Anonymous
}}}
102 5 Anonymous
103 7 Anonymous
 * Install Monit on the ''master node'' and ''slave nodes''. Enter at the command line as ''root'':
104 5 Anonymous
{{{
105 5 Anonymous
 yum install monit
106 1 Anonymous
}}}
107 1 Anonymous
108 6 Anonymous
 * General configuration on the ''master node'' and ''slave nodes''. Edit ''/etc/monit.conf'' as ''root'':
109 1 Anonymous
110 6 Anonymous
  Start Monit as a daemon and check the services every 2 minutes:
111 6 Anonymous
{{{
112 6 Anonymous
set daemon 120
113 6 Anonymous
}}}
114 6 Anonymous
115 6 Anonymous
  Use ''syslog'' in order to rotate log files:
116 6 Anonymous
{{{
117 6 Anonymous
set logfile syslog facility log_daemon
118 6 Anonymous
}}}
119 6 Anonymous
120 6 Anonymous
  Set list of mailservers to be used for alert delivery:
121 6 Anonymous
{{{
122 6 Anonymous
set mailserver master01.procksi.local  # primary mailserver
123 6 Anonymous
               marian.cs.nott.ac.uk    # fallback relay
124 6 Anonymous
}}}
125 6 Anonymous
126 6 Anonymous
  Change the alert message format:
127 6 Anonymous
{{{
128 6 Anonymous
set mail-format { from: monit@procksi.net }
129 6 Anonymous
}}}
130 6 Anonymous
131 6 Anonymous
  Set the alert repicient(s):
132 6 Anonymous
{{{
133 6 Anonymous
set alert procksi@cs.nott.ac.uk                 # receive all alerts
134 6 Anonymous
}}}
135 1 Anonymous
136 1 Anonymous
137 7 Anonymous
 * Monitor the general system resources on the ''master node'':
138 7 Anonymous
139 7 Anonymous
  Edit the Monit configuration file ''/etc/monit.conf'': 
140 7 Anonymous
{{{
141 7 Anonymous
  check system master01.procksi.local
142 7 Anonymous
    if loadavg (1min) > 4 then alert
143 7 Anonymous
    if loadavg (5min) > 2 then alert
144 7 Anonymous
    if memory usage > 75% then alert
145 7 Anonymous
    if cpu usage (user) > 70% then alert
146 7 Anonymous
    if cpu usage (system) > 30% then alert
147 7 Anonymous
    if cpu usage (wait) > 20% then alert
148 7 Anonymous
}}} 
149 7 Anonymous
150 7 Anonymous
 * Monitor the Apache web server on the ''master node'':
151 7 Anonymous
152 7 Anonymous
  Edit ''/etc/monit.conf'':
153 7 Anonymous
{{{
154 7 Anonymous
  check process apache with pidfile /var/run/httpd.pid
155 7 Anonymous
    start program = "/sbin/service httpd start"
156 7 Anonymous
    stop program  = "/sbin/service httpd stop"
157 7 Anonymous
    if cpu > 60% for 2 cycles then alert
158 7 Anonymous
    if cpu > 80% for 5 cycles then restart
159 7 Anonymous
    if totalmem > 200.0 MB for 5 cycles then restart
160 7 Anonymous
    if children > 250 then restart
161 7 Anonymous
    if loadavg(5min) greater than 10 for 8 cycles then stop
162 7 Anonymous
    if failed host www.procksi.net port 80 protocol http
163 7 Anonymous
       and request "/monit/token"
164 7 Anonymous
       then restart
165 7 Anonymous
    if 3 restarts within 5 cycles then timeout
166 7 Anonymous
    group server
167 7 Anonymous
}}}
168 7 Anonymous
169 7 Anonymous
  Edit the Apache configuration file ''/etc/httpd/conf/httpd.conf'':
170 7 Anonymous
{{{
171 7 Anonymous
#General Aliases for Monitoring and Testing
172 7 Anonymous
Alias /monit/    "/home/procksi/monit/"
173 7 Anonymous
Alias /ganglia/  "/usr/local/ganglia/html/"
174 7 Anonymous
Alias /trees/    "/home/procksi/trees/"
175 7 Anonymous
}}}
176 7 Anonymous
177 7 Anonymous
  Create the Monit token file and restart Apache. Enter at the command line:
178 7 Anonymous
{{{
179 7 Anonymous
mkdir /home/procksi/monit/
180 7 Anonymous
echo "ProCKSI.monit" > /home/procksi/monit/token
181 7 Anonymous
chown -R procksi.procksi_dev /home/procksi/monit/token
182 7 Anonymous
/sbin/service httpd restart
183 7 Anonymous
}}}
184 7 Anonymous
185 7 Anonymous
 * Configure services to be monitored on the ''slave nodes'':
186 7 Anonymous
   Check the general system resources; edit the host name for slaveXX accordingly. Edit ''/etc/monit.conf'':
187 7 Anonymous
{{{
188 7 Anonymous
  check system slaveXX.procksi.local
189 7 Anonymous
    if loadavg (1min) > 4 then alert
190 7 Anonymous
    if loadavg (5min) > 2 then alert
191 7 Anonymous
    if memory usage > 75% then alert
192 7 Anonymous
    if cpu usage (user) > 70% then alert
193 7 Anonymous
    if cpu usage (system) > 30% then alert
194 7 Anonymous
    if cpu usage (wait) > 20% then alert
195 7 Anonymous
}}}  
196 7 Anonymous
197 5 Anonymous
198 5 Anonymous
 * Configure files to be monitored:
199 1 Anonymous
 [TBC]
200 5 Anonymous
201 8 Anonymous
 * Monitor devices on the ''master node'':
202 8 Anonymous
  Monitor / and /home. Edit the Monit configuration file ''/etc/monit.conf'':
203 1 Anonymous
204 8 Anonymous
{{{
205 8 Anonymous
     check device root with path /dev/sda1
206 8 Anonymous
        if space usage > 80% for 5 times within 15 cycles then alert
207 8 Anonymous
        if space usage > 95% then alert
208 8 Anonymous
        group server
209 8 Anonymous
210 8 Anonymous
     check device home with path /dev/sda3
211 8 Anonymous
        start program = "/bin/mount /home"
212 8 Anonymous
        stop program = "/bin/umount /home"
213 8 Anonymous
        if space usage > 80% for 5 times within 15 cycles then alert
214 8 Anonymous
        if space usage > 95% then alert
215 8 Anonymous
        group server
216 8 Anonymous
}}}
217 8 Anonymous
218 8 Anonymous
 * Monitor devices on the ''slave nodes'':
219 8 Anonymous
  Monitor / and /scratch. Edit the Monit configuration file ''/etc/monit.conf'':
220 8 Anonymous
221 8 Anonymous
{{{
222 8 Anonymous
     check device root with path /dev/sda1
223 8 Anonymous
        if space usage > 80% for 5 times within 15 cycles then alert
224 8 Anonymous
        if space usage > 95% then alert
225 8 Anonymous
        group server
226 8 Anonymous
227 8 Anonymous
     check device scratch with path /dev/sda3
228 8 Anonymous
        start program = "/bin/mount /scratch"
229 8 Anonymous
        stop program = "/bin/umount /scratch"
230 8 Anonymous
        if space usage > 80% for 5 times within 15 cycles then alert
231 8 Anonymous
        if space usage > 95% then alert
232 8 Anonymous
        group server
233 8 Anonymous
}}}
234 8 Anonymous
235 8 Anonymous
236 8 Anonymous
 * Make the Monit daemon start at bootup. Enter at the command line as ''root'' on the ''master node'' and ''slave nodes'':
237 5 Anonymous
  {{{
238 5 Anonymous
   /sbin/chkconfig  monit  on
239 5 Anonymous
  }}}
240 8 Anonymous
 * Start the Monit Daemon. Enter at the command line as ''root'' on the ''master node'' and ''slave nodes'':
241 5 Anonymous
  {{{
242 5 Anonymous
   /sbin/service  monit  start
243 5 Anonymous
  }}}
244 5 Anonymous
245 5 Anonymous
246 5 Anonymous
== Monitoring of Cluster Resources: Ganglia ==
247 1 Anonymous
''Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.''
248 1 Anonymous
249 1 Anonymous
 * Download the latest release of the ''Ganglia Monitoring Core'' from [http://ganglia.sourceforge.net/ http://ganglia.sourceforge.net].
250 1 Anonymous
 * Install Ganglia into ''/usr/local/ganglia'', its web frontend into ''/usr/local/ganglia/html/', and its databases into ''/usr/local/ganglia/rrds/'.
251 1 Anonymous
 * Install the ''Ganglia Monitoring Daemon'' (gmond) on each node, and the ''Ganglia Meta Daemon'' (gmetad) on the the head node.
252 1 Anonymous
253 1 Anonymous
=== Ganglia Monitoring Daemon ===
254 1 Anonymous
 * Configure, build and install Ganglia on each slave node (only with ''gmond''):
255 1 Anonymous
  {{{
256 1 Anonymous
  ./configure --prefix=/usr/local
257 1 Anonymous
  }}}
258 1 Anonymous
  and on the master node (with ''gmond'' and ''gmetad''):
259 1 Anonymous
  {{{
260 1 Anonymous
  ./configure --prefix=/usr/local --with-gmeta
261 1 Anonymous
  }}}
262 1 Anonymous
263 1 Anonymous
 * Initialise the configuration file for ''gmond'':
264 1 Anonymous
  {{{
265 1 Anonymous
  gmond --default >> /etc/gmond.conf
266 1 Anonymous
  }}}
267 1 Anonymous
268 1 Anonymous
 * Configure the ''Ganglia Monitoring Daemon'' in ''/etc/gmond.conf'':
269 1 Anonymous
  * Set the name of the cluster: 
270 1 Anonymous
  {{{
271 1 Anonymous
  cluster {
272 1 Anonymous
    name = "ProCKSI"
273 1 Anonymous
  }
274 1 Anonymous
  }}}
275 1 Anonymous
  * Set the IP address and port for multicast data exchange:
276 1 Anonymous
  {{{
277 1 Anonymous
  udp_send_channel {
278 1 Anonymous
    mcast_join = 239.2.11.71
279 1 Anonymous
    port = 8649
280 1 Anonymous
  }
281 1 Anonymous
  udp_recv_channel {
282 1 Anonymous
    mcast_join = 239.2.11.71
283 1 Anonymous
    port = 8649
284 1 Anonymous
    bind = 239.2.11.71
285 1 Anonymous
  }
286 1 Anonymous
  }}}
287 1 Anonymous
 
288 1 Anonymous
 * Copy start-up script for ''gmond'':
289 1 Anonymous
  {{{
290 1 Anonymous
  cp ./gmond/gmond.init /etc/init.d/gmond
291 1 Anonymous
  }}}
292 1 Anonymous
293 1 Anonymous
 * Add additional route for correct data exchange via multicast using the ''internal'' interface (''eth0''). Modify ''/etc/inid.d/gmond'':
294 1 Anonymous
  {{{
295 1 Anonymous
   #Add multicast route to internal interface
296 1 Anonymous
   /sbin/route add -host 239.2.11.71 dev eth0
297 1 Anonymous
   daemon $GMOND
298 1 Anonymous
  }}}
299 1 Anonymous
  {{{
300 1 Anonymous
   #Remove multicast route to internal interface
301 1 Anonymous
   /sbin/route delete -host 239.2.11.71 dev eth0
302 1 Anonymous
   killproc gmond
303 1 Anonymous
  }}}
304 1 Anonymous
 * Make the Ganglia Monitoring Daemon start at bootup.
305 1 Anonymous
  {{{
306 1 Anonymous
   /sbin/chkconfig  gmond  on
307 1 Anonymous
  }}}
308 1 Anonymous
 * Start the Ganglia Monitoring Daemon:
309 1 Anonymous
  {{{
310 1 Anonymous
   /sbin/service  gmond  start
311 1 Anonymous
  }}}
312 1 Anonymous
  
313 1 Anonymous
=== Ganglia Meta Daemon ===
314 1 Anonymous
 * Install and configure the ''Ganglia Meta Daeomn'' (gmetad) on the master node.
315 1 Anonymous
316 1 Anonymous
 * Make the Ganglia Meta Daemon start at bootup.
317 1 Anonymous
  {{{
318 1 Anonymous
   /sbin/chkconfig  --add gmetad
319 1 Anonymous
   /sbin/chkconfig  gmetad  on
320 1 Anonymous
  }}}
321 1 Anonymous
 * Start the Ganglia Meta Daemon:
322 1 Anonymous
  {{{
323 1 Anonymous
   /sbin/service  gmetad  start
324 1 Anonymous
  }}}
325 1 Anonymous
 
326 1 Anonymous
 * If the pie chart diagrams do not show up, you have to install the ''php-gd'' packages.
327 1 Anonymous
328 1 Anonymous
  
329 1 Anonymous
=== Further Customisation ===
330 1 Anonymous
In order to display more fine-grained time intervals, edit the following files in ''/usr/local/ganglia/html/'':
331 1 Anonymous
 * '''header.php'''
332 1 Anonymous
 {{{
333 1 Anonymous
  if (!$physical) {
334 1 Anonymous
   $context_ranges[]="10 minutes";
335 1 Anonymous
   $context_ranges[]="20 minutes";
336 1 Anonymous
   $context_ranges[]="30 minutes";
337 1 Anonymous
   $context_ranges[]="1 hour";
338 1 Anonymous
   $context_ranges[]="2 hours";
339 1 Anonymous
   $context_ranges[]="4 hours";
340 1 Anonymous
   $context_ranges[]="8 hours";
341 1 Anonymous
   $context_ranges[]="12 hours";
342 1 Anonymous
   $context_ranges[]="1 day";
343 1 Anonymous
   $context_ranges[]="2 days";
344 1 Anonymous
   $context_ranges[]="week";
345 1 Anonymous
   $context_ranges[]="month";
346 1 Anonymous
   $context_ranges[]="year";
347 1 Anonymous
 }}}
348 1 Anonymous
349 1 Anonymous
 * '''get_context.php'''
350 1 Anonymous
 {{{
351 1 Anonymous
  switch ($range) {
352 1 Anonymous
   case "10 minutes":   $start = -600; break;
353 1 Anonymous
   case "20 minutes":   $start = -1200; break;
354 1 Anonymous
   case "30 minutes":   $start = -1800; break;
355 1 Anonymous
   case "1 hour":       $start = -3600; break;
356 1 Anonymous
   case "2 hours":      $start = -7200; break;
357 1 Anonymous
   case "4 hours":      $start = -14400; break;
358 1 Anonymous
   case "8 hours":      $start = -28800; break;
359 1 Anonymous
   case "12 hours":     $start = -43200; break;
360 1 Anonymous
   case "1 day":        $start = -86400; break;
361 1 Anonymous
   case "2 days":       $start = -172800; break;
362 1 Anonymous
   case "week":         $start = -604800; break;
363 1 Anonymous
   case "month":        $start = -2419200; break;
364 1 Anonymous
   case "year":         $start = -31449600; break;
365 1 Anonymous
 }}}
366 1 Anonymous
367 1 Anonymous
368 1 Anonymous
== !JobMonarch ==
369 1 Anonymous
!JobMonarch is an add-on to Ganglia which provides PBS job monitoring through the web browser.
370 1 Anonymous
371 1 Anonymous
See [http://subtrac.rc.sara.nl/oss/jobmonarch/wiki/Documentation http://subtrac.rc.sara.nl/oss/jobmonarch/wiki/Documentation] for information on requirements, configuration and installation.
372 1 Anonymous
373 1 Anonymous
'''Attention''': Does not work properly yet.
374 1 Anonymous
375 1 Anonymous
376 1 Anonymous
== Domain Usage Monitoring ==
377 1 Anonymous
All HTML documents must contain the following code in order to be tracked correctly.
378 1 Anonymous
379 1 Anonymous
 {{{
380 1 Anonymous
<!-- Site Meter -->
381 1 Anonymous
	<script type="text/javascript" src="http://s18.sitemeter.com/js/counter.js?site=s18procksi">
382 1 Anonymous
	</script>
383 1 Anonymous
	<noscript>
384 1 Anonymous
		<a href="http://s18.sitemeter.com/stats.asp?site=s18procksi" target="_top">
385 1 Anonymous
			<img	src=[http://s18.sitemeter.com/meter.asp?site=s18procksi http://s18.sitemeter.com/meter.asp?site=s18procksi]
386 1 Anonymous
    				alt="Site Meter" border="0"/>
387 1 Anonymous
		</a>
388 1 Anonymous
	</noscript>
389 1 Anonymous
390 1 Anonymous
<!-- Copyright (c)2006 Site Meter -->
391 1 Anonymous
 }}}