ClusterMonitoring » History » Version 6

Anonymous, 08/18/2008 01:34 PM

1 1 Anonymous
= Cluster Monitoring =
2 1 Anonymous
The cluster resources and performance needs to be constantly monitored, and the users need to be tracked.
3 1 Anonymous
4 1 Anonymous
We assume the following configuration:
5 2 Anonymous
 ||!StorMan    || 2.12_B928 
6 1 Anonymous
 ||Ganglia     || 3.0.4
7 1 Anonymous
 ||!JobMonarch || 0.2
8 5 Anonymous
 ||Monit       || 4.9.2
9 2 Anonymous
10 2 Anonymous
11 2 Anonymous
== !StorMan ==
12 3 Anonymous
!StorMan is the ''DELL RAID Storage Manger" (RSM) for RAID systems. The agent has ''/dev/aac0'' open and listens on ports 34571, 34572, 34573.
13 2 Anonymous
14 2 Anonymous
 * Download [repos:Externals/Cluster/RSM-2.12_B928_Linux.tgz] from the repository and unpack it. Enter at the command line of the master node:
15 2 Anonymous
  {{{
16 2 Anonymous
  untgz StorMan-2.12.i386.rpm
17 2 Anonymous
  }}}
18 2 Anonymous
19 2 Anonymous
 * Install the RSM ignoring the dependencies. Enter at the command line of the master node:
20 2 Anonymous
  {{{
21 2 Anonymous
  rpm -Uv --dodeps StorMan-2.12.i386.rpm
22 2 Anonymous
  }}}
23 2 Anonymous
24 2 Anonymous
 * Make sure that the RSM starts at boot time. Enter at the command line of the master node:
25 2 Anonymous
  {{{
26 2 Anonymous
  cd /usr/StoreMan
27 2 Anonymous
  cp ./stor_agent /etc/init.d/rsd
28 2 Anonymous
  /sbin/chkconfig --add rsd
29 2 Anonymous
  /sbin/chkconfig rsd on
30 2 Anonymous
  }}}
31 2 Anonymous
32 2 Anonymous
 * Start the RSM. Enter at the command line of the master node:
33 2 Anonymous
  {{{
34 2 Anonymous
  /sbin/service rsd start
35 2 Anonymous
  }}}
36 2 Anonymous
37 4 Anonymous
== Further RAID Monitoring ==
38 4 Anonymous
39 4 Anonymous
=== Command Line Interface (CLI) for Dell's RAID Controller ===
40 4 Anonymous
The manual can be found at [http://support.euro.dell.com/support/edocs/storage/CS6CH/en/ug/dell_ceg.htm]]
41 4 Anonymous
42 4 Anonymous
43 4 Anonymous
=== Comments by William Armitage ===
44 4 Anonymous
On procksi0 i have dropped the raid tools into /usr/local/afa
45 4 Anonymous
46 4 Anonymous
{{{
47 4 Anonymous
root@master01:/usr/local/afa# ls -l /usr/local/afa
48 4 Anonymous
total 3032
49 4 Anonymous
-rwxr--r-- 1 wja  procksi 1893976 Nov 29 11:51 afacli
50 4 Anonymous
-rw-r--r-- 1 root root          0 Nov 29 11:58 cfg.log
51 4 Anonymous
-rw-r--r-- 1 wja  procksi     572 Nov 29 11:51 getcfg.afa
52 4 Anonymous
-rw-r--r-- 1 root root        165 Dec 19 12:34 i2
53 4 Anonymous
-rw-r--r-- 1 root root       2050 Dec 19 12:36 i2.log
54 4 Anonymous
-rw-r--r-- 1 root root        159 Dec 11 18:11 i2.orig
55 4 Anonymous
-rw-r--r-- 1 root root        325 Dec 19 12:34 i3
56 4 Anonymous
-rw-r--r-- 1 root root    1153256 Dec 19 12:38 i3.log
57 4 Anonymous
-rw-r--r-- 1 root root         98 Nov 29 14:47 input
58 4 Anonymous
-rwxr--r-- 1 wja  procksi     595 Nov 29 11:51 MAKEDEV.afa
59 4 Anonymous
-rw-r--r-- 1 root root       1152 Nov 30 12:39 output
60 4 Anonymous
-rw-r--r-- 1 root root       1152 Nov 29 14:48 output.0
61 4 Anonymous
-rw-r--r-- 1 root root       1152 Nov 29 15:23 output.1
62 4 Anonymous
-rw-r--r-- 1 root root       1152 Nov 29 18:04 output.2
63 4 Anonymous
}}}
64 4 Anonymous
65 4 Anonymous
afacli is from http://linux.dell.com/storage.shtml under the section AACRAID > Management Utility > afa-apps-snmp.2807420-A04.tar.gz; untar and pull apart afaapps-4.1-0.i386.rpm.
66 4 Anonymous
67 4 Anonymous
[http://support.dell.com/support/downloads/download.aspx?c=us&l=en&s=gen&releaseid=R85529&formatcnt=1&fileid=112003]
68 4 Anonymous
69 4 Anonymous
If you don't have /dev/afa0 create it by
70 4 Anonymous
{{{
71 4 Anonymous
  cd /dev
72 4 Anonymous
  /usr/local/afa/MAKEDEV.afa afa0
73 4 Anonymous
}}}
74 4 Anonymous
It disappeared after the reboot and needed recreating.
75 4 Anonymous
76 4 Anonymous
afacli is described as a bad port of a dos program.
77 4 Anonymous
while command line it does wierd things to the terminal so feed it scripts.
78 4 Anonymous
it does echo to the output as well as any logging set but it uses escape
79 4 Anonymous
codes that write to the alternate screen in colour xterm and then immediatly
80 4 Anonymous
switches back at end.
81 4 Anonymous
82 4 Anonymous
The "input" script comes from
83 4 Anonymous
[http://linux.dell.com/files/aacraid/nagios/check_raid_pl.txt][[br]]
84 4 Anonymous
{{{output log "output"}}}
85 4 Anonymous
86 4 Anonymous
the more detailed script "i2" comes from
87 4 Anonymous
[http://www.techno-obscura.com/~delgado/notes/sles9-NagiosAfacli.html][[br]]
88 4 Anonymous
{{{output i2.out}}}
89 4 Anonymous
90 4 Anonymous
"i3" dumps the controller logs. Its based on
91 4 Anonymous
[http://threebit.net/mail-archive/centos/msg02033.html][[br]]
92 4 Anonymous
{{{output i3.out}}}
93 2 Anonymous
  
94 5 Anonymous
== Monitoring of Services: Monit ==
95 5 Anonymous
You can find more about ''monit'' at [http://mon.wiki.kernel.org/].
96 1 Anonymous
97 6 Anonymous
 * Add the DAG repository on the ''master node'' and ''slave nodes''. Enter at the command line as ''root'':
98 5 Anonymous
{{{
99 5 Anonymous
 wget http://apt.sw.be/redhat/el5/en/x86_64/rpmforge/RPMS/rpmforge-release-0.3.6-1.el5.rf.x86_64.rpm
100 5 Anonymous
 rpm -Uvh rpmforge-release-0.3.6-1.el5.rf.x86_64.rpm
101 5 Anonymous
}}}
102 5 Anonymous
103 6 Anonymous
 * Install ''monit'' on the ''master node'' and ''slave nodes''. Enter at the command line as ''root'':
104 5 Anonymous
{{{
105 5 Anonymous
 yum install monit
106 1 Anonymous
}}}
107 1 Anonymous
108 6 Anonymous
 * General configuration on the ''master node'' and ''slave nodes''. Edit ''/etc/monit.conf'' as ''root'':
109 1 Anonymous
110 6 Anonymous
  Start Monit as a daemon and check the services every 2 minutes:
111 6 Anonymous
{{{
112 6 Anonymous
set daemon 120
113 6 Anonymous
}}}
114 6 Anonymous
115 6 Anonymous
  Use ''syslog'' in order to rotate log files:
116 6 Anonymous
{{{
117 6 Anonymous
set logfile syslog facility log_daemon
118 6 Anonymous
}}}
119 6 Anonymous
120 6 Anonymous
  Set list of mailservers to be used for alert delivery:
121 6 Anonymous
{{{
122 6 Anonymous
set mailserver master01.procksi.local  # primary mailserver
123 6 Anonymous
               marian.cs.nott.ac.uk    # fallback relay
124 6 Anonymous
}}}
125 6 Anonymous
126 6 Anonymous
  Change the alert message format:
127 6 Anonymous
{{{
128 6 Anonymous
set mail-format { from: monit@procksi.net }
129 6 Anonymous
}}}
130 6 Anonymous
131 6 Anonymous
  Set the alert repicient(s):
132 6 Anonymous
{{{
133 6 Anonymous
set alert procksi@cs.nott.ac.uk                 # receive all alerts
134 6 Anonymous
}}}
135 6 Anonymous
136 6 Anonymous
137 5 Anonymous
 * Configure services to be monitored:
138 5 Anonymous
 [TBC]
139 5 Anonymous
140 5 Anonymous
 * Configure files to be monitored:
141 1 Anonymous
 [TBC]
142 5 Anonymous
143 5 Anonymous
 * Configure devices to be monitored:
144 1 Anonymous
 [TBC]
145 5 Anonymous
146 6 Anonymous
 * Make the ''monit'' daemon start at bootup. Enter at the command line as ''root'' on the ''master node'' and ''slave nodes'':
147 5 Anonymous
  {{{
148 5 Anonymous
   /sbin/chkconfig  monit  on
149 5 Anonymous
  }}}
150 6 Anonymous
 * Start the Ganglia Monitoring Daemon. Enter at the command line as ''root'' on the ''master node'' and ''slave nodes'':
151 5 Anonymous
  {{{
152 5 Anonymous
   /sbin/service  monit  start
153 5 Anonymous
  }}}
154 5 Anonymous
155 5 Anonymous
156 5 Anonymous
== Monitoring of Cluster Resources: Ganglia ==
157 1 Anonymous
''Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.''
158 1 Anonymous
159 1 Anonymous
 * Download the latest release of the ''Ganglia Monitoring Core'' from [http://ganglia.sourceforge.net/ http://ganglia.sourceforge.net].
160 1 Anonymous
 * Install Ganglia into ''/usr/local/ganglia'', its web frontend into ''/usr/local/ganglia/html/', and its databases into ''/usr/local/ganglia/rrds/'.
161 1 Anonymous
 * Install the ''Ganglia Monitoring Daemon'' (gmond) on each node, and the ''Ganglia Meta Daemon'' (gmetad) on the the head node.
162 1 Anonymous
163 1 Anonymous
=== Ganglia Monitoring Daemon ===
164 1 Anonymous
 * Configure, build and install Ganglia on each slave node (only with ''gmond''):
165 1 Anonymous
  {{{
166 1 Anonymous
  ./configure --prefix=/usr/local
167 1 Anonymous
  }}}
168 1 Anonymous
  and on the master node (with ''gmond'' and ''gmetad''):
169 1 Anonymous
  {{{
170 1 Anonymous
  ./configure --prefix=/usr/local --with-gmeta
171 1 Anonymous
  }}}
172 1 Anonymous
173 1 Anonymous
 * Initialise the configuration file for ''gmond'':
174 1 Anonymous
  {{{
175 1 Anonymous
  gmond --default >> /etc/gmond.conf
176 1 Anonymous
  }}}
177 1 Anonymous
178 1 Anonymous
 * Configure the ''Ganglia Monitoring Daemon'' in ''/etc/gmond.conf'':
179 1 Anonymous
  * Set the name of the cluster: 
180 1 Anonymous
  {{{
181 1 Anonymous
  cluster {
182 1 Anonymous
    name = "ProCKSI"
183 1 Anonymous
  }
184 1 Anonymous
  }}}
185 1 Anonymous
  * Set the IP address and port for multicast data exchange:
186 1 Anonymous
  {{{
187 1 Anonymous
  udp_send_channel {
188 1 Anonymous
    mcast_join = 239.2.11.71
189 1 Anonymous
    port = 8649
190 1 Anonymous
  }
191 1 Anonymous
  udp_recv_channel {
192 1 Anonymous
    mcast_join = 239.2.11.71
193 1 Anonymous
    port = 8649
194 1 Anonymous
    bind = 239.2.11.71
195 1 Anonymous
  }
196 1 Anonymous
  }}}
197 1 Anonymous
 
198 1 Anonymous
 * Copy start-up script for ''gmond'':
199 1 Anonymous
  {{{
200 1 Anonymous
  cp ./gmond/gmond.init /etc/init.d/gmond
201 1 Anonymous
  }}}
202 1 Anonymous
203 1 Anonymous
 * Add additional route for correct data exchange via multicast using the ''internal'' interface (''eth0''). Modify ''/etc/inid.d/gmond'':
204 1 Anonymous
  {{{
205 1 Anonymous
   #Add multicast route to internal interface
206 1 Anonymous
   /sbin/route add -host 239.2.11.71 dev eth0
207 1 Anonymous
   daemon $GMOND
208 1 Anonymous
  }}}
209 1 Anonymous
  {{{
210 1 Anonymous
   #Remove multicast route to internal interface
211 1 Anonymous
   /sbin/route delete -host 239.2.11.71 dev eth0
212 1 Anonymous
   killproc gmond
213 1 Anonymous
  }}}
214 1 Anonymous
 * Make the Ganglia Monitoring Daemon start at bootup.
215 1 Anonymous
  {{{
216 1 Anonymous
   /sbin/chkconfig  gmond  on
217 1 Anonymous
  }}}
218 1 Anonymous
 * Start the Ganglia Monitoring Daemon:
219 1 Anonymous
  {{{
220 1 Anonymous
   /sbin/service  gmond  start
221 1 Anonymous
  }}}
222 1 Anonymous
  
223 1 Anonymous
=== Ganglia Meta Daemon ===
224 1 Anonymous
 * Install and configure the ''Ganglia Meta Daeomn'' (gmetad) on the master node.
225 1 Anonymous
226 1 Anonymous
 * Make the Ganglia Meta Daemon start at bootup.
227 1 Anonymous
  {{{
228 1 Anonymous
   /sbin/chkconfig  --add gmetad
229 1 Anonymous
   /sbin/chkconfig  gmetad  on
230 1 Anonymous
  }}}
231 1 Anonymous
 * Start the Ganglia Meta Daemon:
232 1 Anonymous
  {{{
233 1 Anonymous
   /sbin/service  gmetad  start
234 1 Anonymous
  }}}
235 1 Anonymous
 
236 1 Anonymous
 * If the pie chart diagrams do not show up, you have to install the ''php-gd'' packages.
237 1 Anonymous
238 1 Anonymous
  
239 1 Anonymous
=== Further Customisation ===
240 1 Anonymous
In order to display more fine-grained time intervals, edit the following files in ''/usr/local/ganglia/html/'':
241 1 Anonymous
 * '''header.php'''
242 1 Anonymous
 {{{
243 1 Anonymous
  if (!$physical) {
244 1 Anonymous
   $context_ranges[]="10 minutes";
245 1 Anonymous
   $context_ranges[]="20 minutes";
246 1 Anonymous
   $context_ranges[]="30 minutes";
247 1 Anonymous
   $context_ranges[]="1 hour";
248 1 Anonymous
   $context_ranges[]="2 hours";
249 1 Anonymous
   $context_ranges[]="4 hours";
250 1 Anonymous
   $context_ranges[]="8 hours";
251 1 Anonymous
   $context_ranges[]="12 hours";
252 1 Anonymous
   $context_ranges[]="1 day";
253 1 Anonymous
   $context_ranges[]="2 days";
254 1 Anonymous
   $context_ranges[]="week";
255 1 Anonymous
   $context_ranges[]="month";
256 1 Anonymous
   $context_ranges[]="year";
257 1 Anonymous
 }}}
258 1 Anonymous
259 1 Anonymous
 * '''get_context.php'''
260 1 Anonymous
 {{{
261 1 Anonymous
  switch ($range) {
262 1 Anonymous
   case "10 minutes":   $start = -600; break;
263 1 Anonymous
   case "20 minutes":   $start = -1200; break;
264 1 Anonymous
   case "30 minutes":   $start = -1800; break;
265 1 Anonymous
   case "1 hour":       $start = -3600; break;
266 1 Anonymous
   case "2 hours":      $start = -7200; break;
267 1 Anonymous
   case "4 hours":      $start = -14400; break;
268 1 Anonymous
   case "8 hours":      $start = -28800; break;
269 1 Anonymous
   case "12 hours":     $start = -43200; break;
270 1 Anonymous
   case "1 day":        $start = -86400; break;
271 1 Anonymous
   case "2 days":       $start = -172800; break;
272 1 Anonymous
   case "week":         $start = -604800; break;
273 1 Anonymous
   case "month":        $start = -2419200; break;
274 1 Anonymous
   case "year":         $start = -31449600; break;
275 1 Anonymous
 }}}
276 1 Anonymous
277 1 Anonymous
278 1 Anonymous
== !JobMonarch ==
279 1 Anonymous
!JobMonarch is an add-on to Ganglia which provides PBS job monitoring through the web browser.
280 1 Anonymous
281 1 Anonymous
See [http://subtrac.rc.sara.nl/oss/jobmonarch/wiki/Documentation http://subtrac.rc.sara.nl/oss/jobmonarch/wiki/Documentation] for information on requirements, configuration and installation.
282 1 Anonymous
283 1 Anonymous
'''Attention''': Does not work properly yet.
284 1 Anonymous
285 1 Anonymous
286 1 Anonymous
== Domain Usage Monitoring ==
287 1 Anonymous
All HTML documents must contain the following code in order to be tracked correctly.
288 1 Anonymous
289 1 Anonymous
 {{{
290 1 Anonymous
<!-- Site Meter -->
291 1 Anonymous
	<script type="text/javascript" src="http://s18.sitemeter.com/js/counter.js?site=s18procksi">
292 1 Anonymous
	</script>
293 1 Anonymous
	<noscript>
294 1 Anonymous
		<a href="http://s18.sitemeter.com/stats.asp?site=s18procksi" target="_top">
295 1 Anonymous
			<img	src=[http://s18.sitemeter.com/meter.asp?site=s18procksi http://s18.sitemeter.com/meter.asp?site=s18procksi]
296 1 Anonymous
    				alt="Site Meter" border="0"/>
297 1 Anonymous
		</a>
298 1 Anonymous
	</noscript>
299 1 Anonymous
300 1 Anonymous
<!-- Copyright (c)2006 Site Meter -->
301 1 Anonymous
 }}}